Proliant DL740 hangs with sles8 (k_smp 2.4.21-190) 2004-03-05 - By Miquel Colom
Hello Andrew
Your assumptions seems correct to me:
1-Megaraid driver issue. 2-MotherBoard issue.
Have you tried asking or researching at the Dell poweredge list in lists.us.dell.com?
Best regards
Miquel Colom Pizá Director Area Técnica Dept. Sistemas Hotelbeds S.L.
c-Joan Muntaner i Bordoy s/n Bjs. 07006 - Palma de Mallorca Telf. +34 971178839 Fax. +34 971465062
"McAllister, Andrew" <McAllisterA@(protected)> 04/03/2004 23:47
Para: <suse-oracle@(protected)> cc: "Michael Hasenstein" <mha@(protected)>, "Sightler, Tom" <tsightler@(protected)> Asunto: RE: [suse-oracle] [OT] Proliant DL740 hangs with sles8 (k_smp 2 .4.21-190)
Update on this thread...
Our test 2650 has been running flawlessly for over 24 hours with over 7000 of our stress test data loads and a load average of 7.5+. No errors of any kind. Config is as follows: Dell 2650, SuSE SLES 8 SP3 fully patched by YOU in automatic mode, kernel 2.4.21-190_smp, 2 GB RAM, Adaptec RAID on the motherboard, aacraid driver, LVM, reiserfs, Oracle 9.2.0.3, async IO turned on. Max open files per user 1024.
Our standby 6650 has been running for 24 hours and is performing VERY poorly. Latest change on this box was to disable async IO. Only 170 stress test loads have finished in the last 20 hours. This compared to the 2650 above which has completed 7000 loads. We recently turned OFF async IO and performance went into the dumpster. Config is as follows: Dell 6650, SuSE SLES 8 SP3 fully patched by YOU in automatic mode, kernel 2.4.21-190_smp, 4 GB RAM, Perc4/DC (LSI Megaraid 320 dual channel), megaraid2 2.00.8, NO LVM, reiserfs, Oracle 9.2.0.3, async IO turned OFF. Max open files per user 1024. No errors but this may be because the lack of async IO is keeping the box from experiencing "heavy load". Two databases running on this box, one dataguard standby of our production environment and one stress test database.
Production 6650 has been running for 22 hours. Rebooted last night to fix multiple oracle listener hangs and hung shell scripts with hung pipe problems. During system reboot we ran an fsck on all 8 of our database reiser file systems. There were no corruptions found. Database and listener were restarted and our normal data load last night produced two "ORA-12599 (See ORA-12599.ora-code.com): TNS:cryptographic checksum mismatch" and "ORA-03113 (See ORA-03113.ora-code.com) end-of-file on communications channel" messages. If history is any indication, we will get more and more of these errors each night for the next 3 nights, while other system problems arise. Otherwise today operations were mostly normal. Config is as follows: Dell 6650, SuSE SLES 8 SP3 fully patched by YOU in automatic mode, kernel 2.4.21-190_smp, 4 GB RAM, Perc4/DC (LSI Megaraid 320 dual channel), megaraid2 2.00.8, LVM active, reiserfs, Oracle 9.2.0.3, async IO turned ON. Max open files ulimit set to 16384 for the oracle user PRIOR to starting the database.
So from this info I am making the following deductions: 1) Reiserfs doesn't appear to be a problem. It is running on both working and broken servers. Last reboot we fsck'd all mount points on a broken system, no corruption found. 2) Async IO doesn't appear to be a problem. It is running on broken and working servers. 3) Logical volume manager doesn't appear to be a problem, it is running on broken and working servers. 4) Max open files for oracle user at 1024 doesn't appear to be a problem. It is set to 1024 on working and non-working systems and 16384 on a non-working system. 5) Broadcom GigE cards may or may not have contributed to the problems. 6) Hyperthreading on or off, broken servers are still broken.
Based on this I think that there is a fairly good chance (I HOPE) the problem is: 1) Dell 6650 chipset/motherboard issues with kernel 2.4.21-190_smp and kin. 2) Megaraid controller hardware or megaraid2 driver issues (running 2.00.8 from SuSE while 2.10.0.1 is current from LSI) 3) Some other kernel or driver problem that is not directly related to anything above, but eventually causes a resource starvation that does affect the components tested. I hope this isn't the problem.
Next steps? I guess our next step is to go off the supported kernel rpm and upgrade to the megaraid2 2.10.xx driver from LSI. Anyone else have any other suggestions to help diagnose the problems?
Thanks Andy
P.S. Michael, Oracle hasn't yet responded to our TAR updated last night at 9:20pm.
> -- --Original Message-- -- > From: McAllister, Andrew [mailto:McAllisterA@(protected)] > Sent: Thursday, March 04, 2004 8:53 AM > To: Sightler, Tom > Cc: suse-oracle@(protected) > Subject: RE: [suse-oracle] [OT] Proliant DL740 hangs with > sles8 (k_smp 2.4.21-190) > > > > -- --Original Message-- -- > > From: Sightler, Tom [mailto:tsightler@(protected)] > snip > > > To me open files looks low at 1024, but our old system is > > set this way > > > and never hung once. > > > > Did your old system run the same OS and same version of > > Oracle with async IO? If not then it's probably not fair to > > compare the two. > Oracle version was the same, but we ran SuSE 8 PRO not SLES and had no > async IO. > > > I'd at least look at this as, based on your proc output, your > Snip > Good advice. > > to the same as the system limit. Lot's of Oracle setup docs > > suggest setting both the system file number limit and the > > Oracle users ulimit to 65535 or greater. > > > > We used to get lots of ORA-03113 (See ORA-03113.ora-code.com) errors before we increased > > our limits. Are you having some of those? Our current > Yes we are, a fair amount. Usually on the second day after a reboot. > They get worse and worse as time goes on. > > When we restarted our server last night we set the file limit > for oracle > to 16384. > > Thanks > Andy > > -- > To unsubscribe, email: suse-oracle-unsubscribe@(protected) > For additional commands, email: suse-oracle-help@(protected) > Please see http://www.suse.com/oracle/ before posting > >
-- To unsubscribe, email: suse-oracle-unsubscribe@(protected) For additional commands, email: suse-oracle-help@(protected) Please see http://www.suse.com/oracle/ before posting
-- To unsubscribe, email: suse-oracle-unsubscribe@(protected) For additional commands, email: suse-oracle-help@(protected) Please see http://www.suse.com/oracle/ before posting
|
|