Closed Bug 284325 Opened 20 years ago Closed 20 years ago

reptile gets hit by the Linux OOM killer

Categories

(mozilla.org Graveyard :: Server Operations, task, P1)

x86
Linux
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chase, Assigned: justdave)

References

Details

reptile is getting hit by the Linux OOM killer. Filing this to track the issue and possible solutions. This hit us at 2/28 around 23:57 and before that 2.5 days earlier. reptile hosts the database used by dmo and sfx, so when it falls, a loud sound is made. RH Bug 131251 - kernel Out of Memory: Killed process https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=131251 RH Bug 149635 - SCSI errors and kernel OOM killer under scsi load under RHEL4 https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=149635 A lkml thread describing the introduction of this problem in 2.6.9-rc3 http://lkml.org/lkml/2004/10/23/128
Blocks: 284141
Just nailed us again. 3/3 ~7:00am. 5 hours earlier than expected at the previous fail rate. Installed 2.6.9-6.16.ELsmp kernel that was offered by blizzard and rebooted. (we had been running 2.6.9-5.0.3.ELsmp, which is current officially supported version).
Memory prior to shutdown/reboot, but with MySQL already dead: MemTotal: 4025564 kB MemFree: 1032460 kB Buffers: 70444 kB Cached: 1662196 kB SwapCached: 0 kB Active: 2098728 kB Inactive: 125560 kB HighTotal: 3145600 kB HighFree: 985984 kB LowTotal: 879964 kB LowFree: 46476 kB SwapTotal: 2047992 kB SwapFree: 2047992 kB Dirty: 828 kB Writeback: 0 kB Mapped: 262940 kB Slab: 752476 kB Committed_AS: 767160 kB PageTables: 1684 kB VmallocTotal: 106488 kB VmallocUsed: 3168 kB VmallocChunk: 102548 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB
Memory after reboot with everything including mysql running again: MemTotal: 4025572 kB MemFree: 3131848 kB Buffers: 28604 kB Cached: 659876 kB SwapCached: 0 kB Active: 531752 kB Inactive: 317776 kB HighTotal: 3145600 kB HighFree: 2319104 kB LowTotal: 879972 kB LowFree: 812744 kB SwapTotal: 2047992 kB SwapFree: 2047992 kB Dirty: 17780 kB Writeback: 0 kB Mapped: 170280 kB Slab: 27936 kB Committed_AS: 784020 kB PageTables: 1484 kB VmallocTotal: 106488 kB VmallocUsed: 3172 kB VmallocChunk: 102808 kB HugePages_Total: 0 HugePages_Free: 0 Hugepagesize: 2048 kB note the difference in LowMem.
We've had LowMem monitoring in place since the last reboot, and it just hit warning levels earlier this afternoon, and has been steadily (but slowly) dropping since. Restarting MySQL did not affect the memory levels. It would appear to me that the memory leak in the kernel is still there, albeit quite a bit slower than it used to be. At the rate it's going, I'm expecting the OOMKiller to start nailing things around 3:00pm PST Sunday.
The RHEL4 bug at RH that we've been following was recently closed as NOTABUG, and the other bug is for Fedora Core 2. I just opened a new bug at RedHat for this specific issue. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150971
OK, based on levels this last week, we (myself and the guys at RH) think this might be fixed in the kernel we're currently running. We've gone 2 weeks on the current kernel with no crashes. LowMem has been low multiple times and always recovers. Looks like whatever was forgetting to do garbage collection before is actually doing it now. Will leave this open for another week before I sign off on it though.
Severity: normal → critical
OS: Windows XP → Linux
Priority: -- → P1
uptime 20 days 16 hours :) I think I'll call this fixed. If it crashes again I'll reopen it.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.