Closed Bug 284325 Opened 20 years ago Closed 20 years ago

reptile gets hit by the Linux OOM killer

Categories

(mozilla.org Graveyard :: Server Operations, task, P1)

x86
Linux
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chase, Assigned: justdave)

References

Details

reptile is getting hit by the Linux OOM killer.  Filing this to track the issue
and possible solutions.  This hit us at 2/28 around 23:57 and before that 2.5
days earlier.  reptile hosts the database used by dmo and sfx, so when it falls,
a loud sound is made.

  RH Bug 131251 - kernel Out of Memory: Killed process
  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=131251

  RH Bug 149635 - SCSI errors and kernel OOM killer under scsi load under RHEL4
  https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=149635

  A lkml thread describing the introduction of this problem in 2.6.9-rc3
  http://lkml.org/lkml/2004/10/23/128
Blocks: 284141
Just nailed us again.  3/3 ~7:00am.  5 hours earlier than expected at the
previous fail rate.

Installed 2.6.9-6.16.ELsmp kernel that was offered by blizzard and rebooted.
(we had been running 2.6.9-5.0.3.ELsmp, which is current officially supported
version).
Memory prior to shutdown/reboot, but with MySQL already dead:

MemTotal:      4025564 kB
MemFree:       1032460 kB
Buffers:         70444 kB
Cached:        1662196 kB
SwapCached:          0 kB
Active:        2098728 kB
Inactive:       125560 kB
HighTotal:     3145600 kB
HighFree:       985984 kB
LowTotal:       879964 kB
LowFree:         46476 kB
SwapTotal:     2047992 kB
SwapFree:      2047992 kB
Dirty:             828 kB
Writeback:           0 kB
Mapped:         262940 kB
Slab:           752476 kB
Committed_AS:   767160 kB
PageTables:       1684 kB
VmallocTotal:   106488 kB
VmallocUsed:      3168 kB
VmallocChunk:   102548 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB
Memory after reboot with everything including mysql running again:

MemTotal:      4025572 kB
MemFree:       3131848 kB
Buffers:         28604 kB
Cached:         659876 kB
SwapCached:          0 kB
Active:         531752 kB
Inactive:       317776 kB
HighTotal:     3145600 kB
HighFree:      2319104 kB
LowTotal:       879972 kB
LowFree:        812744 kB
SwapTotal:     2047992 kB
SwapFree:      2047992 kB
Dirty:           17780 kB
Writeback:           0 kB
Mapped:         170280 kB
Slab:            27936 kB
Committed_AS:   784020 kB
PageTables:       1484 kB
VmallocTotal:   106488 kB
VmallocUsed:      3172 kB
VmallocChunk:   102808 kB
HugePages_Total:     0
HugePages_Free:      0
Hugepagesize:     2048 kB

note the difference in LowMem.
We've had LowMem monitoring in place since the last reboot, and it just hit
warning levels earlier this afternoon, and has been steadily (but slowly)
dropping since.  Restarting MySQL did not affect the memory levels.

It would appear to me that the memory leak in the kernel is still there, albeit
quite a bit slower than it used to be.  At the rate it's going, I'm expecting
the OOMKiller to start nailing things around 3:00pm PST Sunday.
The RHEL4 bug at RH that we've been following was recently closed as NOTABUG,
and the other bug is for Fedora Core 2.

I just opened a new bug at RedHat for this specific issue.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=150971
OK, based on levels this last week, we (myself and the guys at RH) think this
might be fixed in the kernel we're currently running.  We've gone 2 weeks on the
current kernel with no crashes.  LowMem has been low multiple times and always
recovers.  Looks like whatever was forgetting to do garbage collection before is
actually doing it now.  Will leave this open for another week before I sign off
on it though.
Severity: normal → critical
OS: Windows XP → Linux
Priority: -- → P1
uptime 20 days 16 hours :)

I think I'll call this fixed.  If it crashes again I'll reopen it.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.