Closed Bug 809408 Opened 12 years ago Closed 12 years ago

ganglia1.private.releng.scl3.mozilla.com ro root fs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: arich, Assigned: gcox)

Details

Amy Rich [:arr] [:arich]

Reporter

Description

•

12 years ago

I saw ganglia1.private.releng.scl3.mozilla.com start alerting on swap at 5:20 PST today.  When I logged in, there were nearly 2000 sendmail processes running eating all of the available memory/swap.  It turns out that sendmail was choking because the root filesystem had gone read only.  I rebooted the vm and it seems to be back to normal now, but I'd really like ot know how this happened and what we can do to make sure this stops happening (since I know it's been a chronic issue as of late).
I suspect this was one of the vms that got storage vmotioned last night.

Greg Cox [:gcox]

Assignee

Comment 1

•

12 years ago

vsphere says this svmo'ed at 11/6/2012 10:11:25 pm PDT ("last night," as I write this).

[root@ganglia1.private.releng.scl3 ~]# ls -lrt /var/log/secure*
-rw------- 1 root root 560726 Oct  7 03:36 /var/log/secure-20121007
-rw------- 1 root root 560884 Oct 14 03:31 /var/log/secure-20121014
-rw------- 1 root root 555027 Oct 21 03:31 /var/log/secure-20121021
-rw------- 1 root root 293397 Nov  7 06:19 /var/log/secure-20121107
-rw------- 1 root root   3204 Nov  7 07:03 /var/log/secure
[root@ganglia1.private.releng.scl3 ~]# cat /var/log/secure-20121107 | grep -B2 Nov | head -3
Oct 24 21:19:11 ganglia1 sudo:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib64/nagios/plugins/check_disk -w 10% -c 5% -p /
Oct 24 21:21:48 ganglia1 sudo:   nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib64/nagios/plugins/check_disk -w 10% -c 5% -p /
Nov  7 05:32:52 ganglia1 runuser: pam_unix(runuser:session): session opened for user root by (uid=0)

The VM was already not-logging (around 2 weeks), implying that this probably a victim from the already-understood dedupe based crashes of 804413, rather than something new from svmo's.

Amy Rich [:arr] [:arich]

Reporter

Comment 2

•

12 years ago

Yeesh.  Okay, so that brings up another question... Is there any way we can do a pass of the vms to determine which might be in this state and just unnoticed?  I know pir said that monitring was planned for this, but is there anything we can do short term?

Greg Cox [:gcox]

Assignee

Comment 3

•

12 years ago

Short of writing a big `for` loop, not really.  This is a woe-of-the-OS that isn't readily apparent to vmware directly, because it's basically "ok, I timed out sometime in the past, I give up and am not talking to you."

Even if we tried to go to vmware for diagnosis by inference (disk writes being at zero) it's time-consuming data to gather (mostly reading graphs) and even then it would likely have a too many false positives to be of any use.

Amy Rich [:arr] [:arich]

Reporter

Comment 4

•

12 years ago

Okay, we'll call this resolved and wait on the pending checks from the SRE group, then.  Thanks.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Infrastructure & Operations

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

ganglia1.private.releng.scl3.mozilla.com ro root fs

Categories

(Infrastructure & Operations :: Virtualization, task)

Tracking

(Not tracked)

People

(Reporter: arich, Assigned: gcox)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated