I saw ganglia1.private.releng.scl3.mozilla.com start alerting on swap at 5:20 PST today. When I logged in, there were nearly 2000 sendmail processes running eating all of the available memory/swap. It turns out that sendmail was choking because the root filesystem had gone read only. I rebooted the vm and it seems to be back to normal now, but I'd really like ot know how this happened and what we can do to make sure this stops happening (since I know it's been a chronic issue as of late). I suspect this was one of the vms that got storage vmotioned last night.
vsphere says this svmo'ed at 11/6/2012 10:11:25 pm PDT ("last night," as I write this). [firstname.lastname@example.org ~]# ls -lrt /var/log/secure* -rw------- 1 root root 560726 Oct 7 03:36 /var/log/secure-20121007 -rw------- 1 root root 560884 Oct 14 03:31 /var/log/secure-20121014 -rw------- 1 root root 555027 Oct 21 03:31 /var/log/secure-20121021 -rw------- 1 root root 293397 Nov 7 06:19 /var/log/secure-20121107 -rw------- 1 root root 3204 Nov 7 07:03 /var/log/secure [email@example.com ~]# cat /var/log/secure-20121107 | grep -B2 Nov | head -3 Oct 24 21:19:11 ganglia1 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib64/nagios/plugins/check_disk -w 10% -c 5% -p / Oct 24 21:21:48 ganglia1 sudo: nagios : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/lib64/nagios/plugins/check_disk -w 10% -c 5% -p / Nov 7 05:32:52 ganglia1 runuser: pam_unix(runuser:session): session opened for user root by (uid=0) The VM was already not-logging (around 2 weeks), implying that this probably a victim from the already-understood dedupe based crashes of 804413, rather than something new from svmo's.
Yeesh. Okay, so that brings up another question... Is there any way we can do a pass of the vms to determine which might be in this state and just unnoticed? I know pir said that monitring was planned for this, but is there anything we can do short term?
Short of writing a big `for` loop, not really. This is a woe-of-the-OS that isn't readily apparent to vmware directly, because it's basically "ok, I timed out sometime in the past, I give up and am not talking to you." Even if we tried to go to vmware for diagnosis by inference (disk writes being at zero) it's time-consuming data to gather (mostly reading graphs) and even then it would likely have a too many false positives to be of any use.
Okay, we'll call this resolved and wait on the pending checks from the SRE group, then. Thanks.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.