Closed Bug 461685 Opened 16 years ago Closed 16 years ago

balsa-18branch is down

Categories

(mozilla.org Graveyard :: Server Operations, task, P2)

x86
Linux

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 467634

People

(Reporter: nthomas, Assigned: phong)

Details

Attachments

(1 file)

Nagios first alerted that all the service checks on this box were timing out. Turns out something is using all the CPU up, but can't find out what as ssh sessions can't open and even a VI console is useless. Rebooted, then fixed some errors with fsck.
Back up, with a clobber of the build dir for good measure.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Priority: -- → P1
Resolution: --- → FIXED
Looks like the same problem just started again, investigating.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
100% CPU, can't get a ssh session, and the console doesn't let me move the mouse to the running terminal - hence can't get any top output. Now rebooted, will fsck each drive and hope there is something in the system logs.
It's back building now.

There's nothing in the system logs at all, but the build log got as far as cvs co on SeaMonkeyAll (and the last mod time of that file matches the point that it started using 100% CPU according to the VI client). So possibly a cvs/ssh/network glitch, or an I/O issue since a cvs update is very I/O intesive. 

I've left a 'watch ps' running in the console terminal, so hopefully we will have some more info if this occurs again. Also left the focus on that terminal so that we might be able to kill any errant process - I couldn't find the magical keystroke to focus it on the two outages and the mouse was non-responsive.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
It's down *again*, I'm trying to check out the console right now...
Assignee: nthomas → bhearsum
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Unfortunately, I'm unable to switch to the system console, nor able to focus the terminal window...I've rebooted the machine again, started tinderbox, and started 'watch ps aux' on the terminal inside of X - which we should be able to see next time.

It's up again for now..resolving this bug.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
And down again. Nothing showing in the X session, so X must be crashing out and getting restarted by the script. Off to disk check land we go, I'll probably reinstall vmware tools too.
Assignee: bhearsum → nthomas
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It was pulling SeaMonkeyAll from CVS when it died (again), which is pretty I/O intensive. No errors were found by fsck on either drive, so I've enabled syslog on runlevel 3 to try to get some info in the system logs.

I also looked for errors on the host machine, which was bm-vmware07 at the time of the failure. There's nothing at 19:30, but there this a little earlier in /var/log/vmkwarning:
Nov  4 16:47:57 bm-vmware07 vmkernel: 180:05:53:45.427 cpu1:1035)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts 
Nov  4 16:47:57 bm-vmware07 vmkernel: 180:05:53:45.427 cpu1:1035)WARNING: ScsiDevice: 2740: Failed for vml.020001000060a98000433466344a344969744731474c554e20202
0: SCSI reservation conflict 

mrz, is that netapp-c-001 ?
Priority: P1 → P2
And again about 3 hours ago. Everything in /var/log is populated but looks clean, and the mtime of X log indicates it only started once. Failed out doing cvs again:
  cvs -q -z 3 co -P -r MOZILLA_1_8_BRANCH -D 11/06/2008 07:14 +0000 SeaMonkeyAll
I put a cron job in to dump 'ps auxw --sort -pcpu' to /tmp/ps.log every minute (nagios will warn when / is filling up).

Ideas for more exhaustive checks welcome.
And again, this time doing "c++ ... mozilla/dom/src/base/nsGlobalWindow.cpp", which is what the ps log shows too. Tinderbox restarted.
Went again at around 09:50:00, checking out SeaMonkeyAll. And again when I was just poking around on the box.
Attached file |rpm -Va| output
I've set Vmware to expect Redhat Enterprise Linux 2 rather than 4, and done a clean reinstall of the vmware tools. Not really expecting that to fix it up, but here's hoping.
Fixed for now.
Assignee: nthomas → nobody
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Continues to be problematic. As a last ditch attempt, I'm moving the VM from the netapp-c-001 storage (which also holds the templates for new VM's) to netapp-d-sata-003.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Haven't really been paying attention to this but I'd actually recommend moving to equallogic (yes, despite yesterday's problems) than the netapp.  Let me know if you need help doing that.
It's running on netapp-d-sata-003 for now but feel free to move it to a suitable equallogic partition. I'm not sure how much free space you're keeping on each and I can't see reliable free space numbers using the VI client (due to only build VM's being visible to me ?). Only need 14GB for this VM.
netapp-d-sata-003 didn't help, trying eql01-bm06.
... and it went to 100% CPU shortly after.

More worryingly we now have a second machine hanging "randomly": prometheus-vm went boom at around 1800 PST. It uses eql-bm02, and was hosted on bm-vmware05 at the time; it's running "Red Hat Enterprise Linux AS release 3 (Taroon Update 8)" and a 2.4.21-27.0.4.EL kernel. 

mrz, could you please take a look at the ESX and storage array logs for any problems.
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Assignee: server-ops → mrz
phong is best equipped to look at this.
Assignee: mrz → phong
FYI, balsa-18branch is currently off, prometheus-vm hasn't gone nuts again.
gentle ping... any update?
What are you looking for in these logs?
eg lost connections/dropped transactions between ESX hosts the network storage
(In reply to comment #24)
> eg lost connections/dropped transactions between ESX hosts the network storage

Actual dropped connections, as I know from past experience, result it total chaos and ESX hosts that die.  That is not the case here.
Did this happen a few weeks ago when we had issue with the equallogic?
I think this related to the ESX build cluster being under high load.  This is related to bug 467634.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → DUPLICATE
VM is now restarted, lets see if this canary down the mine will sing again.
balsa went boom again at 14:28 today, and started chewing a lot of CPU. I've re-educated the little punk with a reboot.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: