461685 - balsa-18branch is down

Reporter

Description

•

16 years ago

Nagios first alerted that all the service checks on this box were timing out. Turns out something is using all the CPU up, but can't find out what as ssh sessions can't open and even a VI console is useless. Rebooted, then fixed some errors with fsck.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

16 years ago

Back up, with a clobber of the build dir for good measure.

Status: ASSIGNED → RESOLVED

Closed: 16 years ago

Priority: -- → P1

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

16 years ago

Looks like the same problem just started again, investigating.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

16 years ago

100% CPU, can't get a ssh session, and the console doesn't let me move the mouse to the running terminal - hence can't get any top output. Now rebooted, will fsck each drive and hope there is something in the system logs.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

16 years ago

It's back building now.

There's nothing in the system logs at all, but the build log got as far as cvs co on SeaMonkeyAll (and the last mod time of that file matches the point that it started using 100% CPU according to the VI client). So possibly a cvs/ssh/network glitch, or an I/O issue since a cvs update is very I/O intesive. 

I've left a 'watch ps' running in the console terminal, so hopefully we will have some more info if this occurs again. Also left the focus on that terminal so that we might be able to kill any errant process - I couldn't find the magical keystroke to focus it on the two outages and the mouse was non-responsive.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

16 years ago

It's down *again*, I'm trying to check out the console right now...

Assignee: nthomas → bhearsum

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

16 years ago

Unfortunately, I'm unable to switch to the system console, nor able to focus the terminal window...I've rebooted the machine again, started tinderbox, and started 'watch ps aux' on the terminal inside of X - which we should be able to see next time.

It's up again for now..resolving this bug.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

16 years ago

And down again. Nothing showing in the X session, so X must be crashing out and getting restarted by the script. Off to disk check land we go, I'll probably reinstall vmware tools too.

Assignee: bhearsum → nthomas

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 8

•

16 years ago

It was pulling SeaMonkeyAll from CVS when it died (again), which is pretty I/O intensive. No errors were found by fsck on either drive, so I've enabled syslog on runlevel 3 to try to get some info in the system logs.

I also looked for errors on the host machine, which was bm-vmware07 at the time of the failure. There's nothing at 19:30, but there this a little earlier in /var/log/vmkwarning:
Nov  4 16:47:57 bm-vmware07 vmkernel: 180:05:53:45.427 cpu1:1035)WARNING: SCSI: 119: Failing I/O due to too many reservation conflicts 
Nov  4 16:47:57 bm-vmware07 vmkernel: 180:05:53:45.427 cpu1:1035)WARNING: ScsiDevice: 2740: Failed for vml.020001000060a98000433466344a344969744731474c554e20202
0: SCSI reservation conflict 

mrz, is that netapp-c-001 ?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

16 years ago

Priority: P1 → P2

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 9

•

16 years ago

And again about 3 hours ago. Everything in /var/log is populated but looks clean, and the mtime of X log indicates it only started once. Failed out doing cvs again:
  cvs -q -z 3 co -P -r MOZILLA_1_8_BRANCH -D 11/06/2008 07:14 +0000 SeaMonkeyAll
I put a cron job in to dump 'ps auxw --sort -pcpu' to /tmp/ps.log every minute (nagios will warn when / is filling up).

Ideas for more exhaustive checks welcome.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 10

•

16 years ago

And again, this time doing "c++ ... mozilla/dom/src/base/nsGlobalWindow.cpp", which is what the ps log shows too. Tinderbox restarted.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 11

•

16 years ago

Went again at around 09:50:00, checking out SeaMonkeyAll. And again when I was just poking around on the box.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 12

•

16 years ago

Attached file |rpm -Va| output — Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 13

•

16 years ago

I've set Vmware to expect Redhat Enterprise Linux 2 rather than 4, and done a clean reinstall of the vmware tools. Not really expecting that to fix it up, but here's hoping.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 14

•

16 years ago

Fixed for now.

Assignee: nthomas → nobody

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 15

•

16 years ago

Continues to be problematic. As a last ditch attempt, I'm moving the VM from the netapp-c-001 storage (which also holds the templates for new VM's) to netapp-d-sata-003.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

matthew zeier [:mrz]

Comment 16

•

16 years ago

Haven't really been paying attention to this but I'd actually recommend moving to equallogic (yes, despite yesterday's problems) than the netapp.  Let me know if you need help doing that.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 17

•

16 years ago

It's running on netapp-d-sata-003 for now but feel free to move it to a suitable equallogic partition. I'm not sure how much free space you're keeping on each and I can't see reliable free space numbers using the VI client (due to only build VM's being visible to me ?). Only need 14GB for this VM.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 18

•

16 years ago

netapp-d-sata-003 didn't help, trying eql01-bm06.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 19

•

16 years ago

... and it went to 100% CPU shortly after.

More worryingly we now have a second machine hanging "randomly": prometheus-vm went boom at around 1800 PST. It uses eql-bm02, and was hosted on bm-vmware05 at the time; it's running "Red Hat Enterprise Linux AS release 3 (Taroon Update 8)" and a 2.4.21-27.0.4.EL kernel. 

mrz, could you please take a look at the ESX and storage array logs for any problems.

Assignee: nobody → server-ops

Component: Release Engineering → Server Operations

QA Contact: release → mrz

Jeremy Orem [:oremj]

Updated

•

16 years ago

Assignee: server-ops → mrz

matthew zeier [:mrz]

Comment 20

•

16 years ago

phong is best equipped to look at this.

Assignee: mrz → phong

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 21

•

16 years ago

FYI, balsa-18branch is currently off, prometheus-vm hasn't gone nuts again.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 22

•

16 years ago

gentle ping... any update?

Phong Tran [:phong]

Assignee

Comment 23

•

16 years ago

What are you looking for in these logs?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 24

•

16 years ago

eg lost connections/dropped transactions between ESX hosts the network storage

matthew zeier [:mrz]

Comment 25

•

16 years ago

(In reply to comment #24)
> eg lost connections/dropped transactions between ESX hosts the network storage

Actual dropped connections, as I know from past experience, result it total chaos and ESX hosts that die.  That is not the case here.

Phong Tran [:phong]

Assignee

Comment 26

•

16 years ago

Did this happen a few weeks ago when we had issue with the equallogic?

Phong Tran [:phong]

Assignee

Comment 27

•

16 years ago

I think this related to the ESX build cluster being under high load.  This is related to bug 467634.

Phong Tran [:phong]

Assignee

Updated

•

16 years ago

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → DUPLICATE

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 29

•

16 years ago

VM is now restarted, lets see if this canary down the mine will sing again.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 30

•

16 years ago

balsa went boom again at 14:28 today, and started chewing a lot of CPU. I've re-educated the little punk with a reboot.

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard