423809 - many build/unittest/talos/try machines down after MPT colo failure

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Description

•

16 years ago

From 8:01 pm PDT until 9:25 pm PDT, there was a switch failure in MPT, which caused packet storm, taking down entire colo. Nagios has been disabled for now, because so many machines are still out. From justdave on irc, a partial list of build machines still offline are:

cerebus-vm
bm-xserve16
fx-win32-1.9-slave2
fx-win32-tbox
fxdbug-win32-tbox
l10n-win32-tbox
moz2-win32-slave1
patrocles
production-pacifica-vm
sm-staging-try1-win32-slave
sm-try1-win32-slave
sm-try2-win32-slave
tbnewref-win32-tbox
xr-win32-tbox

(staging-build-console has out-of-disk-space problems, which is different and already acknowledged)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Updated

•

16 years ago

Assignee: nobody → joduinn

Priority: -- → P1

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 1

•

16 years ago

unable to start fx-win32-tbox because of cvs.lock read errors. justdave fixed the cvs partition, which was not mounted correctly. fx-win32-tbox now started ok, and started building.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 2

•

16 years ago

Note: I had to first stop/start the fx-win32-tbox VM within the VI client before I could even login to it. Otherwise, RDC was consistently failing out with "client could not connect with remote computer". After restarting the VM, I could connect with RDC, and login as expected.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 3

•

16 years ago

cerebus-vm now up and building after restart.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 4

•

16 years ago

bm-xserve16 is physically up and accepting VNC connections. Looks like this machine was testing PGO-on-mac builds, so not clear if any build slaves are supposed to be running here. Nagios is all happy with bm-xserve16, so declare success.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 5

•

16 years ago

fx-win32-1.9-slave2 is physically up and accepting RDC connections. Nagios also thinks it ok.  Not sure if its supposed to be used to run fx19l10nrel or fx129nit or fx19rel.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 6

•

16 years ago

fx-win32-1.9.slave2 is physically up and accepting work. looks ok to me. Nagios also
thinks it ok.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 7

•

16 years ago

(In reply to comment #6)
> fx-win32-1.9.slave2 is physically up and accepting work. looks ok to me. Nagios
> also
> thinks it ok.  
Sorry, copy/paste error, I meant to say:

fxdbug-win32-tbox is physically up and accepting work. Nagios also thinks this VM is ok.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 8

•

16 years ago

l10n-win32-tbox is rebooted, physically now up and accepting work. Nagios thinks this VM is ok now.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

16 years ago

Some people and machines were reporting cvs locks by xrbld, so I've shut down all the machines using that (argo, xr-linux-tbox (had r/o fs), xr-win32-tbox, bm-xserve07). Pinging justdave for a fix on the server side.

Assignee: joduinn → nrthomas

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

16 years ago

Rebooted fx-linux-tbox, fxdbug-linux-tbox, xr-linux-tbox for r/o file systems.

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

16 years ago

Boxes are able to pull from CVS now, so things are looking up. Had to remove the source tree on fx-win32-tbox are it lost a bunch of CVS/Entries. Restarted fxdbug-win32-tbox correctly.

bhearsum@mozilla.com (:bhearsum)

Comment 12

•

16 years ago

Alright, I've fixed the following machines:
fx-win32-1.9-slave1 (had to reboot)
fx-linux64-1.9-slave1 (had to reboot + fsck)
sm-staging-try1-win32-slave (had to reboot)
sm-try1-win32-slave (had to reboot)
sm-try2-win32-slave (had to reboot)
sm-try2-linux-slave (had to reboot + fsck)
staging-master (had to reboot + fsck)
moz2-linux-slave1 (had to reboot + fsck)
moz2-win32-slave1 (had to reboot)


fx-linux-1.9-slave1 is dead. The / filesystem is too corrupt to be useful. I've filed bug 423850 and will be cloning a replacement shortly.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 13

•

16 years ago

some talos machines are unreachable... alice just filed bug#423882

Depends on: 423882

Justin Fitzhugh

Updated

•

16 years ago

No longer depends on: 423882

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 14

•

16 years ago

qm-mini-vista05 currently burning; alice watching... lost connection with graph server... expect to be ok next time...

Nick Thomas [:nthomas] (UTC+12)

Comment 15

•

16 years ago

Tinderbox machines are all back up and cycling. We need to check on the current cycle of qm-centos5-01 (a full clobber).

Assignee: nrthomas → anodelman

Nick Thomas [:nthomas] (UTC+12)

Comment 16

•

16 years ago

(forgot to say) over to Alice for the Talos fixes.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Updated

•

16 years ago

Depends on: 423882, 423923

Summary: many build machines down after MPT colo failure → many build/unittest/talos/try machines down after MPT colo failure

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Updated

•

16 years ago

Depends on: 423983

Stuart Parmenter

Updated

•

16 years ago

Severity: normal → blocker

Mike Schroepfer

Comment 17

•

16 years ago

qm-mini-xp01,03
qm-mini-vista01,03

Are still missing.  Is there an eta on getting them back up?

Reed Loden [:reed]

Comment 18

•

16 years ago

(In reply to comment #17)
> qm-mini-xp01,03
> qm-mini-vista01,03
> 
> Are still missing.  Is there an eta on getting them back up?

Are they really missing, or is bug 419071 just hiding them? See bug 424034, too.

What does http://qm-rhel02.mozilla.org:2006/ say about them?

alice nodelman [:alice] [:anode]

Assignee

Comment 19

•

16 years ago

These machines are up, they are just idle.  The PGO builds take enough time to create that a lot of the talos windows boxes have been starved off the tree.  I'm working through a few possible solutions, but this doesn't have anything to do with the colo failure.

If that's the only issue holding this bug open it should be closed.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Reporter

Comment 20

•

16 years ago

Closing. If you know of any machine still down since Tuesday's colo outage, please reopen this bug.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Mike Schroepfer

Comment 21

•

16 years ago

http://graphs.mozilla.org/#spst=range&spss=1204638511.5477388&spse=1206124794&spstart=1196881975&spend=1206124794&bpst=cursor&bpstart=1204638511.5477388&bpend=1206124794&m1tid=53218&m1bl=0&m1avg=0

http://graphs.mozilla.org/#spst=range&spss=1204638511.5477388&spse=1206124794&spstart=1196882424&spend=1206124488&bpst=cursor&bpstart=1204638511.5477388&bpend=1206124794&m1tid=53236&m1bl=0&m1avg=0

xp1,3 repots every few days.   Did they just come online?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

alice nodelman [:alice] [:anode]

Assignee

Comment 22

•

16 years ago

The talos machines that you are referring to are starved of builds due to the long amount of time it takes to produce PGO win builds - upwards of 1 hour and 40 minutes.  Because of how buildbot allocates slaves we end up only exercising a single talos tester machine.  We've seen this behavior since PGO builds were enabled.

These machines are not down they are only idle.  There are other bugs tracking that issue, but it doesn't have to do with the MPT colo failure.

Re-closing.

Status: REOPENED → RESOLVED

Closed: 16 years ago → 16 years ago

Resolution: --- → FIXED

timeless

Updated

•

15 years ago

Component: Release Engineering: Talos → Release Engineering

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering