Closed Bug 423809 Opened 16 years ago Closed 16 years ago

many build/unittest/talos/try machines down after MPT colo failure

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: joduinn, Assigned: anodelman)

References

Details

From 8:01 pm PDT until 9:25 pm PDT, there was a switch failure in MPT, which caused packet storm, taking down entire colo. Nagios has been disabled for now, because so many machines are still out. From justdave on irc, a partial list of build machines still offline are:

cerebus-vm
bm-xserve16
fx-win32-1.9-slave2
fx-win32-tbox
fxdbug-win32-tbox
l10n-win32-tbox
moz2-win32-slave1
patrocles
production-pacifica-vm
sm-staging-try1-win32-slave
sm-try1-win32-slave
sm-try2-win32-slave
tbnewref-win32-tbox
xr-win32-tbox

(staging-build-console has out-of-disk-space problems, which is different and already acknowledged)
Assignee: nobody → joduinn
Priority: -- → P1
unable to start fx-win32-tbox because of cvs.lock read errors. justdave fixed the cvs partition, which was not mounted correctly. fx-win32-tbox now started ok, and started building.
Note: I had to first stop/start the fx-win32-tbox VM within the VI client before I could even login to it. Otherwise, RDC was consistently failing out with "client could not connect with remote computer". After restarting the VM, I could connect with RDC, and login as expected.

cerebus-vm now up and building after restart.
bm-xserve16 is physically up and accepting VNC connections. Looks like this machine was testing PGO-on-mac builds, so not clear if any build slaves are supposed to be running here. Nagios is all happy with bm-xserve16, so declare success.
fx-win32-1.9-slave2 is physically up and accepting RDC connections. Nagios also thinks it ok.  Not sure if its supposed to be used to run fx19l10nrel or fx129nit or fx19rel.
fx-win32-1.9.slave2 is physically up and accepting work. looks ok to me. Nagios also
thinks it ok.  
(In reply to comment #6)
> fx-win32-1.9.slave2 is physically up and accepting work. looks ok to me. Nagios
> also
> thinks it ok.  
Sorry, copy/paste error, I meant to say:

fxdbug-win32-tbox is physically up and accepting work. Nagios also thinks this VM is ok.
l10n-win32-tbox is rebooted, physically now up and accepting work. Nagios thinks this VM is ok now. 
Some people and machines were reporting cvs locks by xrbld, so I've shut down all the machines using that (argo, xr-linux-tbox (had r/o fs), xr-win32-tbox, bm-xserve07). Pinging justdave for a fix on the server side.
Assignee: joduinn → nrthomas
Rebooted fx-linux-tbox, fxdbug-linux-tbox, xr-linux-tbox for r/o file systems.
Boxes are able to pull from CVS now, so things are looking up. Had to remove the source tree on fx-win32-tbox are it lost a bunch of CVS/Entries. Restarted fxdbug-win32-tbox correctly.
Alright, I've fixed the following machines:
fx-win32-1.9-slave1 (had to reboot)
fx-linux64-1.9-slave1 (had to reboot + fsck)
sm-staging-try1-win32-slave (had to reboot)
sm-try1-win32-slave (had to reboot)
sm-try2-win32-slave (had to reboot)
sm-try2-linux-slave (had to reboot + fsck)
staging-master (had to reboot + fsck)
moz2-linux-slave1 (had to reboot + fsck)
moz2-win32-slave1 (had to reboot)


fx-linux-1.9-slave1 is dead. The / filesystem is too corrupt to be useful. I've filed bug 423850 and will be cloning a replacement shortly.
some talos machines are unreachable... alice just filed bug#423882
Depends on: 423882
No longer depends on: 423882
qm-mini-vista05 currently burning; alice watching... lost connection with graph server... expect to be ok next time...
Tinderbox machines are all back up and cycling. We need to check on the current cycle of qm-centos5-01 (a full clobber).
Assignee: nrthomas → anodelman
(forgot to say) over to Alice for the Talos fixes.
Depends on: 423882, 423923
Summary: many build machines down after MPT colo failure → many build/unittest/talos/try machines down after MPT colo failure
Severity: normal → blocker
qm-mini-xp01,03
qm-mini-vista01,03

Are still missing.  Is there an eta on getting them back up?
(In reply to comment #17)
> qm-mini-xp01,03
> qm-mini-vista01,03
> 
> Are still missing.  Is there an eta on getting them back up?

Are they really missing, or is bug 419071 just hiding them? See bug 424034, too.

What does http://qm-rhel02.mozilla.org:2006/ say about them?
These machines are up, they are just idle.  The PGO builds take enough time to create that a lot of the talos windows boxes have been starved off the tree.  I'm working through a few possible solutions, but this doesn't have anything to do with the colo failure.

If that's the only issue holding this bug open it should be closed.
Closing. If you know of any machine still down since Tuesday's colo outage, please reopen this bug.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
The talos machines that you are referring to are starved of builds due to the long amount of time it takes to produce PGO win builds - upwards of 1 hour and 40 minutes.  Because of how buildbot allocates slaves we end up only exercising a single talos tester machine.  We've seen this behavior since PGO builds were enabled.

These machines are not down they are only idle.  There are other bugs tracking that issue, but it doesn't have to do with the MPT colo failure.

Re-closing.
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Talos → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.