Closed
Bug 423809
Opened 17 years ago
Closed 17 years ago
many build/unittest/talos/try machines down after MPT colo failure
Categories
(Release Engineering :: General, defect, P1)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: joduinn, Assigned: anodelman)
References
Details
From 8:01 pm PDT until 9:25 pm PDT, there was a switch failure in MPT, which caused packet storm, taking down entire colo. Nagios has been disabled for now, because so many machines are still out. From justdave on irc, a partial list of build machines still offline are:
cerebus-vm
bm-xserve16
fx-win32-1.9-slave2
fx-win32-tbox
fxdbug-win32-tbox
l10n-win32-tbox
moz2-win32-slave1
patrocles
production-pacifica-vm
sm-staging-try1-win32-slave
sm-try1-win32-slave
sm-try2-win32-slave
tbnewref-win32-tbox
xr-win32-tbox
(staging-build-console has out-of-disk-space problems, which is different and already acknowledged)
Reporter | ||
Updated•17 years ago
|
Assignee: nobody → joduinn
Priority: -- → P1
Reporter | ||
Comment 1•17 years ago
|
||
unable to start fx-win32-tbox because of cvs.lock read errors. justdave fixed the cvs partition, which was not mounted correctly. fx-win32-tbox now started ok, and started building.
Reporter | ||
Comment 2•17 years ago
|
||
Note: I had to first stop/start the fx-win32-tbox VM within the VI client before I could even login to it. Otherwise, RDC was consistently failing out with "client could not connect with remote computer". After restarting the VM, I could connect with RDC, and login as expected.
Reporter | ||
Comment 3•17 years ago
|
||
cerebus-vm now up and building after restart.
Reporter | ||
Comment 4•17 years ago
|
||
bm-xserve16 is physically up and accepting VNC connections. Looks like this machine was testing PGO-on-mac builds, so not clear if any build slaves are supposed to be running here. Nagios is all happy with bm-xserve16, so declare success.
Reporter | ||
Comment 5•17 years ago
|
||
fx-win32-1.9-slave2 is physically up and accepting RDC connections. Nagios also thinks it ok. Not sure if its supposed to be used to run fx19l10nrel or fx129nit or fx19rel.
Reporter | ||
Comment 6•17 years ago
|
||
fx-win32-1.9.slave2 is physically up and accepting work. looks ok to me. Nagios also
thinks it ok.
Reporter | ||
Comment 7•17 years ago
|
||
(In reply to comment #6)
> fx-win32-1.9.slave2 is physically up and accepting work. looks ok to me. Nagios
> also
> thinks it ok.
Sorry, copy/paste error, I meant to say:
fxdbug-win32-tbox is physically up and accepting work. Nagios also thinks this VM is ok.
Reporter | ||
Comment 8•17 years ago
|
||
l10n-win32-tbox is rebooted, physically now up and accepting work. Nagios thinks this VM is ok now.
Comment 9•17 years ago
|
||
Some people and machines were reporting cvs locks by xrbld, so I've shut down all the machines using that (argo, xr-linux-tbox (had r/o fs), xr-win32-tbox, bm-xserve07). Pinging justdave for a fix on the server side.
Assignee: joduinn → nrthomas
Comment 10•17 years ago
|
||
Rebooted fx-linux-tbox, fxdbug-linux-tbox, xr-linux-tbox for r/o file systems.
Comment 11•17 years ago
|
||
Boxes are able to pull from CVS now, so things are looking up. Had to remove the source tree on fx-win32-tbox are it lost a bunch of CVS/Entries. Restarted fxdbug-win32-tbox correctly.
Comment 12•17 years ago
|
||
Alright, I've fixed the following machines:
fx-win32-1.9-slave1 (had to reboot)
fx-linux64-1.9-slave1 (had to reboot + fsck)
sm-staging-try1-win32-slave (had to reboot)
sm-try1-win32-slave (had to reboot)
sm-try2-win32-slave (had to reboot)
sm-try2-linux-slave (had to reboot + fsck)
staging-master (had to reboot + fsck)
moz2-linux-slave1 (had to reboot + fsck)
moz2-win32-slave1 (had to reboot)
fx-linux-1.9-slave1 is dead. The / filesystem is too corrupt to be useful. I've filed bug 423850 and will be cloning a replacement shortly.
Reporter | ||
Comment 13•17 years ago
|
||
some talos machines are unreachable... alice just filed bug#423882
Depends on: 423882
Reporter | ||
Comment 14•17 years ago
|
||
qm-mini-vista05 currently burning; alice watching... lost connection with graph server... expect to be ok next time...
Comment 15•17 years ago
|
||
Tinderbox machines are all back up and cycling. We need to check on the current cycle of qm-centos5-01 (a full clobber).
Assignee: nrthomas → anodelman
Comment 16•17 years ago
|
||
(forgot to say) over to Alice for the Talos fixes.
Reporter | ||
Updated•17 years ago
|
Updated•17 years ago
|
Severity: normal → blocker
Comment 17•17 years ago
|
||
qm-mini-xp01,03
qm-mini-vista01,03
Are still missing. Is there an eta on getting them back up?
Comment 18•17 years ago
|
||
(In reply to comment #17)
> qm-mini-xp01,03
> qm-mini-vista01,03
>
> Are still missing. Is there an eta on getting them back up?
Are they really missing, or is bug 419071 just hiding them? See bug 424034, too.
What does http://qm-rhel02.mozilla.org:2006/ say about them?
Assignee | ||
Comment 19•17 years ago
|
||
These machines are up, they are just idle. The PGO builds take enough time to create that a lot of the talos windows boxes have been starved off the tree. I'm working through a few possible solutions, but this doesn't have anything to do with the colo failure.
If that's the only issue holding this bug open it should be closed.
Reporter | ||
Comment 20•17 years ago
|
||
Closing. If you know of any machine still down since Tuesday's colo outage, please reopen this bug.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Comment 21•17 years ago
|
||
http://graphs.mozilla.org/#spst=range&spss=1204638511.5477388&spse=1206124794&spstart=1196881975&spend=1206124794&bpst=cursor&bpstart=1204638511.5477388&bpend=1206124794&m1tid=53218&m1bl=0&m1avg=0
http://graphs.mozilla.org/#spst=range&spss=1204638511.5477388&spse=1206124794&spstart=1196882424&spend=1206124488&bpst=cursor&bpstart=1204638511.5477388&bpend=1206124794&m1tid=53236&m1bl=0&m1avg=0
xp1,3 repots every few days. Did they just come online?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 22•17 years ago
|
||
The talos machines that you are referring to are starved of builds due to the long amount of time it takes to produce PGO win builds - upwards of 1 hour and 40 minutes. Because of how buildbot allocates slaves we end up only exercising a single talos tester machine. We've seen this behavior since PGO builds were enabled.
These machines are not down they are only idle. There are other bugs tracking that issue, but it doesn't have to do with the MPT colo failure.
Re-closing.
Status: REOPENED → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•