+++ This bug was initially created as a clone of Bug #620948 +++ The following computers require manual "looking at" to determine why they are offline. bm-xserve06 linux-ix-slave42 talos-r3-w7-020 talos-r3-w7-036 try-mac-slave42
moz2-darwin9-slave40 is refusing all SSH and NRPE connections.
(moz2-darwin9-slave40 moved to bug 629763, since it seems quite ill)
Created attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 (In reply to comment #0) > talos-r3-w7-020 > talos-r3-w7-036 Can we prevent these two from starting buildbot? I need them to come up online and install DirectX and the Nvidia driver update. I can't think of any other solution than actually removing them temporarily from buildbot (we could try to boot them and make the change very fast but would require doing it before it picks up a change and that is not easy AFAIK!)
talos-r3-w7-040 doesn't even seem to be in DNS.
Indeed, there is no such machine in inventory. After a brief look at bug 620948, I'm not sure how that name got on the list.
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 I'm probably not the right guy to review this patch.
(In reply to comment #6) > talos-r3-w7-040 doesn't even seem to be in DNS. There's an empty slot for it on the rack, but I have never seen it. Why we skipped 40 remains a mystery.
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 Passing it to coop :)
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 >+ range(21,36) + range(36,40) + range(41,54)], That should be range(37,40) to exclude 36. r+ with that change.
bm-xserve21 is refusing NRPE and SSH gives: ssh_exchange_identification: Connection closed by remote host
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 We have to reconfigure the testing masters before slaves talos-r3-w7-20 & talos-r3-w7-036 can come back online. The change has landed on the "default" branch: http://hg.mozilla.org/build/buildbot-configs/rev/09e5b421ffe5 I am planing on doing a reconfig in the morning.
Ravi says he has kicked: bm-xserve06 try-mac-slave42 bm-xserve21
(In reply to comment #15) > bm-xserve21 bm-xserve21 is failing nagios checks again. Can we get someone from IT to run some diagnostics on it since it's failing repeatedly?
Comment on attachment 507973 [details] [diff] [review] [deployed] disable temporarily slaves 20 & 36 Slaves talos-r3-w7-020 and talos-r3-w7-036 can now be brought back online without any worries.
try-mac-slave28 is still offline pending IT investigation (bug 586892).
talos-r3-w7-032 is unpingable.
talos-r3-w7-020: grey screen -> reboot talos-r3-w7-036: grey screen -> reboot talos-r3-w7-032: grey screen -> reboot talos-r3-fed64-050: date problem -> fsck
The whole list for this bug bug, according to the slave spreadsheet, is now: bm-xserve06 - pingable, but nothing else linux-ix-slave15 linux-ix-slave42 moz2-darwin9-slave05 - very slow to restart moz2-darwin9-slave40 - connections refused, very slow login mv-moz2-linux-ix-slave22 talos-r3-fed-024 talos-r3-fed-029 talos-r3-fed-047 talos-r3-w7-011 try-mac-slave42 w32-ix-slave03 w32-ix-slave08 - VNC, IPMI don't work
talos-r3-snow-009 didn't come back from a software reboot, so it requires a power cycle.
(In reply to comment #24) > talos-r3-snow-009 didn't come back from a software reboot, so it requires a > power cycle. Nevermind, this machine came back on it's own eventually.
I managed to resurrect linux-ix-slave42 today, so it no longer needs intervention.
(In reply to comment #26) > I managed to resurrect linux-ix-slave42 today, so it no longer needs > intervention. Hmm, that was on the list that I asked Spencer to reimage, but he told me this morning that he had trouble reimaging it. If it hasn't been reimaged, I'd expect it to fall over again soon. Spencer- What's the status on that machine?
w32-ix-slave05 - machine is up, stuck in the prelogin opsi stuff. ipmi is inaccessible
talos-r3-xp-042 - stuck in a reboot, with no VNC access and the shutdown command says "A Shutdown Is In Progress"
disregard comment 31 regarding talos-r3-xp-042 - I forgot to try RDP, which worked.
bkero just reimaged w32-ix-slave03 so no need to reboot it.
I just swept the spreadsheet to catch any boxes that were restored without being annotated here (only a few). The latest list of reboots is: linux-ix-slave15 moz2-darwin10-slave40 moz2-darwin9-slave51 mv-moz2-linux-ix-slave07 mv-moz2-linux-ix-slave22 talos-r3-fed-024 talos-r3-fed-029 talos-r3-fed-037 talos-r3-fed-047 talos-r3-fed64-004 talos-r3-fed64-011 talos-r3-fed64-013 talos-r3-fed64-036 talos-r3-w7-011 talos-r3-w7-036 w32-ix-slave05 w32-ix-slave08 the IX boxes' IPMI didn't work, at least not by following the OOB IP in inventory. Note that linux-ix-slave15 came back and failed again on the 7th, so it may require an extra bit of TLC.
talos-r3-fed-003 talos-r3-fed-030 talos-r3-fed64-016 talos-r3-fed64-023 talos-r3-xp-039
talos-r3-xp-039: frozen solid at desktop (6:29AM in the taskbar) -> rebooted normally talos-r3-w7-011: gray screen -> rebooted normally talos-r3-w7-036: gray screen -> rebooted normally talos-r3-fed-003: blank screen -> rebooted normally talos-r3-fed-024: looked OK, no network. -> rebooted normally talos-r3-fed-029: looked OK, no network. -> rebooted normally talos-r3-fed-030: blank screen -> date problem talos-r3-fed-036: grey screen -> reboot talos-r3-fed-037: grey screen -> reboot talos-r3-fed-047: blank screen -> date problem talos-r3-fed64-004: blank screen -> hang -> reimaged talos-r3-fed64-011: looked OK, no network. -> rebooted normally talos-r3-fed64-013: blank screen -> date problem talos-r3-fed64-016: blank screen -> date problem talos-r3-fed64-023: blank screen -> date problem talos-r3-fed64-036: looked OK, no network. -> rebooted normally
Looks like linux-ix-slave15 got reimaged about 5 days ago ?
talos-r3-fed-036 is down again, it managed about 20 reboots before failing.
talos-r3-xp-041 is hung at OPSI.
talos-r3-snow-032 is refusing ssh & vnc. Needs a reboot, and possibly a reimage.
(In reply to comment #41) > talos-r3-xp-041 is hung at OPSI. Cancel this one, it rebooted on its own and is working normally now.
talos-r3-xp-004: gray screen -> reboot talos-r3-w7-036: gray screen -> reboot talos-r3-fed-028: date problem talos-r3-fed-036: up, but no network lease (this is the "looked OK state" in comment 38. talos-r3-fed64-014: gray screen -> reboot talos-r3-fed64-054: gray screen -> reboot talos-r3-snow-032: blue desktop + pinwheel (looks like a hang shutting down) rebooted a couple of times normally
linux-ix-slave15: reimaged, hostname fixed, in puppetd loop. moz2-darwin9-slave51: rebooted in bug 634368 moz2-darwin10-slave40: reooted in bug 634368 Thus endeth this bug. Nothing to carry forward, so please start a new bug for the next interventions.