Closed Bug 629511 Opened 13 years ago Closed 13 years ago

computers that need physical intervention

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: zandr)

References

()

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #620948 +++

The following computers require manual "looking at" to determine why they are offline.

bm-xserve06
linux-ix-slave42
talos-r3-w7-020
talos-r3-w7-036
try-mac-slave42
talos-r3-fed-029
talos-r3-w7-011
moz2-darwin9-slave40 is refusing all SSH and NRPE connections.
(moz2-darwin9-slave40 moved to bug 629763, since it seems quite ill)
(In reply to comment #0)
> talos-r3-w7-020
> talos-r3-w7-036

Can we prevent these two from starting buildbot?
I need them to come up online and install DirectX and the Nvidia driver update.

I can't think of any other solution than actually removing them temporarily from buildbot (we could try to boot them and make the change very fast but would require doing it before it picks up a change and that is not easy AFAIK!)
Attachment #507973 - Flags: review?(dustin)
talos-r3-w7-040 doesn't even seem to be in DNS.
Blocks: 624044
Indeed, there is no such machine in inventory.  After a brief look at bug 620948, I'm not sure how that name got on the list.
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36

I'm probably not the right guy to review this patch.
Attachment #507973 - Flags: review?(dustin)
(In reply to comment #6)
> talos-r3-w7-040 doesn't even seem to be in DNS.

There's an empty slot for it on the rack, but I have never seen it. Why we skipped 40 remains a mystery.
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36

Passing it to coop :)
Attachment #507973 - Flags: review?(coop)
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36

>+                                             range(21,36) + range(36,40) + range(41,54)],

That should be range(37,40) to exclude 36. r+ with that change.
Attachment #507973 - Flags: review?(coop) → review+
bm-xserve21 is refusing NRPE and SSH gives:
  ssh_exchange_identification: Connection closed by remote host
talos-r3-fed64-050
Blocks: 630309
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36

We have to reconfigure the testing masters before slaves talos-r3-w7-20 & talos-r3-w7-036 can come back online.

The change has landed on the "default" branch:
http://hg.mozilla.org/build/buildbot-configs/rev/09e5b421ffe5

I am planing on doing a reconfig in the morning.
Attachment #507973 - Attachment description: disable temporarily slaves 20 & 36 → [checked in] disable temporarily slaves 20 & 36
No longer blocks: 624044
Ravi says he has kicked:

bm-xserve06
try-mac-slave42
bm-xserve21
(In reply to comment #15)
> bm-xserve21

bm-xserve21 is failing nagios checks again. Can we get someone from IT to run some diagnostics on it since it's failing repeatedly?
Comment on attachment 507973 [details] [diff] [review]
[deployed] disable temporarily slaves 20 & 36

Slaves talos-r3-w7-020 and talos-r3-w7-036 can now be brought back online without any worries.
Attachment #507973 - Attachment description: [checked in] disable temporarily slaves 20 & 36 → [deployed] disable temporarily slaves 20 & 36
try-mac-slave28 is still offline pending IT investigation (bug 586892).
talos-r3-w7-032 is unpingable.
talos-r3-w7-020: grey screen -> reboot
talos-r3-w7-036: grey screen -> reboot
talos-r3-w7-032: grey screen -> reboot
talos-r3-fed64-050: date problem -> fsck
The whole list for this bug bug, according to the slave spreadsheet, is now:

bm-xserve06 - pingable, but nothing else
linux-ix-slave15
linux-ix-slave42
moz2-darwin9-slave05 - very slow to restart
moz2-darwin9-slave40 - connections refused, very slow login
mv-moz2-linux-ix-slave22
talos-r3-fed-024
talos-r3-fed-029
talos-r3-fed-047
talos-r3-w7-011
try-mac-slave42
w32-ix-slave03
w32-ix-slave08 - VNC, IPMI don't work
talos-r3-fed64-011
moz2-darwin10-slave40
talos-r3-snow-009 didn't come back from a software reboot, so it requires a power cycle.
Blocks: 631587
(In reply to comment #24)
> talos-r3-snow-009 didn't come back from a software reboot, so it requires a
> power cycle.

Nevermind, this machine came back on it's own eventually.
No longer blocks: 631587
I managed to resurrect linux-ix-slave42 today, so it no longer needs intervention.
talos-r3-fed64-004
talos-r3-fed-037
(In reply to comment #26)
> I managed to resurrect linux-ix-slave42 today, so it no longer needs
> intervention.

Hmm, that was on the list that I asked Spencer to reimage, but he told me this morning that he had trouble reimaging it.

If it hasn't been reimaged, I'd expect it to fall over again soon.

Spencer- What's the status on that machine?
w32-ix-slave05 - machine is up, stuck in the prelogin opsi stuff. ipmi is inaccessible
talos-r3-xp-042 - stuck in a reboot, with no VNC access and the shutdown command says "A Shutdown Is In Progress"
disregard comment 31 regarding talos-r3-xp-042 - I forgot to try RDP, which worked.
bkero just reimaged w32-ix-slave03 so no need to reboot it.
I just swept the spreadsheet to catch any boxes that were restored without being annotated here (only a few).  The latest list of reboots is:

linux-ix-slave15
moz2-darwin10-slave40
moz2-darwin9-slave51
mv-moz2-linux-ix-slave07
mv-moz2-linux-ix-slave22
talos-r3-fed-024
talos-r3-fed-029
talos-r3-fed-037
talos-r3-fed-047
talos-r3-fed64-004
talos-r3-fed64-011
talos-r3-fed64-013
talos-r3-fed64-036
talos-r3-w7-011
talos-r3-w7-036
w32-ix-slave05
w32-ix-slave08

the IX boxes' IPMI didn't work, at least not by following the OOB IP in inventory.  Note that linux-ix-slave15 came back and failed again on the 7th, so it may require an extra bit of TLC.
talos-r3-fed-044
talos-r3-fed-003
talos-r3-fed-030
talos-r3-fed64-016
talos-r3-fed64-023
talos-r3-xp-039
talos-r3-fed-036
mv-moz2-linux-ix-slave21
talos-r3-xp-039: frozen solid at desktop (6:29AM in the taskbar) -> rebooted normally 

talos-r3-w7-011: gray screen -> rebooted normally
talos-r3-w7-036: gray screen -> rebooted normally

talos-r3-fed-003: blank screen -> rebooted normally
talos-r3-fed-024: looked OK, no network. -> rebooted normally
talos-r3-fed-029: looked OK, no network. -> rebooted normally
talos-r3-fed-030: blank screen -> date problem
talos-r3-fed-036: grey screen -> reboot
talos-r3-fed-037: grey screen -> reboot
talos-r3-fed-047: blank screen -> date problem

talos-r3-fed64-004: blank screen -> hang -> reimaged
talos-r3-fed64-011: looked OK, no network. -> rebooted normally
talos-r3-fed64-013: blank screen -> date problem
talos-r3-fed64-016: blank screen -> date problem
talos-r3-fed64-023: blank screen -> date problem
talos-r3-fed64-036: looked OK, no network. -> rebooted normally
Depends on: 634368
Looks like linux-ix-slave15 got reimaged about 5 days ago ?
talos-r3-fed-036 is down again, it managed about 20 reboots before failing.
talos-r3-xp-041 is hung at OPSI.
talos-r3-snow-032 is refusing ssh & vnc. Needs a reboot, and possibly a reimage.
talos-r3-fed64-014
talos-r3-w7-036
(In reply to comment #41)
> talos-r3-xp-041 is hung at OPSI.

Cancel this one, it rebooted on its own and is working normally now.
talos-r3-xp-004
talos-r3-fed-028
talos-r3-fed64-054
talos-r3-xp-004: gray screen -> reboot

talos-r3-w7-036: gray screen -> reboot

talos-r3-fed-028: date problem
talos-r3-fed-036: up, but no network lease (this is the "looked OK state" in comment 38.

talos-r3-fed64-014: gray screen -> reboot
talos-r3-fed64-054: gray screen -> reboot

talos-r3-snow-032: blue desktop + pinwheel (looks like a hang shutting down) rebooted a couple of times normally
linux-ix-slave15: reimaged, hostname fixed, in puppetd loop.
moz2-darwin9-slave51: rebooted in bug 634368
moz2-darwin10-slave40: reooted in bug 634368

Thus endeth this bug. Nothing to carry forward, so please start a new bug for the next interventions.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Alias: reboots
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: