Closed Bug 620948 Opened 15 years ago Closed 15 years ago

computers that need physical intervention

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bear, Assigned: zandr)

References

()

Details

The following computers require manual "looking at" to determine why they are offline
No longer depends on: 620041
Alias: reboots
linux-ix-slave37.build moz:~ bear$ ssh cltbld@linux-ix-slave37.build.mozilla.org ssh: connect to host linux-ix-slave37.build.mozilla.org port 22: Operation timed out moz:~ bear$ ping linux-ix-slave37.build.mozilla.org PING linux-ix-slave37.build.scl1.mozilla.com (10.12.48.231): 56 data bytes Request timeout for icmp_seq 0 Request timeout for icmp_seq 1 Request timeout for icmp_seq 2
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
(In reply to comment #1) > linux-ix-slave37.build console says: puppetd returned non-zero, sleeping for 60...trying again repeated continuously. linux-ix-slave08 is saying the same thing, immediately after a reimage. My first thought was that this was due to a hostname issue, but if you can't even ping it, there's no way to get in and fix hostname.
Probably needs its hostkey cleared on the scl and/or mpt puppet masters.
oops, i mean MV, not scl
(In reply to comment #3) > Probably needs its hostkey cleared on the scl and/or mpt puppet masters. let me do that now
linux-ix-slave37 has attached to the master and is ok
talos-r3-fed-031 talos-r3-fed-032 talos-r3-fed64-001 talos-r3-fed64-011 talos-r3-fed64-018 talos-r3-fed64-022
(In reply to comment #7) > talos-r3-fed64-001 Seriously? This was reimaged *last night*. Did it do any work overnight?
moz2-darwin10-slave01
(In reply to comment #8) > (In reply to comment #7) > > > talos-r3-fed64-001 > > Seriously? This was reimaged *last night*. > > Did it do any work overnight? At least I couldn't reach it and Nagios reports it as unpingable for 19 hours.
Assignee: server-ops → zandr
talos-r3-w7-036
An updated amd collated list: bm-xserve18 linux-ix-slave13 linux-ix-slave14 linux-ix-slave16 moz2-darwin10-slave53 moz2-darwin10-slave54 mv-moz2-linux-ix-slave05 talos-r3-fed-031 talos-r3-fed-032 talos-r3-fed64-001 talos-r3-fed64-011 talos-r3-fed64-018 talos-r3-fed64-022 talos-r3-fed64-047 talos-r3-fed64-055 talos-r3-leopard-003 talos-r3-w7-036 w32-ix-slave03 w32-ix-slave08 w32-ix-slave41 talos-r3-fed-031 talos-r3-fed-032 talos-r3-fed64-001 talos-r3-fed64-011 talos-r3-fed64-018 talos-r3-fed64-022
talos-r3-fed-031: "Couldn't mount root filesystem": reimaged talos-r3-fed-032: talos-r3-fed64-001: rebooted: responds to ssh talos-r3-fed64-011: weird dhcp problem, reimaged talos-r3-fed64-018: same hang at USB: reimaged talos-r3-fed64-022: "Couldn't mount root filesystem": reimaged talos-r3-fed64-047: not in scl1, may not exist, cc :jhford for comment talos-re-fed64-055: not in scl1, may not exist, cc :jhford for comment talos-r3-leopard-003: WFM: responds to ssh, vnc talos-r3-w7-036: w32-ix-slave41: presumed drive failure. Imaging at 20MB/*minute* and falling. see bug 615744
Oops, hit save too early. talos-r3-fed-032: was hung at grey boot screen: rebooted talos-r3-w7-036: was hung at grey boot screen: rebooted And that's it for scl1.
linux-ix-slave14 is also MIA. Not present in 650.
That machine was given to IX to investigate the issues in bug 596366 (comment 11)
t-r3-w764-018 does not respond via SSH or VNC to any of the cltbld passwords I know. If a restart doesn't help, either a re-image or a changed VNC and cltbld password would be great.
t-r3-w764-008 refuses incoming SSH or VNC connections, and thus needs a poking-at.
I managed to guess the VNC password for t-r3-w764-018, but it doesn't match the cltbld account's password. This system probably just needs a re-image.
(In reply to comment #18) > t-r3-w764-008 refuses incoming SSH or VNC connections, and thus needs a > poking-at. And this machine is apparently feeling better this afternoon - I successfully connected via VNC, so I guess it's OK (??)
t-r3-w764-034 was accidentally shut down rather than rebooted, so it will need to be started up.
linux-ix-slave34 has poor performance: Timing buffered disk reads: 66 MB in 3.02 seconds = 21.84 MB/sec
w64-ix-slave27 is not responding on its normal interface, nor IPMI
w64-ix-slave35 also not responding to IPMI
Machines for the next MPT visit: bm-xserve18 moz2-darwin10-slave53 moz2-darwin10-slave54
Assignee: zandr → phong
Flags: colo-trip+
(In reply to comment #25) > Machines for the next MPT visit: > > bm-xserve18 > moz2-darwin10-slave53 > moz2-darwin10-slave54 rebooted.
Assignee: phong → zandr
moz2-linux-slave03 does not have any of the root passwords I know of. It'd be good to have the password reset -- or the whole box re-imaged, if that's easier.
(In reply to comment #27) > moz2-linux-slave03 does not have any of the root passwords I know of. It'd be > good to have the password reset -- or the whole box re-imaged, if that's > easier. Repurposed bug 610308 for this.
failing ping checks: talos-r3-fed-023 talos-r3-fed-025 talos-r3-fed64-009 talos-r3-w7-011 talos-r3-w7-032 talos-r3-w7-036
moz2-darwin10-slave01 moz2-darwin10-slave26
(In reply to comment #30) > moz2-darwin10-slave26 Disregard, see bug 620594
bm-xserve02 is offline. Not sure why. This blocks bug 578234.
Blocks: 578234
talos-r3-fed64-019
bm-xserve10 bm-xserve20 bm-xserve23 bm-xserve24 All four of these xserves report the following error when I try to connect: nc: getaddrinfo: Name or service not known Are these machines in DNS?
talos-r3-fed-003: blank screen -> rebooted -> date problem talos-r3-fed-023: grey screen -> rebooted talos-r3-fed-025: grey screen -> rebooted talos-r3-fed-042: blank screen -> rebooted -> date problem talos-r3-w7-011: grey screen -> rebooted talos-r3-w7-032: grey screen -> rebooted talos-r3-w7-034: turned on talos-r3-w7-036: grey screen -> rebooted talos-r3-fed64-009: hang after USB -> needs reimage talos-r3-fed64-019: hang after USB -> needs reimage talos-r3-fed64-029: blank screen -> rebooted -> date problem w64-ix-slave27-ipmi: fixed w64-ix-slave35-ipmi: bad cable, needs proper (5') replacement
To clarify, I did not reimage the two fed64 machines, (not enough time in the DC) but I did put in a temp patch for w64-ix-slave35.
bm-xserve11 is refusing connections.
(In reply to comment #37) > bm-xserve11 is refusing connections. The most common failure mode for our xserves is the raid array getting corrupted, and freezing up when under I/O load (ie building!). The fix is to re-image. Not saying that's necessary yet, just FYI.
talos-r3-w7-024 needs to come to MV (attn: clyon)
talos-r3-fed-025: grey screen -> rebooted talos-r3-fed-048: blank screen -> rebooted -> date problem talos-r3-fed64-009: reimaged - hostname update pending talos-r3-fed64-019: reimaged - hostname update pending
(In reply to comment #40) > talos-r3-fed-025: grey screen -> rebooted Up and running now > talos-r3-fed-048: blank screen -> rebooted -> date problem Up and running. What's the date problem?
(In reply to comment #41) > (In reply to comment #40) > > talos-r3-fed-048: blank screen -> rebooted -> date problem And I should say 'fixed'. That note was for my own tracking. > Up and running. What's the date problem? System date had reverted to 1/1/01, fsck complaining about superblock timestamp in the future. bug 561442
Talos boxes that have gone down in the last few days (hwclock fix not helping?): talos-r3-fed-008 talos-r3-fed-026 talos-r3-fed-050 talos-r3-fed64-017 talos-r3-fed64-024 talos-r3-fed64-033
did they die running mozmill-all? if so they may have a crashreporter process running and that may stop the reboot. I saw quite a few of them Friday (not sure if I filed a bug or not (can't find one))
A hanging mozmill would block OS restart well before the network interface went away, and so wouldn't block PINGs, but I double checked the masters for all 6 of those machines. They don't have any hanging mozmill (or jetpack) jobs, and the last job they did before failed to reboot was not a mozmill job in almost all cases. When it was mozmill it looked like the job ran fine and terminated normally.
moz2-darwin9-slave05 seems to be gone from the network, according to nagios and ssh
talos-r3-fed64-038 - same
talos-r3-fed-037
talos-r3-fed-024
Here's an updated list based on my trip through nagios and the slave-wrangling spreadsheet (see URL) bm-xserve11 - reboot - comment 37 moz2-darwin9-slave05 - reboot - comment 46 t-r3-w764-018 - try to reset passwords, or reimage - comment 19 talos-r3-fed-008 - reboot talos-r3-fed-024 - reboot talos-r3-fed-026 - reboot talos-r3-fed-037 - reboot talos-r3-fed-050 - reboot talos-r3-fed64-017 - reboot talos-r3-fed64-024 - reboot talos-r3-fed64-033 - reboot talos-r3-fed64-038 - reboot talos-r3-fed64-047 - reboot talos-r3-fed64-055 - reboot try-mac-slave28 - reboot w32-ix-slave03 - reboot - comment 12 w32-ix-slave08 - ssh works but nothing else, spoke, maybe reimage
talos-r3-fed-053 - reboot
The w7 slaves are not easy to detect. The only check we have is PING. I see a solution. An email in the morning with list of w7 machines that have not done a job for a day, the master it was connected to and a tail of twistd.log to help us determine state. On another note, I kicked back up few of the w7 slaves: talos-r3-w7-004 - SIGKILL failed. I VNCed; no CMD running; rebooted - down since 16th - last job jetpack for 3 days talos-r3-w7-029 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 5 days talos-r3-w7-038 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; did *not* reboot - we should debug - down since 13th - last job run was jetpack for 5 days talos-r3-w7-049 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 5 days talos-r3-w7-051 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 3 days Could one of you check talos-r3-w7-004? It did not come back and can't reach it. TODO debug talos-r3-w7-038 for more info
(In reply to comment #52) > The w7 slaves are not easy to detect. > The only check we have is PING. > > I see a solution. > An email in the morning with list of w7 machines that have not done a job for a > day, the master it was connected to and a tail of twistd.log to help us > determine state. Really, we should have some way to get better information to nagios. bug 627126. > TODO debug talos-r3-w7-038 for more info Can you file a bug for this and update the spreadsheet?
(In reply to comment #53) > > TODO debug talos-r3-w7-038 for more info > > Can you file a bug for this and update the spreadsheet? It is back to the pool. Nothing else to follow up.
talos-r3-w7-038 - needs a manual reboot as I can't ssh/vnc/rdc to it
> moz2-darwin9-slave05 - reboot - comment 46 This machine took hours to restart yesterday, and today took at least 30m: 13:50 < nagios> [40] moz2-darwin9-slave05.build:PING is CRITICAL: CRITICAL - Host Unreachable (10.2.71.202) 14:10 < nagios> moz2-darwin9-slave05.build:PING is OK: PING OK - Packet loss = 0%, RTA = 0.31 ms (CDT times = PDT-0200) So it should probably be restarted manually and its behavior monitored. I've disabled buildslave on it.
talos-r3-fed-047 needs a reboot bm-xserve11 doesn't any more, got moved today
(In reply to comment #57) > talos-r3-fed-047 needs a reboot Actually, it's on a cart in Mountain View - bug 626923.
That's fed64-047 by the looks.
Correct, sorry.
More talos slaves down: talos-r3-fed64-037 talos-r3-fed64-041
bm-xserve06 - this was moved yesterday, and came up after that, but at this point everything but ping is down on the host.
talos-r3-w7-004: confused by comment 52, as it's up and running tests now talos-r3-w7-024: reimaged talos-r3-w7-038: rebooted t-r3-w764-018: reimaged talos-r3-fed-008: date problem -> fsck talos-r3-fed-024: gray screen -> reboot talos-r3-fed-026: date problem -> fsck talos-r3-fed-037: gray screen -> reboot talos-r3-fed-050: date problem -> fsck talos-r3-fed-053: date problem -> fsck talos-r3-fed64-017: hang at boot -> reimage - needs hostname talos-r3-fed64-024: date problem -> fsck talos-r3-fed64-033: date problem -> fsck talos-r3-fed64-037: date problem -> fsck talos-r3-fed64-038: hang at boot -> reimage - needs hostname talos-r3-fed64-041: date problem
(In reply to comment #63) > talos-r3-fed64-017: hang at boot -> reimage - needs hostname > talos-r3-fed64-038: hang at boot -> reimage - needs hostname Fixed hostname, passwords, puppet, & buildbot.tac. Back in production.
No worries zandr. Here is a pointer for Monday -> bug 624124
talos-r3-fed-036
mv-moz2-linux-ix-slave22
talos-r3-xp-019 seems to be stuck in shutdown. It will need to be rebooted.
talos-r3-xp-001 also needs a reboot.
Reboots: talos-r3-fed64-016 talos-r3-xp-030
I managed to resurrect both talos-r3-xp-001 and talos-r3-xp-019 by being a little more patient with RDP.
No longer blocks: 620517
linux-ix-slave42 - unpingable, see also 624207
talos-r3-fed64-022
talos-r3-w7-036
talos-r3-fed64-016: date problem -> fsck talos-r3-fed64-022: looked OK, but wasn't on the network -> reboot talos-r3-xp-030: blank screen -> reboot talos-r3-w7-036: grey screen -> reboot talos-r3-fed-036: grey screen -> reboot
talos-r3-w7-036 - wasted no time killing itself again. The rest seem to be up.
talos-r3-w7-020
(In reply to comment #34) > bm-xserve10 > bm-xserve20 > bm-xserve23 > bm-xserve24 > > All four of these xserves report the following error when I try to connect: > > nc: getaddrinfo: Name or service not known > > Are these machines in DNS? Any update on these xserves? It's blocking bug 580346.
Did those work at one time? None are in DNS.
Readded to DNS based on some old entries: build.mozilla.org.deprecated:bm-xserve20 IN A 10.2.71.158 build.mozilla.org.deprecated:bm-xserve10 IN A 10.2.71.10 71.2.10.in-addr.arpa:42 IN PTR bm-xserve23.build.mozilla.org. 71.2.10.in-addr.arpa:43 IN PTR bm-xserve24.build.mozilla.org.
try-mac-slave42
we are still not seeing bm-xserve06 online
zandr: This bug is getting a bit unwieldy. I'm going to create a new one and try to bring forward all the slaves that still require attention.
Alias: reboots
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.