Closed
Bug 620948
Opened 15 years ago
Closed 15 years ago
computers that need physical intervention
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bear, Assigned: zandr)
References
()
Details
The following computers require manual "looking at" to determine why they are offline
| Reporter | ||
Updated•15 years ago
|
Alias: reboots
| Reporter | ||
Comment 1•15 years ago
|
||
linux-ix-slave37.build
moz:~ bear$ ssh cltbld@linux-ix-slave37.build.mozilla.org
ssh: connect to host linux-ix-slave37.build.mozilla.org port 22: Operation timed out
moz:~ bear$ ping linux-ix-slave37.build.mozilla.org
PING linux-ix-slave37.build.scl1.mozilla.com (10.12.48.231): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Updated•15 years ago
|
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
| Assignee | ||
Comment 2•15 years ago
|
||
(In reply to comment #1)
> linux-ix-slave37.build
console says:
puppetd returned non-zero, sleeping for 60...trying again
repeated continuously. linux-ix-slave08 is saying the same thing, immediately after a reimage.
My first thought was that this was due to a hostname issue, but if you can't even ping it, there's no way to get in and fix hostname.
Comment 3•15 years ago
|
||
Probably needs its hostkey cleared on the scl and/or mpt puppet masters.
Comment 4•15 years ago
|
||
oops, i mean MV, not scl
| Reporter | ||
Comment 5•15 years ago
|
||
(In reply to comment #3)
> Probably needs its hostkey cleared on the scl and/or mpt puppet masters.
let me do that now
| Reporter | ||
Comment 6•15 years ago
|
||
linux-ix-slave37 has attached to the master and is ok
Comment 7•15 years ago
|
||
talos-r3-fed-031
talos-r3-fed-032
talos-r3-fed64-001
talos-r3-fed64-011
talos-r3-fed64-018
talos-r3-fed64-022
| Assignee | ||
Comment 8•15 years ago
|
||
(In reply to comment #7)
> talos-r3-fed64-001
Seriously? This was reimaged *last night*.
Did it do any work overnight?
Comment 9•15 years ago
|
||
moz2-darwin10-slave01
Comment 10•15 years ago
|
||
(In reply to comment #8)
> (In reply to comment #7)
>
> > talos-r3-fed64-001
>
> Seriously? This was reimaged *last night*.
>
> Did it do any work overnight?
At least I couldn't reach it and Nagios reports it as unpingable for 19 hours.
Updated•15 years ago
|
Assignee: server-ops → zandr
Comment 11•15 years ago
|
||
talos-r3-w7-036
Comment 12•15 years ago
|
||
An updated amd collated list:
bm-xserve18
linux-ix-slave13
linux-ix-slave14
linux-ix-slave16
moz2-darwin10-slave53
moz2-darwin10-slave54
mv-moz2-linux-ix-slave05
talos-r3-fed-031
talos-r3-fed-032
talos-r3-fed64-001
talos-r3-fed64-011
talos-r3-fed64-018
talos-r3-fed64-022
talos-r3-fed64-047
talos-r3-fed64-055
talos-r3-leopard-003
talos-r3-w7-036
w32-ix-slave03
w32-ix-slave08
w32-ix-slave41
talos-r3-fed-031
talos-r3-fed-032
talos-r3-fed64-001
talos-r3-fed64-011
talos-r3-fed64-018
talos-r3-fed64-022
| Assignee | ||
Comment 13•15 years ago
|
||
talos-r3-fed-031: "Couldn't mount root filesystem": reimaged
talos-r3-fed-032:
talos-r3-fed64-001: rebooted: responds to ssh
talos-r3-fed64-011: weird dhcp problem, reimaged
talos-r3-fed64-018: same hang at USB: reimaged
talos-r3-fed64-022: "Couldn't mount root filesystem": reimaged
talos-r3-fed64-047: not in scl1, may not exist, cc :jhford for comment
talos-re-fed64-055: not in scl1, may not exist, cc :jhford for comment
talos-r3-leopard-003: WFM: responds to ssh, vnc
talos-r3-w7-036:
w32-ix-slave41: presumed drive failure. Imaging at 20MB/*minute* and falling. see bug 615744
| Assignee | ||
Comment 14•15 years ago
|
||
Oops, hit save too early.
talos-r3-fed-032: was hung at grey boot screen: rebooted
talos-r3-w7-036: was hung at grey boot screen: rebooted
And that's it for scl1.
| Assignee | ||
Comment 15•15 years ago
|
||
linux-ix-slave14 is also MIA. Not present in 650.
Comment 16•15 years ago
|
||
That machine was given to IX to investigate the issues in bug 596366 (comment 11)
Comment 17•15 years ago
|
||
t-r3-w764-018 does not respond via SSH or VNC to any of the cltbld passwords I know. If a restart doesn't help, either a re-image or a changed VNC and cltbld password would be great.
Comment 18•15 years ago
|
||
t-r3-w764-008 refuses incoming SSH or VNC connections, and thus needs a poking-at.
Comment 19•15 years ago
|
||
I managed to guess the VNC password for t-r3-w764-018, but it doesn't match the cltbld account's password. This system probably just needs a re-image.
Comment 20•15 years ago
|
||
(In reply to comment #18)
> t-r3-w764-008 refuses incoming SSH or VNC connections, and thus needs a
> poking-at.
And this machine is apparently feeling better this afternoon - I successfully connected via VNC, so I guess it's OK (??)
Comment 21•15 years ago
|
||
t-r3-w764-034 was accidentally shut down rather than rebooted, so it will need to be started up.
Comment 22•15 years ago
|
||
linux-ix-slave34 has poor performance:
Timing buffered disk reads: 66 MB in 3.02 seconds = 21.84 MB/sec
| Assignee | ||
Comment 23•15 years ago
|
||
w64-ix-slave27 is not responding on its normal interface, nor IPMI
| Assignee | ||
Comment 24•15 years ago
|
||
w64-ix-slave35 also not responding to IPMI
| Assignee | ||
Comment 25•15 years ago
|
||
Machines for the next MPT visit:
bm-xserve18
moz2-darwin10-slave53
moz2-darwin10-slave54
Updated•15 years ago
|
Assignee: zandr → phong
Flags: colo-trip+
Comment 26•15 years ago
|
||
(In reply to comment #25)
> Machines for the next MPT visit:
>
> bm-xserve18
> moz2-darwin10-slave53
> moz2-darwin10-slave54
rebooted.
Assignee: phong → zandr
Comment 27•15 years ago
|
||
moz2-linux-slave03 does not have any of the root passwords I know of. It'd be good to have the password reset -- or the whole box re-imaged, if that's easier.
Comment 28•15 years ago
|
||
(In reply to comment #27)
> moz2-linux-slave03 does not have any of the root passwords I know of. It'd be
> good to have the password reset -- or the whole box re-imaged, if that's
> easier.
Repurposed bug 610308 for this.
| Assignee | ||
Comment 29•15 years ago
|
||
failing ping checks:
talos-r3-fed-023
talos-r3-fed-025
talos-r3-fed64-009
talos-r3-w7-011
talos-r3-w7-032
talos-r3-w7-036
| Assignee | ||
Comment 30•15 years ago
|
||
moz2-darwin10-slave01
moz2-darwin10-slave26
| Assignee | ||
Updated•15 years ago
|
Blocks: releng-nagios
| Assignee | ||
Comment 31•15 years ago
|
||
Comment 32•15 years ago
|
||
bm-xserve02 is offline. Not sure why. This blocks bug 578234.
Blocks: 578234
| Assignee | ||
Comment 33•15 years ago
|
||
talos-r3-fed64-019
Comment 34•15 years ago
|
||
bm-xserve10
bm-xserve20
bm-xserve23
bm-xserve24
All four of these xserves report the following error when I try to connect:
nc: getaddrinfo: Name or service not known
Are these machines in DNS?
| Assignee | ||
Comment 35•15 years ago
|
||
talos-r3-fed-003: blank screen -> rebooted -> date problem
talos-r3-fed-023: grey screen -> rebooted
talos-r3-fed-025: grey screen -> rebooted
talos-r3-fed-042: blank screen -> rebooted -> date problem
talos-r3-w7-011: grey screen -> rebooted
talos-r3-w7-032: grey screen -> rebooted
talos-r3-w7-034: turned on
talos-r3-w7-036: grey screen -> rebooted
talos-r3-fed64-009: hang after USB -> needs reimage
talos-r3-fed64-019: hang after USB -> needs reimage
talos-r3-fed64-029: blank screen -> rebooted -> date problem
w64-ix-slave27-ipmi: fixed
w64-ix-slave35-ipmi: bad cable, needs proper (5') replacement
| Assignee | ||
Comment 36•15 years ago
|
||
To clarify, I did not reimage the two fed64 machines, (not enough time in the DC) but I did put in a temp patch for w64-ix-slave35.
| Assignee | ||
Comment 37•15 years ago
|
||
bm-xserve11 is refusing connections.
Comment 38•15 years ago
|
||
(In reply to comment #37)
> bm-xserve11 is refusing connections.
The most common failure mode for our xserves is the raid array getting corrupted, and freezing up when under I/O load (ie building!). The fix is to re-image. Not saying that's necessary yet, just FYI.
| Assignee | ||
Comment 39•15 years ago
|
||
talos-r3-w7-024 needs to come to MV (attn: clyon)
| Assignee | ||
Comment 40•15 years ago
|
||
talos-r3-fed-025: grey screen -> rebooted
talos-r3-fed-048: blank screen -> rebooted -> date problem
talos-r3-fed64-009: reimaged - hostname update pending
talos-r3-fed64-019: reimaged - hostname update pending
Comment 41•15 years ago
|
||
(In reply to comment #40)
> talos-r3-fed-025: grey screen -> rebooted
Up and running now
> talos-r3-fed-048: blank screen -> rebooted -> date problem
Up and running. What's the date problem?
| Assignee | ||
Comment 42•15 years ago
|
||
(In reply to comment #41)
> (In reply to comment #40)
> > talos-r3-fed-048: blank screen -> rebooted -> date problem
And I should say 'fixed'. That note was for my own tracking.
> Up and running. What's the date problem?
System date had reverted to 1/1/01, fsck complaining about superblock timestamp in the future. bug 561442
Comment 43•15 years ago
|
||
Talos boxes that have gone down in the last few days (hwclock fix not helping?):
talos-r3-fed-008
talos-r3-fed-026
talos-r3-fed-050
talos-r3-fed64-017
talos-r3-fed64-024
talos-r3-fed64-033
| Reporter | ||
Comment 44•15 years ago
|
||
did they die running mozmill-all? if so they may have a crashreporter process running and that may stop the reboot.
I saw quite a few of them Friday (not sure if I filed a bug or not (can't find one))
Comment 45•15 years ago
|
||
A hanging mozmill would block OS restart well before the network interface went away, and so wouldn't block PINGs, but I double checked the masters for all 6 of those machines. They don't have any hanging mozmill (or jetpack) jobs, and the last job they did before failed to reboot was not a mozmill job in almost all cases. When it was mozmill it looked like the job ran fine and terminated normally.
Comment 46•15 years ago
|
||
moz2-darwin9-slave05 seems to be gone from the network, according to nagios and ssh
Comment 47•15 years ago
|
||
talos-r3-fed64-038 - same
Comment 48•15 years ago
|
||
talos-r3-fed-037
Comment 49•15 years ago
|
||
talos-r3-fed-024
Comment 50•15 years ago
|
||
Here's an updated list based on my trip through nagios and the slave-wrangling spreadsheet (see URL)
bm-xserve11 - reboot - comment 37
moz2-darwin9-slave05 - reboot - comment 46
t-r3-w764-018 - try to reset passwords, or reimage - comment 19
talos-r3-fed-008 - reboot
talos-r3-fed-024 - reboot
talos-r3-fed-026 - reboot
talos-r3-fed-037 - reboot
talos-r3-fed-050 - reboot
talos-r3-fed64-017 - reboot
talos-r3-fed64-024 - reboot
talos-r3-fed64-033 - reboot
talos-r3-fed64-038 - reboot
talos-r3-fed64-047 - reboot
talos-r3-fed64-055 - reboot
try-mac-slave28 - reboot
w32-ix-slave03 - reboot - comment 12
w32-ix-slave08 - ssh works but nothing else, spoke, maybe reimage
Comment 51•15 years ago
|
||
talos-r3-fed-053 - reboot
Comment 52•15 years ago
|
||
The w7 slaves are not easy to detect.
The only check we have is PING.
I see a solution.
An email in the morning with list of w7 machines that have not done a job for a day, the master it was connected to and a tail of twistd.log to help us determine state.
On another note, I kicked back up few of the w7 slaves:
talos-r3-w7-004 - SIGKILL failed. I VNCed; no CMD running; rebooted - down since 16th - last job jetpack for 3 days
talos-r3-w7-029 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 5 days
talos-r3-w7-038 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; did *not* reboot - we should debug - down since 13th - last job run was jetpack for 5 days
talos-r3-w7-049 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 5 days
talos-r3-w7-051 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 3 days
Could one of you check talos-r3-w7-004? It did not come back and can't reach it.
TODO debug talos-r3-w7-038 for more info
Comment 53•15 years ago
|
||
(In reply to comment #52)
> The w7 slaves are not easy to detect.
> The only check we have is PING.
>
> I see a solution.
> An email in the morning with list of w7 machines that have not done a job for a
> day, the master it was connected to and a tail of twistd.log to help us
> determine state.
Really, we should have some way to get better information to nagios. bug 627126.
> TODO debug talos-r3-w7-038 for more info
Can you file a bug for this and update the spreadsheet?
Comment 54•15 years ago
|
||
(In reply to comment #53)
> > TODO debug talos-r3-w7-038 for more info
>
> Can you file a bug for this and update the spreadsheet?
It is back to the pool. Nothing else to follow up.
Comment 55•15 years ago
|
||
talos-r3-w7-038 - needs a manual reboot as I can't ssh/vnc/rdc to it
Comment 56•15 years ago
|
||
> moz2-darwin9-slave05 - reboot - comment 46
This machine took hours to restart yesterday, and today took at least 30m:
13:50 < nagios> [40] moz2-darwin9-slave05.build:PING is CRITICAL: CRITICAL - Host Unreachable (10.2.71.202)
14:10 < nagios> moz2-darwin9-slave05.build:PING is OK: PING OK - Packet loss = 0%, RTA = 0.31 ms
(CDT times = PDT-0200)
So it should probably be restarted manually and its behavior monitored. I've disabled buildslave on it.
Comment 57•15 years ago
|
||
talos-r3-fed-047 needs a reboot
bm-xserve11 doesn't any more, got moved today
Comment 58•15 years ago
|
||
(In reply to comment #57)
> talos-r3-fed-047 needs a reboot
Actually, it's on a cart in Mountain View - bug 626923.
Comment 59•15 years ago
|
||
That's fed64-047 by the looks.
Comment 60•15 years ago
|
||
Correct, sorry.
Comment 61•15 years ago
|
||
More talos slaves down:
talos-r3-fed64-037
talos-r3-fed64-041
Comment 62•15 years ago
|
||
bm-xserve06 - this was moved yesterday, and came up after that, but at this point everything but ping is down on the host.
| Assignee | ||
Comment 63•15 years ago
|
||
talos-r3-w7-004: confused by comment 52, as it's up and running tests now
talos-r3-w7-024: reimaged
talos-r3-w7-038: rebooted
t-r3-w764-018: reimaged
talos-r3-fed-008: date problem -> fsck
talos-r3-fed-024: gray screen -> reboot
talos-r3-fed-026: date problem -> fsck
talos-r3-fed-037: gray screen -> reboot
talos-r3-fed-050: date problem -> fsck
talos-r3-fed-053: date problem -> fsck
talos-r3-fed64-017: hang at boot -> reimage - needs hostname
talos-r3-fed64-024: date problem -> fsck
talos-r3-fed64-033: date problem -> fsck
talos-r3-fed64-037: date problem -> fsck
talos-r3-fed64-038: hang at boot -> reimage - needs hostname
talos-r3-fed64-041: date problem
Comment 64•15 years ago
|
||
(In reply to comment #63)
> talos-r3-fed64-017: hang at boot -> reimage - needs hostname
> talos-r3-fed64-038: hang at boot -> reimage - needs hostname
Fixed hostname, passwords, puppet, & buildbot.tac. Back in production.
Comment 65•15 years ago
|
||
No worries zandr.
Here is a pointer for Monday -> bug 624124
Comment 66•15 years ago
|
||
bm-xserve06 - bug 627230
Comment 67•15 years ago
|
||
talos-r3-fed-036
Comment 68•15 years ago
|
||
mv-moz2-linux-ix-slave22
Comment 69•15 years ago
|
||
talos-r3-xp-019 seems to be stuck in shutdown. It will need to be rebooted.
Comment 70•15 years ago
|
||
talos-r3-xp-001 also needs a reboot.
Comment 71•15 years ago
|
||
Reboots:
talos-r3-fed64-016
talos-r3-xp-030
Comment 72•15 years ago
|
||
I managed to resurrect both talos-r3-xp-001 and talos-r3-xp-019 by being a little more patient with RDP.
Comment 73•15 years ago
|
||
linux-ix-slave42 - unpingable, see also 624207
Comment 74•15 years ago
|
||
talos-r3-fed64-022
Comment 75•15 years ago
|
||
talos-r3-w7-036
| Assignee | ||
Comment 76•15 years ago
|
||
talos-r3-fed64-016: date problem -> fsck
talos-r3-fed64-022: looked OK, but wasn't on the network -> reboot
talos-r3-xp-030: blank screen -> reboot
talos-r3-w7-036: grey screen -> reboot
talos-r3-fed-036: grey screen -> reboot
Comment 77•15 years ago
|
||
talos-r3-w7-036 - wasted no time killing itself again. The rest seem to be up.
Comment 78•15 years ago
|
||
talos-r3-w7-020
Comment 79•15 years ago
|
||
(In reply to comment #34)
> bm-xserve10
> bm-xserve20
> bm-xserve23
> bm-xserve24
>
> All four of these xserves report the following error when I try to connect:
>
> nc: getaddrinfo: Name or service not known
>
> Are these machines in DNS?
Any update on these xserves? It's blocking bug 580346.
Comment 80•15 years ago
|
||
Did those work at one time? None are in DNS.
Comment 81•15 years ago
|
||
Readded to DNS based on some old entries:
build.mozilla.org.deprecated:bm-xserve20 IN A 10.2.71.158
build.mozilla.org.deprecated:bm-xserve10 IN A 10.2.71.10
71.2.10.in-addr.arpa:42 IN PTR bm-xserve23.build.mozilla.org.
71.2.10.in-addr.arpa:43 IN PTR bm-xserve24.build.mozilla.org.
Comment 82•15 years ago
|
||
try-mac-slave42
| Reporter | ||
Comment 83•15 years ago
|
||
we are still not seeing bm-xserve06 online
Comment 84•15 years ago
|
||
zandr: This bug is getting a bit unwieldy. I'm going to create a new one and try to bring forward all the slaves that still require attention.
Alias: reboots
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•