computers that need physical intervention

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
--
major
RESOLVED FIXED
8 years ago
3 years ago

People

(Reporter: bear, Assigned: zandr)

Tracking

Details

(URL)

(Reporter)

Description

8 years ago
The following computers require manual "looking at" to determine why they are offline
(Reporter)

Updated

8 years ago
No longer depends on: 620041
(Reporter)

Updated

8 years ago
Alias: reboots
(Reporter)

Comment 1

8 years ago
linux-ix-slave37.build

moz:~ bear$ ssh cltbld@linux-ix-slave37.build.mozilla.org
ssh: connect to host linux-ix-slave37.build.mozilla.org port 22: Operation timed out
moz:~ bear$ ping linux-ix-slave37.build.mozilla.org
PING linux-ix-slave37.build.scl1.mozilla.com (10.12.48.231): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2

Updated

8 years ago
Assignee: nobody → server-ops
Component: Release Engineering → Server Operations
QA Contact: release → mrz
(Assignee)

Comment 2

8 years ago
(In reply to comment #1)
> linux-ix-slave37.build

console says:

puppetd returned non-zero, sleeping for 60...trying again

repeated continuously. linux-ix-slave08 is saying the same thing, immediately after a reimage.

My first thought was that this was due to a hostname issue, but if you can't even ping it, there's no way to get in and fix hostname.
Probably needs its hostkey cleared on the scl and/or mpt puppet masters.
oops, i mean MV, not scl
(Reporter)

Comment 5

8 years ago
(In reply to comment #3)
> Probably needs its hostkey cleared on the scl and/or mpt puppet masters.

let me do that now
(Reporter)

Comment 6

8 years ago
linux-ix-slave37 has attached to the master and is ok
talos-r3-fed-031
talos-r3-fed-032
talos-r3-fed64-001
talos-r3-fed64-011
talos-r3-fed64-018
talos-r3-fed64-022
(Assignee)

Comment 8

8 years ago
(In reply to comment #7)

> talos-r3-fed64-001

Seriously? This was reimaged *last night*.

Did it do any work overnight?
moz2-darwin10-slave01
(In reply to comment #8)
> (In reply to comment #7)
> 
> > talos-r3-fed64-001
> 
> Seriously? This was reimaged *last night*.
> 
> Did it do any work overnight?

At least I couldn't reach it and Nagios reports it as unpingable for 19 hours.

Updated

8 years ago
Assignee: server-ops → zandr
talos-r3-w7-036
An updated amd collated list:

bm-xserve18
linux-ix-slave13
linux-ix-slave14
linux-ix-slave16
moz2-darwin10-slave53
moz2-darwin10-slave54
mv-moz2-linux-ix-slave05
talos-r3-fed-031
talos-r3-fed-032
talos-r3-fed64-001
talos-r3-fed64-011
talos-r3-fed64-018
talos-r3-fed64-022
talos-r3-fed64-047
talos-r3-fed64-055
talos-r3-leopard-003
talos-r3-w7-036
w32-ix-slave03
w32-ix-slave08
w32-ix-slave41
talos-r3-fed-031
talos-r3-fed-032
talos-r3-fed64-001
talos-r3-fed64-011
talos-r3-fed64-018
talos-r3-fed64-022
(Assignee)

Comment 13

8 years ago
talos-r3-fed-031: "Couldn't mount root filesystem": reimaged
talos-r3-fed-032: 

talos-r3-fed64-001: rebooted: responds to ssh
talos-r3-fed64-011: weird dhcp problem, reimaged
talos-r3-fed64-018: same hang at USB: reimaged
talos-r3-fed64-022: "Couldn't mount root filesystem": reimaged
talos-r3-fed64-047: not in scl1, may not exist, cc :jhford for comment
talos-re-fed64-055: not in scl1, may not exist, cc :jhford for comment

talos-r3-leopard-003: WFM: responds to ssh, vnc

talos-r3-w7-036: 

w32-ix-slave41: presumed drive failure. Imaging at 20MB/*minute* and falling. see bug 615744
(Assignee)

Comment 14

8 years ago
Oops, hit save too early.

talos-r3-fed-032: was hung at grey boot screen: rebooted

talos-r3-w7-036: was hung at grey boot screen: rebooted

And that's it for scl1.
(Assignee)

Comment 15

8 years ago
linux-ix-slave14 is also MIA. Not present in 650.
That machine was given to IX to investigate the issues in bug 596366 (comment 11)
t-r3-w764-018 does not respond via SSH or VNC to any of the cltbld passwords I know.  If a restart doesn't help, either a re-image or a changed VNC and cltbld password would be great.
t-r3-w764-008 refuses incoming SSH or VNC connections, and thus needs a poking-at.
I managed to guess the VNC password for t-r3-w764-018, but it doesn't match the cltbld account's password.  This system probably just needs a re-image.
(In reply to comment #18)
> t-r3-w764-008 refuses incoming SSH or VNC connections, and thus needs a
> poking-at.

And this machine is apparently feeling better this afternoon - I successfully connected via VNC, so I guess it's OK (??)
t-r3-w764-034 was accidentally shut down rather than rebooted, so it will need to be started up.
linux-ix-slave34 has poor performance:
 Timing buffered disk reads:   66 MB in  3.02 seconds =  21.84 MB/sec
(Assignee)

Comment 23

8 years ago
w64-ix-slave27 is not responding on its normal interface, nor IPMI
(Assignee)

Comment 24

8 years ago
w64-ix-slave35 also not responding to IPMI
(Assignee)

Comment 25

8 years ago
Machines for the next MPT visit:

bm-xserve18
moz2-darwin10-slave53
moz2-darwin10-slave54

Updated

8 years ago
Assignee: zandr → phong
Flags: colo-trip+

Comment 26

8 years ago
(In reply to comment #25)
> Machines for the next MPT visit:
> 
> bm-xserve18
> moz2-darwin10-slave53
> moz2-darwin10-slave54

rebooted.
Assignee: phong → zandr
moz2-linux-slave03 does not have any of the root passwords I know of.  It'd be good to have the password reset -- or the whole box re-imaged, if that's easier.
(In reply to comment #27)
> moz2-linux-slave03 does not have any of the root passwords I know of.  It'd be
> good to have the password reset -- or the whole box re-imaged, if that's
> easier.

Repurposed bug 610308 for this.
(Assignee)

Comment 29

8 years ago
failing ping checks:
talos-r3-fed-023
talos-r3-fed-025
talos-r3-fed64-009
talos-r3-w7-011
talos-r3-w7-032
talos-r3-w7-036
(Assignee)

Comment 30

8 years ago
moz2-darwin10-slave01
moz2-darwin10-slave26
(Assignee)

Updated

8 years ago
Blocks: 603343
(Assignee)

Comment 31

8 years ago
(In reply to comment #30)

> moz2-darwin10-slave26

Disregard, see bug 620594
bm-xserve02 is offline. Not sure why. This blocks bug 578234.
Blocks: 578234
(Assignee)

Comment 33

8 years ago
talos-r3-fed64-019
bm-xserve10
bm-xserve20
bm-xserve23
bm-xserve24

All four of these xserves report the following error when I try to connect:

 nc: getaddrinfo: Name or service not known

Are these machines in DNS?

Updated

8 years ago
Blocks: 580346
(Assignee)

Comment 35

8 years ago
talos-r3-fed-003: blank screen -> rebooted -> date problem
talos-r3-fed-023: grey screen -> rebooted 
talos-r3-fed-025: grey screen -> rebooted
talos-r3-fed-042: blank screen -> rebooted -> date problem
talos-r3-w7-011: grey screen -> rebooted
talos-r3-w7-032: grey screen -> rebooted
talos-r3-w7-034: turned on
talos-r3-w7-036: grey screen -> rebooted

talos-r3-fed64-009: hang after USB -> needs reimage
talos-r3-fed64-019: hang after USB -> needs reimage
talos-r3-fed64-029: blank screen -> rebooted -> date problem

w64-ix-slave27-ipmi: fixed
w64-ix-slave35-ipmi: bad cable, needs proper (5') replacement
(Assignee)

Comment 36

8 years ago
To clarify, I did not reimage the two fed64 machines, (not enough time in the DC) but I did put in a temp patch for w64-ix-slave35.
(Assignee)

Comment 37

8 years ago
bm-xserve11 is refusing connections.
(In reply to comment #37)
> bm-xserve11 is refusing connections.

The most common failure mode for our xserves is the raid array getting corrupted, and freezing up when under I/O load (ie building!). The fix is to re-image. Not saying that's necessary yet, just FYI.
(Assignee)

Comment 39

8 years ago
talos-r3-w7-024 needs to come to MV (attn: clyon)
(Assignee)

Comment 40

8 years ago
talos-r3-fed-025: grey screen -> rebooted
talos-r3-fed-048: blank screen -> rebooted -> date problem
talos-r3-fed64-009: reimaged - hostname update pending
talos-r3-fed64-019: reimaged - hostname update pending
(In reply to comment #40)
> talos-r3-fed-025: grey screen -> rebooted

Up and running now

> talos-r3-fed-048: blank screen -> rebooted -> date problem

Up and running.  What's the date problem?
(Assignee)

Comment 42

8 years ago
(In reply to comment #41)
> (In reply to comment #40)
 
> > talos-r3-fed-048: blank screen -> rebooted -> date problem

And I should say 'fixed'. That note was for my own tracking.
 
> Up and running.  What's the date problem?

System date had reverted to 1/1/01, fsck complaining about superblock timestamp in the future. bug 561442
Talos boxes that have gone down in the last few days (hwclock fix not helping?):

talos-r3-fed-008
talos-r3-fed-026
talos-r3-fed-050
talos-r3-fed64-017
talos-r3-fed64-024
talos-r3-fed64-033
(Reporter)

Comment 44

8 years ago
did they die running mozmill-all?  if so they may have a crashreporter process running and that may stop the reboot.

I saw quite a few of them Friday (not sure if I filed a bug or not (can't find one))
A hanging mozmill would block OS restart well before the network interface went away, and so wouldn't block PINGs, but I double checked the masters for all 6 of those machines. They don't have any hanging mozmill (or jetpack) jobs, and the last job they did before failed to reboot was not a mozmill job in almost all cases. When it was mozmill it looked like the job ran fine and terminated normally.
moz2-darwin9-slave05 seems to be gone from the network, according to nagios and ssh
talos-r3-fed64-038 - same
talos-r3-fed-037
talos-r3-fed-024
Here's an updated list based on my trip through nagios and the slave-wrangling spreadsheet (see URL)

bm-xserve11 - reboot - comment 37
moz2-darwin9-slave05 - reboot - comment 46
t-r3-w764-018 - try to reset passwords, or reimage - comment 19
talos-r3-fed-008 - reboot
talos-r3-fed-024 - reboot
talos-r3-fed-026 - reboot
talos-r3-fed-037 - reboot
talos-r3-fed-050 - reboot
talos-r3-fed64-017 - reboot
talos-r3-fed64-024 - reboot
talos-r3-fed64-033 - reboot
talos-r3-fed64-038 - reboot
talos-r3-fed64-047 - reboot
talos-r3-fed64-055 - reboot
try-mac-slave28 - reboot
w32-ix-slave03 - reboot - comment 12
w32-ix-slave08 - ssh works but nothing else, spoke, maybe reimage
talos-r3-fed-053 - reboot

Comment 52

8 years ago
The w7 slaves are not easy to detect.
The only check we have is PING.

I see a solution.
An email in the morning with list of w7 machines that have not done a job for a day, the master it was connected to and a tail of twistd.log to help us determine state.

On another note, I kicked back up few of the w7 slaves:
talos-r3-w7-004 - SIGKILL failed. I VNCed; no CMD running; rebooted - down since 16th - last job jetpack for 3 days
talos-r3-w7-029 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 5 days
talos-r3-w7-038 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; did *not* reboot - we should debug - down since 13th - last job run was jetpack for 5 days
talos-r3-w7-049 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 5 days
talos-r3-w7-051 - twisted.internet.error.ProcessExitedAlready; task manager open; no CMD running; rebooted - down since 13th - last job run jetpack for 3 days

Could one of you check talos-r3-w7-004? It did not come back and can't reach it.

TODO debug talos-r3-w7-038 for more info
(In reply to comment #52)
> The w7 slaves are not easy to detect.
> The only check we have is PING.
> 
> I see a solution.
> An email in the morning with list of w7 machines that have not done a job for a
> day, the master it was connected to and a tail of twistd.log to help us
> determine state.

Really, we should have some way to get better information to nagios.  bug 627126.

> TODO debug talos-r3-w7-038 for more info

Can you file a bug for this and update the spreadsheet?

Comment 54

8 years ago
(In reply to comment #53)
> > TODO debug talos-r3-w7-038 for more info
> 
> Can you file a bug for this and update the spreadsheet?
It is back to the pool. Nothing else to follow up.

Comment 55

8 years ago
talos-r3-w7-038 - needs a manual reboot as I can't ssh/vnc/rdc to it
> moz2-darwin9-slave05 - reboot - comment 46

This machine took hours to restart yesterday, and today took at least 30m:

13:50 < nagios> [40] moz2-darwin9-slave05.build:PING is CRITICAL: CRITICAL - Host Unreachable (10.2.71.202)
14:10 < nagios> moz2-darwin9-slave05.build:PING is OK: PING OK - Packet loss = 0%, RTA = 0.31 ms

(CDT times = PDT-0200)

So it should probably be restarted manually and its behavior monitored.  I've disabled buildslave on it.
talos-r3-fed-047 needs a reboot

bm-xserve11 doesn't any more, got moved today
(In reply to comment #57)
> talos-r3-fed-047 needs a reboot

Actually, it's on a cart in Mountain View - bug 626923.
That's fed64-047 by the looks.
Correct, sorry.
More talos slaves down:

talos-r3-fed64-037
talos-r3-fed64-041
bm-xserve06 - this was moved yesterday, and came up after that, but at this point everything but ping is down on the host.
(Assignee)

Comment 63

8 years ago
talos-r3-w7-004: confused by comment 52, as it's up and running tests now
talos-r3-w7-024: reimaged
talos-r3-w7-038: rebooted
t-r3-w764-018: reimaged 
talos-r3-fed-008: date problem -> fsck
talos-r3-fed-024: gray screen -> reboot
talos-r3-fed-026: date problem -> fsck
talos-r3-fed-037: gray screen -> reboot
talos-r3-fed-050: date problem -> fsck
talos-r3-fed-053: date problem -> fsck
talos-r3-fed64-017: hang at boot -> reimage - needs hostname
talos-r3-fed64-024: date problem -> fsck
talos-r3-fed64-033: date problem -> fsck
talos-r3-fed64-037: date problem -> fsck
talos-r3-fed64-038: hang at boot -> reimage - needs hostname
talos-r3-fed64-041: date problem
(In reply to comment #63)
> talos-r3-fed64-017: hang at boot -> reimage - needs hostname
> talos-r3-fed64-038: hang at boot -> reimage - needs hostname

Fixed hostname, passwords, puppet, & buildbot.tac. Back in production.

Comment 65

8 years ago
No worries zandr.
Here is a pointer for Monday -> bug 624124
talos-r3-fed-036
mv-moz2-linux-ix-slave22
talos-r3-xp-019 seems to be stuck in shutdown. It will need to be rebooted.
talos-r3-xp-001 also needs a reboot.

Updated

8 years ago
Blocks: 620517
Reboots:
talos-r3-fed64-016
talos-r3-xp-030
I managed to resurrect both talos-r3-xp-001 and talos-r3-xp-019 by being a little more patient with RDP.

Updated

8 years ago
No longer blocks: 620517
Blocks: 627230
linux-ix-slave42 - unpingable, see also 624207
talos-r3-fed64-022
talos-r3-w7-036
(Assignee)

Comment 76

8 years ago
talos-r3-fed64-016: date problem -> fsck
talos-r3-fed64-022: looked OK, but wasn't on the network -> reboot

talos-r3-xp-030: blank screen -> reboot
talos-r3-w7-036: grey screen -> reboot
talos-r3-fed-036: grey screen -> reboot
talos-r3-w7-036 - wasted no time killing itself again.  The rest seem to be up.
talos-r3-w7-020
(In reply to comment #34)
> bm-xserve10
> bm-xserve20
> bm-xserve23
> bm-xserve24
> 
> All four of these xserves report the following error when I try to connect:
> 
>  nc: getaddrinfo: Name or service not known
> 
> Are these machines in DNS?

Any update on these xserves? It's blocking bug 580346.
Did those work at one time?  None are in DNS.
Readded to DNS based on some old entries:

build.mozilla.org.deprecated:bm-xserve20        IN A            10.2.71.158
build.mozilla.org.deprecated:bm-xserve10        IN A            10.2.71.10

71.2.10.in-addr.arpa:42      IN PTR  bm-xserve23.build.mozilla.org.
71.2.10.in-addr.arpa:43      IN PTR  bm-xserve24.build.mozilla.org.
try-mac-slave42
(Reporter)

Comment 83

8 years ago
we are still not seeing bm-xserve06 online
zandr: This bug is getting a bit unwieldy. I'm going to create a new one and try to bring forward all the slaves that still require attention.
Alias: reboots
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED

Updated

8 years ago
Blocks: 629511
(Assignee)

Updated

8 years ago
Duplicate of this bug: 604486
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.