Closed Bug 652744 Opened 14 years ago Closed 14 years ago

talos snow and leopard slaves hanging

Tracking

(Not tracked)

Status:

RESOLVED INVALID

People

(Reporter: bear, Unassigned)

Details

(Whiteboard: [slaveduty])

Attachments

(1 file)

last job per slave output from last night 14 years ago Mike Taylor [:bear] 145.31 KB, text/plain		Details

Mike Taylor [:bear]

Reporter

Description

•

14 years ago

I was spot checking the pending job queue and saw that it had spiked to > 450 jobs with talos snow|leopard being the biggest chunk. Nick posted the output from http://build.mozilla.org/builds/last-job-per-slave.txt (which I'm going to attach) and it showed a huge number of talos-r3-snow and talos-r3-leopard slaves with multi-day idle times so Catlee and myself fired up csshX and rebooted all that were responding.

Mike Taylor [:bear]

Reporter

Comment 1

•

14 years ago

Attached file last job per slave output from last night — Details

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

14 years ago

Did you get any idea what the problems were?

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Summary: excessive wait time for talos snow and leopard try builds → talos snow and leopard slaves hanging

Mike Taylor [:bear]

Reporter

Comment 3

•

14 years ago

The majority of them were sitting with buildslave running but no activity from 19th to 24th. A handful were just plain offline. After a "sudo reboot" the idle ones picked right up and joined the pool. Didn't dig any deeper for the others.

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

14 years ago

The offline machines are probably known - check the spreadsheet to be sure. I'm surprised to hear that buildslave was running. In all of the leopard/snow failures I've seen recently, buildslave had shut down on a SIGTERM in the count-and-reboot step. In most cases where buildslave is running, it just takes an hour or two for the master to hand it a job - that is the existing high-waittimes bug that armen opened last week (bug 649734). Generally that does not take days, though. My approach to solving this is usually to open the slaves 10 at a time with csshX, and use 'uptime'. Any slave up for less than 60m is probably running fine, so I just close the window; for the rest, I look at the logfile and reboot unless it's actually running a build. I'll run a check like this again tomorrow.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Whiteboard: [slaveduty]

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

14 years ago

talos-r3-snow-011 didn't come back after a reboot. In /var/log/system.log there was a complaint from twisted that another twisted was already running. I deleted talos-slave/twistd.pid and did another reboot, which fixed them up. dustin speculated on IRC that when twisted attempted to start there was some other process on the same pid as in the old twistd.pid. Four other machines were in the same state - talos-r3-snow-023, talos-r3-snow-031, talos-r3-snow-042 and talos-r3-snow-045. talos-r3-snow-013 had an issue with Apache barfing on DocumentRoot (/Users/cltbld/talos-slave/talos-data/talos) not existing. If I'm reading statusdb right, this slave last did a talos job on 2011-04-10, then had a three day break (possibly unrelated).

Armen [:armenzg]

Comment 6

•

14 years ago

Isn't this bug a symptom of bug 648665?

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

14 years ago

Armen - I think it's a different bug, from the descriptions here. The Apache problem means that the slave isn't talking to puppet, as puppet creates the documentroot.

Mike Taylor [:bear]

Reporter

Updated

•

14 years ago

Assignee: nobody → bear

Mike Taylor [:bear]

Reporter

Updated

•

14 years ago

Priority: -- → P2

Mike Taylor [:bear]

Reporter

Updated

•

14 years ago

Assignee: bear → nobody

Mike Taylor [:bear]

Reporter

Comment 8

•

14 years ago

I'm closing this as INVALID - it's been a LONG time since this was an active snapshot of hung slaves

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → INVALID

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.

Bugzilla

talos snow and leopard slaves hanging

Categories

(Release Engineering :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: bear, Unassigned)

References

Details

(Whiteboard: [slaveduty])

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Updated

Updated

Comment 8

Updated

Attachment

General

Description

File Name

Content Type