Closed
Bug 652744
Opened 14 years ago
Closed 13 years ago
talos snow and leopard slaves hanging
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: bear, Unassigned)
Details
(Whiteboard: [slaveduty])
Attachments
(1 file)
145.31 KB,
text/plain
|
Details |
I was spot checking the pending job queue and saw that it had spiked to > 450 jobs with talos snow|leopard being the biggest chunk. Nick posted the output from http://build.mozilla.org/builds/last-job-per-slave.txt (which I'm going to attach) and it showed a huge number of talos-r3-snow and talos-r3-leopard slaves with multi-day idle times so Catlee and myself fired up csshX and rebooted all that were responding.
Reporter | ||
Comment 1•14 years ago
|
||
Comment 2•14 years ago
|
||
Did you get any idea what the problems were?
Updated•14 years ago
|
Summary: excessive wait time for talos snow and leopard try builds → talos snow and leopard slaves hanging
Reporter | ||
Comment 3•14 years ago
|
||
The majority of them were sitting with buildslave running but no activity from 19th to 24th. A handful were just plain offline.
After a "sudo reboot" the idle ones picked right up and joined the pool.
Didn't dig any deeper for the others.
Comment 4•14 years ago
|
||
The offline machines are probably known - check the spreadsheet to be sure.
I'm surprised to hear that buildslave was running. In all of the leopard/snow failures I've seen recently, buildslave had shut down on a SIGTERM in the count-and-reboot step. In most cases where buildslave is running, it just takes an hour or two for the master to hand it a job - that is the existing high-waittimes bug that armen opened last week (bug 649734). Generally that does not take days, though.
My approach to solving this is usually to open the slaves 10 at a time with csshX, and use 'uptime'. Any slave up for less than 60m is probably running fine, so I just close the window; for the rest, I look at the logfile and reboot unless it's actually running a build.
I'll run a check like this again tomorrow.
Updated•14 years ago
|
Whiteboard: [slaveduty]
Comment 5•14 years ago
|
||
talos-r3-snow-011 didn't come back after a reboot. In /var/log/system.log there was a complaint from twisted that another twisted was already running. I deleted talos-slave/twistd.pid and did another reboot, which fixed them up. dustin speculated on IRC that when twisted attempted to start there was some other process on the same pid as in the old twistd.pid. Four other machines were in the same state - talos-r3-snow-023, talos-r3-snow-031, talos-r3-snow-042 and talos-r3-snow-045.
talos-r3-snow-013 had an issue with Apache barfing on DocumentRoot (/Users/cltbld/talos-slave/talos-data/talos) not existing. If I'm reading statusdb right, this slave last did a talos job on 2011-04-10, then had a three day break (possibly unrelated).
Comment 6•14 years ago
|
||
Isn't this bug a symptom of bug 648665?
Comment 7•14 years ago
|
||
Armen - I think it's a different bug, from the descriptions here.
The Apache problem means that the slave isn't talking to puppet, as puppet creates the documentroot.
Reporter | ||
Updated•14 years ago
|
Assignee: nobody → bear
Reporter | ||
Updated•14 years ago
|
Priority: -- → P2
Reporter | ||
Updated•14 years ago
|
Assignee: bear → nobody
Reporter | ||
Comment 8•13 years ago
|
||
I'm closing this as INVALID - it's been a LONG time since this was an active snapshot of hung slaves
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•