Closed Bug 858659 Opened 12 years ago Closed 12 years ago

many jobs not starting (or taking a long time to start) on linux test masters

Categories

(Release Engineering :: General, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

Details

We have over 300 fedora jobs pending right now, and tons of idle slaves. The jobs aren't starting though. 13:41 <+catlee> but I think bm24 is so overloaded it's taking hours to process RPC calls
bm18 seems to be in the worst shape - i don't see it handing out any jobs to r3 machines. It's got a graceful shutdown started, but I'm not sure who did it. I also don't see anything in twistd.log about the graceful shutdown being started...this master is very broken right now.
bm18's slaves haven't run a single job since march 28th, i'm restarting it the hard way. I'm guessing that someone initiated a graceful shutdown on the 28th and something happened, and it didn't shut down. $5 says something to do with ec2 slaves mucked it up.
bm18 is back up and fedora pending is down to 73. 14:00 <+catlee> that's the same hung slave issue 14:00 <+catlee> we're only protected against it in a disconnect step 14:00 <+catlee> if the slave dies in other steps, we can still hang
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.