Closed
Bug 986445
Opened 10 years ago
Closed 10 years ago
watch pending and misc.py scheduling make different decisions about which slave to use for a retried job
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: catlee)
Details
Attachments
(1 file)
1.79 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
This morning I noticed that there was about 100 pending test jobs from as far back as 8 hours ago. Looking in the watch pending log I see a lot of: 2014-03-21 05:39:48,447 - DEBUG - waiting for spot request 2014-03-21 05:39:53,750 - DEBUG - 39 interfaces in us-east-1c 2014-03-21 05:39:53,751 - DEBUG - using tst-linux32-spot-146.test.releng.use1.mozilla.com 2014-03-21 05:39:53,751 - DEBUG - Spot request for tst-linux32-spot-146.test.releng.use1.mozilla.com (0.09) 2014-03-21 05:39:54,972 - ERROR - 400 Bad Request 2014-03-21 05:39:54,972 - ERROR - <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidSpotInstanceRequestID.NotFound</Code><Message>The spot instance request ID 'sir-27990649' does not exist</Message></Error></Errors><RequestID>98df7227-da42-4dd2-b253-5afb27013cfb</RequestID></Response> 2014-03-21 05:39:54,972 - DEBUG - waiting for spot request 2014-03-21 05:40:00,249 - DEBUG - 38 interfaces in us-east-1c 2014-03-21 05:40:00,249 - DEBUG - using tst-linux32-spot-002.test.releng.use1.mozilla.com 2014-03-21 05:40:00,249 - DEBUG - Spot request for tst-linux32-spot-002.test.releng.use1.mozilla.com (0.09) 2014-03-21 05:40:01,515 - ERROR - 400 Bad Request 2014-03-21 05:40:01,515 - ERROR - <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidSpotInstanceRequestID.NotFound</Code><Message>The spot instance request ID 'sir-9586c649' does not exist</Message></Error></Errors><RequestID>91baddbb-2ab1-4a22-b285-ae7ea9d13867</RequestID></Response> twtoSince the log was rotated two days ago there's almost 2000 of these messages. The old log (which goes back to December) has 800. Curiously, all of the pending jobs were 32-bit mozilla-inbound tests...which makes me wonder if this is some sort of prioritization issue instead. AFAIK trees are open and we're in fine shape, so not marking as a blocker.
Reporter | ||
Comment 1•10 years ago
|
||
Also curious: tst-linux32-spot-195 is connecting to buildbot-master67, which sees pending jobs on a builder for that slave, but it's not starting them.
Reporter | ||
Comment 2•10 years ago
|
||
I found this on buildbot-master67's log: 2014-03-21 05:46:17-0700 [-] nextAWSSlave: 2 retries for Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc 2014-03-21 05:46:17-0700 [-] nextAWSSlave: No slaves appropriate for retried job - returning None 2014-03-21 05:46:17-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc'' at 203700776>: want to start build, but we don't have a remote 2014-03-21 05:46:17-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc'' at 203700776>: got assignments: {} The master seems to think it's a retry...but looking at https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=dbc31b57cbd7 I don't see an original test. Looking even further back, I see pending Cipc tests all the way back to https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=767ce92ebaf1, which _does_ have a retry. So, it sounds like something is confused about this being a retry (perhaps because there's pending jobs from newer revs than the one that was retride), or watch pending isn't firing up on demand builders to deal with the retry.
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
Reporter | ||
Comment 3•10 years ago
|
||
I don't think we're going to get test coverage for most 32-bit Linux desktop jobs until this is fixed, raising priority.
Severity: major → critical
Reporter | ||
Comment 4•10 years ago
|
||
Debugging is happening on IRC: 09:10 < rail> looks like it looks for all available brids for a builder 09:11 < rail> builder._getBuildable 09:11 < rail> so if we have a lot of retries that may affect all pending? 09:12 <@catlee-away> watch pending should start up ondemand for those though 09:16 < rail> looks like watch pending distinguishes brids, but misc.py uses only builders, so they have different ideas about this 09:16 <@catlee-away> ah, so you could have 2 builds, each with 1 retry, and watch pending wouldn't start ondemand. but misc.py would see 2 retries? I've started some on-demand instances by hand to clear the backlog.
Updated•10 years ago
|
Flags: needinfo?(rail)
Reporter | ||
Updated•10 years ago
|
Summary: watch pending seems to be having trouble bringing up spot instances → watch pending and misc.py scheduling make different decisions about which slave to use for a retried job
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → catlee
Flags: needinfo?(catlee)
Assignee | ||
Comment 5•10 years ago
|
||
Attachment #8396420 -
Flags: review?(rail)
Comment 6•10 years ago
|
||
Comment on attachment 8396420 [details] [diff] [review] count retries per buildername lgtm
Attachment #8396420 -
Flags: review?(rail) → review+
Assignee | ||
Updated•10 years ago
|
Attachment #8396420 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•4 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•