Closed Bug 986445 Opened 10 years ago Closed 10 years ago

watch pending and misc.py scheduling make different decisions about which slave to use for a retried job

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

Details

Attachments

(1 file)

This morning I noticed that there was about 100 pending test jobs from as far back as 8 hours ago. Looking in the watch pending log I see a lot of:
2014-03-21 05:39:48,447 - DEBUG - waiting for spot request
2014-03-21 05:39:53,750 - DEBUG - 39 interfaces in us-east-1c
2014-03-21 05:39:53,751 - DEBUG - using tst-linux32-spot-146.test.releng.use1.mozilla.com
2014-03-21 05:39:53,751 - DEBUG - Spot request for tst-linux32-spot-146.test.releng.use1.mozilla.com (0.09)
2014-03-21 05:39:54,972 - ERROR - 400 Bad Request
2014-03-21 05:39:54,972 - ERROR - <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidSpotInstanceRequestID.NotFound</Code><Message>The spot instance request ID 'sir-27990649' does not exist</Message></Error></Errors><RequestID>98df7227-da42-4dd2-b253-5afb27013cfb</RequestID></Response>
2014-03-21 05:39:54,972 - DEBUG - waiting for spot request
2014-03-21 05:40:00,249 - DEBUG - 38 interfaces in us-east-1c
2014-03-21 05:40:00,249 - DEBUG - using tst-linux32-spot-002.test.releng.use1.mozilla.com
2014-03-21 05:40:00,249 - DEBUG - Spot request for tst-linux32-spot-002.test.releng.use1.mozilla.com (0.09)
2014-03-21 05:40:01,515 - ERROR - 400 Bad Request
2014-03-21 05:40:01,515 - ERROR - <?xml version="1.0" encoding="UTF-8"?>
<Response><Errors><Error><Code>InvalidSpotInstanceRequestID.NotFound</Code><Message>The spot instance request ID 'sir-9586c649' does not exist</Message></Error></Errors><RequestID>91baddbb-2ab1-4a22-b285-ae7ea9d13867</RequestID></Response>


twtoSince the log was rotated two days ago there's almost 2000 of these messages. The old log (which goes back to December) has 800.

Curiously, all of the pending jobs were 32-bit mozilla-inbound tests...which makes me wonder if this is some sort of prioritization issue instead.

AFAIK trees are open and we're in fine shape, so not marking as a blocker.
Also curious: tst-linux32-spot-195 is connecting to buildbot-master67, which sees pending jobs on a builder for that slave, but it's not starting them.
I found this on buildbot-master67's log:
2014-03-21 05:46:17-0700 [-] nextAWSSlave: 2 retries for Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc
2014-03-21 05:46:17-0700 [-] nextAWSSlave: No slaves appropriate for retried job - returning None
2014-03-21 05:46:17-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc'' at 203700776>: want to start build, but we don't have a remote
2014-03-21 05:46:17-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc'' at 203700776>: got assignments: {}


The master seems to think it's a retry...but looking at https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=dbc31b57cbd7 I don't see an original test. Looking even further back, I see pending Cipc tests all the way back to https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=767ce92ebaf1, which _does_ have a retry.

So, it sounds like something is confused about this being a retry (perhaps because there's pending jobs from newer revs than the one that was retride), or watch pending isn't firing up on demand builders to deal with the retry.
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
I don't think we're going to get test coverage for most 32-bit Linux desktop jobs until this is fixed, raising priority.
Severity: major → critical
Debugging is happening on IRC:

09:10 < rail> looks like it looks for all available brids for a builder
09:11 < rail> builder._getBuildable
09:11 < rail> so if we have a lot of retries that may affect all pending?
09:12 <@catlee-away> watch pending should start up ondemand for those though
09:16 < rail> looks like watch pending distinguishes brids, but misc.py uses only builders, so they have different ideas about this
09:16 <@catlee-away> ah, so you could have 2 builds, each with 1 retry, and watch pending wouldn't start ondemand. but misc.py would see 2 retries?


I've started some on-demand instances by hand to clear the backlog.
Flags: needinfo?(rail)
Summary: watch pending seems to be having trouble bringing up spot instances → watch pending and misc.py scheduling make different decisions about which slave to use for a retried job
Assignee: nobody → catlee
Flags: needinfo?(catlee)
Attachment #8396420 - Flags: review?(rail)
Comment on attachment 8396420 [details] [diff] [review]
count retries per buildername

lgtm
Attachment #8396420 - Flags: review?(rail) → review+
Attachment #8396420 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: