Closed Bug 986445 Opened 11 years ago Closed 11 years ago

watch pending and misc.py scheduling make different decisions about which slave to use for a retried job

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: catlee)

Details

Attachments

(1 file)

count retries per buildername 11 years ago Chris AtLee [:catlee] 1.79 KB, patch	rail : review+ catlee : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

11 years ago

This morning I noticed that there was about 100 pending test jobs from as far back as 8 hours ago. Looking in the watch pending log I see a lot of: 2014-03-21 05:39:48,447 - DEBUG - waiting for spot request 2014-03-21 05:39:53,750 - DEBUG - 39 interfaces in us-east-1c 2014-03-21 05:39:53,751 - DEBUG - using tst-linux32-spot-146.test.releng.use1.mozilla.com 2014-03-21 05:39:53,751 - DEBUG - Spot request for tst-linux32-spot-146.test.releng.use1.mozilla.com (0.09) 2014-03-21 05:39:54,972 - ERROR - 400 Bad Request 2014-03-21 05:39:54,972 - ERROR - <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidSpotInstanceRequestID.NotFound</Code><Message>The spot instance request ID 'sir-27990649' does not exist</Message></Error></Errors><RequestID>98df7227-da42-4dd2-b253-5afb27013cfb</RequestID></Response> 2014-03-21 05:39:54,972 - DEBUG - waiting for spot request 2014-03-21 05:40:00,249 - DEBUG - 38 interfaces in us-east-1c 2014-03-21 05:40:00,249 - DEBUG - using tst-linux32-spot-002.test.releng.use1.mozilla.com 2014-03-21 05:40:00,249 - DEBUG - Spot request for tst-linux32-spot-002.test.releng.use1.mozilla.com (0.09) 2014-03-21 05:40:01,515 - ERROR - 400 Bad Request 2014-03-21 05:40:01,515 - ERROR - <?xml version="1.0" encoding="UTF-8"?> <Response><Errors><Error><Code>InvalidSpotInstanceRequestID.NotFound</Code><Message>The spot instance request ID 'sir-9586c649' does not exist</Message></Error></Errors><RequestID>91baddbb-2ab1-4a22-b285-ae7ea9d13867</RequestID></Response> twtoSince the log was rotated two days ago there's almost 2000 of these messages. The old log (which goes back to December) has 800. Curiously, all of the pending jobs were 32-bit mozilla-inbound tests...which makes me wonder if this is some sort of prioritization issue instead. AFAIK trees are open and we're in fine shape, so not marking as a blocker.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 1

•

11 years ago

Also curious: tst-linux32-spot-195 is connecting to buildbot-master67, which sees pending jobs on a builder for that slave, but it's not starting them.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

11 years ago

I found this on buildbot-master67's log: 2014-03-21 05:46:17-0700 [-] nextAWSSlave: 2 retries for Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc 2014-03-21 05:46:17-0700 [-] nextAWSSlave: No slaves appropriate for retried job - returning None 2014-03-21 05:46:17-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc'' at 203700776>: want to start build, but we don't have a remote 2014-03-21 05:46:17-0700 [-] <Builder ''Ubuntu VM 12.04 mozilla-inbound opt test crashtest-ipc'' at 203700776>: got assignments: {} The master seems to think it's a retry...but looking at https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=dbc31b57cbd7 I don't see an original test. Looking even further back, I see pending Cipc tests all the way back to https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=767ce92ebaf1, which _does_ have a retry. So, it sounds like something is confused about this being a retry (perhaps because there's pending jobs from newer revs than the one that was retride), or watch pending isn't firing up on demand builders to deal with the retry.

Flags: needinfo?(rail)

Flags: needinfo?(catlee)

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

11 years ago

I don't think we're going to get test coverage for most 32-bit Linux desktop jobs until this is fixed, raising priority.

Severity: major → critical

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 4

•

11 years ago

Debugging is happening on IRC: 09:10 < rail> looks like it looks for all available brids for a builder 09:11 < rail> builder._getBuildable 09:11 < rail> so if we have a lot of retries that may affect all pending? 09:12 <@catlee-away> watch pending should start up ondemand for those though 09:16 < rail> looks like watch pending distinguishes brids, but misc.py uses only builders, so they have different ideas about this 09:16 <@catlee-away> ah, so you could have 2 builds, each with 1 retry, and watch pending wouldn't start ondemand. but misc.py would see 2 retries? I've started some on-demand instances by hand to clear the backlog.

Rail Aliiev [:rail]

Updated

•

11 years ago

Flags: needinfo?(rail)

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

11 years ago

Summary: watch pending seems to be having trouble bringing up spot instances → watch pending and misc.py scheduling make different decisions about which slave to use for a retried job

Chris AtLee [:catlee]

Assignee

Updated

•

11 years ago

Assignee: nobody → catlee

Flags: needinfo?(catlee)

Chris AtLee [:catlee]

Assignee

Comment 5

•

11 years ago

Attached patch count retries per buildername — Details — Splinter Review

Attachment #8396420 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 6

•

11 years ago

Comment on attachment 8396420 [details] [diff] [review] count retries per buildername lgtm

Attachment #8396420 - Flags: review?(rail) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

11 years ago

Attachment #8396420 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

6 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

watch pending and misc.py scheduling make different decisions about which slave to use for a retried job

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: catlee)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated

Updated

Comment 5

Comment 6

Updated

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type