The web-page 'stop build' button was pressed by 'buildbot-bridge': bbb is letting claims expire

RESOLVED FIXED

Status

Release Engineering
General Automation
--
critical
RESOLVED FIXED
9 months ago
8 months ago

People

(Reporter: Tomcat, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [bbb], URL)

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(5 attachments)

(Reporter)

Description

9 months ago
like in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95 there are canceled jobs that should not be canceled because of valid test runs.

We somehow cancel jobs we need and so reduce somehow our testcoverage and risk regressions are not catched as soon they would appear in this tests.
(Reporter)

Comment 1

9 months ago
catlee, rail: could you take a look at this, it happened also on aurora today
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
(Assignee)

Updated

9 months ago
Flags: needinfo?(catlee)
Whiteboard: [bbb]
(Assignee)

Comment 2

9 months ago
Some notes from looking at https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95&selectedJob=84260421

On 2017-03-16, times in PDT

* 03:24:47 taskId KrkTi61rTYuH-_iE99-wXg is created
* 03:50:54 taskId KrkTi61rTYuH-_iE99-wXg is scheduled
* 03:51:14 build request id 143320688 is created
* 03:59:21 build starts
* 03:59:32 taskId KrkTi61rTYuH-_iE99-wXg is started
* 04:01:35 reflector logs: task KrkTi61rTYuH-_iE99-wXg: buildrequest 143320688: Build hasn't started yet, nothing to do
* 04:10:42 reflector starts processing 921 tasks
* 04:11:36 reflector logs: task KrkTi61rTYuH-_iE99-wXg: processing task (582/921) (Note: doesn't reclaim here)
* 04:12:01 reflector completes processing 921/921
* 04:19:32 taskId KrkTi61rTYuH-_iE99-wXg fails with 'claim-expired'
* 04:21:01 reflector logs: ask KrkTi61rTYuH-_iE99-wXg: run 0: buildrequest 143320688: Claim for BuildRequest will expire in less than 5min, reclaiming
* 04:21:02 TaskclusterRestFailure: Run 0 on task KrkTi61rTYuH-_iE99-wXg is resolved or not running.
* 04:21:02 task KrkTi61rTYuH-_iE99-wXg: run 0: deadline exceeded; cancelling it

So the problem looks like the task in taskcluster expired because BBB didn't renew its claim soon enough.

BBB only reclaims tasks if there are 5 minutes or less until the deadline. The deadline was 04:19:32, so we would have needed to look at the task sometime after 04:14:32.

bm82 started a reflector loop at 04:13:01 and didn't finish until 04:22:27. It doesn't inspect our task until 04:21:02.

Sometimes the reflector loop finishes in a few minutes, and sometimes it takes nearly 10 minutes.
(Assignee)

Comment 3

9 months ago
Just a few ideas:

* Sort the tasks to process by their deadline, so we can reflect the ones closest to expiring first
* re-claim tasks asynchronously. Each HTTP call takes nearly one second. As we have more active tasks, we are taking longer to reflect of all of them back
* schedule task re-claiming. Instead of processing the tasks in a loop, we could simply schedule a reclaim call at the appropriate time.
* reclaim tasks when there are 10 minutes before expiring. this is the simplest short term thing we can do, but won't really fix the problem long-term.
* deleting tasks from bbbdb seems to take a while sometimes:

Mar 16 04:14:22 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.servicebase - buildrequest 143319937: deleting task from bbbdb
Mar 16 04:14:31 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task GH70v936ST-h1A2-wCX1yw: processing task (299/899)

this prevents processing of subsequent tasks.
(Assignee)

Updated

9 months ago
Summary: The web-page 'stop build' button was pressed by 'buildbot-bridge': cancels valid jobs :( → The web-page 'stop build' button was pressed by 'buildbot-bridge': bbb is letting claims expire
Created attachment 8848088 [details]
bbb.png

Note sure how useful is this but by checking this cancel job (1) I found in papertrail this part: 

Mar 16 04:17:47 buildbot-master84.bb.releng.scl3.mozilla.com python:  buildbot_bridge_tclistener bbb.services - task Hb9SbBEaQjaSolzVONeJKQ: run 1: Buildername Rev7 MacOSX Yosemite 10.10.5 autoland debug test mochitest-e10s-browser-chrome-3 matches an ignore pattern, doing nothing
 
This means that a condition for ignore pattern is found for this job (2) with more details in Bug 1201861,maybe that's why bbb is letting claims expire.


(1) https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=3810b44d85e5187378fba2ac2adeecd3edf36ce4&filter-resultStatus=usercancel&filter-resultStatus=runnable&selectedJob=84259157

(2) https://github.com/mozilla-releng/buildbot-bridge/blob/master/bbb/services.py#L522
Created attachment 8848092 [details]
bbb_1.png
(Assignee)

Comment 6

9 months ago
Yes, for sure, if these are being ignored by BBB, then it won't be reclaiming them. It shouldn't be scheduling them in the first place though.
(Assignee)

Comment 7

9 months ago
The dev BBB instances only watch alder, jamun, and date branches. It's expected that they ignore autoland.
(Assignee)

Comment 8

9 months ago
Created attachment 8848135 [details] [review]
Add metrics; increase deadline window; sort tasks by deadline
Attachment #8848135 - Flags: review?(rail)
Comment hidden (mozreview-request)
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1

https://reviewboard.mozilla.org/r/121112/#review123040
Attachment #8848139 - Flags: review?(rail) → review+
Flags: needinfo?(rail)
Attachment #8848135 - Flags: review?(rail) → review+
(Assignee)

Updated

9 months ago
Assignee: nobody → catlee
(Assignee)

Comment 11

9 months ago
https://hg.mozilla.org/build/puppet/rev/d05c32f3139860abcad993b9d7b06335ad7885c5
Bug 1347912: Add statsd to BBB virtualenvs r=rail
Comment hidden (mozreview-request)
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1

https://reviewboard.mozilla.org/r/121112/#review123092
(Assignee)

Comment 14

9 months ago
https://hg.mozilla.org/build/puppet/rev/8e577f2b01929062b8f96c7648969d1c5b66362d
Bug 1347912: Update production with new BBB r=rail
(Assignee)

Comment 15

9 months ago
Created attachment 8848494 [details] [review]
Ignore reclaimTask failures when a build is complete
Attachment #8848494 - Flags: review?(rail)
Attachment #8848494 - Flags: review?(rail) → review+
(Assignee)

Updated

9 months ago
Blocks: 1340619
Comment hidden (mozreview-request)
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1

https://reviewboard.mozilla.org/r/121112/#review123452
(Assignee)

Comment 18

9 months ago
https://hg.mozilla.org/build/puppet/rev/817ad4a6ef7aaaf2b6939a3224c7d3c1fde951cc
Bug 1347912: Bump BBB version to 1.6.1 r=rail
(Assignee)

Comment 19

9 months ago
I believe this is fixed now.
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → FIXED
(Reporter)

Comment 20

9 months ago
(In reply to Chris AtLee [:catlee] from comment #19)
> I believe this is fixed now.

it came back :(((( 

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157151
and
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157154

aka: https://treeherder.mozilla.org/logviewer.html#?job_id=86157154&repo=mozilla-aurora&lineNumber=15753
and: https://treeherder.mozilla.org/logviewer.html#?job_id=86157151&repo=mozilla-aurora&lineNumber=14583

seems only happend on aurora so far today.
Status: RESOLVED → REOPENED
Flags: needinfo?(catlee)
Resolution: FIXED → ---
(Assignee)

Comment 21

9 months ago
I took a look at this task:
https://tools.taskcluster.net/task-inspector/#SlO47j0dQnCHn5RarbLA_w which corresponds to this job on TH: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157151

Here are the buildbot-bridge logs:
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: fetching task from bbbdb
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: injecting task into bb
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - Created buildset_property taskId=["SlO47j0dQnCHn5RarbLA_w", "bbb"]
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: creating task
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: injected into bb
Mar 24 01:50:07 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_bblistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: claiming
Mar 24 01:50:46 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (28/523)
Mar 24 01:50:47 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (28/521)
Mar 24 01:51:00 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (42/521)
Mar 24 01:52:12 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (95/507)
Mar 24 01:52:13 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (95/507)
Mar 24 01:53:19 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (120/515)
Mar 24 01:53:22 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (126/515)
Mar 24 01:53:58 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (131/515)
Mar 24 01:54:46 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (155/505)
Mar 24 01:54:48 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (152/502)
Mar 24 01:55:52 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (151/496)
Mar 24 01:56:00 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (151/495)
Mar 24 01:57:04 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (152/488)
Mar 24 01:58:41 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (162/486)
Mar 24 01:59:59 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (189/491)
Mar 24 02:10:10 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: fetching task from bbbdb
Mar 24 02:10:10 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 1: buildrequest 144084934: updating run id
Mar 24 02:10:17 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (189/491)
Mar 24 02:10:17 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: Claim for BuildRequest will expire in less than 10min, reclaiming
Mar 24 02:10:18 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: deadline exceeded; cancelling it
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (270/528)
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: BuildRequest is done. BBListener should process it soon, reclaiming in the meantime
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: got 409 when reclaiming task; assuming task is complete 


We were taking a long time reclaiming all the tasks right around 02:00 Pacific time, and weren't able to reclaim this particular one in time.
Flags: needinfo?(catlee)
(Assignee)

Updated

8 months ago
Status: REOPENED → RESOLVED
Last Resolved: 9 months ago8 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.