The web-page 'stop build' button was pressed by 'buildbot-bridge': bbb is letting claims expire

RESOLVED FIXED

Status

--
critical
RESOLVED FIXED
2 years ago
9 months ago

People

(Reporter: cbook, Assigned: catlee)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [bbb], URL)

Attachments

(5 attachments)

(Reporter)

Description

2 years ago
like in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95 there are canceled jobs that should not be canceled because of valid test runs.

We somehow cancel jobs we need and so reduce somehow our testcoverage and risk regressions are not catched as soon they would appear in this tests.
(Reporter)

Comment 1

2 years ago
catlee, rail: could you take a look at this, it happened also on aurora today
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
(Assignee)

Updated

2 years ago
Flags: needinfo?(catlee)
Whiteboard: [bbb]
(Assignee)

Comment 2

2 years ago
Some notes from looking at https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95&selectedJob=84260421

On 2017-03-16, times in PDT

* 03:24:47 taskId KrkTi61rTYuH-_iE99-wXg is created
* 03:50:54 taskId KrkTi61rTYuH-_iE99-wXg is scheduled
* 03:51:14 build request id 143320688 is created
* 03:59:21 build starts
* 03:59:32 taskId KrkTi61rTYuH-_iE99-wXg is started
* 04:01:35 reflector logs: task KrkTi61rTYuH-_iE99-wXg: buildrequest 143320688: Build hasn't started yet, nothing to do
* 04:10:42 reflector starts processing 921 tasks
* 04:11:36 reflector logs: task KrkTi61rTYuH-_iE99-wXg: processing task (582/921) (Note: doesn't reclaim here)
* 04:12:01 reflector completes processing 921/921
* 04:19:32 taskId KrkTi61rTYuH-_iE99-wXg fails with 'claim-expired'
* 04:21:01 reflector logs: ask KrkTi61rTYuH-_iE99-wXg: run 0: buildrequest 143320688: Claim for BuildRequest will expire in less than 5min, reclaiming
* 04:21:02 TaskclusterRestFailure: Run 0 on task KrkTi61rTYuH-_iE99-wXg is resolved or not running.
* 04:21:02 task KrkTi61rTYuH-_iE99-wXg: run 0: deadline exceeded; cancelling it

So the problem looks like the task in taskcluster expired because BBB didn't renew its claim soon enough.

BBB only reclaims tasks if there are 5 minutes or less until the deadline. The deadline was 04:19:32, so we would have needed to look at the task sometime after 04:14:32.

bm82 started a reflector loop at 04:13:01 and didn't finish until 04:22:27. It doesn't inspect our task until 04:21:02.

Sometimes the reflector loop finishes in a few minutes, and sometimes it takes nearly 10 minutes.
(Assignee)

Comment 3

2 years ago
Just a few ideas:

* Sort the tasks to process by their deadline, so we can reflect the ones closest to expiring first
* re-claim tasks asynchronously. Each HTTP call takes nearly one second. As we have more active tasks, we are taking longer to reflect of all of them back
* schedule task re-claiming. Instead of processing the tasks in a loop, we could simply schedule a reclaim call at the appropriate time.
* reclaim tasks when there are 10 minutes before expiring. this is the simplest short term thing we can do, but won't really fix the problem long-term.
* deleting tasks from bbbdb seems to take a while sometimes:

Mar 16 04:14:22 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.servicebase - buildrequest 143319937: deleting task from bbbdb
Mar 16 04:14:31 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task GH70v936ST-h1A2-wCX1yw: processing task (299/899)

this prevents processing of subsequent tasks.
(Assignee)

Updated

2 years ago
Summary: The web-page 'stop build' button was pressed by 'buildbot-bridge': cancels valid jobs :( → The web-page 'stop build' button was pressed by 'buildbot-bridge': bbb is letting claims expire
Created attachment 8848088 [details]
bbb.png

Note sure how useful is this but by checking this cancel job (1) I found in papertrail this part: 

Mar 16 04:17:47 buildbot-master84.bb.releng.scl3.mozilla.com python:  buildbot_bridge_tclistener bbb.services - task Hb9SbBEaQjaSolzVONeJKQ: run 1: Buildername Rev7 MacOSX Yosemite 10.10.5 autoland debug test mochitest-e10s-browser-chrome-3 matches an ignore pattern, doing nothing
 
This means that a condition for ignore pattern is found for this job (2) with more details in Bug 1201861,maybe that's why bbb is letting claims expire.


(1) https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=3810b44d85e5187378fba2ac2adeecd3edf36ce4&filter-resultStatus=usercancel&filter-resultStatus=runnable&selectedJob=84259157

(2) https://github.com/mozilla-releng/buildbot-bridge/blob/master/bbb/services.py#L522
(Assignee)

Comment 6

2 years ago
Yes, for sure, if these are being ignored by BBB, then it won't be reclaiming them. It shouldn't be scheduling them in the first place though.
(Assignee)

Comment 7

2 years ago
The dev BBB instances only watch alder, jamun, and date branches. It's expected that they ignore autoland.
(Assignee)

Comment 8

2 years ago
Created attachment 8848135 [details] [review]
Add metrics; increase deadline window; sort tasks by deadline
Attachment #8848135 - Flags: review?(rail)
Comment hidden (mozreview-request)

Comment 10

2 years ago
mozreview-review
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1

https://reviewboard.mozilla.org/r/121112/#review123040
Attachment #8848139 - Flags: review?(rail) → review+

Updated

2 years ago
Flags: needinfo?(rail)
Attachment #8848135 - Flags: review?(rail) → review+
(Assignee)

Updated

2 years ago
Assignee: nobody → catlee
Comment hidden (mozreview-request)
(Assignee)

Comment 15

2 years ago
Created attachment 8848494 [details] [review]
Ignore reclaimTask failures when a build is complete
Attachment #8848494 - Flags: review?(rail)

Updated

2 years ago
Attachment #8848494 - Flags: review?(rail) → review+
(Assignee)

Updated

2 years ago
Blocks: 1340619
Comment hidden (mozreview-request)
(Assignee)

Comment 19

2 years ago
I believe this is fixed now.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
(Assignee)

Comment 21

2 years ago
I took a look at this task:
https://tools.taskcluster.net/task-inspector/#SlO47j0dQnCHn5RarbLA_w which corresponds to this job on TH: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157151

Here are the buildbot-bridge logs:
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: fetching task from bbbdb
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: injecting task into bb
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - Created buildset_property taskId=["SlO47j0dQnCHn5RarbLA_w", "bbb"]
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: creating task
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: injected into bb
Mar 24 01:50:07 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_bblistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: claiming
Mar 24 01:50:46 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (28/523)
Mar 24 01:50:47 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (28/521)
Mar 24 01:51:00 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (42/521)
Mar 24 01:52:12 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (95/507)
Mar 24 01:52:13 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (95/507)
Mar 24 01:53:19 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (120/515)
Mar 24 01:53:22 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (126/515)
Mar 24 01:53:58 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (131/515)
Mar 24 01:54:46 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (155/505)
Mar 24 01:54:48 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (152/502)
Mar 24 01:55:52 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (151/496)
Mar 24 01:56:00 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (151/495)
Mar 24 01:57:04 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (152/488)
Mar 24 01:58:41 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (162/486)
Mar 24 01:59:59 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (189/491)
Mar 24 02:10:10 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: fetching task from bbbdb
Mar 24 02:10:10 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 1: buildrequest 144084934: updating run id
Mar 24 02:10:17 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (189/491)
Mar 24 02:10:17 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: Claim for BuildRequest will expire in less than 10min, reclaiming
Mar 24 02:10:18 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: deadline exceeded; cancelling it
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (270/528)
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: BuildRequest is done. BBListener should process it soon, reclaiming in the meantime
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: got 409 when reclaiming task; assuming task is complete 


We were taking a long time reclaiming all the tasks right around 02:00 Pacific time, and weren't able to reclaim this particular one in time.
Flags: needinfo?(catlee)
(Assignee)

Updated

2 years ago
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago2 years ago
Resolution: --- → FIXED
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.