Closed
Bug 1347912
Opened 8 years ago
Closed 8 years ago
The web-page 'stop build' button was pressed by 'buildbot-bridge': bbb is letting claims expire
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: cbook, Assigned: catlee)
References
()
Details
(Whiteboard: [bbb])
Attachments
(5 files)
like in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95 there are canceled jobs that should not be canceled because of valid test runs.
We somehow cancel jobs we need and so reduce somehow our testcoverage and risk regressions are not catched as soon they would appear in this tests.
Reporter | ||
Comment 1•8 years ago
|
||
catlee, rail: could you take a look at this, it happened also on aurora today
Flags: needinfo?(rail)
Flags: needinfo?(catlee)
Assignee | ||
Updated•8 years ago
|
Flags: needinfo?(catlee)
Whiteboard: [bbb]
Assignee | ||
Comment 2•8 years ago
|
||
Some notes from looking at https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=8a1c36e9ba805f20ec6d63793f18268516d69d95&selectedJob=84260421
On 2017-03-16, times in PDT
* 03:24:47 taskId KrkTi61rTYuH-_iE99-wXg is created
* 03:50:54 taskId KrkTi61rTYuH-_iE99-wXg is scheduled
* 03:51:14 build request id 143320688 is created
* 03:59:21 build starts
* 03:59:32 taskId KrkTi61rTYuH-_iE99-wXg is started
* 04:01:35 reflector logs: task KrkTi61rTYuH-_iE99-wXg: buildrequest 143320688: Build hasn't started yet, nothing to do
* 04:10:42 reflector starts processing 921 tasks
* 04:11:36 reflector logs: task KrkTi61rTYuH-_iE99-wXg: processing task (582/921) (Note: doesn't reclaim here)
* 04:12:01 reflector completes processing 921/921
* 04:19:32 taskId KrkTi61rTYuH-_iE99-wXg fails with 'claim-expired'
* 04:21:01 reflector logs: ask KrkTi61rTYuH-_iE99-wXg: run 0: buildrequest 143320688: Claim for BuildRequest will expire in less than 5min, reclaiming
* 04:21:02 TaskclusterRestFailure: Run 0 on task KrkTi61rTYuH-_iE99-wXg is resolved or not running.
* 04:21:02 task KrkTi61rTYuH-_iE99-wXg: run 0: deadline exceeded; cancelling it
So the problem looks like the task in taskcluster expired because BBB didn't renew its claim soon enough.
BBB only reclaims tasks if there are 5 minutes or less until the deadline. The deadline was 04:19:32, so we would have needed to look at the task sometime after 04:14:32.
bm82 started a reflector loop at 04:13:01 and didn't finish until 04:22:27. It doesn't inspect our task until 04:21:02.
Sometimes the reflector loop finishes in a few minutes, and sometimes it takes nearly 10 minutes.
Assignee | ||
Comment 3•8 years ago
|
||
Just a few ideas:
* Sort the tasks to process by their deadline, so we can reflect the ones closest to expiring first
* re-claim tasks asynchronously. Each HTTP call takes nearly one second. As we have more active tasks, we are taking longer to reflect of all of them back
* schedule task re-claiming. Instead of processing the tasks in a loop, we could simply schedule a reclaim call at the appropriate time.
* reclaim tasks when there are 10 minutes before expiring. this is the simplest short term thing we can do, but won't really fix the problem long-term.
* deleting tasks from bbbdb seems to take a while sometimes:
Mar 16 04:14:22 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.servicebase - buildrequest 143319937: deleting task from bbbdb
Mar 16 04:14:31 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task GH70v936ST-h1A2-wCX1yw: processing task (299/899)
this prevents processing of subsequent tasks.
Assignee | ||
Updated•8 years ago
|
Summary: The web-page 'stop build' button was pressed by 'buildbot-bridge': cancels valid jobs :( → The web-page 'stop build' button was pressed by 'buildbot-bridge': bbb is letting claims expire
Comment 4•8 years ago
|
||
Note sure how useful is this but by checking this cancel job (1) I found in papertrail this part:
Mar 16 04:17:47 buildbot-master84.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task Hb9SbBEaQjaSolzVONeJKQ: run 1: Buildername Rev7 MacOSX Yosemite 10.10.5 autoland debug test mochitest-e10s-browser-chrome-3 matches an ignore pattern, doing nothing
This means that a condition for ignore pattern is found for this job (2) with more details in Bug 1201861,maybe that's why bbb is letting claims expire.
(1) https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=3810b44d85e5187378fba2ac2adeecd3edf36ce4&filter-resultStatus=usercancel&filter-resultStatus=runnable&selectedJob=84259157
(2) https://github.com/mozilla-releng/buildbot-bridge/blob/master/bbb/services.py#L522
Comment 5•8 years ago
|
||
Assignee | ||
Comment 6•8 years ago
|
||
Yes, for sure, if these are being ignored by BBB, then it won't be reclaiming them. It shouldn't be scheduling them in the first place though.
Assignee | ||
Comment 7•8 years ago
|
||
The dev BBB instances only watch alder, jamun, and date branches. It's expected that they ignore autoland.
Assignee | ||
Comment 8•8 years ago
|
||
Attachment #8848135 -
Flags: review?(rail)
Comment hidden (mozreview-request) |
Comment 10•8 years ago
|
||
mozreview-review |
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1
https://reviewboard.mozilla.org/r/121112/#review123040
Attachment #8848139 -
Flags: review?(rail) → review+
Updated•8 years ago
|
Flags: needinfo?(rail)
Attachment #8848135 -
Flags: review?(rail) → review+
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → catlee
Assignee | ||
Comment 11•8 years ago
|
||
https://hg.mozilla.org/build/puppet/rev/d05c32f3139860abcad993b9d7b06335ad7885c5
Bug 1347912: Add statsd to BBB virtualenvs r=rail
Comment hidden (mozreview-request) |
Comment 13•8 years ago
|
||
mozreview-review |
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1
https://reviewboard.mozilla.org/r/121112/#review123092
Assignee | ||
Comment 14•8 years ago
|
||
https://hg.mozilla.org/build/puppet/rev/8e577f2b01929062b8f96c7648969d1c5b66362d
Bug 1347912: Update production with new BBB r=rail
Assignee | ||
Comment 15•8 years ago
|
||
Attachment #8848494 -
Flags: review?(rail)
Updated•8 years ago
|
Attachment #8848494 -
Flags: review?(rail) → review+
Comment hidden (mozreview-request) |
Comment 17•8 years ago
|
||
mozreview-review |
Comment on attachment 8848139 [details]
Bug 1347912: Bump BBB version to 1.6.1
https://reviewboard.mozilla.org/r/121112/#review123452
Assignee | ||
Comment 18•8 years ago
|
||
https://hg.mozilla.org/build/puppet/rev/817ad4a6ef7aaaf2b6939a3224c7d3c1fde951cc
Bug 1347912: Bump BBB version to 1.6.1 r=rail
Assignee | ||
Comment 19•8 years ago
|
||
I believe this is fixed now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 20•8 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #19)
> I believe this is fixed now.
it came back :((((
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157151
and
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157154
aka: https://treeherder.mozilla.org/logviewer.html#?job_id=86157154&repo=mozilla-aurora&lineNumber=15753
and: https://treeherder.mozilla.org/logviewer.html#?job_id=86157151&repo=mozilla-aurora&lineNumber=14583
seems only happend on aurora so far today.
Status: RESOLVED → REOPENED
Flags: needinfo?(catlee)
Resolution: FIXED → ---
Assignee | ||
Comment 21•8 years ago
|
||
I took a look at this task:
https://tools.taskcluster.net/task-inspector/#SlO47j0dQnCHn5RarbLA_w which corresponds to this job on TH: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&noautoclassify&selectedJob=86157151
Here are the buildbot-bridge logs:
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: fetching task from bbbdb
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: injecting task into bb
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - Created buildset_property taskId=["SlO47j0dQnCHn5RarbLA_w", "bbb"]
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: creating task
Mar 24 01:49:54 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: injected into bb
Mar 24 01:50:07 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_bblistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: claiming
Mar 24 01:50:46 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (28/523)
Mar 24 01:50:47 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (28/521)
Mar 24 01:51:00 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (42/521)
Mar 24 01:52:12 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (95/507)
Mar 24 01:52:13 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (95/507)
Mar 24 01:53:19 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (120/515)
Mar 24 01:53:22 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (126/515)
Mar 24 01:53:58 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (131/515)
Mar 24 01:54:46 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (155/505)
Mar 24 01:54:48 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (152/502)
Mar 24 01:55:52 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (151/496)
Mar 24 01:56:00 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (151/495)
Mar 24 01:57:04 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (152/488)
Mar 24 01:58:41 buildbot-master86.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (162/486)
Mar 24 01:59:59 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (189/491)
Mar 24 02:10:10 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.servicebase - task SlO47j0dQnCHn5RarbLA_w: fetching task from bbbdb
Mar 24 02:10:10 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_tclistener bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 1: buildrequest 144084934: updating run id
Mar 24 02:10:17 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (189/491)
Mar 24 02:10:17 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: Claim for BuildRequest will expire in less than 10min, reclaiming
Mar 24 02:10:18 buildbot-master72.bb.releng.usw2.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: deadline exceeded; cancelling it
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: processing task (270/528)
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: BuildRequest is done. BBListener should process it soon, reclaiming in the meantime
Mar 24 02:10:50 buildbot-master82.bb.releng.scl3.mozilla.com python: buildbot_bridge_reflector bbb.services - task SlO47j0dQnCHn5RarbLA_w: run 0: buildrequest 144084934: got 409 when reclaiming task; assuming task is complete
We were taking a long time reclaiming all the tasks right around 02:00 Pacific time, and weren't able to reclaim this particular one in time.
Flags: needinfo?(catlee)
Assignee | ||
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•