'Resetting dropped connection: bugbug.herokuapp.com' causes gecko decision task timeout, should fall back to schedule tasks
Categories
(Firefox Build System :: Task Configuration, defect, P1)
Tracking
(Not tracked)
People
(Reporter: aryx, Unassigned)
Details
Attachments
(1 file)
'Resetting dropped connection: bugbug.herokuapp.com' causes gecko decision task timeout, should fall back to schedule tasks according to Marco, but scheduling everything will likely overwhelm CI with the load generated.
https://treeherder.mozilla.org/logviewer?job_id=345914361&repo=autoland
[task 2021-07-21T17:53:40.508Z] Generating optimized task graph
[task 2021-07-21T17:53:45.218Z] Reading file: /builds/worker/checkouts/gecko/moz.build
[task 2021-07-21T17:53:45.283Z] Reading file: /builds/worker/checkouts/gecko/docshell/moz.build
[task 2021-07-21T17:53:45.310Z] Reading file: /builds/worker/checkouts/gecko/docshell/base/moz.build
[task 2021-07-21T17:53:45.318Z] Reading file: /builds/worker/checkouts/gecko/netwerk/moz.build
[task 2021-07-21T17:53:45.321Z] Reading file: /builds/worker/checkouts/gecko/netwerk/base/moz.build
[task 2021-07-21T17:53:45.328Z] Reading file: /builds/worker/checkouts/gecko/netwerk/protocol/moz.build
[task 2021-07-21T17:53:45.331Z] Reading file: /builds/worker/checkouts/gecko/netwerk/protocol/http/moz.build
[task 2021-07-21T17:53:45.351Z] Resetting dropped connection: bugbug.herokuapp.com
[task 2021-07-21T17:53:45.381Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T17:53:55.400Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T17:54:05.415Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
...
[task 2021-07-21T18:09:56.983Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:07.001Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:17.011Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:27.022Z] Resetting dropped connection: bugbug.herokuapp.com
[task 2021-07-21T18:10:27.046Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:37.065Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:47.089Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:57.108Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:07.127Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:17.146Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:27.166Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:37.185Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:47.335Z] no files found matching a pattern in `skip-unless-changed` for webrender-android-emulator-debug
[task 2021-07-21T18:11:47.335Z] no files found matching a pattern in `skip-unless-changed` for webrender-android-emulator-release
[task 2021-07-21T18:11:47.336Z] no files found matching a pattern in `skip-unless-changed` for webrender-android-hw-p2-debug
![]() |
Reporter | |
Comment 1•4 years ago
|
||
Decision tasks still take a very long time and sometimes need to be rerun. This can cause tree closures and even render pushes unusable for releases when the decision task succeeds to schedule a share of the taskgraph but not all.
This has a high priority and needs a fix today.
Comment 2•4 years ago
|
||
The bugbug issue should have been fixed. We still need a fix for this bug (fallback not working), so that when the service happens to be down we don't stop the world.
Comment 3•4 years ago
•
|
||
The fallback is working, but I guess the duration of Decision tasks has been creeping up to the point that a 9 minute delay (which is how long we wait before swithching to the fallback) causes them to go over the 30 minute threshold.
So we need to bump the Decision task timeout and/or wait less long for bugbug to reply. I propose as a short term solution we bump the timeout to 60 minutes, then Marco can disable bugbug on autoland for now to let the dynos catch up to the load. Once things are back into a good state we can turn bugbug back on for autoland.
Longer term we should:
- Prioritize Decision task performance to try and get the duration back down to something reasonable.
- Add some logic to bugbug to start dropping pushes on autoland if the load gets to high. Maybe we could even add an endpoint to cancel a request which taskgraph can call after it has given up and switched to the fallback (so bubgug knows not to bother).
Comment 4•4 years ago
|
||
Decision task durations have been creeping over the 20 minute mark, which means
we no longer have a lot of leeway when things timeout (like waiting for
bugbug). Longer term we should focus on reducing Decision task duration, but
for now this is needed to re-open autoland.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 6•4 years ago
•
|
||
Decision task durations have been creeping over the 20 minute mark
That is only true when bugbug is slow. Normal decision tasks take 2 to 3 minutes + hg clone time. See https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&series=autoland,2680290,1,2&timerange=1209600 (the graph doesn't include hg clone time).
Comment 7•4 years ago
|
||
Note how the times are essentially x, x+9, x+18 and x+27. (2 minutes < x < 3 minutes)
Comment 8•4 years ago
|
||
bugherder |
Comment 9•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7)
Note how the times are essentially x, x+9, x+18 and x+27. (2 minutes < x < 3 minutes)
This is fishy, is it possible we are hitting the fallback time multiple times?
We store whether the bugbug HTTP service timed out in a fallback
property of the BugBugPushSchedules
class. Maybe we should turn it into a global variable to ensure that once it's set to True, it always stays True?
Comment 10•4 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:ahal, maybe it's time to close this bug?
Updated•4 years ago
|
Comment 12•2 years ago
|
||
I believe this is working now, we can open a new bug for any future related issues.
Description
•