Closed Bug 1721729 Opened 4 years ago Closed 2 years ago

'Resetting dropped connection: bugbug.herokuapp.com' causes gecko decision task timeout, should fall back to schedule tasks

Categories

(Firefox Build System :: Task Configuration, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: aryx, Unassigned)

Details

Attachments

(1 file)

'Resetting dropped connection: bugbug.herokuapp.com' causes gecko decision task timeout, should fall back to schedule tasks according to Marco, but scheduling everything will likely overwhelm CI with the load generated.

https://treeherder.mozilla.org/logviewer?job_id=345914361&repo=autoland

[task 2021-07-21T17:53:40.508Z] Generating optimized task graph
[task 2021-07-21T17:53:45.218Z] Reading file: /builds/worker/checkouts/gecko/moz.build
[task 2021-07-21T17:53:45.283Z] Reading file: /builds/worker/checkouts/gecko/docshell/moz.build
[task 2021-07-21T17:53:45.310Z] Reading file: /builds/worker/checkouts/gecko/docshell/base/moz.build
[task 2021-07-21T17:53:45.318Z] Reading file: /builds/worker/checkouts/gecko/netwerk/moz.build
[task 2021-07-21T17:53:45.321Z] Reading file: /builds/worker/checkouts/gecko/netwerk/base/moz.build
[task 2021-07-21T17:53:45.328Z] Reading file: /builds/worker/checkouts/gecko/netwerk/protocol/moz.build
[task 2021-07-21T17:53:45.331Z] Reading file: /builds/worker/checkouts/gecko/netwerk/protocol/http/moz.build
[task 2021-07-21T17:53:45.351Z] Resetting dropped connection: bugbug.herokuapp.com
[task 2021-07-21T17:53:45.381Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T17:53:55.400Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T17:54:05.415Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16

...

[task 2021-07-21T18:09:56.983Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:07.001Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:17.011Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:27.022Z] Resetting dropped connection: bugbug.herokuapp.com
[task 2021-07-21T18:10:27.046Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:37.065Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:47.089Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:10:57.108Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:07.127Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:17.146Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:27.166Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:37.185Z] https://bugbug.herokuapp.com:443 "GET /push/integration/autoland/698a2f3f2369296af6f1f2145d82a23da5bcb75a/schedules HTTP/1.1" 202 16
[task 2021-07-21T18:11:47.335Z] no files found matching a pattern in `skip-unless-changed` for webrender-android-emulator-debug
[task 2021-07-21T18:11:47.335Z] no files found matching a pattern in `skip-unless-changed` for webrender-android-emulator-release
[task 2021-07-21T18:11:47.336Z] no files found matching a pattern in `skip-unless-changed` for webrender-android-hw-p2-debug

Decision tasks still take a very long time and sometimes need to be rerun. This can cause tree closures and even render pushes unusable for releases when the decision task succeeds to schedule a share of the taskgraph but not all.

This has a high priority and needs a fix today.

Severity: -- → S1
Flags: needinfo?(mcastelluccio)
Priority: -- → P1

The bugbug issue should have been fixed. We still need a fix for this bug (fallback not working), so that when the service happens to be down we don't stop the world.

Flags: needinfo?(mcastelluccio)

The fallback is working, but I guess the duration of Decision tasks has been creeping up to the point that a 9 minute delay (which is how long we wait before swithching to the fallback) causes them to go over the 30 minute threshold.

So we need to bump the Decision task timeout and/or wait less long for bugbug to reply. I propose as a short term solution we bump the timeout to 60 minutes, then Marco can disable bugbug on autoland for now to let the dynos catch up to the load. Once things are back into a good state we can turn bugbug back on for autoland.

Longer term we should:

  1. Prioritize Decision task performance to try and get the duration back down to something reasonable.
  2. Add some logic to bugbug to start dropping pushes on autoland if the load gets to high. Maybe we could even add an endpoint to cancel a request which taskgraph can call after it has given up and switched to the fallback (so bubgug knows not to bother).

Decision task durations have been creeping over the 20 minute mark, which means
we no longer have a lot of leeway when things timeout (like waiting for
bugbug). Longer term we should focus on reducing Decision task duration, but
for now this is needed to re-open autoland.

Assignee: nobody → ahal
Status: NEW → ASSIGNED
Attachment #9232670 - Attachment description: Bug 1721729 - Increate decision task timeout to 60 minutes, r?#taskgraph-reviewers! → Bug 1721729 - Increase decision task timeout to 60 minutes, r?#taskgraph-reviewers!
Pushed by dluca@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9e034dc905dd Increase decision task timeout to 60 minutes, r?#taskgraph-reviewers! CLOSED TREE
Assignee: ahal → nobody
Status: ASSIGNED → NEW

Decision task durations have been creeping over the 20 minute mark

That is only true when bugbug is slow. Normal decision tasks take 2 to 3 minutes + hg clone time. See https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&series=autoland,2680290,1,2&timerange=1209600 (the graph doesn't include hg clone time).

Note how the times are essentially x, x+9, x+18 and x+27. (2 minutes < x < 3 minutes)

(In reply to Mike Hommey [:glandium] from comment #7)

Note how the times are essentially x, x+9, x+18 and x+27. (2 minutes < x < 3 minutes)

This is fishy, is it possible we are hitting the fallback time multiple times?
We store whether the bugbug HTTP service timed out in a fallback property of the BugBugPushSchedules class. Maybe we should turn it into a global variable to ensure that once it's set to True, it always stays True?

The leave-open keyword is there and there is no activity for 6 months.
:ahal, maybe it's time to close this bug?

Flags: needinfo?(ahal)
Flags: needinfo?(ahal)
Keywords: leave-open

I believe this is working now, we can open a new bug for any future related issues.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: