Closed Bug 1676170 Opened 4 years ago Closed 4 years ago

The decision task of mozilla-release is busted: it times out on json-automationrelevance

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect, P1)

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jlorenzo, Assigned: sheehan)

References

(Regression)

Details

Attachments

(2 files)

Today, I merged mozilla-beta (83) to mozilla-release. As usual, the number of changesets is quite big[1]. For an unknown reason, json-automationrelevance[2] takes too long to respond for the decision task, even after 10 retries with exponential backoff:

[task 2020-11-09T13:53:34.146Z] Querying version control for metadata: https://hg.mozilla.org/releases/mozilla-release/json-automationrelevance/fee723a73e4a12fbd179e05b54f1f2e5623c90c4
[task 2020-11-09T13:53:34.146Z] attempt 1/10
[task 2020-11-09T13:53:34.146Z] retry: calling get_automationrelevance, attempt #1
[task 2020-11-09T13:54:04.414Z] retry: Caught exception: 
[task 2020-11-09T13:54:04.414Z] sleeping for 10.89s (attempt 1/10)
[task 2020-11-09T13:54:15.309Z] attempt 2/10
[task 2020-11-09T13:54:15.309Z] retry: calling get_automationrelevance, attempt #2
[task 2020-11-09T13:54:45.630Z] retry: Caught exception: 
[task 2020-11-09T13:54:45.630Z] sleeping for 16.43s (attempt 2/10)
[task 2020-11-09T13:55:02.078Z] attempt 3/10
[task 2020-11-09T13:55:02.078Z] retry: calling get_automationrelevance, attempt #3
[task 2020-11-09T13:55:32.321Z] retry: Caught exception: 
[task 2020-11-09T13:55:32.321Z] sleeping for 22.40s (attempt 3/10)
[task 2020-11-09T13:55:54.735Z] attempt 4/10
[task 2020-11-09T13:55:54.735Z] retry: calling get_automationrelevance, attempt #4
[task 2020-11-09T13:56:25.042Z] retry: Caught exception: 
[task 2020-11-09T13:56:25.042Z] sleeping for 35.24s (attempt 4/10)
[task 2020-11-09T13:57:00.286Z] attempt 5/10
[task 2020-11-09T13:57:00.286Z] retry: calling get_automationrelevance, attempt #5
[task 2020-11-09T13:57:30.597Z] retry: Caught exception: 
[task 2020-11-09T13:57:30.597Z] sleeping for 51.25s (attempt 5/10)
[task 2020-11-09T13:58:21.872Z] attempt 6/10
[task 2020-11-09T13:58:21.872Z] retry: calling get_automationrelevance, attempt #6
[task 2020-11-09T13:58:52.212Z] retry: Caught exception: 
[task 2020-11-09T13:58:52.212Z] sleeping for 73.28s (attempt 6/10)
[task 2020-11-09T14:00:05.501Z] attempt 7/10
[task 2020-11-09T14:00:05.502Z] retry: calling get_automationrelevance, attempt #7
[task 2020-11-09T14:00:35.796Z] retry: Caught exception: 
[task 2020-11-09T14:00:35.796Z] sleeping for 106.81s (attempt 7/10)
[task 2020-11-09T14:02:22.616Z] attempt 8/10
[task 2020-11-09T14:02:22.616Z] retry: calling get_automationrelevance, attempt #8
[task 2020-11-09T14:02:52.967Z] retry: Caught exception: 
[task 2020-11-09T14:02:52.968Z] sleeping for 179.35s (attempt 8/10)
[task 2020-11-09T14:05:52.411Z] attempt 9/10
[task 2020-11-09T14:05:52.411Z] retry: calling get_automationrelevance, attempt #9
[task 2020-11-09T14:06:22.757Z] retry: Caught exception: 
[task 2020-11-09T14:06:22.757Z] sleeping for 263.29s (attempt 9/10)
[task 2020-11-09T14:10:46.130Z] attempt 10/10
[task 2020-11-09T14:10:46.130Z] retry: calling get_automationrelevance, attempt #10
[task 2020-11-09T14:11:16.447Z] retry: Caught exception: 
[task 2020-11-09T14:11:16.447Z] retry: Giving up on get_automationrelevance
[task 2020-11-09T14:11:16.447Z] Error loading tasks for kind test:

Rerunning the decision task a second time gave the same result.

This is blocking releases. Sheehan, Zeid, do you guys own this part of hg.m.o? If so can one of you guys have a look at the server logs? If not, could you loop in the right person?

[1] https://treeherder.mozilla.org/jobs?repo=mozilla-release&revision=fee723a73e4a12fbd179e05b54f1f2e5623c90c4&selectedTaskRun=H45WZSEiSReHJWLllbhDaw.0
[2] https://hg.mozilla.org/releases/mozilla-release/json-automationrelevance/fee723a73e4a12fbd179e05b54f1f2e5623c90c4

Flags: needinfo?(zeid)
Flags: needinfo?(sheehan)

Last week we added backout metadata to the json-automationrelevance endpoint for bug 1673985, which is known to be expensive to calculate. For pushes to try, autoland and other more frequently pushed repos this change probably only resulted in a few seconds of extra CPU time. Since release pushes usually have thousands of changesets from weeks of development, calculating the backouts for a single push would be very expensive.

I think the easiest fix here is to gate the expensive check around a query string parameter flag - so calls to json-automationrelevance that want to compute the backouts for relevant changesets would look something like https://hg.mozilla.org/releases/mozilla-release/json-automationrelevance/fee723a73e4a12fbd179e05b54f1f2e5623c90c4?backouts=1. Marco, would this solution work for you? As jlorenzo said this is blocking releases.

Flags: needinfo?(zeid)
Flags: needinfo?(sheehan)
Flags: needinfo?(mcastelluccio)
Attached image treestatus.png

Thanks for the explanation, Connor! I just closed mozilla-release, in the meantime.

Regressed by: 1673985

Calls to json-automationrelevance on mozilla-release can take up to 30s
since deploying changeset 78a0d7c424fc18d. This commit hides the expensive
backout information behind a backouts query string parameter so only the
relevant calls will perform the expensive calculation.

Assignee: nobody → sheehan
Status: NEW → ASSIGNED

Pushed by cosheehan@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/fe454eae1eb7
hgmo: backout information in json-automationrelevance behind a flag r=zeid

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

I've pushed a fix that will put the expensive calculation on json-automationrelevance behind a flag. I'm deploying now but we should keep this open while we verify the fix.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Decision task is back to green! Thank you very much for this super quick fix, Connor!

For the record, I just reopened mozilla-release.

Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED

It WFM, I have a fix for mozci: https://github.com/mozilla/mozci/pull/353.

Flags: needinfo?(mcastelluccio)
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: