Closed Bug 1805762 Opened 3 years ago Closed 3 years ago

firefox-android: Decision task failing on Android monorepo

Categories

(Release Engineering :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1788606

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

Since the Focus cut-over, the Decision task on firefox-android has been "struggling":

https://treeherder.mozilla.org/jobs?repo=firefox-android&searchStr=decision&fromchange=facba8dad5495a50c06447eece0932dcabe6270e

Recent timeouts, but also frequent retries before that.

There is a similar, perhaps more persistent issue with level 1 decision tasks.

Fenix seems okay, even though those decision tasks use the same workers.

The nightly decision tasks seem fine.

Typical timeout log: https://firefoxci.taskcluster-artifacts.net/RsjsnetoRLiXPGDvJ12mPg/0/public/logs/live_backing.log

[task 2022-12-14T17:47:39.125Z] 2022-12-14 17:47:39,122 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:56:54.622Z] 2022-12-14 17:56:54,621 - INFO - Looking for dependencies in compileOnly configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:21.311Z] 2022-12-14 17:57:21,311 - INFO - Looking for dependencies in implementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:40.078Z] 2022-12-14 17:57:40,078 - INFO - Looking for dependencies in testImplementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:57.441Z] 2022-12-14 17:57:57,440 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/focus-android

[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2022-12-14 18:16:43.584Z] === Task Finished ===
[taskcluster 2022-12-14 18:16:43.585Z] Unsuccessful task run with exit code: -1 completed in 1839.11 seconds

In terms of the changelog, the actual failures started with https://github.com/mozilla-mobile/firefox-android/commit/3efeffde87d021f37e4c82e9cd373cdf3a9efe89, but the content of that commit seems unrelated, and there were several claim-expired tasks on earlier commits (like https://firefox-ci-tc.services.mozilla.com/tasks/AUo9E6rLQtONaFlMRKEGvg/runs/1). Also, some decision task reruns succeed (like https://firefox-ci-tc.services.mozilla.com/tasks/RsjsnetoRLiXPGDvJ12mPg).

The most recent mobile-3/decision-gcp tasks have been successful. One difference I notice is the successful ones are using a new image:

Status: Downloaded newer image for mozillareleases/taskgraph@sha256:e9f387c9049db80281eaed73c7173b653be195423cdd58300f71d9796d8f1ac7

mobile-1/decision-gcp tasks are still "stuck": they run for a long time, fail claim-expired, then retry. eg https://firefox-ci-tc.services.mozilla.com/tasks/LhAkSXZpTbGwec17HbdIJQ/runs/5

I can run the exact same decision task command locally without any trouble (completes in a few minutes).

taskgraph decision --pushlog-id=0 --pushdate=0 --project=firefox-android --message= --owner=mergify[bot]@users.noreply.github.com --level=1 --base-repository=https://github.com/mozilla-mobile/firefox-android --base-ref=releases_v108 --base-rev=15cb4068f5724eb164812e5969812e48d315e7da --head-repository=https://github.com/mozilla-mobile/firefox-android --head-ref=mergify/bp/releases_v108/pr-279 --head-rev=e8056a7c3b08dc26ad2d3a1c5c3b96169f81815e --repository-type=git --tasks-for=github-pull-request-untrusted
2022-12-14 15:35:45,201 - INFO - Loading graph configuration.
 ...
2022-12-14 15:38:23,064 - INFO - Full task graph contains 842 tasks and 2444 dependencies
 ...
2022-12-14 15:38:25,069 - INFO - Filter filter_target_tasks pruned 708 tasks (134 remain)
 ,,,

There have been some successful firefox-android mobile-1/decision-gcp tasks today, like
https://firefox-ci-tc.services.mozilla.com/tasks/V-FpqA5eRhu_gTa3duSGvA/runs/1
https://firefox-ci-tc.services.mozilla.com/tasks/UBAZVdMYTlWUreLGc5n1Yw/runs/0

Why did these succeed?

I believe the CLAIM_EXPIRED issue is separate from the Decision task timeouts.

[task 2022-12-14T17:47:39.125Z] 2022-12-14 17:47:39,122 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:56:54.622Z] 2022-12-14 17:56:54,621 - INFO - Looking for dependencies in compileOnly configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:21.311Z] 2022-12-14 17:57:21,311 - INFO - Looking for dependencies in implementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:40.078Z] 2022-12-14 17:57:40,078 - INFO - Looking for dependencies in testImplementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:57.441Z] 2022-12-14 17:57:57,440 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/focus-android

[taskcluster:error] Task timeout after 1800 seconds. Force killing container.

From the timestamps there's a 10 min gap between the first and last lines, and the rest of the task log is under a minute. So that means we must be stuck for ~20min somewhere after that last log line. It might be worth bumping max-run-time up to an hour or so here to keep things running while we figure this out (unless retriggers are consistently faster?).

So the Looking for dependencies log is immediately followed by a call out to gradle:
https://github.com/mozilla-mobile/firefox-android/blob/8f2fa0c2189c3aedebaf08a4f701e2878b2eec4e/taskcluster/android_taskgraph/gradle.py#L44

And the Decision task does have a gradle cache defined:
https://github.com/mozilla-mobile/firefox-android/blob/8f2fa0c2189c3aedebaf08a4f701e2878b2eec4e/.taskcluster.yml#L238

So I think this is actually pretty simple.. By integrating focus, the Decision task needs to make more calls out to gradle. On a new worker which doesn't have any cache yet, this is slow enough to go over 30 min. But once a worker has populated its cache (which would partially happen even from a failed task), it is then fast enough to stay under 30 min.

I think this was compounded by the CLAIM_EXPIRED issue which I still am unsure why it happened, but seems likely to be related to the Taskluster 46.1.0 deploy that happened around the same time. (This issue appears to be resolved after I terminated all the workers)

So short term, bumping max-run-time would likely solve this. Longer term, we might want to try to run these in parallel, or use beefier instances or something to speed these gradle calls up.

Johan, needinfo'ing you for visibility as it's too late for me to be making patches :)

Flags: needinfo?(jlorenzo)

Thanks :ahal!

https://github.com/mozilla-mobile/firefox-android/pull/340 should massively improve decision task performance.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(jlorenzo)
Resolution: --- → WORKSFORME

Thanks for looking into this :ahal. I don't know what happened 5 days ago. For sure, the situation improved the next day, even before [1] landed. That said, I agree with :gbrown, not calling gradle anymore improves the runtime of the decision task. We're now back to a minute instead of 15. So I'm marking this bug as a duplicate since having faster decision tasks is the ultimate goal.

[1] https://github.com/mozilla-mobile/firefox-android/pull/340

Duplicate of bug: 1788606
Resolution: WORKSFORME → DUPLICATE
You need to log in before you can comment on or make changes to this bug.