firefox-android: Decision task failing on Android monorepo
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
People
(Reporter: gbrown, Assigned: gbrown)
References
Details
Since the Focus cut-over, the Decision task on firefox-android has been "struggling":
Recent timeouts, but also frequent retries before that.
There is a similar, perhaps more persistent issue with level 1 decision tasks.
![]() |
Assignee | |
Comment 1•3 years ago
|
||
Fenix seems okay, even though those decision tasks use the same workers.
![]() |
Assignee | |
Comment 2•3 years ago
|
||
The nightly decision tasks seem fine.
![]() |
Assignee | |
Comment 3•3 years ago
|
||
Typical timeout log: https://firefoxci.taskcluster-artifacts.net/RsjsnetoRLiXPGDvJ12mPg/0/public/logs/live_backing.log
[task 2022-12-14T17:47:39.125Z] 2022-12-14 17:47:39,122 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:56:54.622Z] 2022-12-14 17:56:54,621 - INFO - Looking for dependencies in compileOnly configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:21.311Z] 2022-12-14 17:57:21,311 - INFO - Looking for dependencies in implementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:40.078Z] 2022-12-14 17:57:40,078 - INFO - Looking for dependencies in testImplementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:57.441Z] 2022-12-14 17:57:57,440 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/focus-android
[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
[taskcluster 2022-12-14 18:16:43.584Z] === Task Finished ===
[taskcluster 2022-12-14 18:16:43.585Z] Unsuccessful task run with exit code: -1 completed in 1839.11 seconds
![]() |
Assignee | |
Comment 4•3 years ago
•
|
||
In terms of the changelog, the actual failures started with https://github.com/mozilla-mobile/firefox-android/commit/3efeffde87d021f37e4c82e9cd373cdf3a9efe89, but the content of that commit seems unrelated, and there were several claim-expired tasks on earlier commits (like https://firefox-ci-tc.services.mozilla.com/tasks/AUo9E6rLQtONaFlMRKEGvg/runs/1). Also, some decision task reruns succeed (like https://firefox-ci-tc.services.mozilla.com/tasks/RsjsnetoRLiXPGDvJ12mPg).
![]() |
Assignee | |
Comment 5•3 years ago
•
|
||
The most recent mobile-3/decision-gcp tasks have been successful. One difference I notice is the successful ones are using a new image:
Status: Downloaded newer image for mozillareleases/taskgraph@sha256:e9f387c9049db80281eaed73c7173b653be195423cdd58300f71d9796d8f1ac7
![]() |
Assignee | |
Comment 6•3 years ago
|
||
mobile-1/decision-gcp tasks are still "stuck": they run for a long time, fail claim-expired, then retry. eg https://firefox-ci-tc.services.mozilla.com/tasks/LhAkSXZpTbGwec17HbdIJQ/runs/5
![]() |
Assignee | |
Comment 7•3 years ago
|
||
I can run the exact same decision task command locally without any trouble (completes in a few minutes).
taskgraph decision --pushlog-id=0 --pushdate=0 --project=firefox-android --message= --owner=mergify[bot]@users.noreply.github.com --level=1 --base-repository=https://github.com/mozilla-mobile/firefox-android --base-ref=releases_v108 --base-rev=15cb4068f5724eb164812e5969812e48d315e7da --head-repository=https://github.com/mozilla-mobile/firefox-android --head-ref=mergify/bp/releases_v108/pr-279 --head-rev=e8056a7c3b08dc26ad2d3a1c5c3b96169f81815e --repository-type=git --tasks-for=github-pull-request-untrusted
2022-12-14 15:35:45,201 - INFO - Loading graph configuration.
...
2022-12-14 15:38:23,064 - INFO - Full task graph contains 842 tasks and 2444 dependencies
...
2022-12-14 15:38:25,069 - INFO - Filter filter_target_tasks pruned 708 tasks (134 remain)
,,,
![]() |
Assignee | |
Comment 8•3 years ago
|
||
There have been some successful firefox-android mobile-1/decision-gcp tasks today, like
https://firefox-ci-tc.services.mozilla.com/tasks/V-FpqA5eRhu_gTa3duSGvA/runs/1
https://firefox-ci-tc.services.mozilla.com/tasks/UBAZVdMYTlWUreLGc5n1Yw/runs/0
Why did these succeed?
Comment 9•3 years ago
•
|
||
I believe the CLAIM_EXPIRED issue is separate from the Decision task timeouts.
[task 2022-12-14T17:47:39.125Z] 2022-12-14 17:47:39,122 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:56:54.622Z] 2022-12-14 17:56:54,621 - INFO - Looking for dependencies in compileOnly configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:21.311Z] 2022-12-14 17:57:21,311 - INFO - Looking for dependencies in implementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:40.078Z] 2022-12-14 17:57:40,078 - INFO - Looking for dependencies in testImplementation configuration in /builds/worker/checkouts/vcs/android-components
[task 2022-12-14T17:57:57.441Z] 2022-12-14 17:57:57,440 - INFO - Looking for dependencies in api configuration in /builds/worker/checkouts/vcs/focus-android[taskcluster:error] Task timeout after 1800 seconds. Force killing container.
From the timestamps there's a 10 min gap between the first and last lines, and the rest of the task log is under a minute. So that means we must be stuck for ~20min somewhere after that last log line. It might be worth bumping max-run-time
up to an hour or so here to keep things running while we figure this out (unless retriggers are consistently faster?).
Comment 10•3 years ago
•
|
||
So the Looking for dependencies
log is immediately followed by a call out to gradle:
https://github.com/mozilla-mobile/firefox-android/blob/8f2fa0c2189c3aedebaf08a4f701e2878b2eec4e/taskcluster/android_taskgraph/gradle.py#L44
And the Decision task does have a gradle cache defined:
https://github.com/mozilla-mobile/firefox-android/blob/8f2fa0c2189c3aedebaf08a4f701e2878b2eec4e/.taskcluster.yml#L238
So I think this is actually pretty simple.. By integrating focus, the Decision task needs to make more calls out to gradle. On a new worker which doesn't have any cache yet, this is slow enough to go over 30 min. But once a worker has populated its cache (which would partially happen even from a failed task), it is then fast enough to stay under 30 min.
I think this was compounded by the CLAIM_EXPIRED issue which I still am unsure why it happened, but seems likely to be related to the Taskluster 46.1.0 deploy that happened around the same time. (This issue appears to be resolved after I terminated all the workers)
So short term, bumping max-run-time
would likely solve this. Longer term, we might want to try to run these in parallel, or use beefier instances or something to speed these gradle calls up.
Johan, needinfo'ing you for visibility as it's too late for me to be making patches :)
![]() |
Assignee | |
Comment 11•3 years ago
|
||
Thanks :ahal!
https://github.com/mozilla-mobile/firefox-android/pull/340 should massively improve decision task performance.
Thanks for looking into this :ahal. I don't know what happened 5 days ago. For sure, the situation improved the next day, even before [1] landed. That said, I agree with :gbrown, not calling gradle anymore improves the runtime of the decision task. We're now back to a minute instead of 15. So I'm marking this bug as a duplicate since having faster decision tasks is the ultimate goal.
[1] https://github.com/mozilla-mobile/firefox-android/pull/340
Description
•