Intermittent OS X 10.14 QuantumRender opt wrench [taskcluster:error] Aborting task... | Task aborted - max run time exceeded
Categories
(Firefox Build System :: General, defect, P5)
Tracking
(firefox75 fixed)
Tracking | Status | |
---|---|---|
firefox75 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: gbrown)
Details
(Keywords: intermittent-failure)
Attachments
(1 file)
Filed by: rmaries [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=291512346&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Q4snmrTGTyGsGx6QoLkY3g/runs/1/artifacts/public/logs/live_backing.log
Reftest URL: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Q4snmrTGTyGsGx6QoLkY3g/runs/1/artifacts/public/logs/live_backing.log&only_show_unexpected=1
[task 2020-03-03T23:12:50.289Z] REFTEST reftests/text/bg-color.yaml == reftests/text/bg-color-ref.yaml
[task 2020-03-03T23:12:51.080Z] REFTEST reftests/text/large-glyphs.yaml != reftests/text/blank.yaml
[task 2020-03-03T23:12:56.999Z] REFTEST reftests/text/snap-text-offset.yaml == reftests/text/snap-text-offset-ref.yaml
[task 2020-03-03T23:12:57.147Z] REFTEST reftests/text/shadow-border.yaml == reftests/text/shadow-solid-ref.yaml
[task 2020-03-03T23:12:57.480Z] REFTEST reftests/text/shadow-image.yaml == reftests/text/shadow-solid-ref.yaml
[task 2020-03-03T23:12:57.828Z] REFTEST reftests/text/snap-clip.yaml == reftests/text/snap-clip-ref.yaml
[task 2020-03-03T23:12:58.020Z] REFTEST reftests/text/raster-space-snap.yaml == reftests/text/raster-space-snap-ref.yaml
[task 2020-03-03T23:12:58.228Z] REFTEST reftests/text/raster_root_C_8192.yaml == reftests/text/raster_root_C_ref.yaml
[taskcluster:error] Aborting task...
[taskcluster:warn 2020-03-03T23:21:29.019Z] no such process
[taskcluster 2020-03-03T23:21:29.020Z] === Task Finished ===
[taskcluster 2020-03-03T23:21:29.020Z] Task Duration: 1h0m0.205056324s
[taskcluster 2020-03-03T23:21:29.020Z] [mounts] Preserving cache: Moving "/Users/cltbld/tasks/task_1583274086/checkouts" to "/Users/cltbld/caches/BkcSqlmJRhG9N7JqCnNR4g"
[taskcluster:error] Task aborted - max run time exceeded
Updated•5 years ago
|
Updated•5 years ago
|
Comment 1•5 years ago
•
|
||
There haven't been any failures since this was filled, might be just an unfortunate case of Bug 1589796. Adding manifests (22:22-22:36 - 14 mins) and files (22:36 - 22:55 - 19 mins) seems to take a while vs one green build https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291559368&repo=autoland&lineNumber=55-57. Geoff, what do you think?
![]() |
Assignee | |
Comment 2•5 years ago
|
||
Actually, I find this concerning. 7 of the last 9 runs of this task on mozilla-central have hit task timeouts.
Most earlier runs were successful, but run times vary widely, between 20 and 55 minutes. Occasional task timeouts have been reported in bug 1589796.
The checkout does appear to be the problem, lately taking 30+ minutes. I see some run-task warnings that might be helpful; I'll try to follow-up on these:
[taskcluster 2020-03-05T10:12:24.277Z] Executing command 1: /usr/local/bin/python3 run-task '--gecko-checkout=./checkouts/gecko' --gecko-sparse-profile=build/sparse-profiles/webrender --task-cwd checkouts/gecko/gfx/wr -- bash -cx 'export WRENCH_HEADLESS_TARGET=$PWD/wrench-macos-headless/ && export WRENCH_BINARY=$PWD/wrench-macos/bin/wrench && ci-scripts/macos-release-tests.sh'
[setup 2020-03-05T10:12:24.370Z] run-task started in /Users/cltbld/tasks/task_1583402896
[vcs 2020-03-05T10:12:24.373Z] WARNING: vcs checkout path (./checkouts/gecko) not in cache or volume; performance will likely suffer
[vcs 2020-03-05T10:12:24.373Z] WARNING: HG_STORE_PATH (./checkouts/hg-shared) not in cache or volume; performance will likely suffer
[vcs 2020-03-05T10:12:24.373Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
[vcs 2020-03-05T10:12:24.377Z] executing ['hg', 'robustcheckout', '--sharebase', '/Users/cltbld/tasks/task_1583402896/checkouts/hg-shared', '--purge', '--upstream', 'https://hg.mozilla.org/mozilla-unified', '--sparseprofile', 'build/sparse-profiles/webrender', '--revision', '2e1a978b09d77c7422e972e9c3bfdeb484f6f608', 'https://hg.mozilla.org/mozilla-central', '/Users/cltbld/tasks/task_1583402896/checkouts/gecko']
![]() |
Assignee | |
Comment 4•5 years ago
|
||
I also notice bug 1572363 added a sparse profile for webrender, but is that working as intended?
[vcs 2020-03-05T10:44:10.323Z] (setting sparse config to profile build/sparse-profiles/webrender)
[vcs 2020-03-05T10:44:10.333Z] 0 files added, 0 files dropped, 0 files conflicting
[vcs 2020-03-05T10:44:10.333Z] (sparse refresh complete)
[vcs 2020-03-05T10:44:12.875Z] warning: sparse profile 'build/sparse-profiles/webrender' not found in rev 000000000000 - ignoring it
[vcs 2020-03-05T10:44:18.051Z]
[vcs 2020-03-05T10:44:19.067Z] updating [ ] 100/25598
[vcs 2020-03-05T10:44:20.070Z] updating [====> ] 2500/25598 10s
[vcs 2020-03-05T10:44:21.077Z] updating [========> ] 4900/25598 09s
[vcs 2020-03-05T10:44:22.145Z] updating [==============> ] 7700/25598 08s
[vcs 2020-03-05T10:44:23.231Z] updating [=====================> ] 10900/25598 06s
[vcs 2020-03-05T10:44:24.305Z] updating [===========================> ] 14100/25598 05s
[vcs 2020-03-05T10:44:25.451Z] updating [==================================> ] 17300/25598 04s
[vcs 2020-03-05T10:44:26.497Z] updating [========================================> ] 20500/25598 02s
[vcs 2020-03-05T10:44:30.940Z] updating [===============================================> ] 23700/25598 01s
[vcs 2020-03-05T10:44:31.484Z]
[vcs 2020-03-05T10:44:31.484Z] 25598 files updated, 0 files merged, 0 files removed, 0 files unresolved
[vcs 2020-03-05T10:44:31.561Z] updated to 2e1a978b09d77c7422e972e9c3bfdeb484f6f608
Comment 5•5 years ago
|
||
The sparse profile is working - you can see we're updating only 25598 files (right side of the progress bar) instead of the entire tree which is closer to 250k.
Since we have a sparse-profile I doubt the issue here is related to the filesystem during working directory checkout. I only see one example log in this bug, but taking a look, it seems we're cloning the whole repo and it's taking quite a while. Optimally, these workers should be using a cached copy of the repo downloaded from an earlier task run. In a log like this for example (also running on our macOS farm), we re-use an existing repo on the worker and simply hg pull
the new changes then create a working directory checkout. It seems these tasks are either misconfigured to find the repo cache, or they have a low hit rate on the cache for some other reason (for example, they're running on workers that frequently restart and clear the cache, or they're running on workers that mostly do tests and don't always have a copy of the repo cached).
The initial download is still going to take a while unfortunately. We have optimizations for AWS/GCP that make this faster, but in our datacentre macOS farm things are slower. However this is a price that should only be paid by each new worker once, and subsequent tasks should be re-using the downloaded repo.
![]() |
Assignee | |
Comment 6•5 years ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #5)
Thanks :sheehan!
I only see one example log in this bug
If it helps, here are other logs from recent failures (very slow runs) of this task:
https://treeherder.mozilla.org/logviewer.html#?job_id=291778336&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291788973&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291751471&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291756942&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291594603&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291604512&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291618371&repo=mozilla-central
And some recent successes (fast):
https://treeherder.mozilla.org/logviewer.html#?job_id=291959134&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=291806037&repo=mozilla-central
I think all the slow cases have something like:
[vcs 2020-03-04T11:38:55.260Z] ensuring https://hg.mozilla.org/mozilla-central@fd8d4835bad324d33d433a78302f13bbd02985be is available at /Users/cltbld/tasks/task_1583320981/checkouts/gecko
[vcs 2020-03-04T11:38:55.829Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-03-04T11:38:55.874Z] (sharing from new pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
[vcs 2020-03-04T11:38:56.567Z] applying clone bundle from https://hg.cdn.mozilla.net/mozilla-unified/98c0eaba40241f2f7ffce056add7b4d8d5a26946.zstd-max.hg
and the fast ones have:
[vcs 2020-03-05T12:28:10.655Z] ensuring https://hg.mozilla.org/mozilla-central@2e1a978b09d77c7422e972e9c3bfdeb484f6f608 is available at /Users/cltbld/tasks/task_1583410161/checkouts/gecko
[vcs 2020-03-05T12:28:10.658Z] (existing repository shared store: /Users/cltbld/tasks/task_1582222035/checkouts/hg-shared/8ba995b74e18334ab3707f27e9eb8f4e37ba3d29/.hg)
[vcs 2020-03-05T12:28:10.658Z] (shared store does not exist; deleting destination)
[vcs 2020-03-05T12:28:14.510Z] (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
[vcs 2020-03-05T12:28:14.550Z] (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
It seems these tasks are either misconfigured to find the repo cache, or they have a low hit rate on the cache for some other reason (for example, they're running on workers that frequently restart and clear the cache, or they're running on workers that mostly do tests and don't always have a copy of the repo cached).
These wrench tests run on gecko-t-osx-1014, which is also used for all the osx tests, like mochitests, which don't use the repo -- that seems to explain a low cache hit rate.
I'm thinking maybe we should just accept that this task is going to sometimes take a long time getting the repo and increase its max-run-time accordingly.
![]() |
Assignee | |
Comment 7•5 years ago
|
||
Avoid task timeouts by increasing max-run-time, to allow for repo cloning.
Comment 9•5 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Description
•