[taskcluster:error] Task aborted - max run time exceeded on many Windows source-test tasks
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
People
(Reporter: glandium, Assigned: markco)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell infra])
Attachments
(1 file)
Since the migration of jobs to GCP, many Windows jobs fail with a "run time exceeded" after spending an hour(!) checking out the mercurial tree.
Recent example on mozilla-central: https://treeherder.mozilla.org/logviewer?job_id=395337573&repo=mozilla-central&lineNumber=2792
This is happening a lot.
Comment 1•2 years ago
|
||
Perhaps this has to do with the following:
[vcs 2022-11-03T05:40:21.397Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
As we most certainly have mirrors. The reason it takes so ridiculously long is that it's doing a full-on clone from hg.mozilla.org
Updated•2 years ago
|
Comment 2•2 years ago
|
||
Is that line also happening in tasks that don't fail?
Comment 3•2 years ago
•
|
||
Note that the linked task isn't running in GCP (it's in azure), so the cause here is something else.
Comment 4•2 years ago
|
||
Cloning in Windows source-test tasks has been problematic for months: https://bugzilla.mozilla.org/show_bug.cgi?id=1589796#c490
Assignee | ||
Comment 5•2 years ago
|
||
I think there are 2 things we can do here.
From the -source
VM configuration side we can increase the data disk size and use premium SSDs. This will improve the disk performance which may help improve the HG clone slightly. I will work on getting a test pool up. Which tasks or suite use -source
?
However, what might be the more effective but more complicated fix is getting HG mirrors set up in Azure. Which I am not sure how to start the process or who to talk to. Any suggestions?
jmaher, are we seeing these long clone times with the win11-64-source
testing?
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 6•2 years ago
|
||
source tests are basically:
./mach try fuzzy -q 'test-windows10 source-test'
I saw the same thing when testing on win11, in fact I had to retrigger the tests 3 times each to get at least 1 green; the tests were taking a very long time to clone the repo. I am not sure where the bottleneck is, I assume we are doing a shallow clone; not sure if it is network transfer (then a cached hg repo server in azure would help), or if it is diskIO on the machine (the premium SSD would help)
Reporter | ||
Comment 7•2 years ago
|
||
(In reply to Michelle Goossens [:masterwayz] from comment #1)
Perhaps this has to do with the following:
[vcs 2022-11-03T05:40:21.397Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
As we most certainly have mirrors. The reason it takes so ridiculously long is that it's doing a full-on clone from hg.mozilla.org
That's not the whole story, only 20 minutes are spent hitting the network. The 40 remaining minutes are local I/O of checking out the tree from the repo.
Assignee | ||
Comment 8•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 9•2 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7)
That's not the whole story, only 20 minutes are spent hitting the network. The 40 remaining minutes are local I/O of checking out the tree from the repo.
The current configuration of the data disk would not handle heavy I/O. Monday I will test a configuration with a much larger SSD. Hopefully we see an improvement on that side.
Comment hidden (Intermittent Failures Robot) |
Comment 11•2 years ago
|
||
Assignee | ||
Comment 12•2 years ago
|
||
Here is a push using the 500 GB premium SSD for the data disk: https://treeherder.mozilla.org/jobs?repo=try&revision=4889b74ebfc21cc64d49c70a8d91d097dfa6865c&selectedTaskRun=bQY7zIBoSIivpH9JHKgSAw.0
Using --rebuild 5
we still had handful hit the max runtime. I will take a look at try to figure out if there is anything else that can be done to improve the performance around I/O. Another item we can take look at, as we migrate to Win 11, is doing a local cache of the repo when the VM for this worker pool spins up. We would take the hit time wise on the spin up but could reduce the time in each task. However, I am not sure how would accessing the cache at the time of the task running, but having the repo there pre-task is doable.
Assignee | ||
Comment 13•2 years ago
|
||
We are working on addressing this issue in the Win 11 migration.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 20•2 years ago
|
||
Another manifestation of this issue is bug 1806182.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 42•2 years ago
|
||
Update:
There have been 40 failures within the last 7 days:
- 1 failure on Windows 11 x64 22H2 WebRender debug
- 1 failure on Windows 11 x64 22H2 asan WebRender opt
- 38 failures on windows10-64 opt
Recent log: https://treeherder.mozilla.org/logviewer?job_id=409257091&repo=mozilla-central&lineNumber=2091
Mark, is there any chance you have a bit of time to look over this?
Thank you.
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 45•2 years ago
|
||
Sorry. Unfortunately we can't do much with the current windows10-64 worker configuration, however we will be upgrading those workers in the near future to a newer build of Win 10. Hopefully that will get us some improvement here. Also we are looking at adding some HG mirrors to Azure this year.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•7 months ago
|
Description
•