Closed Bug 1799052 Opened 2 years ago Closed 7 months ago

[taskcluster:error] Task aborted - max run time exceeded on many Windows source-test tasks

Categories

(Release Engineering :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: markco)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

Since the migration of jobs to GCP, many Windows jobs fail with a "run time exceeded" after spending an hour(!) checking out the mercurial tree.
Recent example on mozilla-central: https://treeherder.mozilla.org/logviewer?job_id=395337573&repo=mozilla-central&lineNumber=2792

This is happening a lot.

Perhaps this has to do with the following:
[vcs 2022-11-03T05:40:21.397Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
As we most certainly have mirrors. The reason it takes so ridiculously long is that it's doing a full-on clone from hg.mozilla.org

Flags: needinfo?(mcornmesser)

Is that line also happening in tasks that don't fail?

Note that the linked task isn't running in GCP (it's in azure), so the cause here is something else.

Summary: [taskcluster:error] Task aborted - max run time exceeded on many Windows GCP tasks → [taskcluster:error] Task aborted - max run time exceeded on many Windows source-test tasks

Cloning in Windows source-test tasks has been problematic for months: https://bugzilla.mozilla.org/show_bug.cgi?id=1589796#c490

I think there are 2 things we can do here.

From the -source VM configuration side we can increase the data disk size and use premium SSDs. This will improve the disk performance which may help improve the HG clone slightly. I will work on getting a test pool up. Which tasks or suite use -source?

However, what might be the more effective but more complicated fix is getting HG mirrors set up in Azure. Which I am not sure how to start the process or who to talk to. Any suggestions?

jmaher, are we seeing these long clone times with the win11-64-source testing?

Flags: needinfo?(mcornmesser) → needinfo?(jmaher)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(mcornmesser)

source tests are basically:
./mach try fuzzy -q 'test-windows10 source-test'

I saw the same thing when testing on win11, in fact I had to retrigger the tests 3 times each to get at least 1 green; the tests were taking a very long time to clone the repo. I am not sure where the bottleneck is, I assume we are doing a shallow clone; not sure if it is network transfer (then a cached hg repo server in azure would help), or if it is diskIO on the machine (the premium SSD would help)

Flags: needinfo?(jmaher)

(In reply to Michelle Goossens [:masterwayz] from comment #1)

Perhaps this has to do with the following:
[vcs 2022-11-03T05:40:21.397Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
As we most certainly have mirrors. The reason it takes so ridiculously long is that it's doing a full-on clone from hg.mozilla.org

That's not the whole story, only 20 minutes are spent hitting the network. The 40 remaining minutes are local I/O of checking out the tree from the repo.

Assignee: nobody → mcornmesser
Status: NEW → ASSIGNED

(In reply to Mike Hommey [:glandium] from comment #7)

That's not the whole story, only 20 minutes are spent hitting the network. The 40 remaining minutes are local I/O of checking out the tree from the repo.

The current configuration of the data disk would not handle heavy I/O. Monday I will test a configuration with a much larger SSD. Hopefully we see an improvement on that side.

Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/e37647d11720 Add 10-64-source-ssd worker pool. r=releng-reviewers,jcristau

Here is a push using the 500 GB premium SSD for the data disk: https://treeherder.mozilla.org/jobs?repo=try&revision=4889b74ebfc21cc64d49c70a8d91d097dfa6865c&selectedTaskRun=bQY7zIBoSIivpH9JHKgSAw.0

Using --rebuild 5 we still had handful hit the max runtime. I will take a look at try to figure out if there is anything else that can be done to improve the performance around I/O. Another item we can take look at, as we migrate to Win 11, is doing a local cache of the repo when the VM for this worker pool spins up. We would take the hit time wise on the spin up but could reduce the time in each task. However, I am not sure how would accessing the cache at the time of the task running, but having the repo there pre-task is doable.

We are working on addressing this issue in the Win 11 migration.

Another manifestation of this issue is bug 1806182.

Whiteboard: [stockwell disable-recommended]
Whiteboard: [stockwell disable-recommended]

Update:

There have been 40 failures within the last 7 days:

  • 1 failure on Windows 11 x64 22H2 WebRender debug
  • 1 failure on Windows 11 x64 22H2 asan WebRender opt
  • 38 failures on windows10-64 opt

Recent log: https://treeherder.mozilla.org/logviewer?job_id=409257091&repo=mozilla-central&lineNumber=2091

Mark, is there any chance you have a bit of time to look over this?
Thank you.

Flags: needinfo?(mcornmesser)
Whiteboard: [stockwell needswork:owner]

Sorry. Unfortunately we can't do much with the current windows10-64 worker configuration, however we will be upgrading those workers in the near future to a newer build of Win 10. Hopefully that will get us some improvement here. Also we are looking at adding some HG mirrors to Azure this year.

Flags: needinfo?(mcornmesser)
Duplicate of this bug: 1824198
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
See Also: → 1738853
Severity: -- → S3
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: