1799052 - [taskcluster:error] Task aborted - max run time exceeded on many Windows source-test tasks

Reporter

Description

•

2 years ago

Since the migration of jobs to GCP, many Windows jobs fail with a "run time exceeded" after spending an hour(!) checking out the mercurial tree.
Recent example on mozilla-central: https://treeherder.mozilla.org/logviewer?job_id=395337573&repo=mozilla-central&lineNumber=2792

This is happening a lot.

Michelle Goossens [:masterwayz]

Comment 1

•

2 years ago

Perhaps this has to do with the following:
[vcs 2022-11-03T05:40:21.397Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
As we most certainly have mirrors. The reason it takes so ridiculously long is that it's doing a full-on clone from hg.mozilla.org

Michelle Goossens [:masterwayz]

Updated

•

2 years ago

Flags: needinfo?(mcornmesser)

Andrew Halberstadt [:ahal]

Comment 2

•

2 years ago

Is that line also happening in tasks that don't fail?

Andrew Halberstadt [:ahal]

Comment 3

•

2 years ago

•

Edited

Note that the linked task isn't running in GCP (it's in azure), so the cause here is something else.

Summary: [taskcluster:error] Task aborted - max run time exceeded on many Windows GCP tasks → [taskcluster:error] Task aborted - max run time exceeded on many Windows source-test tasks

Geoff Brown [:gbrown]

Comment 4

•

2 years ago

Cloning in Windows source-test tasks has been problematic for months: https://bugzilla.mozilla.org/show_bug.cgi?id=1589796#c490

Mark Cornmesser [:markco]

Assignee

Comment 5

•

2 years ago

I think there are 2 things we can do here.

From the -source VM configuration side we can increase the data disk size and use premium SSDs. This will improve the disk performance which may help improve the HG clone slightly. I will work on getting a test pool up. Which tasks or suite use -source?

However, what might be the more effective but more complicated fix is getting HG mirrors set up in Azure. Which I am not sure how to start the process or who to talk to. Any suggestions?

jmaher, are we seeing these long clone times with the win11-64-source testing?

Flags: needinfo?(mcornmesser) → needinfo?(jmaher)

Mark Cornmesser [:markco]

Assignee

Updated

•

2 years ago

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Assignee

Updated

•

2 years ago

Flags: needinfo?(mcornmesser)

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

2 years ago

source tests are basically:
./mach try fuzzy -q 'test-windows10 source-test'

I saw the same thing when testing on win11, in fact I had to retrigger the tests 3 times each to get at least 1 green; the tests were taking a very long time to clone the repo. I am not sure where the bottleneck is, I assume we are doing a shallow clone; not sure if it is network transfer (then a cached hg repo server in azure would help), or if it is diskIO on the machine (the premium SSD would help)

Flags: needinfo?(jmaher)

Mike Hommey [:glandium]

Reporter

Comment 7

•

2 years ago

(In reply to Michelle Goossens [:masterwayz] from comment #1)

Perhaps this has to do with the following:
[vcs 2022-11-03T05:40:21.397Z] TASKCLUSTER_WORKER_LOCATION environment variable not set; using public hg.mozilla.org service
As we most certainly have mirrors. The reason it takes so ridiculously long is that it's doing a full-on clone from hg.mozilla.org

That's not the whole story, only 20 minutes are spent hitting the network. The 40 remaining minutes are local I/O of checking out the tree from the repo.

Mark Cornmesser [:markco]

Assignee

Comment 8

•

2 years ago

Attached file Bug 1799052 - Add 10-64-source-ssd worker pool. r?#releng-reviewers — Details

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → mcornmesser

Status: NEW → ASSIGNED

Mark Cornmesser [:markco]

Assignee

Comment 9

•

2 years ago

(In reply to Mike Hommey [:glandium] from comment #7)

That's not the whole story, only 20 minutes are spent hitting the network. The 40 remaining minutes are local I/O of checking out the tree from the repo.

The current configuration of the data disk would not handle heavy I/O. Monday I will test a configuration with a much larger SSD. Hopefully we see an improvement on that side.

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 11

•

2 years ago

Pushed by mcornmesser@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/e37647d11720 Add 10-64-source-ssd worker pool. r=releng-reviewers,jcristau

Mark Cornmesser [:markco]

Assignee

Comment 12

•

2 years ago

Here is a push using the 500 GB premium SSD for the data disk: https://treeherder.mozilla.org/jobs?repo=try&revision=4889b74ebfc21cc64d49c70a8d91d097dfa6865c&selectedTaskRun=bQY7zIBoSIivpH9JHKgSAw.0

Using --rebuild 5 we still had handful hit the max runtime. I will take a look at try to figure out if there is anything else that can be done to improve the performance around I/O. Another item we can take look at, as we migrate to Win 11, is doing a local cache of the repo when the VM for this worker pool spins up. We would take the hit time wise on the spin up but could reduce the time in each task. However, I am not sure how would accessing the cache at the time of the task running, but having the repo there pre-task is doable.

Mark Cornmesser [:markco]

Assignee

Comment 13

•

2 years ago

We are working on addressing this issue in the Win 11 migration.

Comment hidden (Intermittent Failures Robot)

Mike Hommey [:glandium]

Reporter

Comment 20

•

2 years ago

Another manifestation of this issue is bug 1806182.

Comment hidden (Intermittent Failures Robot)

Marian-Vasile Laza

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Keywords: intermittent-failure

Comment hidden (Intermittent Failures Robot)

Marian-Vasile Laza

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

Natalia Csoregi [:nataliaCs]

Comment 42

•

2 years ago

Update:

There have been 40 failures within the last 7 days:

1 failure on Windows 11 x64 22H2 WebRender debug
1 failure on Windows 11 x64 22H2 asan WebRender opt
38 failures on windows10-64 opt

Recent log: https://treeherder.mozilla.org/logviewer?job_id=409257091&repo=mozilla-central&lineNumber=2091

Mark, is there any chance you have a bit of time to look over this?
Thank you.

Flags: needinfo?(mcornmesser)

Natalia Csoregi [:nataliaCs]

Updated

•

2 years ago

Whiteboard: [stockwell needswork:owner]

Comment hidden (Intermittent Failures Robot)

Mark Cornmesser [:markco]

Assignee

Comment 45

•

2 years ago

Sorry. Unfortunately we can't do much with the current windows10-64 worker configuration, however we will be upgrading those workers in the near future to a newer build of Win 10. Hopefully that will get us some improvement here. Also we are looking at adding some HG mirrors to Azure this year.

Flags: needinfo?(mcornmesser)

Comment hidden (Intermittent Failures Robot)

Daniel Holbert [:dholbert]

Updated

•

2 years ago

Duplicate of this bug: 1824198

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended] → [stockwell infra]

Comment hidden (Intermittent Failures Robot)

Julien Cristau [:jcristau]

Updated

•

2 years ago

Updated

•

2 years ago

Severity: -- → S3

Comment hidden (Intermittent Failures Robot)

Mark Cornmesser [:markco]

Assignee

Updated

•

7 months ago

Status: ASSIGNED → RESOLVED

Closed: 7 months ago

Resolution: --- → FIXED