Open Bug 1738853 Opened 3 months ago Updated 13 days ago

Windows startup test tasks sometimes take >30 minutes to clone hg repository which delays shipping Nightlies

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(firefox-esr91 unaffected, firefox93 unaffected, firefox94 unaffected, firefox95 wontfix, firefox96 wontfix)

ASSIGNED
Tracking Status
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- unaffected
firefox95 --- wontfix
firefox96 --- wontfix

People

(Reporter: aryx, Assigned: masterwayz)

References

(Regression)

Details

(Keywords: leave-open, regression)

Attachments

(1 file)

The Windows startup tests (Treeherder symbole "SUT") often only take 1 minute to run but sometimes they run for 37-39 minutes because cloning the mozilla-unified repository is slow. Nightlies only ship after the startup test tasks have succeeded.

Expected: Reliable short task run time. If a full clone is necessary, the clone time should be shorter. (Or could the relevant files for this task get generated as an artifact?)

Example: https://treeherder.mozilla.org/logviewer?job_id=356409035&repo=mozilla-central
You can also select the task in the main view and use the "Similar Jobs" at the bottom to find affected tasks: https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HxTUt0EmSju9h1eH5CtqYA.0&revision=6aff7acd8390f8d83eb15589a0ceb376cedb2416&searchStr=windows%2Cstartup

Is this a new change, since the new Windows AMIs? or have we seen it before?

Flags: needinfo?(aryx.bugmail)

We also hit this on Fx 95.0b2.

This is because the -source workers (and I think the regular Azure Win10 workers as well) need to perform a full hg clone before they can run jobs (if it's a brand new worker and it's the first job it runs).
On AWS I have noticed that it clones the hg bundle from the CDN. So that is an option to make that work, or alternatively ship the image with a cached repository.

Mark, we talked about this before I think, any ideas?

Flags: needinfo?(mcornmesser)

(Mark: see above comment, Bugzilla had a conflict and didn't NI you on the above comment I made)

The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.

Flags: needinfo?(sheehan)

We'd probably want the equivalent of bug 1585133 for Azure.

(In reply to Mike Hommey [:glandium] from comment #6)

The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.

Once https://github.com/mozilla-releng/OpenCloudConfig/pull/271 is landed and deployed we should get stream clone bundles as the default through the robustcheckout update. We should also upload bundles to Azure's S3-equivalent blob storage and prioritize those over the CDN. I think there may be an old patch on Phabricator we can revive that has some of this work completed.

Creating a cached copy of the repo in the image would help here, however on EC2 I know there are performance issues with caching a copy of the repo in an AMI - not sure if the same problems affect Azure/Windows/etc.

Alternatively we could program the workers to immediately start cloning a copy of the repo on startup (ie the instance is started from an image without a repo cache but the cloning completes before the instance takes any tasks) then we would hit the requirement of "reliable short task run time". Spinning up new workers would appear to take much longer of course, but from the developer perspective their tasks would complete in a reasonable amount of time.

Flags: needinfo?(sheehan)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(michelle)

On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly

The failed tasks got retried automatically and no logs are available.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)

On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly

The failed tasks got retried automatically and no logs are available.

There is no indication in the logs that these tasks are using clonebundles at all. In that case bug 1738924 will certainly help here, it's already been merged and is in the process of being tested and deployed.

(In reply to Connor Sheehan [:sheehan] from comment #10)

There is no indication in the logs that these tasks are using clonebundles at all.

I was wrong, these tasks are using clonebundles, but the output showing we used a clonebundle is in a different location than with Linux/mac workers. We use a zstd bundle here, so stream clones should make a dent in the clone time and from masterwayz' test log the time to clone has improved after forcing stream clone bundles via robustcheckout.

It's still pretty slow (~18m instead of ~45m from these failures) but it's getting better. Reviving the Azure bundle hosting patches should improve things more.

Going to land a ci-config patch soon that should improve it to 18 minutes instead of 45 minutes. Still not the best, but it does make SUT pass 3/3 times instead of 1/5 times on try.

Flags: needinfo?(michelle)
Keywords: leave-open
Assignee: nobody → michelle
Attachment #9251705 - Attachment description: WIP: Bug 1738853 - New Azure images using robutcheckout r=#releng-reviewers → Bug 1738853 - New Azure images using robutcheckout r=#releng-reviewers
Status: NEW → ASSIGNED
Pushed by michelle@masterwayz.nl:
https://hg.mozilla.org/ci/ci-configuration/rev/4d0d24c95aad
New Azure images using robutcheckout r=releng-reviewers,jmaher

Has this been resolved with the new images?

Flags: needinfo?(aryx.bugmail)

It's either 1 or ~20 minutes now.

Flags: needinfo?(aryx.bugmail)

Should we keep this bug open?

Flags: needinfo?(aryx.bugmail)

Yes, let's track the task time being <5 minutes in general.

Flags: needinfo?(aryx.bugmail)
You need to log in before you can comment on or make changes to this bug.