Windows startup test tasks sometimes take >30 minutes to clone hg repository which delays shipping Nightlies
Categories
(Firefox Build System :: Task Configuration, defect)
Tracking
(firefox-esr91 unaffected, firefox93 unaffected, firefox94 unaffected, firefox95 wontfix, firefox96 wontfix)
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox93 | --- | unaffected |
firefox94 | --- | unaffected |
firefox95 | --- | wontfix |
firefox96 | --- | wontfix |
People
(Reporter: aryx, Unassigned)
References
(Regression)
Details
(Keywords: leave-open, regression)
Attachments
(1 file)
The Windows startup tests (Treeherder symbole "SUT") often only take 1 minute to run but sometimes they run for 37-39 minutes because cloning the mozilla-unified repository is slow. Nightlies only ship after the startup test tasks have succeeded.
Expected: Reliable short task run time. If a full clone is necessary, the clone time should be shorter. (Or could the relevant files for this task get generated as an artifact?)
Example: https://treeherder.mozilla.org/logviewer?job_id=356409035&repo=mozilla-central
You can also select the task in the main view and use the "Similar Jobs" at the bottom to find affected tasks: https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HxTUt0EmSju9h1eH5CtqYA.0&revision=6aff7acd8390f8d83eb15589a0ceb376cedb2416&searchStr=windows%2Cstartup
Comment 1•1 year ago
|
||
Is this a new change, since the new Windows AMIs? or have we seen it before?
![]() |
Reporter | |
Comment 2•1 year ago
|
||
Thanks, starts after bug 1733684 landed.
Updated•1 year ago
|
Comment 3•1 year ago
|
||
We also hit this on Fx 95.0b2.
Comment 4•1 year ago
|
||
This is because the -source workers (and I think the regular Azure Win10 workers as well) need to perform a full hg clone
before they can run jobs (if it's a brand new worker and it's the first job it runs).
On AWS I have noticed that it clones the hg bundle from the CDN. So that is an option to make that work, or alternatively ship the image with a cached repository.
Mark, we talked about this before I think, any ideas?
Updated•1 year ago
|
Comment 5•1 year ago
|
||
(Mark: see above comment, Bugzilla had a conflict and didn't NI you on the above comment I made)
Comment 6•1 year ago
|
||
The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.
Comment 7•1 year ago
•
|
||
We'd probably want the equivalent of bug 1585133 for Azure.
Updated•1 year ago
|
Comment 8•1 year ago
|
||
(In reply to Mike Hommey [:glandium] from comment #6)
The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.
Once https://github.com/mozilla-releng/OpenCloudConfig/pull/271 is landed and deployed we should get stream clone bundles as the default through the robustcheckout update. We should also upload bundles to Azure's S3-equivalent blob storage and prioritize those over the CDN. I think there may be an old patch on Phabricator we can revive that has some of this work completed.
Creating a cached copy of the repo in the image would help here, however on EC2 I know there are performance issues with caching a copy of the repo in an AMI - not sure if the same problems affect Azure/Windows/etc.
Alternatively we could program the workers to immediately start cloning a copy of the repo on startup (ie the instance is started from an image without a repo cache but the cloning completes before the instance takes any tasks) then we would hit the requirement of "reliable short task run time". Spinning up new workers would appear to take much longer of course, but from the developer perspective their tasks would complete in a reasonable amount of time.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
![]() |
Reporter | |
Comment 9•1 year ago
•
|
||
On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly
The failed tasks got retried automatically and no logs are available.
Comment 10•1 year ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)
On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly
The failed tasks got retried automatically and no logs are available.
There is no indication in the logs that these tasks are using clonebundles at all. In that case bug 1738924 will certainly help here, it's already been merged and is in the process of being tested and deployed.
Comment 11•1 year ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #10)
There is no indication in the logs that these tasks are using clonebundles at all.
I was wrong, these tasks are using clonebundles, but the output showing we used a clonebundle is in a different location than with Linux/mac workers. We use a zstd bundle here, so stream clones should make a dent in the clone time and from masterwayz' test log the time to clone has improved after forcing stream clone bundles via robustcheckout.
It's still pretty slow (~18m instead of ~45m from these failures) but it's getting better. Reviving the Azure bundle hosting patches should improve things more.
Comment 12•1 year ago
|
||
Going to land a ci-config patch soon that should improve it to 18 minutes instead of 45 minutes. Still not the best, but it does make SUT pass 3/3 times instead of 1/5 times on try.
Comment 13•1 year ago
|
||
Updated•1 year ago
|
Comment 14•1 year ago
|
||
Pushed by michelle@masterwayz.nl: https://hg.mozilla.org/ci/ci-configuration/rev/4d0d24c95aad New Azure images using robutcheckout r=releng-reviewers,jmaher
Updated•1 year ago
|
![]() |
Reporter | |
Comment 18•1 year ago
|
||
Yes, let's track the task time being <5 minutes in general.
Comment 19•11 months ago
|
||
Is there anything else we could do here, or only wait for the TC team? Trying to clean up my pending assigned bugs queue.
Comment 20•11 months ago
|
||
Looks like we still want https://bugzilla.mozilla.org/show_bug.cgi?id=1738853#c7 .. Unsure if there is a bug on file for that or not. I don't see the TC team blocking anything here though.
Some other thoughts..
- Do sparse profiles improve clone times at all? Maybe we could set one up here.
- Can we optimize the dependency graph at all here to improve parallelization? E.g, if we currently have
SUT -> A -> B -> C
, can we make it:
SUT---> C
A -> B---^
?
Updated•5 months ago
|
Comment 21•5 months ago
|
||
Unassigned this from myself, unless this is fixed now.
Description
•