Windows startup test tasks sometimes take >30 minutes to clone hg repository which delays shipping Nightlies
Categories
(Firefox Build System :: Task Configuration, defect)
Tracking
(firefox-esr91 unaffected, firefox93 unaffected, firefox94 unaffected, firefox95 wontfix, firefox96 wontfix)
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox93 | --- | unaffected |
firefox94 | --- | unaffected |
firefox95 | --- | wontfix |
firefox96 | --- | wontfix |
People
(Reporter: aryx, Unassigned)
References
(Regression)
Details
(Keywords: regression)
Attachments
(1 file)
The Windows startup tests (Treeherder symbole "SUT") often only take 1 minute to run but sometimes they run for 37-39 minutes because cloning the mozilla-unified repository is slow. Nightlies only ship after the startup test tasks have succeeded.
Expected: Reliable short task run time. If a full clone is necessary, the clone time should be shorter. (Or could the relevant files for this task get generated as an artifact?)
Example: https://treeherder.mozilla.org/logviewer?job_id=356409035&repo=mozilla-central
You can also select the task in the main view and use the "Similar Jobs" at the bottom to find affected tasks: https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HxTUt0EmSju9h1eH5CtqYA.0&revision=6aff7acd8390f8d83eb15589a0ceb376cedb2416&searchStr=windows%2Cstartup
Comment 1•3 years ago
|
||
Is this a new change, since the new Windows AMIs? or have we seen it before?
![]() |
Reporter | |
Comment 2•3 years ago
|
||
Thanks, starts after bug 1733684 landed.
Updated•3 years ago
|
Comment 3•3 years ago
|
||
We also hit this on Fx 95.0b2.
Comment 4•3 years ago
|
||
This is because the -source workers (and I think the regular Azure Win10 workers as well) need to perform a full hg clone
before they can run jobs (if it's a brand new worker and it's the first job it runs).
On AWS I have noticed that it clones the hg bundle from the CDN. So that is an option to make that work, or alternatively ship the image with a cached repository.
Mark, we talked about this before I think, any ideas?
Updated•3 years ago
|
Comment 5•3 years ago
|
||
(Mark: see above comment, Bugzilla had a conflict and didn't NI you on the above comment I made)
Comment 6•3 years ago
|
||
The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.
Comment 7•3 years ago
•
|
||
We'd probably want the equivalent of bug 1585133 for Azure.
Updated•3 years ago
|
Comment 8•3 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #6)
The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.
Once https://github.com/mozilla-releng/OpenCloudConfig/pull/271 is landed and deployed we should get stream clone bundles as the default through the robustcheckout update. We should also upload bundles to Azure's S3-equivalent blob storage and prioritize those over the CDN. I think there may be an old patch on Phabricator we can revive that has some of this work completed.
Creating a cached copy of the repo in the image would help here, however on EC2 I know there are performance issues with caching a copy of the repo in an AMI - not sure if the same problems affect Azure/Windows/etc.
Alternatively we could program the workers to immediately start cloning a copy of the repo on startup (ie the instance is started from an image without a repo cache but the cloning completes before the instance takes any tasks) then we would hit the requirement of "reliable short task run time". Spinning up new workers would appear to take much longer of course, but from the developer perspective their tasks would complete in a reasonable amount of time.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
![]() |
Reporter | |
Comment 9•3 years ago
•
|
||
On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly
The failed tasks got retried automatically and no logs are available.
Comment 10•3 years ago
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)
On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly
The failed tasks got retried automatically and no logs are available.
There is no indication in the logs that these tasks are using clonebundles at all. In that case bug 1738924 will certainly help here, it's already been merged and is in the process of being tested and deployed.
Comment 11•3 years ago
|
||
(In reply to Connor Sheehan [:sheehan] from comment #10)
There is no indication in the logs that these tasks are using clonebundles at all.
I was wrong, these tasks are using clonebundles, but the output showing we used a clonebundle is in a different location than with Linux/mac workers. We use a zstd bundle here, so stream clones should make a dent in the clone time and from masterwayz' test log the time to clone has improved after forcing stream clone bundles via robustcheckout.
It's still pretty slow (~18m instead of ~45m from these failures) but it's getting better. Reviving the Azure bundle hosting patches should improve things more.
Comment 12•3 years ago
|
||
Going to land a ci-config patch soon that should improve it to 18 minutes instead of 45 minutes. Still not the best, but it does make SUT pass 3/3 times instead of 1/5 times on try.
Comment 13•3 years ago
|
||
Updated•3 years ago
|
Comment 14•3 years ago
|
||
Updated•3 years ago
|
![]() |
Reporter | |
Comment 18•3 years ago
|
||
Yes, let's track the task time being <5 minutes in general.
Comment 19•3 years ago
|
||
Is there anything else we could do here, or only wait for the TC team? Trying to clean up my pending assigned bugs queue.
Comment 20•3 years ago
|
||
Looks like we still want https://bugzilla.mozilla.org/show_bug.cgi?id=1738853#c7 .. Unsure if there is a bug on file for that or not. I don't see the TC team blocking anything here though.
Some other thoughts..
- Do sparse profiles improve clone times at all? Maybe we could set one up here.
- Can we optimize the dependency graph at all here to improve parallelization? E.g, if we currently have
SUT -> A -> B -> C
, can we make it:
SUT---> C
A -> B---^
?
Updated•3 years ago
|
Comment 21•3 years ago
|
||
Unassigned this from myself, unless this is fixed now.
Comment 22•2 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:ahal, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.
Comment 23•2 years ago
|
||
This is essentially the same as bug 1799052 AIUI, and some work's been happening there.
Updated•2 years ago
|
Comment 24•8 months ago
•
|
||
Looks like there are still a few cases where clone is super slow, e.g.:
https://firefox-ci-tc.services.mozilla.com/tasks/O4OrDawrQ9a7nK7qlJuVzA/runs/0/logs/public/logs/live_backing.log
https://firefox-ci-tc.services.mozilla.com/tasks/EiNbqtacTmKrMwXF5XMo3g/runs/0/logs/public/logs/live_backing.log
https://firefox-ci-tc.services.mozilla.com/tasks/MCcY0FD7S4qatnFA5jnuSA/runs/0/logs/public/logs/live_backing.log
average and max clone times on windows -source workers look like this for the last ~10 days:
date instance_type average_clone_seconds max_clone_seconds
2024-07-16 Standard_F8s_v2 361.0909612708619 590.06703519821
2024-07-17 Standard_F8s_v2 388.26866307406323 1596.8691520691
2024-07-17 Standard_F32s_v2 184.30901334966984 392.61391997337
2024-07-18 Standard_F8s_v2 370.657629318363 666.9570684433
2024-07-19 Standard_F8s_v2 377.46113437509763 593.83016681671
2024-07-19 Standard_F32s_v2 282.63595340328851 546.48441195488
2024-07-20 Standard_F8s_v2 415.73494991864254 599.94339489937
2024-07-20 Standard_F32s_v2 327.97066480973189 398.05749821663
2024-07-21 Standard_F32s_v2 360.31971017519624 410.64426183701
2024-07-21 Standard_F8s_v2 412.77955138377689 541.10026526451
2024-07-22 Standard_F8s_v2 334.8340514398825 617.51078820229
2024-07-22 Standard_F32s_v2 246.55422240840511 461.62169647217
2024-07-23 Standard_F8s_v2 395.9791845713799 1793.6068696976
2024-07-23 Standard_F32s_v2 213.96679587614821 601.87324810028
2024-07-24 Standard_F8s_v2 388.64158589821164 653.48181796074
2024-07-24 Standard_F32s_v2 222.56959745223452 458.45049333572
2024-07-25 Standard_F32s_v2 201.54874094195779 568.88640546799
2024-07-25 Standard_F8s_v2 340.30335808941686 622.87601709366
2024-07-26 Standard_F8s_v2 378.12601283434304 581.623919487
2024-07-26 Standard_F32s_v2 320.04382702006308 445.13493967056
[edit: fixed numbers, first table was all wrong]
Updated•8 months ago
|
Comment 25•8 months ago
|
||
For the table in comment 24, can we include the region the worker was in?
Looking at the instance type, those workers are a combination of -source
testers and 2022 builders, so I am wondering what is a common factor is.
Description
•