Open Bug 1738853 Opened 3 years ago Updated 8 months ago

Windows startup test tasks sometimes take >30 minutes to clone hg repository which delays shipping Nightlies

Categories

(Firefox Build System :: Task Configuration, defect)

defect

Tracking

(firefox-esr91 unaffected, firefox93 unaffected, firefox94 unaffected, firefox95 wontfix, firefox96 wontfix)

Tracking Status
firefox-esr91 --- unaffected
firefox93 --- unaffected
firefox94 --- unaffected
firefox95 --- wontfix
firefox96 --- wontfix

People

(Reporter: aryx, Unassigned)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file)

The Windows startup tests (Treeherder symbole "SUT") often only take 1 minute to run but sometimes they run for 37-39 minutes because cloning the mozilla-unified repository is slow. Nightlies only ship after the startup test tasks have succeeded.

Expected: Reliable short task run time. If a full clone is necessary, the clone time should be shorter. (Or could the relevant files for this task get generated as an artifact?)

Example: https://treeherder.mozilla.org/logviewer?job_id=356409035&repo=mozilla-central
You can also select the task in the main view and use the "Similar Jobs" at the bottom to find affected tasks: https://treeherder.mozilla.org/jobs?repo=mozilla-central&selectedTaskRun=HxTUt0EmSju9h1eH5CtqYA.0&revision=6aff7acd8390f8d83eb15589a0ceb376cedb2416&searchStr=windows%2Cstartup

Is this a new change, since the new Windows AMIs? or have we seen it before?

Flags: needinfo?(aryx.bugmail)
Has Regression Range: --- → yes

We also hit this on Fx 95.0b2.

This is because the -source workers (and I think the regular Azure Win10 workers as well) need to perform a full hg clone before they can run jobs (if it's a brand new worker and it's the first job it runs).
On AWS I have noticed that it clones the hg bundle from the CDN. So that is an option to make that work, or alternatively ship the image with a cached repository.

Mark, we talked about this before I think, any ideas?

Flags: needinfo?(mcornmesser)

(Mark: see above comment, Bugzilla had a conflict and didn't NI you on the above comment I made)

The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.

Flags: needinfo?(sheehan)

We'd probably want the equivalent of bug 1585133 for Azure.

(In reply to Mike Hommey [:glandium] from comment #6)

The mercurial server has a config that puts priority to the stream clonebundles for clients on AWS. We'd need the same for Azure, or we should change CI (robustcheckout) to just always use stream clones.

Once https://github.com/mozilla-releng/OpenCloudConfig/pull/271 is landed and deployed we should get stream clone bundles as the default through the robustcheckout update. We should also upload bundles to Azure's S3-equivalent blob storage and prioritize those over the CDN. I think there may be an old patch on Phabricator we can revive that has some of this work completed.

Creating a cached copy of the repo in the image would help here, however on EC2 I know there are performance issues with caching a copy of the repo in an AMI - not sure if the same problems affect Azure/Windows/etc.

Alternatively we could program the workers to immediately start cloning a copy of the repo on startup (ie the instance is started from an image without a repo cache but the cloning completes before the instance takes any tasks) then we would hit the requirement of "reliable short task run time". Spinning up new workers would appear to take much longer of course, but from the developer perspective their tasks would complete in a reasonable amount of time.

Flags: needinfo?(sheehan)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(michelle)

On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly

The failed tasks got retried automatically and no logs are available.

(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #9)

On Saturday, the Windows startup tests succeeded only on the fourth attempt and it took 4h 46min from scheduling to the startup test succeed. This delayed an urgent bug fix release for Nightly

The failed tasks got retried automatically and no logs are available.

There is no indication in the logs that these tasks are using clonebundles at all. In that case bug 1738924 will certainly help here, it's already been merged and is in the process of being tested and deployed.

(In reply to Connor Sheehan [:sheehan] from comment #10)

There is no indication in the logs that these tasks are using clonebundles at all.

I was wrong, these tasks are using clonebundles, but the output showing we used a clonebundle is in a different location than with Linux/mac workers. We use a zstd bundle here, so stream clones should make a dent in the clone time and from masterwayz' test log the time to clone has improved after forcing stream clone bundles via robustcheckout.

It's still pretty slow (~18m instead of ~45m from these failures) but it's getting better. Reviving the Azure bundle hosting patches should improve things more.

Going to land a ci-config patch soon that should improve it to 18 minutes instead of 45 minutes. Still not the best, but it does make SUT pass 3/3 times instead of 1/5 times on try.

Flags: needinfo?(michelle)
Keywords: leave-open
Assignee: nobody → michelle
Attachment #9251705 - Attachment description: WIP: Bug 1738853 - New Azure images using robutcheckout r=#releng-reviewers → Bug 1738853 - New Azure images using robutcheckout r=#releng-reviewers
Status: NEW → ASSIGNED
Pushed by michelle@masterwayz.nl: https://hg.mozilla.org/ci/ci-configuration/rev/4d0d24c95aad New Azure images using robutcheckout r=releng-reviewers,jmaher

Has this been resolved with the new images?

Flags: needinfo?(aryx.bugmail)

It's either 1 or ~20 minutes now.

Flags: needinfo?(aryx.bugmail)

Should we keep this bug open?

Flags: needinfo?(aryx.bugmail)

Yes, let's track the task time being <5 minutes in general.

Flags: needinfo?(aryx.bugmail)

Is there anything else we could do here, or only wait for the TC team? Trying to clean up my pending assigned bugs queue.

Flags: needinfo?(ahal)

Looks like we still want https://bugzilla.mozilla.org/show_bug.cgi?id=1738853#c7 .. Unsure if there is a bug on file for that or not. I don't see the TC team blocking anything here though.

Some other thoughts..

  • Do sparse profiles improve clone times at all? Maybe we could set one up here.
  • Can we optimize the dependency graph at all here to improve parallelization? E.g, if we currently have SUT -> A -> B -> C, can we make it:
SUT---> C
A -> B---^

?

Flags: needinfo?(ahal)
Assignee: mgoossens → nobody
Status: ASSIGNED → NEW

Unassigned this from myself, unless this is fixed now.

The leave-open keyword is there and there is no activity for 6 months.
:ahal, maybe it's time to close this bug?
For more information, please visit auto_nag documentation.

Flags: needinfo?(ahal)

This is essentially the same as bug 1799052 AIUI, and some work's been happening there.

See Also: → 1799052
Flags: needinfo?(ahal)
Keywords: leave-open

Looks like there are still a few cases where clone is super slow, e.g.:
https://firefox-ci-tc.services.mozilla.com/tasks/O4OrDawrQ9a7nK7qlJuVzA/runs/0/logs/public/logs/live_backing.log
https://firefox-ci-tc.services.mozilla.com/tasks/EiNbqtacTmKrMwXF5XMo3g/runs/0/logs/public/logs/live_backing.log
https://firefox-ci-tc.services.mozilla.com/tasks/MCcY0FD7S4qatnFA5jnuSA/runs/0/logs/public/logs/live_backing.log

average and max clone times on windows -source workers look like this for the last ~10 days:

date	        instance_type	  average_clone_seconds	max_clone_seconds
2024-07-16	Standard_F8s_v2	  361.0909612708619	590.06703519821
2024-07-17	Standard_F8s_v2	  388.26866307406323	1596.8691520691
2024-07-17	Standard_F32s_v2  184.30901334966984	392.61391997337
2024-07-18	Standard_F8s_v2	  370.657629318363	666.9570684433
2024-07-19	Standard_F8s_v2	  377.46113437509763	593.83016681671
2024-07-19	Standard_F32s_v2  282.63595340328851	546.48441195488
2024-07-20	Standard_F8s_v2	  415.73494991864254	599.94339489937
2024-07-20	Standard_F32s_v2  327.97066480973189	398.05749821663
2024-07-21	Standard_F32s_v2  360.31971017519624	410.64426183701
2024-07-21	Standard_F8s_v2	  412.77955138377689	541.10026526451
2024-07-22	Standard_F8s_v2	  334.8340514398825	617.51078820229
2024-07-22	Standard_F32s_v2  246.55422240840511	461.62169647217
2024-07-23	Standard_F8s_v2   395.9791845713799	1793.6068696976
2024-07-23	Standard_F32s_v2  213.96679587614821	601.87324810028
2024-07-24	Standard_F8s_v2	  388.64158589821164	653.48181796074
2024-07-24	Standard_F32s_v2  222.56959745223452	458.45049333572
2024-07-25	Standard_F32s_v2  201.54874094195779	568.88640546799
2024-07-25	Standard_F8s_v2	  340.30335808941686	622.87601709366
2024-07-26	Standard_F8s_v2	  378.12601283434304	581.623919487
2024-07-26	Standard_F32s_v2  320.04382702006308	445.13493967056

[edit: fixed numbers, first table was all wrong]

Severity: -- → S3

For the table in comment 24, can we include the region the worker was in?

Looking at the instance type, those workers are a combination of -source testers and 2022 builders, so I am wondering what is a common factor is.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: