Open Bug 1964338 Opened 7 months ago Updated 5 months ago

Very slow startup of Firefox for MacOS 14.7.x workers in CI

Categories

(Testing :: General, defect)

defect

Tracking

(Not tracked)

People

(Reporter: whimboo, Unassigned, NeedInfo)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: perf)

Attachments

(1 file)

Over on bug 1960740 I noticed that we have very slow startup times of Firefox these days for MacOS workers. While it took usually around only 1.5s for the new 14.7 workers (which was still 50% higher than for 10.15) we are now at around 4s as it can be seen in this log:

https://treeherder.mozilla.org/logviewer?job_id=506448689&repo=autoland&lineNumber=30937-30996

WebDriver tests are timing out because they take longer than 25s in CI while locally they run for around 5s.

Talos tests as well show a slight increase but as well a lot of noise which indicates that something isn't working correctly since the new workers have been made available.

Maybe we could get a Gecko profile created while running a single Wdspec test in CI by passing the profiler related environment variables and store it in the upload folder so it can be made available as artifact?

As it looks like the timeouts only happen on the 14.70 machines and I'm not able to see them for 14.7.1 which should also be in use for web-platform tests, or am I wrong?

Flags: needinfo?(jmaher)

Interestingly on bug 1960740 I see that the failures only occur on 14.7.1.

We might be facing sandbox issues similar to what happened with Bug 1893921

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #1)

As it looks like the timeouts only happen on the 14.70 machines and I'm not able to see them for 14.7.1 which should also be in use for web-platform tests, or am I wrong?

Sorry this was wrong information. After discussing with Julian all these failures are indeed coming from machines running on 14.7.1. I used the platform name as given in the intermittent-failures list but they can both refer to 14.7 and 14.7.1. This is causing confusion to me :/

there was a mistake in reimaging some machines from staging -> production and picked up 14.7.1 on a few machines- this is not our normal mode of operations. Historically and by design we don't install minor patch versions, only intentional upgrades. I see how this can be confusing. The fix is to reinstall all the machines to a supported (i.e. signed version, which right now is 14.7.5)- this will require some testing and deployment, not an overnight task.

The fix for bug 1893921 was in the product, not our machine imaging, please correct me if I am wrong.

I am confused why this issue would be a problem on the newer installed machines only?

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #4)

there was a mistake in reimaging some machines from staging -> production and picked up 14.7.1 on a few machines- this is not our normal mode of operations. Historically and by design we don't install minor patch versions, only intentional upgrades. I see how this can be confusing. The fix is to reinstall all the machines to a supported (i.e. signed version, which right now is 14.7.5)- this will require some testing and deployment, not an overnight task.

The fix for bug 1893921 was in the product, not our machine imaging, please correct me if I am wrong.

This was slightly different because the Github workers were using a macos version we are not using in CI, so we had to fix Firefox in order to stop hitting sandbox issues with this specific version of macos (macos 12 I think at that time).

I am confused why this issue would be a problem on the newer installed machines only?

Not sure it's just a guess, but we already had scenarios where a specific macos was triggering timeouts and slowness.
Maybe we'll have the same issue with the 14.7.5 we now want to use.

Joel, is there a plan to at least quarantine those existing 14.7.1 workers? Or would those be too many that we cannot miss them at the moment? If that's the case what is a rough timeline to when no 14.7.1 workers will still be around?

Flags: needinfo?(jmaher)

currently we are looking at EOD Friday (May 9th). This might get done sooner, but saving 1-2 days for errors

Flags: needinfo?(jmaher)

Thank you Joel! I think that we can wait until then. Please keep us updated when the re-imaging of those affected workers is done.

Blocks: 1650025

as of last night (May 7th PDT), all of the macosx 14.70 workers were updated to 14.7.5 patch level- we should have the same version for all machines in the pool.

This is good to hear but I don't think that all machines have been taken care of. I can see a few which still have 14.7.1 installed:

And I feel that there might be more.

Ryan, could you please check? Thanks.

Flags: needinfo?(rcurran)

They look good on my end...

[root@macmini-r8-39.test.releng.mdc1.mozilla.com ~]# sw_vers
ProductName:		macOS
ProductVersion:		14.7.5
BuildVersion:		23H527

[root@macmini-r8-46.test.releng.mdc1.mozilla.com ~]# sw_vers
ProductName:		macOS
ProductVersion:		14.7.5
BuildVersion:		23H527

[root@macmini-r8-36.test.releng.mdc1.mozilla.com ~]# sw_vers
ProductName:		macOS
ProductVersion:		14.7.5
BuildVersion:		23H527
Flags: needinfo?(rcurran)

gecko-t-osx-1400-r8 report

Ran a report of all the mac14 workers in gecko-t-osx-1400-r8 and attached. gecko-t-osx-1400-r8.csv

Thanks for checking that Ryan. That is interesting and I wonder if they were recently (meaning today) updated? Like for macmini-r8-36 the test job started today at 03:35:18 and finished 04:01:45 and the log is clearly stating macOS 14.7.1 (23H222).

I'll trigger some more jobs on this particular try push to see if it works now.

Interesting the system version as reported now for macmini-r8-284 is "macOS 14.7 (23H124)". Is there a reason why we do not include the patch level? It would help to identify issues like the one covered by this bug in the future. Otherwise we won't see that an update has actually happened.

Also there are still workers running on 14.7.1:

Both are listed in the CSV file to have 14.7.5 installed. So there is clearly a mismatch.

Flags: needinfo?(rcurran)

I am not familiar with system version, I only hack up os_version for helping standardize our manifest conditions. I believe what you are seeing is coming from the taskcluster worker, not any in-tree related test scripts. For that specific link, there is a wptreport.json that shows the mozinfo variables and has 14.7.5

for those two, the wptreport.json shows 14.7.5, I suspect there is a bug in taskcluster worker.

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #18)

for those two, the wptreport.json shows 14.7.5, I suspect there is a bug in taskcluster worker.

Pete, could you maybe have a look?

Otherwise we are still seeing timeouts for the related WebDriver BiDi tests. If it's the same underlying issue it would have to be checked. I assume we now clearly stay with 14.7.5.

Flags: needinfo?(rcurran) → needinfo?(pmoore)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #19)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #18)

for those two, the wptreport.json shows 14.7.5, I suspect there is a bug in taskcluster worker.

Pete, could you maybe have a look?

I believe this comes from https://github.com/search?q=repo%3Amozilla-platform-ops%2Fronin_puppet%20system_version&type=code

Ryan, do you know where this is configured?

Flags: needinfo?(pmoore) → needinfo?(rcurran)

99% of the pool was on 14.7.5 on May 7th 6:07PM EST
100% of the pool was on 14.7.5 on May 8th 9:47AM EST

Flags: needinfo?(rcurran)
See Also: → 1965460

Oh I see. /opt/worker/worker-runner-config.yaml and /opt/worker/generic-worker.conf.yaml isn't refreshing/picking up the updated version

It looks like the staging pool (updated a day or two prior) did pickup the changes in those config files above. Lets give it a couple days to cook and I'll check on Monday and make sure the configs are up to date.

I pushed a try build with the meta data for wdspec tests updated. A lot of them actually got marked by wptsync as intermittent timing out:

https://treeherder.mozilla.org/jobs?repo=try&revision=9c811e69762aa5d8a1af3fd42a6b2f281a8932d1

Summary: Very slow startup of Firefox for MacOS workers in CI → Very slow startup of Firefox for MacOS 14.7.x workers in CI

It looks like re-puppetizing the hosts speeds this up. I'll start that now and let you know when it's finished.

Puppetizing the hosts updates /opt/worker/worker-runner-config.yaml

After the next reboot, /opt/worker/generic-worker.conf.yaml updates. I think we're good.

Flags: needinfo?(pmoore)

I was able to update the wptrunner code to create a Gecko profile from the startup of an opt build of Firefox on a MacOS 14.7.5 worker. As it shows it takes nearly 4s until about:blank of the initial tab is loaded. This is because there is an about 1.7s wait time when only the renderer seem to show some activity:

https://share.firefox.dev/4maIqaW

This somewhat looks like a graphics issue to me. Markus could you maybe take a look?

Flags: needinfo?(mstange.moz)
Keywords: perf

I'm very confused, that profile is on a Linux build? It seems it's compiling shaders and using swrast which is not great tho, and the main thread is blocked on that which is quite terrible.

Flags: needinfo?(hskupin)

I agree, the provided profile is from Linux, not from macOS.

Flags: needinfo?(mstange.moz)

Yes, sorry. I completely messed this up. Due to the long wait times for Mac test jobs on MacOS I triggered Linux builds first to check that a profile can be generated. Let me trigger a build for MacOS now.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #27)

It seems it's compiling shaders and using swrast which is not great tho, and the main thread is blocked on that which is quite terrible.

Would you mind filing a new bug for that? You seem to have way more insight into this already. Otherwise I'm happy to do as well.

Flags: needinfo?(hskupin)
No longer blocks: 1965637
See Also: → 1965637
Blocks: 1791951
Blocks: 1825501
Blocks: 1962306
Blocks: 1903965
Blocks: 1897948
Blocks: 1878934
Blocks: 1864081
Blocks: 1858034
Blocks: 1769430
Blocks: 1905655

I recorded profiles for mac and windows for the same test as in Henrik's linux profile. I don't see anything particularly slow, so I'll record one for one of the tests which are timing out now.

Blocks: 1966016

As discussed with Julian we may have to split this bug into two different ones given that we do not only see delays during startup but as well for each single test which is triggering a new web content process. As Julian mentioned to me in case of timeouts there is a ~1000ms delay before the pre-allocated web content process is ready. We agreed that he is doing some more checks and will file a separate bug.

For the above provided profile on macOS I noticed that the browser-startup-idle-tasks-finished notification is getting sent after 6s! Running tests locally I can see this notification coming in after 1.6s when running wpt tests, and only 800ms when using mach run.

I wonder where these additional 800ms are coming from when running wpt tests.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #33)

I wonder where these additional 800ms are coming from when running wpt tests.

Ok, this is actually caused by using an existing profile. When I remove the profile and run mach run again, I can see as well a 160ms delay alone for _updateDatabase from the Add-on Manager. But that's still lesser than the 380ms for a wpt run.

Both bug 1946247 and bug 1946248, which are font loading related delays during startup are causing issues here as well.

Markus, the profile as referenced on bug 1946247 doesn't exist anymore. So is that the same as for gfxPlatformFontList::LoadFontInfo during the startup path? This can be seen as well locally (macOS 15.2) during wpt runs but not with mach run. Maybe you could check as well what else causes the delay? See comment 32 for the uploaded profile for macOS. Thanks!

Depends on: 1946247, 1946248
Flags: needinfo?(mstange.moz)

The profile on Window shows as well about 1.8s spent during startup just for loading the font info. I assume this should be its own widget bug?

Blocks: 1965465
Blocks: 1865759

Puppetizing the hosts updates /opt/worker/worker-runner-config.yaml

After the next reboot, /opt/worker/generic-worker.conf.yaml updates. I think we're good.

Hi Ryan, I saw a needinfo from me here - just checking if you still need anything from me? Thanks.

Sorry about that Pete I think we are good on my end. Do you need anything else from me re this issue :whimboo :pmoore :mstange ?

Flags: needinfo?(pmoore)
Flags: needinfo?(mstange.moz)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #35)

Both bug 1946247 and bug 1946248, which are font loading related delays during startup are causing issues here as well.

Markus, the profile as referenced on bug 1946247 doesn't exist anymore. So is that the same as for gfxPlatformFontList::LoadFontInfo during the startup path? This can be seen as well locally (macOS 15.2) during wpt runs but not with mach run. Maybe you could check as well what else causes the delay? See comment 32 for the uploaded profile for macOS. Thanks!

Note that this question still stands for Markus.

Flags: needinfo?(mstange.moz)

Oh and as we detected as well, most of our tests are failing not necessarily due to this slower startup but that each content process takes an additional 1s before it is ready to be used. Haik will most likely file a new bug for that soon given that this is related to our sandboxing feature.

Depends on: 1967525
Blocks: 1827331
Blocks: 1960072
No longer blocks: 1827331
No longer blocks: 1960072
See Also: → 1970956
Blocks: 1794745
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: