Very slow startup of Firefox for MacOS 14.7.x workers in CI
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
People
(Reporter: whimboo, Unassigned, NeedInfo)
References
(Depends on 1 open bug, Blocks 2 open bugs)
Details
(Keywords: perf)
Attachments
(1 file)
|
9.41 KB,
text/csv
|
Details |
Over on bug 1960740 I noticed that we have very slow startup times of Firefox these days for MacOS workers. While it took usually around only 1.5s for the new 14.7 workers (which was still 50% higher than for 10.15) we are now at around 4s as it can be seen in this log:
https://treeherder.mozilla.org/logviewer?job_id=506448689&repo=autoland&lineNumber=30937-30996
WebDriver tests are timing out because they take longer than 25s in CI while locally they run for around 5s.
Talos tests as well show a slight increase but as well a lot of noise which indicates that something isn't working correctly since the new workers have been made available.
Maybe we could get a Gecko profile created while running a single Wdspec test in CI by passing the profiler related environment variables and store it in the upload folder so it can be made available as artifact?
| Reporter | ||
Comment 1•7 months ago
|
||
As it looks like the timeouts only happen on the 14.70 machines and I'm not able to see them for 14.7.1 which should also be in use for web-platform tests, or am I wrong?
Comment 2•7 months ago
|
||
Interestingly on bug 1960740 I see that the failures only occur on 14.7.1.
We might be facing sandbox issues similar to what happened with Bug 1893921
| Reporter | ||
Comment 3•7 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #1)
As it looks like the timeouts only happen on the 14.70 machines and I'm not able to see them for 14.7.1 which should also be in use for web-platform tests, or am I wrong?
Sorry this was wrong information. After discussing with Julian all these failures are indeed coming from machines running on 14.7.1. I used the platform name as given in the intermittent-failures list but they can both refer to 14.7 and 14.7.1. This is causing confusion to me :/
Comment 4•7 months ago
|
||
there was a mistake in reimaging some machines from staging -> production and picked up 14.7.1 on a few machines- this is not our normal mode of operations. Historically and by design we don't install minor patch versions, only intentional upgrades. I see how this can be confusing. The fix is to reinstall all the machines to a supported (i.e. signed version, which right now is 14.7.5)- this will require some testing and deployment, not an overnight task.
The fix for bug 1893921 was in the product, not our machine imaging, please correct me if I am wrong.
I am confused why this issue would be a problem on the newer installed machines only?
Comment 5•7 months ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #4)
there was a mistake in reimaging some machines from staging -> production and picked up 14.7.1 on a few machines- this is not our normal mode of operations. Historically and by design we don't install minor patch versions, only intentional upgrades. I see how this can be confusing. The fix is to reinstall all the machines to a supported (i.e. signed version, which right now is 14.7.5)- this will require some testing and deployment, not an overnight task.
The fix for bug 1893921 was in the product, not our machine imaging, please correct me if I am wrong.
This was slightly different because the Github workers were using a macos version we are not using in CI, so we had to fix Firefox in order to stop hitting sandbox issues with this specific version of macos (macos 12 I think at that time).
I am confused why this issue would be a problem on the newer installed machines only?
Not sure it's just a guess, but we already had scenarios where a specific macos was triggering timeouts and slowness.
Maybe we'll have the same issue with the 14.7.5 we now want to use.
| Reporter | ||
Comment 6•7 months ago
|
||
Joel, is there a plan to at least quarantine those existing 14.7.1 workers? Or would those be too many that we cannot miss them at the moment? If that's the case what is a rough timeline to when no 14.7.1 workers will still be around?
Comment 7•7 months ago
|
||
currently we are looking at EOD Friday (May 9th). This might get done sooner, but saving 1-2 days for errors
| Reporter | ||
Comment 8•7 months ago
|
||
Thank you Joel! I think that we can wait until then. Please keep us updated when the re-imaging of those affected workers is done.
Comment 9•7 months ago
|
||
as of last night (May 7th PDT), all of the macosx 14.70 workers were updated to 14.7.5 patch level- we should have the same version for all machines in the pool.
| Reporter | ||
Comment 10•7 months ago
|
||
This is good to hear but I don't think that all machines have been taken care of. I can see a few which still have 14.7.1 installed:
- https://treeherder.mozilla.org/logviewer?job_id=507228558&repo=try (macmini-r8-36)
- https://treeherder.mozilla.org/logviewer?job_id=507228565&repo=try (macmini-r8-39)
- https://treeherder.mozilla.org/logviewer?job_id=507228555&repo=try (macmini-r8-46)
And I feel that there might be more.
Ryan, could you please check? Thanks.
Comment 11•7 months ago
|
||
They look good on my end...
[root@macmini-r8-39.test.releng.mdc1.mozilla.com ~]# sw_vers
ProductName: macOS
ProductVersion: 14.7.5
BuildVersion: 23H527
[root@macmini-r8-46.test.releng.mdc1.mozilla.com ~]# sw_vers
ProductName: macOS
ProductVersion: 14.7.5
BuildVersion: 23H527
[root@macmini-r8-36.test.releng.mdc1.mozilla.com ~]# sw_vers
ProductName: macOS
ProductVersion: 14.7.5
BuildVersion: 23H527
Comment 12•7 months ago
|
||
gecko-t-osx-1400-r8 report
Comment 13•7 months ago
|
||
Ran a report of all the mac14 workers in gecko-t-osx-1400-r8 and attached. gecko-t-osx-1400-r8.csv
| Reporter | ||
Comment 14•7 months ago
|
||
Thanks for checking that Ryan. That is interesting and I wonder if they were recently (meaning today) updated? Like for macmini-r8-36 the test job started today at 03:35:18 and finished 04:01:45 and the log is clearly stating macOS 14.7.1 (23H222).
I'll trigger some more jobs on this particular try push to see if it works now.
| Reporter | ||
Comment 15•7 months ago
|
||
Interesting the system version as reported now for macmini-r8-284 is "macOS 14.7 (23H124)". Is there a reason why we do not include the patch level? It would help to identify issues like the one covered by this bug in the future. Otherwise we won't see that an update has actually happened.
| Reporter | ||
Comment 16•7 months ago
|
||
Also there are still workers running on 14.7.1:
- https://treeherder.mozilla.org/logviewer?job_id=507367597&repo=try&lineNumber=19066 (macmini-r8-37)
- https://treeherder.mozilla.org/logviewer?job_id=507367684&repo=try&lineNumber=28 (macmini-r8-31)
Both are listed in the CSV file to have 14.7.5 installed. So there is clearly a mismatch.
Comment 17•7 months ago
|
||
I am not familiar with system version, I only hack up os_version for helping standardize our manifest conditions. I believe what you are seeing is coming from the taskcluster worker, not any in-tree related test scripts. For that specific link, there is a wptreport.json that shows the mozinfo variables and has 14.7.5
Comment 18•7 months ago
|
||
for those two, the wptreport.json shows 14.7.5, I suspect there is a bug in taskcluster worker.
| Reporter | ||
Comment 19•7 months ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #18)
for those two, the wptreport.json shows 14.7.5, I suspect there is a bug in taskcluster worker.
Pete, could you maybe have a look?
Otherwise we are still seeing timeouts for the related WebDriver BiDi tests. If it's the same underlying issue it would have to be checked. I assume we now clearly stay with 14.7.5.
Comment 20•7 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #19)
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #18)
for those two, the wptreport.json shows 14.7.5, I suspect there is a bug in taskcluster worker.
Pete, could you maybe have a look?
I believe this comes from https://github.com/search?q=repo%3Amozilla-platform-ops%2Fronin_puppet%20system_version&type=code
Ryan, do you know where this is configured?
Comment 21•7 months ago
•
|
||
99% of the pool was on 14.7.5 on May 7th 6:07PM EST
100% of the pool was on 14.7.5 on May 8th 9:47AM EST
Comment 22•7 months ago
•
|
||
Oh I see. /opt/worker/worker-runner-config.yaml and /opt/worker/generic-worker.conf.yaml isn't refreshing/picking up the updated version
It looks like the staging pool (updated a day or two prior) did pickup the changes in those config files above. Lets give it a couple days to cook and I'll check on Monday and make sure the configs are up to date.
| Reporter | ||
Comment 23•7 months ago
•
|
||
I pushed a try build with the meta data for wdspec tests updated. A lot of them actually got marked by wptsync as intermittent timing out:
https://treeherder.mozilla.org/jobs?repo=try&revision=9c811e69762aa5d8a1af3fd42a6b2f281a8932d1
Comment 24•7 months ago
|
||
It looks like re-puppetizing the hosts speeds this up. I'll start that now and let you know when it's finished.
Comment 25•7 months ago
•
|
||
Puppetizing the hosts updates /opt/worker/worker-runner-config.yaml
After the next reboot, /opt/worker/generic-worker.conf.yaml updates. I think we're good.
| Reporter | ||
Comment 26•7 months ago
|
||
I was able to update the wptrunner code to create a Gecko profile from the startup of an opt build of Firefox on a MacOS 14.7.5 worker. As it shows it takes nearly 4s until about:blank of the initial tab is loaded. This is because there is an about 1.7s wait time when only the renderer seem to show some activity:
https://share.firefox.dev/4maIqaW
This somewhat looks like a graphics issue to me. Markus could you maybe take a look?
Comment 27•7 months ago
|
||
I'm very confused, that profile is on a Linux build? It seems it's compiling shaders and using swrast which is not great tho, and the main thread is blocked on that which is quite terrible.
Comment 28•7 months ago
|
||
I agree, the provided profile is from Linux, not from macOS.
| Reporter | ||
Comment 29•7 months ago
|
||
Yes, sorry. I completely messed this up. Due to the long wait times for Mac test jobs on MacOS I triggered Linux builds first to check that a profile can be generated. Let me trigger a build for MacOS now.
(In reply to Emilio Cobos Álvarez (:emilio) from comment #27)
It seems it's compiling shaders and using swrast which is not great tho, and the main thread is blocked on that which is quite terrible.
Would you mind filing a new bug for that? You seem to have way more insight into this already. Otherwise I'm happy to do as well.
Updated•7 months ago
|
Comment 30•7 months ago
|
||
Sure, filed bug 1965637
| Comment hidden (Intermittent Failures Robot) |
Comment 32•7 months ago
|
||
| Reporter | ||
Comment 33•7 months ago
|
||
As discussed with Julian we may have to split this bug into two different ones given that we do not only see delays during startup but as well for each single test which is triggering a new web content process. As Julian mentioned to me in case of timeouts there is a ~1000ms delay before the pre-allocated web content process is ready. We agreed that he is doing some more checks and will file a separate bug.
For the above provided profile on macOS I noticed that the browser-startup-idle-tasks-finished notification is getting sent after 6s! Running tests locally I can see this notification coming in after 1.6s when running wpt tests, and only 800ms when using mach run.
I wonder where these additional 800ms are coming from when running wpt tests.
| Reporter | ||
Comment 34•7 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #33)
I wonder where these additional 800ms are coming from when running wpt tests.
Ok, this is actually caused by using an existing profile. When I remove the profile and run mach run again, I can see as well a 160ms delay alone for _updateDatabase from the Add-on Manager. But that's still lesser than the 380ms for a wpt run.
| Reporter | ||
Comment 35•7 months ago
|
||
Both bug 1946247 and bug 1946248, which are font loading related delays during startup are causing issues here as well.
Markus, the profile as referenced on bug 1946247 doesn't exist anymore. So is that the same as for gfxPlatformFontList::LoadFontInfo during the startup path? This can be seen as well locally (macOS 15.2) during wpt runs but not with mach run. Maybe you could check as well what else causes the delay? See comment 32 for the uploaded profile for macOS. Thanks!
| Reporter | ||
Comment 36•7 months ago
|
||
The profile on Window shows as well about 1.8s spent during startup just for loading the font info. I assume this should be its own widget bug?
Comment 37•7 months ago
|
||
Puppetizing the hosts updates
/opt/worker/worker-runner-config.yamlAfter the next reboot,
/opt/worker/generic-worker.conf.yamlupdates. I think we're good.
Hi Ryan, I saw a needinfo from me here - just checking if you still need anything from me? Thanks.
Comment 38•7 months ago
•
|
||
Sorry about that Pete I think we are good on my end. Do you need anything else from me re this issue :whimboo :pmoore :mstange ?
| Reporter | ||
Comment 39•7 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #35)
Both bug 1946247 and bug 1946248, which are font loading related delays during startup are causing issues here as well.
Markus, the profile as referenced on bug 1946247 doesn't exist anymore. So is that the same as for
gfxPlatformFontList::LoadFontInfoduring the startup path? This can be seen as well locally (macOS 15.2) during wpt runs but not withmach run. Maybe you could check as well what else causes the delay? See comment 32 for the uploaded profile for macOS. Thanks!
Note that this question still stands for Markus.
| Reporter | ||
Comment 40•7 months ago
|
||
Oh and as we detected as well, most of our tests are failing not necessarily due to this slower startup but that each content process takes an additional 1s before it is ready to be used. Haik will most likely file a new bug for that soon given that this is related to our sandboxing feature.
| Reporter | ||
Updated•6 months ago
|
Description
•