Open Bug 1927416 Opened 1 month ago Updated 2 hours ago

Slow startup times of 4 to 6s for Firefox for all test jobs on MacOS 10.15 workers (delayed "final-ui-startup" notification)

Categories

(Testing :: General, defect, P2)

defect
Points:
5

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: whimboo, Assigned: whimboo, NeedInfo)

References

(Blocks 8 open bugs)

Details

(Whiteboard: [webdriver:m14])

Attachments

(1 file)

We experience slower startup times of Firefox for at least the webdriver related test suites on MacOS workers. Hereby it's only headful that is affected. The headless tests do not seem to show the issue. Further the problem is not only happening for the very first start of Firefox but as well for any successive restarts / starts.

I did an initial investigation over on bug 1924507 comment 3 because the test /webdriver/tests/classic/new_session/unhandled_prompt_behavior.py showed most of the failures initially. The test durations vary between 129s and 163s. This is all present today and can be seen on older jobs pre-dating October.

Why we have such huge differences in execution is not clear. But as it looks like it can all be tracked down to startup delays for Firefox so that even shippable builds take more than 4s for even a warm start. As Joel mentioned over on that other bug as well is that we run multiple jobs on MacOS workers which probably could mean that due to a higher CPU utilization of other jobs, our one isn't getting enough cycles? Interesting for sure is that headless is not affected. So maybe some graphics related issue could trigger that behavior in shared environments?

Markus, do you have an idea? Maybe just some kind of logging that we could do on try builds where we hopefully can catch this issue as well? There are basically all the wpt tests affected that have in general already a longer runtime, and now are timing out.

Flags: needinfo?(mstange.moz)

I actually missed to add an example. Here is one from a wdspec test:
https://treeherder.mozilla.org/logviewer?job_id=478095037&repo=autoland&lineNumber=117985-118040

[task 2024-10-14T00:22:32.161Z] 00:22:32     INFO - PID 2560 | 1728865352160	mozrunner::runner	INFO	Running command: MOZ_CRASHREPORTER="1" MOZ_CRASHREPORTER_NO_REPORT="1" MOZ_CRASHREPORTER_SHUTDOWN="1" "/opt/worker/tasks/task ... --marionette" "--profile" "/var/folders/8l/7mfbjqr91rd81vc3rvs2xst0000014/T/tmp7isizu1z.mozrunner" "-foreground" "-no-remote"
[task 2024-10-14T00:22:32.162Z] 00:22:32     INFO - PID 2560 | 1728865352161	geckodriver::marionette	DEBUG	Waiting 60s to connect to browser on 127.0.0.1
[task 2024-10-14T00:22:32.162Z] 00:22:32     INFO - PID 2560 | 1728865352161	geckodriver::browser	TRACE	Failed to open /var/folders/8l/7mfbjqr91rd81vc3rvs2xst0000014/T/tmp7isizu1z.mozrunner/MarionetteActivePort
[task 2024-10-14T00:22:32.163Z] 00:22:32     INFO - PID 2560 | 1728865352161	geckodriver::marionette	TRACE	Retrying in 100ms
[..]
[task 2024-10-14T00:22:35.436Z] 00:22:35     INFO - PID 2560 | 1728865355435	geckodriver::browser	TRACE	Failed to open /var/folders/8l/7mfbjqr91rd81vc3rvs2xst0000014/T/tmp7isizu1z.mozrunner/MarionetteActivePort
[task 2024-10-14T00:22:35.437Z] 00:22:35     INFO - PID 2560 | 1728865355435	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-14T00:22:35.479Z] 00:22:35     INFO - PID 2560 | 1728865355478	Marionette	TRACE	Received observer notification final-ui-startup
[task 2024-10-14T00:22:35.512Z] 00:22:35     INFO - PID 2560 | 1728865355511	Marionette	INFO	Listening on port 50822
[task 2024-10-14T00:22:35.513Z] 00:22:35     INFO - PID 2560 | 1728865355511	Marionette	DEBUG	Marionette is listening
[task 2024-10-14T00:22:35.665Z] 00:22:35     INFO - PID 2560 | Read port: 50822
[task 2024-10-14T00:22:36.040Z] 00:22:36     INFO - PID 2560 | 1728865356039	Marionette	DEBUG	Accepted connection 0 from 127.0.0.1:50823
[task 2024-10-14T00:22:36.109Z] 00:22:36     INFO - PID 2560 | 1728865356108	geckodriver::marionette	DEBUG	Connection to Marionette established on 127.0.0.1:50822.
Blocks: 1924507
Blocks: 1926427
Blocks: 1769749
Blocks: 1877539

one clarification, we don't run tasks in parallel, we just share the pool of machine for all types of tasks- so there isn't one task sharing resources with another task, etc.

Oh ok. That wasn't clear. Thank you for the clarification! So this is then very interesting.

Joel, is there a way to have a one-click loaner for such a machine? It might be good to get the tests running and to create a Gecko profile. That might be more helpful than any other logging.

Flags: needinfo?(jmaher)

macosx is hardware, you need to vpn into the datacenter and remote onto the machine- adding rcurran as he does most of the macosx work.

Ryan, can you answer :whimboo's question about what it would take to get a macosx 10.15 loaner?

Flags: needinfo?(jmaher) → needinfo?(rcurran)

pick a worker. we'll quarantine it and provide you ssh/vnc access as needed. thanks

Flags: needinfo?(rcurran)

I have taken a look at existing failures, but there is no specific worker that stood out for producing more than usual failures. So a random one should probably be fine. Thanks!

Assignee: nobody → hskupin
Severity: -- → S3
Status: NEW → ASSIGNED
Priority: -- → P3
Whiteboard: [webdriver:m13]

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #0)

Markus, do you have an idea? Maybe just some kind of logging that we could do on try builds where we hopefully can catch this issue as well? There are basically all the wpt tests affected that have in general already a longer runtime, and now are timing out.

One thing you could try is to set MOZ_PROFILER_STARTUP=1 and MOZ_PROFILER_SHUTDOWN=/tmp/startup_profile.json and see if that shows the slow startup. But if the slowness happens before our code starts running, for example because macOS does some signature check, then this won't show it.

Flags: needinfo?(mstange.moz)

I tried to reproduce on the Mac Mini loaner but I'm not able to see slow startup times. It looks all normal and we see a very quick startup of Firefox:

1730884412414	Marionette	DEBUG	Closed connection 1
Application command: Firefox-opt.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/1k/73174z2506d5hbnftcjr17rm0000gp/T/tmpkfslmxsr.mozrunner
1730884412713	Marionette	INFO	Marionette enabled
1730884412863	Marionette	TRACE	Received observer notification final-ui-startup

Given that I can replicate it as well with pure Marionette tests I might just push a job with a single test to run and use the Gecko profiler settings as mentioned above. Mochitests are affected as well, and surprisingly Talos too:

[task 2024-11-05T00:38:06.086Z] 00:38:06     INFO -  TEST-INFO | started process 728 (/opt/worker/tasks/task_173076700310462/build/application/Firefox Nightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/9m/8nfcw1_j2gn7r45bdvc36wrm000014/T/tmpnr09v840/profile)
[task 2024-11-05T00:38:10.514Z] 00:38:10     INFO -  PID 728 | [ERROR error_support::handling] logins-unexpected: Error parsing URL: relative URL with a cannot-be-a-base base

Don't we have Talos startup performance tests which should have caught that long delay?

I just triggered a random Talos job with the profiler enabled. Lets see if that already gives a hint so that I don't have to do prepare a custom Marionette job for it.

https://treeherder.mozilla.org/jobs?repo=mozilla-central&searchStr=macos%2Copt%2Ctalos&revision=33aa55b02fe2215f1ad827880de20c775186430f&selectedTaskRun=QPkIuBWRRVeZZJptMNSQvw.0

Here some timings of Mn jobs for comparison:

Sep 4th: 0.9s

[task 2024-09-04T07:13:14.877Z] 07:13:14     INFO - Application command: /opt/worker/tasks/task_172543263043635/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/hh/22h03g290_3fzy6fg6z99z_c000014/T/tmpjuz71b44.mozrunner
[task 2024-09-04T07:13:15.135Z] 07:13:15     INFO -  1725433995135	Marionette	INFO	Marionette enabled
[task 2024-09-04T07:13:15.748Z] 07:13:15     INFO -  1725433995748	Marionette	TRACE	Received observer notification final-ui-startup

Oct 1st: 2.3s

[task 2024-10-01T01:34:29.918Z] 01:34:29     INFO - Application command: /opt/worker/tasks/task_172774535893276/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/9r/cc81p_zj7_bf6051yy4xhjt0000014/T/tmpaztnj34j.mozrunner
[task 2024-10-01T01:34:30.139Z] 01:34:30     INFO -  1727746470139	Marionette	INFO	Marionette enabled
[task 2024-10-01T01:34:32.239Z] 01:34:32     INFO -  1727746472239	Marionette	TRACE	Received observer notification final-ui-startup

Nov 6th: 4.6s

[task 2024-11-06T07:56:59.670Z] 07:56:59     INFO - Application command: /opt/worker/tasks/task_173087905802411/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/0y/nndy_9kd66lg9_qw2_gl47hw000014/T/tmpc1sc6pgy.mozrunner
[task 2024-11-06T07:56:59.914Z] 07:56:59     INFO -  1730879819914	Marionette	INFO	Marionette enabled
[task 2024-11-06T07:57:04.287Z] 07:57:04     INFO -  1730879824286	Marionette	TRACE	Received observer notification final-ui-startup

Not sure if this graph from Perfherder shows the dramatic increase of startup time as well, but if it does we should really investigate and fix this performance regression.

As it looks like the T(o) jobs actually contain this measurement. I've triggered a new job with the Gecko profiler enabled. I hope that this time we will get a profile that contains the startup phase:

https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&revision=b26d51f8c9177d0b4274407c337393018f1a7e33&searchStr=Talos%2Cother%2Cmac%2Cprofiling

Here is the actual recorded Gecko profile which shows the delay of about 4s: https://share.firefox.dev/40Agnct

Flags: needinfo?(mstange.moz)

Both bug 1928656 and bug 1928653 show regressions for Talos tests as well on MacOS.

Here is a try build for a T(o) job with profiling enabled from a checkout of Sep 4th. Lets see if the numbers are still bad or are equal to 1s.

https://treeherder.mozilla.org/jobs?repo=try&revision=d26f6a27e88f343b746ec1ef98e27b602242401d

Blocks: 1928656, 1928653

Lets move this bug to Testing :: General given that it affects certain test suites. Individual fixes for Firefox or the worker should be filed as dependencies.

Component: Agent → General
Product: Remote Protocol → Testing

We're stuck in native mac code doing menubar shenanigans. Something that matches the timing, code, and such would be bug 1855346, though it went through various back outs? Not sure if that shows up in the graph.

See Also: → 1855346

Ryan, would you know what's changed on the macosx 1015 workers since around October 1st?

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #15)

We also re-triggered the T(o) job for a build from Aug 26th. The results will be shown in this graph on Perfherder.

The graph here suggests that it's an issue related to the workers/environment and not a regression in the browser.

Flags: needinfo?(rcurran)

The profile shows that we're blocked waiting for something that's related to the GPU or the window server. I'd blame some ambient system state; maybe there are too many windows open or something is running in the background hogging the GPU, or maybe something is weird with the monitor.

Flags: needinfo?(mstange.moz)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #15)

We also re-triggered the T(o) job for a build from Aug 26th. The results will be shown in this graph on Perfherder.

Those results show the same bad numbers as what we have at the moment on mozilla-central. So as Greg mentioned it's most likely related to the environment of the MacOS workers.

one thing :kash noticed is that some of the perf machines have 24hz refresh rate and not the target 60hz (bug 1917594). No idea if that would contribute to this, but thought I would mention it. Once we figure out the root cause, we will work on pre-flight setup to ensure we have 60hz.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #16)

We're stuck in native mac code doing menubar shenanigans. Something that matches the timing, code, and such would be bug 1855346, though it went through various back outs? Not sure if that shows up in the graph.

I think that we can rule this out given that a build from August shows the same symptoms these days.

(In reply to Markus Stange [:mstange] from comment #18)

The profile shows that we're blocked waiting for something that's related to the GPU or the window server. I'd blame some ambient system state; maybe there are too many windows open or something is running in the background hogging the GPU, or maybe something is weird with the monitor.

Ok, that is interesting! I was thinking about that a bit more and have a hypothesis. Could this actually be related to the new Mac session support that landed for Firefox via bug 1827651? It had several landings and backouts and since mid of September it's stable now.

Note that after each job the worker gets restarted! So if Firefox is not correctly closed at the end of a job the system triggered restart will most likely activate the session restore code and starts Firefox again. This also explains why older builds are affected (the worker as used already has a couple of Firefox processes running) and we have such a noise in the Talos data.

Regarding the worker environment none of the changes as landed recently seem to be related and could cause that.

Stephen, could you please check my last comment? Is that reasonable?

Flags: needinfo?(rcurran) → needinfo?(spohl.mozilla.bugs)

Maybe the worker should kill all the cltbld user initiated processes before the restart. Given that those jobs that run into a task timeout won't be able to clean-up the machine correctly, I see it being the only place of doing it.

See Also: → 1929101

With this job we have a startup time of 6s.

Ryan, could you please check how many Firefox processes are running on the macmini-r8-152 worker? Thanks!

Flags: needinfo?(rcurran)

as a note any given mac mini in the general pool will be reused almost immediately, the machine is restarted between tasks

[relops@macmini-r8-152.test.releng.mdc1.mozilla.com ~]$ sudo ps aux | grep '[f]irefox' | wc -l
       1

[relops@macmini-r8-152.test.releng.mdc1.mozilla.com ~]$ sudo ps aux | grep '[f]irefox'
cltbld           1410   0.0  0.3  4393200  54916   ??  S     2:40PM   0:02.53 /Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python -u /opt/worker/tasks/task_173090282395218/build/tests/mochitest/runtests.py browser/base/content/test/fullscreen/browser.toml browser/components/aboutlogins/tests/browser/browser.toml browser/components/reportbrokensite/test/browser/browser.toml browser/components/tests/browser/browser.toml dom/base/test/browser.toml dom/serviceworkers/test/browser.toml toolkit/components/pdfjs/test/browser.toml --setpref=layout.css.stylo-threads=4 --setpref=network.process.enabled=true --setpref=network.http.network_access_on_socket_process.enabled=true --setpref=layers.d3d11.enable-blacklist=false --appname=/opt/worker/tasks/task_173090282395218/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox --utility-path=tests/bin --extra-profile-file=tests/bin/plugins --symbols-path=/opt/worker/tasks/task_173090282395218/build/symbols --certificate-path=tests/certs --quiet --log-errorsummary=/opt/worker/tasks/task_173090282395218/build/blobber_upload_dir/mochitest-browser-chrome_errorsummary.log --screenshot-on-fail --cleanup-crashes --marionette-startup-timeout=180 --sandbox-read-whitelist=/opt/worker/tasks/task_173090282395218/build --log-raw=- --flavor=browser --chunk-by-runtime
cltbld           2050   0.0  0.0        0      0   ??  Z     2:56PM   0:00.00 (firefox)
Flags: needinfo?(rcurran)

By request

[root@macmini-r8-152.test.releng.mdc1.mozilla.com ~]# sudo ps aux | grep '[f]irefox' | wc -l
       2
[root@macmini-r8-152.test.releng.mdc1.mozilla.com ~]# sudo ps aux | grep '[f]irefox'
cltbld           1243   0.6  0.4  6963668  70204   ??  Ss    3:27PM   0:01.22 /opt/worker/tasks/task_173090606349572/build/application/Firefox Nightly.app/Contents/MacOS/firefox -unittest --gtest_death_test_style=threadsafe
cltbld           1239   0.0  0.2  4365924  27968   ??  S     3:27PM   0:00.18 /Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python -u /opt/worker/tasks/task_173090606349572/build/tests/gtest/rungtests.py --xre-path=/opt/worker/tasks/task_173090606349572/build/application/Firefox Nightly.app/Contents/Resources --cwd=/opt/worker/tasks/task_173090606349572/build/tests/gtest --symbols-path=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Jd7cBQ9lT7-_aytl8fgrlA/artifacts/public/build/target.crashreporter-symbols.zip --utility-path=tests/bin /opt/worker/tasks/task_173090606349572/build/application/Firefox Nightly.app/Contents/MacOS/firefox

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #26)

as a note any given mac mini in the general pool will be reused almost immediately, the machine is restarted between tasks

Please see the description of the new session resume feature which is an OS level thing. But I do not know how MacOS decides which profile to run given that we delete it with the restart. Maybe there is a reference in profiles.ini? Maybe Stephen can give more details.

I just checked several Mochitests as run on macOS 11.2.3 (aarch64) and they all do NOT show this startup delay:

[task 2024-11-05T23:51:15.618Z] 23:51:15     INFO - Application command: /opt/worker/tasks/task_173085033658563/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/cr/5_z7mtts3jb6l848j8glx824000014/T/tmp4rxvfiry.mozrunner
[task 2024-11-05T23:51:15.620Z] 23:51:15     INFO - runtests.py | Application pid: 2097
[task 2024-11-05T23:51:15.620Z] 23:51:15     INFO - TEST-INFO | started process GECKO(2097)
[task 2024-11-05T23:51:15.803Z] 23:51:15     INFO - GECKO(2097) | 1730850675802	Marionette	INFO	Marionette enabled
[task 2024-11-05T23:51:15.899Z] 23:51:15     INFO - GECKO(2097) | 1730850675899	Marionette	TRACE	Received observer notification final-ui-startup

11.2.3 is hosted at macstadium and we haven't seen refresh rate issues with our macosx 14.x machines hosted at macstadium.

The macs in macstadium have this hdmi dongle attached https://fit-iot.com/web/product/fit-headless-4k/
I'm not sure what we install in mdc but it may (probably?) is different. I can find out

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #23)

Stephen, could you please check my last comment? Is that reasonable?

I do not believe that this has to do with the new macOS session resume support. The main reason is because this support isn't technically "new". We have simply expanded it. Before, only Firefox instances that were never restarted due to an update were opted into session resume. Bug 1827651 expanded this support to also include Firefox instances that were restarted after an update. Unless there is some kind of interaction with update tests, this shouldn't have made a difference.

Flags: needinfo?(spohl.mozilla.bugs)

(In reply to Stephen A Pohl [:spohl] from comment #33)

I do not believe that this has to do with the new macOS session resume support.

Thanks for your reply. Then lets rule this out, especially given that only a single instance of Firefox runs on a restarted worker.

Ryan, while you are trying to figure out the connected dongles in MDC I wonder if the machine that I got from the staging pool has the same dongle attached to it as the production machines. If not I would appreciate if I could get a different machine. Thanks.

Flags: needinfo?(rcurran)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #34)

(In reply to Stephen A Pohl [:spohl] from comment #33)

I do not believe that this has to do with the new macOS session resume support.

Thanks for your reply. Then lets rule this out, especially given that only a single instance of Firefox runs on a restarted worker.

Ryan, while you are trying to figure out the connected dongles in MDC I wonder if the machine that I got from the staging pool has the same dongle attached to it as the production machines. If not I would appreciate if I could get a different machine. Thanks.

Hello Henrik,

I just checked the staging host macmini-r8-125.test.releng.mdc1.mozilla.com and it has the same dongle as the production r8s. 28E850

Flags: needinfo?(rcurran)

(In reply to Ryan Curran from comment #35)

I just checked the staging host macmini-r8-125.test.releng.mdc1.mozilla.com and it has the same dongle as the production r8s. 28E850

Thanks for the confirmation. Would it still be possible to hand me over a machine at the time when we see such a delay? When it gets disconnected from Taskcluster it should not restart automatically after the 30 min of inactivity, which should keep all the files and processes around.

Flags: needinfo?(rcurran)

Sure just let me know

Flags: needinfo?(rcurran)

Quick update of the status. We tried yesterday to disconnect a worker from the production set when it was running a wdspec test that clearly shows the longer startup times. Due to extra work as necessary to make it available for me to login, and timezone differences we had a delay so that I was only able to login when the actual task finished running on the worker. But that also means that any files related to the test job were removed from the file system already from within the folder /opt/worker/. Joel, do you know if there is a way to prevent that?

Also interesting is that a disconnected worker can still upload the final result to Taskcluster / Treeherder - as it happened for this job. I would have expected that it is marked as an error or something else. Not sure if that is a bug and we should fix or not.

We will try again today with a better sync between Ryan and myself so that I can login while the tests are running. But it would still be good to know until then how to prevent the job from deleting all the files.

Flags: needinfo?(jmaher)
Blocks: 1785895

When the worker finishes a task and reboots, this worker runner script executes before generic worker starts running and it looks like it clears /opt/worker/tasks/task_*

To test it would probably be easy to just comment that line out of the script. The script is at /usr/local/bin/worker-runner.sh

(In reply to Ryan Curran from comment #39)

To test it would probably be easy to just comment that line out of the script. The script is at /usr/local/bin/worker-runner.sh

Is it still something that can be done while the job is running and you are disconnecting the machine? If yes, maybe you could make the whole script a no-op when preparing the machine for hand-over to me?

it is unrealistic to get a live machine and then quickly quarantine it and preserve it before it goes onto other jobs.

could you not take a machine or two that has historically been prone to the problem, put them in your own pool and run tasks on them, or remote desktop to them and run the tasks manually?

Flags: needinfo?(jmaher)

Well it's not so unrealistically. Its just that we have to be quick before the job finishes. So I re-triggered a wd job which let this task to be created on the macmini-r8-170 worker.

I've now access via VNC and was able to see the delay visually and in the logs as created by the running wd job. There is no other Firefox running on the system, and also no CPU spikes visible when using top or the Activity Monitor. Now that the job is done I'll try to re-run the tests to see if it is still reproducible. If yes, I will try to install XCode to be able to create a system profile by using Instruments.

if you are able to schedule a task on a specific machine, then you have a private pool :)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #43)

if you are able to schedule a task on a specific machine, then you have a private pool :)

I'll most likely try that. Is there some documentation on how to get this done? I assume Ryan can place the machine in such a pool, but what do I have to change for eg. Marionette tests to use that specific pool?

Flags: needinfo?(jmaher)

Ryan can make a worker pool for you. it might end up being macosx-<...>-whimboo or some different worker pool name. If it just has a suffix (e.g. -whimboo), then you can do ./mach try fuzzy -q 'test-macosx marionette ...' --worker-suffix=-whimboo

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #45)

Ryan can make a worker pool for you. it might end up being macosx-<...>-whimboo or some different worker pool name. If it just has a suffix (e.g. -whimboo), then you can do ./mach try fuzzy -q 'test-macosx marionette ...' --worker-suffix=-whimboo

Ryan, could you please help me in setting this up for the machine that I already have access to since yesterday evening? Thanks!

Flags: needinfo?(rcurran)

Oh and I would also need XCode to be installed to be able to use Instruments. But sadly I cannot use the App Store for it because it only has the latest version of it and as such requires at minimum MacOS 14.5. Because I don't have a developer account I cannot download it. Could you maybe help with that as well?

Installed Xcode 12.4 and cut a new PR for the pool

TC key rotated and stored in 1Password for new pool
TC key and new pool ID added to Vault
PR draft cut for ronin_puppet with change
Machine has been re-puppetized with the new pool configuration

gecko-t-osx-bug1927416 should be functional now

Usually running something like this:
./mach try fuzzy --worker-override t-osx-1015-r8=releng-hardware/gecko-t-osx-1015-r8-staging

Tends to work for me. In your case it might be something like:
./mach try fuzzy --worker-override t-osx-1015-r8=releng-hardware/gecko-t-osx-bug1927416

:whimboo just a reminder that this machine is back in its production state configuration wise. thank you

Flags: needinfo?(rcurran)
Attached file Firefox.trace.zip

Markus, I have created a time profile with Instruments and I'm sure that I caught this delay. See the attached trace log. Could it be that this is related to Metal?

Flags: needinfo?(mstange.moz)

I've taken another look at this trace and realized that I should have also created a Metal System Trace. After doing so, I noticed that a significant amount of time (similar to the time profile) is spent in MTLCompilerFSCache:GetElement, suggesting that we’re frequently trying to load already-compiled Metal shaders from the Metal Compiler File System Cache.

After some investigation, I found that this cache is located under /private/var/folders/. Checking the disk usage of individual folders, I noticed that Mozilla-related folders are occupying a large amount of space:

950MB	/private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightlydebug/com.apple.metal
758MB	/private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightly/com.apple.metal/Intel(R) UHD Graphics 630
...

Clearing the /private/var/folders/tv/ folder resolved the issue, as shown in this try build:

[task 2024-11-15T08:22:51.764Z] 08:22:51 INFO - Application command: /opt/worker/tasks/task_173165853969153/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -remote-debugging-port -foreground -profile /var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/T/tmp3meaf4ll.mozrunner
[task 2024-11-15T08:22:51.944Z] 08:22:51 INFO - RemoteAgent DEBUG WebDriver BiDi enabled
[task 2024-11-15T08:22:51.944Z] 08:22:51 INFO - Marionette INFO Marionette enabled
[task 2024-11-15T08:22:52.180Z] 08:22:52 INFO - Marionette TRACE Received observer notification final-ui-startup

This eliminated the previous 4-6 second delay before final-ui-startup, allowing it to be reached almost instantly. As a result, the Mn(unit) job runtime dropped from 21 minutes to just 9 minutes. After running this new try job the tv folder contains the following Mozilla-specific items:

 68K	/private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightly/com.apple.metal/Intel(R) UHD Graphics 630
136K	/private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightly/com.apple.metal

This suggests that some recent Metal-related features in Firefox (introduced around June) are causing data to leak into the Metal Compiler File System Cache. While this may not significantly impact most Mac users who don’t restart Firefox frequently, it could still introduce minor delays.

Also caches like the MtlCompilerFSCache, which are meant to improve performance, may persist across reboots to avoid reinitialization delays. This indicates a need for a solution to prevent Metal data from accumulating excessively over time.

Joel, I wonder if we should potentially re-puppetize the MacOS 10.15 workers (to clean-up those folders) given that we see an exponential increase of the startup time reaching close to 6s for a single startup of Firefox.

Flags: needinfo?(jmaher)

To reflect that this is not only affecting the WebDriver tests but all tests that have to launch Firefox, I'm going to update the bug's summary. We should file separate bugs (like for graphics and infra work) that should block this bug.

Summary: Slow startup times of Firefox for webdriver tests on MacOS (delayed "final-ui-startup" notification) → Slow startup times of 4 to 6s for Firefox for all test jobs on MacOS 10.15 workers (delayed "final-ui-startup" notification)

That's a great find! I haven't had a chance to understand what this means yet. I wonder if we only add to the cache during webgpu tests or if it is something that happens in regular Firefox usage too (where webgpu is still preffed off by default).

When I check about:config in a Nightly build I can see that dom.webgpu.enabled is set to true by default. And that is reflected in the static prefs list. So that means that all the tests we run for Nightly builds (central, autoland, ...) actually contribute to growing that cache. The caches for devedition and release builds of Firefox were way smaller so that maybe only specific webgpu tests are causing the cache to grow.

Here a link of a spreadsheet listing all the files with the full path under /private/var/folders/.

As requested by Markus moving the needinfo to Erich who know more about the development of WebGPU.

Flags: needinfo?(mstange.moz) → needinfo?(egubler)

do you want us try a find /private/var/folders -type d -name "org.mozilla.nightlydebug" -exec rm -rf {} + 2>/dev/null before generic worker starts after a fresh reboot?

As discussed with Ryan we probably want to remove all entries related to org.mozilla.* to ensure that we get rid of all the cache entries. The decision for this approach I would hand over to someone else who knows more about the Mac workers, but it sounds like a good compromise for now.

Points: --- → 5

just to confirm:

we want to try adding:

find /private/var/folders/ -path '*/C/org.mozilla.*' -type d -o -path '*/C/org.mozilla.*/*' -type d -exec rm -rf {} +

in our worker-runner-.sh script which will execute before the worker starts, correct?

PR

If landed, will deploy when I return from work week next week (return 11/25)

Flags: needinfo?(jmaher)

My understanding is that the files under $(getconf DARWIN_USER_CACHE_DIR)/com.apple.metal are created implicitly by the Metal Shading Language compiler, so any use of Metal shaders could result in files being created there. Certainly running the WPT WebGPU CTS will create many, many shaders.

It would be cool if someone with a mac could visit https://webgpu.github.io/webgpu-samples/ in Nightly, run some demos, and see if a bunch of junk gets created in that directory.

This isn't something that our WebGPU implementation explicitly requests; it seems to be the default behavior of the Metal Shading Language compiler. It's too bad that Apple designed their cache to take so much CPU, even before one needs its contents.

Flags: needinfo?(egubler)

thanks :jimb

I logged into a 10.15 staging r8 node to test this. how long would you recommend letting each test run for before doing a diff on the directory size? thanks

Every time any part of Firefox wants to run its own code on the GPU, it has to compile a shader. Running the WPT WebGPU CTS even just for a little bit should create plenty of files.

But if I'm understanding the bug correctly, all Firefox startup times are affected - Firefox doesn't need to run the WebGPU CTS for its startup to get slowed down, right?

Oho, this looks interesting:

In https://crbug.com/angleproject/5354 a workaround was added for ANGLE's Metal backend tests which eliminated intermittent several-second hangs during startup - prevent Metal from accessing the per-user shader cache. It seems that the application was hanging on access to this directory.

As pointed out by rsesek@ in http://crbug.com/974219#c91 (restricted view, sorry) this dyld interposition technique won't work in Chrome, but Chrome's sandbox can be used to deny Metal's runtime access to this directory. We should try this as a potential workaround for the Metal shader compilation issues seen in Chrome.

comment:

Metal on Mac has some issues with shader caching inside "/private/var/.../com.apple.metal" folder. After some point, the cache becomes huge and the compilation of Metal shaders would take a very long time to finish. That would be the reason of intermittent timeouts here.
It happened a lot on my local machines when I ran the dEQP tests (sometimes it took 10' to finish loading the first shader). When using debugger to pause the app, it pointed to the internal library function where the Metal shader cache lookup happens.
I usually deleted the cache folder above after some point and the shader compilation became fast again.

Metal is the only officially supported graphics API on macOS these days, although OpenGL is still there and works fine. But this makes me suspect that probably anything that does graphics at all must, at some point, end up trying to run Metal shaders, even if they're just implementation details of some library or translated from some other language. But any use of Metal would probably be enough to cause an attempt to access the cache.

Whiteboard: [webdriver:m13] → [webdriver:m14]

Jim, would it maybe helpful to get a list of entries of this cache if that would be helpful to identify the cause of the massive size of the cache? We most likely will have to do that before the above script gets distributed on the machines which will remove the cache on each restart.

Flags: needinfo?(jimb)

Sure, I'd love to see a listing. If the files are textual, and you could provide the contents of two or three selected at random, that might serve as a smoking gun.

But more broadly, there are two issues here:

  1. We have slow Firefox startup times in our CI. This can be fixed with a script.

  2. I imagine that, over time, less unusual cases than the WPT WebGPU CTS could also cause an accumulation of files in the Metal Shading Language compiler cache, leaving people with slow Firefox start times. This is an actual product problem, not just a CI infra issue.

Flags: needinfo?(jimb)
Priority: P3 → P2

(In reply to Jim Blandy :jimb from comment #69)

Sure, I'd love to see a listing. If the files are textual, and you could provide the contents of two or three selected at random, that might serve as a smoking gun.

Ryan, is that something that you could help with when you are back from the work week? We should definitely have to do that before the changes for clean-up are pushed to production. If we miss to do that it would remove all the data.

  1. We have slow Firefox startup times in our CI. This can be fixed with a script.

Yes, that is correct. Hopefully we can fix it by next week when Ryan is back. I would also like to close this specific bug once the changes have been shipped to production.

  1. I imagine that, over time, less unusual cases than the WPT WebGPU CTS could also cause an accumulation of files in the Metal Shading Language compiler cache, leaving people with slow Firefox start times. This is an actual product problem, not just a CI infra issue.

Indeed and that is something I was asking already in comment 53. Should I already file a graphic's specific bug or should we wait until the data you requested is available?

Flags: needinfo?(rcurran)
Flags: needinfo?(jimb)

Also note that there is an ongoing process to move test jobs from MacOS 10.15 to MacOS 14.70 Intel (see bug 1922355). Given that all the machines are newly installed via Puppet this specific slowdown will as well not be visible anymore.

I have a theory that the shader cache is using a key based on the path of the executable, so the fact that CI runs use builds installed at effectively random paths leads to a massive buildup of redundant cache entries, and the OS may be preloading the entire cache index at startup so that lookups have low latency (otherwise searching for each shader used in a game would stall as it reads in a bit more of the index on the first several requests - games tend to have a lot of distinct shaders).

I'd further suppose that it is probably a SQLite database as that is popular these days, but immaterial to the theory.

Theory can be tested by installing Firefox at multiple locations and then deleting the shader cache and running each one and checking the size of the shader cache between each run.

Here is a copy of the com.apple.metal directory from an affected machine: https://drive.google.com/file/d/1wHiv6UgYtUDhsQ2GzA6M0KWj0kn5ImqG/view?usp=sharing

Flags: needinfo?(rcurran)

(In reply to Ryan Curran from comment #74)

Here is a copy of the com.apple.metal directory from an affected machine: https://drive.google.com/file/d/1wHiv6UgYtUDhsQ2GzA6M0KWj0kn5ImqG/view?usp=sharing

Those files are not sqlite based databases but really a custom format that is not documented by Apple. I had a quick look into these but what not able to find any specific evidence of what's going wrong. Jim, maybe you can have a look at it?

I as well tried the following variants for https://webgpu.github.io/webgpu-samples (note that a lot of the examples fail due to a shader creation error) but I was not able to force the cache to grow to some high numbers. The size stayed around 2.5MB for all the related files.

  1. Variations of the installation path of Firefox
  2. Different versions of the same kind of Firefox build (only increased the size once for the 2nd build)
  3. Using different Firefox profile locations

So I have no idea what might cause the cache to grow over time. But maybe there might have been an issue with this older MacOS version that might have triggered such a leak in the cache? In such a case we should observe the new MacOS 14.70 workers and regularly check their startup performance in Perfherder. I don't think that we should invest the time right now to do the above checks on such an old system.

Note that I had a look at Perfherder for the MacOS 14.70 workers and the values they report for startup times. They are stable around 1500ms currently, but that is still 50% higher than the values that we had in June this year on MacOS 15.10 (see comment 10).

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: