Slow startup times of 4 to 6s for Firefox for all test jobs on MacOS 10.15 workers (delayed "final-ui-startup" notification)
Categories
(Testing :: General, defect, P2)
Tracking
(Not tracked)
People
(Reporter: whimboo, Assigned: whimboo, NeedInfo)
References
(Blocks 8 open bugs)
Details
(Whiteboard: [webdriver:m14])
Attachments
(1 file)
7.61 MB,
application/zip
|
Details |
We experience slower startup times of Firefox for at least the webdriver related test suites on MacOS workers. Hereby it's only headful that is affected. The headless tests do not seem to show the issue. Further the problem is not only happening for the very first start of Firefox but as well for any successive restarts / starts.
I did an initial investigation over on bug 1924507 comment 3 because the test /webdriver/tests/classic/new_session/unhandled_prompt_behavior.py
showed most of the failures initially. The test durations vary between 129s and 163s. This is all present today and can be seen on older jobs pre-dating October.
Why we have such huge differences in execution is not clear. But as it looks like it can all be tracked down to startup delays for Firefox so that even shippable builds take more than 4s for even a warm start. As Joel mentioned over on that other bug as well is that we run multiple jobs on MacOS workers which probably could mean that due to a higher CPU utilization of other jobs, our one isn't getting enough cycles? Interesting for sure is that headless is not affected. So maybe some graphics related issue could trigger that behavior in shared environments?
Markus, do you have an idea? Maybe just some kind of logging that we could do on try builds where we hopefully can catch this issue as well? There are basically all the wpt tests affected that have in general already a longer runtime, and now are timing out.
Assignee | ||
Comment 1•1 month ago
|
||
I actually missed to add an example. Here is one from a wdspec test:
https://treeherder.mozilla.org/logviewer?job_id=478095037&repo=autoland&lineNumber=117985-118040
[task 2024-10-14T00:22:32.161Z] 00:22:32 INFO - PID 2560 | 1728865352160 mozrunner::runner INFO Running command: MOZ_CRASHREPORTER="1" MOZ_CRASHREPORTER_NO_REPORT="1" MOZ_CRASHREPORTER_SHUTDOWN="1" "/opt/worker/tasks/task ... --marionette" "--profile" "/var/folders/8l/7mfbjqr91rd81vc3rvs2xst0000014/T/tmp7isizu1z.mozrunner" "-foreground" "-no-remote"
[task 2024-10-14T00:22:32.162Z] 00:22:32 INFO - PID 2560 | 1728865352161 geckodriver::marionette DEBUG Waiting 60s to connect to browser on 127.0.0.1
[task 2024-10-14T00:22:32.162Z] 00:22:32 INFO - PID 2560 | 1728865352161 geckodriver::browser TRACE Failed to open /var/folders/8l/7mfbjqr91rd81vc3rvs2xst0000014/T/tmp7isizu1z.mozrunner/MarionetteActivePort
[task 2024-10-14T00:22:32.163Z] 00:22:32 INFO - PID 2560 | 1728865352161 geckodriver::marionette TRACE Retrying in 100ms
[..]
[task 2024-10-14T00:22:35.436Z] 00:22:35 INFO - PID 2560 | 1728865355435 geckodriver::browser TRACE Failed to open /var/folders/8l/7mfbjqr91rd81vc3rvs2xst0000014/T/tmp7isizu1z.mozrunner/MarionetteActivePort
[task 2024-10-14T00:22:35.437Z] 00:22:35 INFO - PID 2560 | 1728865355435 geckodriver::marionette TRACE Retrying in 100ms
[task 2024-10-14T00:22:35.479Z] 00:22:35 INFO - PID 2560 | 1728865355478 Marionette TRACE Received observer notification final-ui-startup
[task 2024-10-14T00:22:35.512Z] 00:22:35 INFO - PID 2560 | 1728865355511 Marionette INFO Listening on port 50822
[task 2024-10-14T00:22:35.513Z] 00:22:35 INFO - PID 2560 | 1728865355511 Marionette DEBUG Marionette is listening
[task 2024-10-14T00:22:35.665Z] 00:22:35 INFO - PID 2560 | Read port: 50822
[task 2024-10-14T00:22:36.040Z] 00:22:36 INFO - PID 2560 | 1728865356039 Marionette DEBUG Accepted connection 0 from 127.0.0.1:50823
[task 2024-10-14T00:22:36.109Z] 00:22:36 INFO - PID 2560 | 1728865356108 geckodriver::marionette DEBUG Connection to Marionette established on 127.0.0.1:50822.
Comment 2•1 month ago
|
||
one clarification, we don't run tasks in parallel, we just share the pool of machine for all types of tasks- so there isn't one task sharing resources with another task, etc.
Assignee | ||
Comment 3•1 month ago
|
||
Oh ok. That wasn't clear. Thank you for the clarification! So this is then very interesting.
Joel, is there a way to have a one-click loaner for such a machine? It might be good to get the tests running and to create a Gecko profile. That might be more helpful than any other logging.
Comment 4•1 month ago
|
||
macosx is hardware, you need to vpn into the datacenter and remote onto the machine- adding rcurran as he does most of the macosx work.
Ryan, can you answer :whimboo's question about what it would take to get a macosx 10.15 loaner?
Comment 5•1 month ago
|
||
pick a worker. we'll quarantine it and provide you ssh/vnc access as needed. thanks
Assignee | ||
Comment 6•1 month ago
|
||
I have taken a look at existing failures, but there is no specific worker that stood out for producing more than usual failures. So a random one should probably be fine. Thanks!
Comment 7•29 days ago
|
||
:whimboo see https://mozilla-hub.atlassian.net/browse/RELOPS-1132?focusedCommentId=956462 for the latest
Assignee | ||
Updated•23 days ago
|
Comment 8•23 days ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #0)
Markus, do you have an idea? Maybe just some kind of logging that we could do on try builds where we hopefully can catch this issue as well? There are basically all the wpt tests affected that have in general already a longer runtime, and now are timing out.
One thing you could try is to set MOZ_PROFILER_STARTUP=1
and MOZ_PROFILER_SHUTDOWN=/tmp/startup_profile.json
and see if that shows the slow startup. But if the slowness happens before our code starts running, for example because macOS does some signature check, then this won't show it.
Assignee | ||
Comment 9•21 days ago
|
||
I tried to reproduce on the Mac Mini loaner but I'm not able to see slow startup times. It looks all normal and we see a very quick startup of Firefox:
1730884412414 Marionette DEBUG Closed connection 1
Application command: Firefox-opt.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/1k/73174z2506d5hbnftcjr17rm0000gp/T/tmpkfslmxsr.mozrunner
1730884412713 Marionette INFO Marionette enabled
1730884412863 Marionette TRACE Received observer notification final-ui-startup
Given that I can replicate it as well with pure Marionette tests I might just push a job with a single test to run and use the Gecko profiler settings as mentioned above. Mochitests are affected as well, and surprisingly Talos too:
[task 2024-11-05T00:38:06.086Z] 00:38:06 INFO - TEST-INFO | started process 728 (/opt/worker/tasks/task_173076700310462/build/application/Firefox Nightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/9m/8nfcw1_j2gn7r45bdvc36wrm000014/T/tmpnr09v840/profile)
[task 2024-11-05T00:38:10.514Z] 00:38:10 INFO - PID 728 | [ERROR error_support::handling] logins-unexpected: Error parsing URL: relative URL with a cannot-be-a-base base
Don't we have Talos startup performance tests which should have caught that long delay?
I just triggered a random Talos job with the profiler enabled. Lets see if that already gives a hint so that I don't have to do prepare a custom Marionette job for it.
Assignee | ||
Comment 10•21 days ago
|
||
Here some timings of Mn jobs for comparison:
Sep 4th: 0.9s
[task 2024-09-04T07:13:14.877Z] 07:13:14 INFO - Application command: /opt/worker/tasks/task_172543263043635/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/hh/22h03g290_3fzy6fg6z99z_c000014/T/tmpjuz71b44.mozrunner
[task 2024-09-04T07:13:15.135Z] 07:13:15 INFO - 1725433995135 Marionette INFO Marionette enabled
[task 2024-09-04T07:13:15.748Z] 07:13:15 INFO - 1725433995748 Marionette TRACE Received observer notification final-ui-startup
Oct 1st: 2.3s
[task 2024-10-01T01:34:29.918Z] 01:34:29 INFO - Application command: /opt/worker/tasks/task_172774535893276/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/9r/cc81p_zj7_bf6051yy4xhjt0000014/T/tmpaztnj34j.mozrunner
[task 2024-10-01T01:34:30.139Z] 01:34:30 INFO - 1727746470139 Marionette INFO Marionette enabled
[task 2024-10-01T01:34:32.239Z] 01:34:32 INFO - 1727746472239 Marionette TRACE Received observer notification final-ui-startup
Nov 6th: 4.6s
[task 2024-11-06T07:56:59.670Z] 07:56:59 INFO - Application command: /opt/worker/tasks/task_173087905802411/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/0y/nndy_9kd66lg9_qw2_gl47hw000014/T/tmpc1sc6pgy.mozrunner
[task 2024-11-06T07:56:59.914Z] 07:56:59 INFO - 1730879819914 Marionette INFO Marionette enabled
[task 2024-11-06T07:57:04.287Z] 07:57:04 INFO - 1730879824286 Marionette TRACE Received observer notification final-ui-startup
Not sure if this graph from Perfherder shows the dramatic increase of startup time as well, but if it does we should really investigate and fix this performance regression.
Assignee | ||
Comment 11•21 days ago
|
||
As it looks like the T(o) jobs actually contain this measurement. I've triggered a new job with the Gecko profiler enabled. I hope that this time we will get a profile that contains the startup phase:
Assignee | ||
Comment 12•21 days ago
|
||
Here is the actual recorded Gecko profile which shows the delay of about 4s: https://share.firefox.dev/40Agnct
Assignee | ||
Comment 13•21 days ago
|
||
Both bug 1928656 and bug 1928653 show regressions for Talos tests as well on MacOS.
Here is a try build for a T(o) job with profiling enabled from a checkout of Sep 4th. Lets see if the numbers are still bad or are equal to 1s.
https://treeherder.mozilla.org/jobs?repo=try&revision=d26f6a27e88f343b746ec1ef98e27b602242401d
Assignee | ||
Comment 14•21 days ago
|
||
Lets move this bug to Testing :: General
given that it affects certain test suites. Individual fixes for Firefox or the worker should be filed as dependencies.
Assignee | ||
Comment 15•21 days ago
|
||
We also re-triggered the T(o) job for a build from Aug 26th. The results will be shown in this graph on Perfherder.
Comment 16•21 days ago
|
||
We're stuck in native mac code doing menubar shenanigans. Something that matches the timing, code, and such would be bug 1855346, though it went through various back outs? Not sure if that shows up in the graph.
Comment 17•21 days ago
|
||
Ryan, would you know what's changed on the macosx 1015 workers since around October 1st?
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #15)
We also re-triggered the T(o) job for a build from Aug 26th. The results will be shown in this graph on Perfherder.
The graph here suggests that it's an issue related to the workers/environment and not a regression in the browser.
Comment 18•21 days ago
|
||
The profile shows that we're blocked waiting for something that's related to the GPU or the window server. I'd blame some ambient system state; maybe there are too many windows open or something is running in the background hogging the GPU, or maybe something is weird with the monitor.
Assignee | ||
Comment 19•21 days ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #15)
We also re-triggered the T(o) job for a build from Aug 26th. The results will be shown in this graph on Perfherder.
Those results show the same bad numbers as what we have at the moment on mozilla-central. So as Greg mentioned it's most likely related to the environment of the MacOS workers.
Comment 20•21 days ago
|
||
one thing :kash noticed is that some of the perf machines have 24hz refresh rate and not the target 60hz (bug 1917594). No idea if that would contribute to this, but thought I would mention it. Once we figure out the root cause, we will work on pre-flight setup to ensure we have 60hz.
Assignee | ||
Comment 21•21 days ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #16)
We're stuck in native mac code doing menubar shenanigans. Something that matches the timing, code, and such would be bug 1855346, though it went through various back outs? Not sure if that shows up in the graph.
I think that we can rule this out given that a build from August shows the same symptoms these days.
Assignee | ||
Comment 22•21 days ago
|
||
(In reply to Markus Stange [:mstange] from comment #18)
The profile shows that we're blocked waiting for something that's related to the GPU or the window server. I'd blame some ambient system state; maybe there are too many windows open or something is running in the background hogging the GPU, or maybe something is weird with the monitor.
Ok, that is interesting! I was thinking about that a bit more and have a hypothesis. Could this actually be related to the new Mac session support that landed for Firefox via bug 1827651? It had several landings and backouts and since mid of September it's stable now.
Note that after each job the worker gets restarted! So if Firefox is not correctly closed at the end of a job the system triggered restart will most likely activate the session restore code and starts Firefox again. This also explains why older builds are affected (the worker as used already has a couple of Firefox processes running) and we have such a noise in the Talos data.
Regarding the worker environment none of the changes as landed recently seem to be related and could cause that.
Assignee | ||
Comment 23•21 days ago
|
||
Stephen, could you please check my last comment? Is that reasonable?
Assignee | ||
Comment 24•21 days ago
|
||
Maybe the worker should kill all the cltbld user initiated processes before the restart. Given that those jobs that run into a task timeout won't be able to clean-up the machine correctly, I see it being the only place of doing it.
Assignee | ||
Comment 25•21 days ago
|
||
With this job we have a startup time of 6s.
Ryan, could you please check how many Firefox processes are running on the macmini-r8-152 worker? Thanks!
Comment 26•21 days ago
|
||
as a note any given mac mini in the general pool will be reused almost immediately, the machine is restarted between tasks
Comment 27•21 days ago
•
|
||
[relops@macmini-r8-152.test.releng.mdc1.mozilla.com ~]$ sudo ps aux | grep '[f]irefox' | wc -l
1
[relops@macmini-r8-152.test.releng.mdc1.mozilla.com ~]$ sudo ps aux | grep '[f]irefox'
cltbld 1410 0.0 0.3 4393200 54916 ?? S 2:40PM 0:02.53 /Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python -u /opt/worker/tasks/task_173090282395218/build/tests/mochitest/runtests.py browser/base/content/test/fullscreen/browser.toml browser/components/aboutlogins/tests/browser/browser.toml browser/components/reportbrokensite/test/browser/browser.toml browser/components/tests/browser/browser.toml dom/base/test/browser.toml dom/serviceworkers/test/browser.toml toolkit/components/pdfjs/test/browser.toml --setpref=layout.css.stylo-threads=4 --setpref=network.process.enabled=true --setpref=network.http.network_access_on_socket_process.enabled=true --setpref=layers.d3d11.enable-blacklist=false --appname=/opt/worker/tasks/task_173090282395218/build/application/Firefox NightlyDebug.app/Contents/MacOS/firefox --utility-path=tests/bin --extra-profile-file=tests/bin/plugins --symbols-path=/opt/worker/tasks/task_173090282395218/build/symbols --certificate-path=tests/certs --quiet --log-errorsummary=/opt/worker/tasks/task_173090282395218/build/blobber_upload_dir/mochitest-browser-chrome_errorsummary.log --screenshot-on-fail --cleanup-crashes --marionette-startup-timeout=180 --sandbox-read-whitelist=/opt/worker/tasks/task_173090282395218/build --log-raw=- --flavor=browser --chunk-by-runtime
cltbld 2050 0.0 0.0 0 0 ?? Z 2:56PM 0:00.00 (firefox)
Comment 28•21 days ago
|
||
By request
[root@macmini-r8-152.test.releng.mdc1.mozilla.com ~]# sudo ps aux | grep '[f]irefox' | wc -l
2
[root@macmini-r8-152.test.releng.mdc1.mozilla.com ~]# sudo ps aux | grep '[f]irefox'
cltbld 1243 0.6 0.4 6963668 70204 ?? Ss 3:27PM 0:01.22 /opt/worker/tasks/task_173090606349572/build/application/Firefox Nightly.app/Contents/MacOS/firefox -unittest --gtest_death_test_style=threadsafe
cltbld 1239 0.0 0.2 4365924 27968 ?? S 3:27PM 0:00.18 /Library/Frameworks/Python.framework/Versions/3.11/Resources/Python.app/Contents/MacOS/Python -u /opt/worker/tasks/task_173090606349572/build/tests/gtest/rungtests.py --xre-path=/opt/worker/tasks/task_173090606349572/build/application/Firefox Nightly.app/Contents/Resources --cwd=/opt/worker/tasks/task_173090606349572/build/tests/gtest --symbols-path=https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Jd7cBQ9lT7-_aytl8fgrlA/artifacts/public/build/target.crashreporter-symbols.zip --utility-path=tests/bin /opt/worker/tasks/task_173090606349572/build/application/Firefox Nightly.app/Contents/MacOS/firefox
Assignee | ||
Comment 29•21 days ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #26)
as a note any given mac mini in the general pool will be reused almost immediately, the machine is restarted between tasks
Please see the description of the new session resume feature which is an OS level thing. But I do not know how MacOS decides which profile to run given that we delete it with the restart. Maybe there is a reference in profiles.ini? Maybe Stephen can give more details.
Assignee | ||
Comment 30•21 days ago
•
|
||
I just checked several Mochitests as run on macOS 11.2.3 (aarch64) and they all do NOT show this startup delay:
[task 2024-11-05T23:51:15.618Z] 23:51:15 INFO - Application command: /opt/worker/tasks/task_173085033658563/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/cr/5_z7mtts3jb6l848j8glx824000014/T/tmp4rxvfiry.mozrunner
[task 2024-11-05T23:51:15.620Z] 23:51:15 INFO - runtests.py | Application pid: 2097
[task 2024-11-05T23:51:15.620Z] 23:51:15 INFO - TEST-INFO | started process GECKO(2097)
[task 2024-11-05T23:51:15.803Z] 23:51:15 INFO - GECKO(2097) | 1730850675802 Marionette INFO Marionette enabled
[task 2024-11-05T23:51:15.899Z] 23:51:15 INFO - GECKO(2097) | 1730850675899 Marionette TRACE Received observer notification final-ui-startup
Comment 31•21 days ago
|
||
11.2.3 is hosted at macstadium and we haven't seen refresh rate issues with our macosx 14.x machines hosted at macstadium.
Comment 32•21 days ago
|
||
The macs in macstadium have this hdmi dongle attached https://fit-iot.com/web/product/fit-headless-4k/
I'm not sure what we install in mdc but it may (probably?) is different. I can find out
Comment 33•21 days ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #23)
Stephen, could you please check my last comment? Is that reasonable?
I do not believe that this has to do with the new macOS session resume support. The main reason is because this support isn't technically "new". We have simply expanded it. Before, only Firefox instances that were never restarted due to an update were opted into session resume. Bug 1827651 expanded this support to also include Firefox instances that were restarted after an update. Unless there is some kind of interaction with update tests, this shouldn't have made a difference.
Assignee | ||
Comment 34•19 days ago
|
||
(In reply to Stephen A Pohl [:spohl] from comment #33)
I do not believe that this has to do with the new macOS session resume support.
Thanks for your reply. Then lets rule this out, especially given that only a single instance of Firefox runs on a restarted worker.
Ryan, while you are trying to figure out the connected dongles in MDC I wonder if the machine that I got from the staging pool has the same dongle attached to it as the production machines. If not I would appreciate if I could get a different machine. Thanks.
Comment 35•19 days ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #34)
(In reply to Stephen A Pohl [:spohl] from comment #33)
I do not believe that this has to do with the new macOS session resume support.
Thanks for your reply. Then lets rule this out, especially given that only a single instance of Firefox runs on a restarted worker.
Ryan, while you are trying to figure out the connected dongles in MDC I wonder if the machine that I got from the staging pool has the same dongle attached to it as the production machines. If not I would appreciate if I could get a different machine. Thanks.
Hello Henrik,
I just checked the staging host macmini-r8-125.test.releng.mdc1.mozilla.com and it has the same dongle as the production r8s. 28E850
Assignee | ||
Comment 36•16 days ago
|
||
(In reply to Ryan Curran from comment #35)
I just checked the staging host macmini-r8-125.test.releng.mdc1.mozilla.com and it has the same dongle as the production r8s. 28E850
Thanks for the confirmation. Would it still be possible to hand me over a machine at the time when we see such a delay? When it gets disconnected from Taskcluster it should not restart automatically after the 30 min of inactivity, which should keep all the files and processes around.
Assignee | ||
Comment 38•14 days ago
|
||
Quick update of the status. We tried yesterday to disconnect a worker from the production set when it was running a wdspec test that clearly shows the longer startup times. Due to extra work as necessary to make it available for me to login, and timezone differences we had a delay so that I was only able to login when the actual task finished running on the worker. But that also means that any files related to the test job were removed from the file system already from within the folder /opt/worker/
. Joel, do you know if there is a way to prevent that?
Also interesting is that a disconnected worker can still upload the final result to Taskcluster / Treeherder - as it happened for this job. I would have expected that it is marked as an error or something else. Not sure if that is a bug and we should fix or not.
We will try again today with a better sync between Ryan and myself so that I can login while the tests are running. But it would still be good to know until then how to prevent the job from deleting all the files.
Comment 39•14 days ago
|
||
When the worker finishes a task and reboots, this worker runner script executes before generic worker starts running and it looks like it clears /opt/worker/tasks/task_*
To test it would probably be easy to just comment that line out of the script. The script is at /usr/local/bin/worker-runner.sh
Assignee | ||
Comment 40•14 days ago
|
||
(In reply to Ryan Curran from comment #39)
To test it would probably be easy to just comment that line out of the script. The script is at
/usr/local/bin/worker-runner.sh
Is it still something that can be done while the job is running and you are disconnecting the machine? If yes, maybe you could make the whole script a no-op when preparing the machine for hand-over to me?
Comment 41•14 days ago
|
||
it is unrealistic to get a live machine and then quickly quarantine it and preserve it before it goes onto other jobs.
could you not take a machine or two that has historically been prone to the problem, put them in your own pool and run tasks on them, or remote desktop to them and run the tasks manually?
Assignee | ||
Comment 42•14 days ago
|
||
Well it's not so unrealistically. Its just that we have to be quick before the job finishes. So I re-triggered a wd
job which let this task to be created on the macmini-r8-170
worker.
I've now access via VNC and was able to see the delay visually and in the logs as created by the running wd job. There is no other Firefox running on the system, and also no CPU spikes visible when using top
or the Activity Monitor
. Now that the job is done I'll try to re-run the tests to see if it is still reproducible. If yes, I will try to install XCode to be able to create a system profile by using Instruments.
Comment 43•14 days ago
|
||
if you are able to schedule a task on a specific machine, then you have a private pool :)
Assignee | ||
Comment 44•14 days ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #43)
if you are able to schedule a task on a specific machine, then you have a private pool :)
I'll most likely try that. Is there some documentation on how to get this done? I assume Ryan can place the machine in such a pool, but what do I have to change for eg. Marionette tests to use that specific pool?
Comment 45•14 days ago
|
||
Ryan can make a worker pool for you. it might end up being macosx-<...>-whimboo
or some different worker pool name. If it just has a suffix (e.g. -whimboo
), then you can do ./mach try fuzzy -q 'test-macosx marionette ...' --worker-suffix=-whimboo
Assignee | ||
Comment 46•13 days ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #45)
Ryan can make a worker pool for you. it might end up being
macosx-<...>-whimboo
or some different worker pool name. If it just has a suffix (e.g.-whimboo
), then you can do./mach try fuzzy -q 'test-macosx marionette ...' --worker-suffix=-whimboo
Ryan, could you please help me in setting this up for the machine that I already have access to since yesterday evening? Thanks!
Assignee | ||
Comment 47•13 days ago
|
||
Oh and I would also need XCode to be installed to be able to use Instruments
. But sadly I cannot use the App Store for it because it only has the latest version of it and as such requires at minimum MacOS 14.5. Because I don't have a developer account I cannot download it. Could you maybe help with that as well?
Comment 48•13 days ago
|
||
Installed Xcode 12.4 and cut a new PR for the pool
Comment 49•13 days ago
•
|
||
TC key rotated and stored in 1Password for new pool
TC key and new pool ID added to Vault
PR draft cut for ronin_puppet with change
Machine has been re-puppetized with the new pool configuration
gecko-t-osx-bug1927416 should be functional now
Usually running something like this:
./mach try fuzzy --worker-override t-osx-1015-r8=releng-hardware/gecko-t-osx-1015-r8-staging
Tends to work for me. In your case it might be something like:
./mach try fuzzy --worker-override t-osx-1015-r8=releng-hardware/gecko-t-osx-bug1927416
:whimboo just a reminder that this machine is back in its production state configuration wise. thank you
Assignee | ||
Comment 50•13 days ago
|
||
Markus, I have created a time profile with Instruments and I'm sure that I caught this delay. See the attached trace log. Could it be that this is related to Metal?
Assignee | ||
Comment 51•12 days ago
•
|
||
I've taken another look at this trace and realized that I should have also created a Metal System Trace. After doing so, I noticed that a significant amount of time (similar to the time profile) is spent in MTLCompilerFSCache:GetElement
, suggesting that we’re frequently trying to load already-compiled Metal shaders from the Metal Compiler File System Cache.
After some investigation, I found that this cache is located under /private/var/folders/
. Checking the disk usage of individual folders, I noticed that Mozilla-related folders are occupying a large amount of space:
950MB /private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightlydebug/com.apple.metal
758MB /private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightly/com.apple.metal/Intel(R) UHD Graphics 630
...
Clearing the /private/var/folders/tv/
folder resolved the issue, as shown in this try build:
[task 2024-11-15T08:22:51.764Z] 08:22:51 INFO - Application command: /opt/worker/tasks/task_173165853969153/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -remote-debugging-port -foreground -profile /var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/T/tmp3meaf4ll.mozrunner
[task 2024-11-15T08:22:51.944Z] 08:22:51 INFO - RemoteAgent DEBUG WebDriver BiDi enabled
[task 2024-11-15T08:22:51.944Z] 08:22:51 INFO - Marionette INFO Marionette enabled
[task 2024-11-15T08:22:52.180Z] 08:22:52 INFO - Marionette TRACE Received observer notification final-ui-startup
This eliminated the previous 4-6 second delay before final-ui-startup
, allowing it to be reached almost instantly. As a result, the Mn(unit)
job runtime dropped from 21 minutes to just 9 minutes. After running this new try job the tv
folder contains the following Mozilla-specific items:
68K /private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightly/com.apple.metal/Intel(R) UHD Graphics 630
136K /private/var/folders/tv/jtbd48bn6m9gybm_1g3q9l_m000014/C/org.mozilla.nightly/com.apple.metal
This suggests that some recent Metal-related features in Firefox (introduced around June) are causing data to leak into the Metal Compiler File System Cache. While this may not significantly impact most Mac users who don’t restart Firefox frequently, it could still introduce minor delays.
Also caches like the MtlCompilerFSCache, which are meant to improve performance, may persist across reboots to avoid reinitialization delays. This indicates a need for a solution to prevent Metal data from accumulating excessively over time.
Assignee | ||
Comment 52•12 days ago
|
||
Joel, I wonder if we should potentially re-puppetize the MacOS 10.15 workers (to clean-up those folders) given that we see an exponential increase of the startup time reaching close to 6s for a single startup of Firefox.
Assignee | ||
Comment 53•12 days ago
|
||
To reflect that this is not only affecting the WebDriver tests but all tests that have to launch Firefox, I'm going to update the bug's summary. We should file separate bugs (like for graphics and infra work) that should block this bug.
Comment 54•12 days ago
|
||
That's a great find! I haven't had a chance to understand what this means yet. I wonder if we only add to the cache during webgpu tests or if it is something that happens in regular Firefox usage too (where webgpu is still preffed off by default).
Assignee | ||
Comment 55•12 days ago
|
||
When I check about:config
in a Nightly build I can see that dom.webgpu.enabled
is set to true
by default. And that is reflected in the static prefs list. So that means that all the tests we run for Nightly builds (central, autoland, ...) actually contribute to growing that cache. The caches for devedition and release builds of Firefox were way smaller so that maybe only specific webgpu tests are causing the cache to grow.
Assignee | ||
Comment 56•12 days ago
|
||
Here a link of a spreadsheet listing all the files with the full path under /private/var/folders/
.
Assignee | ||
Comment 57•12 days ago
|
||
As requested by Markus moving the needinfo to Erich who know more about the development of WebGPU.
Comment 58•12 days ago
|
||
do you want us try a find /private/var/folders -type d -name "org.mozilla.nightlydebug" -exec rm -rf {} + 2>/dev/null
before generic worker starts after a fresh reboot?
Assignee | ||
Comment 59•12 days ago
|
||
As discussed with Ryan we probably want to remove all entries related to org.mozilla.*
to ensure that we get rid of all the cache entries. The decision for this approach I would hand over to someone else who knows more about the Mac workers, but it sounds like a good compromise for now.
Comment 60•12 days ago
•
|
||
just to confirm:
we want to try adding:
find /private/var/folders/ -path '*/C/org.mozilla.*' -type d -o -path '*/C/org.mozilla.*/*' -type d -exec rm -rf {} +
in our worker-runner-.sh script which will execute before the worker starts, correct?
Comment 61•12 days ago
|
||
If landed, will deploy when I return from work week next week (return 11/25)
Updated•12 days ago
|
Comment 62•12 days ago
|
||
My understanding is that the files under $(getconf DARWIN_USER_CACHE_DIR)/com.apple.metal
are created implicitly by the Metal Shading Language compiler, so any use of Metal shaders could result in files being created there. Certainly running the WPT WebGPU CTS will create many, many shaders.
It would be cool if someone with a mac could visit https://webgpu.github.io/webgpu-samples/ in Nightly, run some demos, and see if a bunch of junk gets created in that directory.
This isn't something that our WebGPU implementation explicitly requests; it seems to be the default behavior of the Metal Shading Language compiler. It's too bad that Apple designed their cache to take so much CPU, even before one needs its contents.
Updated•12 days ago
|
Comment 63•12 days ago
|
||
thanks :jimb
I logged into a 10.15 staging r8 node to test this. how long would you recommend letting each test run for before doing a diff on the directory size? thanks
Assignee | ||
Updated•12 days ago
|
Comment 64•12 days ago
|
||
Every time any part of Firefox wants to run its own code on the GPU, it has to compile a shader. Running the WPT WebGPU CTS even just for a little bit should create plenty of files.
Comment 65•12 days ago
|
||
But if I'm understanding the bug correctly, all Firefox startup times are affected - Firefox doesn't need to run the WebGPU CTS for its startup to get slowed down, right?
Oho, this looks interesting:
In https://crbug.com/angleproject/5354 a workaround was added for ANGLE's Metal backend tests which eliminated intermittent several-second hangs during startup - prevent Metal from accessing the per-user shader cache. It seems that the application was hanging on access to this directory.
As pointed out by rsesek@ in http://crbug.com/974219#c91 (restricted view, sorry) this dyld interposition technique won't work in Chrome, but Chrome's sandbox can be used to deny Metal's runtime access to this directory. We should try this as a potential workaround for the Metal shader compilation issues seen in Chrome.
Comment 66•12 days ago
|
||
Metal on Mac has some issues with shader caching inside "/private/var/.../com.apple.metal" folder. After some point, the cache becomes huge and the compilation of Metal shaders would take a very long time to finish. That would be the reason of intermittent timeouts here.
It happened a lot on my local machines when I ran the dEQP tests (sometimes it took 10' to finish loading the first shader). When using debugger to pause the app, it pointed to the internal library function where the Metal shader cache lookup happens.
I usually deleted the cache folder above after some point and the shader compilation became fast again.
Comment 67•12 days ago
|
||
Metal is the only officially supported graphics API on macOS these days, although OpenGL is still there and works fine. But this makes me suspect that probably anything that does graphics at all must, at some point, end up trying to run Metal shaders, even if they're just implementation details of some library or translated from some other language. But any use of Metal would probably be enough to cause an attempt to access the cache.
Updated•10 days ago
|
Assignee | ||
Comment 68•9 days ago
|
||
Jim, would it maybe helpful to get a list of entries of this cache if that would be helpful to identify the cause of the massive size of the cache? We most likely will have to do that before the above script gets distributed on the machines which will remove the cache on each restart.
Comment 69•9 days ago
|
||
Sure, I'd love to see a listing. If the files are textual, and you could provide the contents of two or three selected at random, that might serve as a smoking gun.
But more broadly, there are two issues here:
-
We have slow Firefox startup times in our CI. This can be fixed with a script.
-
I imagine that, over time, less unusual cases than the WPT WebGPU CTS could also cause an accumulation of files in the Metal Shading Language compiler cache, leaving people with slow Firefox start times. This is an actual product problem, not just a CI infra issue.
Assignee | ||
Updated•8 days ago
|
Assignee | ||
Comment 70•7 days ago
|
||
(In reply to Jim Blandy :jimb from comment #69)
Sure, I'd love to see a listing. If the files are textual, and you could provide the contents of two or three selected at random, that might serve as a smoking gun.
Ryan, is that something that you could help with when you are back from the work week? We should definitely have to do that before the changes for clean-up are pushed to production. If we miss to do that it would remove all the data.
- We have slow Firefox startup times in our CI. This can be fixed with a script.
Yes, that is correct. Hopefully we can fix it by next week when Ryan is back. I would also like to close this specific bug once the changes have been shipped to production.
- I imagine that, over time, less unusual cases than the WPT WebGPU CTS could also cause an accumulation of files in the Metal Shading Language compiler cache, leaving people with slow Firefox start times. This is an actual product problem, not just a CI infra issue.
Indeed and that is something I was asking already in comment 53. Should I already file a graphic's specific bug or should we wait until the data you requested is available?
Assignee | ||
Comment 71•7 days ago
•
|
||
Also note that there is an ongoing process to move test jobs from MacOS 10.15 to MacOS 14.70 Intel (see bug 1922355). Given that all the machines are newly installed via Puppet this specific slowdown will as well not be visible anymore.
Comment hidden (Intermittent Failures Robot) |
Comment 73•2 days ago
|
||
I have a theory that the shader cache is using a key based on the path of the executable, so the fact that CI runs use builds installed at effectively random paths leads to a massive buildup of redundant cache entries, and the OS may be preloading the entire cache index at startup so that lookups have low latency (otherwise searching for each shader used in a game would stall as it reads in a bit more of the index on the first several requests - games tend to have a lot of distinct shaders).
I'd further suppose that it is probably a SQLite database as that is popular these days, but immaterial to the theory.
Theory can be tested by installing Firefox at multiple locations and then deleting the shader cache and running each one and checking the size of the shader cache between each run.
Comment 74•20 hours ago
•
|
||
Here is a copy of the com.apple.metal directory from an affected machine: https://drive.google.com/file/d/1wHiv6UgYtUDhsQ2GzA6M0KWj0kn5ImqG/view?usp=sharing
Assignee | ||
Comment 75•2 hours ago
|
||
(In reply to Ryan Curran from comment #74)
Here is a copy of the com.apple.metal directory from an affected machine: https://drive.google.com/file/d/1wHiv6UgYtUDhsQ2GzA6M0KWj0kn5ImqG/view?usp=sharing
Those files are not sqlite based databases but really a custom format that is not documented by Apple. I had a quick look into these but what not able to find any specific evidence of what's going wrong. Jim, maybe you can have a look at it?
I as well tried the following variants for https://webgpu.github.io/webgpu-samples (note that a lot of the examples fail due to a shader creation error) but I was not able to force the cache to grow to some high numbers. The size stayed around 2.5MB for all the related files.
- Variations of the installation path of Firefox
- Different versions of the same kind of Firefox build (only increased the size once for the 2nd build)
- Using different Firefox profile locations
So I have no idea what might cause the cache to grow over time. But maybe there might have been an issue with this older MacOS version that might have triggered such a leak in the cache? In such a case we should observe the new MacOS 14.70 workers and regularly check their startup performance in Perfherder. I don't think that we should invest the time right now to do the above checks on such an old system.
Assignee | ||
Comment 76•2 hours ago
|
||
Note that I had a look at Perfherder for the MacOS 14.70 workers and the values they report for startup times. They are stable around 1500ms currently, but that is still 50% higher than the values that we had in June this year on MacOS 15.10 (see comment 10).
Description
•