Creation of content processes is slow on macos arm64 runners on GitHub actions
Categories
(Core :: Security: Process Sandboxing, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox129 | --- | fixed |
People
(Reporter: alexrudenko, Assigned: haik)
References
(Blocks 1 open bug)
Details
(Whiteboard: [webdriver:m11])
Attachments
(6 files, 1 obsolete file)
52.00 KB,
application/octet-stream
|
Details | |
80.56 KB,
text/plain
|
Details | |
8.42 MB,
text/plain
|
Details | |
5.38 MB,
application/zip
|
Details | |
7.83 MB,
text/plain
|
Details | |
Bug 1893921 - Creation of content processes is slow on macos arm64 runners on GitHub actions r=spohl
48 bytes,
text/x-phabricator-request
|
Details | Review |
See https://github.com/puppeteer/puppeteer/actions/runs/8829855155/job/24242835681?pr=12318 It is a test run done on new macos arm64 runners on GitHub actions (https://github.com/actions/runner-images/blob/macos-14-arm64/20240422.3/images/macos/macos-14-arm64-Readme.md#browsers)
Several tests time out and in general the test run is very slow, 21 minutes for one test shard (usually up to 5 min). When run with the trace option, the logs indicate that browsingContext.create always blocks for 5 seconds: https://gist.github.com/OrKoN/1bcde28b31e2dba4ba50909deecdd010
Comment 1•7 months ago
|
||
I've triggered a new job with debug enabled. Lets see if this gives some more details:
https://github.com/puppeteer/puppeteer/actions/runs/8829855155/job/24363405593?pr=12318
Comment 2•7 months ago
|
||
And here is already some output:
1714376280846 RemoteAgent TRACE [12] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=5000 waitForExplicitStart=false
1714376280846 RemoteAgent TRACE [12] ProgressListener Setting unload timer (5000ms)
1714376285849 RemoteAgent DEBUG WebDriverBiDiConnection 7462f260-895e-4517-8a97-128c1d17c323 <- {"type":"event","method":"script.realmCreated","params":{"realm":"8f9e717e-c133-42cb-9051-61e1a4366967","origin":"null","context":"0c801b4e-d1d1-455f-b76b-495f31e4e308","type":"window"}}
It looks like the hang comes from waiting for about:blank
to be loaded. Sadly I do not see any update of the web progress listener status and wonder if we somehow miss loading this page.
Alex, is it easy to run a job with the MOZ_LOG="BCWebProgress:5"
environment variable set?
Reporter | ||
Comment 3•7 months ago
|
||
I believe the same log was already in the linked gist https://gist.github.com/OrKoN/1bcde28b31e2dba4ba50909deecdd010
To set extra env variables we can use the PR here https://github.com/puppeteer/puppeteer/pull/12328
Reporter | ||
Comment 4•7 months ago
|
||
I have triggered a run with the env var here https://github.com/puppeteer/puppeteer/actions/runs/8875780005?pr=12328
Reporter | ||
Comment 5•7 months ago
|
||
Comment 6•7 months ago
|
||
Here an example:
https://github.com/puppeteer/puppeteer/actions/runs/8875780005/job/24365989726?pr=12328#step:9:350
1714380471274 RemoteAgent DEBUG WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 -> {"id":8,"method":"browsingContext.create","params":{"type":"tab","userContext":"3da5f8e5-eb20-4d60-b6a2-3b6706ee4274"}}
1714380471274 RemoteAgent TRACE Module root/browsingContext.sys.mjs found for ROOT
1714380471274 RemoteAgent TRACE Received command browsingContext.create for destination ROOT
1714380471274 RemoteAgent TRACE Module root/browsingContext.sys.mjs found for ROOT
1714380471275 RemoteAgent TRACE [10] NavigationListener onStateChange, stateFlags: 196612, status: 0, isStart: false, isStop: false, isNetwork: false, isBindingAborted: false, targetURI: about:blank
1714380471275 RemoteAgent TRACE Received command browsingContext._awaitVisibilityState for destination WINDOW_GLOBAL
1714380471275 RemoteAgent TRACE Module windowglobal/browsingContext.sys.mjs found for WINDOW_GLOBAL
1714380471277 RemoteAgent TRACE [10] NavigationListener onStateChange, stateFlags: 131088, status: 0, isStart: false, isStop: true, isNetwork: false, isBindingAborted: false, targetURI: about:blank
1714380471314 RemoteAgent DEBUG WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"browsingContext.contextCreated","params":{"children":null,"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f","url":"about:blank","userContext":"3da5f8e5-eb20-4d60-b6a2-3b6706ee4274","parent":null}}
2024-04-29T08:47:51.315Z puppeteer:webDriverBiDi:RECV ◀ [
'{"type":"event","method":"browsingContext.contextCreated","params":{"children":null,"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f","url":"about:blank","userContext":"3da5f8e5-eb20-4d60-b6a2-3b6706ee4274","parent":null}}'
]
1714380471330 RemoteAgent TRACE Received DOM event TabSelect for [object XULElement]
[Parent 5860: Main Thread]: I/BCWebProgress OnLocationChange({isTopLevel:1, isLoadingDocument:1}, {URI:about:blank, originalURI:about:blank}, about:blank, 0) on {top:1, id:a, url:about:blank}
1714380471336 RemoteAgent TRACE Module windowglobal-in-root/realm-created.sys.mjs not found for WINDOW_GLOBAL
2024-04-29T08:47:51.337Z puppeteer:webDriverBiDi:RECV ◀ [
1714380471336 RemoteAgent DEBUG WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"script.realmCreated","params":{"realm":"9279068c-35a3-483d-854f-cba368b3ed6a","origin":"null","context":"306dcc2d-3624-41f7-b432-d87b10d858bd","type":"window"}}
'{"type":"event","method":"script.realmCreated","params":{"realm":"9279068c-35a3-483d-854f-cba368b3ed6a","origin":"null","context":"306dcc2d-3624-41f7-b432-d87b10d858bd","type":"window"}}'
1714380471336 RemoteAgent TRACE Module windowglobal-in-root/window-global-handler-created.sys.mjs not found for WINDOW_GLOBAL
]
1714380471337 RemoteAgent DEBUG WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"browsingContext.domContentLoaded","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}
[Parent 5860: Main Thread]: I/BCWebProgress OnSecurityChange(<null>, <null>, 4) on {top:1, id:a, url:about:blank}
1714380471337 RemoteAgent TRACE [306dcc2d-3624-41f7-b432-d87b10d858bd] No navigation found to stop for url: about:blank
[Parent 5860: Main Thread]: I/BCWebProgress OnStateChange({isTopLevel:1, isLoadingDocument:1}, {URI:about:blank, originalURI:about:blank}, STATE_STOP|STATE_IS_NETWORK|STATE_IS_WINDOW, NS_OK) on {top:1, id:a, url:about:blank}
1714380471337 RemoteAgent DEBUG WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"event","method":"browsingContext.load","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}
2024-04-29T08:47:51.338Z puppeteer:webDriverBiDi:RECV ◀ [
'{"type":"event","method":"browsingContext.domContentLoaded","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}'
]
2024-04-29T08:47:51.338Z puppeteer:webDriverBiDi:RECV ◀ [
'{"type":"event","method":"browsingContext.load","params":{"context":"306dcc2d-3624-41f7-b432-d87b10d858bd","timestamp":1714380471277,"url":"about:blank","navigation":null}}'
]
1714380471407 RemoteAgent TRACE [12] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=5000 waitForExplicitStart=false
1714380471407 RemoteAgent TRACE [12] ProgressListener Setting unload timer (5000ms)
2024-04-29T08:47:56.506Z puppeteer:webDriverBiDi:RECV ◀ [
1714380476506 RemoteAgent TRACE [12] ProgressListener No navigation detected: about:blank
'{"type":"success","id":8,"result":{"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f"}}'
1714380476506 RemoteAgent TRACE [12] ProgressListener Stop: has error=false
]
1714380476506 RemoteAgent DEBUG WebDriverBiDiConnection adc3c975-c114-4598-bfaa-f6881514cfc9 <- {"type":"success","id":8,"result":{"context":"4e53e36f-31c1-4b45-ba2f-f6dc0adf121f"}}
We seem to start the progress listener after the page about:blank
has already been loaded? Also I see the browsingContext.load
event emitted twice.
Updated•7 months ago
|
Comment 7•7 months ago
|
||
We discussed this in triage yesterday.
First observation is that browsingContext.create waits for the visibilityState of the previous tab to change (here) before we start waiting for the navigation (here).
It would probably be safer to start waiting for the initial navigation right away. However our waitForInitialNavigationCompleted
helper is supposed to detect situations where the initial page has already been loaded and resolve early (here).
To verify that, locally I artificially delayed the call to waitForInitialNavigationCompleted
by 500ms, 1000ms, 2000ms, and in all cases the helper resumed early, without waiting for the 5s timeout of the ProgressListener. Based on this test, I think that calling the helper "late" works fine.
Therefore I tried to delay the call to waitForInitialNavigationCompleted
on Puppeteer as well, by setting browser.tabs.remote.unloadDelayMs
to a few hundred ms and pushing to github CI. This should force waitForInitialNavigationCompleted
to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.
Comment 8•7 months ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #7)
Therefore I tried to delay the call to
waitForInitialNavigationCompleted
on Puppeteer as well, by settingbrowser.tabs.remote.unloadDelayMs
to a few hundred ms and pushing to github CI. This should forcewaitForInitialNavigationCompleted
to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.
Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI. Based on my comment did you try it out locally by setting browser.tabs.remote.unloadDelayMs
to some higher values? Were you able to reproduce the timeout in waiting for the initial navigation?
Comment 9•7 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #8)
(In reply to Julian Descottes [:jdescottes] from comment #7)
Therefore I tried to delay the call to
waitForInitialNavigationCompleted
on Puppeteer as well, by settingbrowser.tabs.remote.unloadDelayMs
to a few hundred ms and pushing to github CI. This should forcewaitForInitialNavigationCompleted
to start late as well, and give enough time to the initial navigation to complete, but it still seems to fail with macos-latest.Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI. Based on my comment did you try it out locally by setting
browser.tabs.remote.unloadDelayMs
to some higher values? Were you able to reproduce the timeout in waiting for the initial navigation?
Yes, that's what I meant with the previous paragraph: I tried locally with 500ms, 1000ms, 2000ms and it didn't reproduce the issue. The helper successfully detects that the navigation is already done and resolves immediately.
Note that changing this preference does delay the visibilitychange event even further, so I wouldn't expect that anything changes for Puppeteer in GitHub CI.
I think it should have an impact, and it's worrying that it doesn't.
Basically by increasing this value, we delay the "gap" between the call to addTab and the call to waitForInitialNavigationCompleted. For instance if we set the value to 1000 ms, it should give enough time to the new tab to do its initial navigation and waitForInitialNavigationCompleted
should then detect that the initial navigation is over and resolve. As I said, I verified locally that the helper was correctly returning early in those cases, so I don't really understand why it doesn't happen on Github CI.
Comment 10•7 months ago
|
||
I assume that with bug 1894305 now fixed, we should trigger some new debug jobs in GitHub CI for Puppeteer (with the Firefox cache removed first)?
Comment 11•7 months ago
|
||
I retriggered a run at https://github.com/puppeteer/puppeteer/actions/runs/9059188428/job/24886294787?pr=12336
My previous push had a typo in MOZ_LOG so it didn't include the Process logs.
Comment 12•7 months ago
|
||
I've re-triggered the failing job with debug logs enabled:
https://github.com/puppeteer/puppeteer/actions/runs/9059188428/job/24888261056?pr=12336
Comment 13•7 months ago
|
||
Thanks! So it seems it takes a long time to create a new process when we try to create new tabs:
2024-05-13T08:38:30.1111650Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:30.1211520Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:30.1313910Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:30.2590990Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:30.2595100Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:30.2596060Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.1948660Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.1949320Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.1950120Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.3429020Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.3457780Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.3458420Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:32.2271950Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:32.2373640Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:32.2475650Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:32.4686340Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:32.4725860Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:32.4768310Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:33.2278670Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:33.2283020Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:33.2287190Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:33.4980610Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:33.4981200Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:33.4981830Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:34.3286360Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:34.3287130Z 1715589514319 RemoteAgent TRACE [12] ProgressListener No navigation detected: about:blank
2024-05-13T08:38:34.3307020Z 2024-05-13T08:38:34.320Z puppeteer:webDriverBiDi:RECV ◀ [
2024-05-13T08:38:34.3307680Z 1715589514319 RemoteAgent TRACE [12] ProgressListener Stop: has error=false
2024-05-13T08:38:34.3369870Z '{"type":"success","id":8,"result":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226"}}'
2024-05-13T08:38:34.3371320Z 1715589514320 RemoteAgent DEBUG WebDriverBiDiConnection 33fdce95-6f81-4977-8679-0cec0e867265 <- {"type":"success","id":8,"result":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226"}}
2024-05-13T08:38:34.3475890Z ]
2024-05-13T08:38:34.3478510Z 1715589514322 RemoteAgent DEBUG WebDriverBiDiConnection 33fdce95-6f81-4977-8679-0cec0e867265 -> {"id":9,"method":"browsingContext.setViewport","params":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226","viewport":{"width":800,"height":600},"devicePixelRatio":null}}
2024-05-13T08:38:34.3570390Z 2024-05-13T08:38:34.321Z puppeteer:webDriverBiDi:SEND ► [
2024-05-13T08:38:34.3571080Z 1715589514322 RemoteAgent TRACE Module root/browsingContext.sys.mjs found for ROOT
2024-05-13T08:38:34.3616340Z '{"id":9,"method":"browsingContext.setViewport","params":{"context":"5e1f91fa-3853-493c-9146-eb0e5c25d226","viewport":{"width":800,"height":600},"devicePixelRatio":null}}'
2024-05-13T08:38:34.3620350Z 1715589514322 RemoteAgent TRACE Received command browsingContext.setViewport for destination ROOT
2024-05-13T08:38:34.3653510Z ]
2024-05-13T08:38:34.3654270Z 1715589514322 RemoteAgent TRACE Module root/browsingContext.sys.mjs found for ROOT
2024-05-13T08:38:34.3655360Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:34.3656560Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
We try to start a process, but we get Process Starting delayed process start, delay=1000 for 5 seconds.
Olli, do you know what could delay the creation of new processes? This is happening on macos arm workers on Github
Comment 14•7 months ago
|
||
If I don't remember wrong Randell worked on some process related changes in the past. So maybe he is a better person to ask. Randell, if that is not the case please let us know, and maybe you know a better person to ask for. Thanks!
Comment 15•7 months ago
|
||
(I don't know what "macos arm workers on Github" means).
I assume this is on opt/shippable builds?
Do we have performance profiles for this?
Comment 16•7 months ago
|
||
(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #15)
(I don't know what "macos arm workers on Github" means).
The workers provided by Github to run Github actions, see https://github.com/actions/runner-images/blob/macos-14-arm64/20240422.3/images/macos/macos-14-arm64-Readme.md#browsers
I assume this is on opt/shippable builds?
Yes.
Do we have performance profiles for this?
Not yet I'll try to get one.
Comment 17•7 months ago
|
||
I just saw that there are as well MacOS 13 ARM images:
https://github.com/actions/runner-images/blob/main/images/macos/macos-13-arm64-Readme.md
I wonder if those might show the same problem or not. The current puppeteer checks all run for MacOS 13. So maybe it might be a MacOS 14 issue.
Comment 18•6 months ago
|
||
So this is the cause of the 1s delays:
# Process launch delay (in milliseconds).
- name: dom.ipc.processPrelaunch.delayMs
type: uint32_t
# This number is fairly arbitrary ... the intention is to put off
# launching another app process until the last one has finished
# loading its content, to reduce CPU/memory/IO contention.
value: 1000
mirror: always
You can modify this value if you need to spawn a lot of processes in a short time for tests/etc. For browsing, this avoids us slowing down load in order to spawn processes we may need in the future
Comment 19•6 months ago
|
||
Thank you Randell. What's interesting is that no other test suite is actually modifying this preference and we also didn't notice a problem with this delay before ourselves. It's just that the GitHub MacOS ARM workers are so slow. Do you have any idea or might there be another bug already which about visible delays. Most likely users on such machines would also face this problem. So maybe there is some inconsistency? Is there anything in the logs, which we provided, that might explain why there is such a delay? Or do we need the requested performance profile for it?
Further I'm not sure if we should actually change this value for tests. It would clearly cause us to not being able to find such situations.
Comment 20•6 months ago
|
||
It's rare to need to spawn new processes quickly on the web. In normal operation, we keep 3 processes in a cache available; we refill the cache whenever we get Idle time during the load. Even with sites like CNN or fandom which have 12-20+ processes to render (lots of iframes) I found that even a slow old reference laptop never or rarely ran out of preallocated processes -- because sites don't need those 12+ processes all at the same moment.
We only hit the 1s delay if we're calling to create a new process and we have a process in the action of starting that hasn't finished getting ready yet - starting another process while we're waiting for the previous process to finish starting is generally a good way to delay the already-started process.
That said, if this really is happening in non-test situations, we could either reduce the delay, or we could remember we wanted to spawn a process and start that when (or sooner after) the previous process finishes starting.
If you can figure out why we're spawning processes fast enough to it this, or get more detailed logs or an RR trace, I could look at this more. If we have any evidence this is happening in the field in any quantity, we certainly could try to minimize it. Perhaps add a telemetry for the delay getting hit? Not sure if we'd want a count, perhaps since the assumption is that most machines would never hit it, but perhaps some (slow) machines might hit it more often, and a count would help there.
Comment 21•6 months ago
|
||
Thanks for all the feedback!
The issue we are having with this test suite is not only a 1s delay, it seems to take between 4 or 5 seconds to create the process for a new tab.
For the log I pasted in comment 13 we can see the following 6 lines of logs repeated 4 times (each time with a 1 second gap):
2024-05-13T08:38:31.1948660Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.1949320Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.1950120Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
2024-05-13T08:38:31.3429020Z [Parent 5255: Main Thread]: D/Process Starting process allocate on idle
2024-05-13T08:38:31.3457780Z [Parent 5255: Main Thread]: D/Process Trying to start process now
2024-05-13T08:38:31.3458420Z [Parent 5255: Main Thread]: D/Process Starting delayed process start, delay=1000
I can try to reduce the delay, but I suspect the issue is that we can't start the process and so we keep waiting for 1 second.
You mentioned that we could hit the 1s delay if we already are waiting for a process to finish starting, so I'm guessing we have a process on those machines which is really slow to start and which makes subsequent process creation slow?
get more detailed logs
Do you know what kind of logs would be helpful here?
Comment 22•6 months ago
|
||
Bug 1894695 reports a similar issue and that it started between 124.0a1 and 127.0a1. But I'm not sure if they were really using an ARM worker at the time when it started. So I've requested more information.
Given that this doesn't seem to be WebDriver related but a core issue with process creation I'm going to move this bug to the DOM content processes component.
Comment 24•6 months ago
|
||
This issue might actually hit more folks as we thought. As bug 1894695 comment 7 states GitHub updated the macos-latest
definition and points to the ARM machines now. If users don't hard-code a specific version like macos-12
, they will all notice this delay in process creation.
Julian, were you already able to create a Gecko profile of a single affected Puppeteer test? If you have troubles please let me know and I can try.
Comment 25•6 months ago
|
||
Note also that there is are no intel-based macos-14 machines apparently:
- https://www.reddit.com/r/github/comments/1cjiat4/is_github_actions_macoslatest_now_only_armbased/
- https://github.com/actions/runner/issues/3256
So not only do you have to hardcode, you have to hardcode an older version :-(
Comment 26•6 months ago
•
|
||
Barry, if you may have an easy way to reproduce and you could run a single test with the following environment variables set for Firefox, I would appreciate.
- MOZ_PROFILER_STARTUP_FILTERS="*"
- MOZ_PROFILER_STARTUP="1"
- MOZ_PROFILER_SHUTDOWN="%path_to_json_file%"
Please note that Firefox process needs to be shutdown normally, which means you cannot kill it. Otherwise the profile will not be written to the given path. Thanks!
Comment 27•6 months ago
|
||
Weirdly I can't repro this on my M1 Mac even though it's ARM64 too!
The only way I've got to reproduce it is with this workflow:
https://github.com/tunetheweb/github-action-test/blob/main/.github/workflows/tests.yml
You can fork that repo and try it yourself. It runs on every push and saves the profile to the action page ready for you to download. I could attach an example here but it's 49.7 MB.
Comment 28•6 months ago
|
||
(In reply to Barry Pollard from comment #27)
The only way I've got to reproduce it is with this workflow:
https://github.com/tunetheweb/github-action-test/blob/main/.github/workflows/tests.yml
You can fork that repo and try it yourself. It runs on every push and saves the profile to the action page ready for you to download. I could attach an example here but it's 49.7 MB.
Thanks a lot for creating such a workflow and the tooling to offer such a profile for download! I've picked the generated JSON file from this workflow and uploaded it to https://profiler.firefox.com/.
Here is the shared profile: https://share.firefox.dev/4bFO4vT
Randell or Olli, do you see something strange going on in there? Here a filtered view for process markers where at least at the beginning two Blocked preallocation
markers can be seen each with 5s, but probably it's not the only reason given that further hangs happen afterwards.
Comment 29•6 months ago
|
||
What was done to create that profile? Why are there so many content processes? And why aren't those processes really doing anything?
They are sampled but don't seem to execute much code.
Is the testing framework possibly not sending the urls to load soon enough or what?
I wonder if enabling fission.processSiteNames would make performance profiling a bit more useful.
Comment 30•6 months ago
|
||
I took a quick peek at the profile and noticed 4619ms of Recv Thread Latency for IPC message, PContent::Msg_FlushFOGData
https://share.firefox.dev/4bKHxjD
But it's not clear where the latency is coming from.
Comment 31•6 months ago
|
||
(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #29)
What was done to create that profile?
As usual the normal profiler environmental variables were set.
Why are there so many content processes? And why aren't those processes really doing anything? They are sampled but don't seem to execute much code.
No idea. Maybe that's the reason why it all runs that slowly? Would some extra logs help? I assume also setting MOZ_LOG="Process:5"
to enable process logs may give further details as markers?
Is the testing framework possibly not sending the urls to load soon enough or what?
As we said above the issue is not reproducible locally bit only in GitHub actions for MacOS ARM runners. The only delay the test has is after the navigation. But for particularly testing the overall issue it could potentially be removed.
I wonder if enabling fission.processSiteNames would make performance profiling a bit more useful.
That could be done by setting it as Firefox preference when starting the WebDriver session. Barry, would it be possible for you to make these small modifications and also adding the other env variable that I pointed out above? Thanks!
Comment 32•6 months ago
|
||
I presume this does what you need: https://github.com/tunetheweb/github-action-test/commit/b9a0ee0ae5e43870bfdefa66a985ee11a554925f
In which case the results are here: https://github.com/tunetheweb/github-action-test/actions/runs/9363622641
Comment 33•6 months ago
|
||
(In reply to Barry Pollard from comment #32)
I presume this does what you need: https://github.com/tunetheweb/github-action-test/commit/b9a0ee0ae5e43870bfdefa66a985ee11a554925f
There is a typo in the if condition (browserName === 'fiefox')
which caused the required preference not to be added. If you could do a quick follow-up push I can check the results and share the profile. Thanks.
Comment 34•6 months ago
|
||
Oops fixed now. BTW you can fork this repo so you're not holding up waiting for me each time.
https://github.com/tunetheweb/github-action-test/actions/runs/9365810252
Comment 35•6 months ago
|
||
Sorry I was kinda busy this morning, but if needed I could do that, right. Thanks for jumping in and helping! It's appreciated.
Here is one example of a navigation that takes more than 5s because there is a Blocked preallocation
happening: https://share.firefox.dev/4bJpVop
With the profile overview and filtered by the Process
marker it looks like that processes to be launched as immediate
seems to be blocked, but normal process launches are fine?
Olli, are those additionally information helpful for further investigating the issue?
Comment 36•6 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #35)
With the profile overview and filtered by the
Process
marker it looks like that processes to be launched asimmediate
seems to be blocked, but normal process launches are fine?
The content processes that take ~5s to launch are stuck inside nsAppShell::Init()
(see https://share.firefox.dev/4bH2rAn) doing "mac-things" that regard NSColorGetUserAccentColor
, so the parent process will get the notification of the process being successfully launched only late. There is a chance that nsAppShell::Init()
is doing too many things synchronously that could better be done later?
Updated•6 months ago
|
Comment 37•6 months ago
|
||
Thanks Jens! That opens up the question if there is any way to figure out what exactly is causing this delay in NSColorGetUserAccentColor
. Stephen, is there any way with normal MOZ_LOG logging to get more details here? I assume that widget:5
might not help, or?
Also regarding the prioritization I want to note that this is an issue which is seen by everyone using the new MacOS ARM runners for GitHub actions. While for now there is a chance to fallback to older MacOS releases (most recent is MacOS 13) it might not be possible anymore when those runners based on Intel machines are no longer supported. Each and every test that is using Firefox will see a drastic increase of the test duration. IMHO we should find the underlying issue sooner than later to continue allowing users to have a good testing experience. Otherwise we might loose test coverage or as the FCP example show get completely wrong test results.
Comment 38•6 months ago
•
|
||
So we got a reply: https://github.com/actions/runner-images/issues/9974#issuecomment-2150347670
In summary this is:
Regarding configuration I may say that it's not a private info: runner is a virtual machine based on Veertu Anka with the following set of resources
If we know what might cause that they offered to workaround it in their environment.
Comment 39•6 months ago
|
||
I'd recommend trying to get an Instruments profile of this happening on the github worker. You should be able get this with something like xctrace record --template 'Time Profiler' --all-processes --time-limit 5s
and figure out how to get that out of github worker as an artifact or some thing.
Comment 40•6 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #39)
I'd recommend trying to get an Instruments profile of this happening on the github worker. You should be able get this with something like
xctrace record --template 'Time Profiler' --all-processes --time-limit 5s
and figure out how to get that out of github worker as an artifact or some thing.
I tried to run xctrace as a background task during the job, but it fails to record the trace when Firefox starts to run and the trace is still recording.
https://github.com/juliandescottes/github-action-test/blob/main/.github/workflows/tests.yml#L37
My background task is:
sleep 50 && xctrace record --template 'Time Profiler' --all-processes --time-limit 5s --output /tmp/xctrace.trace &
And I play with the sleep delay + time-limit to try to match Firefox running the tests. If the trace is recording before Firefox starts I manage to record a valid trace (at least judging by the size of it). But if it overlaps with Firefox running (which is the goal), then the trace fails and only contains a RunIssues.storedata file.
I can attach a failed trace here in case it helps, but maybe I'm just doing it wrong.
Comment 41•6 months ago
|
||
Updated•6 months ago
|
Comment 42•6 months ago
|
||
I got an xctrace profile here: https://github.com/jrmuizel/github-action-test/actions/runs/9421839252/
It looks like it wasn't working because xctrace was still writing out the profile before we uploaded it.
I used while pgrep -x "xctrace" > /dev/null; do sleep 2; echo waiting for trace; done
and killall -SIGINT xctrace
to stop xctrace
Comment 43•6 months ago
|
||
Markus and I looked at the xctrace profile and nothing stood out as incriminating.
The content process is always waiting 5 seconds so I think it's more likely that this is timing out instead of waiting for some cpu intensive work to happen. I'd recommend disabling the sandbox to see if that helps. If it does we can turn on sandbox logging to see if there's something that looks incriminating.
Updated•6 months ago
|
Comment 44•6 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #43)
The content process is always waiting 5 seconds so I think it's more likely that this is timing out instead of waiting for some cpu intensive work to happen. I'd recommend disabling the sandbox to see if that helps. If it does we can turn on sandbox logging to see if there's something that looks incriminating.
Julian, could you try with this suggestion? Would be good to know if it is related to sandboxing and if yet what exactly might show us the log output.
Comment 45•6 months ago
|
||
Comment 46•6 months ago
|
||
Jeff, is there anything else I should do to enable sandbox logging, beyond flipping security.sandbox.logging.enabled
to true (I re-enabled the sandbox when I captured the logs, maybe I should have disabled it again?)
Comment 47•6 months ago
|
||
I wonder if it would be helpful to check different supported sandbox levels on MacOS and which one actually triggers this delay:
https://searchfox.org/mozilla-central/rev/a26e972e97fbec860400d80df625193f4c88d64e/browser/app/profile/firefox.js#1453-1468
Comment 48•6 months ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #46)
Jeff, is there anything else I should do to enable sandbox logging, beyond flipping
security.sandbox.logging.enabled
to true (I re-enabled the sandbox when I captured the logs, maybe I should have disabled it again?)
Ah sorry, I meant the macOS logging using the log
program. i.e with sandboxing turned on using log show
.
Comment 49•6 months ago
|
||
(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #47)
I wonder if it would be helpful to check different supported sandbox levels on MacOS and which one actually triggers this delay:
https://searchfox.org/mozilla-central/rev/a26e972e97fbec860400d80df625193f4c88d64e/browser/app/profile/firefox.js#1453-1468
Good idea, I tried levels 1 and 2, it still fails. Only fully disabling the sandbox helps.
Comment 50•6 months ago
|
||
This is the output of log show --last 5m. The test starts around 09:01:56, so ~line 24000.
Anything interesting in there?
Comment 51•6 months ago
|
||
Haik, does anything in the log look suspicious? Have you seen a 5 second timeout during startup before?
Comment 52•6 months ago
|
||
grepping for "kernel: \(Sandbox.*plugin"
shows all the violations we're making.
I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.
Comment 53•6 months ago
|
||
(here is the Firefox profile corresponding to those logs: https://share.firefox.dev/4be0Twu)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #52)
grepping for
"kernel: \(Sandbox.*plugin"
shows all the violations we're making.I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.
Thanks, I assume we need to build a custom Firefox for this? Or is this something that can be driven via prefs/env variables?
Comment 54•6 months ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #53)
(here is the Firefox profile corresponding to those logs: https://share.firefox.dev/4be0Twu)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #52)
grepping for
"kernel: \(Sandbox.*plugin"
shows all the violations we're making.I'd recommend adjusting https://searchfox.org/mozilla-central/source/security/sandbox/mac/SandboxPolicyContent.h#14 to allow as much stuff from the logs as possible to see if that fixes it. If it does you should be able to narrow down what exactly is being blocked that's causing the problem.
Thanks, I assume we need to build a custom Firefox for this? Or is this something that can be driven via prefs/env variables?
I think you'll need to build a custom Firefox, but maybe Haik knows better.
Assignee | ||
Comment 55•5 months ago
•
|
||
Yes, you need to rebuild to make sandbox policy changes with the exception of a few things that are parameterized.
One thing to try that can be set via prefs is to try setting the pref security.sandbox.content.mac.disconnect-windowserver
to false
with the sandbox enabled which would tell us if it is related to not having a WindowServer connection. Edit: looking at the profile, the NSAppShell::Init()
call is stuck in NSColorGetUserAccentColor
which might be trying to connect to the WindowServer or something display-related.
Since setting security.sandbox.content.level
to 1
did not avoid the problem, it's probably not caused by not having permission to read the filesystem. I suspect it's related to trying to connect to a mach service it doesn't normally connect to in a non-virtualized environment or has a timeout in this environment. These are logged with messages like
plugin-container(3912) deny(1) mach-lookup com.apple.lsd.modifydb
After the WindowServer test, what I would do for next steps is to look for the messages related to denying connection to a service (com.apple.lsd.modifydb
in the above example) and allow that service in the content process policy with an allow rule by adding it to SandboxPolicyContent.h here making sure to get the parenthesis right. We might be able to find a single service connection attempt that is causing the delay.
Do we have a method for running this VM type on a developer Mac?
Comment 56•5 months ago
|
||
Same issue with security.sandbox.content.mac.disconnect-windowserver set to false: https://share.firefox.dev/3Vtc5zE
I will attach the logs.
Do we have a method for running this VM type on a developer Mac?
I didn't find the images for the VMs, but I asked on https://github.com/actions/runner-images/issues/9974 (maybe I should file a new issue since this one has been closed for some time)
Comment 57•5 months ago
|
||
Assignee | ||
Comment 58•5 months ago
|
||
I was able to reproduce locally on a macOS (12 and also 15 Beta) VM. I'm working on debugging the delay.
Assignee | ||
Comment 59•5 months ago
|
||
After doing some trial and error testing on the macOS 15 Beta VM, I found that adding allowances in the content process sandbox for the following iokit properties avoids the 5-second delay in nsAppShell::Init()
. I'll get a patch posted to add these back after doing some more testing. I'll try to find out more information about security-impacts of the properties.
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:product-id
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOService property:housing-color
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:IORegistryEntryPropertyKeys
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:ean-storage-present
plugin-container(3756) deny(1) iokit-get-properties iokit-class:IOService property:syscfg-v2-data
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:product-id
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOService property:housing-color
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:IORegistryEntryPropertyKeys
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOPlatformDevice property:ean-storage-present
plugin-container(3757) deny(1) iokit-get-properties iokit-class:IOService property:syscfg-v2-data
Assignee | ||
Comment 60•5 months ago
|
||
Add access to additional iokit properties in the content process sandbox to avoid a 5 second delay in [NSApplication init] when running in VMs.
Assignee | ||
Comment 61•5 months ago
|
||
After some more debugging, I've found that all the properties listed in comment 59 (product-id, housing-color, IORegistryEntryPropertyKeys, ean-storage-present, and syscfg-v2-data) are read within the call to [NSApplication Init]
from nsAppShell::Init()
. AppKit'__NSUserAccentColorGetHardwareAccentColorName_block_invoke()
is the lowest AppKit function on the stack. I set breakpoints on IOKit'IORegistryEntryCreateCFProperty
to catch this. In the VM, the only properties actually present are product-id
and ean-storage-present
checked by running ioreg. The Safari renderer sandbox also allows reads of all these properties.
Updated•5 months ago
|
Comment 62•5 months ago
|
||
Comment 63•5 months ago
|
||
bugherder |
Comment 64•5 months ago
|
||
Julian, would you mind testing in GitHub CI if the issue is fixed for Puppeteer and the other test repository? Thanks.
Comment 65•5 months ago
|
||
Unfortunately, it doesn't seem to help on the puppeteer issue https://github.com/puppeteer/puppeteer/actions/runs/9640423521/job/26584204278?pr=12328
For the github-action-test example, I'm trying to make sure that we use the latest Nightly, but it doesn't seem to help either.
We still seem blocked on the same call to __NSUserAccentColorGetHardwareAccentColorName_block_invoke
https://share.firefox.dev/4eCgfOf
Updated•5 months ago
|
Assignee | ||
Comment 66•5 months ago
|
||
@Julian, would you be able to get me logs again like on comment 50? Could you also report the build ID so we can make sure the fix is there?
I'll retest on the VM I'm using. I was testing with a QEMU based VM using the macOS 15 beta.
Comment 67•5 months ago
|
||
Updated•5 months ago
|
Assignee | ||
Comment 68•5 months ago
|
||
@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?
Assignee | ||
Comment 69•5 months ago
|
||
After switching to a macOS 14.5 VM, I can reproduce the delay with the first fix. I'll request backout and re-land a complete fix.
Comment 70•5 months ago
|
||
Backed out at request of @haik
Backout link: https://hg.mozilla.org/integration/autoland/rev/54b564576bfecdde005d9757fac990ba4a9813b6
Comment 71•5 months ago
|
||
(In reply to Haik Aftandilian [:haik] from comment #68)
@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?
I don't have anything setup to do that for now, but I imagine it's doable? Although since you reproduce now, I guess we will reland the complete fix and test with Nightly?
Assignee | ||
Comment 72•5 months ago
|
||
Add access to additional iokit properties and sysctls in the content process sandbox to avoid a 5 second delay in [NSApplication init] when running in VMs.
Comment 73•5 months ago
|
||
Comment 74•5 months ago
|
||
bugherder |
Updated•5 months ago
|
Assignee | ||
Comment 75•5 months ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #71)
(In reply to Haik Aftandilian [:haik] from comment #68)
@Julian, thanks. Would you be able to test a build if I provide a link to a .dmg?
I don't have anything setup to do that for now, but I imagine it's doable? Although since you reproduce now, I guess we will reland the complete fix and test with Nightly?
Ideally, we would verify the fix on the github runner first, but given how that seems difficult and the behavior of my 14.5 VM appears to match the logs from the github runner, I landed the updated fix.
Comment 76•5 months ago
|
||
And I can confirm the issue is fixed on my test repository.
Thanks a lot for the help here Haik!
Assignee | ||
Comment 77•5 months ago
|
||
(In reply to Julian Descottes [:jdescottes] from comment #76)
And I can confirm the issue is fixed on my test repository.
Thanks a lot for the help here Haik!
Awesome, thanks! Kudos to everyone who helped debug this down to the sandboxing level.
Assignee | ||
Comment 78•5 months ago
|
||
I have not been able to reproduce the problem with other process sandboxes (socket, RDD, GMP, and utility) in the 14.5 VM.
Description
•