Closed Bug 1967525 Opened 5 months ago Closed 2 months ago

Mac workers (since macOS 14.7.1) show a 500ms to 1.5s delay for web content process creation due to sandbox restrictions

Categories

(Core :: Security: Process Sandboxing, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: whimboo, Assigned: haik)

References

(Depends on 1 open bug, Blocks 11 open bugs, )

Details

(Keywords: perf)

Attachments

(3 files)

We detected when investigating test timeouts for WebDriver tests that each and every command which causes a new content process to be created to show a delay of about 1s before the content process is ready. Here two profiles as recorded by Julian:

(In reply to Haik Aftandilian [:haik] from bug 1960740 comment #12)

For the profile with the timeout, I can see content processes taking ~1 second in nsAppShell::Init in AppKit code. I'll try and reproduce this on a macOS 14 system to investigate.

If it is caused by the sandbox (which seems likely), disabling the content sandbox by setting the pref security.sandbox.content.level=0 should avoid the problem.

I disabled the sandbox for testing that proposal and it indeed fixed the issue:
https://treeherder.mozilla.org/jobs?repo=try&revision=eb9ff40260bff73150407731a026e70e0e6fe682

So there is something wrong with the MacOS workers in our CI since they were updated from 14.7.0 to 14.7.1 (now we have 14.7.5).

I had a look at the Release Notes for this security release and there are a couple of sandbox related fixes for several applications, but not sure if this might cause the trouble for us in case there were changes to the underlying core code as well: https://support.apple.com/en-us/121570

There is also bug 1964338 which covers the very slow startup of those workers which now takes ~4.5s instead of ~1.5s each time we start Firefox. Parts of it might be related to this bug - especially the creation of the very first web content process.

Haik is actually taking a look at it.

Flags: needinfo?(haftandilian)

in testing ubuntu 24.04 there are a lot of timeout on startup failures and disabling the sandbox seemed to cut them in half.

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #2)

in testing ubuntu 24.04 there are a lot of timeout on startup failures and disabling the sandbox seemed to cut them in half.

Joel, could you file a separate bug for that? The Mac and Linux sandboxes are different implementations.

Flags: needinfo?(jmaher)
Assignee: nobody → haftandilian
Flags: needinfo?(haftandilian)

Filing this as S3 on the assumption it's not a general, overall performance regression (which I'd hope Telemetry or other tests would've picked up).

Severity: -- → S3
Blocks: 1968471

Hi Ryan, we would need to investigate this issue live on a current mac worker running a 14.7.5 release of macOS. Could you disconnect one from production and make it available for me and Haik? I would have to setup the machine so that Haik can easily use it for reproducing this particular slow down. I hope that this is possible to share it with both of us. Please ensure that the automatic restart is disabled as well (afair it happened via crontab?). Thanks.

Flags: needinfo?(rcurran)

https://mozilla-hub.atlassian.net/browse/RELOPS-1547e

Is there a specific worker you want me to target in that prod pool :whimboo or will any (that meet the specs) work?

Thanks

Flags: needinfo?(rcurran)
Blocks: 1968866
Flags: needinfo?(hskupin)

Sorry for not replying earlier, but you edited your comment after submission so I didn't see the extra question in the email that I got from Bugzilla. I think it is fine to pick any of the workers. Please note that I'm off the next days and will be back on Monday.

Flags: needinfo?(hskupin) → needinfo?(rcurran)
Flags: needinfo?(rcurran)
Blocks: 1827331
Blocks: 1960072

Note that I'm not able to prepare the machine correctly because there is some kind of automatic restart each 6h still in place. More details can be found in the JIRA ticket. Ryan, can you please take a look?

Flags: needinfo?(rcurran)

Also something more that I noticed:

  • ~/Library/LaunchAgents/ has a lot of *.moz-taskchd-test-* entries from former running tests. These files contain an interval of 1800s for firefox. Not sure what they are used for. Maybe we should clean-up those files as well?
  • In /Library/LaunchDaemons/ we keep the PuppetAgent alife. Could this maybe cause the machine to get rebooted? Maybe we need to stop that launch agent when disconnecting a machine to be used for investigations?

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #9)

Note that I'm not able to prepare the machine correctly because there is some kind of automatic restart each 6h still in place. More details can be found in the JIRA ticket. Ryan, can you please take a look?

I was able to solve the problem by removing the reboot call from /usr/local/bin/worker-runner.sh, which seem to be still running even through the crontab job is removed.

Let me now try to see if I can reproduce the issue.

I haven't been able to reproduce this locally on macOS 14.6.7 or macOS 15.5 so I suspect I'll need access to the machine in CI to narrow down what is causing the slow content process startup.

Priority: -- → P1

So far I was also not able to see it on that loaner. Maybe because I had VNC open? Now I wanted to try to run the tests and disconnecting immediately from VNC but when I logged in I saw Mochitests running. Also the machine was restarted 4 minutes before.

Ryan, could it be that the machine was moved out of quarantine?

It looks like there is a Talos test which clearly shows the regression in content process startup times.

What's interesting is that there seem to be two different lines. For the baseline with most of the data it shows a slow climbing up since around end of April, while some data is quite off and shows the ~1s startup time for the content process. Not sure where those values actually come from and why it differs that much.

It was un-quarantined by mistake yesterday by me. I have quarantined the worker and will reach out to :whimboo once the machine is ready to test. Sorry about that :)

Flags: needinfo?(rcurran)

This Talos graph including as well the previous MacOS 10.15 data shows it even better.

Joel, do you remember or know which changes we made to 14.7.x around January 8th and as well end of April? Would be good to know what these two lines were actually caused by.

FYI I'm as well concerned about the actual cold-load browsertime test results that are growing as well as this dashboard shows:
https://arewefastyet.com/mac1470/cold-page-load/overview?numDays=200

Maybe these regressions are related to the slow content process startup and might hopefully go away once we were able to fix this bug.

there were no specific changes to 14.70 in January- but we did finish migrating in early January- my theory is that OS cache/etc. grows over time, so a few weeks of running and we started to see a difference.

End of April- was this the 14.7.5 stuff? that would have been an upgrade- but we also started working in the DC then- specifically to put machines on new KVM cables, this resulted in allowing us to turn on the checks to require proper resolution/refresh_rate. A handful of machines were fully reimaged as well (instead of upgraded).

Flags: needinfo?(jmaher)

Is there an easy way to query for the Talos data that spikes out for these tests? If content process startup takes more than 600ms might be a good starting point. Then we may know which machines are actually affected and could draw conclusions?

Flags: needinfo?(jmaher)

Given that a lot of machines only show a small increase (but still 50%!) it would be good to actually investigate the issue on a machine that had a huge spike recently. Ryan, it would be great if you could quarantine and prepare macmini-r8-62 for us. Thanks.

Flags: needinfo?(rcurran)

maybe a raw query on redash, but I am not too familiar with the current tests/variants and db structure. Maybe :sparky knows a way

Flags: needinfo?(jmaher) → needinfo?(gmierz2)

Here's a query of all the machines that have produced bad data (>600 value) in the last 3 weeks: https://sql.telemetry.mozilla.org/queries/108617/source

Here's a more specific query for the machines per repo: https://sql.telemetry.mozilla.org/queries/108616/source
I used this to sanity check if the machines have ever produced values under 600 before. I did a small sample of them, and all of them only produced values that were greater than 1000 (normal is around 300).

Here's the list of the problematic machines:

macmini-r8-25
macmini-r8-27
macmini-r8-28
macmini-r8-29
macmini-r8-30
macmini-r8-31
macmini-r8-32
macmini-r8-33
macmini-r8-34
macmini-r8-35
macmini-r8-36
macmini-r8-37
macmini-r8-38
macmini-r8-39
macmini-r8-40
macmini-r8-44
macmini-r8-45
macmini-r8-46
macmini-r8-48
macmini-r8-49
macmini-r8-51
macmini-r8-59
macmini-r8-61
macmini-r8-62
macmini-r8-64
macmini-r8-66
macmini-r8-67
macmini-r8-68
macmini-r8-69
Flags: needinfo?(gmierz2)
See Also: → 1970956

Thanks Greg! Joel, are those maybe machines that have the new KVM switch? I assume they don't have a different name and as such cannot be identified by the task details?

Flags: needinfo?(jmaher)

redirect to :rcurran as he had the master list of work

Flags: needinfo?(jmaher)

All of the machines in :sparky 's list are kvm-connected

Flags: needinfo?(rcurran)

FYI I'm not able to reproduce the issue on macmini-r8-62, which stood out as affected in the above list. There is no delay in the content process creation happening which should cause tests to time out.

Haik, maybe you can give exact steps what to do to get a sandbox log? I could try to modify the wdspec jobs to let it create all the necessary data and upload it as artifacts. Maybe that way we can get it reproduced.

Flags: needinfo?(haftandilian)

When i check the profiles again from comment 0 I can see that for the working one we have only about 180ms of IPC in/out when the process got created, while for the affected cases it takes nearly 800ms.

@Henrik,

I'm going to attach a patch that logs how long it takes to initialize nsAppShell which is where we've seen the delay. This might help recognize when the problem is occurring from terminal output when running the test.

Set the pref security.sandbox.logging.enabled to true to trigger sandbox violation logging. This does not cause logging to go to stdout or stderr. It causes sandbox violation logging to go to the system log.

To capture the logs, you can use the macOS Console app (/System/Applications/Utilities/Console.app) or the log command in the terminal. Start streaming before running the test and then stop it afterwards. Here's a log command I used recently to log sandbox violations. I've seen cases where violations trigger log entries not conforming to this pattern so it is not guaranteed, but works most of the time. Look for violations triggered by plugin-container.

  • log stream --level debug --predicate "process=='kernel' && eventMessage contains 'deny'" &> log-output.txt

To capture everything with the log command (which will generate about ~10MB a minute), use this and then grep through it.

  • $ log stream --level debug &> log-output.txt.
Flags: needinfo?(haftandilian)

Thanks Haik! As first step I've triggered a try build with your patch so that we can easily see which test is still affected most. That way I could limit the next try builds to just this sub folder of wdspec tests and hope it reproduces, which would allow us to not log that much sandbox data and being able to upload the sandbox log as artifact.

https://treeherder.mozilla.org/jobs?repo=try&revision=b445cb2f2319c0e75b3666d70cdfb5369d3fbd77

Finally the tests were running macOS and it shows that the time as needed for creating the appshell for a content process varies a lot! See:
https://treeherder.mozilla.org/logviewer?job_id=512973024&repo=try&lineNumber=9443 and filter by nsAppShell init time:

[task 2025-06-13T19:51:22.168+00:00] 19:51:22     INFO - PID 1026 | XXX pid: 1030, nsAppShell init time: 3 milliseconds
[task 2025-06-13T19:51:22.678+00:00] 19:51:22     INFO - PID 1026 | XXX pid: 1042, nsAppShell init time: 0 milliseconds
[task 2025-06-13T19:51:23.367+00:00] 19:51:23     INFO - PID 1026 | XXX pid: 1044, nsAppShell init time: 0 milliseconds
[task 2025-06-13T19:51:24.480+00:00] 19:51:24     INFO - PID 1026 | XXX pid: 1043, nsAppShell init time: 1141 milliseconds
[task 2025-06-13T19:51:24.731+00:00] 19:51:24     INFO - PID 1026 | XXX pid: 1045, nsAppShell init time: 1183 milliseconds
[task 2025-06-13T19:51:26.447+00:00] 19:51:26     INFO - PID 1026 | XXX pid: 1053, nsAppShell init time: 118 milliseconds
[task 2025-06-13T19:51:27.471+00:00] 19:51:27     INFO - PID 1026 | XXX pid: 1054, nsAppShell init time: 1106 milliseconds
[task 2025-06-13T19:51:27.906+00:00] 19:51:27     INFO - PID 1026 | XXX pid: 1055, nsAppShell init time: 1148 milliseconds
[task 2025-06-13T19:51:27.907+00:00] 19:51:27     INFO - PID 1026 | XXX pid: 1056, nsAppShell init time: 1144 milliseconds
[task 2025-06-13T19:51:29.143+00:00] 19:51:29     INFO - PID 1026 | XXX pid: 1057, nsAppShell init time: 1026 milliseconds
[task 2025-06-13T19:51:31.709+00:00] 19:51:31     INFO - PID 1058 | XXX pid: 1062, nsAppShell init time: 5 milliseconds
[task 2025-06-13T19:51:31.986+00:00] 19:51:31     INFO - PID 1058 | XXX pid: 1065, nsAppShell init time: 0 milliseconds
[task 2025-06-13T19:51:32.790+00:00] 19:51:32     INFO - PID 1058 | XXX pid: 1067, nsAppShell init time: 0 milliseconds
[task 2025-06-13T19:51:33.829+00:00] 19:51:33     INFO - PID 1058 | XXX pid: 1066, nsAppShell init time: 1067 milliseconds
[task 2025-06-13T19:51:34.859+00:00] 19:51:34     INFO - PID 1058 | XXX pid: 1069, nsAppShell init time: 1044 milliseconds
[task 2025-06-13T19:51:35.091+00:00] 19:51:35     INFO - PID 1058 | XXX pid: 1071, nsAppShell init time: 129 milliseconds

Interesting here is that after each restart of Firefox the values for the first 3 created content processes are always low, while they are at ~1s afterward for a bit, but in-between fall as well back to smaller values for 3 times again before they are raised.

I assume this is related to the dom.ipc.processPrelaunch.fission.number which is set to 3? And there is as well dom.ipc.processPrelaunch.delayMs with 1000ms. Aren't we able to pre-launch fast enough to keep up with the tests that open new tabs/windows in a quite very fast fashion?

Haik, would it be still important to check the sandbox log or may this be a red herring and the issue is somewhere else?

Flags: needinfo?(haftandilian)

@Henrik,

Haik, would it be still important to check the sandbox log or may this be a red herring and the issue is somewhere else?

We need more information to be able to tell.

Let's adjust the patch to print out the process type. For example, content processes and utility processes both use plugin-container executable, but they have different sandbox policies. Assuming none of the other logging we have collects command line arguments. I'm wondering what the slow processes have in common such as are they a specific process type? Would you like me to update the patch?

Another test would be to disable sandboxing for all process types and see if the slow init time still occurs.

Did you have any luck running the test on the machine that is provisioned to us?

Flags: needinfo?(haftandilian) → needinfo?(hskupin)

(In reply to Haik Aftandilian [:haik] from comment #32)

Let's adjust the patch to print out the process type. For example, content processes and utility processes both use plugin-container executable, but they have different sandbox policies. Assuming none of the other logging we have collects command line arguments. I'm wondering what the slow processes have in common such as are they a specific process type? Would you like me to update the patch?

I've updated the patch and pushed again to try:
https://treeherder.mozilla.org/jobs?repo=try&revision=0eeaaf0c59fb5119639f54527994a60b038f237e

It only prints the actual value but not the process name (not sure how to convert it) so here is the reference of process types:
https://searchfox.org/mozilla-central/source/__GENERATED__/xpcom/build/GeckoProcessTypes.h

Another test would be to disable sandboxing for all process types and see if the slow init time still occurs.

That is what I initially did and it solved the problem (see my initial comment and referenced try link)

Did you have any luck running the test on the machine that is provisioned to us?

The tests are running but even through it was slow in creating the processes in CI I do not see this issue when running the tests via VNC. Maybe some cache is affecting it, which was wiped out after the tests were run. Nevertheless I've now started another run on the machine with the improved log lines. Lets see if something stands out.

Flags: needinfo?(hskupin)

Haik, so it's clearly the web content process creation that is affected here:
https://treeherder.mozilla.org/logviewer?job_id=513332833&repo=try&lineNumber=309067

task 2025-06-16T09:12:33.585+00:00] 09:12:33     INFO - PID 1057 | XXX pid: 1061, type: 0, nsAppShell init time: 5 milliseconds
[task 2025-06-16T09:12:33.957+00:00] 09:12:33     INFO - PID 1057 | XXX pid: 1072, type: 8, nsAppShell init time: 0 milliseconds
[task 2025-06-16T09:12:34.595+00:00] 09:12:34     INFO - PID 1057 | XXX pid: 1074, type: 7, nsAppShell init time: 0 milliseconds
[task 2025-06-16T09:12:34.975+00:00] 09:12:34     INFO - PID 1057 | XXX pid: 1073, type: 2, nsAppShell init time: 410 milliseconds
[task 2025-06-16T09:12:35.236+00:00] 09:12:35     INFO - PID 1057 | XXX pid: 1075, type: 2, nsAppShell init time: 437 milliseconds
[task 2025-06-16T09:12:36.902+00:00] 09:12:36     INFO - PID 1057 | XXX pid: 1079, type: 11, nsAppShell init time: 151 milliseconds
[task 2025-06-16T09:12:37.461+00:00] 09:12:37     INFO - PID 1057 | XXX pid: 1080, type: 2, nsAppShell init time: 461 milliseconds
[task 2025-06-16T09:12:37.462+00:00] 09:12:37     INFO - PID 1057 | XXX pid: 1081, type: 2, nsAppShell init time: 501 milliseconds
[task 2025-06-16T09:12:37.462+00:00] 09:12:37     INFO - PID 1057 | XXX pid: 1082, type: 2, nsAppShell init time: 492 milliseconds
[task 2025-06-16T09:12:38.898+00:00] 09:12:38     INFO - PID 1057 | XXX pid: 1083, type: 0, nsAppShell init time: 5 milliseconds
[task 2025-06-16T09:12:39.148+00:00] 09:12:39     INFO - PID 1057 | XXX pid: 1085, type: 8, nsAppShell init time: 0 milliseconds
[task 2025-06-16T09:12:39.848+00:00] 09:12:39     INFO - PID 1057 | XXX pid: 1092, type: 7, nsAppShell init time: 0 milliseconds
[task 2025-06-16T09:12:40.201+00:00] 09:12:40     INFO - PID 1057 | XXX pid: 1087, type: 2, nsAppShell init time: 381 milliseconds
[task 2025-06-16T09:12:40.871+00:00] 09:12:40     INFO - PID 1057 | XXX pid: 1094, type: 2, nsAppShell init time: 371 milliseconds
[task 2025-06-16T09:12:40.927+00:00] 09:12:40     INFO - PID 1057 | XXX pid: 1095, type: 2, nsAppShell init time: 369 milliseconds
[task 2025-06-16T09:12:41.270+00:00] 09:12:41     INFO - PID 1057 | XXX pid: 1096, type: 11, nsAppShell init time: 119 milliseconds

The utility process (11) also takes a bit longer to create but it's not that long as for the web content process (2). The above numbers are from a machine that is only partially affected.

I think what we have to do here now is to as best get a machine immediately disconnected while tests are running and long delays can be observed. Then any restart logic needs to be removed so that we can be sure to have it in a state when no clean-up has been done. Given that it is kinda hard to get a mac worker for tests these days (long wait times) it's not easy to do.

Joel and Ryan, what best could we do to get this accomplished?

Flags: needinfo?(rcurran)
Flags: needinfo?(jmaher)

For now I pushed 3 try builds, with sandbox violation logging enabled for all of them:

  • Sandbox content level 2 (try)
  • Sandbox content level 1 (try)
  • Sandbox content level 3 (try)

Lets see if we can see a pattern for those different values.

Otherwise I hope that once we have the logs in the System log that we do not clean-up those when restarting the machine and that we should be able to look at them even after the machine was used for other jobs.

I am not aware of any trickery to make this work. I believe if we could do this it would require:

  • detecting when this happens (in close to real time)
  • Ryan logging in and killing the taskcluster worker (and probably other things to prevent restart)

the tricky part is detecting when this happens. Possibly one could hack up a task to detect this and then "sleep for 4 hours", in addition that same hack would need to adjust the maximum task runtimes and harness runtimes. Finally we would need some method to determine how to detect this and alert Ryan.

There might be other complications with getting access to the machine- credentials, etc.; I am not sure if those require a reboot.


alternatively, you could have a couple machines in a pool that you have access to and run the jobs as you would in the above mentioned try- then you could monitor your try push and remote desktop to the machine if you determine it is in this bad state.

Flags: needinfo?(jmaher)

Ok, thank you for the feedback. Then I would suggest that we first check if we keep the Systemlog entries that my currently still waiting try test jobs will produce on those machines. If that works we could safely ignore all the complexity that you layed out in your last comment.

Both sandbox levels 1 and 2 as well show slow content webprocess creation times. So the only solution would be to turn the sandbox off completely. With the jobs running for level 3 I identified macmini-r8-49 as a very slow worker with ~1s process creation time.

Ryan, can you please give me access to that machine and as best make it that no restart + further cleanup will happen? I would like to check if I can get the requested sandbox logs from the syslog. Thanks!

There are no logs anymore on macmini-r8-49. So I handed it back. Currently I don't think it is possible to get access to such a worker while the tests are running due to the unknown wait time until mac workers finally will grab such a job, and the prompt availability of someone to hand out access for me on that machine.

As such I'll try first if running the log stream command as Haik mentioned above while some wdspec tests are running and store the output in an artifact file to get uploaded. Lets see if I can get something tomorrow.

Flags: needinfo?(rcurran)
Blocks: 1973368

(In reply to Haik Aftandilian [:haik] from comment #28)

  • log stream --level debug --predicate "process=='kernel' && eventMessage contains 'deny'" &> log-output.txt

To capture everything with the log command (which will generate about ~10MB a minute), use this and then grep through it.

  • $ log stream --level debug &> log-output.txt.

Haik, streaming is not that easy. It would be better for me to actually post-process the logs. Do you know / see any side-effects or maybe missed entries with the following command?

log show --predicate "process == 'kernel' && eventMessage contains 'deny'" \
         --info --debug \
         --start <start_time> \
         --end <end_time> \
         > log-output.txt

If not I would finally try to add this command at the end of the wdspec job to upload an artifact.

Flags: needinfo?(haftandilian)

Lets see if the following try build reveals something - we most likely have to wait until tomorrow to see any test result.

https://treeherder.mozilla.org/jobs?repo=try&revision=e878a150895af488d36e258b6ba19019f24a5fdd

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #40)

(In reply to Haik Aftandilian [:haik] from comment #28)

  • log stream --level debug --predicate "process=='kernel' && eventMessage contains 'deny'" &> log-output.txt

To capture everything with the log command (which will generate about ~10MB a minute), use this and then grep through it.

  • $ log stream --level debug &> log-output.txt.

Haik, streaming is not that easy. It would be better for me to actually post-process the logs. Do you know / see any side-effects or maybe missed entries with the following command?

log show --predicate "process == 'kernel' && eventMessage contains 'deny'" \
         --info --debug \
         --start <start_time> \
         --end <end_time> \
         > log-output.txt

If not I would finally try to add this command at the end of the wdspec job to upload an artifact.

I don't know for sure, but I would try with that log command.

Based on your comment that you see the problem even with level 1 of the sandbox, that's an indication that the problem is not filesystem related. It could be related to a service that is denied in the sandbox and for that it usually means a mach-lookup operation that is denied.

It could also be caused by the sandbox denying access to a sysctl. We can test that by keeping the sandbox level as the default and adding the line (allow sysctl-read sysctl-write) to the policy.

Flags: needinfo?(haftandilian)

Good idea. Lets do that. The other try build from above isn't done yet because we are still waiting for a worker nearly 24h! As such lets push a new try build with those additional policy items so we can see it sooner than later.

https://treeherder.mozilla.org/jobs?repo=try&revision=a5061d25aa56c2d2fee119f0ccf93612a0fc3ca2

Oh and note that since end of May the mean value for our content process creation time increased by another 200ms to 600ms now with some entries reaching a maximum of 1.6s.

Considering how many tests we run and how many tabs are created or pages navigated that might cause new content processes, this could drastically increase the test runtime - and maybe causing this huge backlog on mac workers for try in CI (see bug 1974346).

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #41)

Lets see if the following try build reveals something - we most likely have to wait until tomorrow to see any test result.

https://treeherder.mozilla.org/jobs?repo=try&revision=e878a150895af488d36e258b6ba19019f24a5fdd

Sadly the sandbox log is empty. Most likely this happened because the cltbld user that we are running the tests for doesn't have the required privileges to actually access the logs. Lets wait for the other try build with the sysctl policy addition if that shows a difference.

Haik, I assume that we would still need the sandbox logs to exactly determine where it is coming from?

Ryan, on a default worker without any further changes to user permissions, would I be able to access the system logs with the ctlbld user by using sudo? I assume probably not.

Flags: needinfo?(rcurran)
Flags: needinfo?(haftandilian)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #45)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #41)

Lets see if the following try build reveals something - we most likely have to wait until tomorrow to see any test result.

https://treeherder.mozilla.org/jobs?repo=try&revision=e878a150895af488d36e258b6ba19019f24a5fdd

Sadly the sandbox log is empty. Most likely this happened because the cltbld user that we are running the tests for doesn't have the required privileges to actually access the logs. Lets wait for the other try build with the sysctl policy addition if that shows a difference.

Haik, I assume that we would still need the sandbox logs to exactly determine where it is coming from?

If adding (allow sysctl-read sysctl-write) to the content sandbox avoids the problem for content processes, that narrows down the problem significantly. Yes, then we would want to revert that change and look at the sandbox logs focusing on sysctl-related errors.

Flags: needinfo?(haftandilian)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #43)

Good idea. Lets do that. The other try build from above isn't done yet because we are still waiting for a worker nearly 24h! As such lets push a new try build with those additional policy items so we can see it sooner than later.

https://treeherder.mozilla.org/jobs?repo=try&revision=a5061d25aa56c2d2fee119f0ccf93612a0fc3ca2

This try build is actually invalid because I accidentally pushed it by using an artifact build. I've triggered a new try build with the sysctl-read and sysctl-write addition to see if that makes a difference:

https://treeherder.mozilla.org/jobs?repo=try&revision=91e1eba7cb504f536b22fee01990af122062516b

:whimboo as-is, the cltbld account is a non-admin user, so sudo'ing from that account will not work.

Flags: needinfo?(rcurran)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #47)

This try build is actually invalid because I accidentally pushed it by using an artifact build. I've triggered a new try build with the sysctl-read and sysctl-write addition to see if that makes a difference:

https://treeherder.mozilla.org/jobs?repo=try&revision=91e1eba7cb504f536b22fee01990af122062516b

Unless I haven't done something wrong with the sysctl patch I do not see that we have improvements. The content process creation times are still around 500ms for those jobs - we probably had luck not to have a machine that is super slow.

(In reply to Ryan Curran from comment #48)

:whimboo as-is, the cltbld account is a non-admin user, so sudo'ing from that account will not work.

This is sad. So we are probably locked on getting a worker handed out for investigation. I now filed a JIRA ticket to handle that request.

Flags: needinfo?(haftandilian)

The wdspec jobs only run about 28 minutes which makes it hard to prepare the machine in time, especially when we do not know when the job actually gets started due to the huge backlog on 14.70. As such I've pushed a new try build which uses a single chunk only and as such will run 5 times that long. Further I have bumped the job priority to very-high so that it's getting processed early compared to other try builds already waiting in the queue.

https://treeherder.mozilla.org/jobs?repo=try&revision=4f9ac597cbcacfe1fe72c5634bd50e35dafaed0c

Priorities larger than low can actually not be used to permission restrictions. As such I pushed again with keeping low and a single wdspec job:

https://treeherder.mozilla.org/jobs?repo=try&revision=1ec2c5851aa5711b6177d017bd9bc16ef4be4597

Note that the situation gets worse week by week and we need to make progress on it.

Here one example from two Wd jobs as run on the same try push from yesterday:
https://treeherder.mozilla.org/jobs?repo=try&revision=1ec2c5851aa5711b6177d017bd9bc16ef4be4597&selectedTaskRun=IGyTnXd_SYa1DWml81A2zA.0

The selected job ran 166 minutes while the other ones are around 100 minutes. This is because the delay for content process creation for the first job is around 1.5s for each and every process, while for the others it's only around 500ms, which is also way higher than the usual 80 - 100ms that I can see locally on my machine.

Overall this means the mean duration as visible by the smaller increasing line in the perfherder graph has now a 5x longer time then it should be. While the extreme numbers are even 15x larger!! When I see a 66% longer job duration between those two growing graphs as introduced by a 3x larger content process creation time, you can imagine what that means when we compare this to the actual expected 80-100ms process creation time. This now means that it definitely has a HUGE impact on all the test jobs that we run in CI on those macOS 14.70 workers.

We should as quick as possible investigate and fix the situation because it's getting more worse day by day and the backlog will grow even more so that devs might have to wait up to 2 days to get a try job run.

What I still need is a way to immediately get a machine quarantined and correctly prepared so that I get access to it and any reboot logic is stopped because rebooting means the problem goes away. On top of it, due to the extreme long queues on try and other jobs (like for releases, uplifts, landing) it's not predictable when the job will run. Given that we do not know what exactly triggers it probably doesn't yet make sense to have a separate pool of workers unless we trigger random jobs on it in the hope that we can reproduce the situation. But maybe we should try out?

Joel, what do you think?

Flags: needinfo?(jmaher)
Summary: Mac workers (since macOS 14.7.1) show a 1s delay for web content process creation due to sandbox restrictions → Mac workers (since macOS 14.7.1) show a 500ms to 1.5s delay for web content process creation due to sandbox restrictions

Haik, do we actually know if this issue is as well impacting our user base on macOS? Do we have a way to get the content process creation time out of Telemetry?

Joel, currently I have access to macmini-r8-137 and macmini-r8-66. Maybe we could try out the extra pool, just created with those two workers. I could then try to push various test jobs for builds with Haik's patch, and hopefully be able to get one of these two machines into that state.

I don't know how many content processes we create during a full set of tests, but I imagine it is in the range of 5000-10000; Assuming a 1 second delay for this, that could be up to 3 hours of wasted CPU for a full test run- probably equal to about 40 hours of CPU time or 2 machines missing from the pool (which ironically is the same impact on the worker pool as the 2 workers you have access to).

If this has been determined to be sandboxing, can we just turn that off for CI until we figure this out? I don't think it will fix our backlog problems, but it won't make it worse.

it is probably best to move forward with putting these two workers you have access to into a separate pool, then you can schedule jobs on that pool as you see fit.

I assume we have pool t-osx-1400-r8, you would create a pool t-osx-1400-r8-whimboo, then ./mach try fuzzy -q 'test-macosx14 ...' --worker-suffix='-whimboo'

if Ryan is on PTO, I believe :aerickson can help out (keep in mind all of relops is on public holiday today (Friday July 4th))

Flags: needinfo?(jmaher) → needinfo?(aerickson)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #55)

I don't know how many content processes we create during a full set of tests, but I imagine it is in the range of 5000-10000; Assuming a 1 second delay for this, that could be up to 3 hours of wasted CPU for a full test run- probably equal to about 40 hours of CPU time or 2 machines missing from the pool (which ironically is the same impact on the worker pool as the 2 workers you have access to).

I'm running all the wpt tests now from the upstream repository on one of those macOS workers and it's running tests under the editing folder now. Since the beginning it should be roughly already be 42000 content processes created now. Keep in mind that each new tab, added iframe or cross-origin navigation can cause new content processes to be created. So the number is much much higher.

If this has been determined to be sandboxing, can we just turn that off for CI until we figure this out? I don't think it will fix our backlog problems, but it won't make it worse.

I don't think so given that we have to test what we actually ship. And that wouldn't be the case anymore.

it is probably best to move forward with putting these two workers you have access to into a separate pool, then you can schedule jobs on that pool as you see fit.

I assume we have pool t-osx-1400-r8, you would create a pool t-osx-1400-r8-whimboo, then ./mach try fuzzy -q 'test-macosx14 ...' --worker-suffix='-whimboo'

if Ryan is on PTO, I believe :aerickson can help out (keep in mind all of relops is on public holiday today (Friday July 4th))

That would be great thanks! I'm still around next week but then off for 3 weeks. So we should make progress or someone else has to pick-up that work.

Here's the glam view, MacOs for
content_process_launch_total_ms: Total time elapsed during asynchronous content process launch, until the process is usable for loading content.

And, glam view for
content_process_sync_launch_ms, Time elapsed during synchronous content process launch until the process is usable for loading content.

From a quick view they don't seem to be too problematic, but I have very little experience in this particular area.

https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-t-osx-1400-r8-whimboo has been created and both devices (66 and 137) are in it. I've verified jobs are working. Reboots are disabled and they will run an infinite number of tasks (numberOfTasksToRun: 0).

Flags: needinfo?(aerickson)

As it looks like I was able to actually hit this very long content process creation case with ~1400ms and the sandbox violation log claims mach-lookup com.apple.lsd.modifydb to be denied a couple of times and as well taking up a lot of time?

Haik, please take a look at the log. Thanks

To actually submit job requests to the pool I was using:

mach try fuzzy --no-artifact --show-chunk-numbers --worker-suffix="t-osx-1400-r8=-whimboo"

We can trigger new test jobs directly from Treeherder if needed:
https://treeherder.mozilla.org/jobs?repo=try&revision=ededafbfa1fc81aeeb26b91f3a2dd54f6c084953&selectedTaskRun=BOqq26ULSE-H2dJ1YZGH6g.0

Note that the script that checks the refresh rate current fails on macmini-r8-137 because it gets an empty value returned. Andrew, I think that you can re-puppetize that machine (I've made some global changes) and add it back to the default pool. It's enough to have macmini-r8-66 now. Thanks!

Flags: needinfo?(aerickson)

I've submitted a new try build to check if allowing modifydb lookups within the sandbox will actually get us away from those long delays:
https://treeherder.mozilla.org/jobs?repo=try&revision=d43475acb9e4b0846dc02d555002716cfda2a531

Joel, the current screen refresh rate and resolution checks fail as mentioned above. I tried out some variations and as it looks like the screenresolution tool from jhford seems to work fine:

$ screenresolution get
2025-07-08 10:01:21.586 screenresolution[50066:635904] starting screenresolution argv=/usr/local/Cellar/screenresolution/1.6/bin/screenresolution get
2025-07-08 10:01:21.602 screenresolution[50066:635904] Display 0: 1920x1080x32@60

while

system_profiler SPDisplaysDataType
Graphics/Displays:

    Intel UHD Graphics 630:

      Chipset Model: Intel UHD Graphics 630
      Type: GPU
      Bus: Built-In
      VRAM (Dynamic, Max): 1536 MB
      Vendor: Intel
      Device ID: 0x3e9b
      Revision ID: 0x0000
      Metal Support: Metal 3

Maybe we should make it more stable? (There is also displayplacer)

Flags: needinfo?(jmaher)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #59)

Created attachment 9499137 [details]
sandbox violation logs for a single web content process creation

As it looks like I was able to actually hit this very long content process creation case with ~1400ms and the sandbox violation log claims mach-lookup com.apple.lsd.modifydb to be denied a couple of times and as well taking up a lot of time?

Haik, please take a look at the log. Thanks

So allowing modifydb actually helped but note that it should not be everything given that the durations still have quite a huge variety:

[task 2025-07-08T12:03:09.344+00:00] 12:03:09     INFO - PID 59083 | XXX pid: 59096, type: 2, nsAppShell init time: 233 milliseconds
[task 2025-07-08T12:03:09.465+00:00] 12:03:09     INFO - PID 59083 | XXX pid: 59098, type: 2, nsAppShell init time: 87 milliseconds
[task 2025-07-08T12:03:11.703+00:00] 12:03:11     INFO - PID 59083 | XXX pid: 59106, type: 2, nsAppShell init time: 66 milliseconds

In the logs I can see as well a lot of user-preference-read kcfpreferencesanyapplication entries which might as well have an influence here. Checking our code base I can only find a single instance of it for the resistfingerprinting component, which tries to read the AppleFontSmoothing preference. Not sure if that is something we can enable or not - similar to modifydb - but we probably should not repeat trying to read this preference.

The very first content process creation time is still longer and takes around 200ms compared to the ~80ms for all the following ones. Interesting is that when Firefox gets restarted we do not see such a higher value as well. So maybe it's some OS caching logic that can be seen here and something we do not have to care that much about?

[task 2025-07-08T16:17:51.921+00:00] 16:17:51     INFO - PID 96282 | XXX pid: 96301, type: 2, nsAppShell init time: 225 milliseconds
[task 2025-07-08T16:17:52.152+00:00] 16:17:52     INFO - PID 96282 | XXX pid: 96305, type: 2, nsAppShell init time: 62 milliseconds
[..]
[task 2025-07-08T16:18:29.711+00:00] 16:18:29     INFO - PID 96387 | XXX pid: 96395, type: 2, nsAppShell init time: 71 milliseconds
[task 2025-07-08T16:18:30.659+00:00] 16:18:30     INFO - PID 96387 | XXX pid: 96405, type: 2, nsAppShell init time: 70 milliseconds
[task 2025-07-08T16:18:30.727+00:00] 16:18:30     INFO - PID 96387 | XXX pid: 96406, type: 2, nsAppShell init time: 68 milliseconds

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #60)

Note that the script that checks the refresh rate current fails on macmini-r8-137 because it gets an empty value returned. Andrew, I think that you can re-puppetize that machine (I've made some global changes) and add it back to the default pool. It's enough to have macmini-r8-66 now. Thanks!

r8-137 has been reimaged and removed from quarantine.

Flags: needinfo?(aerickson)

I talked with Joel and we should definitely have some documentation on firefox-source-docs that explains in how to troubleshoot issues on workers in CI. As suggested we can have a new top-level item named Troubleshooting issues with CI, and then a sub-item Debugging with loaners. I can try to write-up some steps based on my experience with this bug, which we then can enhance step by step.

My current hypothesis is that there is a list of Application bundles somewhere in the system, and it's being traversed here; we're spending a lot of time in _LSBundleMatchesNode.

The call to CFURLCopyResourcePropertyForKey inside -[NSApplication(NSSafeAperture) _initializeSafeAperture] probably happens for everyone, it just takes longer on these machines. So we should be able to find out locally what URL and key it's called for. This might even allow us to make a reduced testcase.

I'm not sure why it would be faster without sandboxing. Maybe it tries to get some information from a different process, sandboxing blocks that, and then it falls back to a path that computes the information within the current process.

(In reply to Markus Stange [:mstange] from comment #67)

The call to CFURLCopyResourcePropertyForKey inside -[NSApplication(NSSafeAperture) _initializeSafeAperture] probably happens for everyone, it just takes longer on these machines. So we should be able to find out locally what URL and key it's called for. This might even allow us to make a reduced testcase.

So how could we find this out locally? Which command(s) needs to be run? I had issues reproducing when I cloned the upstream wpt repository and running the tests directly. I need to find out how to run the wpt tests through mozharness on the box.

Flags: needinfo?(mstange.moz)

Here's what I see locally:

(lldb) bt
* thread #1, name = 'MainThread', queue = 'com.apple.main-thread', stop reason = breakpoint 3.1
  * frame #0: 0x000000019fb85c28 LaunchServices`_LSBundleFindWithNode
    frame #1: 0x000000019fb8535c LaunchServices`_LSFindOrRegisterBundleNode + 216
    frame #2: 0x000000019fd602c4 LaunchServices`LaunchServices::URLPropertyProvider::capabilityEffectiveNodeForNode(LaunchServices::Database::Context&, FSNode*) + 260
    frame #3: 0x000000019fd5ed94 LaunchServices`LaunchServices::URLPropertyProvider::prepareApplicationCapabilityValue(LaunchServices::Database::Context&, id<FSNodePropertyProviding>, __FileCache*, __CFString const*, LaunchServices::URLPropertyProvider::State*, NSError* __autoreleasing*) + 196
    frame #4: 0x000000019fb818c4 LaunchServices`LaunchServices::URLPropertyProvider::prepareValues(__CFURL const*, __FileCache*, __CFString const* const*, void const**, long, void const*, __CFError**) + 576
    frame #5: 0x00000001a33ce914 CoreServicesInternal`prepareValuesForBitmap(__CFURL const*, __FileCache*, _FilePropertyBitmap*, __CFError**) + 460
    frame #6: 0x00000001a33cd010 CoreServicesInternal`_FSURLCopyResourcePropertyForKeyInternal(__CFURL const*, __CFString const*, void*, void*, __CFError**, unsigned char) + 232
    frame #7: 0x000000019f6839dc CoreFoundation`CFURLCopyResourcePropertyForKey + 108
    frame #8: 0x000000019f69a644 CoreFoundation`-[NSURL getResourceValue:forKey:error:] + 104
    frame #9: 0x00000001a40672c4 AppKit`-[NSApplication(NSSafeAperture) getValue:forSafeApertureCompatibilityURLResourceKey:] + 124
    frame #10: 0x00000001a4066c08 AppKit`+[_NSSafeApertureCompatibilityManager _updateSafeApertureCompatibilityMode:withURLResourceKey:forMode:fromDataSource:] + 120
    frame #11: 0x00000001a4066ddc AppKit`+[_NSSafeApertureCompatibilityManager compatibilityModeForDataSource:] + 116
    frame #12: 0x00000001a40675d0 AppKit`-[NSApplication(NSSafeAperture) _initializeSafeAperture] + 40
    frame #13: 0x00000001a35a2ff0 AppKit`-[NSApplication init] + 1376
    frame #14: 0x00000001a35a28c4 AppKit`+[NSApplication sharedApplication] + 128
    frame #15: 0x00000001176733a8 XUL`SetupMacApplicationDelegate(gRestartedByOS=0x000000011c215278) at MacApplicationDelegate.mm:113:3 [opt]
    frame #16: 0x00000001176614b0 XUL`XREMain::XRE_mainInit(this=0x000000016fdfe0e0, aExitFlag=0x000000016fdfe058) at nsAppRunner.cpp:4340:3 [opt]
    frame #17: 0x00000001176672c4 XUL`XREMain::XRE_main(this=0x000000016fdfe0e0, argc=4, argv=0x000000016fdfecf0, aConfig=<unavailable>) at nsAppRunner.cpp:6099:16 [opt]
    frame #18: 0x00000001176678f4 XUL`XRE_main(argc=<unavailable>, argv=<unavailable>, aConfig=<unavailable>) at nsAppRunner.cpp:6211:21 [opt]
    frame #19: 0x0000000100000be0 firefox`main [inlined] do_main(argc=4, argv=0x000000016fdfecf0, envp=<unavailable>) at nsBrowserApp.cpp:232:22 [opt]
    frame #20: 0x0000000100000a1c firefox`main(argc=<unavailable>, argv=<unavailable>, envp=<unavailable>) at nsBrowserApp.cpp:464:16 [opt]
    frame #21: 0x000000019f226b98 dyld`start + 6076
(lldb) po $x1
<FSNode 0x1029c46e0> { isDir = ?, path = '/Users/mstange/code/obj-m-opt/dist/Nightly.app' }

I think the next step would be to read the code of _LSBundleFindWithNode and find out where the list comes from that it checks against.

Flags: needinfo?(haftandilian)
See Also: → 1504554

There are some undocumented functions used by WebKit and Blink that relate to accessing LaunchServices in a sandboxed process: SetApplicationIsDaemon and _LSSetApplicationLaunchServicesServerConnectionStatus. These might short circuit entries into LaunchServices code and help us out here.

Here's my current guess: The LaunchServices database may have grown very big on these machines. In bug 1827651 we introduced calls to LSRegisterURL; my guess is that we accumulate more and more registered app bundles in the LaunchServices database and never clean them out.

You can explore the database with the lsregister command line utility.

Henrik, can you extract the list of bundles from one of these machines? /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -dump Bundle | grep 'path:' > bundles.txt

On my development machine, this creates a bundles.txt file with 1381 lines.

Flags: needinfo?(mstange.moz) → needinfo?(hskupin)

I can surely do soon. In the meantime one interesting fact - which might speak against this theory - is that I was able to get to normal values again when I duplicated the task folder on the machine, and then moved this duplicated folder to the cltbld's home folder. The webcontent process creation times dropped and now as well stay at this level with an unpatched version of Firefox:

[task 2025-07-11T05:43:01.068+00:00] 05:43:01     INFO - PID 61959 | XXX pid: 62008, type: 2, nsAppShell init time: 1316 milliseconds
[task 2025-07-11T05:43:02.896+00:00] 05:43:02     INFO - PID 61959 | XXX pid: 62010, type: 2, nsAppShell init time: 1465 milliseconds
[task 2025-07-11T05:43:04.554+00:00] 05:43:04     INFO - PID 61959 | XXX pid: 62011, type: 2, nsAppShell init time: 1436 milliseconds
[task 2025-07-11T05:43:05.271+00:00] 05:43:05     INFO - PID 61959 | XXX pid: 62016, type: 2, nsAppShell init time: 494 milliseconds
[task 2025-07-11T05:43:05.579+00:00] 05:43:05     INFO - PID 61959 | XXX pid: 62017, type: 2, nsAppShell init time: 92 milliseconds
[task 2025-07-11T05:43:05.680+00:00] 05:43:05     INFO - PID 61959 | XXX pid: 62018, type: 2, nsAppShell init time: 112 milliseconds
[..]
[task 2025-07-11T05:48:18.662+00:00] 05:48:18     INFO - PID 62936 | XXX pid: 63031, type: 2, nsAppShell init time: 79 milliseconds

Not sure what that actually means and what in specific this might have been triggered on the OS side.

(In reply to Markus Stange [:mstange] from comment #71)

On my development machine, this creates a bundles.txt file with 1381 lines.

Ok, so on the machine in CI this has 9500 lines and is about 1.2MB in size! About 1700 lines are actually related to task specific folders for xpcshell tests that most likely run update tests:

path:                       /opt/worker/tasks/task_174254813620657/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marStageSuccessPartial/dir.app/Contents/MacOS/callback_app.app (0x10bc4)
path:                       /opt/worker/tasks/task_174254813620657/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marWrongChannel/dir.app/Contents/MacOS/callback_app.app (0x10bc8)
path:                       /opt/worker/tasks/task_174254813620657/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marAppApplyUpdateSuccess/dir.app (0x10bd0)
path:                       /opt/worker/tasks/task_174254813620657/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marStageSuccessComplete/dir.app/Contents/MacOS/callback_app.app (0x10bdc)

Then we have about 7250 entries for the registered Firefox application for each task as run on the machines:

path:                       /opt/worker/tasks/task_174254628261469/build/application/Firefox NightlyDebug.app (0x10b38)
path:                       /opt/worker/tasks/task_174550149473054/build/application/Firefox Developer Edition.app (0x18b3c)
path:                       /opt/worker/tasks/task_174254813620657/build/application/Firefox Nightly.app (0x10b40)
path:                       /opt/worker/tasks/task_174550308047905/build/application/Firefox NightlyDebug.app (0x18b44)
path:                       /opt/worker/tasks/task_174254813620657/build/application/Firefox Nightly.app/Contents/MacOS/plugin-container.app (0x10b44)
path:                       /opt/worker/tasks/task_174550848343084/build/application/Firefox Nightly.app (0x18b4c)

Other entries like plugin-container.app have only 253 entries

I tried again what happens when I duplicate the task folder and I was able to see the same behavior:

[task 2025-07-11T06:16:21.107+00:00] 06:16:21     INFO - PID 69632 | XXX pid: 69697, type: 2, nsAppShell init time: 1890 milliseconds
[task 2025-07-11T06:16:21.203+00:00] 06:16:21     INFO - PID 69632 | XXX pid: 69698, type: 2, nsAppShell init time: 1925 milliseconds
[task 2025-07-11T06:16:24.875+00:00] 06:16:24     INFO - PID 69699 | XXX pid: 69712, type: 2, nsAppShell init time: 84 milliseconds
[task 2025-07-11T06:16:26.198+00:00] 06:16:26     INFO - PID 69699 | XXX pid: 69715, type: 2, nsAppShell init time: 93 milliseconds

Not sure what that means, but if bug 1827651 might have caused it why did it start that late?

Flags: needinfo?(hskupin) → needinfo?(mstange.moz)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #73)

path: /opt/worker/tasks/task_174254628261469/build/application/Firefox NightlyDebug.app (0x10b38)

Interesting for all those entries is the timestamp. Those somehow date back to 1975th.

Running an xpcshell test on the machine now gives the correct timestamp:

Copying /opt/worker/downloads/Fm2esJ6TQuCnpSPJPxFyjg to /opt/worker/tasks/task_175221916270523/fetch-content

Also running the 2 xpcshell test jobs as available for macOS opt builds 21 more Bundle entries are getting added:

X1:

path:                       /opt/worker/tasks/task_175221916270523/build/application/Firefox Nightly.app (0x25ac8)
path:                       /opt/worker/tasks/task_175221916270523/build/application/Firefox Nightly.app/Contents/MacOS/plugin-container.app (0x25acc)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marAppApplyUpdateStageOldVersionFailure/dir.app (0x25ad0)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marAppApplyUpdateStageSuccess/dir.app (0x25b10)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marAppApplyUpdateSuccess/dir.app (0x25b14)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marVersionDowngrade/dir.app/Contents/MacOS/callback_app.app (0x25b1c)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marSuccessPartial/dir.app/Contents/MacOS/callback_app.app (0x25b20)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marSuccessComplete/dir.app/Contents/MacOS/callback_app.app (0x25b24)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marStageSuccessPartial/dir.app/Contents/MacOS/callback_app.app (0x25b28)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marAppInUseSuccessComplete/dir.app/Contents/MacOS/callback_app.app (0x25b2c)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marAppInUseStageSuccessComplete_unix/dir.app/Contents/MacOS/callback_app.app (0x25b30)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marCallbackUmask_unix/dir.app/Contents/MacOS/callback_app.app (0x25b34)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marFailurePartial/dir.app/Contents/MacOS/callback_app.app (0x25b38)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marMissingUpdateSettings/dir.app/Contents/MacOS/callback_app.app (0x25b3c)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marWrongChannel/dir.app/Contents/MacOS/callback_app.app (0x25b40)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marPostUpdateEnvironment/dir.app/Contents/MacOS/callback_app.app (0x25b44)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marStageSuccessComplete/dir.app/Contents/MacOS/callback_app.app (0x25b4c)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marSuccessPartialWhileBackgroundTaskRunning/dir.app/Contents/MacOS/callback_app.app (0x25b58)
path:                       /opt/worker/tasks/task_175221916270523/build/tests/xpcshell/tests/toolkit/mozapps/update/tests/unit_update_binary/marSuccessPartialWhileBackgroundTaskRunning/dir.app (0x25b5c)

X2:

path:                       /opt/worker/tasks/task_175221559748829/build/application/Firefox Nightly.app (0x25abc)
path:                       /opt/worker/tasks/task_175221559748829/build/application/Firefox Nightly.app/Contents/MacOS/plugin-container.app (0x25ac0)

CC'ing Robin for information given that this might be related to the application update tests as well.

I finally got it working that the task folder doesn't get deleted anymore. With that I can now run a single Marionette test to reproduce the problem. Here the steps

  1. Request access for macmini-r8-66 (aerickson can help you with that)
  2. Use a VNC viewer to connect to it and login with cltbld
  3. Open a terminal and change the cwd to /opt/worker/tasks/task_175224278486602
  4. Run Marionette: build/venv/bin/python -u build/tests/marionette/harness/marionette_harness/runtests.py --gecko-log=- -vv --binary=build/application/Firefox\ Nightly.app/Contents/MacOS/firefox --address=127.0.0.1:2828 build/tests/marionette/tests/testing/marionette/harness/marionette_harness/tests/unit/test_checkbox.py
  5. Search for type: 2 in the logging output.

It is very strange. The following steps brought as well the duration back to normal values and now I cannot see those huge times anymore on the machine:

  1. Manually start Firefox with the default profile
  2. I still observed slow content process creation times of 1400ms
  3. Open some more tabs and navigate and observe the values are the same
  4. Open about:config and change the security.sandbox.content.level to 0.
  5. Continue opening tabs and navigate to certain pages -> values dropped to normal
  6. Change security.sandbox.content.level back to 3
  7. Continue opening tabs and navigate to certain pages -> values stayed at normal

Now I'm not able to reproduce the high value anymore on the machine. When restarting Firefox the value stays low, cleaning the Firefox folder under Application Support or the Cache doesn't change anything as well. Also the Marionette test now reports normal values.

Is there any other location on the Mac where we are storing a cache? Note that the amount of stored bundles hasn't changed at all - so it shouldn't be the problem.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #76)

CC'ing Robin for information given that this might be related to the application update tests as well.

Stephen has worked on updater tests and might have some insight here. See comment 73 and later.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #78)

Now I'm not able to reproduce the high value anymore on the machine. When restarting Firefox the value stays low, cleaning the Firefox folder under Application Support or the Cache doesn't change anything as well. Also the Marionette test now reports normal values.

I pushed another try job which basically adds a new task folder and puts all files in a different location. With that the content process creation time is again ~1400ms (live log).

So it may have to do something with the location of the Firefox binary. Could it be that maybe writing to the LaunchServices database is the problem here? We probably do it only once and when the path is added we no longer have to do it again? Just guessing...

filed Bug 1976984 to split out the possibility for changing our screen resolution code.

Flags: needinfo?(jmaher)

I was using the following command to wipe all the entries from the LaunchService database:

sudo /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -kill -r -domain local -domain system -domain user

But sadly it didn't change anything. The creation time is still slow when running a Marionette test afterward. Checking for entries once it was done revealed these two:

/opt/worker/tasks/task_175248934257528/build/application/Firefox Nightly.app
/opt/worker/tasks/task_175248934257528/build/application/Firefox Nightly.app/Contents/MacOS/plugin-container.app

I wonder if we actually should register the plugin-container.app? It's just an internal helper application the user isn't facing at all.

I filed https://github.com/mozilla-platform-ops/ronin_puppet/issues/903 for the launch services db cleanup after a task finished running.

Given that I can easily reproduce on the machine in CI I now created 3 different profiles:

  1. Starting Firefox without any modifications (sandbox level 3): https://share.firefox.dev/4kzbJ4R
  2. Starting Firefox with the sandbox level set to 0: https://share.firefox.dev/451lYdz
  3. Starting Firefox with the sandbox level set back to 3: https://share.firefox.dev/46fRTZ7

As you can see there is one big difference here. In both cases (1 and 2) we have XRE_RunAppShell calling nsGetServiceByCID::operator() whereby with sandbox level 3 this takes around 2037ms (which matches the XXX pid: 75313, type: 2, nsAppShell init time: 2056 millisecond), while with sandbox level 0 this call only takes 181ms (XXX pid: 75862, type: 2, nsAppShell init time: 181 milliseconds). Interestingly when I turn on sandboxing again nsGetServiceByCID::operator() is no longer in the stack, which would explain why we do no longer see the delay due to the slow call to _LSBundleFindWithNode (what Markus mentioned in comment 69).

One thing that I noticed today as well are entries like the following when streaming the logs while Firefox creates new content processes:

2025-07-15 12:03:54.919640+0000  localhost cfprefsd[407]: (CoreFoundation) [com.apple.defaults:cfprefsd] rejecting read of { kCFPreferencesAnyApplication, cltbld, kCFPreferencesCurrentHost, /Library/Managed Preferences/cltbld/.GlobalPreferences.plist, managed: 1 } from process 79092 (plugin-container) because accessing these preferences requires user-preference-read or file-read-data sandbox access
[..]
2025-07-15 12:03:56.055331+0000  localhost cfprefsd[407]: (CoreFoundation) [com.apple.defaults:cfprefsd] rejecting read of { kCFPreferencesAnyApplication, cltbld, kCFPreferencesCurrentHost, /Library/Managed Preferences/cltbld/.GlobalPreferences.plist, managed: 1 } from process 79092 (plugin-container) because accessing these preferences requires user-preference-read or file-read-data sandbox access

There are a lot of those entries and checking the time it's a little bit more than 1s until the system probably gives up to try to read the file. And this maps to the log output:

1752581036049 XXX pid: 79092, type: 2, nsAppShell init time: 1091 milliseconds

I wanted to know if our Puppet configuration actually sets this managed preference and indeed I found this entry, which sets the scrollbar visible status:

https://github.com/mozilla-platform-ops/ronin_puppet/blob/03893e81569eddf25f68fd37f4ac35e3892417bc/modules/macos_utils/manifests/show_scroll_bars.pp#L8

Not sure why Puppet's config lists the path of the file at /Library/Preferences/.GlobalPreferences.plist while the actual file on disk is at /Library/Managed Preferences/cltbld/.GlobalPreferences.plist. It contains:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>AppleShowScrollBars</key>
	<string>Always</string>
	<key>NSDocumentSaveNewDocumentsToCloud</key>
	<false/>
	<key>PayloadUUID</key>
	<string>090885A8-E60E-458C-B133-D5CD18308976</string>
</dict>
</plist>

But it looks like there is no difference if this file exists or not; the startup times are still slow. Also a Firefox version that doesn't show the issue as well tries to read this file but with the difference that those entries appear way lesser in the log and as such also creates the process way faster. Maybe this is a red herring?

Haik, I now have a debug build ready that could be used by a mac developer to actually get more information and try out a potential fix. Who will be able to continue the work on this bug (and to give access to that machine)? Looks like I don't get further at the moment. Thanks.

Flags: needinfo?(haftandilian)

Hi :whimboo

The scroll bar setting is enforced both by Puppet and by MDM:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>PayloadContent</key>
	<array>
		<dict>
			<key>AppleShowScrollBars</key>
			<string>Always</string>
			<key>NSDocumentSaveNewDocumentsToCloud</key>
			<false/>
			<key>PayloadDisplayName</key>
			<string>Global Preferences</string>
			<key>PayloadIdentifier</key>
			<string>.GlobalPreferences.090885A8-E60E-458C-B133-D5CD18308976</string>
			<key>PayloadType</key>
			<string>.GlobalPreferences</string>
			<key>PayloadUUID</key>
			<string>090885A8-E60E-458C-B133-D5CD18308976</string>
			<key>PayloadVersion</key>
			<integer>1</integer>
		</dict>
	</array>
	<key>PayloadDisplayName</key>
	<string>Always Show Scrollbars</string>
	<key>PayloadIdentifier</key>
	<string>com.mozilla.always_show_scrollbars</string>
	<key>PayloadScope</key>
	<string>System</string>
	<key>PayloadType</key>
	<string>Configuration</string>
	<key>PayloadUUID</key>
	<string>B2DBC23F-AC74-450A-9C10-EDD84FDF83B3</string>
	<key>PayloadVersion</key>
	<integer>1</integer>
</dict>
</plist>

Do you suspect the issue is caused by the Puppet config or the MDM Configuration Profile above? Thank you

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #86)

Haik, I now have a debug build ready that could be used by a mac developer to actually get more information and try out a potential fix. Who will be able to continue the work on this bug (and to give access to that machine)? Looks like I don't get further at the moment. Thanks.

I'll continue with it. Great work to narrow the problem down to this point. Let's discuss on chat how I can use the test machine.

Flags: needinfo?(haftandilian)

(In reply to Ryan Curran from comment #87)

Do you suspect the issue is caused by the Puppet config or the MDM Configuration Profile above? Thank you

I'm not 100% sure, but we at least do not allow content processes to read from /Library/Managed Preferences/cltbld/.GlobalPreferences.plist and /Library/Managed Preferences/.GlobalPreferences.plist. Is it expected that we have multiple files for MDM, and then one more with the exact same settings under /Library/Preferences/.GlobalPreferences.plist for Puppet? Do we need the settings for MDM, or could those be turned off? Is there a way to temporarily turn those off for testing purposes? I assume that Haik might be interested on that.

Flags: needinfo?(rcurran)

:whimboo I turned the MDM profile off on your loaner device and it has been removed. please test and lmk!

Flags: needinfo?(rcurran)

(In reply to Ryan Curran from comment #90)

:whimboo I turned the MDM profile off on your loaner device and it has been removed. please test and lmk!

I checked and while those read-pref entries are no longer visible it didn't make any difference. The delays are still there. That means we can rule out MDM. Thanks.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] (away 07/21 - 08/08) from comment #82)

I was using the following command to wipe all the entries from the LaunchService database:

sudo /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -kill -r -domain local -domain system -domain user

But sadly it didn't change anything.

If you dump the database again, is it actually empty? If it's still slow with an empty database I'd love to see a new startup profile.

Flags: needinfo?(mstange.moz)

(In reply to Markus Stange [:mstange] from comment #92)

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] (away 07/21 - 08/08) from comment #82)

I was using the following command to wipe all the entries from the LaunchService database:

sudo /System/Library/Frameworks/CoreServices.framework/Versions/A/Frameworks/LaunchServices.framework/Versions/A/Support/lsregister -kill -r -domain local -domain system -domain user

But sadly it didn't change anything.

If you dump the database again, is it actually empty? If it's still slow with an empty database I'd love to see a new startup profile.

A reboot might be required for the changes to take effect too, but we're avoiding rebooting the machine for now. I've tried adding an additional 10,000 apps on a local system, but haven't been able to reproduce a delay yet.

(In reply to Haik Aftandilian [:haik] from comment #93)

A reboot might be required for the changes to take effect too, but we're avoiding rebooting the machine for now. I've tried adding an additional 10,000 apps on a local system, but haven't been able to reproduce a delay yet.

We could check with a reboot. In the case the issue is no longer reproducible afterward we would have to request a new machine and setup everything again. Alternatively I would suggest to not do it on the machine we have right now, but check that on a machine that runs actual jobs? That way we could see how it works. I'm only worried that this command might remove too much data from the launch services db.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] (away 07/21 - 08/08) from comment #84)

  1. Starting Firefox without any modifications (sandbox level 3): https://share.firefox.dev/4kzbJ4R
  2. Starting Firefox with the sandbox level set to 0: https://share.firefox.dev/451lYdz

Haik, I wonder if we could use these two profiles and check what is different in the 2nd profile - like which API is successfully called now and causes some extra stuff that would make the issue go away for subsequent content process creations. Maybe there is something jumping out of it.

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] (away 07/21 - 08/08) from comment #94)

(In reply to Haik Aftandilian [:haik] from comment #93)

A reboot might be required for the changes to take effect too, but we're avoiding rebooting the machine for now. I've tried adding an additional 10,000 apps on a local system, but haven't been able to reproduce a delay yet.

We could check with a reboot.

Maybe killall Finder, as suggested by the help text, might do the trick?

I'm only worried that this command might remove too much data from the launch services db.

I agree. I think we should write a script which dumps all the bundle paths, and then manually unregisters each one for which the file no longer exists. Maybe there's a way to run this script before every reboot.

(In reply to Markus Stange [:mstange] from comment #96)

I'm only worried that this command might remove too much data from the launch services db.

I agree. I think we should write a script which dumps all the bundle paths, and then manually unregisters each one for which the file no longer exists. Maybe there's a way to run this script before every reboot.

There is the following file that controls what is done for cleanup steps after the task run and finally reboots the machine. So we can implement it there. I've already filed that as issue on the Ronin repository.

https://github.com/mozilla-platform-ops/ronin_puppet/blob/3f0e39a7dbe54812aca13411bd2a91c6b5a3e51e/modules/worker_runner/templates/worker-runner.sh.erb#L7

We have now reproduced the problem on a local machine not part of CI by creating 10,000 copies of an app and registering each one with the lsregister command. That results in a similar slow launch of content processes with time being spent in the same stack as profiles collected from CI.

:haik, if there is anything we need to do on setup/cleanup to reset OS caches/registries/etc. that is something that can be figured out. I suspect this is one of those "side effects of 24x7 CI"

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #99)

:haik, if there is anything we need to do on setup/cleanup to reset OS caches/registries/etc. that is something that can be figured out. I suspect this is one of those "side effects of 24x7 CI"

:jmaher, yes, we suspect we can use lsregister to unregister the old apps in the Launch Services DB, but we're not sure of the exact command yet. :whimboo filed https://github.com/mozilla-platform-ops/ronin_puppet/issues/903#issuecomment-3089678790 for this. I've been focusing on the understanding the problematic stack, but am going to start experimenting with the lsregister command now.

In my testing, I've found that the .app bundles that get created for update tests and then removed can be deleted from the LS database with lsregister -u <path>. A script that dumps the database and removes entries that no longer exist on the filesystem works well so far.

That leaves the large number of Nightly.app bundles in the database from the tasks directories such as /opt/worker/tasks/task_174254628261469 that :whimboo listed in comment 73. If these task_<ID> directories are present then a script that only removes entries no longer on the filesystem won't be sufficient.

I've found that lsregister -reseed appears to revert back the DB to a valid state with expected /Applications and system apps and extensions. It removes the entries from /opt/worker/tasks regardless of whether or not the entries are still on the filesystem.

If it's expected that we'll accumulate the tasks directories, then we could experiment with running lsregister -reseed to reset the database to a valid default state without requiring a reboot.

those directories should be removed, otherwise we would have thousands of /opt/worker/tasks/* directories. :rcurran can you confirm we delete the old /opt/worker/tasks/task_* directory when the task is completed? are the edge cases when we do not?

Flags: needinfo?(rcurran)

Because this is Simple GW these machines reboot in between tasks. At each startup, we do this. So short answer is yes :)

Flags: needinfo?(rcurran)

:jmaher or Ryan, could we try adding the script from https://github.com/mozilla-platform-ops/ronin_puppet/issues/903#issuecomment-3130065646 to the cleanup steps?

Flags: needinfo?(rcurran)
Flags: needinfo?(jmaher)

landed the potential change on gecko-t-osx-1400-r8-staging for testing

Flags: needinfo?(rcurran)
Flags: needinfo?(jmaher)

It looks like that the changes landed as well on production and as perfherder shows the numbers are back now at around 200ms. This is great to see!

Haik, is there anything else that needs to be done?

Status: NEW → ASSIGNED
Flags: needinfo?(haftandilian)

Haik, one thing I noticed is that on our Mac minis in CI we still have a higher content process creation time of 200ms compared to the 60ms locally on my M4. Shall I file a new bug for that so that we can further investigate? It's unrelated to the issue on this bug.

this fix was deployed last friday which means anything in the last 48 hours should be what to expect. If those values are still slower, then we need to investigate a new item or accept it. I assume we have a noticeable win thanks to the cleanup.

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #108)

I assume we have a noticeable win thanks to the cleanup.

Yes we have. As a result, the runtime for many macOS CI jobs has dropped by 2x - 4x! E.g. wpt1 has dropped from 60min-120min down to ~27min.

Blocks: 1978169
Blocks: 1978308

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #106)

Haik, is there anything else that needs to be done?

Haik replied on Matrix and mentioned the following follow-ups he want to work one:

From looking at the sandboxing interaction, there are a couple of follow up fixes I want to work on. Change the content process launch code to not use this

https://searchfox.org/mozilla-central/rev/6b4cb595d05ac38e2cfc493e3b81fe4c97a71f12/widget/cocoa/nsAppShell.mm#380-389

And

Bug 1504554 - Remove access to launchservicesd from the content sandbox
Bug 1529390 - Rename content process executable to something user friendly for Activity Monitor listing

Haik, mind filing the bug for the first item?

For the 200ms process creation time in CI I'm going to file a new bug soon and will use the current loaner to do some investigation.

Also I still have to create some documentation but lets get this done on a follow-up bug as well.

That means that we should mark this bug as fixed given that it is basically done.

Status: ASSIGNED → RESOLVED
Closed: 2 months ago
Keywords: perf
Resolution: --- → FIXED
Blocks: 1982857
See Also: → 1983178

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #110)

Haik, mind filing the bug for the first item?

Sure, it's bug 1983178 - Skip loadNibNamed called during child process startup

Flags: needinfo?(haftandilian)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: