Utility process has added `ld.so.conf*` mainthread read/close IO due to ORB JS validator instantiating the utility process before main process goes idle
Categories
(Core :: IPC, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox113 | --- | disabled |
firefox114 | --- | disabled |
firefox115 | --- | fixed |
People
(Reporter: Gijs, Assigned: gerard-majax)
References
(Regression)
Details
(Keywords: regression)
Attachments
(1 file, 2 obsolete files)
In bug 1831406 we're trying to re-enable browser_startup_mainthreadio.js which accidentally got disabled everywhere.
This has turned up that sometime since Dec 2 last year and May 4, we did something that tripped extra reads/stats on /etc/ld.so.conf
and various files under that location. Here's an example trypush.
We can investigate this failure with the profiler. Here's a permalink to a view of the marker chart in the main process I found helpful, which produces this stack:
(root) []
XREMain::XRE_main []
Task UtilityProcessHost::OnChannelConnected []
0x7f2b3f207a3e [libxul.so]
This isn't super informative, unfortunately. :-(
Note that there is earlier IO for the same file triggered by ContentParent::LaunchSubprocess
. This is tracked under the before first paint / before first browser window exclusions in the test already.
In the past few months a bunch of bugs relating to the utility process were fixed... :lissyx, any idea what might have tripped this?
Assignee | ||
Comment 1•1 year ago
|
||
I'm not even sure I understand the problem here. Main thread IO when launching a new child process and we read /etc/ld.so.conf
? Given we perform launching new processes from the main thread, I dont see how we could avoid that ?
Assignee | ||
Comment 2•1 year ago
|
||
Looks like ContentParent::LaunchSubprocess
ends up calling GeckoChildProcessHost::AsyncLaunch
https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/dom/ipc/ContentParent.cpp#2702 which we also use https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessHost.cpp#96 so that might be a lead ?
Assignee | ||
Comment 3•1 year ago
|
||
Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.
Assignee | ||
Comment 4•1 year ago
|
||
On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303
Assignee | ||
Comment 5•1 year ago
|
||
Could you verify if it's just the generic utility started for the JS Oracle
IPC actor? With proper IPC profiling you should see it for sure. If this is the one generating those IOs, I'm afraid there's not much we can do so far.
Reporter | ||
Comment 6•1 year ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #4)
On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303
In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?
(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).
(In reply to Alexandre LISSY :gerard-majax from comment #3)
Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.
That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator
turned off and see if that makes any difference.
(In reply to Alexandre LISSY :gerard-majax from comment #5)
Could you verify if it's just the generic utility started for the
JS Oracle
IPC actor? With proper IPC profiling you should see it for sure.
I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.
If this is the one generating those IOs, I'm afraid there's not much we can do so far.
Well, at a minimum we'll want to understand the conditions where this happens and add the appropriate allowlist entry in the test, at least for now, so we can re-enable the test. But doing that requires at least some understanding of the issue.
In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?
(there's likely substantial overlap with the IO for content process launches. I'm very much not an expert on this part of the codebase but IIRC the team did substantial work there wrt prealloc'd processes etc. to try to avoid some of the cost of the process launch in common codepaths.)
Assignee | ||
Comment 7•1 year ago
|
||
(In reply to :Gijs (he/him) from comment #6)
(In reply to Alexandre LISSY :gerard-majax from comment #4)
On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303
In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?
(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).
ipc
filtering should be enough
(In reply to Alexandre LISSY :gerard-majax from comment #3)
Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.
That seems a likely culprit. I'll try a trypush with
browser.opaqueResponseBlocking.javascriptValidator
turned off and see if that makes any difference.
I dont know if that is enough to disable spawing the process or if the pref is check once it is launched
(In reply to Alexandre LISSY :gerard-majax from comment #5)
Could you verify if it's just the generic utility started for the
JS Oracle
IPC actor? With proper IPC profiling you should see it for sure.I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.
Same, you should see IPC messages profiling showing this
If this is the one generating those IOs, I'm afraid there's not much we can do so far.
Well, at a minimum we'll want to understand the conditions where this happens and add the appropriate allowlist entry in the test, at least for now, so we can re-enable the test. But doing that requires at least some understanding of the issue.
In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?
Starting from non main thread was not possible. For the rest, it's specific to the JS Oracle, I dont know.
(there's likely substantial overlap with the IO for content process launches. I'm very much not an expert on this part of the codebase but IIRC the team did substantial work there wrt prealloc'd processes etc. to try to avoid some of the cost of the process launch in common codepaths.)
Assignee | ||
Comment 8•1 year ago
|
||
I guess you could just re-run this under pernosco and see what triggers precisely?
Assignee | ||
Updated•1 year ago
|
Reporter | ||
Comment 9•1 year ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #7)
(In reply to :Gijs (he/him) from comment #6)
(In reply to Alexandre LISSY :gerard-majax from comment #4)
On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303
In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?
(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).
ipc
filtering should be enough
OK, well, I see StartUtility
in https://share.firefox.dev/3pEfrDI , so that seems OK then.
(In reply to Alexandre LISSY :gerard-majax from comment #3)
Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.
That seems a likely culprit. I'll try a trypush with
browser.opaqueResponseBlocking.javascriptValidator
turned off and see if that makes any difference.I dont know if that is enough to disable spawing the process or if the pref is check once it is launched
Flipping the pref made the test go green on Linux, so I'll add exceptions based on the pref (as it's not true on release yet).
(In reply to Alexandre LISSY :gerard-majax from comment #5)
Could you verify if it's just the generic utility started for the
JS Oracle
IPC actor? With proper IPC profiling you should see it for sure.I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.
Same, you should see IPC messages profiling showing this
Does the link above answer this question or did you want to dig into this more?
In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?
Starting from non main thread was not possible. For the rest, it's specific to the JS Oracle, I dont know.
OK. Maybe :sefeng has ideas.
Reporter | ||
Updated•1 year ago
|
Assignee | ||
Comment 10•1 year ago
|
||
We clearly have the child hitting the Start JS Oracle marker: https://share.firefox.dev/3o4QmBa
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 11•1 year ago
|
||
Moreover, if you see the same from ContentParent::LaunchSubprocess
I would suspect this is related to something both are using, and not specific to Utility itself.
Comment 12•1 year ago
|
||
Set release status flags based on info from the regressing bug 1532644
Reporter | ||
Updated•1 year ago
|
Assignee | ||
Comment 13•1 year ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #8)
I guess you could just re-run this under pernosco and see what triggers precisely?
Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).
And a local rr
submit to pernosco seems to reveal bugs?
Reporter | ||
Comment 14•1 year ago
•
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #13)
(In reply to Alexandre LISSY :gerard-majax from comment #8)
I guess you could just re-run this under pernosco and see what triggers precisely?
Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).
The test does not run on debug on infra, bug 1549723.
And a local
rr
submit to pernosco seems to reveal bugs?
I don't quite know what this means. :-)
Comment 15•1 year ago
|
||
(In reply to :Gijs (he/him) from comment #0)
This has turned up that sometime since Dec 2 last year and May 4, we did something that tripped extra reads/stats on
/etc/ld.so.conf
and various files under that location.
Could that possibly be the ld.so.conf
parser in the sandbox code? For content processes, that (and a lot of other filesystem policy construction) is cached, but that doesn't appear to be the case for other process types. If this is a problem, we'd probably need to refactor the ldconfig parser to collect a list of paths instead of side-effecting a policy, so that the work can be shared among different process types.
If this can be profiled with native stacks, or maybe just instrumented with more markers, that should confirm (or deny) my guess.
Assignee | ||
Comment 16•1 year ago
|
||
(In reply to :Gijs (he/him) from comment #14)
(In reply to Alexandre LISSY :gerard-majax from comment #13)
(In reply to Alexandre LISSY :gerard-majax from comment #8)
I guess you could just re-run this under pernosco and see what triggers precisely?
Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).
The test does not run on debug on infra, bug 1549723.
And a local
rr
submit to pernosco seems to reveal bugs?I don't quite know what this means. :-)
There was a problem on pernosco, it's fixed and I have access to my session now: https://pernos.co/debug/qXHWQomWpNbfAu9lACTMtQ/index.html#f{m[Awu5,ArFX_,t[AQ,J6e5_,f{e[Awu5,ArAH_,s{af9stDZAA,bTQ,uAcbD,oAtkp___/
It's as jed said, sandboxing code.
Comment 17•1 year ago
|
||
:sefeng, since you are the author of the regressor, bug 1532644, could you take a look? Also, could you set the severity field?
For more information, please visit BugBot documentation.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 18•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 19•1 year ago
|
||
Updated•1 year ago
|
Assignee | ||
Comment 20•1 year ago
|
||
Updated•1 year ago
|
Reporter | ||
Comment 21•1 year ago
|
||
Heads up that bug 1831406 landed so to land the patches here it would be great if we could update the allowlist entries in browser_startup_mainthreadio.js
for *ld.so.conf*
"before becoming idle". The job won't go orange either way because the entry has ignoreIfUnused
set, because IME this IO sometimes happens before, sometimes after, the browser going idle. I imagine that has to do with whatever logic is being used to kick off the ORB JS work.
Updated•1 year ago
|
Assignee | ||
Comment 22•1 year ago
|
||
(In reply to :Gijs (he/him) from comment #21)
Heads up that bug 1831406 landed so to land the patches here it would be great if we could update the allowlist entries in
browser_startup_mainthreadio.js
for*ld.so.conf*
"before becoming idle". The job won't go orange either way because the entry hasignoreIfUnused
set, because IME this IO sometimes happens before, sometimes after, the browser going idle. I imagine that has to do with whatever logic is being used to kick off the ORB JS work.
i've updated the patch to remove it, please r+?
Comment 23•1 year ago
|
||
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/893c91960f17 Cache ldconfig to limit main thread io r=jld,Gijs
Comment 24•1 year ago
|
||
Backed out changeset 893c91960f17 (Bug 1833110) for causing leakcheck failures CLOSED TREE
Log: https://treeherder.mozilla.org/logviewer?job_id=416230193&repo=autoland&lineNumber=2806
Backout: https://hg.mozilla.org/integration/autoland/rev/9e33255964f5bf48254eaa27f9f0042034b4a1f0
Assignee | ||
Updated•1 year ago
|
Comment 25•1 year ago
|
||
Pushed by alissy@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e4261c1b9c83 Cache ldconfig to limit main thread io r=jld,Gijs
Comment 26•1 year ago
|
||
bugherder |
Description
•