Closed Bug 1833110 Opened 1 year ago Closed 1 year ago

Utility process has added `ld.so.conf*` mainthread read/close IO due to ORB JS validator instantiating the utility process before main process goes idle

Categories

(Core :: IPC, defect)

Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
115 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox113 --- disabled
firefox114 --- disabled
firefox115 --- fixed

People

(Reporter: Gijs, Assigned: gerard-majax)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file, 2 obsolete files)

In bug 1831406 we're trying to re-enable browser_startup_mainthreadio.js which accidentally got disabled everywhere.

This has turned up that sometime since Dec 2 last year and May 4, we did something that tripped extra reads/stats on /etc/ld.so.conf and various files under that location. Here's an example trypush.

We can investigate this failure with the profiler. Here's a permalink to a view of the marker chart in the main process I found helpful, which produces this stack:

(root) []
XREMain::XRE_main []
Task UtilityProcessHost::OnChannelConnected []
0x7f2b3f207a3e [libxul.so]

This isn't super informative, unfortunately. :-(

Note that there is earlier IO for the same file triggered by ContentParent::LaunchSubprocess. This is tracked under the before first paint / before first browser window exclusions in the test already.

In the past few months a bunch of bugs relating to the utility process were fixed... :lissyx, any idea what might have tripped this?

Flags: needinfo?(lissyx+mozillians)

I'm not even sure I understand the problem here. Main thread IO when launching a new child process and we read /etc/ld.so.conf ? Given we perform launching new processes from the main thread, I dont see how we could avoid that ?

Flags: needinfo?(lissyx+mozillians)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure. If this is the one generating those IOs, I'm afraid there's not much we can do so far.

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to Alexandre LISSY :gerard-majax from comment #4)

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?

(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).

(In reply to Alexandre LISSY :gerard-majax from comment #3)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator turned off and see if that makes any difference.

(In reply to Alexandre LISSY :gerard-majax from comment #5)

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure.

I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.

If this is the one generating those IOs, I'm afraid there's not much we can do so far.

Well, at a minimum we'll want to understand the conditions where this happens and add the appropriate allowlist entry in the test, at least for now, so we can re-enable the test. But doing that requires at least some understanding of the issue.

In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?

(there's likely substantial overlap with the IO for content process launches. I'm very much not an expert on this part of the codebase but IIRC the team did substantial work there wrt prealloc'd processes etc. to try to avoid some of the cost of the process launch in common codepaths.)

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(lissyx+mozillians)

(In reply to :Gijs (he/him) from comment #6)

(In reply to Alexandre LISSY :gerard-majax from comment #4)

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?

(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).

ipc filtering should be enough

(In reply to Alexandre LISSY :gerard-majax from comment #3)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator turned off and see if that makes any difference.

I dont know if that is enough to disable spawing the process or if the pref is check once it is launched

(In reply to Alexandre LISSY :gerard-majax from comment #5)

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure.

I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.

Same, you should see IPC messages profiling showing this

If this is the one generating those IOs, I'm afraid there's not much we can do so far.

Well, at a minimum we'll want to understand the conditions where this happens and add the appropriate allowlist entry in the test, at least for now, so we can re-enable the test. But doing that requires at least some understanding of the issue.

In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?

Starting from non main thread was not possible. For the rest, it's specific to the JS Oracle, I dont know.

(there's likely substantial overlap with the IO for content process launches. I'm very much not an expert on this part of the codebase but IIRC the team did substantial work there wrt prealloc'd processes etc. to try to avoid some of the cost of the process launch in common codepaths.)

Flags: needinfo?(lissyx+mozillians)

I guess you could just re-run this under pernosco and see what triggers precisely?

Flags: needinfo?(sefeng)
Flags: needinfo?(afarre)

(In reply to Alexandre LISSY :gerard-majax from comment #7)

(In reply to :Gijs (he/him) from comment #6)

(In reply to Alexandre LISSY :gerard-majax from comment #4)

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?

(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).

ipc filtering should be enough

OK, well, I see StartUtility in https://share.firefox.dev/3pEfrDI , so that seems OK then.

(In reply to Alexandre LISSY :gerard-majax from comment #3)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator turned off and see if that makes any difference.

I dont know if that is enough to disable spawing the process or if the pref is check once it is launched

Flipping the pref made the test go green on Linux, so I'll add exceptions based on the pref (as it's not true on release yet).

(In reply to Alexandre LISSY :gerard-majax from comment #5)

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure.

I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.

Same, you should see IPC messages profiling showing this

Does the link above answer this question or did you want to dig into this more?

In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?

Starting from non main thread was not possible. For the rest, it's specific to the JS Oracle, I dont know.

OK. Maybe :sefeng has ideas.

Flags: needinfo?(lissyx+mozillians)
Keywords: regression
Regressed by: 1532644
OS: Unspecified → Linux
Hardware: Unspecified → Desktop
Summary: Utility process has added `ld.so.conf*` mainthread read/close IO in the past 6 months → Utility process has added `ld.so.conf*` mainthread read/close IO due to ORB JS validator instantiating the utility process before main process goes idle

We clearly have the child hitting the Start JS Oracle marker: https://share.firefox.dev/3o4QmBa

Flags: needinfo?(lissyx+mozillians)

Moreover, if you see the same from ContentParent::LaunchSubprocess I would suspect this is related to something both are using, and not specific to Utility itself.

Set release status flags based on info from the regressing bug 1532644

(In reply to Alexandre LISSY :gerard-majax from comment #8)

I guess you could just re-run this under pernosco and see what triggers precisely?

Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).

And a local rr submit to pernosco seems to reveal bugs?

(In reply to Alexandre LISSY :gerard-majax from comment #13)

(In reply to Alexandre LISSY :gerard-majax from comment #8)

I guess you could just re-run this under pernosco and see what triggers precisely?

Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).

The test does not run on debug on infra, bug 1549723.

And a local rr submit to pernosco seems to reveal bugs?

I don't quite know what this means. :-)

(In reply to :Gijs (he/him) from comment #0)

This has turned up that sometime since Dec 2 last year and May 4, we did something that tripped extra reads/stats on /etc/ld.so.conf and various files under that location.

Could that possibly be the ld.so.conf parser in the sandbox code? For content processes, that (and a lot of other filesystem policy construction) is cached, but that doesn't appear to be the case for other process types. If this is a problem, we'd probably need to refactor the ldconfig parser to collect a list of paths instead of side-effecting a policy, so that the work can be shared among different process types.

If this can be profiled with native stacks, or maybe just instrumented with more markers, that should confirm (or deny) my guess.

(In reply to :Gijs (he/him) from comment #14)

(In reply to Alexandre LISSY :gerard-majax from comment #13)

(In reply to Alexandre LISSY :gerard-majax from comment #8)

I guess you could just re-run this under pernosco and see what triggers precisely?

Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).

The test does not run on debug on infra, bug 1549723.

And a local rr submit to pernosco seems to reveal bugs?

I don't quite know what this means. :-)

There was a problem on pernosco, it's fixed and I have access to my session now: https://pernos.co/debug/qXHWQomWpNbfAu9lACTMtQ/index.html#f{m[Awu5,ArFX_,t[AQ,J6e5_,f{e[Awu5,ArAH_,s{af9stDZAA,bTQ,uAcbD,oAtkp___/

It's as jed said, sandboxing code.

Flags: needinfo?(sefeng)
Flags: needinfo?(afarre)

:sefeng, since you are the author of the regressor, bug 1532644, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(sefeng)
Flags: needinfo?(sefeng)
Assignee: nobody → lissyx+mozillians
Status: NEW → ASSIGNED
Attachment #9333842 - Attachment is obsolete: true
Attachment #9333844 - Attachment is obsolete: true

Heads up that bug 1831406 landed so to land the patches here it would be great if we could update the allowlist entries in browser_startup_mainthreadio.js for *ld.so.conf* "before becoming idle". The job won't go orange either way because the entry has ignoreIfUnused set, because IME this IO sometimes happens before, sometimes after, the browser going idle. I imagine that has to do with whatever logic is being used to kick off the ORB JS work.

Attachment #9333845 - Attachment description: Bug 1833110 - Cache ldconfig to limit main thread io r?jld! → Bug 1833110 - Cache ldconfig to limit main thread io r?jld!,Gijs!

(In reply to :Gijs (he/him) from comment #21)

Heads up that bug 1831406 landed so to land the patches here it would be great if we could update the allowlist entries in browser_startup_mainthreadio.js for *ld.so.conf* "before becoming idle". The job won't go orange either way because the entry has ignoreIfUnused set, because IME this IO sometimes happens before, sometimes after, the browser going idle. I imagine that has to do with whatever logic is being used to kick off the ORB JS work.

i've updated the patch to remove it, please r+?

Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/893c91960f17
Cache ldconfig to limit main thread io r=jld,Gijs
Flags: needinfo?(lissyx+mozillians)
Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e4261c1b9c83
Cache ldconfig to limit main thread io r=jld,Gijs
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 115 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: