Closed Bug 1833110 Opened 1 year ago Closed 1 year ago

Utility process has added `ld.so.conf*` mainthread read/close IO due to ORB JS validator instantiating the utility process before main process goes idle

Categories

(Core :: IPC, defect)

Product:

Component:

Platform:

Desktop

Linux

Type:

defect

Priority:

Not set

Severity:

--

Tracking

()

Status:

RESOLVED FIXED

Milestone:

115 Branch

Tracking Flags:

Tracking

Status

firefox-esr102

---

unaffected

firefox113

---

disabled

firefox114

---

disabled

firefox115

---

fixed

People

(Reporter: Gijs, Assigned: gerard-majax)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 file, 2 obsolete files)

WIP: Bug 1833110 - Cache ldconfig to limit main thread io 1 year ago :gerard-majax 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1833110 - Cache ldconfig to limit main thread io r?jld! 1 year ago :gerard-majax 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1833110 - Cache ldconfig to limit main thread io r?jld!,Gijs! 1 year ago :gerard-majax 48 bytes, text/x-phabricator-request		Details \| Review

Reporter

Description

•

1 year ago

In bug 1831406 we're trying to re-enable browser_startup_mainthreadio.js which accidentally got disabled everywhere.

This has turned up that sometime since Dec 2 last year and May 4, we did something that tripped extra reads/stats on /etc/ld.so.conf and various files under that location. Here's an example trypush.

We can investigate this failure with the profiler. Here's a permalink to a view of the marker chart in the main process I found helpful, which produces this stack:

(root) []
XREMain::XRE_main []
Task UtilityProcessHost::OnChannelConnected []
0x7f2b3f207a3e [libxul.so]

This isn't super informative, unfortunately. :-(

Note that there is earlier IO for the same file triggered by ContentParent::LaunchSubprocess. This is tracked under the before first paint / before first browser window exclusions in the test already.

In the past few months a bunch of bugs relating to the utility process were fixed... :lissyx, any idea what might have tripped this?

Flags: needinfo?(lissyx+mozillians)

Assignee

Comment 1

•

1 year ago

I'm not even sure I understand the problem here. Main thread IO when launching a new child process and we read /etc/ld.so.conf ? Given we perform launching new processes from the main thread, I dont see how we could avoid that ?

Flags: needinfo?(lissyx+mozillians)

Assignee

Comment 2

•

1 year ago

Looks like ContentParent::LaunchSubprocess ends up calling GeckoChildProcessHost::AsyncLaunch https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/dom/ipc/ContentParent.cpp#2702 which we also use https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessHost.cpp#96 so that might be a lead ?

Assignee

Comment 3

•

1 year ago

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

Assignee

Comment 4

•

1 year ago

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

Assignee

Comment 5

•

1 year ago

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure. If this is the one generating those IOs, I'm afraid there's not much we can do so far.

Flags: needinfo?(gijskruitbosch+bugs)

Reporter

Comment 6

•

1 year ago

(In reply to Alexandre LISSY :gerard-majax from comment #4)

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?

(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).

(In reply to Alexandre LISSY :gerard-majax from comment #3)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator turned off and see if that makes any difference.

(In reply to Alexandre LISSY :gerard-majax from comment #5)

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure.

I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.

If this is the one generating those IOs, I'm afraid there's not much we can do so far.

Well, at a minimum we'll want to understand the conditions where this happens and add the appropriate allowlist entry in the test, at least for now, so we can re-enable the test. But doing that requires at least some understanding of the issue.

In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?

(there's likely substantial overlap with the IO for content process launches. I'm very much not an expert on this part of the codebase but IIRC the team did substantial work there wrt prealloc'd processes etc. to try to avoid some of the cost of the process launch in common codepaths.)

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(lissyx+mozillians)

Assignee

Comment 7

•

1 year ago

(In reply to :Gijs (he/him) from comment #6)

(In reply to Alexandre LISSY :gerard-majax from comment #4)

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?

(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).

ipc filtering should be enough

(In reply to Alexandre LISSY :gerard-majax from comment #3)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator turned off and see if that makes any difference.

I dont know if that is enough to disable spawing the process or if the pref is check once it is launched

(In reply to Alexandre LISSY :gerard-majax from comment #5)

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure.

I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.

Same, you should see IPC messages profiling showing this

If this is the one generating those IOs, I'm afraid there's not much we can do so far.

Well, at a minimum we'll want to understand the conditions where this happens and add the appropriate allowlist entry in the test, at least for now, so we can re-enable the test. But doing that requires at least some understanding of the issue.

In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?

Starting from non main thread was not possible. For the rest, it's specific to the JS Oracle, I dont know.

(there's likely substantial overlap with the IO for content process launches. I'm very much not an expert on this part of the codebase but IIRC the team did substantial work there wrt prealloc'd processes etc. to try to avoid some of the cost of the process launch in common codepaths.)

Flags: needinfo?(lissyx+mozillians)

Assignee

Comment 8

•

1 year ago

I guess you could just re-run this under pernosco and see what triggers precisely?

Assignee

Updated

•

1 year ago

Flags: needinfo?(sefeng)

Flags: needinfo?(afarre)

Reporter

Comment 9

•

1 year ago

(In reply to Alexandre LISSY :gerard-majax from comment #7)

(In reply to :Gijs (he/him) from comment #6)

(In reply to Alexandre LISSY :gerard-majax from comment #4)

On the try push profile there's something I dont get since we should be seeing those markers https://searchfox.org/mozilla-central/rev/eb07633057d66ab25f9db4c5900eeb6913da7579/ipc/glue/UtilityProcessManager.cpp#250-254,289-292,300-303

In what process and on what thread? And are the profiler settings for this task definitely including this marker category / thread / process?

(I'm very much not an expert on the profiler but those would be the first things I'd check, anyway. Otherwise #perf-help might be able to help clarify what's going on there).

ipc filtering should be enough

OK, well, I see StartUtility in https://share.firefox.dev/3pEfrDI , so that seems OK then.

(In reply to Alexandre LISSY :gerard-majax from comment #3)

Oh that being said, over the last six months there has been some change that makes utility being started quite much earlier than in the past: bug 1532644 which landed 4 months ago.

That seems a likely culprit. I'll try a trypush with browser.opaqueResponseBlocking.javascriptValidator turned off and see if that makes any difference.

I dont know if that is enough to disable spawing the process or if the pref is check once it is launched

Flipping the pref made the test go green on Linux, so I'll add exceptions based on the pref (as it's not true on release yet).

(In reply to Alexandre LISSY :gerard-majax from comment #5)

Could you verify if it's just the generic utility started for the JS Oracle IPC actor? With proper IPC profiling you should see it for sure.

I don't quite know what you're asking here - how would I verify that? I'll add that I don't have a full Linux build environment handy at the moment.

Same, you should see IPC messages profiling showing this

Does the link above answer this question or did you want to dig into this more?

In terms of a "real" fix - could the IO not be postponed until after the browser is idle? Why is it happening eagerly? And/or, could it not happen on a different thread?

Starting from non main thread was not possible. For the rest, it's specific to the JS Oracle, I dont know.

OK. Maybe :sefeng has ideas.

Flags: needinfo?(lissyx+mozillians)

Keywords: regression

Regressed by: 1532644

Reporter

Updated

•

1 year ago

OS: Unspecified → Linux

Hardware: Unspecified → Desktop

Summary: Utility process has added `ld.so.conf*` mainthread read/close IO in the past 6 months → Utility process has added `ld.so.conf*` mainthread read/close IO due to ORB JS validator instantiating the utility process before main process goes idle

Assignee

Comment 10

•

1 year ago

We clearly have the child hitting the Start JS Oracle marker: https://share.firefox.dev/3o4QmBa

Assignee

Updated

•

1 year ago

Flags: needinfo?(lissyx+mozillians)

Assignee

Comment 11

•

1 year ago

Moreover, if you see the same from ContentParent::LaunchSubprocess I would suspect this is related to something both are using, and not specific to Utility itself.

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

1 year ago

Set release status flags based on info from the regressing bug 1532644

status-firefox113: --- → affected

status-firefox114: --- → affected

status-firefox-esr102: --- → unaffected

Reporter

Updated

•

1 year ago

status-firefox113: affected → disabled

status-firefox114: affected → disabled

Assignee

Comment 13

•

1 year ago

(In reply to Alexandre LISSY :gerard-majax from comment #8)

I guess you could just re-run this under pernosco and see what triggers precisely?

Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).

And a local rr submit to pernosco seems to reveal bugs?

Reporter

Comment 14

•

1 year ago

•

(In reply to Alexandre LISSY :gerard-majax from comment #13)

(In reply to Alexandre LISSY :gerard-majax from comment #8)

I guess you could just re-run this under pernosco and see what triggers precisely?

Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).

The test does not run on debug on infra, bug 1549723.

And a local rr submit to pernosco seems to reveal bugs?

I don't quite know what this means. :-)

Jed Davis [:jld] (on leave until 08-31)

Comment 15

•

1 year ago

(In reply to :Gijs (he/him) from comment #0)

This has turned up that sometime since Dec 2 last year and May 4, we did something that tripped extra reads/stats on /etc/ld.so.conf and various files under that location.

Could that possibly be the ld.so.conf parser in the sandbox code? For content processes, that (and a lot of other filesystem policy construction) is cached, but that doesn't appear to be the case for other process types. If this is a problem, we'd probably need to refactor the ldconfig parser to collect a list of paths instead of side-effecting a policy, so that the work can be shared among different process types.

If this can be profiled with native stacks, or maybe just instrumented with more markers, that should confirm (or deny) my guess.

Assignee

Comment 16

•

1 year ago

(In reply to :Gijs (he/him) from comment #14)

(In reply to Alexandre LISSY :gerard-majax from comment #13)

(In reply to Alexandre LISSY :gerard-majax from comment #8)

I guess you could just re-run this under pernosco and see what triggers precisely?

Looks like pernosco self-service could not repro (likely because the failure is reported on opt and not debug, which makes me wonder why it is not ran or failing on debug ? locally at least it is reproducing on debug).

The test does not run on debug on infra, bug 1549723.

And a local rr submit to pernosco seems to reveal bugs?

I don't quite know what this means. :-)

There was a problem on pernosco, it's fixed and I have access to my session now: https://pernos.co/debug/qXHWQomWpNbfAu9lACTMtQ/index.html#f{m[Awu5,ArFX_,t[AQ,J6e5_,f{e[Awu5,ArAH_,s{af9stDZAA,bTQ,uAcbD,oAtkp___/

It's as jed said, sandboxing code.

Flags: needinfo?(sefeng)

Flags: needinfo?(afarre)

BugBot [:suhaib / :marco/ :calixte]

Comment 17

•

1 year ago

:sefeng, since you are the author of the regressor, bug 1532644, could you take a look? Also, could you set the severity field?

For more information, please visit BugBot documentation.

Flags: needinfo?(sefeng)

Assignee

Updated

•

1 year ago

Flags: needinfo?(sefeng)

Assignee

Comment 18

•

1 year ago

Attached file WIP: Bug 1833110 - Cache ldconfig to limit main thread io (obsolete) — Details

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → lissyx+mozillians

Status: NEW → ASSIGNED

Assignee

Comment 19

•

1 year ago

Attached file Bug 1833110 - Cache ldconfig to limit main thread io r?jld! (obsolete) — Details

Phabricator Automation

Updated

•

1 year ago

Attachment #9333842 - Attachment is obsolete: true

Assignee

Comment 20

•

1 year ago

Attached file Bug 1833110 - Cache ldconfig to limit main thread io r?jld!,Gijs! — Details

Phabricator Automation

Updated

•

1 year ago

Attachment #9333844 - Attachment is obsolete: true

Reporter

Comment 21

•

1 year ago

Heads up that bug 1831406 landed so to land the patches here it would be great if we could update the allowlist entries in browser_startup_mainthreadio.js for *ld.so.conf* "before becoming idle". The job won't go orange either way because the entry has ignoreIfUnused set, because IME this IO sometimes happens before, sometimes after, the browser going idle. I imagine that has to do with whatever logic is being used to kick off the ORB JS work.

Phabricator Automation

Updated

•

1 year ago

Attachment #9333845 - Attachment description: Bug 1833110 - Cache ldconfig to limit main thread io r?jld! → Bug 1833110 - Cache ldconfig to limit main thread io r?jld!,Gijs!

Assignee

Comment 22

•

1 year ago

(In reply to :Gijs (he/him) from comment #21)

Heads up that bug 1831406 landed so to land the patches here it would be great if we could update the allowlist entries in browser_startup_mainthreadio.js for *ld.so.conf* "before becoming idle". The job won't go orange either way because the entry has ignoreIfUnused set, because IME this IO sometimes happens before, sometimes after, the browser going idle. I imagine that has to do with whatever logic is being used to kick off the ORB JS work.

i've updated the patch to remove it, please r+?

Comment 23

•

1 year ago

Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/893c91960f17
Cache ldconfig to limit main thread io r=jld,Gijs

Noemi Erli[:noemi_erli]

Comment 24

•

1 year ago

Backed out changeset 893c91960f17 (Bug 1833110) for causing leakcheck failures CLOSED TREE

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&collapsedPushes=842382&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=893c91960f1791bd7bf17631a5de43e765dcaa99&selectedTaskRun=CZQ7QY7RRHKMLqm_Hw-Wmg.0

Log: https://treeherder.mozilla.org/logviewer?job_id=416230193&repo=autoland&lineNumber=2806

Backout: https://hg.mozilla.org/integration/autoland/rev/9e33255964f5bf48254eaa27f9f0042034b4a1f0

Flags: needinfo?(lissyx+mozillians)

Assignee

Updated

•

1 year ago

Flags: needinfo?(lissyx+mozillians)

Comment 25

•

1 year ago

Pushed by alissy@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e4261c1b9c83
Cache ldconfig to limit main thread io r=jld,Gijs

Natalia Csoregi [:nataliaCs]

Comment 26

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/e4261c1b9c83

Status: ASSIGNED → RESOLVED

Closed: 1 year ago

status-firefox115: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 115 Branch

You need to log in before you can comment on or make changes to this bug.