Closed Bug 1763625 Opened 2 years ago Closed 1 year ago

Intermittent SUMMARY: ThreadSanitizer: data race /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_posix.cc:52:11 in OpenProcessHandle

Categories

(Core :: IPC, defect, P1)

All
Linux
defect

Tracking

()

RESOLVED FIXED
112 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox107 --- unaffected
firefox108 --- unaffected
firefox109 --- wontfix
firefox110 --- wontfix
firefox111 --- wontfix
firefox112 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jld)

References

Details

(Keywords: csectype-race, intermittent-failure, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main112+r])

Attachments

(1 file)

This looks familiar.

Group: dom-core-security
Component: DOM: Content Processes → IPC

The spike in recent failures seems to be from bug 1787718 based on the stack.

Flags: needinfo?(hskupin)
Flags: needinfo?(hskupin)

Hm, ContentParent::Pid()/::GetOsPid() are racy. The stack trace for example in
https://treeherder.mozilla.org/logviewer?job_id=389272588&repo=autoland&lineNumber=2404 shows a code path where any JS could access GetOsPid() which then just calls Pid() and apparently that isn't ok.

jld, could we somehow fix ContentParent::Pid()?
We can backout bug 1787718 for now, but that isn't really the cause, it just makes us call Pid() more often.

Flags: needinfo?(jld)

Given that there is nothing that I can do here, I would suggest that we backout the patch. Aryx, could you please take care of it?

Flags: needinfo?(aryx.bugmail)
Blocks: 1787718
See Also: 1787718

(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #3)

Hm, ContentParent::Pid()/::GetOsPid() are racy. The stack trace for example in
https://treeherder.mozilla.org/logviewer?job_id=389272588&repo=autoland&lineNumber=2404 shows a code path where any JS could access GetOsPid() which then just calls Pid() and apparently that isn't ok.

jld, could we somehow fix ContentParent::Pid()?
We can backout bug 1787718 for now, but that isn't really the cause, it just makes us call Pid() more often.

Jed, do you have some feedback for us?

My idea here is to protect mChildProcessHandle with a lock, and then get the pid under that lock (so, maybe add GeckoChildProcessHost::GetChildProcessId). That will avoid racing on the handle or doing a use-after-close (on Windows; on Unix a “handle” is a pid), but the pid might not refer to the process after the lock is released (if the process exits). If the pid is used only for logging and not to interact with the process, that should be an acceptable risk.

Flags: needinfo?(jld)

Interestingly we can see those assertions / crashes now even on mozilla-central since November 29th and it's always the test browser_test_powerMetrics.js which is causing it.

https://treeherder.mozilla.org/logviewer?job_id=398502658&repo=mozilla-central&lineNumber=4234

Florian made some changes here recently on bug 1802361, so probably those triggered this particular crash.

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #8)

Interestingly we can see those assertions / crashes now even on mozilla-central since November 29th and it's always the test browser_test_powerMetrics.js which is causing it.

https://treeherder.mozilla.org/logviewer?job_id=398502658&repo=mozilla-central&lineNumber=4234

If I understand these stacks, they just mean the "IPC I/O Parent" thread called GeckoChildProcessHost::AsyncLaunch after the main thread called ChromeUtils::RequestProcInfo (which called GeckoChildProcessHost::GetAll).

I don't think these calls should be a problem, so it's probably at the GeckoChildProcessHost level that something needs fixing. And that seems aligned with what Jed said in comment 7.

Florian made some changes here recently on bug 1802361, so probably those triggered this particular crash.

The only explanation I can see is that maybe the test now takes longer to complete.

Flags: needinfo?(florian)
OS: Unspecified → Linux
Hardware: Unspecified → All

browser_test_powerMetrics.js is causing this crash pretty frequently on Linux. Florian, would it be something you could at least investigate?

Flags: needinfo?(florian)

I don't think I can do more than what I wrote in comment 9.
If we can't fix this and the failure rate becomes a problem, we might need to just disable the test on tsan.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #7)

My idea here is to protect mChildProcessHandle with a lock, and then get the pid under that lock

Is this something that you could easily do, or was there a problem with this idea?

Flags: needinfo?(florian) → needinfo?(jld)
Flags: needinfo?(jld)

Jed said he had a partial patch for this issue, so let's take it.

Assignee: nobody → jld
Priority: P5 → P1

Silly mistake on my part: the old version of ContentParent::Pid had this:

  if (!mSubprocess || !mSubprocess->GetChildProcessHandle()) {
    return -1;
  }

And I accidentally dropped the !mSubprocess check, so it dereferences a null pointer in that case. Should be easy enough to fix.

Flags: needinfo?(jld)
Group: dom-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch

The patch landed in nightly and beta is affected.
:jld, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox111 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]
Flags: needinfo?(jld)

It's already ridden the train to beta at this point, but for context: The severity is relatively minor and the patch isn't trivial, so I wasn't in favor of uplift.

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main112+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: