Intermittent SUMMARY: ThreadSanitizer: data race /builds/worker/checkouts/gecko/ipc/chromium/src/base/process_util_posix.cc:52:11 in OpenProcessHandle
Categories
(Core :: IPC, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr102 | --- | unaffected |
firefox107 | --- | unaffected |
firefox108 | --- | unaffected |
firefox109 | --- | wontfix |
firefox110 | --- | wontfix |
firefox111 | --- | wontfix |
firefox112 | --- | fixed |
People
(Reporter: intermittent-bug-filer, Assigned: jld)
References
Details
(Keywords: csectype-race, intermittent-failure, sec-moderate, Whiteboard: [post-critsmash-triage][adv-main112+r])
Attachments
(1 file)
Filed by: alissy [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=373605066&repo=try
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Pr8csIIpSu6wLQQPTe7F9Q/runs/0/artifacts/public/logs/live_backing.log
Comment 1•2 years ago
|
||
This looks familiar.
Updated•2 years ago
|
Comment 2•2 years ago
|
||
The spike in recent failures seems to be from bug 1787718 based on the stack.
Updated•2 years ago
|
Comment 3•2 years ago
|
||
Hm, ContentParent::Pid()/::GetOsPid() are racy. The stack trace for example in
https://treeherder.mozilla.org/logviewer?job_id=389272588&repo=autoland&lineNumber=2404 shows a code path where any JS could access GetOsPid() which then just calls Pid() and apparently that isn't ok.
jld, could we somehow fix ContentParent::Pid()?
We can backout bug 1787718 for now, but that isn't really the cause, it just makes us call Pid() more often.
Comment 4•2 years ago
|
||
Given that there is nothing that I can do here, I would suggest that we backout the patch. Aryx, could you please take care of it?
Comment 5•2 years ago
|
||
The patch on bug 1787718 got backed out:
https://hg.mozilla.org/integration/autoland/rev/972d0006abb1b5160fa87e1754f2f1d23975c3d9
Updated•2 years ago
|
Comment 6•2 years ago
|
||
(In reply to Olli Pettay [:smaug][bugs@pettay.fi] from comment #3)
Hm, ContentParent::Pid()/::GetOsPid() are racy. The stack trace for example in
https://treeherder.mozilla.org/logviewer?job_id=389272588&repo=autoland&lineNumber=2404 shows a code path where any JS could access GetOsPid() which then just calls Pid() and apparently that isn't ok.jld, could we somehow fix ContentParent::Pid()?
We can backout bug 1787718 for now, but that isn't really the cause, it just makes us call Pid() more often.
Jed, do you have some feedback for us?
Assignee | ||
Comment 7•2 years ago
|
||
My idea here is to protect mChildProcessHandle
with a lock, and then get the pid under that lock (so, maybe add GeckoChildProcessHost::GetChildProcessId
). That will avoid racing on the handle or doing a use-after-close (on Windows; on Unix a “handle” is a pid), but the pid might not refer to the process after the lock is released (if the process exits). If the pid is used only for logging and not to interact with the process, that should be an acceptable risk.
Comment 8•1 year ago
|
||
Interestingly we can see those assertions / crashes now even on mozilla-central since November 29th and it's always the test browser_test_powerMetrics.js which is causing it.
https://treeherder.mozilla.org/logviewer?job_id=398502658&repo=mozilla-central&lineNumber=4234
Florian made some changes here recently on bug 1802361, so probably those triggered this particular crash.
Comment 9•1 year ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #8)
Interestingly we can see those assertions / crashes now even on mozilla-central since November 29th and it's always the test browser_test_powerMetrics.js which is causing it.
https://treeherder.mozilla.org/logviewer?job_id=398502658&repo=mozilla-central&lineNumber=4234
If I understand these stacks, they just mean the "IPC I/O Parent" thread called GeckoChildProcessHost::AsyncLaunch
after the main thread called ChromeUtils::RequestProcInfo
(which called GeckoChildProcessHost::GetAll
).
I don't think these calls should be a problem, so it's probably at the GeckoChildProcessHost
level that something needs fixing. And that seems aligned with what Jed said in comment 7.
Florian made some changes here recently on bug 1802361, so probably those triggered this particular crash.
The only explanation I can see is that maybe the test now takes longer to complete.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Comment 10•1 year ago
|
||
browser_test_powerMetrics.js is causing this crash pretty frequently on Linux. Florian, would it be something you could at least investigate?
Comment 11•1 year ago
|
||
I don't think I can do more than what I wrote in comment 9.
If we can't fix this and the failure rate becomes a problem, we might need to just disable the test on tsan.
(In reply to Jed Davis [:jld] ⟨⏰|UTC-8⟩ ⟦he/him⟧ from comment #7)
My idea here is to protect
mChildProcessHandle
with a lock, and then get the pid under that lock
Is this something that you could easily do, or was there a problem with this idea?
Updated•1 year ago
|
Comment 12•1 year ago
|
||
Jed said he had a partial patch for this issue, so let's take it.
Assignee | ||
Comment 13•1 year ago
|
||
Comment 14•1 year ago
|
||
Landed: https://hg.mozilla.org/integration/autoland/rev/dc92467cea22653feff3eb52e706d664b84f3394
Backed out for causing memory-related crashes and failures
backout: https://hg.mozilla.org/integration/autoland/rev/73dfb11f4338225d2e930e583ca37a3ad01f940c
push: https://treeherder.mozilla.org/jobs?repo=autoland&group_state=expanded&revision=dc92467cea22653feff3eb52e706d664b84f3394
failure log: https://treeherder.mozilla.org/logviewer?job_id=407189999&repo=autoland&lineNumber=4099
Assignee | ||
Comment 15•1 year ago
|
||
Silly mistake on my part: the old version of ContentParent::Pid
had this:
if (!mSubprocess || !mSubprocess->GetChildProcessHandle()) {
return -1;
}
And I accidentally dropped the !mSubprocess
check, so it dereferences a null pointer in that case. Should be easy enough to fix.
Comment 16•1 year ago
|
||
Btw this got re-landed as: https://hg.mozilla.org/integration/autoland/rev/72094220e38c3726f8a24a6306429c17882399c7
Comment 17•1 year ago
|
||
Comment 18•1 year ago
|
||
The patch landed in nightly and beta is affected.
:jld, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.
- If no, please set
status-firefox111
towontfix
.
For more information, please visit auto_nag documentation.
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 19•1 year ago
|
||
It's already ridden the train to beta at this point, but for context: The severity is relatively minor and the patch isn't trivial, so I wasn't in favor of uplift.
Updated•1 year ago
|
Updated•7 months ago
|
Description
•