TakeMinidumpForChild() and OnChildProcessDumpRequested() can race
Categories
(Toolkit :: Crash Reporting, defect, P2)
Tracking
()
People
(Reporter: gsvelto, Assigned: gsvelto)
References
Details
Attachments
(2 files)
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
18.66 KB,
text/plain
|
RyanVM
:
approval-mozilla-esr78+
|
Details |
The code dealing with content process crash reports assumes that OnChildProcessDumpRequested()
always runs before TakeMinidumpForChild()
. I also assumed this was the case but apparently it's not.
Using rr chaos mode I could trick the main process into trying to take a minidump for which OnChildProcessDumpRequested()
had not been called yet leading to an obscure test failure. It seems that my patch for bug 1614933 caused the main process to act quickly enough for this to happen in automation.
Fixing this requires putting some sort of synchronization between the two - besides the lock they already use - but I'm not entirely sure how to handle it because the minidump generation may fail and so the main process cannot simply assume there's always going to be a minidump to be taken.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
Taking this because it needs to be fixed, but after staring at the relevant Breakpad code for a while I still cannot possibly fathom how this can happen.
Assignee | ||
Comment 2•5 years ago
|
||
It turns out this is a Linux-specific issue and not what's causing the failure in tests in bug 1614933 so I'm leaving this for now. Hopefully this code will be removed soon and we'll just close this one.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 3•4 years ago
|
||
This affects both macOS and Linux, but not Windows.
Assignee | ||
Comment 4•4 years ago
|
||
This patch provides fixes for two different issues that would often interact
leading to larger problems. The first one is a problem with how exception-time
crash annotations are sent to the main process. We did a raw call to write()
on a pipe assuming that it was blocking and it would only return after all the
data was written. It turns out the pipe is non-blocking and we could
accidentally send a truncated annotation because we weren't checking write()'s
return value.
Solving this issue highlighted a second problem: in bug 1614933 we discovered
a potential deadlock in Windows-specific code that could happen if we would
block writing crash annotations in the exception handler. This flaw is also
present in the Linux and macOS paths but was not visible because of the
non-blocking nature of the writes. Since the child process would not stop it
couldn't possible deadlock. However, the relevant code is racy: if the child
process' pipe would start rejecting writes early the process could race past
the main process leading to a crash. The sequence of events in the racy case
would be the following:
- We hit an exception in the child process, we enter the exception handler and
signal the main process to write a minidump - The crash generation thread in the main process is woken up, writes the
minidump then signals the child process it can continue - The child process writes out the crash annotations then exits
- At this stage the crash generation thread in the main process should have
picked up the annotations and stored the crash report in the pid-to-minidump
table. But let's assume the scheduler doesn't wake up this thread for now. - The main thread in the main process wakes up in response to the child
process shutdown, it will try to grab the minidump - The minidump is not available yet because the crash generation client hasn't
run yet and there's nothing preventing the main thread to race past it: the
main process crashes trying to access the NULL pointer
To solve this second issue the act of transfering the exception-time
annotations is decoupled from the act of writing the minidump in both the main
process and the child process. This way when the child process is forced to
wait for the main process to act on the minidump before the crash annotations
are written out, and by the time the child process quits the main process has
reliably acquired the lock to the pid-to-minidump table so that the race can't
happen anymore.
Note: to implement the second change the child process exception handler
should execute the minidump callback after it has request the creation of a
minidump. For some reason this was implemented in breakpad only for the
in-process crash case, not the out-of-process one. I modified the relevant
code in the exception handler to invoke the callback in both cases.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 6•4 years ago
|
||
bugherder |
Assignee | ||
Comment 7•4 years ago
|
||
Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
Beta/Release Uplift Approval Request
- User impact if declined: When one of the content process crashes - even in the background - the whole browser may crash. This affects both Linux and macOS users but it impacts ARM64/macOS a lot more than others.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: Bug 1666383
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): The change impacts a very delicate asynchronous code path that spans three different threads in two different processes and part of it runs in an exception handler. While this is covered by both unit and integration tests we can't be 100% sure that it won't cause regressions. That being said since it fixes a major crasher the risk is mitigated by solving an issue we're already aware of. Also a similar change was already done to the corresponding code in Windows and we didn't run into issues there.
- String changes made/needed: none
Updated•4 years ago
|
Comment 8•4 years ago
|
||
Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
A bit nerve wracking to take this late in the cycle, but I think the Pros outweigh the Cons here. Approved for 84.0b8.
On a related note - is this something we should eventually (i.e. not this cycle) consider taking on ESR78 also?
Assignee | ||
Comment 9•4 years ago
|
||
Yes, this should go to ESR together with bug 1666383. The crash volume isn't huge there but it's significant nonetheless.
Comment 10•4 years ago
|
||
Thanks, we can revisit that next cycle.
Comment 11•4 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 12•4 years ago
|
||
Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This bug can lead to full browser crashes and it seem to have an oversized impact on macOS.
- User impact if declined: The browser may crash right after a content process has crashed.
- Fix Landed on Version: 84
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Same as in comment 7, but the patch has already shipped to the release channel where we didn't seem to encounter regressions.
- String or UUID changes made by this patch: none
Comment 13•4 years ago
|
||
Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
This needs rebasing for ESR78. Please attach a rebased patch and request approval on that when ready.
Assignee | ||
Comment 14•4 years ago
|
||
Here's a rebased patch suitable for ESR78. See comment 12 for the approval request.
Comment 15•4 years ago
|
||
Comment on attachment 9194382 [details]
[PATCH esr78] Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
macOS crash reporter reliability improvement. Approved for 78.7esr.
Comment 16•4 years ago
|
||
bugherder uplift |
Description
•