Closed Bug 1624467 Opened 4 years ago Closed 3 years ago

TakeMinidumpForChild() and OnChildProcessDumpRequested() can race

Categories

(Toolkit :: Crash Reporting, defect, P2)

Unspecified
Other
defect

Tracking

()

RESOLVED FIXED
85 Branch
Tracking Status
firefox-esr78 85+ fixed
firefox83 --- wontfix
firefox84 + fixed
firefox85 + fixed

People

(Reporter: gsvelto, Assigned: gsvelto)

References

Details

Attachments

(2 files)

The code dealing with content process crash reports assumes that OnChildProcessDumpRequested() always runs before TakeMinidumpForChild(). I also assumed this was the case but apparently it's not.

Using rr chaos mode I could trick the main process into trying to take a minidump for which OnChildProcessDumpRequested() had not been called yet leading to an obscure test failure. It seems that my patch for bug 1614933 caused the main process to act quickly enough for this to happen in automation.

Fixing this requires putting some sort of synchronization between the two - besides the lock they already use - but I'm not entirely sure how to handle it because the minidump generation may fail and so the main process cannot simply assume there's always going to be a minidump to be taken.

Priority: -- → P1

Taking this because it needs to be fixed, but after staring at the relevant Breakpad code for a while I still cannot possibly fathom how this can happen.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED

It turns out this is a Linux-specific issue and not what's causing the failure in tests in bug 1614933 so I'm leaving this for now. Hopefully this code will be removed soon and we'll just close this one.

Assignee: gsvelto → nobody
Status: ASSIGNED → NEW
OS: Unspecified → Linux
Priority: P1 → P4
No longer blocks: 1614933
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED

This affects both macOS and Linux, but not Windows.

OS: Linux → Other

This patch provides fixes for two different issues that would often interact
leading to larger problems. The first one is a problem with how exception-time
crash annotations are sent to the main process. We did a raw call to write()
on a pipe assuming that it was blocking and it would only return after all the
data was written. It turns out the pipe is non-blocking and we could
accidentally send a truncated annotation because we weren't checking write()'s
return value.

Solving this issue highlighted a second problem: in bug 1614933 we discovered
a potential deadlock in Windows-specific code that could happen if we would
block writing crash annotations in the exception handler. This flaw is also
present in the Linux and macOS paths but was not visible because of the
non-blocking nature of the writes. Since the child process would not stop it
couldn't possible deadlock. However, the relevant code is racy: if the child
process' pipe would start rejecting writes early the process could race past
the main process leading to a crash. The sequence of events in the racy case
would be the following:

  • We hit an exception in the child process, we enter the exception handler and
    signal the main process to write a minidump
  • The crash generation thread in the main process is woken up, writes the
    minidump then signals the child process it can continue
  • The child process writes out the crash annotations then exits
  • At this stage the crash generation thread in the main process should have
    picked up the annotations and stored the crash report in the pid-to-minidump
    table. But let's assume the scheduler doesn't wake up this thread for now.
  • The main thread in the main process wakes up in response to the child
    process shutdown, it will try to grab the minidump
  • The minidump is not available yet because the crash generation client hasn't
    run yet and there's nothing preventing the main thread to race past it: the
    main process crashes trying to access the NULL pointer

To solve this second issue the act of transfering the exception-time
annotations is decoupled from the act of writing the minidump in both the main
process and the child process. This way when the child process is forced to
wait for the main process to act on the minidump before the crash annotations
are written out, and by the time the child process quits the main process has
reliably acquired the lock to the pid-to-minidump table so that the race can't
happen anymore.

Note: to implement the second change the child process exception handler
should execute the minidump callback after it has request the creation of a
minidump. For some reason this was implemented in breakpad only for the
in-process crash case, not the out-of-process one. I modified the relevant
code in the exception handler to invoke the callback in both cases.

Attachment #9189728 - Attachment description: Bug 1624467 - Fix an issue in crash report generation that could lead to truncated annotations or a browser crash r=KrisWright → Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
Priority: P4 → P2
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/70a5d9defc6b
Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 85 Branch

Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright

Beta/Release Uplift Approval Request

  • User impact if declined: When one of the content process crashes - even in the background - the whole browser may crash. This affects both Linux and macOS users but it impacts ARM64/macOS a lot more than others.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1666383
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): The change impacts a very delicate asynchronous code path that spans three different threads in two different processes and part of it runs in an exception handler. While this is covered by both unit and integration tests we can't be 100% sure that it won't cause regressions. That being said since it fixes a major crasher the risk is mitigated by solving an issue we're already aware of. Also a similar change was already done to the corresponding code in Windows and we didn't run into issues there.
  • String changes made/needed: none
Attachment #9189728 - Flags: approval-mozilla-beta?

Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright

A bit nerve wracking to take this late in the cycle, but I think the Pros outweigh the Cons here. Approved for 84.0b8.

On a related note - is this something we should eventually (i.e. not this cycle) consider taking on ESR78 also?

Flags: needinfo?(gsvelto)
Attachment #9189728 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Yes, this should go to ESR together with bug 1666383. The crash volume isn't huge there but it's significant nonetheless.

Flags: needinfo?(gsvelto)

Thanks, we can revisit that next cycle.

Blocks: 1663088

Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: This bug can lead to full browser crashes and it seem to have an oversized impact on macOS.
  • User impact if declined: The browser may crash right after a content process has crashed.
  • Fix Landed on Version: 84
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Same as in comment 7, but the patch has already shipped to the release channel where we didn't seem to encounter regressions.
  • String or UUID changes made by this patch: none
Attachment #9189728 - Flags: approval-mozilla-esr78?
See Also: → 1683579

Comment on attachment 9189728 [details]
Bug 1624467 - Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright

This needs rebasing for ESR78. Please attach a rebased patch and request approval on that when ready.

Flags: needinfo?(gsvelto)
Attachment #9189728 - Flags: approval-mozilla-esr78?

Here's a rebased patch suitable for ESR78. See comment 12 for the approval request.

Flags: needinfo?(gsvelto)
Attachment #9194382 - Flags: approval-mozilla-esr78?

Comment on attachment 9194382 [details]
[PATCH esr78] Fix a race in child process crash generation which could lead to a full browser crash r=KrisWright

macOS crash reporter reliability improvement. Approved for 78.7esr.

Attachment #9194382 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: