Open Bug 1555123 Opened 2 years ago Updated 1 year ago

Missing stack traces from content process crashes

Categories

(Toolkit :: Crash Reporting, defect, P3)

defect

Tracking

()

People

(Reporter: ahal, Unassigned)

References

(Blocks 2 open bugs)

Details

There have been a few bugs recently around content crashes not reporting stack traces (see blocks list). I figured I'd file a parent bug to collect them all in one place. There's also a similarly titled bug in the "see also" list, but there hasn't been any activity in 2 years, so not sure the two causes were related.

In bug 1539449 I modified mozharness such that the string "A content process crashed" causes the task to turn orange no matter what (even if a stack trace is missing). This should at least stop us from accidentally letting content crashes slip through CI. But we still need to figure out why the stacks are missing.

Here are some STR on try:

$ hg up 5edbe9b1b822
$ curl https://bug1539449.bmoattachments.org/attachment.cgi?id=9065272 | hg import -
$ ./mach try fuzzy -q "win64asan mochimedia !spi"

Also from bug 1545856 comment 4:

I'm assuming that the child process has crashed before we set up the crash reporter host. In that case the property bag received by the ipc:content-shutdown notification will contain the "abnormal" field but not the "dumpID" one because no minidump has been generated. That would also explain why there's no stack trace and why there's that failure in reftest.jsm. That should certainly be fixed.

There's also another unrelated problem: I don't know exactly how reftests work but there's been issues with other test harnesses in the past because they didn't wait for the crash manager to do its job before removing the expected crash dumps. That would usually lead to hard-to-diagnose races. I'm pretty sure this is not the case here but Andrew mentioned other cases where he's not getting backtraces in comment 1 and that might be because the minidump has been wiped out too early by the test harness.

Is there some kind of event test harnesses should be waiting for before we start tests? Since this behaviour has been observed in reftest, mochitest and wpt, I'd tend to rule out issues related to the test-harness prematurely cleaning up minidumps.

(In reply to Andrew Halberstadt [:ahal] from comment #0)

Is there some kind of event test harnesses should be waiting for before we start tests? Since this behaviour has been observed in reftest, mochitest and wpt, I'd tend to rule out issues related to the test-harness prematurely cleaning up minidumps.

Services.crashmanager.ensureCrashIsPresent(<dump_id>) returns a promise that is resolved once a crash has been fully processed and is safe to delete. It's only available in the main process though. You can see how I used it to fix a race in mochitests in bug 1393800.

(In reply to Gabriele Svelto [:gsvelto] from comment #1)

Services.crashmanager.ensureCrashIsPresent(<dump_id>) returns a promise that is resolved once a crash has been fully processed and is safe to delete. It's only available in the main process though. You can see how I used it to fix a race in mochitests in bug 1393800.

Sorry, I was more referring to:

I'm assuming that the child process has crashed before we set up the crash reporter host.

I.e, maybe we can delay the start of the tests until after the crash reporter is set up. Though the ensureCrashIsPresent tip is also useful :).

No longer blocks: 1554852
Priority: -- → P3

NI to triage owner: Is it possible to prioritize this bug further? I'm concerned that the discussion of the crashes happening early in the setup process means that the crashes could correlate in such a way that we have a blind spot. Like we'll never see startup crashes or something. I'm also concerned that this bug's importance may be obscured by the fact that 1) most people don't know about this bug or bug 1494799 or its friends so they probably just ignore intermittents without (symbolicated) stacks, and 2) like all the bugs that depend on bug 1494799, the failures just eventually get marked invalid by the https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup heuristics.

Flags: needinfo?(gsvelto)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #3)

NI to triage owner: Is it possible to prioritize this bug further? I'm concerned that the discussion of the crashes happening early in the setup process means that the crashes could correlate in such a way that we have a blind spot. Like we'll never see startup crashes or something. I'm also concerned that this bug's importance may be obscured by the fact that 1) most people don't know about this bug or bug 1494799 or its friends so they probably just ignore intermittents without (symbolicated) stacks, and 2) like all the bugs that depend on bug 1494799, the failures just eventually get marked invalid by the https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup heuristics.

Some things have changed since we first discussed this. The biggest change being the fix for bug 1282776 that ensures crash dumps are generated even for crashes happening very early during content process startup. If we still have cases where we have crashes but no minidumps then the most likely cause is that the harness is not finding the minidumps. There was a recent case of this in bug 1593465 where minidumps were being generated but the harness didn't know where to look for them.

I'm happy to tackle this next week but I want to be sure I'm looking at the real problem, is something like bug 1593465 a good example of this behavior? Looking at a log I can see a crash being reported but obviously no minidump and thus no trace.

Flags: needinfo?(gsvelto)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #5)

Bug 1596750 is a WPT example that happened today. Parsed log at https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=276411601&repo=autoland&lineNumber=4727.

ASAN disables the crash reporter, I assume that also disables minidumps?

I've opened a bunch a dozen of bugs blocking bug 1494799 and I can find two cases: the first are ASAN builds which as Eric pointed out simply have the crash reporter disabled (at compile time) so they can't generate minidumps. The other case which seems more common is a crash being reported by marionette, here's an example:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=272415294&repo=autoland&lineNumber=3755-3756

And here's another:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=273266650&repo=autoland&lineNumber=7233-7234

All these failures are happening here:

https://searchfox.org/mozilla-central/rev/b58e44b74ef2b4a44bdfb4140c2565ac852504be/testing/web-platform/tests/tools/wptrunner/wptrunner/executors/executormarionette.py#632-636

Note how marionette sets the status of the test to CRASH, but it doesn't check if a crash really happened. I don't know marionette well enough to be sure but from what I can tell it's losing the connection with Firefox (which throws the IOError) and just assumes that it's because Firefox crashed. Looking at the logs I don't see a crash happening; normally you can see some IPC failures and breakpad callbacks in case of a crash but there aren't in these logs.

Please note that the above references aren't related to Marionette itself, but the wpt runner is using Marionette as executor for wpt-tests for Firefox. If there is no crash handling added to executormarionette yet, then this could be a very good explanation for that.

James should know more about it.

Flags: needinfo?(james)

At some point we tried checking for actual crash dumps and using that as the basis for setting the status to crash. But we ended up with a lot of intermittents particularly relating to shutdown crashes, to the extent that the change didn't stick. I think there was perhaps a race between the dump being generated and the harness looking for the dump file. It's also unclear how to rely on dump files existing if some configurations don't generate them.

I'm very happy to try to improve the wptrunner behaviour here, but I don't know what's both workable and helpful to developers. So suggestions are very welcome.

Flags: needinfo?(james)

(In reply to James Graham [:jgraham] from comment #9)

At some point we tried checking for actual crash dumps and using that as the basis for setting the status to crash. But we ended up with a lot of intermittents particularly relating to shutdown crashes, to the extent that the change didn't stick.

I'm not sure I parse what you're saying here. Do you mean that your attempts to improve the reporting of errors looked a lot like it was responsible for the errors to the sheriffs and so they backed it out?

I'm very happy to try to improve the wptrunner behaviour here, but I don't know what's both workable and helpful to developers. So suggestions are very welcome.

Not just a WPT thing, but: I think it would be an improvement to normalize it so that TEST-UNEXPECTED-CRASH means there's a stack you can read in the test log and there may also be a minidump available as an artifact.

If we have ASAN builds that are actually crashing and not dumping stacks (it's not clear if comment 7 implies ASAN is flawless and it's marionette incorrectly labeling IOErrors as crashes), a different reported error like TEST-UNEXPECTED-ASAN-STACKLESS-CRASH and TEST-UNEXPECTED-STACKLESS-CRASH or anything that lets people:

  • Easily locate bugs that are indicative of ASAN problems that we can look into and correct without having to look at every crash bug. (It seems reasonable that ASAN really shouldn't have mysterious crashes, and if it does, we should try and address that.)
  • Quickly determine how effective looking at a given bug and its log are going to be and whether they should bother looking for a crash.
  • Improve perception of the signal-to-noise ratio of intermittent bugs. Feeling like a high percentage of intermittent bugs are in-actionable is a net loss for everyone.

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #10)

Not just a WPT thing, but: I think it would be an improvement to normalize it so that TEST-UNEXPECTED-CRASH means there's a stack you can read in the test log and there may also be a minidump available as an artifact.

IIRC there's two ways to get a stack in the log: one is if we hit an assertion or somesuch on debug builds in which case we print it out directly from Gecko, the other is by parsing a minidump that's been left after the test finished.

If we have ASAN builds that are actually crashing and not dumping stacks (it's not clear if comment 7 implies ASAN is flawless and it's marionette incorrectly labeling IOErrors as crashes)

ASAN builds have the crash reporter disabled so they are unable to write out minidumps.

(In reply to Gabriele Svelto [:gsvelto] from comment #11)

ASAN builds have the crash reporter disabled so they are unable to write out minidumps.

Right, understood.

To clarify, I usually expect ASAN to print to stdout/stderr when a crash happens though. I'm fine that there's no minidump in that case. My question is whether there are actually situations where ASAN crashes and does not print to stdout/stderr. If there are situations where it does not print out stacks, that seems like a problem we should seek to address by fixing bugs in ASAN or our implementation of how we build/instrument our code for ASAN.

I was not clear if you were saying you've seen ASAN do this or if you've only seen Marionette claim things are crashes that are not crashes.

You need to log in before you can comment on or make changes to this bug.