Open Bug 1323979 Opened 3 years ago Updated 9 months ago

Re-engineer the crash handling code to make the crash dump lifecycle robust


(Firefox :: General, defect)

Not set




(Reporter: gsvelto, Unassigned)



(Whiteboard: [fce-active-legacy])


(1 file, 3 obsolete files)

While working on bug 1293656 I've realized that our crash-handling code suffers from a number of issues related to how minidump files are handled and how crash dumps are tracked.

In short the code touching the crash dump files does not have real synchronization points which can cause both races and inconsistencies when handling crashes. Here's a few practical examples of how this manifests itself:

- The code that records crashes with the CrashManager and the code that displays the "tab crashed" page are triggered by different events and are not synchronized. This could lead to code in the "tab crashed" page acting on crashes that hadn't yet been recorded in the CrashManager.

- Deletion of content process minidumps isn't really handled anywhere if the crash is not submitted. We have some code that cleaned up old content crashes but it's been activated only in B2G so I suspect that unreported content crashes are never deleted from client machines.

- Most code dealing with crashes is handed the crash id, it then proceed to access the related files on its own. Since we now have multiple places where the crash dump files are touched this can lead to issues that I've seen most commonly in tests, where a test might try to delete a minidump file that's still been acted upon by other parts of the codebase.

This issues weren't apparent before for a number of reasons: the biggest one is that content crashes are "new" in Firefox. Most crash-management code was written with browser crashes in mind; and many content-crash-specific changes were done for B2G but the related code was never enabled in Firefox.

I've deployed some ad-hoc fixes in bug 1293656 to fix the most glaring issues but a re-design of this part is required to make the entire process robust.

My idea would be to make the CrashManager the sole source of truth regarding which crashes are in the system, etc... And having all consumers listen to custom events fired by the CrashManager (crash-added, crash-removed, etc...) instead of the mish-mash of events we now use (which include manually injecting code triggered by CrashReporterHost::NotifyCrashService(), listening to the "oop-browser-crashed", "oop-frameloader-crashed", "ipc:content-shutdown", or "ipc:browser-destroyed", etc...).

The crash dump files themselves could be abstracted within an object that knows how to find the related files (we've got plenty of ad-hoc, incoherent code for that too) and can be used to make the files' life-cycle predictable WRT file access and deletion.
Whiteboard: [fce-active]
I've started to address some of the issues I mentioned above in bug 1322611 where I've added two of the methods we have to access minidump and extra files to the nsICrashReporter interface. My overall goal for this is roughly the following:

- Augment the crash manager so that it has the necessary functionality to track all crashes in the ssytem as well as send events about crashes being added to it.

- Change existing code to use the crash manager to track crashes (wait for them, retrieve the crash list, etc...)

- Change existing code to use the methods provided from the nsICrashReporter to access the actual files. This will ensure that all the code that manipulates crash reports accesses a crash file through the same code and thus using the same path.

I still haven't figured out how to deal with cleaning up crash files. I intend to enable the mechanism we had in B2G to remove old unreported content crashes (otherwise they stick around forever) but it's current implementation is far from optimal and I would like to have something that integrates well with the rest of the crash report lifecycle.
I've resumed working on this and I've discovered yet another issue I hadn't encountered yet: we have two distinct code paths to deal with regular content crashes and plugin crashes that overlap.

When a plugin crashes we use the regular code for informing the crash manager about the crash (see [1] and [2]).

At the same time however we send a separate "PluginCrashed" DOM event which is triggered either by PluginModuleParent::ActorDestroy() [3] or GMPParent::ActorDestroy()[4] . Before sending the "PluginCrashed" DOM event we also send a regular "plugin-crashed" event from [5] in the case of NPAPI plugins and a "gmp-plugin-crashed" event in the case of GMP plugins [6]. ContentCrashHandlers.jsm which contains the logic to submit plugin crash reports listens to both the "plugin-crashed" and "gmp-plugin-crashed" events and records the crash ID they carry. The code that does report the crash - in PluginCrashReporter.submitCrashReport() - is invoked via the crash submission form which is opened in response to a "PluginCrashed" event within the PluginContent object. Since the two chains of event depend on different events that are launched in parallel they can cause a race and the code in [7] seems to account for this possibility.

To sum it up: this is complicated. I'd very much like for all the code that cares about crashes to go through a single chain of events instead of three different ones but since the different events carry slightly different information and have slightly different semantics this requires quite a bit of thought.

Assignee: nobody → gsvelto
Depends on: 1362041
Attached patch [PATCH] WIP (obsolete) — Splinter Review
Here's a WIP of what I'm trying to do. In a nutshell:

- I'm providing a "process-crashed" event that covers *all* crashes and is sent only after the crash manager has fully recorded the occurrence. This can be used by the current "ipc:content-shutdown" and "plugin-crashed" listeners to gather information about the crash. The information available is a lot richer than with the previous events since the stack trace can be inspected, etc... This paves the way for future user-facing work like identifying known crash signatures and informing the user about them.

- I'm re-working some of the nsICrashReporter interfaces so access to the minidump files is always mediated by the nsICrashReporter implementer. This will allow me to get rid of tons of copy-pasted code to handle minidumps & friends.

- I'm also refactoring some of the code using the crash reporting services to make it more succinct and readable.

This patch is still incomplete and breaks a bunch of tests (but there's a few that already work with the new interfaces).
Attached patch [PATCH] WIP v2 (obsolete) — Splinter Review
Far more refactoring going on in this WIP, with the CrashService becoming the point of synchronization between the various bits of code that need to deal with crashes. I've split all the code that didn't deal with HTTP submission out of CrashSubmit and moved it in CrashService and I'm progressively removing all the code that handles crash files on its own. This way the crash lifecycle will be handled all in one place. It's not fully working yet though, quite a few tests are broken.
Attachment #8876718 - Attachment is obsolete: true
Attached patch [PATCH] WIP v3Splinter Review
Tests are now passing locally, I've started running this on try and once I've ironed out all the remaining issues I'll split the patch up so that it'll be easier to review it.

This ended up being a rather large patch, fortunately it removes quite a bit more code than it adds.
Attachment #8882826 - Attachment is obsolete: true
This is becoming a *very* large patch so it's better to start splitting it up since it covers a bunch of different things which can be reviewed and landed separately (also reducing the risks involved).
Depends on: 1393435
Whoops, I've put the wrong bug in the commit name :-/
Attachment #8900737 - Attachment is obsolete: true
Attachment #8900737 - Flags: review?(mconley)
Depends on: 1393800
Whiteboard: [fce-active] → [fce-active-legacy]
Assignee: gsvelto → nobody
You need to log in before you can comment on or make changes to this bug.