Open Bug 1911301 Opened 4 months ago Updated 2 months ago

[meta] Crash reporting life-cycle refactoring

Categories

(Toolkit :: Crash Reporting, task)

task

Tracking

()

People

(Reporter: gsvelto, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: meta)

Attachments

(1 file)

The code that deals with child process crashes and existing crashes has been written mostly in ad-hoc ways and has now accumulated a significant amount of technical debt which we should start shedding. Notably:

  • When a child process crashes we pass the crash ID around in various events but the code receiving it doesn't often have a way to access the crash data itself (like the annotations) so we have several places in the code where we materialize the path of a crash from it's ID, then parse the data we need. This is redundant and error-prone.
  • We have no way of telling if a crash belongs to a certain profile or a certain installation. We dump all crashes in the Crash Reports/pending folder - including ignored ones - and we pick them up from there. This is confusing for users and requires extra ad-hoc code to deal with them. The pending folder was originally intended only for crashes that were about to be submitted, not for storing crashes permanently if they weren't.
  • Error-handling is very poor. When crash submission fails the user is left with little recourse to get rid of the old files. We do have code to delete obsolete crashes that have not been submitted but it's disabled IIRC (it used to be enabled on Firefox OS where storage on the system partition was extremely limited).
  • If the crash machinery wants to add or remove a crash annotation it needs to do so manually, by parsing the .extra file then writing it back. Even inspecting annotations requires parsing past a certain point.
  • A crash report can be accompanied by other files (a memory report or additional minidumps), the code handling this is also completely ad-hoc and fragile.
  • Because of the above, our crash tooling requires its own ad-hoc code to deal with crashes. This has been a pain-point for a long time, with every test harness being forced to maintain its own way of tracking and parsing crashes.

These are structural issues that require a better approach to the entire crash report lifecycle. This is a metabug so I'm only listing what I think should happen and we'll put the actual implementation bits into concrete bugs. Here's a list off the top of my mind:

  • For child process crashes happening at runtime create an abstract object that represents the entire crash report. All of the manipulation of the crash report should happen through this object and not via ad-hoc code sprinkled over the sources. This object would have the ownership of all the associated files, contain the annotations for inspection and possibly provide functionality to alter the crash report before submission. Ideally this should have both C++ and JavaScript bindings, bonus points if we can have Rust too.
  • Create a different abstraction for crash reports that have been left behind and have not been generated by the current session. Or flag the objects so that it's clear that they were pre-existing and haven't been generated right away. We probably want to handle them differently anyway.
  • Update our automation tooling to use these abstractions instead of manually looking for crash reports.
  • Get rid of the CrashManager. It's an additional database we don't really use.
  • Revisit the file movement of a crash report, the pending directory is an artifact of the past. For regular users we can generate crash reports in a temporary folder, then move them to the profile. If we're too early during startup and the profile isn't there yet we can submit them directly from the crash reporter client. Rethink how automation finds the crash reports, we could do this via a proper interface instead of forcing the automation code to poke the paths where they might be stored.
  • Overhaul about:crashes so that it shows crashes for the current profile/installation, and not all the crash reports. Or if it does show all the crash reports make it clear that some do not apply to the current profile/installation.
Depends on: 1884947

RE: about:crashes, the overhaul should also include some improvements like:

  • displaying some basic properties like the process type of the crash and how it was submitted (automatic vs user),
  • displaying whether there was a prior submission which failed (and the failure reason!),
  • a delete button for unsubmitted crashes.

I made a diagram of an improved crash report lifecycle flow at a high level.

This sounds like a great project.

I haven't thoroughly thought this through (that's a lot of "ough"es), but often I am wanting to see all my crashes reports in one spot. And if we have a user whom we have asked to do some testing, and they might have multiple profiles, it is advantageous again to see all the crash reports in one list. A user in frustration with a problem might also intentionally create new profiles or install in new program directories (and therefore get a new profile).

So I think I'd like to have "show all the crash reports make it clear that some do not apply to the current profile/installation."

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: