Closed Bug 1487410 Opened 6 years ago Closed 5 months ago

Use minidump-stackwalk's JSON output for mozcrash.py

Categories

(Toolkit :: Crash Reporting, enhancement)

enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1809528

People

(Reporter: ted, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

When we encounter Firefox crashes in CI we currently process them using a minidump_stackwalk that we fetch from tooltool, built from https://hg.mozilla.org/users/tmielczarek_mozilla.com/stackwalk-http/ . This mostly works but the binaries get out of date. I filed bug 1391408 on that. Another unfortunate thing is that minidump_stackwalk just outputs the same human-readable format that the tool in upstream Breakpad does, so we have code in mozcrash that tries to generate a Socorro-style signature and it resorts to parsing the output with regexes:
https://dxr.mozilla.org/mozilla-central/rev/c2e3be6a1dd352b969a45f0b85e87674e24ad284/testing/mozbase/mozcrash/mozcrash/mozcrash.py#264

Additionally, that format loses a bunch of data which is one of the reasons we switched to JSON output in Socorro.

We've got a minidump-analyzer binary in the tree now that we build for use in client-side stackwalking:
https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/minidump-analyzer

It does basically the same work, and outputs JSON that's virtually identical to Socorro's JSON format. I think it would be worthwhile to look into using it instead of minidump_stackwalk for processing crashes in CI.

Here are some things I know for sure we'd need to do:
1) Give minidump-analyzer a "just output JSON" option. Currently it expects to stuff the JSON into the .extra file.
2) Hook up the existing Breakpad .sym file locating+parsing code in minidump-analyzer. It currently expects to do its work without symbols.
3) Import the `HTTPSymbolSupplier` code I wrote for stackwalk-http and hook it up in minidump-analyzer:
https://hg.mozilla.org/users/tmielczarek_mozilla.com/stackwalk-http/file/51e4725ffad4/http_symbol_supplier.cc/tip/http_symbol_supplier.cc

Additionally, the minidump_stackwalk binaries we currently use default to fetching symbols from our symbol server S3 bucket (primarily to get operating system symbols), so we'd want that as well for CI:
https://hg.mozilla.org/users/tmielczarek_mozilla.com/stackwalk-http/file/51e4725ffad4/stackwalk.cc#l111 (we could change this to point at symbols.mo instead of S3 directly nowadays).

4) Modify mozcrash to handle running minidump-analyzer instead of minidump_stackwalk, parse the JSON output, and produce similar human-readable output. We should be able to gracefully handle both binaries by just checking if the output looks like JSON and handling it as such.

Potential upsides:
* Since minidump-analyzer is built from in-tree source on every build it's much easier to keep it up-to-date with the rest of our Breakpad code.
* We could upload the JSON output as an artifact from test runs, so developers could get more information out of crashes in CI without having to download the minidump and process it locally.
* We could hook up willkg's siggen Python module: http://bluesock.org/~willkg/blog/mozilla/siggen_0_2_0.html (discussed in bug 828452) to produce actual Socorro-compatible signatures.
* It would be easier for developers to make changes to the tool since it lives in-tree.
* We'd be using the same stackwalker that we use for client-side stackwalking, instead of stackwalk-http which isn't used anywhere but CI.
* If we get to a place where we can share stackwalking code between Socorro and client-side stackwalking, doing this work ahead of time will let us easily use that same code for crashes in CI.

We're replacing minidump_stackwalk with the oxidized version soon. After we've done that we could switch the output to JSON and we should be able to easily implement what Ted suggested in comment 0.

Depends on: 1741205

🙌 I'm obviously not doing any of the work so you don't have to listen to me, but if you do that it might be useful to tweak the Rust minidump-stackwalk so you can ask it to spit out both the human-readable and JSON output, maybe to separate files or something (IDK, whatever) so that you could have mozcrash parse the JSON to generate a signature (ideally just reusing the existing siggen module) and also put the human-readable output somewhere in the CI logs for humans to read, without having to write code to replicate that output from the JSON format.

Yes last night i was already considering adding a --cyborg=path/to/write/json option

Summary: Investigate using minidump-analyzer as a replacement for minidump_stackwalk for crash stacks in CI → Use minidump-stackwalk's JSON output for mozcrash.py

Repurposing this bug:

It sucks that mozcrash.py parses minidump-stackwalk's human readable output instead of just requesting the machine-readable JSON (which also contains way more details!). It causes problems.

The only thing mildly blocking making this change is that the human-readable output is genuinely nice to have for humans. There are 4 options to resolve this:

  • Ideally: implement cyborg output: https://github.com/luser/rust-minidump/issues/296
  • Lazily: just run minidump-stackwalk twice, once for each mode (wasteful, but at least the disk cache reduces a lot of the waste)
  • Rudely: pass the --pretty flag so the JSON is vaguely human-readable and just dump that in the logs
  • Chaotically: have mozcrash.py synthesize its own human-friendly format from the json (bad allocation of human effort, but viable)
See Also: → 1742712

Having Socorro-compatible JSON here paves the way to finally fix bug 867571 the importance of which cannot be underestimated: fixing it would finally allow us to automatically tie the crashes we see in automation with those we see in the wild.

Implemented --cyborg output in https://github.com/luser/rust-minidump/pull/316

Now just need to implement the mozcrash.py infra (and update the toolchain)

Note: the schema for the json output is """documented""" here: https://github.com/luser/rust-minidump/blob/a224c912df9ddde3a956f9796176b3aacde69b1a/minidump-processor/src/process_state.rs#L590

Everything we need should be in the top-level crashing_thread field.

Assignee: nobody → a.beingessner

This gets us the crash report in a stable machine-readable format which is much
more reliable to parse and analyze.

Also bumps the version of minidump-stackwalk to get the cyborg feature and
some CLI fixes.

This moves signature generation as a responsibility from rust-minidump
to mozcrash.py. In the long-term this is desirable, as it would allow us to
keep the implementation in sync with socorro. But in the short-term the output
is worse until we implement that functionality.

The above patch is semi-WIP. It works, but the signature generation is garbage. This is because the JSON output doesn't give you a signature, it just gives you the parts to synthesize it yourself.

We need to do one of the following to make it decent:

  1. Hand-roll signature generation in mozcrash.py (yuck).
  2. Make minidump-stackwalk include a synthesized signature to the json output (kinda weird, but the machinery is already there for human output. It just wouldn't be in sync with socorro, maybe the best cost+benefit in the short term?).
  3. Use https://github.com/willkg/socorro-siggen to generate signatures (Will has concerns about maintenance/vendoring).
  4. Implement Bug 828452 (add a signature generation web api to socorro) (arguably ideal, since the end-goal is to be in sync with socorro, but a lot of upfront work).

Just before the end of the year, I threw together a half-baked signature generation web api for Socorro. I need to finish and document it. I think I'll get to it in 2022h1. Maybe June.

Is this bug really impactful such that I should shove some things around in my queue and get to it earlier?

No, we lived without it for ages so it's not urgent.

Severity: normal → S3

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: a.beingessner → nobody

This was done in bug 1809528

Status: NEW → RESOLVED
Closed: 5 months ago
Duplicate of bug: 1809528
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: