Closed Bug 1687987 Opened 2 years ago Closed 4 months ago

restructure raw crash

Categories

(Socorro :: Antenna, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

When looking at the data in a raw crash, it's unclear which bits come from crash annotations and which bits are added by the collector during crash ingestion.

Telemetry ingestion puts the ingestion-bits at the root of the item and the raw contents of what it received in payload. If we were starting fresh, I would do something like that. However, we're long past that point and there's probably tooling around the current shape of raw crash reports.

Instead, I think it's ok if we prefix Socorro-added things. Nothing should be changing the raw crash except for the collector, so maybe collector_ or socorro_ is ok. The former is clearer from the crash ingestion point of vies. The latter might be clearer to people who don't work on crash ingestion.

This bug covers figuring out which way to go and then figuring out how to implement it.

We should also document what the collector is adding somewhere. Gabriele suggests adding it to CrashAnnotations.yaml in-tree, but I don't think that's the right place.

Here's a non-exhaustive list of things the collector currently adds:

  • payload: either "json" or "multipart" -- denotes how the crash annotations were encoded
  • payload_compressed: either "0" or "1" -- denotes whether the crash report payload was compressed or not
  • legacy_processing: the result of throttling (not sure if this is actually used anymore)
  • throttle_rate: the throttle rate for the rule that handled this crash report
  • collector_notes: list of strings capturing what the collector did to this crash report
  • submitted_timestamp: the timestamp in UTC the collector collected this crash report; in isoformat
  • timestamp: another timestamp for when the collector collected this crash report; in seconds since epoch -- used to figure out how long it took in wall time to collect and save this crash report
  • dump_checksums: map of dump_name -> sha256 hash in hex for dumps
  • MinidumpSha256Hash: hash of upload_file_minidump dump -- this should match the hash in the crash pings in Telemetry
  • uuid: the crash id generated by the collector
  • type_tag: the dump id prefix which is bp -- this is configured in the collector; we've never used anything else and i don't think anything uses this, so maybe we can nix it

I suggest the first pass goes like this:

  • We look into removing type_tag. I don't think it's helping.
  • We look into removing legacy_processing. I don't think this is helpful anymore since we either accept for processing or reject now.
  • We look into removing timestamp and adjusting the code to base the "how long did it take to collect this crash?" metric on submitted_timestamp.
  • We look into removing throttle_rate. I don't think that's interesting.

After that, we prefix everything except MinidumpSha256Hash and uuid with collector_. I don't think I want to prefix those two fields because I know they get used elsewhere and that will likely get messy.

After that, we write up a document with all these additional bits.

Grabbing this to do some easy changes I can do now.

type_tag isn't used anywhere. I'm removing it. It's the only thing that uses the dump_id_prefix configuration variable, so we should nix that, too.

legacy_processing was useful when we needed to distinguish between ACCEPT and DEFER, but we no longer DEFER crash reports, so we can nix this field.

throttle_rate isn't used anywhere, so I'm going to remove that, too.

timestamp is only used in the collector, so I'm going to rework that to reuse submitted_timestamp.

For all of these, we need to remove saving the data from antenna and then we need to remove display and search bits in socorro. So that's two steps.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

This is the Antenna side of the changes discussed in comment #2:

willkg merged PR #669: "bug #1687987: remove unhelpful added fields from raw crash" in 78da869.

This is the Socorro side of the changes discussed in comment #2.

willkg merged PR #5658: "bug 1687987: stop using removed fields" in 389dde4.

Next step is to prefix everything except MinidumpSha256Hash and uuid with collector_. I don't think I want to prefix those two fields because I know they get used elsewhere and that will likely get messy. Also, I need to spend some time looking into whether we can just switch to the new field names or whether we have to do 6 months of having both versions of the field, then finish cutting over when the data has expired.

The first pass was pushed out in antenna in bug #1688654 and socorro in bug #1688681.

Bumping bugs off my queue because I'm not going to get to them any time soon.

Assignee: willkg → nobody
Status: ASSIGNED → NEW

Some additional things I was thinking about related to this:

  1. It's nice that the raw crash is almost the same data as the crash annotations sent in the payload. Taking a raw crash and re-submitting it is straightforward now. It should continue to be straight-forward.
  2. Earlier, I was thinking MinidumpSha256Hash should be part of the annotations even though it's computed by the collector. I don't think we should do that. I think everything the collector adds should be segregated.
  3. Restructuring the raw crash is worth writing a proposal for. So I'm going to fix the summary of this bug and write a proposal.
Summary: change collector-added bits to raw_crash to have common prefix → restructure raw crash
Assignee: nobody → willkg
Status: NEW → ASSIGNED

I wrote a proposal and shopped it around. One of the ideas behind it was to conform to existing conventions.

Glean has this schema for it's payload:

https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/main/schemas/glean/glean/glean.1.schema.json

Looking at org_mozilla_firefox.baseline as an example, the items in the Glean ping are at the top level, there's a "metadata", and some other things added at ingestion. Data engineers don't care where things come from in the way I thought they might.

Given that, I think I want to keep the raw crash mostly as is, except that I do want to:

  1. move collector_notes, dump_checksums, payload, and payload_compressed into a metadata section
  2. remove MinidumpSha256Hash entirely--it's in dump_checksums and the processor can pull it out from there
  3. add a version field indicating the raw crash structure version in case we ever want to change it in a backwards-incompatible way in the future

So we'd end up with something like this (as an example):

{
  "BuildID": "20220902183841",
  "Comments": "willkg crash test with inline functions 20220902_1544",
  "CrashTime": "1662147837",
  "InstallTime": "1662147786",
  "MozCrashReason": "MOZ_CRASH(Crash via about:crashparent)",
  "Notes": "Ubuntu 22.04.1 LTSFP(D00-L1000-W00000000-T010) WR? WR+ EGL? EGL- GL Context? GL Context+ ",
  "ProductID": "{ec8030f7-c20a-464f-9b0e-13a3a9e97384}",
  "ProductName": "Firefox",
  "ReleaseChannel": "default",
  "StartupTime": "1662147817",
  "SubmittedFrom": "Client",
  "useragent_locale": "en-US",
  "metadata": {
    "collector_notes": [],
    "dump_checksums": {
      "upload_file_minidump": "639febfeaca3579d58f098a5fa5e8d6276323cf88a8026060be920691dd612f0"
    },
    "payload": "json",
    "payload_compressed": "0",
  },
  "submitted_timestamp": "2022-09-02T19:44:32.762212+00:00",
  "uuid": "20aec939-4154-4520-ba3c-024930220902",
  "version": 2
}

Going to work on adding support for that now.

  1. update socorro to support new structure; add migration code in crashstorage to convert old structure to new structure
  2. update socorro-submitter to support new and old structure
  3. update antenna to switch to new structure
  4. write up bug for socorro and socorro-submitter to remove handling for old structure to be done in 6 months

willkg merged PR #6200: "bug 1687987: support raw crash version 2" in 8051738.

That adds support for version 2 of the raw crash to socorro.

willkg merged PR #46: "bug 1687987: support version 2 raw crash" in 5b0b3a1.

That adds support for version 2 of the raw crash to the socorro-submitter.

willkg merged PR #871: "bug 1687987: switch to raw crash version 2" in 3f6502f.

This switches antenna to save raw crash using the version 2 structure.

Next steps:

  1. verify everything is working in stage between antenna and socorro
  2. deploy socorro to prod so it can handle both version 1 and version 2 of the raw crash
    1. verify that processing and the report view work correctly
  3. deploy socorro-submitter to prod so it can handle both version 1 and version 2 of the raw crash
    1. verify crash reports are getting submitted to stage
  4. deploy antenna to prod so now only version 2 raw crash data is being generated
  5. write up bug for socorro and socorro-submitter to remove handling for old structure to be done in 6 months

However, I don't want to do anything risky next week. Given that, I want to do steps 1, 2, and 3 by end of Monday and then push everything else off until October.

We deployed the Socorro changes just now in bug #1791381. Everything looks ok.

The "Raw annotations" tab now shows the "metadata" section. We should change the template to stop showing "metadata" in the "Raw annotations" tab since it's not annotations. It also shows up in the "Debug" tab which makes more sense for that information.

We deployed the socorro-submitter changes in bug #1794816 today.

We deployed the antenna changes in bug #1794832 just now. We're now on raw crash version 2.

The last step here is to write up a bug for socorro and socorro-submitter to remove handling for the old raw crash structure to be done in 6 months.

I wrote up bug #1794837 (socorro-submitter) and bug #1794838 (socorro) to cover removing the migration code.

Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.