Closed Bug 1408041 Opened 7 years ago Closed 6 years ago

[tracker] expose MinidumpSha256Hash

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lonnen, Unassigned)

References

(Blocks 1 open bug)

Details

Bug 1322611 introduced MinidumpSha256Hash to the breakpad crash. Socorro should use this in a few places:

1. verify the client generated value and, if mismatched, alert and abort before it ends up in a public dataset. a processor rule seems like the right place for this.

2. send this field with other publicly available data to telemetry so we can join data downstream

3. expose this as a field in supersearch for easier cross-system lookups between telemetry and socorro



uncertain of this last one:

4a. the search box in the header should, when provided a sha256 hash, load the most recently submitted report with this sha 

-or-

4b. the search box in the header should, when provided a sha256 hash, load the search page for this sha iff more than one crash has this exact sha, or load the report directly if a single crash has this sha
Slight correction here: bug 1322611 added that field to the telemetry crash ping only, not the data that gets submitted to Socorro, so step 1 will need to be "calculate the SHA-256 hash of the minidump file". We already have md5sums of minidump files in Socorro (as `dump_checksums`) so presumably there's already a straightforward place to shoehorn this.

I'm not sure if either bit of step 4 is needed, but I don't think we have enough experience with the crash ping to know what the workloads there will look like. As long as there's some way to take the hash from a crash ping and look up a corresponding crash in Socorro that should be sufficient to start. Having the data in the telemetry pipeline is probably the most important thing, so that we'll be able to write an analysis over crash ping data and make it join to Socorro crashes when available.
I assumed that we were only going to go the other way (from Socorro -> ping) because that field was intended to only be available to logged-in users, and that workflow would enforce that.
It doesn't seem particularly harmful to make it public, in any event. It's not PII in and of itself, and it won't allow you to get any PII you didn't already have access to.
I vaguely remember Benjamin mentioning that the problem with making the hash public was about enabling an external onlooker to correlate the two and recover data that way. I can't find on which bug that was though.
Unless you can source that, I'm unclear why correlating two sets of publicly available information (both of which are basically the same thing in slightly different forms) is any more of a risk to user privacy.
Crash reports can have multiple minidump files. Are we explicitly talking about the one named "upload_file_minidump"? If there isn't one (pretty sure I've seen crashes like this), then what value should the MinidumpSha256Hash field have? Empty string? Null/None?
Exposing just the one from upload_file_minidump is probably sufficient. If not we can figure out how to expose the others in a followup. Are there really crashes that are completely missing a minidump? I know we have plenty of examples where the minidump is zero bytes. In any event, having a Null/None for that case would be fine.
I don't know if they're "valid crashes", but when I was working on Antenna I saw some crashes with no "upload_file_minidump" and it's one of the scenarios I was testing with.

If we're doing sha256 hashes of multiple dumps, then we probably want to make this field a dict of dump name -> sha256 hash. Does that sound right?

If that's true, then it seems like we might be better off replacing dump_checksums since we'll effectively have the same data in two places. I don't honestly know what uses dump_checksums. I see it being produced in Socorro. There are a couple of places where we use the keys to figure out what dumps are associated with the raw crash. Otherwise, I don't see it being used anywhere.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #8)
> If we're doing sha256 hashes of multiple dumps, then we probably want to
> make this field a dict of dump name -> sha256 hash. Does that sound right?

We could do that, as long as it's straightforward to query for the data for upload_file_minidump, since that's all we're going to care about for the purposes of matching crashes between Socorro and the crash ping for now.

> If that's true, then it seems like we might be better off replacing
> dump_checksums since we'll effectively have the same data in two places. I
> don't honestly know what uses dump_checksums. I see it being produced in
> Socorro. There are a couple of places where we use the keys to figure out
> what dumps are associated with the raw crash. Otherwise, I don't see it
> being used anywhere.

I'm not aware of anything that uses dump_checksums. I think it was probably part of the work to filter out duplicate crashes that happened a while back. If so, I don't think swapping a md5 hash for a sha-256 hash is harmful (and is actually better in practice).
When working on analysis for duplicate crash reports, I looked at 14,000 Firefox crash reports. Of those 14,000, 418 had 0-byte upload_file_minidump files.

We need to keep that in mind here, too. Maybe we ignore 0-byte upload_file_minidump files.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #10)
> We need to keep that in mind here, too. Maybe we ignore 0-byte
> upload_file_minidump files.

It's not super important what exactly we do with those since they don't have any actionable data in the dump anyway, so there won't be any real use in correlating a crash ping to a crash in Socorro for those.
Ted: Is that true even if they have other dumps that aren't 0 bytes? For example, I was looking at this one when I was looking at 14,000 crashes earlier:

https://crash-stats.mozilla.com/report/index/6ba7b1b2-1e66-4b2c-93e9-e63540171011

That has an upload_file_minidump_browser.
Yes. The other dumps aren't crashes, they're just showing the state of other processes at the time of that crash. In this case we killed a content process because of an IPC error, so it's useful to know what the browser process was doing at that time. However, if we don't have an actual minidump for the content process then we don't have a stack, and we won't have anything to put in the crash ping anyway.
I landed code in the collector to calculate the MinidumpSha256Hash and code in the processor to carry that along and make sure it ends up in the crash data Socorro sends to Telemetry. Further, minidump_sha256_hash is a searchable field in Super Search. That was done under bug #1409427. It'll get pushed to prod next week.

The only thing that's not done is item 4. I'm voting we push that off to a separate bug when someone asks for it.
It's in prod! Marking this as FIXED.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.