Closed Bug 1660542 Opened 4 years ago Closed 3 years ago

Suspicious distribution of crash pings in STMO, with up to 45 pings per minidump_sha256_hash

Categories

(Toolkit :: Crash Reporting, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: kats, Unassigned)

Details

(Whiteboard: [dataquality])

I was doing some work with trying to symbolicate crash pings and noticed that on some days the number of crash pings on the beta channel was super high. :gsvelto said that there was a known issue where crash pings would sometimes get sent repeatedly, and that the minidump_sha256_hash field could be used to detect such duplicates.

I wrote a query to see if this was the case here, and indeed it is: https://sql.telemetry.mozilla.org/queries/74132/source has a query which processes (as of this writing) about 165651 crash pings, and finds that there are only 26637 distinct minidump_sha256_hash values. Sorting by frequency of duplication, it seems that a large number of pings get repeated exactly 45 times.

All the repeated pings seem to have payload.process_type as main. And normalized_os seems to always be Windows.

I should note that it's not always exactly 45 pings. If I go down enough rows in the table I see entries with smaller numbers of duplicate pings. For the sample I'm looking at (based on a selection of pings from beta channel on 2020-08-14) almost all of duplicated pings have the nsWidgetWindowsModuleCtor signature (bug 1571516). If I strip out the ones with nsWidgetWindowsModuleCtor as the signature, I see only 5 rows with duplicated pings: one is a 25-count duplication with an empty minidump_sha256_hash value and signature nsNSSComponent::nsNSSComponent, and the remaining 4 rows are all 2-count duplications with assorted signatures.

https://sql.telemetry.mozilla.org/queries/74133/source

Summary: Suspicious distribution of crash pings in STMO, with 45 pings per minidump_sha256_hash → Suspicious distribution of crash pings in STMO, with up to 45 pings per minidump_sha256_hash

The issue is most likely in the pingsender Windows-specific code. Is 45 the maximum number of duplications we see? I wonder if we're mishandling some error and repeating the HTTP POST operation up to a predefined limit.

In that sample 45 was a maximum. If I widen my search to all of the august beta-channel pings I see up to 141 duplications per minidump_sha256_hash. https://sql.telemetry.mozilla.org/queries/74135/source

Whiteboard: [data-quality]

The severity field is not set for this bug.
:gsvelto, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)
Severity: -- → S3
Flags: needinfo?(gsvelto)

Gabriele, could you set the priority on this as well? It keeps on coming up in our data quality triage meeting (when we do bug triage, not saying the issue has gotten worse).

Flags: needinfo?(gsvelto)

I completely forgot about this, sorry! Giving it P2 because I think it's worth investigating. I manually fetched a few results from the pings that have several entries and something immediately stands out: the cause of the crash is always EXCEPTION_BREAKPOINT. I can't be sure that's the case for all crashes because I'd have to cook up a query for that, but it's suspect nonetheless. EXCEPTION_BREAKPOINT crashes have hit some kind of trap instruction so the crashed process might have been suspended, and thus it might be that we take multiple minidumps that are all identical. But even then it's rather odd because we have code in place to make sure we only take one minidump per crashed process so ¯\_(ツ)_/¯.

Anyway I'll look into this.

Flags: needinfo?(gsvelto)
Priority: -- → P2

This continues to be on the data team's radar due to the [data-quality] whiterboard tag. A more recent query suggests that there have been at most 2 instances of any single md5 hash in recent pings:

https://sql.telemetry.mozilla.org/queries/82991/source

Feels like maybe we can resolve this?

Flags: needinfo?(gsvelto)

Yes, it does looks like it's solved. One of the changes that was landed is bug 1734262. If the issue was caused by the network machinery like I speculated in comment 3 then it would make sense. Note that bug 1734262 does not apply to main process crashes, so we'll still see a few duplicate entries if pingsender was the culprit here.

Flags: needinfo?(gsvelto)

Can we resolve this then?

Flags: needinfo?(gsvelto)

Yeah, closing as WFM.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(gsvelto)
Resolution: --- → WORKSFORME
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.