Suspicious distribution of crash pings in STMO, with up to 45 pings per minidump_sha256_hash
Categories
(Toolkit :: Crash Reporting, defect, P2)
Tracking
()
People
(Reporter: kats, Unassigned)
Details
(Whiteboard: [dataquality])
I was doing some work with trying to symbolicate crash pings and noticed that on some days the number of crash pings on the beta channel was super high. :gsvelto said that there was a known issue where crash pings would sometimes get sent repeatedly, and that the minidump_sha256_hash field could be used to detect such duplicates.
I wrote a query to see if this was the case here, and indeed it is: https://sql.telemetry.mozilla.org/queries/74132/source has a query which processes (as of this writing) about 165651 crash pings, and finds that there are only 26637 distinct minidump_sha256_hash values. Sorting by frequency of duplication, it seems that a large number of pings get repeated exactly 45 times.
Reporter | ||
Comment 1•4 years ago
|
||
All the repeated pings seem to have payload.process_type
as main
. And normalized_os
seems to always be Windows
.
Reporter | ||
Comment 2•4 years ago
|
||
I should note that it's not always exactly 45 pings. If I go down enough rows in the table I see entries with smaller numbers of duplicate pings. For the sample I'm looking at (based on a selection of pings from beta channel on 2020-08-14) almost all of duplicated pings have the nsWidgetWindowsModuleCtor
signature (bug 1571516). If I strip out the ones with nsWidgetWindowsModuleCtor
as the signature, I see only 5 rows with duplicated pings: one is a 25-count duplication with an empty minidump_sha256_hash value and signature nsNSSComponent::nsNSSComponent
, and the remaining 4 rows are all 2-count duplications with assorted signatures.
Comment 3•4 years ago
|
||
The issue is most likely in the pingsender Windows-specific code. Is 45 the maximum number of duplications we see? I wonder if we're mishandling some error and repeating the HTTP POST operation up to a predefined limit.
Reporter | ||
Comment 4•4 years ago
|
||
In that sample 45 was a maximum. If I widen my search to all of the august beta-channel pings I see up to 141 duplications per minidump_sha256_hash. https://sql.telemetry.mozilla.org/queries/74135/source
Updated•4 years ago
|
Comment 5•4 years ago
|
||
The severity field is not set for this bug.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•4 years ago
|
Comment 6•3 years ago
•
|
||
Gabriele, could you set the priority on this as well? It keeps on coming up in our data quality triage meeting (when we do bug triage, not saying the issue has gotten worse).
Comment 7•3 years ago
|
||
I completely forgot about this, sorry! Giving it P2 because I think it's worth investigating. I manually fetched a few results from the pings that have several entries and something immediately stands out: the cause of the crash is always EXCEPTION_BREAKPOINT
. I can't be sure that's the case for all crashes because I'd have to cook up a query for that, but it's suspect nonetheless. EXCEPTION_BREAKPOINT
crashes have hit some kind of trap instruction so the crashed process might have been suspended, and thus it might be that we take multiple minidumps that are all identical. But even then it's rather odd because we have code in place to make sure we only take one minidump per crashed process so ¯\_(ツ)_/¯.
Anyway I'll look into this.
Comment 8•3 years ago
|
||
This continues to be on the data team's radar due to the [data-quality] whiterboard tag. A more recent query suggests that there have been at most 2 instances of any single md5 hash in recent pings:
https://sql.telemetry.mozilla.org/queries/82991/source
Feels like maybe we can resolve this?
Comment 9•3 years ago
|
||
Yes, it does looks like it's solved. One of the changes that was landed is bug 1734262. If the issue was caused by the network machinery like I speculated in comment 3 then it would make sense. Note that bug 1734262 does not apply to main process crashes, so we'll still see a few duplicate entries if pingsender was the culprit here.
Comment 11•3 years ago
|
||
Yeah, closing as WFM.
Updated•2 years ago
|
Description
•