Perform volumetric validation of Glean crash pings in comparison with Telemetry
Categories
(Toolkit :: Crash Reporting, task)
Tracking
()
People
(Reporter: afranchuk, Assigned: afranchuk)
References
(Blocks 1 open bug)
Details
We should verify that statistics derived from the new Glean-based crash pings match those from Telemetry pings. Given they are in the same code path and use the same source information, they should exactly match.
Assignee | ||
Comment 1•2 years ago
|
||
There are notable discrepancies between crash ping counts from Glean and Telemetry. We need to further investigate the cause of these differences.
In some cases, there are 2-3 magnitudes more pings that are reported to Glean than to Telemetry in a single day and content type with all else (filters) being equal. These thousands of pings are all from one client, but it appears that Telemetry didn't get a single ping from that client, which needs to be investigated.
Assignee | ||
Comment 2•2 years ago
|
||
:chutten made a useful ReDash query comparing the crash pings between Glean and Telemetry. The main
process type is expected to be under-reported in Glean because the crash reporter uses pingsender
to send pings to only Telemetry. Other than main
, all of the other process types seem to match quite closely.
WRT the discrepancies noted in the prior comment, it turns out that the Ten Percent Sample
field, when set to no
, only includes the other 90% of samples (I think it may actually be based on clients?), so it was missing data in the comparison. The solution is to set that field to allow any value. I have updated the Looker dashboard to set the field correctly. Now, if you disable the main
category (and optionally disable a few of the categories which the Telemetry data is missing), the graphs looks more similar and have similar numbers. I've also added the crash counts by process type graphs which make it even more obvious (I could add a merged graph but it's a little cumbersome).
Comment 3•2 years ago
|
||
(In reply to Alex Franchuk from comment #2)
...only includes the other 90% of samples (I think it may actually be based on clients?)...
Correct. The default sampling in the Data Platform (ie, sample_id
) is done by client_id
to aim for representative samples across the client population: https://docs.telemetry.mozilla.org/concepts/sample_id.html
Assignee | ||
Comment 4•2 years ago
|
||
Ignoring the main
process type, we're getting 2.08% more crash pings from Glean overall. I think this volume (and the lesser volume for main
) is what we expect.
Comment 5•2 years ago
|
||
While taking with :gcp this morning I figured out why we're seeing such a big discrepancy in main process crashes. Let's say Firefox crashes, the crash reporter client picks up the crash and sends a ping; when it's done it will leave a CrashUUID
annotation in the event file. When Firefox restarts it will find the event file, check for the annotation and if it's present it won't send another ping. Problem is we do so via _sendCrashPing()
which is also what we use for sending Glean pings. So right now Glean main process crash pings correspond to the legacy pings that the crash reporter client couldn't send on his own. I'll open a bug to fix that.
Description
•