Closed Bug 1819424 Opened 2 years ago Closed 2 years ago

Perform volumetric validation of Glean crash pings in comparison with Telemetry

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: afranchuk, Assigned: afranchuk)

References

(Blocks 1 open bug)

Details

Alex Franchuk [:afranchuk]

Assignee

Description

•

2 years ago

We should verify that statistics derived from the new Glean-based crash pings match those from Telemetry pings. Given they are in the same code path and use the same source information, they should exactly match.

Alex Franchuk [:afranchuk]

Assignee

Comment 1

•

2 years ago

There are notable discrepancies between crash ping counts from Glean and Telemetry. We need to further investigate the cause of these differences.

Looker

In some cases, there are 2-3 magnitudes more pings that are reported to Glean than to Telemetry in a single day and content type with all else (filters) being equal. These thousands of pings are all from one client, but it appears that Telemetry didn't get a single ping from that client, which needs to be investigated.

Alex Franchuk [:afranchuk]

Assignee

Comment 2

•

2 years ago

:chutten made a useful ReDash query comparing the crash pings between Glean and Telemetry. The main process type is expected to be under-reported in Glean because the crash reporter uses pingsender to send pings to only Telemetry. Other than main, all of the other process types seem to match quite closely.

WRT the discrepancies noted in the prior comment, it turns out that the Ten Percent Sample field, when set to no, only includes the other 90% of samples (I think it may actually be based on clients?), so it was missing data in the comparison. The solution is to set that field to allow any value. I have updated the Looker dashboard to set the field correctly. Now, if you disable the main category (and optionally disable a few of the categories which the Telemetry data is missing), the graphs looks more similar and have similar numbers. I've also added the crash counts by process type graphs which make it even more obvious (I could add a merged graph but it's a little cumbersome).

Chris H-C :chutten

Comment 3

•

2 years ago

(In reply to Alex Franchuk from comment #2)

...only includes the other 90% of samples (I think it may actually be based on clients?)...

Correct. The default sampling in the Data Platform (ie, sample_id) is done by client_id to aim for representative samples across the client population: https://docs.telemetry.mozilla.org/concepts/sample_id.html

Alex Franchuk [:afranchuk]

Assignee

Comment 4

•

2 years ago

Ignoring the main process type, we're getting 2.08% more crash pings from Glean overall. I think this volume (and the lesser volume for main) is what we expect.

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

Alex Franchuk [:afranchuk]

Assignee

Updated

•

2 years ago

Comment 5

•

2 years ago

While taking with :gcp this morning I figured out why we're seeing such a big discrepancy in main process crashes. Let's say Firefox crashes, the crash reporter client picks up the crash and sends a ping; when it's done it will leave a CrashUUID annotation in the event file. When Firefox restarts it will find the event file, check for the annotation and if it's present it won't send another ping. Problem is we do so via _sendCrashPing() which is also what we use for sending Glean pings. So right now Glean main process crash pings correspond to the legacy pings that the crash reporter client couldn't send on his own. I'll open a bug to fix that.

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Perform volumetric validation of Glean crash pings in comparison with Telemetry

Categories

(Toolkit :: Crash Reporting, task)

Tracking

()

People

(Reporter: afranchuk, Assigned: afranchuk)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5