1204436 - Investigate why Telemetry v4 reports many more crashes than FHR v2

Reporter

Description

•

9 years ago

spenrose mentioned that FHRv2 reports many more crashes than the number of crash-pings reported by Telemetry v4. We should investigate and explain the issue.

Alessio Placitelli [:Dexter]

Reporter

Updated

•

9 years ago

Blocks: 1122482

Whiteboard: [unifiedTelemetry]

Alessio Placitelli [:Dexter]

Reporter

Updated

•

9 years ago

Assignee: nobody → alessio.placitelli

Alessio Placitelli [:Dexter]

Reporter

Comment 1

•

9 years ago

Sam, is there any other detail you can provide?

Flags: needinfo?(spenrose)

Alessio Placitelli [:Dexter]

Reporter

Updated

•

9 years ago

Summary: Investigate why FHR v2 reports many more crashes than Telemetry v4 → Investigate why Telemetry v4 reports many more crashes than FHR v2

Alessio Placitelli [:Dexter]

Reporter

Comment 2

•

9 years ago

My mistake, it's the other way around: Telemetry v4 reports many more crashes than FHR v2.

Benjamin Smedberg

Comment 3

•

9 years ago

Do you have links to each of the rollup scripts? This is my area of expertise, and I'll wager we're just counting different things.

Georg Fritzsche [:gfritzsche]

Comment 4

•

9 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #2) > My mistake, it's the other way around: Telemetry v4 reports many more > crashes than FHR v2. If it is that way around, we should able to: * match most v4 aborted sessions in v2 (v2s abortedTotalTime entries allow us to match aborted sessions pretty directly, although v4s equivalent is basically always much greater than the v2 values) * show that these additional aborted sessions in v4 are reasonable sessions, not just duplicates etc. (e.g. by sampling some client histories, showing that there is no overlap with other sessions in client histories, ...)

Georg Fritzsche [:gfritzsche]

Comment 5

•

9 years ago

Also: We might submit an aborted-session for clean shutdowns if we failed to remove the temporary aborted-session ping on shutdown (see bug 1196852). We would see that as a shutdown and an aborted-session ping with the same session id.

Katie Parlante

Updated

•

9 years ago

Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][data-validation]

Georg Fritzsche [:gfritzsche]

Comment 6

•

9 years ago

This is not actionable yet, unassigning.

Assignee: alessio.placitelli → nobody

Benjamin Smedberg

Comment 7

•

9 years ago

It is actionable, but not at the client!

Assignee: nobody → spenrose

Sam Penrose

Assignee

Comment 8

•

9 years ago

I am developing the data set that will allow me to analyze this. I expect to have something around the end of the week.

Flags: needinfo?(spenrose)

Thomas Huelbert

Updated

•

9 years ago

Points: --- → 1

Priority: -- → P2

Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry][data-validation][measurement:client]

Sam Penrose

Assignee

Comment 9

•

9 years ago

Details are here: https://docs.google.com/document/d/1rvbCyVSzBsexer-0G7HdR_l1DczKAJdjzChxv0MjYco/edit# Specifically, using the "crashes per 1000 clientID" metric from the Executive Dashboard, 41 Beta v2 records show a 2.7x INCREASE compared to Release in May, and 41 Beta v4 shows a 26% DECREASE. If we have not seen this discrepancy in any previous analysis, then my data collection method should be considered guilty until proven innocent. Particularly WRT the 2.7x increase. More soon.

Georg Fritzsche [:gfritzsche]

Comment 10

•

9 years ago

Do you have any pointers to the current export & comparison/rollup code that gets us the crash data?

Sam Penrose

Assignee

Comment 11

•

9 years ago

Here are the key extraction details: V2_CRASH_KEYS = set(['main-crash', 'main-hang', 'plugin-crash', 'plugin-hang']) crash_dict = d.get('org.mozilla.crashes.crashes', {}) crash_count = sum([v for k, v in crash_dict.items() if k in V2_CRASH_KEYS]) It's also possible I somehow doubled or (more likely) tripled the values during the many-steps process of creating the tables in question. If I included too many keys then great, we're done. Otherwise I can re-execute the logic and see if the ratio holds.

Georg Fritzsche [:gfritzsche]

Comment 12

•

9 years ago

How about the v4 data this compares to?

Sam Penrose

Assignee

Comment 13

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #12) > How about the v4 data this compares to? It's just the number of pings of doctype "crash" in the executive summary stream as exposed by moztelemetry.get_records().

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Flags: needinfo?(benjamin)

Benjamin Smedberg

Comment 14

•

9 years ago

The v2 rollup should be main-crash only and should not include main-hang (those shouldn't actually exist at all), plugin-crash or plugin-hang.

Flags: needinfo?(benjamin)

Sam Penrose

Assignee

Comment 15

•

9 years ago

\o/! Way to go Penrose, making shallow transparent mistakes! I will run a new study tomorrow morning and update the doc.

Thomas Huelbert

Comment 16

•

9 years ago

getting out of client sprint

Whiteboard: [unifiedTelemetry][data-validation][measurement:client] → [unifiedTelemetry][data-validation]

Thomas Huelbert

Updated

•

9 years ago

Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry]

Sam Penrose

Assignee

Comment 17

•

9 years ago

Confirmed caused by bad query.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Whiteboard: [unifiedTelemetry]

Bugzilla

Investigate why Telemetry v4 reports many more crashes than FHR v2

Categories

(Toolkit :: Telemetry, defect, P2)

Tracking

()

People

(Reporter: Dexter, Assigned: spenrose)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Comment 15

Comment 16

Updated

Comment 17