Closed Bug 1204436 Opened 9 years ago Closed 9 years ago

Investigate why Telemetry v4 reports many more crashes than FHR v2

Categories

(Toolkit :: Telemetry, defect, P2)

defect
Points:
1

Tracking

()

RESOLVED FIXED
Tracking Status
firefox43 --- affected

People

(Reporter: Dexter, Assigned: spenrose)

References

Details

spenrose mentioned that FHRv2 reports many more crashes than the number of crash-pings reported by Telemetry v4. We should investigate and explain the issue.
Blocks: 1122482
Whiteboard: [unifiedTelemetry]
Assignee: nobody → alessio.placitelli
Sam, is there any other detail you can provide?
Flags: needinfo?(spenrose)
Summary: Investigate why FHR v2 reports many more crashes than Telemetry v4 → Investigate why Telemetry v4 reports many more crashes than FHR v2
My mistake, it's the other way around: Telemetry v4 reports many more crashes than FHR v2.
Do you have links to each of the rollup scripts? This is my area of expertise, and I'll wager we're just counting different things.
(In reply to Alessio Placitelli [:Dexter] from comment #2) > My mistake, it's the other way around: Telemetry v4 reports many more > crashes than FHR v2. If it is that way around, we should able to: * match most v4 aborted sessions in v2 (v2s abortedTotalTime entries allow us to match aborted sessions pretty directly, although v4s equivalent is basically always much greater than the v2 values) * show that these additional aborted sessions in v4 are reasonable sessions, not just duplicates etc. (e.g. by sampling some client histories, showing that there is no overlap with other sessions in client histories, ...)
Also: We might submit an aborted-session for clean shutdowns if we failed to remove the temporary aborted-session ping on shutdown (see bug 1196852). We would see that as a shutdown and an aborted-session ping with the same session id.
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][data-validation]
This is not actionable yet, unassigning.
Assignee: alessio.placitelli → nobody
It is actionable, but not at the client!
Assignee: nobody → spenrose
I am developing the data set that will allow me to analyze this. I expect to have something around the end of the week.
Flags: needinfo?(spenrose)
Points: --- → 1
Priority: -- → P2
Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry][data-validation][measurement:client]
Details are here: https://docs.google.com/document/d/1rvbCyVSzBsexer-0G7HdR_l1DczKAJdjzChxv0MjYco/edit# Specifically, using the "crashes per 1000 clientID" metric from the Executive Dashboard, 41 Beta v2 records show a 2.7x INCREASE compared to Release in May, and 41 Beta v4 shows a 26% DECREASE. If we have not seen this discrepancy in any previous analysis, then my data collection method should be considered guilty until proven innocent. Particularly WRT the 2.7x increase. More soon.
Do you have any pointers to the current export & comparison/rollup code that gets us the crash data?
Here are the key extraction details: V2_CRASH_KEYS = set(['main-crash', 'main-hang', 'plugin-crash', 'plugin-hang']) crash_dict = d.get('org.mozilla.crashes.crashes', {}) crash_count = sum([v for k, v in crash_dict.items() if k in V2_CRASH_KEYS]) It's also possible I somehow doubled or (more likely) tripled the values during the many-steps process of creating the tables in question. If I included too many keys then great, we're done. Otherwise I can re-execute the logic and see if the ratio holds.
How about the v4 data this compares to?
(In reply to Georg Fritzsche [:gfritzsche] from comment #12) > How about the v4 data this compares to? It's just the number of pings of doctype "crash" in the executive summary stream as exposed by moztelemetry.get_records().
Flags: needinfo?(benjamin)
The v2 rollup should be main-crash only and should not include main-hang (those shouldn't actually exist at all), plugin-crash or plugin-hang.
Flags: needinfo?(benjamin)
\o/! Way to go Penrose, making shallow transparent mistakes! I will run a new study tomorrow morning and update the doc.
getting out of client sprint
Whiteboard: [unifiedTelemetry][data-validation][measurement:client] → [unifiedTelemetry][data-validation]
Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry]
Confirmed caused by bad query.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [unifiedTelemetry]
You need to log in before you can comment on or make changes to this bug.