Closed
Bug 1204436
Opened 9 years ago
Closed 9 years ago
Investigate why Telemetry v4 reports many more crashes than FHR v2
Categories
(Toolkit :: Telemetry, defect, P2)
Toolkit
Telemetry
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox43 | --- | affected |
People
(Reporter: Dexter, Assigned: spenrose)
References
Details
spenrose mentioned that FHRv2 reports many more crashes than the number of crash-pings reported by Telemetry v4.
We should investigate and explain the issue.
Reporter | ||
Updated•9 years ago
|
Assignee: nobody → alessio.placitelli
Reporter | ||
Comment 1•9 years ago
|
||
Sam, is there any other detail you can provide?
Flags: needinfo?(spenrose)
Reporter | ||
Updated•9 years ago
|
Summary: Investigate why FHR v2 reports many more crashes than Telemetry v4 → Investigate why Telemetry v4 reports many more crashes than FHR v2
Reporter | ||
Comment 2•9 years ago
|
||
My mistake, it's the other way around: Telemetry v4 reports many more crashes than FHR v2.
Comment 3•9 years ago
|
||
Do you have links to each of the rollup scripts? This is my area of expertise, and I'll wager we're just counting different things.
Comment 4•9 years ago
|
||
(In reply to Alessio Placitelli [:Dexter] from comment #2)
> My mistake, it's the other way around: Telemetry v4 reports many more
> crashes than FHR v2.
If it is that way around, we should able to:
* match most v4 aborted sessions in v2
(v2s abortedTotalTime entries allow us to match aborted sessions pretty directly, although v4s equivalent is basically always much greater than the v2 values)
* show that these additional aborted sessions in v4 are reasonable sessions, not just duplicates etc.
(e.g. by sampling some client histories, showing that there is no overlap with other sessions in client histories, ...)
Comment 5•9 years ago
|
||
Also:
We might submit an aborted-session for clean shutdowns if we failed to remove the temporary aborted-session ping on shutdown (see bug 1196852).
We would see that as a shutdown and an aborted-session ping with the same session id.
Updated•9 years ago
|
Whiteboard: [unifiedTelemetry] → [unifiedTelemetry][data-validation]
Comment 6•9 years ago
|
||
This is not actionable yet, unassigning.
Assignee: alessio.placitelli → nobody
Assignee | ||
Comment 8•9 years ago
|
||
I am developing the data set that will allow me to analyze this. I expect to have something around the end of the week.
Flags: needinfo?(spenrose)
Updated•9 years ago
|
Points: --- → 1
Priority: -- → P2
Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry][data-validation][measurement:client]
Assignee | ||
Comment 9•9 years ago
|
||
Details are here: https://docs.google.com/document/d/1rvbCyVSzBsexer-0G7HdR_l1DczKAJdjzChxv0MjYco/edit#
Specifically, using the "crashes per 1000 clientID" metric from the Executive Dashboard, 41 Beta v2 records show a 2.7x INCREASE compared to Release in May, and 41 Beta v4 shows a 26% DECREASE.
If we have not seen this discrepancy in any previous analysis, then my data collection method should be considered guilty until proven innocent. Particularly WRT the 2.7x increase. More soon.
Comment 10•9 years ago
|
||
Do you have any pointers to the current export & comparison/rollup code that gets us the crash data?
Assignee | ||
Comment 11•9 years ago
|
||
Here are the key extraction details:
V2_CRASH_KEYS = set(['main-crash', 'main-hang', 'plugin-crash', 'plugin-hang'])
crash_dict = d.get('org.mozilla.crashes.crashes', {})
crash_count = sum([v for k, v in crash_dict.items() if k in V2_CRASH_KEYS])
It's also possible I somehow doubled or (more likely) tripled the values during the many-steps process of creating the tables in question. If I included too many keys then great, we're done. Otherwise I can re-execute the logic and see if the ratio holds.
Comment 12•9 years ago
|
||
How about the v4 data this compares to?
Assignee | ||
Comment 13•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #12)
> How about the v4 data this compares to?
It's just the number of pings of doctype "crash" in the executive summary stream as exposed by moztelemetry.get_records().
Updated•9 years ago
|
Flags: needinfo?(benjamin)
Comment 14•9 years ago
|
||
The v2 rollup should be main-crash only and should not include main-hang (those shouldn't actually exist at all), plugin-crash or plugin-hang.
Flags: needinfo?(benjamin)
Assignee | ||
Comment 15•9 years ago
|
||
\o/! Way to go Penrose, making shallow transparent mistakes! I will run a new study tomorrow morning and update the doc.
Comment 16•9 years ago
|
||
getting out of client sprint
Whiteboard: [unifiedTelemetry][data-validation][measurement:client] → [unifiedTelemetry][data-validation]
Updated•9 years ago
|
Whiteboard: [unifiedTelemetry][data-validation] → [unifiedTelemetry]
Assignee | ||
Comment 17•9 years ago
|
||
Confirmed caused by bad query.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [unifiedTelemetry]
You need to log in
before you can comment on or make changes to this bug.
Description
•