Closed
Bug 1348945
Opened 8 years ago
Closed 8 years ago
huge increase in beta weekly-active-user crash rates
Categories
(Core :: General, defect, P1)
Core
General
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox53 | - | fix-optional |
People
(Reporter: bkelly, Assigned: chutten)
References
Details
Open:
https://metrics.services.mozilla.com/firefox-dashboard/
Change the combo boxes to "weekly" and "beta" release channel. Observe that crashes spike by about 250% in the last week. Its possible this is due to the FF53 merge to beta.
If you select the aurora channel you can see it had increased crash rates on FF53 as well and its gotten worse with FF54.
Just filing this bug in case the issue is not on anyone's radar yet. Benjamin, is this known?
Flags: needinfo?(benjamin)
Reporter | ||
Comment 1•8 years ago
|
||
We also had a noticeable decrease in WAU last week. Perhaps this is just a data quality issue if the denominator in (crashes / # profiles) was reported abnormally low for some reason?
Updated•8 years ago
|
Flags: needinfo?(benjamin) → needinfo?(mcastelluccio)
Comment 2•8 years ago
|
||
Or maybe Mauro. But definitely not me at this point!
Flags: needinfo?(mdoglio)
Comment 3•8 years ago
|
||
The only issue I'm aware of regarding crashes is bug 1345153. This should only affect Firefox 54+ (nightly & aurora) and according to :chutten's analysis should account for a 39% increase of main crashes. I'll ni :mreid (who owns the dataset) to see what's going on.
Flags: needinfo?(mdoglio) → needinfo?(mreid)
Assignee | ||
Comment 4•8 years ago
|
||
I'm not seeing much on crashdash[1], which is consistent with a "low WAU" hypothesis, as it uses "kilo usage hours" as a denominator. I confirm that pingSender's dupes should not yet have reached beta.
[1]: https://telemetry.mozilla.org/crashes/
Reporter | ||
Comment 5•8 years ago
|
||
Note that there is a dip in WAU numbers for release channel as well, but it doesn't show the spike in crash rates. Not sure if that shoots a hole in that theory.
Comment 6•8 years ago
|
||
The crash_summary dataset shows more crashes reported on beta in the past couple of weeks.
https://sql.telemetry.mozilla.org/queries/3761/source
I checked the raw data to see if it was a problem with duplicate document ids, and, while there are a few (less than 2%), there are not enough dupes to explain this increase.
Flags: needinfo?(mreid)
Reporter | ||
Comment 7•8 years ago
|
||
[Tracking Requested - why for this release]:
Beta merges to release in 2.5 weeks. It feels like we need to understand this large stability regression before that merge can happen.
tracking-firefox53:
--- → ?
Comment 8•8 years ago
|
||
-> chutten for diagnosis, since marco is on PTO
Assignee: nobody → chutten
Priority: -- → P1
Assignee | ||
Comment 9•8 years ago
|
||
Can someone point me to the code for the dashboard? The link at the bottom is broken, and I'd love to see how it counts crashes.
As for :mreid's analysis, a couple of things happened in March. The first, and most relevant, is that Beta 53 was released and, with it, sending "crash" pings for content crashes.
So if we're just counting crash pings, we should expect an explosion since merge day. This is why crash_aggregates' numbers are not quite as upset: it currently counts content crashes using "main" pings (for legacy reasons).
Here's a query to illustrate how the different processTypes add up: https://sql.telemetry.mozilla.org/queries/3908/source
All of the pings with a non-NULL processType are from 53, and the "content" ones are completely new. Consistent with my hypothesis, the sum of NULL+main crash pings is roughly constant (actually dropping slightly) across time.
(( Oh, and if you notice an inflation in Aurora in the same timeframe, that's likely bug 1345153 ))
Next steps: Check how those numbers on the metrics dashboards are being tallied. If it's a simple count of crash pings, this is a big ol' nothingburger (as :ddurst likes to call it). If it isn't, further investigation is required.
Flags: needinfo?(mcastelluccio)
Comment 10•8 years ago
|
||
I don't know who operates https://metrics.services.mozilla.com/firefox-dashboard/ but I think rweiss should.
Flags: needinfo?(rweiss)
Comment 11•8 years ago
|
||
I believe this is now managed by IT. NI'ing hcrince.
Flags: needinfo?(rweiss) → needinfo?(hcrince)
Comment 12•8 years ago
|
||
Pretty sure this is content crashes but let's make sure.
status-firefox53:
--- → affected
Comment 13•8 years ago
|
||
Content/shutdown crashes that we exclude from the stats, that is.
Comment 14•8 years ago
|
||
liz asked me to comment in the bug regarding what we are using for our RelMan criteria - you can see what was reported at the last Channel meeting here: https://wiki.mozilla.org/Firefox/Channels/Meetings/2017-03-28#Beta
awsy rate: .84 (browser .55, content .29)
telemetry (m+c-s) from friday: 4.62 (was 4.07 the week before, 6.47 this time last cycle)
Assignee | ||
Comment 15•8 years ago
|
||
Okidoki, so the code generating the data for the dashboard is here: https://github.com/mozilla-services/data-pipeline/blob/e5c29541794325388336a210746029dce998b9e5/reports/executive_summary/run_executive_report.py#L117-L118 (thank you :mreid for the pointer)
It just counts the number of crash pings received. So, as previously noted, with the 53 train hitting beta, the introduction of content-process crash pings explains entirely the increase seen.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 16•8 years ago
|
||
Chris, this bug isn't fixed yet because the executive dashboard is still incorrect (which is IIRC supposed to be the source of truth for board meetings).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 17•8 years ago
|
||
:mreid filed bug 1352443 for the development effort. I thought this was for investigation.
Comment 18•8 years ago
|
||
No, I don't think this bug can be called fixed until the dashboard there shows our official source of crash-rate truth.
Comment 19•8 years ago
|
||
I don't think I need to keep tracking this for 53 as there is nothing in-product that would affect the release. I commented in bug 1352443.
Assignee | ||
Comment 20•8 years ago
|
||
Thanks to :mreid's efforts in bug 1352443, the dashboard now links to arewestableyet.com and https://telemetry.mozilla.org/crashes instead of displaying incorrect crash counts.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 8 years ago
Flags: needinfo?(hcrince)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•