Closed Bug 1348945 Opened 8 years ago Closed 8 years ago

huge increase in beta weekly-active-user crash rates

Categories

(Core :: General, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox53 - fix-optional

People

(Reporter: bkelly, Assigned: chutten)

References

Details

Open: https://metrics.services.mozilla.com/firefox-dashboard/ Change the combo boxes to "weekly" and "beta" release channel. Observe that crashes spike by about 250% in the last week. Its possible this is due to the FF53 merge to beta. If you select the aurora channel you can see it had increased crash rates on FF53 as well and its gotten worse with FF54. Just filing this bug in case the issue is not on anyone's radar yet. Benjamin, is this known?
Flags: needinfo?(benjamin)
We also had a noticeable decrease in WAU last week. Perhaps this is just a data quality issue if the denominator in (crashes / # profiles) was reported abnormally low for some reason?
Flags: needinfo?(benjamin) → needinfo?(mcastelluccio)
Or maybe Mauro. But definitely not me at this point!
Flags: needinfo?(mdoglio)
The only issue I'm aware of regarding crashes is bug 1345153. This should only affect Firefox 54+ (nightly & aurora) and according to :chutten's analysis should account for a 39% increase of main crashes. I'll ni :mreid (who owns the dataset) to see what's going on.
Flags: needinfo?(mdoglio) → needinfo?(mreid)
I'm not seeing much on crashdash[1], which is consistent with a "low WAU" hypothesis, as it uses "kilo usage hours" as a denominator. I confirm that pingSender's dupes should not yet have reached beta. [1]: https://telemetry.mozilla.org/crashes/
Note that there is a dip in WAU numbers for release channel as well, but it doesn't show the spike in crash rates. Not sure if that shoots a hole in that theory.
The crash_summary dataset shows more crashes reported on beta in the past couple of weeks. https://sql.telemetry.mozilla.org/queries/3761/source I checked the raw data to see if it was a problem with duplicate document ids, and, while there are a few (less than 2%), there are not enough dupes to explain this increase.
Flags: needinfo?(mreid)
[Tracking Requested - why for this release]: Beta merges to release in 2.5 weeks. It feels like we need to understand this large stability regression before that merge can happen.
-> chutten for diagnosis, since marco is on PTO
Assignee: nobody → chutten
Priority: -- → P1
Can someone point me to the code for the dashboard? The link at the bottom is broken, and I'd love to see how it counts crashes. As for :mreid's analysis, a couple of things happened in March. The first, and most relevant, is that Beta 53 was released and, with it, sending "crash" pings for content crashes. So if we're just counting crash pings, we should expect an explosion since merge day. This is why crash_aggregates' numbers are not quite as upset: it currently counts content crashes using "main" pings (for legacy reasons). Here's a query to illustrate how the different processTypes add up: https://sql.telemetry.mozilla.org/queries/3908/source All of the pings with a non-NULL processType are from 53, and the "content" ones are completely new. Consistent with my hypothesis, the sum of NULL+main crash pings is roughly constant (actually dropping slightly) across time. (( Oh, and if you notice an inflation in Aurora in the same timeframe, that's likely bug 1345153 )) Next steps: Check how those numbers on the metrics dashboards are being tallied. If it's a simple count of crash pings, this is a big ol' nothingburger (as :ddurst likes to call it). If it isn't, further investigation is required.
Flags: needinfo?(mcastelluccio)
I don't know who operates https://metrics.services.mozilla.com/firefox-dashboard/ but I think rweiss should.
Flags: needinfo?(rweiss)
I believe this is now managed by IT. NI'ing hcrince.
Flags: needinfo?(rweiss) → needinfo?(hcrince)
Pretty sure this is content crashes but let's make sure.
Content/shutdown crashes that we exclude from the stats, that is.
liz asked me to comment in the bug regarding what we are using for our RelMan criteria - you can see what was reported at the last Channel meeting here: https://wiki.mozilla.org/Firefox/Channels/Meetings/2017-03-28#Beta awsy rate: .84 (browser .55, content .29) telemetry (m+c-s) from friday: 4.62 (was 4.07 the week before, 6.47 this time last cycle)
Okidoki, so the code generating the data for the dashboard is here: https://github.com/mozilla-services/data-pipeline/blob/e5c29541794325388336a210746029dce998b9e5/reports/executive_summary/run_executive_report.py#L117-L118 (thank you :mreid for the pointer) It just counts the number of crash pings received. So, as previously noted, with the 53 train hitting beta, the introduction of content-process crash pings explains entirely the increase seen.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
See Also: → 1352443
Chris, this bug isn't fixed yet because the executive dashboard is still incorrect (which is IIRC supposed to be the source of truth for board meetings).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
:mreid filed bug 1352443 for the development effort. I thought this was for investigation.
No, I don't think this bug can be called fixed until the dashboard there shows our official source of crash-rate truth.
Depends on: 1352443
I don't think I need to keep tracking this for 53 as there is nothing in-product that would affect the release. I commented in bug 1352443.
Thanks to :mreid's efforts in bug 1352443, the dashboard now links to arewestableyet.com and https://telemetry.mozilla.org/crashes instead of displaying incorrect crash counts.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Flags: needinfo?(hcrince)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.