Closed Bug 1528996 Opened 10 months ago Closed 4 months ago

Nightly Fennec MC rate doubled since last week

Categories

(Cloud Services :: Mission Control, defect)

Unspecified
Android
defect
Not set

Tracking

(firefox67 affected)

RESOLVED WORKSFORME
Tracking Status
firefox67 --- affected

People

(Reporter: marcia, Unassigned)

Details

(Whiteboard: [geckoview]?)

Last week at Tues/Thursday Channel meeting and Wednesday Cross Functional Meeting, MC was reporting a rate in the mid-20s. Today it is reporting almost double that - is there something that happened regarding usage hours? We merged 2019-01-28, so we are now a few weeks into the nightly 67 cycle.

https://wiki.mozilla.org/Firefox/Channels/Meetings/2019-02-07 - Unfortunately data was missing that day since I was out

https://wiki.mozilla.org/Firefox/Channels/Meetings/2019-02-12#Mobile- Nightly score was 25.39

https://wiki.mozilla.org/Firefox/Channels/Meetings/2019-02-14#Mobile - Nightly score was 24.89

https://public.etherpad-mozilla.org/p/channel-meeting - Today's Nightly rate is 45.73

Could that be related to the fact that we now offer x86_64 builds? (bug 1505538)

Did a quick redash query of the telemetry ping counts -- looks like the counts are pretty evenly divided between aarch64 and arm:

https://sql.telemetry.mozilla.org/queries/61536#158420

Whatever the problem was, it seems to have been decreasing in volume.

Socorro reports only 1 crash report from x86_64 Fennec and only 21 from 32-bit x86 Fennec, which is in line with the x86 Fennec crash rates before we started publishing x86_64 Fennec.

https://crash-stats.mozilla.com/search/?product=FennecAndroid&version=67.0a1&date=%3E%3D2019-02-13T07%3A42%3A00.000Z&date=%3C2019-02-20T07%3A42%3A00.000Z&_facets=signature&_facets=cpu_arch&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_arch

x86_64 Fennec was in the Google Play Store on February 13, three days before the crash rate started increasing. Maybe the increase was related to weekend usage? The crash rate appears to be decreasing now, though perhaps that is just telemetry processing lag?

OS: Unspecified → Android
Whiteboard: [geckoview:p1]

The current ARM64 Fennec Nightly builds only have the Baseline JS JIT. The incomplete IonMonkey JS JIT maybe have been accidentally enabled (by bug 1523015) on February 13, causing crashes like bug 1528621). That might explain the increase in ARM64 Fennec crashes, but not the 32-bit ARM Fennec crashes.

https://crash-stats.mozilla.com/signature/?signature=js%3A%3Ajit%3A%3APatchJump shows 57 crashes/21 installs on that crash, which is really the only newish volume crash currently on nightly besides existing Bug 1521158.

As Will notes in Comment 4, the issue seems to have come and gone in the matter of a few days.

I broke things down on the 17th (the crashiest day) by build and client id:

https://sql.telemetry.mozilla.org/queries/61563/

It looks like a single client running build 20190216093716 is responsible for 15% (191 count) of the crashes which would explain some of the distortion, although the remainder of the crashes seem reasonably well distributed at a first glance. It would really be nice to know what exactly crashed and how: as we've mentioned before, we get a bunch of crashes in telemetry that don't make it to socorro.

As it is though, I'm not sure if I can justify the effort involved in extracting the pings and running an analysis, given that this is a transient problem. We should be symbolicating these pings and generating automatic reports later in 2019.

Adding :chutten here in case he has anything to add.

Taking a look at the MozCrashReason, the most common reason for the crash is NULL: https://sql.telemetry.mozilla.org/queries/61570/source

I don't know enough about the character of Fennec crashes to guess what causes a NULL reason.

Today's Fennec rate also zoomed up to 199.30, with main crashes showing an increase of 438%. Is there another spike that is showing on the Telemetry side for this increase?

Flags: needinfo?(wlachance)

(In reply to Marcia Knous [:marcia - needinfo? me] from comment #10)

Today's Fennec rate also zoomed up to 199.30, with main crashes showing an increase of 438%. Is there another spike that is showing on the Telemetry side for this increase?

This is another case where the way mission control calculates things can you throw you off. In this case we stopped incorporating any data from 67.0 into the nightly rate, which meant we only had the 68 data (which has less usage hours associated with it). You can see there's nothing remarkable happening by zooming in on the data in the graph:

https://data-missioncontrol.dev.mozaws.net/#/nightly/android/main_crashes?aggregateLength=1&timeInterval=604800&relative=0&percentile=99&normalized=0&disabledVersions=&versionGrouping=version

As you can see the overall number of crashes has remained relatively constant over the last week. I would expect the nightly rate calculation to settle down soon.

Flags: needinfo?(wlachance)

P.S. In the future, could you please file a new bug for each issue that you see, rather than piggy-backing on top of old issues like this? Having a bunch of unrelated problems attached to a single bug report makes it difficult to track/understand how many issues we're seeing over time. If you think the issue might be related, feel free to link to other bugs in a new report. It's easy to mark reports as duplicate after the fact.

(In reply to William Lachance (:wlach) (use needinfo!) from comment #12)

P.S. In the future, could you please file a new bug for each issue that you see, rather than piggy-backing on top of old issues like this? Having a bunch of unrelated problems attached to a single bug report makes it difficult to track/understand how many issues we're seeing over time. If you think the issue might be related, feel free to link to other bugs in a new report. It's easy to mark reports as duplicate after the fact.

Yes, sorry about that.

Marcia, is this bug about Fennec 67 Nightly's MC rate still relevant?

Flags: needinfo?(mozillamarcia.knous)
Whiteboard: [geckoview:p1] → [geckoview]?

Resolving as WFM, as Comment 8 explains what happened in this case.

Status: NEW → RESOLVED
Closed: 4 months ago
Flags: needinfo?(mozillamarcia.knous)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.