Open Bug 1515439 Opened 5 years ago Updated 5 years ago

On linux, content_crash rate seems to be getting worse release over release

Categories

(Cloud Services :: Mission Control, defect)

defect
Not set
normal

Tracking

(firefox64 affected)

REOPENED
Tracking Status
firefox64 --- affected

People

(Reporter: ritu, Unassigned)

Details

Attachments

(6 files)

Go to https://missioncontrol.telemetry.mozilla.org/#/release/linux and observe the content_crashes rate, it seems to be steadily going up. 

Conversely, the main_crashes rate is going down.

We should try to investigate the cause of this trend.

Hi Dolske, this sounds like an important concern, who should be involved into investigating it?

Flags: needinfo?(dolske)

I would assume crash issues should start out in platform, so over to... Selena I guess?

Flags: needinfo?(dolske) → needinfo?(sdeckelmann)
Flags: needinfo?(sdeckelmann)

Looking at it now, the numbers are different, and it looks like the crashes are going down. Maybe that was a temporary blip?

Flags: needinfo?(rkothari)

Hi Selena, since we just pushed out 64.0.2, let's wait a few days before we review the crash rates on linux on release channel.

Hi Will, Lonnen, I am surprised that the two screenshots (mine and Selena's) have such different crash rates (mainly looking at main and content) for 61, 62. If the numbers fluctuate so much, how can we reliably use these for comparisons?

Flags: needinfo?(wlachance)
Flags: needinfo?(chris.lonnen)

(In reply to Ritu Kothari (:ritu) from comment #7)

Hi Will, Lonnen, I am surprised that the two screenshots (mine and Selena's) have such different crash rates (mainly looking at main and content) for 61, 62. If the numbers fluctuate so much, how can we reliably use these for comparisons?

Mission Control shouldn't be updating the rate for a release after a newer one has gone out the door-- so that's a bug if it's happening. I'll investigate.

That said, Mission Control only measures usage hours for "official" versions of Firefox-- on Linux, a relatively small proportion of our users on are on the official versions (most are on derived editions from Ubuntu and other vendors). When the population is so small, a relatively small number of users experiencing (or not experiencing!) a crash can have a big impact on the rate, since the rate is # of crashes per thousand hours of use across the entire population.

So tl;dr: I'd be surprised if there was any concerning trend here. I will double check though. I'll clear the flags for lonnen/ritu and re-escalate this bug if I find anything.

Flags: needinfo?(rkothari)
Flags: needinfo?(chris.lonnen)
Attached image "all" numbers

(In reply to William Lachance (:wlach) (use needinfo!) from comment #8)

So tl;dr: I'd be surprised if there was any concerning trend here. I will double check though. I'll clear the flags for lonnen/ritu and re-escalate this bug if I find anything.

Just realized what was going on. By default mission control shows an "adjusted" view of the crash rate, which tries to account for the fact that releases are crashier for the first week or so that they're out, and thus tries to show the crash rate for previous releases in a similar interval so you can compare like-versus-like. This view is expected to evolve over the course of a release's life cycle.

If you want a view of how each release performed overall, be sure to use the overall selection in this view. I'll try to look into how to display/explain this better in the future.

From that POV, it does look like the main crash rate in 64 is about half what it was in previous versions, while the content crash rate is pretty much the same.

I'm not sure how to explain the drop in main crash rate, actually, though it seems real by the graphs (https://data-missioncontrol.dev.mozaws.net/#/release/linux/main_crashes?aggregateLength=24&timeInterval=604800&relative=1&percentile=99&normalized=1&disabledVersions=&versionGrouping=version -- may need to reload a few times before it loads).

The top crashers for 64 vs 63 aren't showing any obvious culprits, but note these charts only count submitted crashes, not the crash pings we use to count crashes in mission control:

https://crash-stats.mozilla.com/search/?product=Firefox&version=63.0.1&version=63.0&version=63.0.3&platform=linux&process_type=browser&date=%3E%3D2018-07-09T18%3A40%3A23.000Z&date=%3C2019-01-09T17%3A40%3A23.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

https://crash-stats.mozilla.com/search/?product=Firefox&version=64.0&version=64.0.2&platform=linux&process_type=browser&date=%3E%3D2018-12-26T17%3A40%3A38.000Z&date=%3C2019-01-09T17%3A40%3A38.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Next step investigating this would be looking at the crash pings themselves to see if there are any trends which weren't obvious in the above charts. This is hard and I'm not sure if it is worth the effort, maybe we can discuss @ an upcoming channel meeting (I can't make Thursday's, but maybe next week).

Flags: needinfo?(wlachance)

Actually on further inspection we're noticing a corresponding decrease in the main crash rate with 64 on Windows (-39%) and Mac (-25%), you can see this on the main mission control page:

https://missioncontrol.telemetry.mozilla.org/#/

Still don't know exactly why (the top crash lists for Windows are just as opaque to me as the Linux ones), but this indicates either an improvement in quality or a change in usage pattern.

I am not sure what to believe. Another anomaly I noticed is the crazy drop in content_shutdown_crashes in 64 as compared to 62, 63 for windows, linux and mac (not as prominent as the other two).

Are we certain there isn't a problem in telemetry data or some miscalculation somewhere?

(In reply to Ritu Kothari (:ritu) from comment #11)

I am not sure what to believe. Another anomaly I noticed is the crazy drop in content_shutdown_crashes in 64 as compared to 62, 63 for windows, linux and mac (not as prominent as the other two).

The decrease in content_shutdown_crashes was investigated in bug 1515664, it looks like the root cause was bug 1498942 being fixed.

Are we certain there isn't a problem in telemetry data or some miscalculation somewhere?

Nothing is ever certain, but I don't see any reason to believe there is a problem a priori. Mission Control seems to be correctly identifying large-scale changes in crash quantity on the nightly channel, for example.

Gian-Carlo, any thoughts here?

Flags: needinfo?(gpascutto)
Flags: needinfo?(gpascutto)
Flags: needinfo?(gpascutto)

From that POV, it does look like the main crash rate in 64 is about half what it was in previous versions, while the content crash rate is pretty much the same.

So, we're not looking for a regression, we're looking for a fix in the main process, correct?

The top crashers for 64 vs 63 aren't showing any obvious culprits, but note these charts only count submitted crashes, not the crash pings we use to count crashes in mission control:

If this is indeed across all platforms, either we fixed a (few) big crasher(s), or the ping that is sending this data got broken. If we can't find any crashers in crash-stats that dropped significantly, I guess we should be looking at the telemetry code that sends this ping?

wlach, ritu, does my understanding sound correct to you?

Flags: needinfo?(wlachance)
Flags: needinfo?(rkothari)

sounds like we can close this out as invalid.

(In reply to Jim Mathies [:jimm] from comment #19)

sounds like we can close this out as invalid.

Not until we're confident that we didn't break crash telemetry.

Discussed this with chutten a bit more on #missioncontrol:

https://mozilla.logbot.info/missioncontrol/20190128#c15888874

tl;dr - we don't think there is any strong evidence that anything bad/wrong is happening. The raw number of main and content crashes reported for 64 is about in line with that which you see for previous versions. You can see this by going to the count view on the summary pages:

https://data-missioncontrol.dev.mozaws.net/#/release/windows?window=adjusted&type=count

Remember there are two components to a crash rate: # of crashes (the numerator) and number of k usage hours (the denominator). I think at least some of the variability comes from the fact that we are using subsession lengths to count usage hours on desktop and this is known to be unreliable (bug 1514392).

I know this explanation is somewhat unsatisfying and handwavey, but I don't think a more detailed investigation (which would take at least a few days) is warranted at this point given the evidence available that there is an actual problem. Some of the tools we're working on in 2019 will allow deeper investigation and better explanation for how/why things are changing. Stay tuned.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(wlachance)
Flags: needinfo?(rkothari)
Resolution: --- → INVALID

Hi Will, I see the content crash rate go up again on release65 as compared to previous releases on linux. Attaching the latest screenshot.

Flags: needinfo?(wlachance)

(In reply to Ritu Kothari (:ritu) from comment #23)

Hi Will, I see the content crash rate go up again on release65 as compared to previous releases on linux. Attaching the latest screenshot..

65 just went out so I'd wait another little while before making any judgements release-over-release. You'll note that the "adjusted" rates are very similar (currently I see 2.97 for 65, 2.72 for 64, and 2.55 for 63):

https://data-missioncontrol.dev.mozaws.net/#/release/linux?window=adjusted&type=rate

The adjusted rate compares historical data only over the same time period that the latest release is out. So that would be:

  • The first ~2 weeks of 64
  • The first ~2 weeks of 63
    ...

This is a fairer comparison initially.

Flags: needinfo?(wlachance)

Can you quantify another little while? Updates for linux on 65 have been unthrottled (at 100%) for a week now.

(In reply to Ritu Kothari (:ritu) from comment #25)

Can you quantify another little while? Updates for linux on 65 have been unthrottled (at 100%) for a week now.

On 64.0.2, it looks like it took about two weeks before the numbers started to level off:

https://data-missioncontrol.dev.mozaws.net/#/release/linux/content_crashes?aggregateLength=24&timeInterval=1295940&relative=0&percentile=99&normalized=1&disabledVersions=63.0,63.0.1,63.0.3,64.0&versionGrouping=version&startTime=1546992000

This is pretty much the same story as always: there aren't as many usage hours accumulated when something is first released (so the usage khours denominator is lower), "crashier" users tend to update quicker (because they restart their browser), etc.

(In reply to William Lachance (:wlach) (use needinfo!) from comment #24)

(In reply to Ritu Kothari (:ritu) from comment #23)

65 just went out so I'd wait another little while before making any
judgements release-over-release. You'll note that the "adjusted" rates are
very similar (currently I see 2.97 for 65, 2.72 for 64, and 2.55 for 63):

Even this is a worsening trend IMO, going from 2.55 to 2.97 for content crashes. For a number like 2, a 0.5 difference should trigger investigation into what the cause might be. What good are these crash-rate #s if we don't do anything about them but reason them out with "expected".

I see the same content crash rate worsening trend on release channel on MC. See the attached screenshot.

Status: RESOLVED → REOPENED
Resolution: INVALID → ---

(In reply to Ritu Kothari (:ritu) from comment #29)

I see the same content crash rate worsening trend on release channel on MC. See the attached screenshot.

Again, this seems more or less like expected variations, in my judgement. The overall numbers for 65 seem to be in line with previous releases, if you adjust for the time of release, 63 had an overall crash rate more or less exactly the same as 65 overall (~2.28). I don't feel like further investigation here is likely to produce any actionable results.

Ritu, who should own this bug now?

Flags: needinfo?(rkothari)

Hi Matt, I think we should move it to Mission Control component since there are discussions around addressing these kinds of false-positives (maybe it is a real problem?) and how to prevent them from showing up on Mission Control.

Component: General → Mission Control
Flags: needinfo?(rkothari)
Product: Firefox → Cloud Services
Version: 64 Branch → other
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: