Open Bug 988055 Opened 11 years ago Updated 5 months ago

Use Telemetry to measure how often recovery from OOM happens in the wild

Categories

(Core :: JavaScript Engine, task, P5)

task

Tracking

()

People

(Reporter: jorendorff, Assigned: iain)

References

(Blocks 1 open bug)

Details

We can assume that if the exception is cleared and we survive a GC, we recovered.
Assignee: general → nobody
Iain, would this sound like something useful to improve our OOM story going forward?
Flags: needinfo?(iireland)
Huh. I guess there is nothing new under the sun. This is exactly what we decided to do at All Hands, (except Jason's heuristic in comment #1 is probably better than the timer idea we were talking about.) I'm going to claim this bug for the telemetry work we're planning.
Assignee: nobody → iireland
Flags: needinfo?(iireland)

Email conversations with the telemetry team have taken place. Summarizing here:

It turns out that we've already done a lot of the work here. We already track OOM state in crash annotations:

https://searchfox.org/mozilla-central/source/xpcom/base/CycleCollectedJSRuntime.h#183-213
https://searchfox.org/mozilla-central/source/toolkit/crashreporter/CrashAnnotations.yaml#466-480

In addition to distinguishing between small and large allocations, we distinguish between Recovered (successfully completed a GC), Reported (crashed prior to completing a GC), and Reporting (crashed while trying to handle the OOM). This isn't exactly what we want, because we only see "Recovered" if the browser crashes for a different reason, but it's a pretty good start.

Here are a couple of telemetry queries:

Small allocations: https://sql.telemetry.mozilla.org/queries/60789
Large allocations: https://sql.telemetry.mozilla.org/queries/60792

(I arbitrarily picked Jan 1, 2018 as the start date.)

Things I found interesting:

  1. Very few crashes have any large allocation failures, and they almost always recover. We literally had 1 crash last year where we didn't complete a GC after failing a large allocation. It's quite likely that was a pure coincidence and we crashed for an unrelated reason.

  2. A surprisingly high percentage of JSOutOfMemory crash annotations for small allocations are recoveries. It's about a 2:1 ratio (reported:recovered), and that doesn't include cases where we recover but don't subsequently crash.

  3. Notwithstanding 2, the total numbers for all of these OOMs are tiny. Out of 5B crash reports from 2018, less than 100K had a JSOutOfMemory crash annotation. Jason pointed out on IRC that we can calculate a reasonable upper bound for the total number of successful recoveries, using the reasoning "if it were happening more than X often, then we would see way more crashes with OOM annotations". At first glance, the math seems to imply that JS OOMs are not a meaningful contributor to overall Firefox crash stats.

The next step is to look at the data more closely to see if these conclusions still hold if we focus on (mobile/32-bit/...).

Severity: normal → S3
Blocks: sm-api
Severity: S3 → N/A
Type: defect → task
Priority: -- → P5
You need to log in before you can comment on or make changes to this bug.