Use Telemetry to measure how often recovery from OOM happens in the wild
Categories
(Core :: JavaScript Engine, task, P5)
Tracking
()
People
(Reporter: jorendorff, Assigned: iain)
References
(Blocks 1 open bug)
Details
Updated•10 years ago
|
Assignee | ||
Comment 2•6 years ago
|
||
Assignee | ||
Comment 3•6 years ago
|
||
Email conversations with the telemetry team have taken place. Summarizing here:
It turns out that we've already done a lot of the work here. We already track OOM state in crash annotations:
https://searchfox.org/mozilla-central/source/xpcom/base/CycleCollectedJSRuntime.h#183-213
https://searchfox.org/mozilla-central/source/toolkit/crashreporter/CrashAnnotations.yaml#466-480
In addition to distinguishing between small and large allocations, we distinguish between Recovered (successfully completed a GC), Reported (crashed prior to completing a GC), and Reporting (crashed while trying to handle the OOM). This isn't exactly what we want, because we only see "Recovered" if the browser crashes for a different reason, but it's a pretty good start.
Here are a couple of telemetry queries:
Small allocations: https://sql.telemetry.mozilla.org/queries/60789
Large allocations: https://sql.telemetry.mozilla.org/queries/60792
(I arbitrarily picked Jan 1, 2018 as the start date.)
Things I found interesting:
-
Very few crashes have any large allocation failures, and they almost always recover. We literally had 1 crash last year where we didn't complete a GC after failing a large allocation. It's quite likely that was a pure coincidence and we crashed for an unrelated reason.
-
A surprisingly high percentage of JSOutOfMemory crash annotations for small allocations are recoveries. It's about a 2:1 ratio (reported:recovered), and that doesn't include cases where we recover but don't subsequently crash.
-
Notwithstanding 2, the total numbers for all of these OOMs are tiny. Out of 5B crash reports from 2018, less than 100K had a JSOutOfMemory crash annotation. Jason pointed out on IRC that we can calculate a reasonable upper bound for the total number of successful recoveries, using the reasoning "if it were happening more than X often, then we would see way more crashes with OOM annotations". At first glance, the math seems to imply that JS OOMs are not a meaningful contributor to overall Firefox crash stats.
The next step is to look at the data more closely to see if these conclusions still hold if we focus on (mobile/32-bit/...).
Updated•2 years ago
|
Updated•5 months ago
|
Description
•