Integrate nsThread::SaveMemoryReportNearOOM() with the available memory tracker

RESOLVED FIXED in Firefox 63

Status

()

enhancement
RESOLVED FIXED
Last year
10 months ago

People

(Reporter: gsvelto, Assigned: gsvelto)

Tracking

(Blocks 1 bug)

unspecified
mozilla63
x86_64
Windows
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox63 fixed)

Details

Attachments

(1 attachment)

+++ This bug was initially created as a clone of Bug #1451005 +++

As :dmajor pointed out in bug 1451005 comment 11 we periodically check for low-memory scenarios in the main event loop machinery to grab memory reports before we crash. This code is ripe for integration with the available memory tracker since both are periodically checking for available memory.

Integrating this code would solve multiple issues:

- Checking for available memory in a single place using shared thresholds. Currently this code uses a different threshold than the available memory tracker and only checks for low virtual memory scenarios [1]. On 64-bit Windows builds where we'll never run out of virtual memory this code is useless; checking for low physical memory and low commit space would be useful.

- The existing threshold is too low [2]. Looking at OOM crash pings most crashes happen when there's more free memory than specified so the chances of this code actually running are slim. To make matter worse the available memory tracker has a higher threshold at which it'll start sending memory-pressure notifications which will lower memory usage making it even more unlikely for a report to be ever grabbed.

- Because this code is also running in child processes, it needs to do a magic IPC dance through the ContentParent object [3] to inform the main process that it needs to grab a memory report. This is probably useless too, if we're low on memory then the main process is likely to realize it quite a bit sooner than a child process. Telemetry data seems to confirm this as I can't find recorded instances of the DOM_CONTENTPROCESS_TROUBLED_DUE_TO_MEMORY scalar. This code  can probably be removed.

- Last but not least this is being called every time we spin the main thread on every process which is a horrible waste of CPU cycles in my eyes (even though it's a only a handful, think of it multiplied by every event loop spin for every Firefox user).

[1] https://searchfox.org/mozilla-central/rev/c0d81882c7941c4ff13a50603e37095cdab0d1ea/xpcom/threads/nsThread.cpp#514
[2] https://searchfox.org/mozilla-central/rev/c0d81882c7941c4ff13a50603e37095cdab0d1ea/xpcom/threads/nsThread.cpp#510
[3] https://searchfox.org/mozilla-central/rev/c0d81882c7941c4ff13a50603e37095cdab0d1ea/dom/ipc/ContentParent.cpp#5105
(In reply to Gabriele Svelto [:gsvelto] from comment #0)
> - The existing threshold is too low [2]. Looking at OOM crash pings most
> crashes happen when there's more free memory than specified so the chances
> of this code actually running are slim.

It may be helpful if I give some explanation (as the author of the code in question). This was written circa 2014 when Win64 Firefox wasn't really a thing, and address-space-OOMs were a huge source of crashes on Win32. We were trying to collect more data on hopelessly-low-address-space machines. The idea was not "if you hit an OOM crash, then you should have collected a memory report" but the converse: "if you are collecting a memory report, that means you are so low on memory that you're pretty much guaranteed to crash very soon." 200MiB was a pretty good number for that. I think part of the reason was to avoid janking users who are in survivable memory situations. Also, engineers can only look at so many reports, so focusing on the most extreme cases helps the problems stand out more.

Another factor here is that we release the gBreakpadReservedVM ballast before collecting the GlobalMemoryStatusEx for the crash report. So let's say VM drops to exactly 200MiB and we collect a memory report, and immediately crash; then the crash report is likely to report some value between 200 and 280MiB.
(In reply to David Major [:dmajor] from comment #1)
> It may be helpful if I give some explanation (as the author of the code in
> question). This was written circa 2014 when Win64 Firefox wasn't really a
> thing, and address-space-OOMs were a huge source of crashes on Win32. We
> were trying to collect more data on hopelessly-low-address-space machines.
> The idea was not "if you hit an OOM crash, then you should have collected a
> memory report" but the converse: "if you are collecting a memory report,
> that means you are so low on memory that you're pretty much guaranteed to
> crash very soon." 200MiB was a pretty good number for that. I think part of
> the reason was to avoid janking users who are in survivable memory
> situations. Also, engineers can only look at so many reports, so focusing on
> the most extreme cases helps the problems stand out more.

Makes sense.

> Another factor here is that we release the gBreakpadReservedVM ballast
> before collecting the GlobalMemoryStatusEx for the crash report. So let's
> say VM drops to exactly 200MiB and we collect a memory report, and
> immediately crash; then the crash report is likely to report some value
> between 200 and 280MiB.

I see, thanks for the explanation. Until this week I would have thought that both the thresholds we used in the available memory tracker and this one were sound, but spending a couple of days on our telemetry data showed me that my gut feeling about them was wrong. OOM crash pings for 32-bit Windows show >70% of OOM crashes happening with more than 256MiB of virtual memory available so we'll have to revise both the available memory tracker threshold and this one to make them effective.

I must admit I haven't looked at older data so it's possible that a few years ago these thresholds did work well. After all on early Firefox OS devices we had low-memory detection set to a ridiculously low amount, something like 20MiB, and it worked quite well.
This moves the code that detects very low memory scenarios and grabs memory
reports from the main thread event-loop to the available memory tracker.
Besides removing the overhead of the check from the event-loop code this
increases the likeliness of the reports being gathered by sampling at a
higher frequency but only when we already detected a low-memory scenario. Last
but not least this add checks for low commit-space detection alongside low
virtual-memory detection.
Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Comment on attachment 9002130 [details]
Bug 1459212 - Save memory reports for use in crash reports when low on memory

David Major [:dmajor] has approved the revision.
Attachment #9002130 - Flags: review+
Comment on attachment 9002130 [details]
Bug 1459212 - Save memory reports for use in crash reports when low on memory

Andrew McCreight [:mccr8] has approved the revision.
Attachment #9002130 - Flags: review+
To add another datapoint, right now only 20-25% of OOM crashes come with a memory report. My hope is to raise that to at least 50%.

https://sql.telemetry.mozilla.org/queries/56039
On crash-stats the percentage is also much, much lower (like less than 1%). I'm not sure why though.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c2422757c912
Save memory reports for use in crash reports when low on memory r=dmajor,mccr8
https://hg.mozilla.org/mozilla-central/rev/c2422757c912
Status: ASSIGNED → RESOLVED
Closed: 11 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Since we landed this the number of OOM crashes that include a memory report has more than doubled:

https://sql.telemetry.mozilla.org/queries/56039#151881

Before only 10-15% of the OOM crashes had a memory report attached, now we're up at 30-35%. In particular Win64 builds never had memory reports attached while now they have almost half of the time. This is calculated using crash pings, it would be interesting to see how this reflects
On crash-stats the difference is night and day: before we had one memory report every ~50 OOM|small crashes, now we have more than two out of five.
You need to log in before you can comment on or make changes to this bug.