[meta] OOM while evicting the nursery

NEW
Unassigned

Status

()

defect
P3
critical
Last year
26 days ago

People

(Reporter: lizzard, Unassigned)

Tracking

(Depends on 4 bugs, Blocks 3 bugs, 4 keywords)

Trunk
Unspecified
Windows 7
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox-esr52 unaffected, firefox-esr60 affected, firefox-esr68 affected, firefox62 wontfix, firefox63 wontfix, firefox64 wontfix, firefox68 affected, firefox69 affected, firefox70 affected)

Details

(Whiteboard: [MemShrink:P2], crash signature)

This bug was filed from the Socorro interface and is
report bp-06fa5db6-f403-4285-a1d3-d22ee0180628.
=============================================================

This crash is spiking in 62.0b3. There are some crashes from 60.0.3 but none for 61 release or 62 nightly/ dev edition 62. 


Top 10 frames of crashing thread:

0 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1577
1 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1591
2 xul.dll js::TenuringTracer::allocTenured<js::PlainObject> js/src/gc/Marking.cpp:2927
3 xul.dll js::TenuringTracer::movePlainObjectToTenured js/src/gc/Marking.cpp:3018
4 xul.dll js::TenuringTracer::traverse<JSObject> js/src/gc/Marking.cpp:2634
5 xul.dll js::gc::StoreBuffer::MonoTypeBuffer<js::gc::StoreBuffer::CellPtrEdge>::trace js/src/gc/Marking.cpp:2679
6 xul.dll js::Nursery::doCollection js/src/gc/Nursery.cpp:877
7 xul.dll js::Nursery::collect js/src/gc/Nursery.cpp:738
8 xul.dll js::gc::GCRuntime::minorGC js/src/gc/GC.cpp:7886
9 xul.dll js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:7483

=============================================================
For non-GC folk:  The GC has decided it needs to collect some memory so that the program can continue, but to do that it needs memory to copy some objects into as they're moved, and usually there is some available but in these crashes there isn't.

Ideas:

Anything we do to avoid this is likely to either make things slower or se more memory.

We could adjust tuning so we're collecting earlier, which could be good to reduce memory usage anyway.

A more radical idea is that when we detect that we can't move an object because there's no free memory, continue the collection without evicting the rest of the nursery.  Then after sweeping frees some memory we can continue with the nursery collection to free it.  It's probably best to do this non-incrementally but it could jank the browser a bit.  I'm just worried that if allocation fails like this (we're were currently crashing) things are pretty dire anyway.
This crash went away completely in beta 8 but is now is back with beta 9.
Duplicate of this bug: 1477545
Hi Liz, I think this crash may change signatures depending upon the build.

On the two sigantures I've seen: Mostly crashes on x86 but still some on amd64.  Almost entirely on Windows.

Some of the crashes show what seems like "enough" phyical memory available.  I wondered if it was virtual memory or fragmentation in the virtual address space that might be the problem.  But that seems unlikely since 1) AFAIK we don't limit 32-bit address spaces due to NaN-boxing 2) on x86-64 (and there are crashes there) AIUI NaN-boxing isn't limiting the address space on Linux or Windows.

This crash is weird: https://crash-stats.mozilla.com/report/index/faf7fffe-ca43-4f98-849b-d81720180719  It failed to allocate 1MB when the OS says there's 5GB free.  x86-64 running Windows 10, the process uptime is only 6 seconds.

Possible action: look at some full crash reports and try to determine if address space is limited or fragmented.
Crash Signature: [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>] → [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>] [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC ]
Depends on: GCCrashes
Hardware: x86 → Unspecified
Whiteboard: [MemShrink]
Available page file is only 6MB in that crash you linked. I feel like I've seen a number of reports before like that, where there's lots of free memory but not much in the way of page file.

We do record some kind of information about the max contiguous block of memory in crash reports on Windows, but I'm not seeing it in that one.
Whiteboard: [MemShrink] → [MemShrink:P2]
One of the top 2 signatures in this bug is rising as we go through betas. Comments are not really useful. 427 crashes in b16 so far. The second signature is the bug appears to be the nightly signature, with 68 crashes in the last 7 days.
Version: 61 Branch → Trunk
These crashes occur when Firefox is running out of memory and is in the process of doing garbage collection, unfortunately it needs to allocate a little more memory to complete the garbage collection.  There's nothing easy we can do to fix this directly.  I'd still like to track it as an open bug, in case we find some new way to handle the situation.

There is potentially a not-so-simple change to avoid the crash.  But it may not be worth it if the system is so low on memory and may crash anyway.  It would jank for a few seconds while we try to free memory.  NI for myself to think about this and come back to it later.
Flags: needinfo?(pbone)
Is there any telemetry on how many installations are 32 bit vs 64 bit? This is just anecdotal, but on my home Windows machine, I noticed I started getting OOM crashes in Nightly a month or so ago, and somehow I was on a 32 bit build, when I thought I'd installed 64 bit.
(In reply to Andrew McCreight [:mccr8] from comment #8)
> Is there any telemetry on how many installations are 32 bit vs 64 bit? This
> is just anecdotal, but on my home Windows machine, I noticed I started
> getting OOM crashes in Nightly a month or so ago, and somehow I was on a 32
> bit build, when I thought I'd installed 64 bit.

Just under 70%.
https://hardware.metrics.mozilla.com/

(81% have 64bit OSs and AFAIK the vast majority have 64bit processors.)

So just over 30% have 32-bit installations, and this hits 32-bit users 48% of the time, so it is proportionally higher there.

it's also not isolated to sysmtes with <= 4GB physical memory, but seems ti hit them fairly often. Although quite a lot of the Firefox population have 4GB systems.
Jon, Steve.

I want to run an idea by you. I think we could try to recover from this if:

 + We stop collecting the nursery when this happens.
 + We _do_ handle updating the pointers for the evicted objects, but this also now includes pointers _in_ the nursery to evicted objects.
 + We probably cannot recover memory from the nursery yet.  AFAIK there are still live objects on every chunk.
 + We _do_ trace through the rest of the nursery, but now without eviction, we do this to mark objects in the tenured heap.
 + We proceeded to collect the tenured heap.  It must be non-incremental and must not do compaction.
 + ONce that completes, and sweeping recovers some memory.  We can proceed with evicting the nursery. and potentially compact the tenured heap after that (since memory usage is dire, we should do this soon, but doesn't have to happen in this collection).

Is this plausable?  Is it worthwhile to avoid these crashes?

We could potentially add some tuning (like we do for mobile) for desktops with small amounts of memory.  But I'd prefer to do Bug 1433007 because it'll affect any tuning decisions.
Flags: needinfo?(pbone)
Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
(In reply to Paul Bone [:pbone] from comment #10)
I think it would be complicated to make this work, and doesn't guarantee that we can proceed because we may not recover enough memory.

I don't think it's worthwhile.  At some point you just run out of memory.
Flags: needinfo?(jcoppeard)
I'm going to move that idea to a new bug and make it P5..  We could always come back to it later.
Priority: -- → P5
Summary: Crash in OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T> → OOM while evicting the nursery
Depends on: 1484903
Top Crashers for Firefox 60.0.2
31 	0.35% 	0.01% 	OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>	10 	10 	0 	0 	9 	0 	2017-10-18

Top Crashers for Firefox 63.0a1
44 	0.31% 	-0.11% 	OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC	11 	11 	0 	0 	10 	0 	2018-07-07

Top Crashers for Firefox 60.0.1
45 	0.28% 	0.1% 	OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>	4 	4 	0 	0 	4 	0 	2017-10-18 

Top Crashers for Firefox 62.0b15
70    	0.2% 	-0.01% 	OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>	3 	3 	0 	0 	3 	0 	2017-10-18

Top Crashers for Firefox 62.0b16
75   	0.17% 	0.16% 	OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>	12 	12 	0 	0 	12 	0 	2017-10-18

Top Crashers for Firefox 60.1.0esr
109     0.14% 	-0.05% 	OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>	5 	5 	0 	0 	5 	0 	2017-10-18	

7 days ago.
Blocks: GCCrashes
No longer depends on: GCCrashes
(In reply to Paul Bone [:pbone] from comment #9)
> Just under 70%.
> https://hardware.metrics.mozilla.com/

Thanks for looking. What I actually meant, but failed to actually say, is that I wonder if the number of 32 bit systems has been increasing. "Browsers by architecture" on your link seems to indicate that it has not been, so never mind.
So we could try to make the GC collect earlier, hopefully by doing so we can complete the collection before memory is exausted.  We may either move some of these crashes to a different signature, but hopefully it reduces the amount of OOM crashes overall.

I think we also want to have a different maximum for nursery size on systems with low memory / low virtual memory.
Depends on: 1495355
This is a top crasher on beta spiking +400 crashes a day in the 63 beta, yet this is marked as P5 since a follow up bug (bug 1484903) was opened and it is also marked as a P5. Jon, could you reprioritize this bug? 

I would like to know if there is something we can do to mitigate the volume of crashes before we ship 63, the number of crashes is way higher than what it was in 62 beta and I wouldn't want to see it explode when it hits release.

Thanks
Flags: needinfo?(jcoppeard)
(In reply to Pascal Chevrel:pascalc from comment #16)
These are OOM crashes.  Do we know whether the overall OOM crash rate has increased or the crashes have just moved around?

There haven't been many changes to the GC recently and nothing springs to mind as a likely culprit (bug 1407143 is in this area but I don't see why that would cause this).

It's possible that something else in the system is causing us to tenure more nursery things.  One possibility is increased use of promises as our implementation does a lot of allocation for these at the moment.
Flags: needinfo?(jcoppeard)
yes, the overall share of oom crashes seems to have gone up during the 63 cycle: https://bugzilla.mozilla.org/show_bug.cgi?id=1488164#c29
Is there a way to see if tracking protection is enabled? I'm hypothesizing that this could be a general spike in OOMs due to Facebook spamming out that large file repeatedly. If that only started happening recently, it would boost general OOMs, and if OOMs show up semi-randomly in our various unhandleable OOM buckets, then you would expect this to increase. But I don't have any data to support that hypothesis; I don't even know if FB started doing that recently?
I've made this a metabug (Bug 1488164 comment 41) to track the various things we might be able to do about these crashes.
Keywords: meta
QA Contact: jcoppeard
Summary: OOM while evicting the nursery → [meta] OOM while evicting the nursery
QA Contact: jcoppeard
(We usually mark meta bugs as P3 so I've done that here)
Priority: P5 → P3
(In reply to Steve Fink [:sfink] [:s:] from comment #19)
> Is there a way to see if tracking protection is enabled? I'm hypothesizing
> that this could be a general spike in OOMs due to Facebook spamming out that
> large file repeatedly. If that only started happening recently, it would
> boost general OOMs, and if OOMs show up semi-randomly in our various
> unhandleable OOM buckets, then you would expect this to increase. But I
> don't have any data to support that hypothesis; I don't even know if FB
> started doing that recently?

overholt shot down this hypothesis: beta doesn't have any tracking protection on by default.
Flags: needinfo?(sphink)
So I'm not that great at using crash-stats, but I wanted to answer the question of whether there's evidence for this particular OOM crash becoming more likely, rather than OOM crashes in general. I couldn't come up with an easy way (advice welcomed!).

I first looked at 'OOM | large' crashes to see if I could spot the spike. But I found the absolute numbers to be fairly useless; I really want crashes per daily user or something. But as far as I know, we don't have that information in crash-stats, so I used total crashes (with any signature) as a proxy. (It's not a great proxy, since any other big crasher is going to skew the baseline crash rate and make total crashes no longer proportional to ADUs.)

Then I bucketed together the OOM-large-during-tenuring signatures I could find, to see if they were rising as a proportion of OOM|large crashes. The result for build_ids after 20180801:

62.0b14: total=30892 oom_large=1289 (4.2%) nursery= 911 (70.7%)
62.0b15: total=24618 oom_large=1105 (4.5%) nursery= 778 (70.4%)
62.0b16: total=30975 oom_large=1350 (4.4%) nursery= 940 (69.6%)
62.0b17: total=23023 oom_large=1000 (4.3%) nursery= 696 (69.6%)
62.0b18: total=28436 oom_large=1244 (4.4%) nursery= 870 (69.9%)
62.0b19: total=22201 oom_large= 931 (4.2%) nursery= 661 (71.0%)
62.0b20: total=30047 oom_large=1191 (4.0%) nursery= 821 (68.9%)
62.0b99: total=60700 oom_large=2909 (4.8%) nursery=2000 (68.8%)
 63.0b3: total=18837 oom_large=1100 (5.8%) nursery= 774 (70.4%)
 63.0b4: total=24759 oom_large=1498 (6.1%) nursery=1087 (72.6%)
 63.0b5: total=27435 oom_large=1718 (6.3%) nursery=1225 (71.3%)
 63.0b6: total=31822 oom_large=1976 (6.2%) nursery=1413 (71.5%)
 63.0b7: total=26243 oom_large=1599 (6.1%) nursery=1146 (71.7%)
 63.0b8: total=28499 oom_large=1848 (6.5%) nursery=1373 (74.3%)
 63.0b9: total=23208 oom_large=1556 (6.7%) nursery=1105 (71.0%)
63.0b10: total=23904 oom_large=1502 (6.3%) nursery=1125 (74.9%)
63.0b11: total= 3984 oom_large= 219 (5.5%) nursery= 139 (63.5%)

From this, I see that we have a distinct jump from beta62 to beta63 (4.x% to 6.x%), but at most a tiny change in the percentage of those crashes that hit while evicting the nursery (the topic of this bug). From this I conclude that tenuring does not appear to be the problem; OOM is. We're getting a *lot* more OOM crashes in beta63 than we did in beta62. (Well, to be pedantically precise, a lot more of our crashes are OOM|large crashes.) We're seeing a spike in nursery OOM crashes because we run out of memory, then crash in whatever happens to try to allocate next. If we broke something to do with tenuring (eg we tenure into too-large cells or something), then I would expect the percentage of OOM crashes that are in the nursery allocation paths to increase, not stay the same.

The "nursery" field above is either the signature 'OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC' or any signature that starts with 'OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer'. "oom_large" is signatures starting with 'OOM | large'. "total" is when I don't restrict by signature at all.

In all cases, I am only looking at product=Firefox, release_channel=beta, platform=Windows, and date from 2018-04-04 to just before 2018-10-04, and am only looking at build_id >= 20180801000000.
For whatever it's worth, here's the result for OOM | small:

   62.0: total= 3526 oom_small= 216 (6.1%)
62.0b14: total=30892 oom_small=3655 (11.8%)
62.0b15: total=24618 oom_small=2953 (12.0%)
62.0b16: total=30975 oom_small=3684 (11.9%)
62.0b17: total=23023 oom_small=2779 (12.1%)
62.0b18: total=28436 oom_small=3402 (12.0%)
62.0b19: total=22201 oom_small=2749 (12.4%)
62.0b20: total=30047 oom_small=3504 (11.7%)
62.0b99: total=60700 oom_small=7827 (12.9%)
 63.0b3: total=18837 oom_small=2204 (11.7%)
 63.0b4: total=24759 oom_small=2975 (12.0%)
 63.0b5: total=27435 oom_small=3344 (12.2%)
 63.0b6: total=31822 oom_small=4088 (12.8%)
 63.0b7: total=26243 oom_small=3405 (13.0%)
 63.0b8: total=28499 oom_small=3944 (13.8%)
 63.0b9: total=23208 oom_small=3216 (13.9%)
63.0b10: total=23904 oom_small=3420 (14.3%)
63.0b11: total= 3984 oom_small= 460 (11.5%)

So small OOMs ("large" vs "small" refers to the size of the allocation that we failed) are climbing too, but not as quickly as large OOMs. I'm not really sure what to make of that.
copying over some information from the "see also" that might be relevant here as well:

according to crash data from 63.0a1 these signatures likely regressed in build 20180706224413 - these would be the changes going into that build: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=fa376b&tochange=9849ea3937e2e7c97f79b1d2b601f956181b4629
One thing that stands out is:

  Bug 1468207 Use the new timer-based available memory tracker for Win32 builds

I don't exactly understand what this does, but is it possible that this has impaired our handling of low memory situations?

Also in that build was my changes to the atoms table in bug 1434598 which was quite a large change, but shouldn't cause us to use much more memory.
Flags: needinfo?(gsvelto)
Bug 1468207 made detecting low-memory scenarios work properly on Win64 and significantly improved the existing detection on Win32. What it does in practice is send a "memory-pressure" event when it enters a low-memory state and it sends it again every 30 seconds if the low-memory state persists. The first event comes with a "low-memory" payload, the following use "low-memory-ongoing". This triggers - among other things - a GC/CC cycle [1]. Is it possible this might be causing the issue here?

[1] https://searchfox.org/mozilla-central/rev/1ce4e8a5601da8e744ca6eda69e782318afab54d/dom/base/nsJSEnvironment.cpp#339
Flags: needinfo?(gsvelto)
There's another side-effect of bug 1468207 which I failed to mention and it might also be the cause of what you're seeing here: since low-memory detection works properly, Firefox is taking longer to go OOM because the steady flow of memory-pressure events tends to "trim" it down for a while. If the user keeps opening new tabs it will eventually go OOM anyway but it will take longer.

Looking at some of the crash reports for these signatures this appears to be true: since this affects 32-bit builds I found a lot of reports where available virtual memory was very, very low (100MiB or less). In the past most virtual-memory related OOMs would start appearing around the 200-300 MiB mark, going all the way down to 100 MiB free virtual memory means we're really keeping Firefox alive as long as possible. Some of the crashes are running out of commit space (i.e. physical memory + swap space) and those have even smaller amounts available (less than 10MiB!). Those reports have a count of the number of low-memory events that have been detected (LowCommitSpaceEvents annotation) which is often above 10, in some cases above 100. Each sample is taken at a 30s interval so that means Firefox had been in a low-memory situation for a few minutes but sometimes even for an hour or more.

Either way these crashes are happening in situations which are almost hopeless: we won't be able to save Firefox with that little memory available until we implement some more radical measures such as tab unloading under memory pressure.
Untracking for 63 as it has become a meta bug and we are close to the release, if actionable dependencies are open and fixed for 63 or a potential 63 dot release we will track those instead.
Depends on: 1505094
See Also: → 1518138

I spoke with Jon and we will try to reduce the number of these crashes through work on scheduling. Two ideas are:

Bug 1537649 - investigate fragmentation in other zones that causes the current zone to OOM.
Bug 1537654 - Investigate fragmentation arenas of different AllocKinds,

Depends on: 1540005
Depends on: 1540161
Duplicate of this bug: 1568057
Crash Signature: [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>] [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC ] → [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::TenuringTracer::allocTenured<T>] [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC ] [@ …

A report on reddit of this crash comes from a user with 32GB of memory but very little free swap space. It seems that Windows is denying these allocations because it won't overcommit swap: https://www.reddit.com/r/firefox/comments/cgeb9i/whats_up_with_the_memory_usage/euhv0c0/

Crap, sorry, I was trying to add CC and was looking at those, didn't mean to change them.

Don't worry. You set them correctly so I'll just leave them. Welcome to contributing to Firefox (whether it's just good bug reports or code, it's still contributing).

okay, it took me two tries to get it right.

You need to log in before you can comment on or make changes to this bug.