Closed Bug 1072151 Opened 10 years ago Closed 6 years ago

crash in OOM | unknown | js::CrashAtUnhandlableOOM(char const*) | js::Nursery::moveToTenured(js::gc::MinorCollectionTracer*, JSObject*)

Categories

(Core :: JavaScript: GC, defect)

x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox33 --- wontfix
firefox34 + wontfix
firefox35 + wontfix
firefox36 + wontfix
firefox37 - affected
firefox38 --- affected
firefox39 --- affected
firefox40 --- ?
firefox41 --- ?

People

(Reporter: JasnaPaka, Unassigned)

References

Details

(Keywords: crash, topcrash, Whiteboard: [tbird crash])

Crash Data

This bug was filed from the Socorro interface and is 
report bp-86285ca1-1931-4a1f-b4c7-6b2a02140924.
=============================================================

I have scrolled on Facebook. Random crash.
I reproduced the crash on Firefox 34 Beta 2 on Windows 7 32bit. Here is the crash report: https://crash-stats.mozilla.com/report/index/bp-0adaedc8-6a94-4910-b6e6-4d0b72141021. 

I don't have proper STR. I had many sites opened (Facebook, Youtube, Yahoo Mail, Pinterest, Google Maps) and I was navigating between them.  

In the last week 2259 crashes occured with this signature (on Windows 7).
This is the #10 topcrash in Firefox 33, and is also showing up significantly in 34.0b1. The top urls for this crash signature are for Facebook. 

Crashing thread: 

0 	mozjs.dll 	js::CrashAtUnhandlableOOM(char const*) 	js/src/jscntxt.cpp
1 	mozjs.dll 	js::Nursery::moveToTenured(js::gc::MinorCollectionTracer*, JSObject*) 	js/src/gc/Nursery.cpp
2 	mozjs.dll 	js::Nursery::collectToFixedPoint(js::gc::MinorCollectionTracer*, js::Nursery::TenureCountCache&) 	js/src/gc/Nursery.cpp
3 	mozjs.dll 	js::Nursery::collect(JSRuntime*, JS::gcreason::Reason, js::Vector<js::types::TypeObject*, 0, js::SystemAllocPolicy>*) 	js/src/gc/Nursery.cpp
4 	mozjs.dll 	js::gc::GCRuntime::gcCycle(bool, __int64, js::JSGCInvocationKind, JS::gcreason::Reason) 	js/src/jsgc.cpp
5 	mozjs.dll 	js::gc::GCRuntime::collect(bool, __int64, js::JSGCInvocationKind, JS::gcreason::Reason) 	js/src/jsgc.cpp
6 	mozjs.dll 	RunLastDitchGC 	js/src/jsgc.cpp
7 	mozjs.dll 	js::gc::ArenaLists::refillFreeList<1>(js::ThreadSafeContext*, js::gc::AllocKind) 	js/src/jsgc.cpp
8 	mozjs.dll 	js::gc::AllocateNonObject<JSFatInlineString, 1>(js::ThreadSafeContext*) 	js/src/jsgcinlines.h
9 	mozjs.dll 	js::ConcatStrings<1>(js::ThreadSafeContext*, JS::Handle<JSString*>, JS::Handle<JSString*>) 	js/src/vm/String.cpp
10 	libGLESv2.dll 	gl::ResourceManager::getTexture(unsigned int) 	gfx/angle/src/libglesv2/ResourceManager.cpp
11 	libGLESv2.dll 	gl::GetCurrentData() 	gfx/angle/src/libglesv2/main.cpp
12 	libGLESv2.dll 	glActiveTexture 	gfx/angle/src/libglesv2/libGLESv2.cpp
13 	xul.dll 	mozilla::WebGLContext::UnbindFakeBlackTextures() 	dom/canvas/WebGLContextDraw.cpp
14 	xul.dll 	mozilla::WebGLContext::DrawElements(unsigned int, int, unsigned int, __int64) 	dom/canvas/WebGLContextDraw.cpp
15 	xul.dll 	mozilla::dom::WebGLRenderingContextBinding::drawElements 	obj-firefox/dom/bindings/WebGLRenderingContextBinding.cpp
16 		@0x52c837bf
Component: General → JavaScript: GC
Is there anything that can be done about this, Terrence?
Flags: needinfo?(terrence)
Written in parallel with David's comment 4, so not taking advantage of the new data there:

We generally shouldn't be getting LastDitchGC as it indicates that our GC heap limit tripped before our malloc trigger: we really don't want this to happen ever because last-ditch GC's are non-incremental. It's probably something particular with FB's workload; it would be nice to know what part of the fast-heap-growth curve FB is in when this happens. 

In the long term, we need to find a way to cope better when we're near the limit; however, If we've FB tripping this right now, we need to do something in the short term as well. We could either scale down our heap-growth triggers such that we GC sooner when there is still memory available, or we could keep a larger ballast around. Of course, if we're near the heap limit anyway, neither of these is going to help much and it's going to hurt performance elsewhere at the same time.
Flags: needinfo?(terrence)
(In reply to David Major [:dmajor] (UTC+13) from comment #4)
> 
> Also, any chance for a size annotation on these aborts?

1MiB, via VirtualAlloc, 1MiB aligned. The alignment requirement might be killing us, although :ehoogeveen did a ton to help mitigate that issue a few months ago.
(In reply to Terrence Cole [:terrence] from comment #5)
> In the long term, we need to find a way to cope better when we're near the
> limit; however, If we've FB tripping this right now, we need to do something
> in the short term as well.

Still a top crash on 34 beta, and still mostly on FB. Is there any hope for a fix before the train leaves?
Flags: needinfo?(terrence)
[Tracking Requested - why for this release]:

[Tracking Requested - why for this release]:

This wasn't marked topcrash (till now) and wasn't marked for tracking for 34, somehow. It is the #10 topcrash for 34.0b10 but not at a super high volume.
   
I'm tagging it now, but this may not make it into 34.
(In reply to David Major [:dmajor] (UTC+13) from comment #7)
> (In reply to Terrence Cole [:terrence] from comment #5)
> > In the long term, we need to find a way to cope better when we're near the
> > limit; however, If we've FB tripping this right now, we need to do something
> > in the short term as well.
> 
> Still a top crash on 34 beta, and still mostly on FB. Is there any hope for
> a fix before the train leaves?

Not really. The short term solution would have been bug 1095620, but the blocking bug 1074961 took almost 2 months more to complete than expected due to existing wrongness. I think we're going to have to let this ride for another release. :-(
Flags: needinfo?(terrence)
Thanks terrence!  I will go ahead and mark it wontfix for 34 then. Our overall crash rate for 34 is looking pretty good actually!

I feel that you should have this animated gif of a kitten with a butterfly:
http://image.blingee.com/images19/content/output/000/000/000/7a9/785756861_1112924.gif
Our fork of jemalloc now caches up to 128 chunks worth of memory (bug 1073662), and may be getting a variant of the GC allocation logic (bug 1005844). Once we have both, it might be a good idea to see if we can rip out the GC allocation logic in favor of making jemalloc do the heavy lifting, so the GC can benefit from the recycled chunks.

Unfortunately, the chunk recycling logic fundamentally cannot handle chunks of different sizes on Windows, so it is limited to chunksized allocations there (so it won't help with, say, allocating the nursery itself). I'm also not sure if jemalloc actually *exposes* a way to choose the desired alignment, but I'm sure we can make it do so.

Also note that if the patch in bug 1005844 is rejected, I do not think we should unify the logic as it will be a step backward for the GC (just caching chunks does not help use all available chunks in high fragmentation situations).
This is at #19 for ff35, marking wontfix given we have nothing new to try and bug 1074961 is resolved on 36, we should be OK shipping with this but seeing reduced/no volume in the 36 release.
Yepp, still exists in the 35.0.1.

Firefox 35.0.1 Crash Report [@ OOM | unknown | js::CrashAtUnhandlableOOM(char const*) | js::Nursery::moveToTenured(js::gc::MinorCollectionTracer*, JSObject*) ]
https://crash-stats.mozilla.com/report/index/f2bfa16c-6386-4258-8219-0d9c82150129

but if it's fixed for 36, let's let it rest.
initially wanted to leave this thread in peace, but my system fails with this several times a day...
the last two
https://crash-stats.mozilla.com/report/index/75fb60a5-842b-4e4d-b52a-ceadf2150204
https://crash-stats.mozilla.com/report/index/800f9df1-bd6d-4ac5-8c6c-447912150204

will have to stop watching youtube playlists...
This is the #15 topcrash on Firefox 36.0b with a high number of crashes still in 36.0b5. Not in the top 10 but significantly high volume.
This may be fixed in 36.0b6 actually. Kairo is there enough data by now to judge or should it wait another day?
Flags: needinfo?(kairo)
(In reply to Liz Henry :lizzard from comment #16)
> This may be fixed in 36.0b6 actually. Kairo is there enough data by now to
> judge or should it wait another day?

It's still #8 but it's in a similar position in 35, and we've always had GC crashes around there, so I'm not that concerned about the signature itself. Now if we find reproducible cases, we surely want to look into them, and if the JS team might have a good idea what's going on here then as well, but otherwise it doesn't sound very actionable to me.
Flags: needinfo?(kairo)
Uhg, so bug 1074961 was supposed to make it so that we could easily do something like bug 1073662 and keep more memory live to use as a buffer. But it turns out that to do that safely (e.g. without causing OOM elsewhere), we really need to be able to estimate when our GC triggers are going to fire. Currently our GC triggers are a catastrophe: 5+ years of ad-hoc additions, each their own special snowflake. You can read about the current situation at [1] and the work to fix them is ongoing at [2]. We can at least see the light at the end of the tunnel finally, but I'm afraid none of this is going to be suitable for uplift. In the meantime, I'll keep trying to think of a safe shorter term solution as I continue getting more and more context on the problem.

1 - https://dxr.mozilla.org/mozilla-central/source/js/src/gc/GCRuntime.h?from=GCRuntime.h&case=true#198
2 - https://bugzilla.mozilla.org/show_bug.cgi?id=1130211
Terrence or Kairo, do you have any reason to believe that this crash is related to JS memory or that GC is related? I suspect that this is just a symptom of running out of memory, and that this is primarily related to the OOM issues with youtube video in 36 (as in comment 14).

We can confirm what is actually using memory with a combination of looking at the memory mappings in the minidump and the about:memory data for those crashes which have it. Here's a supersearch link to 36.0b6 crashes with this signature and which come with about:memory data:

https://crash-stats.mozilla.com/search/?signature=%3DOOM+|+unknown+|+js%3A%3ACrashAtUnhandlableOOM%28char+const*%29+|+js%3A%3ANursery%3A%3AmoveToTenured%28js%3A%3Agc%3A%3AMinorCollectionTracer*%2C+JSObject*%29&contains_memory_report=!__null__&version=36.0&build_id=20150202183609&_facets=build_id&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports

You'll need to use the API to get access to the about:memory data, since it's not exposed in the UI anywhere. It's not public, so you might need extra permissions.

e.g. I loaded the memory report from https://crash-stats.mozilla.com/report/index/0731731b-62f3-4903-8c19-97c882150203 and this is clearly not JS-related:

Explicit Allocations
286.06 MB (100.0%) ++ explicit

  423.11 MB ── private
1,109.52 MB ── resident
3,944.82 MB ── vsize
    1.08 MB ── vsize-max-contiguous

So we're running out of virtual memory in this case.

I think that dmajor has a way of categorizing groups of crashes based on this data, but I'm not 100% sure.
Flags: needinfo?(dmajor)
Odds are the JS engine is trying to allocate a new 1mb chunk to allocate an object in.  It wouldn't be too surprising that this is one of the first things to fail in a low-memory situation.
Yeah this is just low memory. JS isn't really at the heart of the problem.

There's a supersearch field 'write_combine_size' that shows how much memory is going to the gfx stack. Between that field and the Youtube URLs, a lot of these are pointing to the recent video OOM issues.
Flags: needinfo?(dmajor)
Seems it is a wontfix for 36.
Tracking for 37 since it is in the top 10 (even if we have been tracking this bug for a while now).
Is this bug actionable or do we expect that it will be actionable at some point? Should we simply resolve this as wontfix?
For any "OOM | unknown" crash, one action item is to make the size known. I've been grumbling about js::CrashAtUnhandlableOOM for a long time, but given the JS code patterns it may not be easy to annotate. Not sure if it should be dealt with here or a more general bug.

IMO wontfix seems harsh, I'd prefer that you just untrack or call it incomplete, but I guess I don't really care that much.
(In reply to David Major [:dmajor] (UTC+13) from comment #24)
> IMO wontfix seems harsh, I'd prefer that you just untrack or call it
> incomplete, but I guess I don't really care that much.

Given that this is still an active crash, I think you're right about this. I'm going to untrack as I don't see the value on following up here until we come up with a way to obtain additional information to help us debug this. I also see instances of the bug on 38 and 39 and so have marked both releases as affected.
Hello we are getting a report what this is happening again on SuMo
[Tracking Requested - why for this release]:
Crash Signature: [@ OOM | unknown | js::CrashAtUnhandlableOOM(char const*) | js::Nursery::moveToTenured(js::gc::MinorCollectionTracer*, JSObject*)] → [@ OOM | unknown | js::CrashAtUnhandlableOOM(char const*) | js::Nursery::moveToTenured(js::gc::MinorCollectionTracer*, JSObject*)] [@ OOM | unknown | js::CrashAtUnhandlableOOM | js::Nursery::moveToTenured]
(In reply to Christian Riechers from comment #28)
> Another crash reported on SUMO with TB 38.4.0:
> bp-0fa034eb-2833-4185-bccf-0acd32151207
> 
> See https://support.mozilla.org/en-US/questions/1097870

#80 crash signature for Thunderbird 38.4.0. But many are multiple reports by same users. And, as bsmedberg suggests, these look like straight up OOM and the crash signature is of no help
Whiteboard: [tbird crash]
This signature ends after FIrefox 39 - for any signature containg js::Nursery::moveToTenured.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.