Closed Bug 1490042 Opened 4 years ago Closed 4 years ago
Infinite recursion inside type inference sweeping
This bug was filed from the Socorro interface and is report bp-316db272-008e-41c2-a87f-019280180907. ============================================================= Top 10 frames of crashing thread: 0 kernelbase.dll VirtualAlloc 1 mozglue.dll ?chunk_alloc@@YAPAXII_NPA_N@Z.llvm.16719441873306329850 memory/build/mozjemalloc.cpp:2017 2 mozglue.dll arena_t::AllocRun memory/build/mozjemalloc.cpp:2561 3 mozglue.dll BaseAllocator::malloc memory/build/mozjemalloc.cpp:4163 4 mozglue.dll moz_arena_malloc memory/build/malloc_decls.h:115 5 xul.dll js::LifoAlloc::getOrCreateChunk js/src/ds/LifoAlloc.cpp:269 6 xul.dll js::ObjectGroup::sweep js/src/vm/TypeInference.cpp:4509 7 xul.dll js::ObjectGroup::maybeClearNewScriptOnOOM js/src/vm/TypeInference.cpp:3113 8 xul.dll js::TypeZone::clearAllNewScriptsOnOOM js/src/vm/TypeInference.cpp:4701 9 xul.dll js::AutoClearTypeInferenceStateOnOOM::~AutoClearTypeInferenceStateOnOOM js/src/vm/TypeInference.cpp:4722 ============================================================= While looking at bug 1489957, I came across a handful of what looks like an infinite recursion crash inside type inference sweeping code. The sweep, maybeClearNewScriptOnOOM, clearAllNewScriptsOnOOM, ~AutoClearTypeInferenceStateOnOOM repeat over and over again.
Jon, any ideas? There aren't that many crashes, so it might not be worth investing too much time, but maybe there's an easy fix.
I guess it is more common than I thought at first. I did a search for proto signatures containing clearAllNewScriptsOnOOM, and that turned up almost 1000 crashes, and the half dozen crash reports I looked at also looked like instances of this crash, such as: bp-3a5fe8a8-58f3-4867-9f5e-17f190180910 https://crash-stats.mozilla.com/search/?proto_signature=~clearAllNewScriptsOnOOM&date=%3E%3D2018-09-03T08%3A33%3A05.000Z&date=%3C2018-09-10T08%3A33%3A05.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
TypeInference data is swept by copy and compacting it to a new area on every collection. Sweeping is incremental and there is code to ensure data is swept if necessary before it is accessed (AutoSweepObjectGroup). In the OOM failure path we are touching unswept data and sweeping it, which causes us to try to allocate again resulting in an infinite loop. sfink suggested that we should be throwing away information on OOM and marking types as unknown, which would mean we don't need to allocate. I think this is the right approach.
As we see from the stack trace it's possible for the type inference OOM handling to sweep an ObjectGroup and hit OOM and loop. TI OOM handling works by setting a flag on an AutoClearTypeInferenceStateOnOOM and having that run cleanup actions in its destructor. The problem here is that we can nest AutoClearTypeInferenceStateOnOOM instances. I think that if we disallow this nesting then it won't be possible for this problem to happen, and OOM won't affect the ongoing cleanup. I've spent a while today trying to make a test case to reproduce this but I haven't been able to find anything that works.
Assignee: nobody → jcoppeard
This patch changes the destructor of ~AutoClearTypeInferenceStateOnOOM to call |zone->types.setSweepingTypes(false)| at end, after cleanup has happened. This makes us assert (in release) if the cleanup tries to use any nested AutoClearTypeInferenceStateOnOOM. ObjectGroup::sweep() changes to take a reference to one of these cleanup objects and the option to pass a null pointer and have it create a local one is removed. Finally the AutoSweepObjectGroup and AutoSweepTypeScript classes get two constructors: one that takes a reference to an existing cleanup object, and one that creates one for you if needed. All this means that OOM during OOM cleanup will just set the flag in the OOM cleanup object which is already true and not cause us to loop.
Attachment #9008409 - Flags: review?(jdemooij)
Here's a patch for an unsetgczeal() shell function which turns off a single zeal mode and doesn't finish the current GC like gczeal(0) does. This was necessary to make a test case for this.
Attachment #9008411 - Flags: review?(sphink)
Test case to cause OOM during sweeping of type information. This was quite tricky to get right. I checked that this reproduces the problem without the fix. I don't think we get an infinite loop but recursion of depth proportional to the number of unswept object groups. It also depends on incremental GC yielding during TI sweeping after an OOM has occured.
Attachment #9008415 - Flags: review?(jdemooij)
Comment on attachment 9008411 [details] [diff] [review] bug1490042-unset-zeal Review of attachment 9008411 [details] [diff] [review]: ----------------------------------------------------------------- Wow. That's quite comprehensive. Looks good to me, though I'm not sure I would spot some necessary missing cleanup.
Attachment #9008411 - Flags: review?(sphink) → review+
Comment on attachment 9008409 [details] [diff] [review] bug1490042-fix Review of attachment 9008409 [details] [diff] [review]: ----------------------------------------------------------------- Nice! ::: js/src/gc/GC.cpp @@ +6238,5 @@ > } > } > > static void > SweepThing(JSScript* script, AutoClearTypeInferenceStateOnOOM* oom) Nit: I think this |oom| could be a reference too, because we dereference it unconditionally? Also the SweepThing overload below. ::: js/src/vm/TypeInference.cpp @@ -4740,5 @@ > propertySet = nullptr; > } > > -static void > -EnsureHasAutoClearTypeInferenceStateOnOOM(AutoClearTypeInferenceStateOnOOM*& oom, Zone* zone, Very cool we can get rid of this.
Attachment #9008409 - Flags: review?(jdemooij) → review+
Comment on attachment 9008415 [details] [diff] [review] bug1490042-testcase Review of attachment 9008415 [details] [diff] [review]: ----------------------------------------------------------------- That's pretty heroic.
Attachment #9008415 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #9) > Nit: I think this |oom| could be a reference too, because we dereference it > unconditionally? Also the SweepThing overload below. Good call. I also need to fix argument forwarding and then this works.
Pushed by email@example.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/aa3c5d257b1e Only allow a single AutoClearTypeInferenceStateOnOOM to be active at once r=jandem r=sfink
Is this something you'd consider safe for a Beta uplift?
Backed out for causing topcrash bug 1491530. https://hg.mozilla.org/integration/autoland/rev/2b3acad1d831076f4936382ba4216c1c9d43ff63
The previous patch failed because of a path taken when converting unboxed objects to native uses AutoSweepObjectGroup while there's an AutoClearTypeInferenceStateOnOOM further up the stack. I couldn't see a way to thread a reference to that through all the calls without it infecting everything. Here's an alternative, which is to not pass a reference to the AutoClearTypeInferenceStateOnOOM at all and keep the state in the TypeZone. I don't like this as much but it seems to work. Another option is to wait until unboxed object support is removed which I think will make the original patch viable again.
Comment on attachment 9009933 [details] [diff] [review] bug1490042-fix v2 Review of attachment 9009933 [details] [diff] [review]: ----------------------------------------------------------------- This is still nicer than what we're currently doing.
Attachment #9009933 - Flags: review?(jdemooij) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/274c743b3d55 Only allow one AutoClearTypeInferenceStateOnOOM to be live at a time to fix recursive type sweeping r=jandem
Jon, do you consider this patch safe for a Beta uplift? Thanks.
The previous patch caused problems and had to be backed out so I'd prefer to let this one ride if this issue is low enough volume that we're happy to do that.
You need to log in before you can comment on or make changes to this bug.