Closed Bug 1615307 Opened 4 years ago Closed 2 months ago

Crash in [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::gc::GCRuntime::mergeRealms]

Categories

(Core :: JavaScript Engine, defect, P2)

x86
Windows 10
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox73 --- affected
firefox74 --- affected

People

(Reporter: pascalc, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

This bug is for crash report bp-e93e4c41-8af6-4544-abeb-8c5f30200213.

Top 10 frames of crashing thread:

0 xul.dll js::AutoEnterOOMUnsafeRegion::crash js/src/vm/JSContext.cpp:1501
1 xul.dll js::gc::GCRuntime::mergeRealms js/src/gc/GC.cpp:7658
2 xul.dll js::gc::MergeRealms js/src/gc/GC.cpp:7556
3 xul.dll js::GlobalHelperThreadState::finishParseTaskCommon js/src/vm/HelperThreads.cpp:1910
4 xul.dll js::GlobalHelperThreadState::finishMultiParseTask js/src/vm/HelperThreads.cpp:2017
5 xul.dll mozilla::ScriptPreloader::MaybeFinishOffThreadDecode js/xpconnect/loader/ScriptPreloader.cpp:989
6 xul.dll mozilla::ScriptPreloader::WaitForCachedScript js/xpconnect/loader/ScriptPreloader.cpp:882
7 xul.dll mozilla::ScriptPreloader::GetCachedScript js/xpconnect/loader/ScriptPreloader.cpp:858
8 xul.dll mozJSComponentLoader::ObjectForLocation js/xpconnect/loader/mozJSComponentLoader.cpp:819
9 xul.dll mozJSComponentLoader::Import js/xpconnect/loader/mozJSComponentLoader.cpp:1353

Apparently, a very large number of startup crashes in 72.0.2 + 73.0

I do not understand this bug. Some of the crashes seems to happens while plenty of memory seems to be remaining. (70% of the crashes happens while having less than 90% system memory usage)
While many others are justified OOMs, I think it might be worth investigating if other source of errors might not be incorrectly be reported as OOM while merging realms.

Steve, any idea what could be going wrong which is miss-reported as an OOM?

Flags: needinfo?(sphink)
Priority: -- → P2

There are two calls to oomUnsafe.crash() directly in the body of GCRuntime::mergeRealms, both guarded by coverage::IsLCovEnabled(). Neither of them is the crash in comment 0:

MOZ_CRASH Reason (Sanitized): [unhandlable oom] failed to transfer unique ids from off-thread

https://crash-stats.mozilla.org/search/?release_channel=release&signature=%3DOOM%20%7C%20unknown%20%7C%20js%3A%3AAutoEnterOOMUnsafeRegion%3A%3Acrash%20%7C%20js%3A%3Agc%3A%3AGCRuntime%3A%3AmergeRealms&product=Firefox&version=72.0&version=72.0.1&version=72.0.2&version=73.0&date=%3E%3D2020-01-03T00%3A00%3A00.000Z&date=%3C2020-02-14T00%3A33%3A00.000Z&_facets=install_time&_facets=version&_facets=address&_facets=moz_crash_reason&_facets=reason&_facets=build_id&_facets=platform_pretty_version&_facets=signature&_facets=useragent_locale&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-moz_crash_reason shows that all 2015 of 2015 crashes are on this same line.

That is from an inlined call to JS::Zone::adoptUniqueIds, after a put() fails: uniqueIds().put(e.front().key(), e.front().value()). That map is a js::ZoneOrGCTaskData<js::gc::UniqueIdMap> where UniqueIdMap is really GCHashMap<Cell*, uint64_t, PointerHasher<Cell*>, SystemAllocPolicy, UniqueIdGCPolicy>. So we're looking at a SystemAllocPolicy failure here.

put() calls add() which claims to only fail on OOM, though there are quite a few different ways for it to return false: https://searchfox.org/mozilla-central/source/mfbt/HashTable.h#2097,2111,2119,2128,2131

The ways to fail I see are (1) checkSimulatedOOM() and (2) a failure in ensureHash that we only detect now, or (3) rehashing failed. I thought it would be easy to see that checkSimulatedOOM would never fail in a non-DEBUG build, but it's a little trickier than I suspected: targetThread_ && targetThread_ == js::oom::GetThreadType() will always return false because GetThreadType() is hardcoded to return zero in a non-DEBUG build, and this expression will always be false for any value of targetThread_ in that case.

I'll skip ensureHash for now. Rehashing can fail in one interesting way that doesn't involve OOM: if (MOZ_UNLIKELY(newCapacity > sMaxCapacity)). sMaxCapacity is 2**30, or 1GB. It looks like newCapacity could result from rawCapacity() being half that, or 512MB. I don't really follow all the math, but my local build says entries are 16 bytes, so maybe that translates to 32 million entries. Which doesn't seem impossible?

I'm not really sure what might go wrong to get 32 million objects with unique ids. To get a unique ID, you mostly just need to use the object as a key in a map somewhere. I could imagine that being pretty common. It makes me wonder if we should stick the current unique ID (it's a forever-incrementing 64-bit value) into crashdumps.

nbp, does that spark any ideas?

The other option is for ensureHash to fail. I'm pretty confused about how that stuff works, but I think it'll end up at https://searchfox.org/mozilla-central/source/js/src/gc/Zone-inl.h#38 which has a couple of ways to fail, but while I didn't look all that hard they all seemed to be normal OOM with a SystemAllocPolicy (even in the funky nursery case, where it appends to a vector.)

Flags: needinfo?(sphink) → needinfo?(nicolas.b.pierron)

Probably worth mentioning that mergeRealms can be removed after the Stencil work lands.

Severity: normal → S3

Closing because no crashes reported for 12 weeks.

Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.