Closed Bug 1260785 Opened 8 years ago Closed 8 years ago

crash in OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::MovableCellHasher<T>::hash

Categories

(Core :: JavaScript: GC, defect)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox45 --- wontfix
firefox46 --- wontfix
firefox47 --- wontfix
firefox48 + fixed
firefox49 + fixed
firefox-esr45 --- wontfix
firefox50 + fixed

People

(Reporter: calixte, Assigned: jonco)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(3 files)

This bug was filed from the Socorro interface and is 
report bp-53c551a3-4604-408b-89d2-04ad62160330.
=============================================================

This signature is #35 in topcrash for 45.0.1, #25 for 46.0b5 and #190 for 47.0a2.
In three cases, it appeared the 2016-03-24.

Stack:

js::AutoEnterOOMUnsafeRegion::crash(char const*)
js::MovableCellHasher<JSObject*>::hash(JSObject* const&)
js::detail::HashTable<js::HashMapEntry<JSObject*, mozilla::Vector<js::ArrayBufferViewObject*, 1, js::SystemAllocPolicy> >, js::HashMap<JSObject*, mozilla::Vector<js::ArrayBufferViewObject*, 1, js::SystemAllocPolicy>, js::MovableCellHasher<JSObject*>, js::SystemAllocPolicy>::MapHashPolicy, js::SystemAllocPolicy>::prepareHash(JSObject* const&)
js::AutoCycleDetector::init()
ArrayJoin<0>(JSContext*, JS::CallArgs&)
js::array_join(JSContext*, unsigned int, JS::Value*)
@0x7b5d877f
This looks like another case where we try to grow some GC-related hash table and we run out of memory.
In release, the crash is spiking since 2016-05-04 (increased by ~127% until 2016-05-08) and is #14 in top-crashes for 46.0.1 and #45 for 47.0b3.
(In reply to Andrew McCreight [:mccr8] from comment #1)
> This looks like another case where we try to grow some GC-related hash table
> and we run out of memory.

This isn't GC, it's JS object cycle detection [1]. What's weird is |AutoCycleDetector::init| [2] is fallible, but I guess the we're ooming somewhere in the lookup? The stack seems a bit sketchy.

[1] https://dxr.mozilla.org/mozilla-central/rev/043082cb7bd8490c60815f67fbd1f33323ad7663/js/src/jscntxt.h#35
[2] https://dxr.mozilla.org/mozilla-central/rev/043082cb7bd8490c60815f67fbd1f33323ad7663/js/src/jscntxt.cpp#60-71
Err, yeah I should have said it's growing a GC-related thing, but not *during* GC.

The implementation has changed a bit in trunk but essentially we're ooming when hashing a cell because MovableCellHasher uses |js::Zone::getUniqueId| which maintains a hashtable mapping Cells to unique ids which in this case fails to insert the mapping. Confusing, right?

I'm not sure there's much we can do here, it seems we made adding the unique id mapping infallible for a reason, so the standard make it fallible approach won't work. Terrence this seems to be your domain, any ideas?
Flags: needinfo?(terrence)
Crash Signature: [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::MovableCellHasher<T>::hash] → [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::MovableCellHasher<T>::hash] [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::MovableCellHasher<T>::hash] [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | JS::Zone::getUniqueI…
This is probably OOM:small. I guess the allocation might be large. I need to land the size reporting stuff and get this hooked up there.
Flags: needinfo?(terrence)
Depends on: 1257387
There's also the signature "OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::gc::StoreBuffer::putGeneric<T>" which might be related (https://crash-stats.mozilla.com/report/index/1f99d931-6dfb-4b9d-ae4f-190de2160517).
Flags: needinfo?(terrence)
Sorry, I didn't mean to needinfo you.
Flags: needinfo?(terrence)
See Also: → 1273492
Noting that OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | JS::Zone::getUniqueIdInfallible is now the #4 topcrash in beta, with about 700 crashes per week on beta 3, 4, and 5. Is there anything else we can do here for 48?
It might be possible to do something about this.

This OOM occurs when we try to lookup or add and entry to a hashtable keyed on a movable GC pointer, which we handle by assigning the object a fixed unique ID in a separate table.  At the moment we crash if that ID generation hits OOM.

In the case of looking up a movable pointer, if there is no ID already present for that pointer then we know the pointer can't be in the hash table.

In the case of adding a pointer, if ID generation fails then we already have a path by which to report OOM since hash table insertion is fallible.

Making that work could be tricky though.
Track this as there is a high volume in 48/49.
Here's a patch to implement the approach outlined in comment 9.

This adds a FallibleHashMethods traits class templated on the hash policy and implements it for MovableCellHasher.  The use of a separate class to the hash policy lets most hash policies ignore it rather that forcing all of them to implement the extra methods.

Hash table Ptr and AddPtr classes get an extra 'error' state where the entry pointer is null to indicate a failed hash operation during lookup.

One drawback of this approach is that the error reporting strategy of anything using MovableCellHasher must match that of the unique ID table, i.e. it must not report errors.  I had to change a couple of use from TempAllocPolicy to SystemAllocPolicy.  Most uses were fine however.  

I'm not sure how I would fix this property without adding another method to the alloc policy to maybe report an out of memory error but that's going to add yet more baggage to the alloc policy.

These changes are exercised by our existing OOM tests.
Assignee: nobody → jcoppeard
Attachment #8770066 - Flags: review?(terrence)
Comment on attachment 8770066 [details] [diff] [review]
bug1260785-unique-hash-code-oom

Review of attachment 8770066 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for taking this!

::: js/public/HashTable.h
@@ +894,4 @@
>  #endif
> +        {}
> +
> +        bool error() const {

Error is pretty generic and could even be a verb. Could we call this something more specific? Maybe isValid() or isInitialized().

::: js/src/gc/Barrier.cpp
@@ +110,5 @@
> +/* static */ bool
> +MovableCellHasher<T>::hasHash(const Lookup& l)
> +{
> +    if (!l)
> +        return true;

If I remember correctly, we never store nullptr in a table, but do need to check if nullptr is in a table occasionally. Note for myself: how hard would it be to assert !null here and fix up any callers?

::: js/src/gc/Zone.h
@@ +428,5 @@
>          // If the cell was in the nursery, hopefully unlikely, then we need to
>          // tell the nursery about it so that it can sweep the uid if the thing
>          // does not get tenured.
> +        if (!runtimeFromAnyThread()->gc.nursery.addedUniqueIdToCell(cell)) {
> +            uniqueIds_.remove(cell);

Oh, eww, that's some pretty ugly action at a distance. I guess it's only relevant when you have a fallable HashPolicy.
Attachment #8770066 - Flags: review?(terrence) → review+
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2cfa9ffe77a4
Make hashcode generation fallible for cell pointers that can be moved by GC r=terrence
https://hg.mozilla.org/mozilla-central/rev/2cfa9ffe77a4
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Looks promising, do you think we should try uplift? Maybe a bit risky...
Flags: needinfo?(jcoppeard)
Approval Request Comment
[Feature/regressing bug #]: The change to stable hashing closest to the date this appeared is bug 1224050, so I'm going to blame that.  It's probably the cumulative expansion of stable hashing that triggered it though.
[User impact if declined]: Possible crash on OOM.
[Describe test coverage new/current, TreeHerder]: On m-c since 13th July.
[Risks and why]: Low risk, it's covered by our test code and it's been on m-c for over a week without issues.
[String/UUID change made/needed]: None
Flags: needinfo?(jcoppeard)
Attachment #8773267 - Flags: review+
Attachment #8773267 - Flags: approval-mozilla-aurora?
Attached patch bug1260785-betaSplinter Review
Approval Request Comment
[Feature/regressing bug #]: Bug 1224050 with caveats as above.
[User impact if declined]: Possible crash on OOM.
[Describe test coverage new/current, TreeHerder]: On m-c since 13th July.
[Risks and why]: Low risk, it's covered by our test code and it's been on m-c for over a week without issues.
[String/UUID change made/needed]: None
Attachment #8773270 - Flags: review+
Attachment #8773270 - Flags: approval-mozilla-beta?
the crashes on nightly seem to have ceased after the 20160713030216 build: http://bit.ly/2axJp1y
the signature seems to be at about 0.2% of all crashes on 47 and 49.0a2 currently, but close to 2% on 48 beta.
Comment on attachment 8773270 [details] [diff] [review]
bug1260785-beta

Review of attachment 8773270 [details] [diff] [review]:
-----------------------------------------------------------------

We have this bug for a while (at least from 45) so we should not be too worried. However, from the crash reports, the stability number of 48 beta 9 has more crash sign than the release.  It's likely that we will have a big spike on this crash on release. Although this patch is pretty big, we can take it for 48 beta RC. If we have critical fallout, we can always do a RC 2.
Attachment #8773270 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment on attachment 8773267 [details] [diff] [review]
bug1260785-aurora

Review of attachment 8773267 [details] [diff] [review]:
-----------------------------------------------------------------

Per comment #20, take it for aurora.
Attachment #8773267 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
does not apply cleanly to aurora

renamed 1260785 -> bug1260785.patch
applying bug1260785.patch
patching file js/public/HashTable.h
Hunk #1 FAILED at 56
Hunk #2 FAILED at 310
Hunk #3 FAILED at 669
Hunk #4 FAILED at 846
Hunk #5 FAILED at 878
Hunk #6 FAILED at 909
Hunk #7 FAILED at 1676
Hunk #8 FAILED at 1750
8 out of 8 hunks FAILED -- saving rejects to file js/public/HashTable.h.rej
patching file js/public/Id.h
Hunk #1 FAILED at 21
1 out of 1 hunks FAILED -- saving rejects to file js/public/Id.h.rej
patching file js/public/RootingAPI.h
Hunk #1 FAILED at 558
1 out of 1 hunks FAILED -- saving rejects to file js/public/RootingAPI.h.rej
patching file js/src/gc/Barrier.cpp
Hunk #1 FAILED at 101
1 out of 1 hunks FAILED -- saving rejects to file js/src/gc/Barrier.cpp.rej
patching file js/src/gc/Barrier.h
Hunk #1 FAILED at 795
1 out of 1 hunks FAILED -- saving rejects to file js/src/gc/Barrier.h.rej
patching file js/src/gc/Zone.h
Hunk #1 FAILED at 422
1 out of 1 hunks FAILED -- saving rejects to file js/src/gc/Zone.h.rej
patching file js/src/jsapi-tests/testGCWeakCache.cpp
Hunk #1 FAILED at 18
1 out of 1 hunks FAILED -- saving rejects to file js/src/jsapi-tests/testGCWeakCache.cpp.rej
patching file js/src/jscntxt.cpp
Hunk #1 FAILED at 56
Hunk #2 FAILED at 962
2 out of 2 hunks FAILED -- saving rejects to file js/src/jscntxt.cpp.rej
patching file js/src/jscntxt.h
Hunk #1 FAILED at 29
1 out of 1 hunks FAILED -- saving rejects to file js/src/jscntxt.h.rej
patching file js/src/json.cpp
Hunk #1 FAILED at 122
Hunk #2 FAILED at 305
2 out of 2 hunks FAILED -- saving rejects to file js/src/json.cpp.rej
patching file js/src/vm/StructuredClone.cpp
Hunk #1 FAILED at 271
Hunk #2 FAILED at 333
Hunk #3 FAILED at 960
Hunk #4 FAILED at 1260
4 out of 4 hunks FAILED -- saving rejects to file js/src/vm/StructuredClone.cpp.rej
patch failed, unable to continue (try -v)
patch failed, rejects left in working directory
errors during apply, please fix and qrefresh bug1260785.patch
Flags: needinfo?(jcoppeard)
I just landed this.  Pulsebot doesn't seem to have posted the changes yet.
Flags: needinfo?(jcoppeard)
thnaks, no wonder this didn't applied, setting flags
Crash volume for signature 'OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::MovableCellHasher<T>::hash':
 - nightly (version 50): 12 crashes from 2016-06-06.
 - aurora  (version 49): 1 crash from 2016-06-07.
 - beta    (version 48): 7 crashes from 2016-06-06.
 - release (version 47): 6910 crashes from 2016-05-31.
 - esr     (version 45): 2050 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          5          4          0          0          1          2          0
 - aurora           0          0          1          0          0          0          0
 - beta             4          0          0          0          0          2          1
 - release       1228       1226       1020        997        899        811        307
 - esr            296        277        237        242        200        191        100

Affected platforms: Windows, Mac OS X, Linux
Depends on: 1341367
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: