Crash in [@ mozilla::HashMapEntry<T>::~HashMapEntry]
Categories
(Core :: XPConnect, defect)
Tracking
()
| Tracking | Status | |
|---|---|---|
| thunderbird_esr60 | --- | unaffected |
| firefox-esr60 | --- | unaffected |
| firefox65 | --- | unaffected |
| firefox66 | --- | unaffected |
| firefox67 | + | fixed |
People
(Reporter: marcia, Assigned: jandem)
References
Details
(4 keywords, Whiteboard: [post-critsmash-triage])
Crash Data
Attachments
(1 file)
This bug is for crash report bp-3ac7e79e-d7c6-4000-b787-f3fb10190223.
Seen while looking at nightly crash stats - Windows and Mac crashes started using 20190223041557: https://bit.ly/2H3vVwv. 5 crashes/6 installs so far.
Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=98c743692a8494dbd609abe2f96b8297f777cad6&tochange=492e4409f468f4841def795627356148d0bb7347
Top 10 frames of crashing thread:
0 XUL mozilla::HashMapEntry<js::HeapPtr<JSObject*>, js::HeapPtr<JSObject*> >::~HashMapEntry js/public/HeapAPI.h:155
1 XUL <name omitted> mfbt/HashTable.h:100
2 XUL XPCWrappedNativeScope::~XPCWrappedNativeScope js/public/Utility.h:537
3 XUL XPCWrappedNativeScope::~XPCWrappedNativeScope js/xpconnect/src/XPCWrappedNativeScope.cpp:271
4 XUL XPCJSRuntime::FinalizeCallback js/xpconnect/src/XPCWrappedNativeScope.cpp:416
5 XUL js::gc::GCRuntime::endSweepingSweepGroup js/src/gc/GC.cpp:1808
6 XUL sweepaction::SweepActionSequence<js::gc::GCRuntime*, js::FreeOp*, js::SliceBudget&>::run js/src/gc/GC.cpp:6271
7 XUL sweepaction::SweepActionRepeatFor<js::gc::SweepGroupsIter, JSRuntime*, js::gc::GCRuntime*, js::FreeOp*, js::SliceBudget&>::run js/src/gc/GC.cpp:6331
8 XUL js::gc::GCRuntime::incrementalSlice js/src/gc/GC.cpp:6503
9 XUL js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:7373
| Reporter | ||
Comment 1•7 years ago
|
||
OS X 10.14 has the largest amount of crashes so far (13). This also affects Fennec. One comment "Don't remember what was here. Just noticed this, perhaps happened while idle overnight?"
Comment 2•7 years ago
|
||
It looks like this is a bunch of null derefs while calling WeakMapPtr::destroy() from ~XPCWrappedNativeScope. Does this look like the other issue with null keys in weak maps, Jon?
Comment 3•7 years ago
|
||
This is also not a very good signature. This should be added to the prefix list.
| Reporter | ||
Comment 4•7 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #3)
This is also not a very good signature. This should be added to the prefix list.
I filed Bug 1533011 for this. This crash continues to happen regularly on 67 nightly, with a range of 2-8 crashes per build.
Comment 5•7 years ago
|
||
Bug 1531035 is what I was thinking of in comment 2, but that's landed and it doesn't look like this fixed anything.
Comment 6•7 years ago
|
||
Boris, maybe this is a regression from bug 1523843? That's in the regression range in comment 0. Bug 1528146 talks about web extensions, and I see what feels like an unusually large number of moz-extension:// URLs in the crashes.
This crash seems to have a decent stack: bp-1862e857-b1be-412d-aaa7-d03430190305
We're getting into this hash table destroy from a call to void JS::WeakMapPtr<JSObject*, JSObject*>::destroy(), so maybe this is about mXrayExpandos?
Nominating for tracking because this is a newly introduced null deref crash that is happening regularly, though it doesn't look like a top crash.
Updated•7 years ago
|
Comment 7•7 years ago
|
||
Boris, maybe this is a regression from bug 1523843?
It's possible, yes.
Bug 1528146 talks about web extensions
Note that bug 1528146 is unrelated to bug 1523843.
so maybe this is about mXrayExpandos?
Maybe.
So all the crashes I see are on 64-bit (including all the Windows crashes, which is a little odd), and all at 0x10. That suggests to me that the HashMapEntry being destroyed is actually null. Which suggests that in destroyTable slot.isLive() is wrong or something.
One thing we could try doing is temporarily making the test at https://searchfox.org/mozilla-central/rev/3e0f1d95fcf8832413457e3bec802113bdd1f8e8/dom/base/nsGlobalWindowOuter.cpp#1850 always test false on nightly and seeing whether that makes this crash go away in the builds with that change. That would pretty definitively tell us whether it's related to the compartment sharing bits.
Comment 8•7 years ago
|
||
I'm going to do the disabling in bug 1533103 and we will see how that affects this bug.
| Assignee | ||
Comment 9•7 years ago
|
||
OK so it looks like we have a value (or key?) in mXrayExpandos with a nullptr zone. We crash when checking if the zone needs an incremental barrier, likely when destroying the js::HeapPtr<>.
Question then is how this can happen. We trace mXrayExpandos in xpc::TraceXPCGlobal => XPCWrappedNativeScope::TraceInside if the global's compartment has a scope.
I wonder if we should be clearing the map here:
Comment 10•7 years ago
|
||
That seems worth trying... I've already scheduled bug 1533103 to autoland, but we could either try to back that out immediately and try the idea from comment 9, or leave it in for a few days, then back out and try comment 9...
Updated•7 years ago
|
| Assignee | ||
Comment 11•7 years ago
|
||
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #10)
That seems worth trying... I've already scheduled bug 1533103 to autoland, but we could either try to back that out immediately and try the idea from comment 9, or leave it in for a few days, then back out and try comment 9...
Bug 1533103 is probably worth trying anyway, but note that you will run into bug 1530608 test failures :/
Comment 12•7 years ago
|
||
Oh. Ugh, right. :(
| Assignee | ||
Comment 13•7 years ago
|
||
Jon, any thoughts on comment 9? Also how do WeakMaps and Zones relate exactly in this case, especially when it comes to sweeping?
XPCWrappedNativeScope is moved to the "dying scopes" list when there are no live globals in the compartment the scope belongs to. But (why) is there still an entry in mXrayExpandos when we destroy the dying scopes (at the end of GC)?
One possibility is transplanted objects? I think TransplantObjectRetainingXrayExpandos does the right thing but there's also the TransplantObject call in SetNewDocument.
| Assignee | ||
Comment 14•7 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #13)
One possibility is transplanted objects? I think TransplantObjectRetainingXrayExpandos does the right thing but there's also the TransplantObject call in SetNewDocument.
That's probably fine because it will get swapped with a CCW.. Here are some things we could release-assert:
- mXrayExpandos is empty when we destroy the wrapped-native-scope
- mXrayExpandos contains only same-compartment (with the scope) keys and values
- mXrayExpandos does not contain any live objects when we move the scope to the dying scopes list
I can write a patch tomorrow...
Updated•7 years ago
|
| Assignee | ||
Comment 15•7 years ago
|
||
I filed bug 1533302 to simplify this code some. I'm hopeful that will fix this bug or at least make it a lot easier to reason about.
Comment 16•7 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #9)
OK so it looks like we have a value (or key?) in mXrayExpandos with a nullptr zone. We crash when checking if the zone needs an incremental barrier, likely when destroying the js::HeapPtr<>.
The arena's zone pointer is set to nullptr when the arena is freed, so this is probably UAF.
Question then is how this can happen. We trace mXrayExpandos in xpc::TraceXPCGlobal => XPCWrappedNativeScope::TraceInside if the global's compartment has a scope.
I wonder if we should be clearing the map here:
That sounds like a good idea to me.
(In reply to Jan de Mooij [:jandem] from comment #13)
Also how do WeakMaps and Zones relate exactly in this case, especially when it comes to sweeping?
A WeakMap is associated with a Zone and is swept when that Zone is swept. Not sure if that answers your question...
Updated•7 years ago
|
Comment 18•7 years ago
|
||
Boris pointed out that the crash seems to have gone away in the last two Nightlies, even before his backout landed. Very peculiar.
Comment 19•7 years ago
|
||
The last one I experienced was: https://crash-stats.mozilla.org/report/index/10b6980b-b7f6-4bc3-9163-9cb4e0190306
That's from 20190306095759.
Comment 20•7 years ago
|
||
So this crash does seem to have gone away before I checked in bug 1533103. The crash in bug 1532090, on the other hand, went away after the commit for bug 1533103. I've landed a backout of bug 1533103 in bug 1533105 and we'll see how that affects the stats....
| Assignee | ||
Comment 21•7 years ago
•
|
||
I'm pretty sure bug 1533302 will improve or affect this: we now destroy the map in UpdateWeakPointersAfterGC when there are no live globals and we release-assert in XPCWrappedNativeScope's destructor that the map is destroyed (not initialized).
Comment 23•7 years ago
|
||
UncheckedUnwrapWithoutExpose is clearly another variant of this crash: lots of moz-extension URLs, TraceXPCGlobal appears in the stack.
I did a search for those two things, and it is more common than the DoMarking variant: https://crash-stats.mozilla.com/search/?proto_signature=~TraceXPCGlobal&url=%5Emoz-extension&product=Firefox&version=67.0a1&date=%3E%3D2019-03-04T18%3A18%3A00.000Z&date=%3C2019-03-11T18%3A18%3A00.000Z&_facets=signature&_sort=-build_id&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
[@ js::WeakMap<T>::markIteratively ] looks like it is also another variant.
Updated•7 years ago
|
Comment 24•7 years ago
|
||
So I'm trying to figure out where we stand.
This bug as filed stopped reproducing for no reason we can see, right?
Bug 1530672 stopped reproducing after we disabled compartment sharing, then started reproducing when we re-enabled it, and bug 1533302 did not help it, correct?
Bug 1532090 predates all this stuff, so not much we can say there based on frequency.
Bug 1530672 predates this stuff too, though may have gotten more frequent with 67, maybe.
Is that all correct?
Updated•7 years ago
|
Comment 25•7 years ago
|
||
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #24)
This bug as filed stopped reproducing for no reason we can see, right?
Correct.
Bug 1530672 stopped reproducing after we disabled compartment sharing, then started reproducing when we re-enabled it, and bug 1533302 did not help it, correct?
Yes, I think that's right.
Bug 1532090 predates all this stuff, so not much we can say there based on frequency.
I looked at all of the crashes with this signature that we have, and the ones with TraceXPCGlobal in the stacks seem to follow the pattern of the signature from bug 1530672. In other words, they went away when it was disabled, and they are still happening on builds with bug 1533302, for instance: bp-aa313f6d-a38b-45ed-a79c-014b50190311
Bug 1530672 predates this stuff too, though may have gotten more frequent with 67, maybe.
Again, if you look at only the signatures with TraceXPCGlobal in the proto stack, it seems to be consistent with the other signatures.
Is that all correct?
Yes.
Looking at my query from comment 11, the first crash is in the 20190223213931 build, which is one build after bug 1523843 landed. There were 0 crashes on the 3-7 builds, which I think is when compartment sharing was temporarily disabled. And yes, bug 1533302 doesn't seem to have fixed that.
| Assignee | ||
Comment 26•7 years ago
|
||
For the other bug, I wonder about this case:
- Compartment has a global A.
- We attempt to create a new global B in A's compartment.
- We start a GC as part of creating B, collecting A.
Then (a) we didn't call TraceXPCGlobal because A is dead and we're still creating B, so we don't trace the expando map and (b) in XPCWrappedNativeScope::UpdateWeakPointersAfterGC we could see a live global at this point (if we're doing an incremental GC for example), so we don't destroy the map.
I'm not sure what the best fix is to prevent this.
| Assignee | ||
Comment 27•7 years ago
|
||
One option is:
-
Model mXrayExpandos after mWaiverWrapperMap, then we only have to call Sweep on it in UpdateWeakPointersAfterGC and that's straight-forward. The mXrayExpandos tracing code in TraceXPCGlobal goes away.
-
Make mIDProto/mIIDProto/mCIDProto true weak pointers.
-
Remove XPCWrappedNativeScope::TraceInside, avoiding the issue in comment 26.
Andrew, does that sound reasonable to you? I think it makes sense to use the same machinery for mXrayExpandos and mWaiverWrapperMap. mXrayExpandos is the more complicated (and buggy) one because it requires that explicit trace call.
| Assignee | ||
Comment 28•7 years ago
•
|
||
Another option is to add a TraceCompartment callback (and then call that where we call Realm::traceRoots). Then Gecko would use that to call XPCWrappedNativeScope::TraceInside instead of in TraceXPCGlobal.
Then we need to be careful to not leak compartments by marking things unconditionally in TraceCompartment...
Comment 29•7 years ago
|
||
I'm not too familiar with the code involved here, but the situation in comment 26 sounds like it would be a failure of the rooting analysis. Shouldn't the global of A be held alive while we're creating global B? For instance, it looks like SandboxOptions::sameZoneAs roots A. Also, for the crashes I've seen the GC is running off an idle callback, so a GC running in the middle of creating a global maybe can't be an issue? It is possible I'm missing something here.
| Assignee | ||
Comment 30•7 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #29)
I'm not too familiar with the code involved here, but the situation in comment 26 sounds like it would be a failure of the rooting analysis. Shouldn't the global of A be held alive while we're creating global B?
I was thinking about the code here: https://searchfox.org/mozilla-central/rev/8ff2cd0a27e3764d9540abdc5a66b2fb1e4e9644/dom/base/nsGlobalWindowOuter.cpp#1831
We look for a JS::Compartment* there but we don't keep one of its globals alive.
Also, for the crashes I've seen the GC is running off an idle callback, so a GC running in the middle of creating a global maybe can't be an issue? It is possible I'm missing something here.
I think the problem is that we forget to trace the WeakMap and then the next GC crashes because we now have garbage in it.
Comment 31•7 years ago
|
||
but we don't keep one of its globals alive
Maybe we should?
| Assignee | ||
Comment 32•7 years ago
|
||
(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #31)
but we don't keep one of its globals alive
Maybe we should?
Adding a RootedObject with a GetFirstGlobalInCompartment call should be pretty easy and might be nice to get crash-stats feedback. (I didn't suggest it before because it felt like a workaround, but maybe it is the right thing to do for now..)
| Assignee | ||
Comment 34•7 years ago
|
||
(In reply to Jan de Mooij [:jandem] from comment #32)
Adding a RootedObject with a GetFirstGlobalInCompartment call should be pretty easy and might be nice to get crash-stats feedback. (I didn't suggest it before because it felt like a workaround, but maybe it is the right thing to do for now..)
Hm we should probably do that at the JS engine level in JS_NewGlobalObject so it's guaranteed to work like that everywhere.
Comment 35•7 years ago
|
||
I'm all for making XPConnect tracing simpler, but failing to realize that a compartment is alive when we have a global for it sounds like it could break in other ways. I guess this is just a failing of the particular way XPC tracing thinks the compartment is alive?
Comment 36•7 years ago
|
||
On the other hand, I'm not sure exactly what sort of stack rooting analysis we could use to figure out that every compartment we refer to on the stack has an object pointer somewhere that is rooted, so maybe avoiding requiring that is better if we can?
| Assignee | ||
Comment 37•7 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #36)
On the other hand, I'm not sure exactly what sort of stack rooting analysis we could use to figure out that every compartment we refer to on the stack has an object pointer somewhere that is rooted, so maybe avoiding requiring that is better if we can?
I think there's only a problem with the new-global code, because if a compartment has no live globals we don't reuse it anymore but when we are creating the new global and GC we're between that decision and having a live global again.
I'll post a patch to fix this in the new-global code. If it helps we can simplify the tracing code separately.
| Assignee | ||
Comment 38•7 years ago
|
||
| Assignee | ||
Updated•7 years ago
|
| Assignee | ||
Comment 39•7 years ago
|
||
Pushed because recent regression: https://hg.mozilla.org/integration/autoland/rev/7f3111601872c9d6e2a49a52cc57f7c7a25210e7
Leaving this open until we know more.
Comment 40•7 years ago
|
||
Updated•7 years ago
|
Comment 41•7 years ago
|
||
There's only been one crash with TraceXPCGlobal in the proto signature since Jan's patch landed, and it is a crash inside TraceProtoAndIfaceCache and not the weak map stuff, so that might be a different issue. The crash is also in a web content process, and not an extension process, which also suggests it might be different. That seems like a good sign.
bp-b295444f-6462-4542-9ca0-a01710190314
| Assignee | ||
Comment 42•7 years ago
|
||
I also don't see the UncheckedUnwrapWithoutExpose signature anymore with recent Nightlies.
Comment 43•6 years ago
|
||
Can we land a test for this?
| Assignee | ||
Comment 44•6 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #43)
Can we land a test for this?
I don't have a test for this. It's probably also not very easy to write one because you have to trigger a GC in exactly the right place.
Updated•6 years ago
|
Updated•6 years ago
|
| Reporter | ||
Comment 45•6 years ago
|
||
Hello Jan - one of the Firefox signatures in this bug seems to still be occurring in the recent betas - see https://bit.ly/2Z5osnl. Should I spin off a new bug for this? thanks.
Comment 46•6 years ago
|
||
DoMarking is just a generic GC crash. We already have bug 1280186 on file for it.
Updated•5 years ago
|
Description
•