Closed Bug 1530364 Opened 7 years ago Closed 7 years ago

Crash in [@ mozilla::HashMapEntry<T>::~HashMapEntry]

Categories

(Core :: XPConnect, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla67
Tracking Status
thunderbird_esr60 --- unaffected
firefox-esr60 --- unaffected
firefox65 --- unaffected
firefox66 --- unaffected
firefox67 + fixed

People

(Reporter: marcia, Assigned: jandem)

References

Details

(4 keywords, Whiteboard: [post-critsmash-triage])

Crash Data

Attachments

(1 file)

This bug is for crash report bp-3ac7e79e-d7c6-4000-b787-f3fb10190223.

Seen while looking at nightly crash stats - Windows and Mac crashes started using 20190223041557: https://bit.ly/2H3vVwv. 5 crashes/6 installs so far.

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=98c743692a8494dbd609abe2f96b8297f777cad6&tochange=492e4409f468f4841def795627356148d0bb7347

Top 10 frames of crashing thread:

0 XUL mozilla::HashMapEntry<js::HeapPtr<JSObject*>, js::HeapPtr<JSObject*> >::~HashMapEntry js/public/HeapAPI.h:155
1 XUL <name omitted> mfbt/HashTable.h:100
2 XUL XPCWrappedNativeScope::~XPCWrappedNativeScope js/public/Utility.h:537
3 XUL XPCWrappedNativeScope::~XPCWrappedNativeScope js/xpconnect/src/XPCWrappedNativeScope.cpp:271
4 XUL XPCJSRuntime::FinalizeCallback js/xpconnect/src/XPCWrappedNativeScope.cpp:416
5 XUL js::gc::GCRuntime::endSweepingSweepGroup js/src/gc/GC.cpp:1808
6 XUL sweepaction::SweepActionSequence<js::gc::GCRuntime*, js::FreeOp*, js::SliceBudget&>::run js/src/gc/GC.cpp:6271
7 XUL sweepaction::SweepActionRepeatFor<js::gc::SweepGroupsIter, JSRuntime*, js::gc::GCRuntime*, js::FreeOp*, js::SliceBudget&>::run js/src/gc/GC.cpp:6331
8 XUL js::gc::GCRuntime::incrementalSlice js/src/gc/GC.cpp:6503
9 XUL js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:7373

OS X 10.14 has the largest amount of crashes so far (13). This also affects Fennec. One comment "Don't remember what was here. Just noticed this, perhaps happened while idle overnight?"

It looks like this is a bunch of null derefs while calling WeakMapPtr::destroy() from ~XPCWrappedNativeScope. Does this look like the other issue with null keys in weak maps, Jon?

Flags: needinfo?(jcoppeard)

This is also not a very good signature. This should be added to the prefix list.

(In reply to Andrew McCreight [:mccr8] from comment #3)

This is also not a very good signature. This should be added to the prefix list.

I filed Bug 1533011 for this. This crash continues to happen regularly on 67 nightly, with a range of 2-8 crashes per build.

Bug 1531035 is what I was thinking of in comment 2, but that's landed and it doesn't look like this fixed anything.

Boris, maybe this is a regression from bug 1523843? That's in the regression range in comment 0. Bug 1528146 talks about web extensions, and I see what feels like an unusually large number of moz-extension:// URLs in the crashes.

This crash seems to have a decent stack: bp-1862e857-b1be-412d-aaa7-d03430190305

We're getting into this hash table destroy from a call to void JS::WeakMapPtr<JSObject*, JSObject*>::destroy(), so maybe this is about mXrayExpandos?

Nominating for tracking because this is a newly introduced null deref crash that is happening regularly, though it doesn't look like a top crash.

Flags: needinfo?(jcoppeard) → needinfo?(bzbarsky)
Component: JavaScript Engine → XPConnect

Boris, maybe this is a regression from bug 1523843?

It's possible, yes.

Bug 1528146 talks about web extensions

Note that bug 1528146 is unrelated to bug 1523843.

so maybe this is about mXrayExpandos?

Maybe.

So all the crashes I see are on 64-bit (including all the Windows crashes, which is a little odd), and all at 0x10. That suggests to me that the HashMapEntry being destroyed is actually null. Which suggests that in destroyTable slot.isLive() is wrong or something.

One thing we could try doing is temporarily making the test at https://searchfox.org/mozilla-central/rev/3e0f1d95fcf8832413457e3bec802113bdd1f8e8/dom/base/nsGlobalWindowOuter.cpp#1850 always test false on nightly and seeing whether that makes this crash go away in the builds with that change. That would pretty definitively tell us whether it's related to the compartment sharing bits.

Flags: needinfo?(bzbarsky)

I'm going to do the disabling in bug 1533103 and we will see how that affects this bug.

Depends on: 1533103
Flags: needinfo?(bzbarsky)

OK so it looks like we have a value (or key?) in mXrayExpandos with a nullptr zone. We crash when checking if the zone needs an incremental barrier, likely when destroying the js::HeapPtr<>.

Question then is how this can happen. We trace mXrayExpandos in xpc::TraceXPCGlobal => XPCWrappedNativeScope::TraceInside if the global's compartment has a scope.

I wonder if we should be clearing the map here:

https://searchfox.org/mozilla-central/rev/fbb251448feb7276f9b1d0a88f9c0cb1cd144ce4/js/xpconnect/src/XPCWrappedNativeScope.cpp#364-366

Group: javascript-core-security

That seems worth trying... I've already scheduled bug 1533103 to autoland, but we could either try to back that out immediately and try the idea from comment 9, or leave it in for a few days, then back out and try comment 9...

Group: javascript-core-security → dom-core-security

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #10)

That seems worth trying... I've already scheduled bug 1533103 to autoland, but we could either try to back that out immediately and try the idea from comment 9, or leave it in for a few days, then back out and try comment 9...

Bug 1533103 is probably worth trying anyway, but note that you will run into bug 1530608 test failures :/

Oh. Ugh, right. :(

Jon, any thoughts on comment 9? Also how do WeakMaps and Zones relate exactly in this case, especially when it comes to sweeping?

XPCWrappedNativeScope is moved to the "dying scopes" list when there are no live globals in the compartment the scope belongs to. But (why) is there still an entry in mXrayExpandos when we destroy the dying scopes (at the end of GC)?

One possibility is transplanted objects? I think TransplantObjectRetainingXrayExpandos does the right thing but there's also the TransplantObject call in SetNewDocument.

Flags: needinfo?(jcoppeard)

(In reply to Jan de Mooij [:jandem] from comment #13)

One possibility is transplanted objects? I think TransplantObjectRetainingXrayExpandos does the right thing but there's also the TransplantObject call in SetNewDocument.

That's probably fine because it will get swapped with a CCW.. Here are some things we could release-assert:

  1. mXrayExpandos is empty when we destroy the wrapped-native-scope
  2. mXrayExpandos contains only same-compartment (with the scope) keys and values
  3. mXrayExpandos does not contain any live objects when we move the scope to the dying scopes list

I can write a patch tomorrow...

I filed bug 1533302 to simplify this code some. I'm hopeful that will fix this bug or at least make it a lot easier to reason about.

Flags: needinfo?(jcoppeard)

(In reply to Jan de Mooij [:jandem] from comment #9)

OK so it looks like we have a value (or key?) in mXrayExpandos with a nullptr zone. We crash when checking if the zone needs an incremental barrier, likely when destroying the js::HeapPtr<>.

The arena's zone pointer is set to nullptr when the arena is freed, so this is probably UAF.

Question then is how this can happen. We trace mXrayExpandos in xpc::TraceXPCGlobal => XPCWrappedNativeScope::TraceInside if the global's compartment has a scope.

I wonder if we should be clearing the map here:

https://searchfox.org/mozilla-central/rev/fbb251448feb7276f9b1d0a88f9c0cb1cd144ce4/js/xpconnect/src/XPCWrappedNativeScope.cpp#364-366

That sounds like a good idea to me.

(In reply to Jan de Mooij [:jandem] from comment #13)

Also how do WeakMaps and Zones relate exactly in this case, especially when it comes to sweeping?

A WeakMap is associated with a Zone and is swept when that Zone is swept. Not sure if that answers your question...

See Also: → 1532090
Assignee: nobody → bzbarsky
See Also: 1532090

Boris pointed out that the crash seems to have gone away in the last two Nightlies, even before his backout landed. Very peculiar.

So this crash does seem to have gone away before I checked in bug 1533103. The crash in bug 1532090, on the other hand, went away after the commit for bug 1533103. I've landed a backout of bug 1533103 in bug 1533105 and we'll see how that affects the stats....

I'm pretty sure bug 1533302 will improve or affect this: we now destroy the map in UpdateWeakPointersAfterGC when there are no live globals and we release-assert in XPCWrappedNativeScope's destructor that the map is destroyed (not initialized).

UncheckedUnwrapWithoutExpose is clearly another variant of this crash: lots of moz-extension URLs, TraceXPCGlobal appears in the stack.

I did a search for those two things, and it is more common than the DoMarking variant: https://crash-stats.mozilla.com/search/?proto_signature=~TraceXPCGlobal&url=%5Emoz-extension&product=Firefox&version=67.0a1&date=%3E%3D2019-03-04T18%3A18%3A00.000Z&date=%3C2019-03-11T18%3A18%3A00.000Z&_facets=signature&_sort=-build_id&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

[@ js::WeakMap<T>::markIteratively ] looks like it is also another variant.

Crash Signature: [@ mozilla::HashMapEntry<T>::~HashMapEntry] → [@ mozilla::HashMapEntry<T>::~HashMapEntry] [@ js::UncheckedUnwrapWithoutExpose] [@ DoMarking<T> ] [@ js::WeakMap<T>::markIteratively ]

So I'm trying to figure out where we stand.

This bug as filed stopped reproducing for no reason we can see, right?

Bug 1530672 stopped reproducing after we disabled compartment sharing, then started reproducing when we re-enabled it, and bug 1533302 did not help it, correct?

Bug 1532090 predates all this stuff, so not much we can say there based on frequency.

Bug 1530672 predates this stuff too, though may have gotten more frequent with 67, maybe.

Is that all correct?

Flags: needinfo?(bzbarsky) → needinfo?(continuation)

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #24)

This bug as filed stopped reproducing for no reason we can see, right?
Correct.

Bug 1530672 stopped reproducing after we disabled compartment sharing, then started reproducing when we re-enabled it, and bug 1533302 did not help it, correct?

Yes, I think that's right.

Bug 1532090 predates all this stuff, so not much we can say there based on frequency.

I looked at all of the crashes with this signature that we have, and the ones with TraceXPCGlobal in the stacks seem to follow the pattern of the signature from bug 1530672. In other words, they went away when it was disabled, and they are still happening on builds with bug 1533302, for instance: bp-aa313f6d-a38b-45ed-a79c-014b50190311

Bug 1530672 predates this stuff too, though may have gotten more frequent with 67, maybe.

Again, if you look at only the signatures with TraceXPCGlobal in the proto stack, it seems to be consistent with the other signatures.

Is that all correct?

Yes.

Looking at my query from comment 11, the first crash is in the 20190223213931 build, which is one build after bug 1523843 landed. There were 0 crashes on the 3-7 builds, which I think is when compartment sharing was temporarily disabled. And yes, bug 1533302 doesn't seem to have fixed that.

Flags: needinfo?(continuation)

For the other bug, I wonder about this case:

  1. Compartment has a global A.
  2. We attempt to create a new global B in A's compartment.
  3. We start a GC as part of creating B, collecting A.

Then (a) we didn't call TraceXPCGlobal because A is dead and we're still creating B, so we don't trace the expando map and (b) in XPCWrappedNativeScope::UpdateWeakPointersAfterGC we could see a live global at this point (if we're doing an incremental GC for example), so we don't destroy the map.

I'm not sure what the best fix is to prevent this.

One option is:

  1. Model mXrayExpandos after mWaiverWrapperMap, then we only have to call Sweep on it in UpdateWeakPointersAfterGC and that's straight-forward. The mXrayExpandos tracing code in TraceXPCGlobal goes away.

  2. Make mIDProto/mIIDProto/mCIDProto true weak pointers.

  3. Remove XPCWrappedNativeScope::TraceInside, avoiding the issue in comment 26.

Andrew, does that sound reasonable to you? I think it makes sense to use the same machinery for mXrayExpandos and mWaiverWrapperMap. mXrayExpandos is the more complicated (and buggy) one because it requires that explicit trace call.

Flags: needinfo?(continuation)

Another option is to add a TraceCompartment callback (and then call that where we call Realm::traceRoots). Then Gecko would use that to call XPCWrappedNativeScope::TraceInside instead of in TraceXPCGlobal.

Then we need to be careful to not leak compartments by marking things unconditionally in TraceCompartment...

I'm not too familiar with the code involved here, but the situation in comment 26 sounds like it would be a failure of the rooting analysis. Shouldn't the global of A be held alive while we're creating global B? For instance, it looks like SandboxOptions::sameZoneAs roots A. Also, for the crashes I've seen the GC is running off an idle callback, so a GC running in the middle of creating a global maybe can't be an issue? It is possible I'm missing something here.

(In reply to Andrew McCreight [:mccr8] from comment #29)

I'm not too familiar with the code involved here, but the situation in comment 26 sounds like it would be a failure of the rooting analysis. Shouldn't the global of A be held alive while we're creating global B?

I was thinking about the code here: https://searchfox.org/mozilla-central/rev/8ff2cd0a27e3764d9540abdc5a66b2fb1e4e9644/dom/base/nsGlobalWindowOuter.cpp#1831

We look for a JS::Compartment* there but we don't keep one of its globals alive.

Also, for the crashes I've seen the GC is running off an idle callback, so a GC running in the middle of creating a global maybe can't be an issue? It is possible I'm missing something here.

I think the problem is that we forget to trace the WeakMap and then the next GC crashes because we now have garbage in it.

but we don't keep one of its globals alive

Maybe we should?

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #31)

but we don't keep one of its globals alive

Maybe we should?

Adding a RootedObject with a GetFirstGlobalInCompartment call should be pretty easy and might be nice to get crash-stats feedback. (I didn't suggest it before because it felt like a workaround, but maybe it is the right thing to do for now..)

I can write the patch.

Flags: needinfo?(jdemooij)

(In reply to Jan de Mooij [:jandem] from comment #32)

Adding a RootedObject with a GetFirstGlobalInCompartment call should be pretty easy and might be nice to get crash-stats feedback. (I didn't suggest it before because it felt like a workaround, but maybe it is the right thing to do for now..)

Hm we should probably do that at the JS engine level in JS_NewGlobalObject so it's guaranteed to work like that everywhere.

I'm all for making XPConnect tracing simpler, but failing to realize that a compartment is alive when we have a global for it sounds like it could break in other ways. I guess this is just a failing of the particular way XPC tracing thinks the compartment is alive?

Flags: needinfo?(continuation)

On the other hand, I'm not sure exactly what sort of stack rooting analysis we could use to figure out that every compartment we refer to on the stack has an object pointer somewhere that is rooted, so maybe avoiding requiring that is better if we can?

(In reply to Andrew McCreight [:mccr8] from comment #36)

On the other hand, I'm not sure exactly what sort of stack rooting analysis we could use to figure out that every compartment we refer to on the stack has an object pointer somewhere that is rooted, so maybe avoiding requiring that is better if we can?

I think there's only a problem with the new-global code, because if a compartment has no live globals we don't reuse it anymore but when we are creating the new global and GC we're between that decision and having a live global again.

I'll post a patch to fix this in the new-global code. If it helps we can simplify the tracing code separately.

Flags: needinfo?(jdemooij)

Pushed because recent regression: https://hg.mozilla.org/integration/autoland/rev/7f3111601872c9d6e2a49a52cc57f7c7a25210e7

Leaving this open until we know more.

Keywords: leave-open
Assignee: bzbarsky → jdemooij

There's only been one crash with TraceXPCGlobal in the proto signature since Jan's patch landed, and it is a crash inside TraceProtoAndIfaceCache and not the weak map stuff, so that might be a different issue. The crash is also in a web content process, and not an extension process, which also suggests it might be different. That seems like a good sign.
bp-b295444f-6462-4542-9ca0-a01710190314

I also don't see the UncheckedUnwrapWithoutExpose signature anymore with recent Nightlies.

Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED

Can we land a test for this?

Group: dom-core-security → core-security-release
Flags: needinfo?(jdemooij)
Flags: in-testsuite?
Keywords: leave-open
Target Milestone: --- → mozilla67

(In reply to Ryan VanderMeulen [:RyanVM] from comment #43)

Can we land a test for this?

I don't have a test for this. It's probably also not very easy to write one because you have to trigger a GC in exactly the right place.

Flags: needinfo?(jdemooij)
Flags: in-testsuite? → in-testsuite-
Flags: qe-verify-
Whiteboard: [post-critsmash-triage]

Hello Jan - one of the Firefox signatures in this bug seems to still be occurring in the recent betas - see https://bit.ly/2Z5osnl. Should I spin off a new bug for this? thanks.

Flags: needinfo?(jdemooij)

DoMarking is just a generic GC crash. We already have bug 1280186 on file for it.

Clearing NI per comment 46.

Flags: needinfo?(jdemooij)
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: