1530364 - Crash in [@ mozilla::HashMapEntry<T>::~HashMapEntry]

Reporter

Description

•

7 years ago

This bug is for crash report bp-3ac7e79e-d7c6-4000-b787-f3fb10190223.

Seen while looking at nightly crash stats - Windows and Mac crashes started using 20190223041557: https://bit.ly/2H3vVwv. 5 crashes/6 installs so far.

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=98c743692a8494dbd609abe2f96b8297f777cad6&tochange=492e4409f468f4841def795627356148d0bb7347

Top 10 frames of crashing thread:

0 XUL mozilla::HashMapEntry<js::HeapPtr<JSObject*>, js::HeapPtr<JSObject*> >::~HashMapEntry js/public/HeapAPI.h:155
1 XUL <name omitted> mfbt/HashTable.h:100
2 XUL XPCWrappedNativeScope::~XPCWrappedNativeScope js/public/Utility.h:537
3 XUL XPCWrappedNativeScope::~XPCWrappedNativeScope js/xpconnect/src/XPCWrappedNativeScope.cpp:271
4 XUL XPCJSRuntime::FinalizeCallback js/xpconnect/src/XPCWrappedNativeScope.cpp:416
5 XUL js::gc::GCRuntime::endSweepingSweepGroup js/src/gc/GC.cpp:1808
6 XUL sweepaction::SweepActionSequence<js::gc::GCRuntime*, js::FreeOp*, js::SliceBudget&>::run js/src/gc/GC.cpp:6271
7 XUL sweepaction::SweepActionRepeatFor<js::gc::SweepGroupsIter, JSRuntime*, js::gc::GCRuntime*, js::FreeOp*, js::SliceBudget&>::run js/src/gc/GC.cpp:6331
8 XUL js::gc::GCRuntime::incrementalSlice js/src/gc/GC.cpp:6503
9 XUL js::gc::GCRuntime::gcCycle js/src/gc/GC.cpp:7373

Marcia Knous [:marcia]

Reporter

Comment 1

•

7 years ago

OS X 10.14 has the largest amount of crashes so far (13). This also affects Fennec. One comment "Don't remember what was here. Just noticed this, perhaps happened while idle overnight?"

Andrew McCreight [:mccr8]

Comment 2

•

7 years ago

It looks like this is a bunch of null derefs while calling WeakMapPtr::destroy() from ~XPCWrappedNativeScope. Does this look like the other issue with null keys in weak maps, Jon?

Flags: needinfo?(jcoppeard)

Andrew McCreight [:mccr8]

Comment 3

•

7 years ago

This is also not a very good signature. This should be added to the prefix list.

Marcia Knous [:marcia]

Reporter

Comment 4

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #3)

This is also not a very good signature. This should be added to the prefix list.

I filed Bug 1533011 for this. This crash continues to happen regularly on 67 nightly, with a range of 2-8 crashes per build.

Andrew McCreight [:mccr8]

Comment 5

•

7 years ago

Bug 1531035 is what I was thinking of in comment 2, but that's landed and it doesn't look like this fixed anything.

Andrew McCreight [:mccr8]

Comment 6

•

7 years ago

Boris, maybe this is a regression from bug 1523843? That's in the regression range in comment 0. Bug 1528146 talks about web extensions, and I see what feels like an unusually large number of moz-extension:// URLs in the crashes.

This crash seems to have a decent stack: bp-1862e857-b1be-412d-aaa7-d03430190305

We're getting into this hash table destroy from a call to void JS::WeakMapPtr<JSObject*, JSObject*>::destroy(), so maybe this is about mXrayExpandos?

Nominating for tracking because this is a newly introduced null deref crash that is happening regularly, though it doesn't look like a top crash.

tracking-firefox67: --- → ?

Flags: needinfo?(jcoppeard) → needinfo?(bzbarsky)

Andrew McCreight [:mccr8]

Updated

•

7 years ago

Component: JavaScript Engine → XPConnect

Boris Zbarsky [:bzbarsky]

Comment 7

•

7 years ago

Boris, maybe this is a regression from bug 1523843?

It's possible, yes.

Bug 1528146 talks about web extensions

Note that bug 1528146 is unrelated to bug 1523843.

so maybe this is about mXrayExpandos?

Maybe.

So all the crashes I see are on 64-bit (including all the Windows crashes, which is a little odd), and all at 0x10. That suggests to me that the HashMapEntry being destroyed is actually null. Which suggests that in destroyTable slot.isLive() is wrong or something.

One thing we could try doing is temporarily making the test at https://searchfox.org/mozilla-central/rev/3e0f1d95fcf8832413457e3bec802113bdd1f8e8/dom/base/nsGlobalWindowOuter.cpp#1850 always test false on nightly and seeing whether that makes this crash go away in the builds with that change. That would pretty definitively tell us whether it's related to the compartment sharing bits.

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 8

•

7 years ago

I'm going to do the disabling in bug 1533103 and we will see how that affects this bug.

Depends on: 1533103

Flags: needinfo?(bzbarsky)

Jan de Mooij [:jandem]

Assignee

Comment 9

•

7 years ago

OK so it looks like we have a value (or key?) in mXrayExpandos with a nullptr zone. We crash when checking if the zone needs an incremental barrier, likely when destroying the js::HeapPtr<>.

Question then is how this can happen. We trace mXrayExpandos in xpc::TraceXPCGlobal => XPCWrappedNativeScope::TraceInside if the global's compartment has a scope.

I wonder if we should be clearing the map here:

https://searchfox.org/mozilla-central/rev/fbb251448feb7276f9b1d0a88f9c0cb1cd144ce4/js/xpconnect/src/XPCWrappedNativeScope.cpp#364-366

Group: javascript-core-security

Boris Zbarsky [:bzbarsky]

Comment 10

•

7 years ago

That seems worth trying... I've already scheduled bug 1533103 to autoland, but we could either try to back that out immediately and try the idea from comment 9, or leave it in for a few days, then back out and try comment 9...

Andrew McCreight [:mccr8]

Updated

•

7 years ago

Group: javascript-core-security → dom-core-security

Jan de Mooij [:jandem]

Assignee

Comment 11

•

7 years ago

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #10)

That seems worth trying... I've already scheduled bug 1533103 to autoland, but we could either try to back that out immediately and try the idea from comment 9, or leave it in for a few days, then back out and try comment 9...

Bug 1533103 is probably worth trying anyway, but note that you will run into bug 1530608 test failures :/

Boris Zbarsky [:bzbarsky]

Comment 12

•

7 years ago

Oh. Ugh, right. :(

Jan de Mooij [:jandem]

Assignee

Comment 13

•

7 years ago

Jon, any thoughts on comment 9? Also how do WeakMaps and Zones relate exactly in this case, especially when it comes to sweeping?

XPCWrappedNativeScope is moved to the "dying scopes" list when there are no live globals in the compartment the scope belongs to. But (why) is there still an entry in mXrayExpandos when we destroy the dying scopes (at the end of GC)?

One possibility is transplanted objects? I think TransplantObjectRetainingXrayExpandos does the right thing but there's also the TransplantObject call in SetNewDocument.

Flags: needinfo?(jcoppeard)

Jan de Mooij [:jandem]

Assignee

Comment 14

•

7 years ago

(In reply to Jan de Mooij [:jandem] from comment #13)

One possibility is transplanted objects? I think TransplantObjectRetainingXrayExpandos does the right thing but there's also the TransplantObject call in SetNewDocument.

That's probably fine because it will get swapped with a CCW.. Here are some things we could release-assert:

mXrayExpandos is empty when we destroy the wrapped-native-scope
mXrayExpandos contains only same-compartment (with the scope) keys and values
mXrayExpandos does not contain any live objects when we move the scope to the dying scopes list

I can write a patch tomorrow...

Pascal Chevrel:pascalc

Updated

•

7 years ago

tracking-firefox67: ? → +

Jan de Mooij [:jandem]

Assignee

Comment 15

•

7 years ago

I filed bug 1533302 to simplify this code some. I'm hopeful that will fix this bug or at least make it a lot easier to reason about.

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Comment 16

•

7 years ago

(In reply to Jan de Mooij [:jandem] from comment #9)

OK so it looks like we have a value (or key?) in mXrayExpandos with a nullptr zone. We crash when checking if the zone needs an incremental barrier, likely when destroying the js::HeapPtr<>.

The arena's zone pointer is set to nullptr when the arena is freed, so this is probably UAF.

Question then is how this can happen. We trace mXrayExpandos in xpc::TraceXPCGlobal => XPCWrappedNativeScope::TraceInside if the global's compartment has a scope.

I wonder if we should be clearing the map here:

https://searchfox.org/mozilla-central/rev/fbb251448feb7276f9b1d0a88f9c0cb1cd144ce4/js/xpconnect/src/XPCWrappedNativeScope.cpp#364-366

That sounds like a good idea to me.

(In reply to Jan de Mooij [:jandem] from comment #13)

Also how do WeakMaps and Zones relate exactly in this case, especially when it comes to sweeping?

A WeakMap is associated with a Zone and is swept when that Zone is swept. Not sure if that answers your question...

Jon Coppeard (:jonco)

Updated

•

7 years ago

Updated

•

7 years ago

Assignee: nobody → bzbarsky

Keywords: csectype-uaf, sec-high

Andrew McCreight [:mccr8]

Updated

•

7 years ago

Comment 18

•

7 years ago

Boris pointed out that the crash seems to have gone away in the last two Nightlies, even before his backout landed. Very peculiar.

Will Kahn-Greene [:willkg] ET needinfo? me

Comment 19

•

7 years ago

The last one I experienced was: https://crash-stats.mozilla.org/report/index/10b6980b-b7f6-4bc3-9163-9cb4e0190306

That's from 20190306095759.

Boris Zbarsky [:bzbarsky]

Comment 20

•

7 years ago

So this crash does seem to have gone away before I checked in bug 1533103. The crash in bug 1532090, on the other hand, went away after the commit for bug 1533103. I've landed a backout of bug 1533103 in bug 1533105 and we'll see how that affects the stats....

Jan de Mooij [:jandem]

Assignee

Comment 21

•

7 years ago

•

Edited

I'm pretty sure bug 1533302 will improve or affect this: we now destroy the map in UpdateWeakPointersAfterGC when there are no live globals and we release-assert in XPCWrappedNativeScope's destructor that the map is destroyed (not initialized).

Andrew McCreight [:mccr8]

Comment 23

•

7 years ago

UncheckedUnwrapWithoutExpose is clearly another variant of this crash: lots of moz-extension URLs, TraceXPCGlobal appears in the stack.

I did a search for those two things, and it is more common than the DoMarking variant: https://crash-stats.mozilla.com/search/?proto_signature=~TraceXPCGlobal&url=%5Emoz-extension&product=Firefox&version=67.0a1&date=%3E%3D2019-03-04T18%3A18%3A00.000Z&date=%3C2019-03-11T18%3A18%3A00.000Z&_facets=signature&_sort=-build_id&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

[@ js::WeakMap<T>::markIteratively ] looks like it is also another variant.

Crash Signature: [@ mozilla::HashMapEntry<T>::~HashMapEntry] → [@ mozilla::HashMapEntry<T>::~HashMapEntry] [@ js::UncheckedUnwrapWithoutExpose] [@ DoMarking<T> ] [@ js::WeakMap<T>::markIteratively ]

Andrew McCreight [:mccr8]

Updated

•

7 years ago

status-firefox65: --- → unaffected

status-firefox-esr60: --- → unaffected

status-thunderbird_esr60: --- → unaffected

Boris Zbarsky [:bzbarsky]

Comment 24

•

7 years ago

So I'm trying to figure out where we stand.

This bug as filed stopped reproducing for no reason we can see, right?

Bug 1530672 stopped reproducing after we disabled compartment sharing, then started reproducing when we re-enabled it, and bug 1533302 did not help it, correct?

Bug 1532090 predates all this stuff, so not much we can say there based on frequency.

Bug 1530672 predates this stuff too, though may have gotten more frequent with 67, maybe.

Is that all correct?

Boris Zbarsky [:bzbarsky]

Updated

•

7 years ago

Flags: needinfo?(bzbarsky) → needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 25

•

7 years ago

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #24)

This bug as filed stopped reproducing for no reason we can see, right?
Correct.

Bug 1530672 stopped reproducing after we disabled compartment sharing, then started reproducing when we re-enabled it, and bug 1533302 did not help it, correct?

Yes, I think that's right.

Bug 1532090 predates all this stuff, so not much we can say there based on frequency.

I looked at all of the crashes with this signature that we have, and the ones with TraceXPCGlobal in the stacks seem to follow the pattern of the signature from bug 1530672. In other words, they went away when it was disabled, and they are still happening on builds with bug 1533302, for instance: bp-aa313f6d-a38b-45ed-a79c-014b50190311

Bug 1530672 predates this stuff too, though may have gotten more frequent with 67, maybe.

Again, if you look at only the signatures with TraceXPCGlobal in the proto stack, it seems to be consistent with the other signatures.

Is that all correct?

Yes.

Looking at my query from comment 11, the first crash is in the 20190223213931 build, which is one build after bug 1523843 landed. There were 0 crashes on the 3-7 builds, which I think is when compartment sharing was temporarily disabled. And yes, bug 1533302 doesn't seem to have fixed that.

Flags: needinfo?(continuation)

Jan de Mooij [:jandem]

Assignee

Comment 26

•

7 years ago

For the other bug, I wonder about this case:

Compartment has a global A.
We attempt to create a new global B in A's compartment.
We start a GC as part of creating B, collecting A.

Then (a) we didn't call TraceXPCGlobal because A is dead and we're still creating B, so we don't trace the expando map and (b) in XPCWrappedNativeScope::UpdateWeakPointersAfterGC we could see a live global at this point (if we're doing an incremental GC for example), so we don't destroy the map.

I'm not sure what the best fix is to prevent this.

Jan de Mooij [:jandem]

Assignee

Comment 27

•

7 years ago

One option is:

Model mXrayExpandos after mWaiverWrapperMap, then we only have to call Sweep on it in UpdateWeakPointersAfterGC and that's straight-forward. The mXrayExpandos tracing code in TraceXPCGlobal goes away.
Make mIDProto/mIIDProto/mCIDProto true weak pointers.
Remove XPCWrappedNativeScope::TraceInside, avoiding the issue in comment 26.

Andrew, does that sound reasonable to you? I think it makes sense to use the same machinery for mXrayExpandos and mWaiverWrapperMap. mXrayExpandos is the more complicated (and buggy) one because it requires that explicit trace call.

Flags: needinfo?(continuation)

Jan de Mooij [:jandem]

Assignee

Comment 28

•

7 years ago

•

Edited

Another option is to add a TraceCompartment callback (and then call that where we call Realm::traceRoots). Then Gecko would use that to call XPCWrappedNativeScope::TraceInside instead of in TraceXPCGlobal.

Then we need to be careful to not leak compartments by marking things unconditionally in TraceCompartment...

Andrew McCreight [:mccr8]

Comment 29

•

7 years ago

I'm not too familiar with the code involved here, but the situation in comment 26 sounds like it would be a failure of the rooting analysis. Shouldn't the global of A be held alive while we're creating global B? For instance, it looks like SandboxOptions::sameZoneAs roots A. Also, for the crashes I've seen the GC is running off an idle callback, so a GC running in the middle of creating a global maybe can't be an issue? It is possible I'm missing something here.

Jan de Mooij [:jandem]

Assignee

Comment 30

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #29)

I'm not too familiar with the code involved here, but the situation in comment 26 sounds like it would be a failure of the rooting analysis. Shouldn't the global of A be held alive while we're creating global B?

I was thinking about the code here: https://searchfox.org/mozilla-central/rev/8ff2cd0a27e3764d9540abdc5a66b2fb1e4e9644/dom/base/nsGlobalWindowOuter.cpp#1831

We look for a JS::Compartment* there but we don't keep one of its globals alive.

Also, for the crashes I've seen the GC is running off an idle callback, so a GC running in the middle of creating a global maybe can't be an issue? It is possible I'm missing something here.

I think the problem is that we forget to trace the WeakMap and then the next GC crashes because we now have garbage in it.

Boris Zbarsky [:bzbarsky]

Comment 31

•

7 years ago

but we don't keep one of its globals alive

Maybe we should?

Jan de Mooij [:jandem]

Assignee

Comment 32

•

7 years ago

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #31)

but we don't keep one of its globals alive

Maybe we should?

Adding a RootedObject with a GetFirstGlobalInCompartment call should be pretty easy and might be nice to get crash-stats feedback. (I didn't suggest it before because it felt like a workaround, but maybe it is the right thing to do for now..)

Jan de Mooij [:jandem]

Assignee

Comment 33

•

7 years ago

I can write the patch.

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 34

•

7 years ago

(In reply to Jan de Mooij [:jandem] from comment #32)

Adding a RootedObject with a GetFirstGlobalInCompartment call should be pretty easy and might be nice to get crash-stats feedback. (I didn't suggest it before because it felt like a workaround, but maybe it is the right thing to do for now..)

Hm we should probably do that at the JS engine level in JS_NewGlobalObject so it's guaranteed to work like that everywhere.

Andrew McCreight [:mccr8]

Comment 35

•

7 years ago

I'm all for making XPConnect tracing simpler, but failing to realize that a compartment is alive when we have a global for it sounds like it could break in other ways. I guess this is just a failing of the particular way XPC tracing thinks the compartment is alive?

Flags: needinfo?(continuation)

Andrew McCreight [:mccr8]

Comment 36

•

7 years ago

On the other hand, I'm not sure exactly what sort of stack rooting analysis we could use to figure out that every compartment we refer to on the stack has an object pointer somewhere that is rooted, so maybe avoiding requiring that is better if we can?

Jan de Mooij [:jandem]

Assignee

Comment 37

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #36)

On the other hand, I'm not sure exactly what sort of stack rooting analysis we could use to figure out that every compartment we refer to on the stack has an object pointer somewhere that is rooted, so maybe avoiding requiring that is better if we can?

I think there's only a problem with the new-global code, because if a compartment has no live globals we don't reuse it anymore but when we are creating the new global and GC we're between that decision and having a live global again.

I'll post a patch to fix this in the new-global code. If it helps we can simplify the tracing code separately.

Jan de Mooij [:jandem]

Assignee

Comment 38

•

7 years ago

Attached file Bug 1530364 - Ensure compartment always has a live global when creating a new global in it. r?bzbarsky! — Details

Jan de Mooij [:jandem]

Assignee

Updated

•

7 years ago

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Assignee

Comment 39

•

7 years ago

Pushed because recent regression: https://hg.mozilla.org/integration/autoland/rev/7f3111601872c9d6e2a49a52cc57f7c7a25210e7

Leaving this open until we know more.

Keywords: leave-open

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 40

•

7 years ago

https://hg.mozilla.org/mozilla-central/rev/7f3111601872

Andrew McCreight [:mccr8]

Updated

•

7 years ago

Assignee: bzbarsky → jdemooij

Andrew McCreight [:mccr8]

Comment 41

•

7 years ago

There's only been one crash with TraceXPCGlobal in the proto signature since Jan's patch landed, and it is a crash inside TraceProtoAndIfaceCache and not the weak map stuff, so that might be a different issue. The crash is also in a web content process, and not an extension process, which also suggests it might be different. That seems like a good sign.
bp-b295444f-6462-4542-9ca0-a01710190314

Jan de Mooij [:jandem]

Assignee

Comment 42

•

7 years ago

I also don't see the UncheckedUnwrapWithoutExpose signature anymore with recent Nightlies.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Ryan VanderMeulen [:RyanVM]

Comment 43

•

6 years ago

Can we land a test for this?

Group: dom-core-security → core-security-release

status-firefox67: affected → fixed

Flags: needinfo?(jdemooij)

Flags: in-testsuite?

Keywords: leave-open

Target Milestone: --- → mozilla67

Jan de Mooij [:jandem]

Assignee

Comment 44

•

6 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #43)

Can we land a test for this?

I don't have a test for this. It's probably also not very easy to write one because you have to trigger a GC in exactly the right place.

Flags: needinfo?(jdemooij)

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Flags: in-testsuite? → in-testsuite-

Andrei Vaida [:avaida]

Updated

•

6 years ago

Flags: qe-verify-

Whiteboard: [post-critsmash-triage]

Marcia Knous [:marcia]

Reporter

Comment 45

•

6 years ago

Hello Jan - one of the Firefox signatures in this bug seems to still be occurring in the recent betas - see https://bit.ly/2Z5osnl. Should I spin off a new bug for this? thanks.

Flags: needinfo?(jdemooij)

Andrew McCreight [:mccr8]

Comment 46

•

6 years ago

DoMarking is just a generic GC crash. We already have bug 1280186 on file for it.

Jan de Mooij [:jandem]

Assignee

Comment 47

•

6 years ago

Clearing NI per comment 46.

Flags: needinfo?(jdemooij)

Daniel Veditz [:dveditz]

Updated

•

5 years ago

Group: core-security-release