Closed Bug 752764 Opened 12 years ago Closed 12 years ago

document.write after page with <video> causes a CC crash [@ nsXPConnect::Traverse]

Categories

(Core :: XPConnect, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla17
Tracking Status
firefox15 --- verified
firefox16 --- verified
firefox17 --- verified
firefox-esr10 --- unaffected

People

(Reporter: jruderman, Assigned: mccr8)

References

Details

(Keywords: crash, sec-moderate, testcase, Whiteboard: [advisory-tracking-])

Crash Data

Attachments

(4 files, 4 obsolete files)

      No description provided.
Attached file stacks (breakpad, gdb)
(Breakpad opt: bp-2a98b4e2-3875-4377-b2ae-b61ae2120506)
Somehow the GetJSPrivate slot of a wrapper is null.
On Windows 7: bp-1fdd18ff-448a-4267-8704-e5aed2120508.
Crash Signature: [@ nsXPConnect::Traverse] → [@ nsXPConnect::Traverse] [@ nsXPConnect::Traverse(void*, nsCycleCollectionTraversalCallback&)]
OS: Mac OS X → All
Hardware: x86_64 → All
Blocks: 500103
No longer blocks: 752324
With the test case in comment 4, I had trouble reproducing it unless I opened and closed empty tabs.  One odd thing is that when I attempted to save the file, I ended up with this crash instead, inside nsWebBrowserPersist::SaveDocumentInternal:
https://crash-stats.mozilla.com/report/index/bp-78647826-b87a-4f42-97e6-3fce92120523
I can poke at this a little.
Assignee: nobody → continuation
Group: core-security
Depends on: 758415
I investigated a bit, and then went through my findings a bit with bholley.  He said we should mark it s-s just in case, because "this is kind of a rabbit hole".

What I found by looking at JS heap dumps is that there were a bunch of XULElement wrappers with null privates.  They have things looking like this:

// legit-looking reflector, where the underlying XPCWN's wrapper is a proxy
// (instead of to itself)
0x127a4ca00 G XULElement 11a8f1e10
> 0x127a48640 G type
> 0x127a67fb0 G shape
> 0x127a4c400 B XPCWrappedNativeProto::mJSProtoObject
> 0x127a50d40 G XPCWrappedNative::mWrapper
> 0x127a4b060 B XPCWrappedNativeScope::mGlobalJSObject
> 0x127a4d020 B XPCWrappedNativeScope::mPrototypeJSObject

// the proxy in question, which has a target object of the broken XUL element
0x127a50d40 G Proxy <no private>
> 0x127a487c0 G type
> 0x127a67830 G shape
> 0x118416a60 G targetObject
> 0x118416a60 G private

// the XUL element reflector with private 0
0x118416a60 G XULElement 0
> 0x127a1d700 G type
> 0x127a36998 G shape

Bobby looked into the code a bit.  Here's my summary.  Blame any errors on me.  This XUL stuff is a video control, and is thus some kind of XBL thing, subject to same-compartment security checks ("SOW").  This means it is subject to the "special object-plus-wrapper transplant".

Here's the code from XPCWrappedNative::ReparentWrapperIfFound:

                JSObject *wwsaved = ww;
                wrapper->SetWrapper(newwrapper);
                ww = js_TransplantObjectWithWrapper(ccx, flat, ww, newobj,
                                                    newwrapper);

What is probably happening is that the wrapper field of the new copy of the element XPCWN (0x127a4ca00 in the example above) gets set to a wrapper for the element (0x127a50d40), that is created for use as a fallback in the transplant (0x118416a60).  But the transplant finds an existing wrapper it can use instead, so the fallback 0x118416a60 is left behind in a zombie state.

Because it is still reachable, it remains alive, then the CC finds it and blows up.
comment 7 is exactly right. In plainer terms, js_TransplantObjectWithWrapper doesn't promise to use the object you give it - you have to check the return value, which we don't.

We don't because we're already holding our breath with respect to the comment preceding that bit of code. The real answer here is bug 758415.
This still crashes on a recent-ish tree with the patches in bug 758415 and bug 751995 applied.
(In reply to Andrew McCreight [:mccr8] from comment #9)
> This still crashes on a recent-ish tree with the patches in bug 758415 and
> bug 751995 applied.

Huh, really? It seemed like we'd pretty decisively figured out the problem, and the backout of the 'pretty serious brain surgery' code should have fixed it. What seems to be happening? Is it still Location?
This wasn't Location before, it was XBL SOW gunk related to video controls.  But yeah, same part of the code seemed to be the problem.  I'll look into it.
The good news is that this is better than before.  Previously on this example there were 3 or 4 XULElement wrappers with null privates, but now there's only one.

0x1188456d0 G XULElement 0
> 0x118841880 G type
> 0x1188b0d30 G shape
> 0x118895880 G Utils

The zombie XULElement is still created in XPCWrappedNative::ReparentWrapperIfFound.

However, it isn't being held alive by a proxy any more.  So I guess that is progress, potentially.

Instead, it is the fun_callscope of a ton of functions that appear to be implementing various video control stuff, and the parent of a bunch of other functions.  It is also the "item" field of one object, the "videocontrols" field of another.

The object that has the videocontrols field is a root in the GC graph:
0x118895880 G nsXPCWrappedJS[nsIDOMEventListener,0x1263abb80:0x125dbee80].mJSObj

So it looks like a JS object implementing a listener is keeping alive the zombie object.  (Possibly other things are, too.)
Attached file XULElement roots
Here's what find_roots shows is keeping alive the zombie XULElement.  The roots are all various nsXPCWrappedJS[nsIDOMEventListener].

There's this one:

via nsXPCWrappedJS[nsIDOMEventListener,0x1263abb80:0x125dbee80].mJSObj :
0x118895880 [Object <no private>]
    --[onVolumeMouseInOut]-> 0x118891a40 [Function]
    --[fun_callscope]-> 0x1188456d0 [XULElement 0]

...then a bunch of mostly parent chain things through shapes.
I accidentally chopped the last line off the file:
  --[parent]-> 0x1188456d0 [XULElement 0]
but it isn't very interesting...
So, there are live functions sitting around with an environment that is a transplanted XUL object.  It seems a little weird to me, knowing little about JS functions, that a XULElement wrapper would be an environment.  But maybe that's normal?

Assuming that is okay, then I have two guesses as to what is happening.  Luke told me that the environment for a function should be in the same compartment as it.

Thus, I see two possibilities:

a) The function has been transplanted already, but something should have removed the original copy from mWrappedJSRoots.

b) Something should have updated the environment of the function, but didn't.

I don't see anything that look like copies of either the XULElement or the the function, judging by field names, so I'm not sure what may be the case.  I can try to log the heap contents in a more fine grained way.
If the function is using the XULElement as the environment, and we transplant the XULElement, it seems likely that the function is getting left behind with the cross-compartment wrapper as its parent.

I know nothing about environments though. How do they relate to parents? Are they only for scripted functions? I know that we parent all XUL elements directly to their XUL parent, so if that also determines the environment then this would make sense...
I think the problem must be that there are direct references to the XPCWN wrapper, via the parent and environment of the functions, but js_TransplantObjectWithWrapper assumes there are no direct references ("Because there are no direct references to the location object itself, we don't want the old obj (|origobj| here) to become the new wrapper but the wrapper itself instead.").  We swap out the guts of the wrapper wrapper to turn it into a CCW, but the old references to the XPCWN wrapper are still there.

One solution to this would be to somehow make sure that the environment and parent we use ends up with the wrapper wrapper.  I don't know if making the function go through that extra layer of indirection is a problem or not.

Another solution would be to turn the XPCWN wrapper into some kind of pass-forward wrapper that just says "hey! go to the CCW over here!" but I'm not sure if that's a thing or not.
(In reply to Andrew McCreight [:mccr8] from comment #17)
> One solution to this would be to somehow make sure that the environment and
> parent we use ends up with the wrapper wrapper.  I don't know if making the
> function go through that extra layer of indirection is a problem or not.

Err, this is somewhat unclear.  What I mean is that the environment/parent would point to the wrapper wrapper from the very beginning, so you wouldn't have a dead reference later.  This would involve tracking down where those env/parents are set, and adding some kind of awful check like "Hey, are you a wrappered wrapper?  If so, use the wrapper wrapper instead!".
Ugh yeah. I _knew_ the assumption that there should be no intra-compartment references to mFlatJSObject sounded too good to be true.

I think the easiest solution is to update js_TransplantObjectWithWrapper to swap in a transparent same-compartment wrapper into |origObj| that points to |origWrapper| (which becomes the new intra-compartment identity for the object after the transplant). Then, we'll probably need to fix up GetWrappedNativeOfJSObject (and anywhere else we call GetObjectParent), and uses of environment in the JS engine, to make sure they handle that case properly. This might also fix bug 760887.
Yeah, that's what I was thinking with my second suggestion there.  It does seem like a better approach, in that we won't have to chase down ever possible way something might refer to the wrapper.

How do you make a "transparent same-compartment wrapper"?
(In reply to Andrew McCreight [:mccr8] from comment #20)
> How do you make a "transparent same-compartment wrapper"?

Just do Wrapper::New and use Wrapper::singleton as the handler. See WrapperFactory::WrapSOWObject for an example of creating a wrapper.
> I know nothing about environments though. How do they relate to parents?

Same thing.
Attached patch WIP (obsolete) — Splinter Review
Doesn't crash with test case, and I'm able to browse around a little without anything too obviously awful happening.

This just implements the first part of Bobby's suggestion, which is to turn the reflector into a DirectWrapper around the new CCW.  I'll look into the GetObjectParent stuff now.
(In reply to Bobby Holley (:bholley) from comment #19)
> Then, we'll probably need to fix up GetWrappedNativeOfJSObject (and anywhere else 
> we call GetObjectParent), and uses of environment in the JS engine, to make sure
> they handle that case properly.

At least one use in GWNOJSO looks okay to me, as it immediately unwraps what it gets:
  JSObject* funObjParent = js::UnwrapObject(js::GetObjectParent(funobj));
I'll try looking around at other uses and see how they seem.
dbaron appears to have hit this in the wild, in bug 766624.
Crash Signature: [@ nsXPConnect::Traverse] [@ nsXPConnect::Traverse(void*, nsCycleCollectionTraversalCallback&)] → [@ nsXPConnect::Traverse] [@ nsXPConnect::Traverse(void*, nsCycleCollectionTraversalCallback&)] [@ nsXPConnectParticipant::TraverseImpl]
The crash looks slightly different now:
bp-e38552ec-b8f0-4269-a867-a03bb2120621
So, my patch in bug 763773 gets rid of this crash, but only because it rips out the check that is actually crashing in this test case.  The zombie objects must still be around, so that's a little disturbing.  I'm going to wait to land that until this is fixed, but it would still be nice to have a way to check for this.

Ideally we'd have some way to check that all of these objects we expect to die when transplanting actually go away at the next GC.  My concern is that if somehow one of these things is kept alive for a few CCs due to being held onto by something else that we could get false positives, but maybe that shouldn't happen anyways?
It is possible that with the patch in bug 763773 this isn't actually a problem.  When I run the test case locally, the bogus XULElement gets cleaned up in the first CC, which suggests that it is garbage, as it should be, but is being held alive by an XPConnect root, so it can only be cleaned up with the CC.

It looks like what is happening is that a number of nsXPCWrappedJS(nsIDOMEventListener) are keeping alive the zombie XULElement.  The listeners get cleaned up in the next CC.  I don't know enough about how they work to know that this is guaranteed to happen, or if that's just particular to this test case.
Bobby is going to figure out how transplanting of XBL and plugins and other stuff is supposed to work, so I'm going to put this on hold until then.

One hacky fix (that is slightly less hacky than just adding a null check to the CC...) would be to nuke the reflector object with one of khuey's proxies that throw when you try to do things with them.
Attached patch nuke orig object (obsolete) — Splinter Review
Here's a simpler approach I discussed with Bobby.  We create a DeadObjectProxy, then swap it in for the wrapper.  That fixes the crash, and will hopefully result in a more controlled situation if we end up exposing the wrapper or holding onto it somehow.

(Bill suggested a plain object, but unfortunately it has the wrong background finalization kind, so it fails an assertion on the swap...)
Attachment #632422 - Attachment is obsolete: true
The previous version of the patch passed a try run on Linux debug.  I'll do a full try run before landing.  This is just cleaned up a bit.
Attachment #647369 - Attachment is obsolete: true
Attachment #647631 - Flags: review?(bobbyholley+bmo)
Attachment #647631 - Flags: feedback?(wmccloskey)
Comment on attachment 647631 [details] [diff] [review]
nuke wrapped reflector during transplant

Review of attachment 647631 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsapi.cpp
@@ +1671,5 @@
>  /*
> + * The location object and SOWs are special. There is the reflector object |origobj|
> + * and the reflector wrapper |origwrapper|. We turn the reflector wrapper into the
> + * new CCW because there are (probably) not any live references to the reflector.
> + * This leads to subtle differences with JS_TransplantObject.

Could you maybe expand on this comment? Ideally someone without knowledge of how xpconnect works should be able to understand it. Lots of people who read this won't know what an SOW is, or even how location objects differ from normal objects. I think you could explain that some C++ objects have an xpconnect reflector as well as a security wrapper, and since the reflector isn't expected to have any references, we turn the security wrapper into a cross-compartment wrapper.
Attachment #647631 - Flags: feedback?(wmccloskey) → feedback+
Comment on attachment 647631 [details] [diff] [review]
nuke wrapped reflector during transplant

>     {
>         AutoCompartment ac(cx, origobj);
>-        JSObject *wrapperGuts = targetobj;
>-        if (!ac.enter() || !JS_WrapObject(cx, &wrapperGuts))
>+        if (!ac.enter())
>+            return NULL;
>+
>+        // Render the reflector inert, in case something gets ahold of it.

Please explain this a little bit more.

>+        JSObject *wrapperGuts = NewDeadProxyObject(cx, JS_GetGlobalForObject(cx, origobj));
>+        if (!wrapperGuts || !origobj->swap(cx, wrapperGuts))
>+            return NULL;
>+
>+        // Turn origwrapper into a CCW to the new object.
>+        wrapperGuts = targetobj;

I might prefer to use a different variable for these two operations to avoid confusion while reading.

Looks great! r=bholley
Attachment #647631 - Flags: review?(bobbyholley+bmo) → review+
Carrying forward bholley's review. Thanks for the feedback!  I updated the comments and split out the initial usage of wrapperGuts into a new variable reflectorGuts.

If either of you have further suggestions for the comments, let me know. I'm going to do a try run here before I land it, as I don't remember if I did a full one before or not.

https://tbpl.mozilla.org/?tree=Try&rev=c39f48fa3305
Attachment #647631 - Attachment is obsolete: true
Attachment #648754 - Flags: review+
Hopefully I have not traded clarity for wit with this comment: "After the swap we have a possibly-live object that isn't dangerous, and a possibly-dangerous object that isn't live."
Blocks: 763773
I updated the patch a little to root reflectorGuts.

https://hg.mozilla.org/integration/mozilla-inbound/rev/61ad814fab44
Backed out as part of the mass tree revert due to bustage caused by other landings:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c801b99d726f

Once the tree is open again, this can reland :-)
https://hg.mozilla.org/mozilla-central/rev/0a48f2d0e2a0

Should this have a test?
Status: NEW → RESOLVED
Closed: 12 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
Target Milestone: --- → mozilla17
There are a handful of null deref crashes on Nightly in WrapperIsNotMainThreadOnly(), just like the test case here, so it seems like this could be an actual problem in the wild. Of course, I have deleted this code in 17 anyways in bug 763773, so the crash at least is fixed either way.
In the last 28 days, there were about 10 of these WrapperIsNotMainThreadOnly() crashes, but none since 8/4, which is the build before this patch landed, so it seems likely this patch fixed those crashes. Before my other patch fixed it permanently. ;)
Comment on attachment 648754 [details] [diff] [review]
nuke wrapped reflector during transplant

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 650353
User impact if declined: some crashes, possible security problems
Testing completed (on m-c, etc.): it has been on m-c for a few days
Risk to taking this patch (and alternatives if risky): should be pretty low. We replace an object in a weird scary state with one in a known clear state, so we're probably only improving things. The crashes could be fixed instead by adding just a null check, but that wouldn't fix any lurking security issues we're not sure about.
String or UUID changes made by this patch: none
Attachment #648754 - Flags: approval-mozilla-beta?
Attachment #648754 - Flags: approval-mozilla-aurora?
I'm going to put this as sec-moderate.  We don't know of anything particularly bad that results from this aside from crashes, but there could be.
Comment on attachment 648754 [details] [diff] [review]
nuke wrapped reflector during transplant

Not a high volume of crashes, but seems low risk enough that we can take this on Beta.
Attachment #648754 - Flags: approval-mozilla-beta?
Attachment #648754 - Flags: approval-mozilla-beta+
Attachment #648754 - Flags: approval-mozilla-aurora?
Attachment #648754 - Flags: approval-mozilla-aurora+
Keywords: verifyme
Whiteboard: [advisory-tracking-]
I'm unable to reproduce this crash with the attached testcase using 2012-05-07 mozilla-central builds (tried opt and debug). Can someone give me some tips as to how I might reproduce this?
Keywords: verifyme
Whiteboard: [advisory-tracking-] → [advisory-tracking-][qa?]
I was able to reproduce it on the 5-7 build on OSX by reloading the page three or four times.
Thanks Andrews, I got it that time. Confirmed reproducible with Firefox 15.0a1 2012-05-06.

Verified fixed with:
 * 2012-08-24 Firefox 15 Beta Debug
 * 2012-08-24 Firefox 16 Aurora Debug
 * 2012-08-24 Firefox 17 Nightly Debug
Status: RESOLVED → VERIFIED
QA Contact: anthony.s.hughes
Whiteboard: [advisory-tracking-][qa?] → [advisory-tracking-]
Group: core-security
(In reply to Bobby Holley (:bholley) from comment #19)
> Ugh yeah. I _knew_ the assumption that there should be no intra-compartment
> references to mFlatJSObject sounded too good to be true.

FWIW, this entire situation is now much clearer to me. we put SOWs (same-compartment security wrappers) around native anonymous content, which generally involve XUL elements in a content HTML document. XUL elements are parented directly to their parent element (as opposed to their document, as is the case for HTML), so the parent chain of the anonymous content eventually hits the SOW-ed object, which violates the (faulty) assumption that there are no direct intra-compartment references to SCSW-ed objects.
So the issue here is that JS_DescribeScriptedCaller returns false when it finds
no JS code on the stack, which we were incorrectly interpretting as an
exception.

The interesting thing here though is that there _should_ be JS code on the
stack. What's actually happening is that GetContextFromStack skips over the
top JS context (the sandbox context) because it isn't associated with a DOM
window. This seems totally wrong IMO, and I'm going to file another bug to
fix it.
Attachment #671423 - Flags: review?(bzbarsky)
Comment on attachment 671423 [details] [diff] [review]
Properly handle |false| return false from JS_DescribeScriptedCaller. v1

Arg, wrong bug number. Please ignore.
Attachment #671423 - Attachment is obsolete: true
Attachment #671423 - Flags: review?(bzbarsky)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: