If nothing else comes up, this appears to be the very last cpg blocker.
This crash happens across the board on native android only. I have no idea how to reproduce it other than pushing to try, so I'm really hoping we can get an idea of what's going on by inspection.
The crash stack is here: http://pastebin.mozilla.org/1588494
The component loader does some script caching, so I was really excited for a minute that this was just the same thing as bug 743034, and that JS_ExecuteScriptVersion was bypassing the fix from that bug. But it looks like JS_ExecuteScriptVersion just calls through to JS_ExecuteScript (indeed, it's in the stack). Darn. :-(
Still, I wouldn't be surprised if it had something to do with the script caching. I'm going to do a try push with the caching disabled in order to test my theory.
Luke, any guesses from the stack?
Here's an experimental push with the script caching disabled: https://tbpl.mozilla.org/?tree=Try&rev=7eb7008dd29d
(In reply to Bobby Holley (:bholley) from comment #1)
> Here's an experimental push with the script caching disabled:
Hm, didn't help. So I guess it's not the script caching. Thoughts, Luke?
It's a compartment mismatch. The crash:
Crash reason: SIGSEGV
Crash address: 0x30
comes from http://mxr.mozilla.org/mozilla-central/source/js/src/vm/Stack-inl.h#638 (which is inlined via GetNameFromBytecode into PropertyCache::fullTest.
When we've seen these in the past, it was simply that the call site had not entered the correct compartment. However, this is called from mozJSComponentLoader::GlobalForLocation which I assume you've scoured.
The crash is surely not specific to ARM/Android; perhaps you could ask the mobile people if there is a way to run the test harness on your desktop?
Yeah, I've scoured it.
Given that the android reftests are run in release mode, I realize now that the normal compartment-checking machinery isn't at work here. So I've sprinkled some release-mode asserts down the ExecuteScript pipe to see if we can catch where things go wrong: https://tbpl.mozilla.org/?tree=Try&rev=28b661be33da
Build bustage on the last one - updated one is here: https://tbpl.mozilla.org/?tree=Try&rev=8f406df56d56
This seems to indicate that the issue is that |cx| and |obj| (|global| in mozJSComponent loader) are not in the same compartment, which is bizarre since the calling function enters the compartment of |global| a bit above.
The only thing I can think of here is that one of the functions called between ac.enter(..) and JS_ExecuteScriptVersion(..) leaves cx in a different compartment than it started.
So I'm going to sprinkle some more release-mode asserts through there to see if I can pin it down.
Ok, while sprinkling in these asserts, I found something interesting - JS_ExecuteScriptVersion leaves the content in a different compartment than it started (and, in particular, a different compartment than the global). This appears to be an interaction between js::AutoPreserveCompartment (only used once in gecko, in GlobalForLocation), and recursive importing. Here's how it appears to go down.
We call GlobalForLocation with cx->compartment == NULL. We then instantiate a preserver, stashing NULL. We then create a global A_g, and enter associated compartment A_c. We then call JS_ExecuteScriptVersion.
JS_ExecuteScriptVersion re-enters. Compartment is A_c, so we stash that with the preserver. We then create a global B_g, and enter its compartment B_c. We then call JS_ExecuteScriptVersion. Before and after the call, we're in B_c.
We then return from GlobalForLocation. The compartment preserver has A_c stashed, so it applies compartment A_c.
We then return our way up through XPConnect and JS. Finally, we end up in the tail end of js::ExecuteKernel, just before we return from our JS_ExecuteScriptVersion call. This triggers the destructor of our ExecuteFrameGuard, which calls ContextStack::popFrame. This ends up doing cx_->resetCompartment(), which for some reason decides to apply B_c.
So we leave the outer call to JS_ExecuteScriptVersion with our cx reset to the compartment of the inner one. This seems bad.
It's not really clear to me how cx_->resetCompartment() works, but it's almost certainly ending up with the wrong answer. Given that this is the only place in gecko where we use js::AutoCompartmentPreserver, I'd imagine it's got at least something to do with it.
Luke, can you drop some knowledge here?
resetCompartment() just makes sure that, when you pop a frame, cx->compartment is left matching the compartment of the "current" scope chain. What is the current scope chain? If there is a frame pushed, it's just fp->scopeChain. The weird behavior I believe we are hitting here is that, if no frames are pushed, the current scope chain is cx->globalObject. (I just noticed resetCompartment() is called redundantly by AutoCompartment::leave() (we can probably remove that).)
So what can happen (and this has bit Blake and I before) is that you start with cx->globalObject = G1, call into JS, cx->globalObject changes to G2 in some new compartment, and then when JS returns, cx->compartment is (rightly, but bizarrely) left in G2->compartment.
I think that is exactly what is happening here: see http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/loader/mozJSComponentLoader.cpp#673. This means that, assuming the reentrance happens on the same 'cx', the inner GlobalForLocation is clobbering cx->globalObject, changing A_g to B_g.
So I think the bug is that doing JS_SetContext(cx, nsnull) in mozJSComponentLoader is wrong.
Actually, my conclusion in comment 7 disagrees with the preceding paragraphs. Basically, cx->globalObject has no meaning; you can never assume you know what compartment it is in. That is why we must kill it (bug 604813 which, you may noticed, is transitively blocked by cpg ;). Thus, we must *expect* cx->globalObject to change at any time.
So, I think you were right, AutoPreserveCompartment is at fault: it blindly changes cx->compartment, completely ignoring the fact that cx->compartment must match the current scope which, because of the cx->globalObject crap, can have changed in a non-LIFO manner!
There is probably a reason there is only one use left. Let's kill this foot-bazooka.
Hm, so removing the AutoPreserveCompartment didn't fix my (local) mismatch, but removing JS_SetGlobalObject(cx, nsnull), presumably what luke meant by JS_SetContext(cx, nsnull) in comment 7, did.
So I've pushed the latter to try to see if it fixes the android issue: https://tbpl.mozilla.org/?tree=Try&rev=98484de4ff73
Created attachment 618628 [details] [diff] [review]
Remove call to JS_SetGlobalObject in mozJSComponentLoader. v1
It does! Attaching a patch, flagging luke for review.
Comment on attachment 618628 [details] [diff] [review]
Remove call to JS_SetGlobalObject in mozJSComponentLoader. v1
Less JS_SetGlobalObject is better so r+ regardless, but I still think AutoPreserveCompartment needs to die because it is still a hazard if cx->globalObject changes inside JS_ExecuteScriptVersion. If you wouldn't mind, could you push a patch to try (on top of this one) that also removes AutoPreserveCompartment? Otherwise, iiuc, this is a timebomb (well, until bug 604813 :).
I'm hell-bent on landing CPG and not taking any unnecessary risks, so if the AutoPreserveCompartment stuff doesn't block the landing I'll do it afterwards (but right afterwards!). Split this off into bug 749173.
Pushed this (on its own) to try here: https://tbpl.mozilla.org/?tree=Try&rev=1370c67d294d
This looks good on try on its own, and fixes the android crashes with cpg. Pushed to inbound: