Closed Bug 883450 Opened 11 years ago Closed 11 years ago

Conservative Stack Scanner doesn't work when not in a request

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla24
Tracking Status
firefox24 + fixed
firefox-esr17 --- wontfix
b2g18 --- wontfix
b2g-v1.1hd --- wontfix
b2g-v1.2 --- unaffected

People

(Reporter: bholley, Assigned: bholley)

References

Details

(Keywords: sec-critical, Whiteboard: [Embargo until ESR17 EOL][adv-main24+])

Attachments

(6 files)

I've been debugging bug 879649 for a good day and a half now, and finally figured out what's going on.

In bug MarkConservativeStackRoots, we bail out of stack scanning if !cgcd->hasStackToScan():

http://mxr.mozilla.org/mozilla-central/source/js/src/gc/RootMarking.cpp#316

This predicate just evaluates to !!nativeStackTop - that is to say, whether we have a usable pointer to the top of the stack. This pointer is generated in RecordNativeStackTopForGC, but is then _cleared_ in updateForRequestEnd(), which gets called when our request count drops to zero:

http://mxr.mozilla.org/mozilla-central/source/js/src/jsapi.cpp#1230

And though RecordNativeStackTopForGC is called at the beginning of Collect(), it no-ops because the request depth is zero:

http://mxr.mozilla.org/mozilla-central/source/js/src/jsgc.cpp#963

This means that _anything_ on the stack outside of a request isn't rooted at all. I wouldn't be surprised if there were dozens of such cases in the browser - especially from consumers of nsIScriptContext functions, who don't necessarily have a need to enter a request themselves.

Marking sec-critical, because the surface is wide enough that there's almost certainly an exploitable case somewhere.
I think we do sort of depend on being able to skip stack scanning during timer GCs. The stack scanner has a pretty high false positive rate on 32-bit platforms, and I'm afraid we'd leak a lot if we always scanned.

However, we could probably find a better way to decide whether to skip stack scanning. Rather than "are we in a request", maybe we could skip it if we're invoked from a timer GC when we're not in a nested event loop. We also have to watch out for schedulePreciseGC.

One question I have is how this is happening. My assumption was always that:
1. We have to be doing some JSAPI stuff in order to GC.
2. Almost all JSAPI calls assert that you're in a request.
Which one of these is broken in this case?
Since when do we require a stack scan when we are not in a request?
Only <20% of all our GCs do a stack scan last time I measured.
(In reply to Gregor Wagner [:gwagner] from comment #2)
> Since when do we require a stack scan when we are not in a request?
> Only <20% of all our GCs do a stack scan last time I measured.

We don't scan the stack when we're not in a request. Bobby filed the bug because he found a case where that caused us to fail to scan something that needed to be scanned. We could just put a request there, but he's worried that it's a widespread problem.
I think the particular case was this:
  http://mxr.mozilla.org/mozilla-central/source/content/xul/content/src/nsXULElement.cpp#2575

This method has a JSScript pointer that is null, which calls into  nsJSContext::CompileScript, which does an AutoRequest, sets the script, then exits the request, so we're in danger, returns the script, then roots the script on the heap. The particular problem he was seeing was that a Pop(), which happened in between the end of the AutoRequest and the return from CompileScript, was triggering a GC.
I guess we could dynamically detect this by adding an assertion to JS::Rooted that we're in a request?  Then, if we're properly precisely rooted, we should be properly dynamically rooted, for paths we test.  (Of course, there is a certain irony here.)
Bobby, do you have a stack or anything? I don't see where Pop() can GC outside of a request, but maybe I got lost following all the indirections.
nsCxPusher embeds a JSAutoRequest, but destroys it (manually) before calling xpc::PopContext:

http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/nsCxPusher.cpp#168

xpc::PopContext invokes JS_RestoreFrameChain, which can GC (both according to evidence and to mrgiggles).

I also don't think that adding more JSAutoRequests is the right solution to this problem, especially since the Request API is on its way out anyway.
(note that the issue in bug 879649 was that _adding_ an AutoPushJSContext (which wraps an nsCxPusher) to nsJSContext::CompileScript caused a ton of crashes in the wild, even though we're incorrect without it).
> xpc::PopContext invokes JS_RestoreFrameChain, which can GC (both according to evidence
> and to mrgiggles).

Are you talking about this call?
  http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCJSContextStack.cpp#52
What's confusing is that the line before it opens a request. So the GC should happen inside a request and we should scan the stack, right?
Also, JS_RestoreFrameChain does assert that we're in a request.
(In reply to Bill McCloskey (:billm) from comment #9)
> > xpc::PopContext invokes JS_RestoreFrameChain, which can GC (both according to evidence
> > and to mrgiggles).
> 
> Are you talking about this call?
>  
> http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/
> XPCJSContextStack.cpp#52
> What's confusing is that the line before it opens a request. So the GC
> should happen inside a request and we should scan the stack, right?

(In reply to Bill McCloskey (:billm) from comment #10)
> Also, JS_RestoreFrameChain does assert that we're in a request.

Oh, good point. However, shortly after calling PopRequest, we call ScriptEvaluted:

http://mxr.mozilla.org/mozilla-central/source/js/xpconnect/src/nsCxPusher.cpp#186

Whose first action in office is to call JS_MaybeGC():

http://mxr.mozilla.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#2333

So that's probably the cause of the GC we're seeing here.
OK. Direct calls to GC are one of the cases where we don't assert that you're in a request, so that makes sense.

I'm kind of tempted to say that we should just spot-fix this. This problem will go away when exact rooting lands. My sense is that that will be ready roughly at the same time as getting rid of requests. The alternative is to do something like comment 1 suggests.
(In reply to Bill McCloskey (:billm) from comment #12) 
> I'm kind of tempted to say that we should just spot-fix this. This problem
> will go away when exact rooting lands.

Do you mean spot fix this one case we found? If so, I'm not really comfortable with that. If you mean "sprinkle in JSAutoRequests in the right places," I think that's the correct solution, assuming we can implement strong dynamic enforcement.

I think the right approach is to assert that we're in a request every time we create a Rooted<T>. The static analysis has pointed us to all the places where we want to keep a GCThing alive across GC potential GC calls. For most of those, we should already be in a request. In theory, the only failures that crop up should be the hazards we're trying to fix here. I'll throw in some assertions and see what happens.
Assignee: general → bobbyholley+bmo
Blocks: 884109
Ah, so... CallbackObject _used_ to have a request there until it got removed in bug 868130.  Note that the comments were not updated in the process; please update them here?
Depends on: 884362
Green! Uploading patches and flagging for review.
There's not a lot we can do without it in the browser these days.
Attachment #765108 - Flags: review?(gkrizsanits)
Attachment #765112 - Flags: review?(gkrizsanits)
Comment on attachment 765113 [details] [diff] [review]
Part 6 - Assert that we're in a request whenever we create a Rooted<T>. v1

Er, this.
Attachment #765113 - Flags: review?(gkrizsanits) → review?(terrence)
Comment on attachment 765111 [details] [diff] [review]
Part 4 - Re-order stuff in CallSetup so that we construct the RootedObject after the Push. v1

>     // We construct our JS::Rooted right after our JSAutoRequest; let's just

Please do fix the comment; see comment 17.

r=me with that.
Attachment #765111 - Flags: review?(bzbarsky) → review+
(In reply to Bobby Holley (:bholley) from comment #0)

> especially from consumers of nsIScriptContext functions

Or mCx, or GetSafeJSContext... Based on some mxr search we have quite some cases to worry about. So the idea here is we assert and rely on our test coverage, and fix only the cases that pops up on try with this assert, right? I have to say that it's not a very relaxing approach. Anyway, I'll finish up the review of the patches soon, just was wondering if it makes sense to try and do some major cleaning up on this department (by manually cherry-picking the suspicious looking cx usages) or this whole thing is going away soon, and this is a good enough solution for the time being?
Comment on attachment 765108 [details] [diff] [review]
Part 1 - MOZ_CRASH if we fail to spin up the SafeJSContext, and remove error handling. v1

Review of attachment 765108 [details] [diff] [review]:
-----------------------------------------------------------------

Shouldn't we rename it to SafeJSContext now that it cannot fail? (at some point in the future...)
Attachment #765108 - Flags: review?(gkrizsanits) → review+
Attachment #765109 - Flags: review?(gkrizsanits) → review+
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #29)
> Or mCx, or GetSafeJSContext... Based on some mxr search we have quite some
> cases to worry about. So the idea here is we assert and rely on our test
> coverage, and fix only the cases that pops up on try with this assert right?

Well, it's better than that. We're (more or less) green on the sixgill exact rooting static analysis. The means we can statically prove that we use Rooted/Handle for every stack value that we need to keep alive across calls that can GC. So as long as we at least call through these functions once, the assertions will tell us the right answer.
Attachment #765110 - Flags: review?(gkrizsanits) → review+
(In reply to Bobby Holley (:bholley) from comment #31)
> Well, it's better than that. We're (more or less) green on the sixgill exact
> rooting static analysis. The means we can statically prove that we use
> Rooted/Handle for every stack value that we need to keep alive across calls
> that can GC. So as long as we at least call through these functions once,
> the assertions will tell us the right answer.

What ever the sixgill exact rooting static analysis exactly does under the hood, I already feel a bit more relaxed :)
Comment on attachment 765112 [details] [diff] [review]
Part 5 - Miscellaneous requests. v1

Review of attachment 765112 [details] [diff] [review]:
-----------------------------------------------------------------

AutoJSContext cx;
     nsresult rv;
+    JSAutoRequest ar(cx);

why?
Attachment #765112 - Flags: review?(gkrizsanits) → review+
Comment on attachment 765113 [details] [diff] [review]
Part 6 - Assert that we're in a request whenever we create a Rooted<T>. v1

Review of attachment 765113 [details] [diff] [review]:
-----------------------------------------------------------------

r=me
Attachment #765113 - Flags: review?(terrence) → review+
Bobby, what's the backporting story this these patches?
(In reply to Andrew McCreight [:mccr8] from comment #37)
> Bobby, what's the backporting story this these patches?

We're unlikely to be able to fix this solidly on all branches, because the assertions aren't backportable (they're only meaningful on branches where we're more or less exactly rooted, which is just trunk at the moment).

There are also a lot of other things that have changed in this department (bug 834732, bug 868130, etc), that will make any kind of meaningful backport hard.

I recommend backporting Part 5 and embargoing this bug until ESR17 EOL.
Whiteboard: [Embargo until ESR17 EOL]
Whiteboard: [Embargo until ESR17 EOL] → [Embargo until ESR17 EOL][adv-main24+]
(In reply to Bobby Holley (:bholley) from comment #38)
> I recommend backporting Part 5 and embargoing this bug until ESR17 EOL.

Still recommended for b2g18? We still want to fix sec-crits there if possible.
Flags: needinfo?(bobbyholley+bmo)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #39)
> (In reply to Bobby Holley (:bholley) from comment #38)
> > I recommend backporting Part 5 and embargoing this bug until ESR17 EOL.
> 
> Still recommended for b2g18? We still want to fix sec-crits there if
> possible.

Yeah, we should probably at least try to backport Part 5.
Flags: needinfo?(bobbyholley+bmo)
Actually, I realize that backporting this pixie dust is totally useless without bug 834732.
Flags: in-testsuite?
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: