1249686 - Extend GC rooting hazard analysis to cover DOM nodes

Assignee

Description

•

8 years ago

For years and years now, we keep having problems like bug 1249377 (and maybe bug 1249685) where we hold a raw pointer on the stack to some DOM node, then we perform an operation that can call back into arbitrary content JS which can trash the DOM node, then we use the stack variable again, causing a use-after-free.

If I understand correctly, the JS GC rooting hazard analysis has two phases. First it calculates a set of functions that might GC. Then it finds if any JS GC thing pointers are live across the call sites that might GC.

It seems to me that "can GC" is going to include every place we could call back into content JS to run mutation observers, so we "just" need to modify the second part to care about DOM stuff we want rooted, and make it understand about nsCOMPtr<> rooting.

(I thought I'd filed a bug on this before, but I can't find it.)

Andrew McCreight [:mccr8]

Assignee

Comment 1

•

8 years ago

Steve, what do you think about this? How hard would it be for me to get the static hazard analysis running locally? Presumably if I can get this running there will be many places to fix before we enable this in automation.

Flags: needinfo?(sphink)

Andrew McCreight [:mccr8]

Assignee

Updated

•

8 years ago

Assignee: nobody → continuation

Andrew Overholt [:overholt]

Updated

•

8 years ago

Whiteboard: dom-triaged btpp-active

Steve Fink [:sfink] [:s:]

Comment 2

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #1)
> Steve, what do you think about this?

I this this is a great idea. You are correct that running script implies cangc, so the existing set of functions that can GC is a conservative superset of what you care about. If it is too conservative, it wouldn't be hard to add an additional (or alternative) pass over the callgraph to compute a can-run-script property for all functions.

The "fun" part is in dealing with all of the function pointers that are assumed to be able to GC. Right now, we only care about the ones that might be called when any GC pointer is held live. Presumably you'd run into a larger set of them if you start looking at all of the ranges of code that hold a DOM node pointer live.

> How hard would it be for me to get the static hazard analysis running locally? Presumably if I can get this running
> there will be many places to fix before we enable this in automation.

Hopefully those will almost be all false positives, but yes. It would be helpful to have a private try cluster for things like this, though the part you'll end up wanting to run and re-run is only the JS-implemented analysis that uses the information gathered during compilation, and the turnaround time on that isn't too awful so perhaps your own single local machine is good enough.

But anyway -- the current analysis on mozilla-inbound, implemented via a mozharness script, is a bit of a pain to run locally. It's doable, and a couple of people have done it, but it always requires some amount of futzing and I know I still have some unlanded fixes to clean up some (but not all) of the problems.

The good news is that I'm switching everything over to taskcluster, and that is *much* easier to run locally. I have not yet landed any of those jobs, though I just landed most of the prerequisite work and the mulet version is reviewed so I'll probably be landing it shortly.

We should probably talk on IRC, but the main prerequisites you'll need is having Docker running on a linux64 host.

Alternatively, you can grab the .xdb files from an existing job and just run the analysis portion. I *think* that should be doable platform-independently, though it requires some custom setup. The existing jobs do not upload the xdb files (they're huge), but I have a patch that does, and the new try jobs will do it if you push with an --upload-xdbs flag in your "try syntax" (which would be my recommended approach for re-rerunning with code changes.)

Current xdb files can be found at https://tools.taskcluster.net/task-inspector/#ZU62j7UKSi-se5yRWzflEQ/0

You'd want to grab *.xdb.bz2 and bunzip2 them locally, then I can walk you through the analysis setup.

Oh, and it probably doesn't help yet, but if I can get gcc 4.9.3 deployed, then I'll have a new version of the whole setup that allows annotating arbitrary types in the source code as GC things or GC pointers. With that, you could do all of this with zero changes to the analysis by sprinkling __attribute__((tag("GC Pointer"))) on various types (bug 1246804).

Flags: needinfo?(sphink)

Andrew McCreight [:mccr8]

Assignee

Comment 3

•

8 years ago

With sfink's help, I've started looking at this, looking at nsINode (and subclasses) being held alive across calls that might GC. It finds around 5000 hazards. (This includes bindings generated code.)

What is likely a large source of false-positives is that it does not know anything about the informal convention that if you pass an nsINode* to a function, it should be held alive. (Of course, nothing actually enforces that, so we could have real hazards related to that.) For now, I'll try to mass-ignore hazards relating to function arguments.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 4

•

8 years ago

Does the analysis understand JSObject -> nsINode relationship? That if you have a wrapper for 
nsINode and the wrapper is kept alive, it is effectively a kungfuDeathGrip for the nsINode too.

Andrew McCreight [:mccr8]

Assignee

Comment 5

•

8 years ago

(In reply to Olli Pettay [:smaug] (HIGH REVIEW LOAD) from comment #4)
> That if you have a wrapper for nsINode and the wrapper is kept alive,
> it is effectively a kungfuDeathGrip for the nsINode too.

I doubt it. Even if it did, do we often root the wrapper for a C++ object within a function? IIUC the actual final rooting analysis only looks at one method at a time.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 6

•

8 years ago

I mean a case where JS is passing some JS object to a method, and bindings then 
do the JS->C++ conversion and in C++ side it looks like
foo->Method(ptr_to_bar);

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox47: affected → ---

Andrew McCreight [:mccr8]

Assignee

Comment 7

•

7 years ago

I started looking into this again. Edge apparently has a thing called "MemGC" which is like Snow White (delayed batched destruction of refcounted objects) plus a conservative stack and heap scan to prevent destroying objects which might still be alive. We should be able to get the benefits of the stack part of that without the problems of conservative analysis.

Andrew McCreight [:mccr8]

Assignee

Comment 8

•

7 years ago

Attached patch very early prototype WIP — Details — Splinter Review

Alex Gaynor [:Alex_Gaynor]

Comment 9

•

6 years ago

I chatted with mccr8 about this at the All Hands, and wanted to summarize what we discussed:

When this issue was originally filed, it was thought that the rooting analysis used by the JS engine would be straightforward to apply to DOM, for the same reason: JS references need to have a strong root whenever the GC might run, and DOM node references need to have a strong reference (beyond the referencess in the DOM tree) whenever a Javascript callback could run and mutate the DOM. So it seemed like there should be an isomorphism.

Unfortunately, the way we use types in JS vs. DOM is sufficiently different to make that analysis not transfer over cleanly. In JS we use js::Rooted<T> types, which indicate that an argument to a function is rooted somewhere up there up the stack. In DOM by contrast, we often pass a Node*, which can be either borrowed from an nsCOMPtr<Node> on the stack (which holds a strong reference) or borrowed directly from the DOM tree.

The lack of precise types in DOM basically makes it impossible to do function-local analysis, in the way the JS rooting analysis works.

Given that, we briefly contemplated how we could proceed. One option is introducing a similar set of types to JS. The concern there is that it would be a lot of churn! dom/ is something like a million lines of C++. Maybe it'd be scriptable, but we were a bit doubtful it could be scriptable without doing the equivalent of writing the entire analysis this bug is about!

In light of all of this, I don't think there's really a straightforward analysis to write here in the way we'd hoped.

Andrew McCreight [:mccr8]

Assignee

Comment 10

•

6 years ago

I think you can still do a local analysis, given the existence of the global analysis that the existing rooting analysis computes ("can this function GC?"). You'd need a rule like "if calling a function that can GC, all node arguments must be rooted". Then, inside a function, if the function itself can GC, you can assume that all of the node arguments are rooted.

The problem I was hitting is that doing the analysis at the call site at the compiler IR level is not easy. Imagine the call site for a function ThisFunctionCallScript(nsINode*):
  nsCOMPtr<nsINode> node(...);
  ThisFunctionCallsScript(node);

The problem here is that node is implicitly being converted to a raw pointer, so the code the static analysis looks at is more like
  nsCOMPtr<nsINode> node(...);
  nsINode* rawPointer = node.get();
  ThisFunctionCallsScript(rawPointer);

To figure out that rawPointer is actually rooted, you have to do some kind of analysis to figure out that rawPointer is equal to node, that node is rooted, and that node hasn't changed since we assigned from it. That's all doable, but my impression is that this is more sophisticated than anything that the rooting analysis needs to do right now, so there would be more work required.

Another thing I forgot about, but which I saw just now, is that we already have some kind of analysis, based around tagging things with MOZ_CAN_RUN_SCRIPT. I don't know exactly what it does, but if anybody is interested in this topic, they should look into that. Maybe that is substituting for the global part of the analysis and the LLVM static analysis handles the call site stuff better?

Andrew McCreight [:mccr8]

Assignee

Comment 11

•

6 years ago

Skimming bug 1380423, it sounds like the MOZ_CAN_RUN_SCRIPT annotation can deal with the call site issue, so really you are adding the annotations because the LLVM static analysis has no global phase. Hmm, I wonder how hard it would be to use the "can this function GC?" database created by the GC rooting analysis to automatically annotate functions. You probably wouldn't want to land that, but maybe it could turn up something, or maybe that database could be plugged into the LLVM analysis somehow.

Alex Gaynor [:Alex_Gaynor]

Comment 12

•

6 years ago

Producing a list of all "can GC" functions, not in js/, that aren't marked MOZ_CAN_RUN_SCRIPT seems like a good tactical goal. I think we could do this analysis once, and then annotate the appropriate leaf functions with MOZ_CAN_RUN_SCRIPT. At that point the existing analysis could take over.


In related news, I filed bug 1471344 which points out a limitation in MOZ_CAN_RUN_SCRIPT.

Alex Gaynor [:Alex_Gaynor]

Comment 13

•

6 years ago

I'm concerned that can-GC may be more broad than we initially considered, and therefore not be close-enough to approximate "can run user code", because it'd inculpate far too much code that isn't exploitable.

Specifically, from looking at some of the intermediate artifacts of the GC rooting analysis, it looks like any usage of an XPCOM API (or at least one that's not builtinclass + scriptable, I guess?) can invoke GC; more generally, any time we run chrome JS code we can GC, but unless that JS code were to invoke a user-script callback, we can't actually causes arbitrary changes to the DOM tree (unlinking nodes and leading to UAF).

Considering any usage of an XPCOM API as a relevant call-site significantly bloats the amount of code we'd have to consider. As an example (if I'm reading this output correctly, which I don't take for granted), nsIURI::GetSpec is considered a can-GC function.

Is there a way to mitigate this? Do JS implementations of XPCOM interfaces ever callback into arbitrary user scripts (meaning it'd be _correct_ to consider them MOZ_CAN_RUN_SCRIPT)?

Andrew McCreight [:mccr8]

Assignee

Comment 14

•

6 years ago

Oh, that's an interesting point. I'll have to think about that.

nsIURI is actually builtinclass, so the rooting analysis must not be taking that into account. That makes sense, because it is analyzing the C++ code, so I'm not sure how it could tell that the virtual method can never be implemented by JS. IIRC, there's basically a whitelist of nsISupports classes (or maybe even just methods?) that are explicitly declared to not be GCing.

Boris Zbarsky [:bzbarsky]

Comment 15

•

6 years ago

Historically, JS implementations of XPCOM interfaces weren't even under our control (because of extensions), so we had no guarantees about them not doing weird stuff.

Nowadays we at least know that it's all our code, but there are no guarantees that they don't cause user code to run, unfortunately.  From a threat model point of view, I think there are a few possible approaches.

1) Treat calls into script-implementable XPCOM interfaces as potentially blowing up the world, and work on reducing the number of such in our codebase by marking things noscript or builtinclass as needed.  We should do the latter anyway.  Maybe have a way of annotating certain things as "safe".

2) Treat calls into script-implementable XPCOM interfaces as safe-by-default but have a way to annotate them otherwise.

3) Extend our analysis somehow into script-implemented XPCOM interfaces so we can tell whether they're safe, like we can for C++ code... (sounds hard).

Alex Gaynor [:Alex_Gaynor]

Comment 16

•

6 years ago

Unfortunately directly using the GC rooting analysis is probably not possible -- as I've dug in, it's clear that there's tons of stuff that's CanGC, but not CanRunUserScript -- basically every "create JS object" function, and there's a lot of them.

In light of that, I'm looking to see if it's possible to reuse the callgraph and do our own light-weight analysis.

Nobody; OK to take it and work on it

Updated

•

5 years ago

Component: DOM → DOM: Core & HTML

Andrew McCreight [:mccr8]

Assignee

Comment 17

•

5 years ago

I think this is WONTFIX. MOZ_CAN_RUN_SCRIPT seems like a more practical option here to incremental detect these kinds of issues. See bug 1415230.

Group: dom-core-security

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX