Closed Bug 970747 Opened 10 years ago Closed 10 years ago

ContainerLayer::InsertAfter, RemoveChild, and other tree updates need graceful, runtime error handling, even in release builds

Categories

(Core :: Graphics, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla30
Tracking Status
firefox27 --- disabled
firefox28 --- disabled
firefox29 --- disabled
firefox30 + fixed
firefox-esr24 --- disabled
b2g18 --- wontfix
b2g-v1.1hd --- wontfix
b2g-v1.2 --- wontfix
b2g-v1.3 --- wontfix
b2g-v1.3T --- wontfix
b2g-v1.4 --- fixed

People

(Reporter: bjacob, Assigned: bjacob)

References

Details

(Keywords: sec-critical, Whiteboard: [qa-])

Attachments

(7 files, 1 obsolete file)

8.53 KB, patch
mattwoodrow
: review+
nical
: review+
dveditz
: sec-approval+
Details | Diff | Splinter Review
1.64 KB, patch
mattwoodrow
: review+
nical
: review+
Details | Diff | Splinter Review
16.41 KB, text/plain
Details
6.87 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
1.07 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
8.34 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
1.77 KB, patch
mattwoodrow
: review+
Details | Diff | Splinter Review
See dependent bugs: bug 967989 (which is a hang of the parent process), bug 970699

Currently the only error handling done by ContainerLayer::InsertAfter is in the form of NS_ASSERTION's, so that doesn't take effect on release builds. But the checks performed in these NS_ASSERTION's are very much needed in release builds to prevent bad client processes from corrupting the parent-side layer tree.

To stick as close as possible to the existing behavior, this patch calls NS_ERROR in the failure cases, and returns false (InsertAfter now returns bool).

The next patch on this bug will use that return value in LayerTransactionParent::RecvUpdate.

There is also a change in ClientContainerLayer::InsertAfter making it avoid to send any message to the parent if an error is detected already on the client side, thus keeping client and parent in sync.
Attachment #8373805 - Flags: review?(roc)
Attachment #8373805 - Flags: review?(nical.bugzilla)
Blocks: 967989, 970699
Note that returning false here means that not only we're bailing out of this parent layer tree update, but we are also generating an IPC error with the usual consequences. I believe that this will typically terminate the client process. I think that that's wanted here, but I wanted to call that out, not sneak it behind your back ;-)

Here's why I think that this is wanted behavior. The corresponding check should already have been performed on the client side (see previous patch, ClientContainerLayer::InsertAfter, as explained above it will not even send the message to the parent if it finds an error), so it should not be possible to hit this. However, a _malicious_ child could easily trigger this, as evidenced by the dependent bugs triggered by faulty. If we allow them to continue, then we do so with a parent-side layer tree that doesn't correspond anymore to the client-side layer tree. That doesn't seem particularly useful (?)

However, I don't have a super strong opinion here. If you feel strongly that we should not generate an IPC error there, to avoid crashing the faulty client process, then that's OK, we can simply silently bail out from RecvUpdate by returning true there. But then, again, we are leaving the compositor with a layer tree that corresponds to nothing on the client side, which scares me a little.
Attachment #8373810 - Flags: review?(roc)
Attachment #8373810 - Flags: review?(nical.bugzilla)
Matt is a good reviewer for these patches, too ---I don't know whom exactly to get a review from, for these patches.
I think it makes sense that we should verify the tree structure of the shadow tree. (I'm not sure I agree this is a high enough priority to work on now, but I'll accept that someone has determined that.)

I think doing these piecemeal checks is not the right way to go; I don't think it catches all possible errors, and even if we fix that, it'll complex to prove that it does. Instead in LayerTransactionParent::RecvUpdate, after processing all the edits, but before calling EndTransaction, we should just verify that the shadow tree is actually a tree. (Maybe skip this verification if there were no edits that affect layer tree structure.) You could do this with an pre-order traversal of the tree, keeping track of which layers you've already seen and checking you never see the same layer twice. You should also traverse MaskLayers when you do this.

If you detect an error, we should terminate the child process, preferably with a crash report for it. It sounds like your IPC error does that?
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #3)
> I think it makes sense that we should verify the tree structure of the
> shadow tree. (I'm not sure I agree this is a high enough priority to work on
> now, but I'll accept that someone has determined that.)

Sorry, I should have detailed the context for this. I'm fuzzing Gfx IPC protocols. See
  https://intranet.mozilla.org/User:Bjacob@mozilla.com/Gfx_IPC_fuzzing
(this page is private because it contains information about security bugs).

As part of that, I have to fix the bugs that I find, otherwise I can't continue fuzzing (I keep hitting over and over again the bugs that I haven't fixed).
 
> I think doing these piecemeal checks is not the right way to go;

Note that the first patch here is just in-place changing NS_ASSERTION(cond, msg) to if(!cond) { NS_ERROR(msg); return false; } --- the piecemeal nature of the checks is unchanged.

> I don't
> think it catches all possible errors, and even if we fix that, it'll complex
> to prove that it does. Instead in LayerTransactionParent::RecvUpdate, after
> processing all the edits, but before calling EndTransaction, we should just
> verify that the shadow tree is actually a tree.

That sounds like a very good idea: a complete verification of state at a given time is indeed a lot easier to reason about than a collection of piecemeal checks of incremental state changes.

One thing worries me though: not validating incremental changes immediately in InsertAfter means allowing a bad client to temporarily corrupt parent-side state. Temporarily here means: between the time of the bad InsertAfter call and the time when the parent does post-child-termination cleanup. Now we need to reason about why this temporary bad state might not last long enough to be harmful. In particular, how confident are we that the compositing thread, which is a different thread than the IPC thread, won't try to composite during the lapse of time where the parent-side layers tree is bad, then possibly hitting e.g. bug 967989 again?

It seems, then, that we have to stick to always validating client-issued layer tree updates before executing them on the parent side, right? But I totally agree that full non-incremental state verification as you describe is a lot more provable and we should probably do that. Should we do both kinds of checks, then?
Note that implicit in this conversation is that once we agree about what to do about these InsertAfter operations, we'll apply the same solution to other layer tree updates (RemoveChild and Reposition...).
Just hit this. This is with the present patches applied. I still crash in a InsertAfter operation, accessing an already-destroyed Layer. This Layer was just destroyed by a RemoveChild operation.

A single RecvUpdate can proceed an entire "changeset" which can contain multiple Edits. We loop over all Edits and handle them, before we get to calling EndTransaction. A bad RemoveChild Edit could be followed by a InsertAfter Edit within the same Changeset. That could be what happened there (what I don't know is if here the bad RemoveChild was part of the same changeset as the InsertAfter that is crashing). To prevent such crashes, we have to validate each Edit before processing the next one.
Attachment #8373996 - Attachment description: Faulty session showing why we can't do without immediate validation of layer tree updates → Faulty session showing why we can't do without immediate validation of each Edit
And, oops, use-after-free in this Faulty/ASan log makes this security sensitive.
Group: core-security
Summary: ContainerLayer::InsertAfter needs graceful, runtime error handling, even in release builds → ContainerLayer::InsertAfter, RemoveChild, and other tree updates need graceful, runtime error handling, even in release builds
Bug 969517 is possibly another flavor of this.
Blocks: 969517
Attachment #8373805 - Flags: review?(nical.bugzilla) → review+
Attachment #8373810 - Flags: review?(nical.bugzilla) → review+
While I agree with roc that it would be good to validate the state at the end of the transaction, I also really like it when the error is caught as early as possible, because it makes debugging much easier, and also because, as bjacob mentioned, it can be hard to survive the entire transaction with a messed up state. We could do both types of checks, but I would like to have at least the early ones.
(In reply to Benoit Jacob [:bjacob] from comment #4)
> One thing worries me though: not validating incremental changes immediately
> in InsertAfter means allowing a bad client to temporarily corrupt
> parent-side state. Temporarily here means: between the time of the bad
> InsertAfter call and the time when the parent does post-child-termination
> cleanup. Now we need to reason about why this temporary bad state might not
> last long enough to be harmful. In particular, how confident are we that the
> compositing thread, which is a different thread than the IPC thread, won't
> try to composite during the lapse of time where the parent-side layers tree
> is bad, then possibly hitting e.g. bug 967989 again?

I don't think it makes sense to have only partially checked the layer subtree for validity before allowing the compositor to see the subtree.

So, in LayerTransactionParent::RecvUpdate, after processing the list of edits, let's check the subtree for validity. (I think the edit-processing loop itself can tolerate non-tree-ness of the intermediate states.) If it's invalid, let's disconnect it before calling EndTransaction and ensure that the compositor never sees it (i.e. that reflayer resolution fails to find the bad subtree).
(In reply to Nicolas Silva [:nical] from comment #9)
> While I agree with roc that it would be good to validate the state at the
> end of the transaction, I also really like it when the error is caught as
> early as possible, because it makes debugging much easier, and also because,
> as bjacob mentioned, it can be hard to survive the entire transaction with a
> messed up state. We could do both types of checks, but I would like to have
> at least the early ones.

Assertions would make sense, but I don't think we should actually exit early. That would make the full checks harder to test.
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #10)
> (In reply to Benoit Jacob [:bjacob] from comment #4)
> > One thing worries me though: not validating incremental changes immediately
> > in InsertAfter means allowing a bad client to temporarily corrupt
> > parent-side state. Temporarily here means: between the time of the bad
> > InsertAfter call and the time when the parent does post-child-termination
> > cleanup. Now we need to reason about why this temporary bad state might not
> > last long enough to be harmful. In particular, how confident are we that the
> > compositing thread, which is a different thread than the IPC thread, won't
> > try to composite during the lapse of time where the parent-side layers tree
> > is bad, then possibly hitting e.g. bug 967989 again?
> 
> I don't think it makes sense to have only partially checked the layer
> subtree for validity before allowing the compositor to see the subtree.
> 
> So, in LayerTransactionParent::RecvUpdate, after processing the list of
> edits, let's check the subtree for validity. (I think the edit-processing
> loop itself can tolerate non-tree-ness of the intermediate states.)

Comment 6 found otherwise!
I am holding off until we come to consensus here, before requesting review on these patches. But this is otherwise ready for review, and as explained above, I still don't see an alternative to doing these immediate checks on each Edit; but yes, it would be very nice to also have whole-tree checks in addition to that.
(In reply to Benoit Jacob [:bjacob] from comment #6)
> Created attachment 8373996 [details]
> Faulty session showing why we can't do without immediate validation of each
> Edit
> 
> Just hit this. This is with the present patches applied. I still crash in a
> InsertAfter operation, accessing an already-destroyed Layer. This Layer was
> just destroyed by a RemoveChild operation.
> 
> A single RecvUpdate can proceed an entire "changeset" which can contain
> multiple Edits. We loop over all Edits and handle them, before we get to
> calling EndTransaction. A bad RemoveChild Edit could be followed by a
> InsertAfter Edit within the same Changeset. That could be what happened
> there (what I don't know is if here the bad RemoveChild was part of the same
> changeset as the InsertAfter that is crashing). To prevent such crashes, we
> have to validate each Edit before processing the next one.

Can't we fix this by putting removed layers into a cleanup list, so that they still alive until we finish processing all edits?
(In reply to Matt Woodrow (:mattwoodrow) from comment #16)
> Can't we fix this by putting removed layers into a cleanup list, so that
> they still alive until we finish processing all edits?

Isn't that much more complicated than just validating untrusted inputs immediately before updating our state?
Whatever the outcome of the debate is, I am happy if I have at least assertions to catch bugs as soon as we see them when there is still a meaningful stack trace.
I don't think there is harm in doing at least part of the validation of the inputs as soon as we have them. If it's not enough let's do both types of validations?
(In reply to Nicolas Silva [:nical] from comment #19)
> Whatever the outcome of the debate is, I am happy if I have at least
> assertions to catch bugs as soon as we see them when there is still a
> meaningful stack trace.

Neither the current mozilla-central code, nor the proposed patches, give you that. The current mozilla-central code only uses NS_ASSERTION; the first half of the proposed patches turn that into NS_ERROR (which is the same) and returning false instead of void; the second half of proposed patches turned these false retvals into IPC errors which would typically terminate child processes, but even if you attach GDB to them that wouldn't tell you anything about these errors, because of asynchronousness.
I still don't see another preferable alternative. Do you?
Flags: needinfo?(roc)
Flags: needinfo?(matt.woodrow)
I think we should MOZ_ASSERT as well as returning the ipc error.

We should never hit these errors when handling messages from a well behaved content process, so it must be a bug in our code and we should just die and make it easy to debug.

In the release case the MOZ_ASSERT won't do anything, which is good because we could be dealing (potentially) with a rogue child process, and we don't want to let it take down the parent.

If you manage to get owned while running a debug build, then I guess you lose. I don't think that's an issue.
Flags: needinfo?(matt.woodrow)
The problem with MOZ_ASSERT here is that that would prevent fuzzing debug builds, because in debug builds, under fuzzing, the parent process would abort. It is useful to be able to fuzz debug builds, not just release builds, because the crashes one gets from fuzzing debug builds are typically easier to understand and fix (it made my work a lot easier while fuzzing gfx protocols).
Also, notice that NS_ERROR will cause test harnesses to record test failures, so we will have some chance of noticing.
Ok, that's reasonable.

It would be nice if we had another option, something like MOZ_ASSERT_UNLESS_THE_FUZZER_IS_RUNNING.
Yes, and that idea has come up in similar discussions on other fuzzing-related bugs. It might well happen.

Meanwhile, NS_ERROR here is closest to the current NS_ASSERTION so it should be easiest to r+ ;-)

Decide whether you want / can steal the reviews here ;-)

Note that all 6 patches here are up for review, I just didn't bother flagging them all until we have some agreement on the general approach.
Attachment #8373805 - Flags: review?(roc) → review?(matt.woodrow)
Attachment #8373810 - Flags: review?(roc) → review?(matt.woodrow)
Attachment #8373805 - Attachment description: Make ContainerLayer::InsertAfter always perform checks and return bool → Part 1: Make ContainerLayer::InsertAfter always perform checks and return bool
Attachment #8373810 - Attachment description: Make LayerTransactionParent::RecvUpdate error out if InsertAfter fails → Part 2: Make LayerTransactionParent::RecvUpdate error out if InsertAfter fails
Attachment #8375963 - Attachment description: Make ContainerLayer::RemoveChild always perform checks and return bool → Part 3: Make ContainerLayer::RemoveChild always perform checks and return bool
Attachment #8375963 - Flags: review?(matt.woodrow)
Attachment #8375965 - Attachment description: Make LayerTransactionParent::RecvUpdate error out if RemoveChild fails → Part 4: Make LayerTransactionParent::RecvUpdate error out if RemoveChild fails
Attachment #8375965 - Flags: review?(matt.woodrow)
Attachment #8375966 - Attachment is obsolete: true
Attachment #8375968 - Attachment description: Make ContainerLayer::RepositionChild always perform checks and return bool → Part 5: Make ContainerLayer::RepositionChild always perform checks and return bool
Attachment #8375968 - Flags: review?(matt.woodrow)
Attachment #8373805 - Flags: review?(matt.woodrow) → review+
How far back does this go and does this only matter in b2g? I'm trying to determine how far back we'll need to take patches. ESR24?
While the bug exists everywhere, it only is a security bug where we have separate untrusted client processes. So far, that is only B2G.
Attachment #8373810 - Flags: review?(matt.woodrow) → review+
Attachment #8375963 - Flags: review?(matt.woodrow) → review+
Attachment #8375965 - Flags: review?(matt.woodrow) → review+
Attachment #8375968 - Flags: review?(matt.woodrow) → review+
Attachment #8379071 - Flags: review?(matt.woodrow) → review+
Flags: needinfo?(roc)
https://tbpl.mozilla.org/?tree=Try&rev=6d755127af30

The crashes on the previous try pushes were an... interesting bug in the Part 3 patch here.

Before my patches, we have:

  virtual void RemoveChild(Layer* aChild) MOZ_OVERRIDE
  { 
    NS_ASSERTION(ClientManager()->InConstruction(),
                 "Can only set properties in construction phase");
    ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this),
                                                      ClientManager()->Hold(aChild));
    ContainerLayer::RemoveChild(aChild);
  }

My part 3 patch here changes this as follows. Notice in particular how we now do the client-side RemoveChild operation _before_ we send a message to the parent. This is intentional: we don't want to propagate a bogus operation to the parent if we can detect already on the client side that it's bad.

  virtual bool RemoveChild(Layer* aChild) MOZ_OVERRIDE
  { 
    if (!ClientManager()->InConstruction()) {
      NS_ERROR("Can only set properties in construction phase");
      return false;
    }
    if (!ContainerLayer::RemoveChild(aChild)) {
      return false;
    }
    ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this),
                                                      ClientManager()->Hold(aChild));
    return true;
  }

The problem though is that the client-side operation, ContainerLayer::RemoveChild(aChild), can result in the immediate destruction of aChild, since it's managed by reference counting. So the crashes on the previous tryserver run were use-after-free's right there, as that operation had destroyed aChild and we were in the next line,

    ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this),
                                                      ClientManager()->Hold(aChild));

The simple 1-line fix is to hold a temporary nsRefPtr to aChild in this function:

+    nsRefPtr<Layer> holdChild = aChild

So the updated code is:

  virtual bool RemoveChild(Layer* aChild) MOZ_OVERRIDE
  { 
    if (!ClientManager()->InConstruction()) {
      NS_ERROR("Can only set properties in construction phase");
      return false;
    }
    // Hold a strong reference to aChild to prevent it from being destroyed by
    // the client-side removal
    nsRefPtr<Layer> holdChild = aChild;
    // Remove the child on the client side, check if that worked before propagating
    // this change over IPC
    if (!ContainerLayer::RemoveChild(aChild)) {
      return false;
    }
    // OK, send IPC message.
    ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this),
                                                      ClientManager()->Hold(aChild));
    return true;
  }

I don't suppose, by default, that this change of 1 line plus some comments requires another review.
Oh, wait... is that exactly what this Hold() method is doing ? :-)
So, this is what RemoveChild looks like in the current version of my patch, which I'm about to land.

  virtual bool RemoveChild(Layer* aChild) MOZ_OVERRIDE
  {
    if (!ClientManager()->InConstruction()) {
      NS_ERROR("Can only set properties in construction phase");
      return false;
    }
    // hold on to aChild before we remove it!
    ShadowableLayer *heldChild = ClientManager()->Hold(aChild);
    if (!ContainerLayer::RemoveChild(aChild)) {
      return false;
    }
    ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this), heldChild);
    return true;
  }
Comment on attachment 8373805 [details] [diff] [review]
Part 1: Make ContainerLayer::InsertAfter always perform checks and return bool

[Security approval request comment]
How easily could an exploit be constructed based on the patch?

Not easily at all. Sure, it's not too hard to understand from the patch that this is fixing a potential use-after-free. But then, these patches doesn't start telling anything about how this particular use-after-free is exploitable.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?

No.

Which older supported branches are affected by this flaw?

All of them.

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?

No, and I don't think we want to backport anything, for the same reasons as on all these fuzzing-discovered bugs: most of our IPC protocols are wildly insecure at the moment, so we are and remain very insecure anyway, so it will only make sense backporting such changes once the majority of IPC protocols are fixed.

How likely is this patch to cause regressions; how much testing does it need?

Medium-risk. There's 6 patches here, it's not tiny.
Attachment #8373805 - Flags: sec-approval?
We can call this "disabled" in Fx-desktop, so sec-approval+ to land on b2g without worrying what old bugs this might reveal in shipping releases.  I can't see us backporting this to b2g18 so I'll just wontfix that now. b2g1.2 seems unlikely but I'll leave that one open for now.
Attachment #8373805 - Flags: sec-approval? → sec-approval+
If we were going to backport these to 1.2 and 1.3, that time was a long time ago.
Marking [qa-] for desktop QA verification. FxOS QA may choose to verify at a later date.
Whiteboard: [qa-]
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: