Closed
Bug 970747
Opened 10 years ago
Closed 10 years ago
ContainerLayer::InsertAfter, RemoveChild, and other tree updates need graceful, runtime error handling, even in release builds
Categories
(Core :: Graphics, defect)
Tracking
()
RESOLVED
FIXED
mozilla30
People
(Reporter: bjacob, Assigned: bjacob)
References
Details
(Keywords: sec-critical, Whiteboard: [qa-])
Attachments
(7 files, 1 obsolete file)
8.53 KB,
patch
|
mattwoodrow
:
review+
nical
:
review+
dveditz
:
sec-approval+
|
Details | Diff | Splinter Review |
1.64 KB,
patch
|
mattwoodrow
:
review+
nical
:
review+
|
Details | Diff | Splinter Review |
16.41 KB,
text/plain
|
Details | |
6.87 KB,
patch
|
mattwoodrow
:
review+
|
Details | Diff | Splinter Review |
1.07 KB,
patch
|
mattwoodrow
:
review+
|
Details | Diff | Splinter Review |
8.34 KB,
patch
|
mattwoodrow
:
review+
|
Details | Diff | Splinter Review |
1.77 KB,
patch
|
mattwoodrow
:
review+
|
Details | Diff | Splinter Review |
See dependent bugs: bug 967989 (which is a hang of the parent process), bug 970699 Currently the only error handling done by ContainerLayer::InsertAfter is in the form of NS_ASSERTION's, so that doesn't take effect on release builds. But the checks performed in these NS_ASSERTION's are very much needed in release builds to prevent bad client processes from corrupting the parent-side layer tree. To stick as close as possible to the existing behavior, this patch calls NS_ERROR in the failure cases, and returns false (InsertAfter now returns bool). The next patch on this bug will use that return value in LayerTransactionParent::RecvUpdate. There is also a change in ClientContainerLayer::InsertAfter making it avoid to send any message to the parent if an error is detected already on the client side, thus keeping client and parent in sync.
Attachment #8373805 -
Flags: review?(roc)
Attachment #8373805 -
Flags: review?(nical.bugzilla)
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 1•10 years ago
|
||
Note that returning false here means that not only we're bailing out of this parent layer tree update, but we are also generating an IPC error with the usual consequences. I believe that this will typically terminate the client process. I think that that's wanted here, but I wanted to call that out, not sneak it behind your back ;-) Here's why I think that this is wanted behavior. The corresponding check should already have been performed on the client side (see previous patch, ClientContainerLayer::InsertAfter, as explained above it will not even send the message to the parent if it finds an error), so it should not be possible to hit this. However, a _malicious_ child could easily trigger this, as evidenced by the dependent bugs triggered by faulty. If we allow them to continue, then we do so with a parent-side layer tree that doesn't correspond anymore to the client-side layer tree. That doesn't seem particularly useful (?) However, I don't have a super strong opinion here. If you feel strongly that we should not generate an IPC error there, to avoid crashing the faulty client process, then that's OK, we can simply silently bail out from RecvUpdate by returning true there. But then, again, we are leaving the compositor with a layer tree that corresponds to nothing on the client side, which scares me a little.
Attachment #8373810 -
Flags: review?(roc)
Attachment #8373810 -
Flags: review?(nical.bugzilla)
Assignee | ||
Comment 2•10 years ago
|
||
Matt is a good reviewer for these patches, too ---I don't know whom exactly to get a review from, for these patches.
I think it makes sense that we should verify the tree structure of the shadow tree. (I'm not sure I agree this is a high enough priority to work on now, but I'll accept that someone has determined that.) I think doing these piecemeal checks is not the right way to go; I don't think it catches all possible errors, and even if we fix that, it'll complex to prove that it does. Instead in LayerTransactionParent::RecvUpdate, after processing all the edits, but before calling EndTransaction, we should just verify that the shadow tree is actually a tree. (Maybe skip this verification if there were no edits that affect layer tree structure.) You could do this with an pre-order traversal of the tree, keeping track of which layers you've already seen and checking you never see the same layer twice. You should also traverse MaskLayers when you do this. If you detect an error, we should terminate the child process, preferably with a crash report for it. It sounds like your IPC error does that?
Assignee | ||
Comment 4•10 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #3) > I think it makes sense that we should verify the tree structure of the > shadow tree. (I'm not sure I agree this is a high enough priority to work on > now, but I'll accept that someone has determined that.) Sorry, I should have detailed the context for this. I'm fuzzing Gfx IPC protocols. See https://intranet.mozilla.org/User:Bjacob@mozilla.com/Gfx_IPC_fuzzing (this page is private because it contains information about security bugs). As part of that, I have to fix the bugs that I find, otherwise I can't continue fuzzing (I keep hitting over and over again the bugs that I haven't fixed). > I think doing these piecemeal checks is not the right way to go; Note that the first patch here is just in-place changing NS_ASSERTION(cond, msg) to if(!cond) { NS_ERROR(msg); return false; } --- the piecemeal nature of the checks is unchanged. > I don't > think it catches all possible errors, and even if we fix that, it'll complex > to prove that it does. Instead in LayerTransactionParent::RecvUpdate, after > processing all the edits, but before calling EndTransaction, we should just > verify that the shadow tree is actually a tree. That sounds like a very good idea: a complete verification of state at a given time is indeed a lot easier to reason about than a collection of piecemeal checks of incremental state changes. One thing worries me though: not validating incremental changes immediately in InsertAfter means allowing a bad client to temporarily corrupt parent-side state. Temporarily here means: between the time of the bad InsertAfter call and the time when the parent does post-child-termination cleanup. Now we need to reason about why this temporary bad state might not last long enough to be harmful. In particular, how confident are we that the compositing thread, which is a different thread than the IPC thread, won't try to composite during the lapse of time where the parent-side layers tree is bad, then possibly hitting e.g. bug 967989 again? It seems, then, that we have to stick to always validating client-issued layer tree updates before executing them on the parent side, right? But I totally agree that full non-incremental state verification as you describe is a lot more provable and we should probably do that. Should we do both kinds of checks, then?
Assignee | ||
Comment 5•10 years ago
|
||
Note that implicit in this conversation is that once we agree about what to do about these InsertAfter operations, we'll apply the same solution to other layer tree updates (RemoveChild and Reposition...).
Assignee | ||
Comment 6•10 years ago
|
||
Just hit this. This is with the present patches applied. I still crash in a InsertAfter operation, accessing an already-destroyed Layer. This Layer was just destroyed by a RemoveChild operation. A single RecvUpdate can proceed an entire "changeset" which can contain multiple Edits. We loop over all Edits and handle them, before we get to calling EndTransaction. A bad RemoveChild Edit could be followed by a InsertAfter Edit within the same Changeset. That could be what happened there (what I don't know is if here the bad RemoveChild was part of the same changeset as the InsertAfter that is crashing). To prevent such crashes, we have to validate each Edit before processing the next one.
Assignee | ||
Updated•10 years ago
|
Attachment #8373996 -
Attachment description: Faulty session showing why we can't do without immediate validation of layer tree updates → Faulty session showing why we can't do without immediate validation of each Edit
Assignee | ||
Comment 7•10 years ago
|
||
And, oops, use-after-free in this Faulty/ASan log makes this security sensitive.
Group: core-security
Assignee | ||
Updated•10 years ago
|
Summary: ContainerLayer::InsertAfter needs graceful, runtime error handling, even in release builds → ContainerLayer::InsertAfter, RemoveChild, and other tree updates need graceful, runtime error handling, even in release builds
Updated•10 years ago
|
Attachment #8373805 -
Flags: review?(nical.bugzilla) → review+
Updated•10 years ago
|
Attachment #8373810 -
Flags: review?(nical.bugzilla) → review+
Comment 9•10 years ago
|
||
While I agree with roc that it would be good to validate the state at the end of the transaction, I also really like it when the error is caught as early as possible, because it makes debugging much easier, and also because, as bjacob mentioned, it can be hard to survive the entire transaction with a messed up state. We could do both types of checks, but I would like to have at least the early ones.
(In reply to Benoit Jacob [:bjacob] from comment #4) > One thing worries me though: not validating incremental changes immediately > in InsertAfter means allowing a bad client to temporarily corrupt > parent-side state. Temporarily here means: between the time of the bad > InsertAfter call and the time when the parent does post-child-termination > cleanup. Now we need to reason about why this temporary bad state might not > last long enough to be harmful. In particular, how confident are we that the > compositing thread, which is a different thread than the IPC thread, won't > try to composite during the lapse of time where the parent-side layers tree > is bad, then possibly hitting e.g. bug 967989 again? I don't think it makes sense to have only partially checked the layer subtree for validity before allowing the compositor to see the subtree. So, in LayerTransactionParent::RecvUpdate, after processing the list of edits, let's check the subtree for validity. (I think the edit-processing loop itself can tolerate non-tree-ness of the intermediate states.) If it's invalid, let's disconnect it before calling EndTransaction and ensure that the compositor never sees it (i.e. that reflayer resolution fails to find the bad subtree).
(In reply to Nicolas Silva [:nical] from comment #9) > While I agree with roc that it would be good to validate the state at the > end of the transaction, I also really like it when the error is caught as > early as possible, because it makes debugging much easier, and also because, > as bjacob mentioned, it can be hard to survive the entire transaction with a > messed up state. We could do both types of checks, but I would like to have > at least the early ones. Assertions would make sense, but I don't think we should actually exit early. That would make the full checks harder to test.
Updated•10 years ago
|
Keywords: sec-critical
Assignee | ||
Comment 12•10 years ago
|
||
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #10) > (In reply to Benoit Jacob [:bjacob] from comment #4) > > One thing worries me though: not validating incremental changes immediately > > in InsertAfter means allowing a bad client to temporarily corrupt > > parent-side state. Temporarily here means: between the time of the bad > > InsertAfter call and the time when the parent does post-child-termination > > cleanup. Now we need to reason about why this temporary bad state might not > > last long enough to be harmful. In particular, how confident are we that the > > compositing thread, which is a different thread than the IPC thread, won't > > try to composite during the lapse of time where the parent-side layers tree > > is bad, then possibly hitting e.g. bug 967989 again? > > I don't think it makes sense to have only partially checked the layer > subtree for validity before allowing the compositor to see the subtree. > > So, in LayerTransactionParent::RecvUpdate, after processing the list of > edits, let's check the subtree for validity. (I think the edit-processing > loop itself can tolerate non-tree-ness of the intermediate states.) Comment 6 found otherwise!
Updated•10 years ago
|
Assignee | ||
Comment 13•10 years ago
|
||
I am holding off until we come to consensus here, before requesting review on these patches. But this is otherwise ready for review, and as explained above, I still don't see an alternative to doing these immediate checks on each Edit; but yes, it would be very nice to also have whole-tree checks in addition to that.
Assignee | ||
Comment 14•10 years ago
|
||
Assignee | ||
Comment 15•10 years ago
|
||
Comment 16•10 years ago
|
||
(In reply to Benoit Jacob [:bjacob] from comment #6) > Created attachment 8373996 [details] > Faulty session showing why we can't do without immediate validation of each > Edit > > Just hit this. This is with the present patches applied. I still crash in a > InsertAfter operation, accessing an already-destroyed Layer. This Layer was > just destroyed by a RemoveChild operation. > > A single RecvUpdate can proceed an entire "changeset" which can contain > multiple Edits. We loop over all Edits and handle them, before we get to > calling EndTransaction. A bad RemoveChild Edit could be followed by a > InsertAfter Edit within the same Changeset. That could be what happened > there (what I don't know is if here the bad RemoveChild was part of the same > changeset as the InsertAfter that is crashing). To prevent such crashes, we > have to validate each Edit before processing the next one. Can't we fix this by putting removed layers into a cleanup list, so that they still alive until we finish processing all edits?
Assignee | ||
Comment 17•10 years ago
|
||
(In reply to Matt Woodrow (:mattwoodrow) from comment #16) > Can't we fix this by putting removed layers into a cleanup list, so that > they still alive until we finish processing all edits? Isn't that much more complicated than just validating untrusted inputs immediately before updating our state?
Assignee | ||
Comment 18•10 years ago
|
||
Comment 19•10 years ago
|
||
Whatever the outcome of the debate is, I am happy if I have at least assertions to catch bugs as soon as we see them when there is still a meaningful stack trace. I don't think there is harm in doing at least part of the validation of the inputs as soon as we have them. If it's not enough let's do both types of validations?
Assignee | ||
Comment 20•10 years ago
|
||
(In reply to Nicolas Silva [:nical] from comment #19) > Whatever the outcome of the debate is, I am happy if I have at least > assertions to catch bugs as soon as we see them when there is still a > meaningful stack trace. Neither the current mozilla-central code, nor the proposed patches, give you that. The current mozilla-central code only uses NS_ASSERTION; the first half of the proposed patches turn that into NS_ERROR (which is the same) and returning false instead of void; the second half of proposed patches turned these false retvals into IPC errors which would typically terminate child processes, but even if you attach GDB to them that wouldn't tell you anything about these errors, because of asynchronousness.
Assignee | ||
Comment 21•10 years ago
|
||
I still don't see another preferable alternative. Do you?
Flags: needinfo?(roc)
Flags: needinfo?(matt.woodrow)
Comment 22•10 years ago
|
||
I think we should MOZ_ASSERT as well as returning the ipc error. We should never hit these errors when handling messages from a well behaved content process, so it must be a bug in our code and we should just die and make it easy to debug. In the release case the MOZ_ASSERT won't do anything, which is good because we could be dealing (potentially) with a rogue child process, and we don't want to let it take down the parent. If you manage to get owned while running a debug build, then I guess you lose. I don't think that's an issue.
Flags: needinfo?(matt.woodrow)
Assignee | ||
Comment 23•10 years ago
|
||
The problem with MOZ_ASSERT here is that that would prevent fuzzing debug builds, because in debug builds, under fuzzing, the parent process would abort. It is useful to be able to fuzz debug builds, not just release builds, because the crashes one gets from fuzzing debug builds are typically easier to understand and fix (it made my work a lot easier while fuzzing gfx protocols).
Assignee | ||
Comment 24•10 years ago
|
||
Also, notice that NS_ERROR will cause test harnesses to record test failures, so we will have some chance of noticing.
Comment 25•10 years ago
|
||
Ok, that's reasonable. It would be nice if we had another option, something like MOZ_ASSERT_UNLESS_THE_FUZZER_IS_RUNNING.
Assignee | ||
Comment 26•10 years ago
|
||
Yes, and that idea has come up in similar discussions on other fuzzing-related bugs. It might well happen. Meanwhile, NS_ERROR here is closest to the current NS_ASSERTION so it should be easiest to r+ ;-) Decide whether you want / can steal the reviews here ;-) Note that all 6 patches here are up for review, I just didn't bother flagging them all until we have some agreement on the general approach.
Assignee | ||
Updated•10 years ago
|
Attachment #8373805 -
Flags: review?(roc) → review?(matt.woodrow)
Assignee | ||
Updated•10 years ago
|
Attachment #8373810 -
Flags: review?(roc) → review?(matt.woodrow)
Assignee | ||
Updated•10 years ago
|
Attachment #8373805 -
Attachment description: Make ContainerLayer::InsertAfter always perform checks and return bool → Part 1: Make ContainerLayer::InsertAfter always perform checks and return bool
Assignee | ||
Updated•10 years ago
|
Attachment #8373810 -
Attachment description: Make LayerTransactionParent::RecvUpdate error out if InsertAfter fails → Part 2: Make LayerTransactionParent::RecvUpdate error out if InsertAfter fails
Assignee | ||
Updated•10 years ago
|
Attachment #8375963 -
Attachment description: Make ContainerLayer::RemoveChild always perform checks and return bool → Part 3: Make ContainerLayer::RemoveChild always perform checks and return bool
Assignee | ||
Updated•10 years ago
|
Attachment #8375963 -
Flags: review?(matt.woodrow)
Assignee | ||
Updated•10 years ago
|
Attachment #8375965 -
Attachment description: Make LayerTransactionParent::RecvUpdate error out if RemoveChild fails → Part 4: Make LayerTransactionParent::RecvUpdate error out if RemoveChild fails
Attachment #8375965 -
Flags: review?(matt.woodrow)
Assignee | ||
Updated•10 years ago
|
Attachment #8375966 -
Attachment is obsolete: true
Assignee | ||
Updated•10 years ago
|
Attachment #8375968 -
Attachment description: Make ContainerLayer::RepositionChild always perform checks and return bool → Part 5: Make ContainerLayer::RepositionChild always perform checks and return bool
Attachment #8375968 -
Flags: review?(matt.woodrow)
Assignee | ||
Comment 27•10 years ago
|
||
Attachment #8379071 -
Flags: review?(matt.woodrow)
Updated•10 years ago
|
Attachment #8373805 -
Flags: review?(matt.woodrow) → review+
Comment 28•10 years ago
|
||
How far back does this go and does this only matter in b2g? I'm trying to determine how far back we'll need to take patches. ESR24?
Assignee | ||
Comment 29•10 years ago
|
||
While the bug exists everywhere, it only is a security bug where we have separate untrusted client processes. So far, that is only B2G.
Updated•10 years ago
|
Attachment #8373810 -
Flags: review?(matt.woodrow) → review+
Updated•10 years ago
|
Attachment #8375963 -
Flags: review?(matt.woodrow) → review+
Updated•10 years ago
|
Attachment #8375965 -
Flags: review?(matt.woodrow) → review+
Updated•10 years ago
|
Attachment #8375968 -
Flags: review?(matt.woodrow) → review+
Updated•10 years ago
|
Attachment #8379071 -
Flags: review?(matt.woodrow) → review+
Assignee | ||
Updated•10 years ago
|
Flags: needinfo?(roc)
Assignee | ||
Comment 30•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=ebb61bda35d1
Assignee | ||
Comment 31•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=6d755127af30 The crashes on the previous try pushes were an... interesting bug in the Part 3 patch here. Before my patches, we have: virtual void RemoveChild(Layer* aChild) MOZ_OVERRIDE { NS_ASSERTION(ClientManager()->InConstruction(), "Can only set properties in construction phase"); ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this), ClientManager()->Hold(aChild)); ContainerLayer::RemoveChild(aChild); } My part 3 patch here changes this as follows. Notice in particular how we now do the client-side RemoveChild operation _before_ we send a message to the parent. This is intentional: we don't want to propagate a bogus operation to the parent if we can detect already on the client side that it's bad. virtual bool RemoveChild(Layer* aChild) MOZ_OVERRIDE { if (!ClientManager()->InConstruction()) { NS_ERROR("Can only set properties in construction phase"); return false; } if (!ContainerLayer::RemoveChild(aChild)) { return false; } ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this), ClientManager()->Hold(aChild)); return true; } The problem though is that the client-side operation, ContainerLayer::RemoveChild(aChild), can result in the immediate destruction of aChild, since it's managed by reference counting. So the crashes on the previous tryserver run were use-after-free's right there, as that operation had destroyed aChild and we were in the next line, ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this), ClientManager()->Hold(aChild)); The simple 1-line fix is to hold a temporary nsRefPtr to aChild in this function: + nsRefPtr<Layer> holdChild = aChild So the updated code is: virtual bool RemoveChild(Layer* aChild) MOZ_OVERRIDE { if (!ClientManager()->InConstruction()) { NS_ERROR("Can only set properties in construction phase"); return false; } // Hold a strong reference to aChild to prevent it from being destroyed by // the client-side removal nsRefPtr<Layer> holdChild = aChild; // Remove the child on the client side, check if that worked before propagating // this change over IPC if (!ContainerLayer::RemoveChild(aChild)) { return false; } // OK, send IPC message. ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this), ClientManager()->Hold(aChild)); return true; } I don't suppose, by default, that this change of 1 line plus some comments requires another review.
Assignee | ||
Comment 32•10 years ago
|
||
Oh, wait... is that exactly what this Hold() method is doing ? :-)
Assignee | ||
Comment 33•10 years ago
|
||
So, this is what RemoveChild looks like in the current version of my patch, which I'm about to land. virtual bool RemoveChild(Layer* aChild) MOZ_OVERRIDE { if (!ClientManager()->InConstruction()) { NS_ERROR("Can only set properties in construction phase"); return false; } // hold on to aChild before we remove it! ShadowableLayer *heldChild = ClientManager()->Hold(aChild); if (!ContainerLayer::RemoveChild(aChild)) { return false; } ClientManager()->AsShadowForwarder()->RemoveChild(ClientManager()->Hold(this), heldChild); return true; }
Assignee | ||
Comment 34•10 years ago
|
||
Comment on attachment 8373805 [details] [diff] [review] Part 1: Make ContainerLayer::InsertAfter always perform checks and return bool [Security approval request comment] How easily could an exploit be constructed based on the patch? Not easily at all. Sure, it's not too hard to understand from the patch that this is fixing a potential use-after-free. But then, these patches doesn't start telling anything about how this particular use-after-free is exploitable. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No. Which older supported branches are affected by this flaw? All of them. Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? No, and I don't think we want to backport anything, for the same reasons as on all these fuzzing-discovered bugs: most of our IPC protocols are wildly insecure at the moment, so we are and remain very insecure anyway, so it will only make sense backporting such changes once the majority of IPC protocols are fixed. How likely is this patch to cause regressions; how much testing does it need? Medium-risk. There's 6 patches here, it's not tiny.
Attachment #8373805 -
Flags: sec-approval?
Comment 35•10 years ago
|
||
We can call this "disabled" in Fx-desktop, so sec-approval+ to land on b2g without worrying what old bugs this might reveal in shipping releases. I can't see us backporting this to b2g18 so I'll just wontfix that now. b2g1.2 seems unlikely but I'll leave that one open for now.
status-b2g18:
--- → wontfix
status-b2g-v1.2:
--- → affected
status-b2g-v1.3:
--- → affected
status-firefox27:
--- → disabled
status-firefox28:
--- → disabled
status-firefox29:
--- → disabled
status-firefox-esr24:
--- → disabled
Updated•10 years ago
|
Attachment #8373805 -
Flags: sec-approval? → sec-approval+
Assignee | ||
Comment 36•10 years ago
|
||
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/b0f5a1cf6f8a remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/f1a9d225b8ee remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/d89cd4492922 remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/a471ffd2211f remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/1b20e5b1e9a5 remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/e9f9f941a9a7
Assignee: nobody → bjacob
https://hg.mozilla.org/mozilla-central/rev/b0f5a1cf6f8a https://hg.mozilla.org/mozilla-central/rev/f1a9d225b8ee https://hg.mozilla.org/mozilla-central/rev/d89cd4492922 https://hg.mozilla.org/mozilla-central/rev/a471ffd2211f https://hg.mozilla.org/mozilla-central/rev/1b20e5b1e9a5 https://hg.mozilla.org/mozilla-central/rev/e9f9f941a9a7
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla30
Updated•10 years ago
|
Comment 38•10 years ago
|
||
If we were going to backport these to 1.2 and 1.3, that time was a long time ago.
Comment 39•10 years ago
|
||
Marking [qa-] for desktop QA verification. FxOS QA may choose to verify at a later date.
Whiteboard: [qa-]
Updated•9 years ago
|
Group: core-security
You need to log in
before you can comment on or make changes to this bug.
Description
•