Closed Bug 1177013 Opened 5 years ago Closed 4 years ago

Cancel CPOWs instead of crashing content process

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox41 --- fixed
firefox42 --- fixed

People

(Reporter: billm, Assigned: billm)

References

Details

Attachments

(14 files)

1.58 KB, patch
dvander
: review+
Details | Diff | Splinter Review
1.85 KB, patch
mrbkap
: review+
Details | Diff | Splinter Review
1.58 KB, patch
mrbkap
: review+
Details | Diff | Splinter Review
10.38 KB, patch
dvander
: review+
Details | Diff | Splinter Review
18.85 KB, patch
dvander
: review+
Details | Diff | Splinter Review
10.31 KB, patch
dvander
: review+
Details | Diff | Splinter Review
5.34 KB, patch
dvander
: review+
Details | Diff | Splinter Review
7.28 KB, patch
dvander
: review+
Details | Diff | Splinter Review
5.33 KB, patch
dvander
: review+
Details | Diff | Splinter Review
3.80 KB, patch
dvander
: review+
Details | Diff | Splinter Review
2.12 KB, patch
dvander
: review+
Details | Diff | Splinter Review
3.46 KB, patch
dvander
: review+
Details | Diff | Splinter Review
5.21 KB, patch
dvander
: review+
Details | Diff | Splinter Review
6.32 KB, patch
dvander
: review+
Details | Diff | Splinter Review
Right now there are a couple scenarios that cause us to crash the content process if someone does something special with a CPOW. Generally, these are:
1. Run a nested event loop while a CPOW is on stack.
2. Send a lower priority message while dispatching a higher priority message.

This bug is about changing the behavior so we cancel the CPOW (causing it to throw) instead of crashing. Once that's done, we're allowed to send any kind of message we want, since a CPOW is no longer on the stack. We can also run a nested event loop, since we're no longer dispatching a CPOW.

The basic idea is to add a new primitive, CancelCurrentTransaction. It causes any Send calls that are currently on stack in either process to fail.

[Aside: One thing to note is that we have to cancel a whole transaction at a time. If we try to cancel just one CPOW out of a whole transaction, we'll get trouble. Imagine this scenario:

Parent: sync msg M1 to child
Child: receive M1, send sync msg M2 to parent
Parent: receive M2, cancel M2, do more stuff

At this point, the parent is still doing work. We can't force it to go back to waiting for the M1 reply since a bunch of functions may be on the stack in between receiving M2 and canceling it. At the same time, the child can start doing work once it's told that M2 was canceled (if it couldn't, then we couldn't allow stuff like nested event loops once the CPOW was canceled). Being in a situation where both sides can potentially send messages while there's still a sync message outstanding (M1) is bad because it causes issues with out-of-order replies (we have to deal with this for rpc messages and it causes a lot of complexity).]

One tricky part here is that, since we have to cancel a whole transaction at a time, a normal-priority sync message from the child can also be canceled as part of the transaction. That's a big problem because we actually rely on them succeeding for Gecko to work properly.

To get around that, I've written some precursor patches that make it impossible to use CPOWs in combination with normal-priority sync messages. This could cause some regressions, but I think they should be fixable.
I ran into a problem here, but I think I've found a way to sort it out. I have to wait for try server to know for sure. In the mean time, I'm going to post the patches that I know work.
As I mentioned, we no longer want the child to process CPOWs while it's waiting for a sync message reply. There are two parts to making that happen:

1. We need to make sure that the parent can't send a CPOW while it's answering a sync message.
2. We also need to deal with CPOWs that race with sync messages from the child.

This patch deals with part 1. (I'm still sorting out how to do part 2.) It just returns an error if we try to send a CPOW while answering a sync message. It looks like we don't do this very much in the browser, and it was easy to fix the places where we do.
Attachment #8626209 - Flags: review?(dvander)
Attached patch Fix context menuSplinter Review
Because of the previous patch, we now need to use sendRpcMessage instead of sendSyncMessage if we expect the parent to send CPOWs down while replying to the message. So we need to fix up the context menu code. I don't think this will cause any interesting changes in behavior.
Attachment #8626211 - Flags: review?(mrbkap)
Attachment #8626209 - Attachment description: patch1 → Don't allow sending CPOWs while dispatching sync msgs
Attachment #8626211 - Attachment description: patch2 → Fix context menu
The installer tests work as follows:
1. When a web page asks to install an add-on (via installtrigger) we send a sync message to the parent.
2. The parent get the message and fires some callbacks, including this onNewInstall thing.
3. Once the install is complete (which the child hears about via a separate message from the parent), the web page is notified via an event.

Our tests listen for onNewInstall and add an event listener on the content page for the install event. The install event listener is added using a CPOW, which is sent while we're still dispatching the sync message in the parent. The CPOW now fails because of the first patch here.

I just worked around this by listening for the event with a frame script. The one danger here is that the event might fire before the frame script is received in the child. However, the event fires based on an "install complete" message that will always be sent after the frame script is sent.
Attachment #8626214 - Flags: review?(mrbkap)
The first thing this patch does is convert a bunch of sendSyncMessage calls in test_cpows to sendRpcMessage. I had to do that for the same reason as the context menu patch above.

I also found a lot of bugs in test_cpows where it starts running one test before the previous test has finished. So those are fixed too.

Finally, I had to remove nested_sync_test since it's no longer supported.
Attachment #8626215 - Flags: review?(dvander)
This is the main patch. It adds CancelCurrentTransaction() to the channel. It operates by setting mCurrentTransaction to 0. Then, Send() calls check if the current transaction has been set to 0 after doing anything that could cancel, and they return an error.

The other side of the channel is also informed of cancellation. It also sets mCurrentTransaction to 0 and operates similarly.
Attachment #8626217 - Flags: review?(dvander)
This patch uses CancelCurrentTransaction in all the places where we currently crash due to some sort of CPOW error.
Attachment #8626218 - Flags: review?(dvander)
Forgot about this one... it also deals with probalems caused by the first patch here. It fixes IPDL tests to deal with the fact that we no longer allow sending CPOWs while dispatching sync messages.
Attachment #8626219 - Flags: review?(dvander)
Attachment #8626211 - Flags: review?(mrbkap) → review+
Attachment #8626214 - Flags: review?(mrbkap) → review+
Attachment #8626209 - Flags: review?(dvander) → review+
Attachment #8626215 - Flags: review?(dvander) → review+
Can you add a telemetry counter for the cases where we cancel, so that we can monitor this over time and correlate against addons?
Comment on attachment 8626217 [details] [diff] [review]
add CancelCurrentTransaction

Review of attachment 8626217 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ +646,5 @@
>              mTimedOutMessageSeqno = 0;
>              return;
>          }
>  
> +        if (aMsg.transaction_id() != mCurrentTransaction) {

It might be good to wrap this and other such sites in something like:
        
        if (WasTransactionCancelled(aMsg.transaction_id()) {
Attachment #8626217 - Flags: review?(dvander) → review+
Attachment #8626218 - Flags: review?(dvander) → review+
Attachment #8626219 - Flags: review?(dvander) → review+
This is the patch we discussed today. It makes it so that, upon receiving a sync message, we don't do this:
  * set the current transaction ID to the message's ID
  * release the lock
  * dispatch the message
  * acquire the lock
  * send the reply
  * release the lock
  * acquire the lock

Instead we only drop the lock once and re-acquire it once.
Attachment #8628065 - Flags: review?(dvander)
As we discussed, I still can't figure out how to handle this once case. So we will still crash intentionally there.
Attachment #8628068 - Flags: review?(dvander)
Attached patch fix memory leakSplinter Review
I noticed this memory leak while doing more testing. If we |return false| at the top of MessageChannel::Send, we leak the Message. We already use an nsAutoPtr lower down, so I just moved that up.
Attachment #8628508 - Flags: review?(dvander)
Attachment #8628065 - Flags: review?(dvander) → review+
Comment on attachment 8628068 [details] [diff] [review]
crash in bad situations

Review of attachment 8628068 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ +797,5 @@
> +        return false;
> +    }
> +
> +    // This isn't an assert so much as an intentional crash because we're in a
> +    // situation that we don't know how to recover from: The parent is awaiting

s/parent/child/
Attachment #8628068 - Flags: review?(dvander) → review+
Attachment #8628069 - Flags: review?(dvander) → review+
Attachment #8628070 - Flags: review?(dvander) → review+
Attachment #8628508 - Flags: review?(dvander) → review+
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/0d06cc55aed1 on suspicion of causing b2g emulator debug test bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=11338434&repo=mozilla-inbound, though only time will tell whether or not it was you, since you landed on top of some other debug bustage so either it'll be all better on the backout and you're guilty, or it won't be and you'll be an innocent bystander. And dead, but innocent dead.
This didn't seem to fix the crashes. I'll need to investigate more.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch more testsSplinter Review
I made a couple mistakes in the above patches. Here are the tests I should have written in the first place. (The IPDL tests are nice, but these tests actually exercise cancellation on the MessageChannel::Send path.)
Attachment #8633702 - Flags: review?(dvander)
Attached patch bug fixesSplinter Review
I'll comment inline about what all this is for.
Attachment #8633704 - Flags: review?(dvander)
Attachment #8633702 - Flags: review?(dvander) → review+
Comment on attachment 8633704 [details] [diff] [review]
bug fixes

Review of attachment 8633704 [details] [diff] [review]:
-----------------------------------------------------------------

Before landing, I discovered some problems with races because (1) CancelCurrentTransaction can run on the I/O thread when it receives a CancelMessage, and (2) CancelCurrentTransaction was writing to mDispatchingSyncMessage, which is not protected by any lock. mDispatchingSyncMessage is unprotected because it's written to by DispatchSyncMessage during the phase where it has already dropped locks. All other uses are on the worker thread, so this was safe until now.

I tried to fix this by eliminating the write to mDispatchingSyncMessage inside CancelCurrentTransaction. My thinking was that setting the current transaction to 0 was enough to cancel everything. That was mostly true, except that I forgot about the assertions. I think we can fix those by also checking mCurrentTransaction first; if it's zero, we can behave as if that mDispatchingSyncMessage is 0.

::: ipc/glue/MessageChannel.cpp
@@ +866,5 @@
>          // again.
>          return false;
>      }
>  
> +    if (mCurrentTransaction &&

Adding the mCurrentTransaction check to cover the case where we canceled.

@@ -877,5 @@
>  
>      if (mCurrentTransaction &&
>          (msg->priority() < DispatchingSyncMessagePriority() ||
> -         mAwaitingSyncReplyPriority > msg->priority() ||
> -         DispatchingSyncMessagePriority() == IPC::Message::PRIORITY_URGENT ||

This check is redundant I think. If DispatchingSyncMessagePriority() == IPC::Message::PRIORITY_URGENT, then it must also be true that:
1. we're in the parent, since only the parent dispatches urgent messages; also,
2. the message we're sending is not urgent, since the parent is not allowed to send urgent messages.

So it must be true that msg->priority() < DispatchingSyncMessagePriority() since any non-urgent message has priority less than urgent.

@@ -878,5 @@
>      if (mCurrentTransaction &&
>          (msg->priority() < DispatchingSyncMessagePriority() ||
> -         mAwaitingSyncReplyPriority > msg->priority() ||
> -         DispatchingSyncMessagePriority() == IPC::Message::PRIORITY_URGENT ||
> -         DispatchingAsyncMessagePriority() == IPC::Message::PRIORITY_URGENT))

I realized that canceling doesn't make any sense in this case. Async messages aren't part of transactions, so there's nothing we can cancel. Instead, we'll just crash below.

@@ +888,2 @@
>  
> +    if (mCurrentTransaction) {

The assertions on the left got folded into here, but now they're only done if we haven't canceled the current transaction.

@@ -1318,5 @@
>  
> -    IPC_ASSERT(prio >= DispatchingSyncMessagePriority(),
> -               "priority inversion while dispatching sync message");
> -    IPC_ASSERT(prio >= mAwaitingSyncReplyPriority,
> -               "dispatching a message of lower priority while waiting for a response");

I think these checks are redundant with the ones in Send. Basically, if you never have priority inversions when sending, I don't think you can have them when receiving either.

@@ +2018,5 @@
>      // avoid that, these RAII classes check if the variable they set has been
>      // tampered with (by us). If so, they don't reset the variable to the old
>      // value.
>  
> +    MOZ_ASSERT(mCurrentTransaction);

This was really stupid.

@@ +2023,4 @@
>      mCurrentTransaction = 0;
> +
> +    mAwaitingSyncReply = false;
> +    mAwaitingSyncReplyPriority = 0;

Unlike mDispatchingSyncMessage, mAwaitingSyncReply is protected by the MessageChannel monitor, so we can assign to it here. For some reason I thought that wasn't necessary. But looking again I think it's important for ShouldDeferIncomingMessage.

@@ +2034,5 @@
>  bool
>  MessageChannel::CancelCurrentTransaction(bool onlyHighPrio)
>  {
>      MonitorAutoLock lock(*mMonitor);
> +    if (mCurrentTransaction &&

Same deal as above: we only want to do this if we haven't already canceled.
Attachment #8633704 - Flags: review?(dvander) → review+
https://hg.mozilla.org/mozilla-central/rev/18d6d206080c
https://hg.mozilla.org/mozilla-central/rev/4be0dec068cb
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Comment on attachment 8626209 [details] [diff] [review]
Don't allow sending CPOWs while dispatching sync msgs

The patches in this bug appear to fix a lot of crashes in e10s. We'd like to uplift.

Approval Request Comment
[Feature/regressing bug #]: e10s
[User impact if declined]: crashes
[Describe test coverage new/current, TreeHerder]: Has been on Nightly for the last few days.
[Risks and why]: The code is mostly e10s only, but there are a lot of patches so there is some risk.
[String/UUID change made/needed]: none
Attachment #8626209 - Flags: approval-mozilla-aurora?
Comment on attachment 8626209 [details] [diff] [review]
Don't allow sending CPOWs while dispatching sync msgs

Patches have been in m-c for a few days so should be safe.
Attachment #8626209 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Blocks: 1188605
Duplicate of this bug: 1187448
I think this needs reopened. https://crash-stats.mozilla.com/report/index/e69112be-c817-4644-8430-71f292150826 is a test crash of 1187448 which is marked as a duplicate.
You need to log in before you can comment on or make changes to this bug.