Closed
Bug 1177013
Opened 9 years ago
Closed 9 years ago
Cancel CPOWs instead of crashing content process
Categories
(Core :: IPC, defect)
Core
IPC
Tracking
()
RESOLVED
FIXED
mozilla42
People
(Reporter: billm, Assigned: billm)
References
Details
Attachments
(14 files)
1.58 KB,
patch
|
dvander
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
1.85 KB,
patch
|
mrbkap
:
review+
|
Details | Diff | Splinter Review |
1.58 KB,
patch
|
mrbkap
:
review+
|
Details | Diff | Splinter Review |
10.38 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
18.85 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
10.31 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
5.34 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
7.28 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
5.33 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
3.80 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
2.12 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
3.46 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
5.21 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
6.32 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
Right now there are a couple scenarios that cause us to crash the content process if someone does something special with a CPOW. Generally, these are: 1. Run a nested event loop while a CPOW is on stack. 2. Send a lower priority message while dispatching a higher priority message. This bug is about changing the behavior so we cancel the CPOW (causing it to throw) instead of crashing. Once that's done, we're allowed to send any kind of message we want, since a CPOW is no longer on the stack. We can also run a nested event loop, since we're no longer dispatching a CPOW. The basic idea is to add a new primitive, CancelCurrentTransaction. It causes any Send calls that are currently on stack in either process to fail. [Aside: One thing to note is that we have to cancel a whole transaction at a time. If we try to cancel just one CPOW out of a whole transaction, we'll get trouble. Imagine this scenario: Parent: sync msg M1 to child Child: receive M1, send sync msg M2 to parent Parent: receive M2, cancel M2, do more stuff At this point, the parent is still doing work. We can't force it to go back to waiting for the M1 reply since a bunch of functions may be on the stack in between receiving M2 and canceling it. At the same time, the child can start doing work once it's told that M2 was canceled (if it couldn't, then we couldn't allow stuff like nested event loops once the CPOW was canceled). Being in a situation where both sides can potentially send messages while there's still a sync message outstanding (M1) is bad because it causes issues with out-of-order replies (we have to deal with this for rpc messages and it causes a lot of complexity).] One tricky part here is that, since we have to cancel a whole transaction at a time, a normal-priority sync message from the child can also be canceled as part of the transaction. That's a big problem because we actually rely on them succeeding for Gecko to work properly. To get around that, I've written some precursor patches that make it impossible to use CPOWs in combination with normal-priority sync messages. This could cause some regressions, but I think they should be fixable.
Assignee | ||
Comment 1•9 years ago
|
||
I ran into a problem here, but I think I've found a way to sort it out. I have to wait for try server to know for sure. In the mean time, I'm going to post the patches that I know work.
Assignee | ||
Comment 2•9 years ago
|
||
As I mentioned, we no longer want the child to process CPOWs while it's waiting for a sync message reply. There are two parts to making that happen: 1. We need to make sure that the parent can't send a CPOW while it's answering a sync message. 2. We also need to deal with CPOWs that race with sync messages from the child. This patch deals with part 1. (I'm still sorting out how to do part 2.) It just returns an error if we try to send a CPOW while answering a sync message. It looks like we don't do this very much in the browser, and it was easy to fix the places where we do.
Attachment #8626209 -
Flags: review?(dvander)
Assignee | ||
Comment 3•9 years ago
|
||
Because of the previous patch, we now need to use sendRpcMessage instead of sendSyncMessage if we expect the parent to send CPOWs down while replying to the message. So we need to fix up the context menu code. I don't think this will cause any interesting changes in behavior.
Attachment #8626211 -
Flags: review?(mrbkap)
Assignee | ||
Updated•9 years ago
|
Attachment #8626209 -
Attachment description: patch1 → Don't allow sending CPOWs while dispatching sync msgs
Assignee | ||
Updated•9 years ago
|
Attachment #8626211 -
Attachment description: patch2 → Fix context menu
Assignee | ||
Comment 4•9 years ago
|
||
The installer tests work as follows: 1. When a web page asks to install an add-on (via installtrigger) we send a sync message to the parent. 2. The parent get the message and fires some callbacks, including this onNewInstall thing. 3. Once the install is complete (which the child hears about via a separate message from the parent), the web page is notified via an event. Our tests listen for onNewInstall and add an event listener on the content page for the install event. The install event listener is added using a CPOW, which is sent while we're still dispatching the sync message in the parent. The CPOW now fails because of the first patch here. I just worked around this by listening for the event with a frame script. The one danger here is that the event might fire before the frame script is received in the child. However, the event fires based on an "install complete" message that will always be sent after the frame script is sent.
Attachment #8626214 -
Flags: review?(mrbkap)
Assignee | ||
Comment 5•9 years ago
|
||
The first thing this patch does is convert a bunch of sendSyncMessage calls in test_cpows to sendRpcMessage. I had to do that for the same reason as the context menu patch above. I also found a lot of bugs in test_cpows where it starts running one test before the previous test has finished. So those are fixed too. Finally, I had to remove nested_sync_test since it's no longer supported.
Attachment #8626215 -
Flags: review?(dvander)
Assignee | ||
Comment 6•9 years ago
|
||
This is the main patch. It adds CancelCurrentTransaction() to the channel. It operates by setting mCurrentTransaction to 0. Then, Send() calls check if the current transaction has been set to 0 after doing anything that could cancel, and they return an error. The other side of the channel is also informed of cancellation. It also sets mCurrentTransaction to 0 and operates similarly.
Attachment #8626217 -
Flags: review?(dvander)
Assignee | ||
Comment 7•9 years ago
|
||
This patch uses CancelCurrentTransaction in all the places where we currently crash due to some sort of CPOW error.
Attachment #8626218 -
Flags: review?(dvander)
Assignee | ||
Comment 8•9 years ago
|
||
Forgot about this one... it also deals with probalems caused by the first patch here. It fixes IPDL tests to deal with the fact that we no longer allow sending CPOWs while dispatching sync messages.
Attachment #8626219 -
Flags: review?(dvander)
Updated•9 years ago
|
Attachment #8626211 -
Flags: review?(mrbkap) → review+
Updated•9 years ago
|
Attachment #8626214 -
Flags: review?(mrbkap) → review+
Attachment #8626209 -
Flags: review?(dvander) → review+
Attachment #8626215 -
Flags: review?(dvander) → review+
Comment 9•9 years ago
|
||
Can you add a telemetry counter for the cases where we cancel, so that we can monitor this over time and correlate against addons?
Comment on attachment 8626217 [details] [diff] [review] add CancelCurrentTransaction Review of attachment 8626217 [details] [diff] [review]: ----------------------------------------------------------------- ::: ipc/glue/MessageChannel.cpp @@ +646,5 @@ > mTimedOutMessageSeqno = 0; > return; > } > > + if (aMsg.transaction_id() != mCurrentTransaction) { It might be good to wrap this and other such sites in something like: if (WasTransactionCancelled(aMsg.transaction_id()) {
Attachment #8626217 -
Flags: review?(dvander) → review+
Attachment #8626218 -
Flags: review?(dvander) → review+
Attachment #8626219 -
Flags: review?(dvander) → review+
Assignee | ||
Comment 11•9 years ago
|
||
This is the patch we discussed today. It makes it so that, upon receiving a sync message, we don't do this: * set the current transaction ID to the message's ID * release the lock * dispatch the message * acquire the lock * send the reply * release the lock * acquire the lock Instead we only drop the lock once and re-acquire it once.
Attachment #8628065 -
Flags: review?(dvander)
Assignee | ||
Comment 12•9 years ago
|
||
As we discussed, I still can't figure out how to handle this once case. So we will still crash intentionally there.
Attachment #8628068 -
Flags: review?(dvander)
Assignee | ||
Comment 13•9 years ago
|
||
Attachment #8628069 -
Flags: review?(dvander)
Assignee | ||
Comment 14•9 years ago
|
||
Attachment #8628070 -
Flags: review?(dvander)
Assignee | ||
Comment 15•9 years ago
|
||
I noticed this memory leak while doing more testing. If we |return false| at the top of MessageChannel::Send, we leak the Message. We already use an nsAutoPtr lower down, so I just moved that up.
Attachment #8628508 -
Flags: review?(dvander)
Attachment #8628065 -
Flags: review?(dvander) → review+
Comment on attachment 8628068 [details] [diff] [review] crash in bad situations Review of attachment 8628068 [details] [diff] [review]: ----------------------------------------------------------------- ::: ipc/glue/MessageChannel.cpp @@ +797,5 @@ > + return false; > + } > + > + // This isn't an assert so much as an intentional crash because we're in a > + // situation that we don't know how to recover from: The parent is awaiting s/parent/child/
Attachment #8628068 -
Flags: review?(dvander) → review+
Attachment #8628069 -
Flags: review?(dvander) → review+
Attachment #8628070 -
Flags: review?(dvander) → review+
Attachment #8628508 -
Flags: review?(dvander) → review+
Comment 17•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/bff9f07dad52 https://hg.mozilla.org/integration/mozilla-inbound/rev/d0f5a3474659 https://hg.mozilla.org/integration/mozilla-inbound/rev/e14a7b70b6fa https://hg.mozilla.org/integration/mozilla-inbound/rev/f5b52fa19511 https://hg.mozilla.org/integration/mozilla-inbound/rev/978a77b60f2a https://hg.mozilla.org/integration/mozilla-inbound/rev/de79eba232f0 https://hg.mozilla.org/integration/mozilla-inbound/rev/f7de893911bc https://hg.mozilla.org/integration/mozilla-inbound/rev/4d7f5205b60b https://hg.mozilla.org/integration/mozilla-inbound/rev/e6bf35115c11 https://hg.mozilla.org/integration/mozilla-inbound/rev/2af18bef5703 https://hg.mozilla.org/integration/mozilla-inbound/rev/3b6448172e50 https://hg.mozilla.org/integration/mozilla-inbound/rev/912aae0815f8
Comment 18•9 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/0d06cc55aed1 on suspicion of causing b2g emulator debug test bustage like https://treeherder.mozilla.org/logviewer.html#?job_id=11338434&repo=mozilla-inbound, though only time will tell whether or not it was you, since you landed on top of some other debug bustage so either it'll be all better on the backout and you're guilty, or it won't be and you'll be an innocent bystander. And dead, but innocent dead.
Comment 19•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/c9abe8523b1a https://hg.mozilla.org/integration/mozilla-inbound/rev/d748fc4b5266 https://hg.mozilla.org/integration/mozilla-inbound/rev/a657fbc60968 https://hg.mozilla.org/integration/mozilla-inbound/rev/c192d46ef789 https://hg.mozilla.org/integration/mozilla-inbound/rev/1675f613a6c2 https://hg.mozilla.org/integration/mozilla-inbound/rev/68877a48bbb6 https://hg.mozilla.org/integration/mozilla-inbound/rev/e1751a40ce0b https://hg.mozilla.org/integration/mozilla-inbound/rev/c72080be5876 https://hg.mozilla.org/integration/mozilla-inbound/rev/b201e1d5883f https://hg.mozilla.org/integration/mozilla-inbound/rev/609e206fc5c0 https://hg.mozilla.org/integration/mozilla-inbound/rev/c32ba604ec1b https://hg.mozilla.org/integration/mozilla-inbound/rev/42effec49f3f
https://hg.mozilla.org/mozilla-central/rev/c9abe8523b1a https://hg.mozilla.org/mozilla-central/rev/d748fc4b5266 https://hg.mozilla.org/mozilla-central/rev/a657fbc60968 https://hg.mozilla.org/mozilla-central/rev/c192d46ef789 https://hg.mozilla.org/mozilla-central/rev/1675f613a6c2 https://hg.mozilla.org/mozilla-central/rev/68877a48bbb6 https://hg.mozilla.org/mozilla-central/rev/e1751a40ce0b https://hg.mozilla.org/mozilla-central/rev/c72080be5876 https://hg.mozilla.org/mozilla-central/rev/b201e1d5883f https://hg.mozilla.org/mozilla-central/rev/609e206fc5c0 https://hg.mozilla.org/mozilla-central/rev/c32ba604ec1b https://hg.mozilla.org/mozilla-central/rev/42effec49f3f
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox42:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Assignee | ||
Comment 21•9 years ago
|
||
This didn't seem to fix the crashes. I'll need to investigate more.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 22•9 years ago
|
||
I made a couple mistakes in the above patches. Here are the tests I should have written in the first place. (The IPDL tests are nice, but these tests actually exercise cancellation on the MessageChannel::Send path.)
Attachment #8633702 -
Flags: review?(dvander)
Assignee | ||
Comment 23•9 years ago
|
||
I'll comment inline about what all this is for.
Attachment #8633704 -
Flags: review?(dvander)
Attachment #8633702 -
Flags: review?(dvander) → review+
Assignee | ||
Comment 24•9 years ago
|
||
Comment on attachment 8633704 [details] [diff] [review] bug fixes Review of attachment 8633704 [details] [diff] [review]: ----------------------------------------------------------------- Before landing, I discovered some problems with races because (1) CancelCurrentTransaction can run on the I/O thread when it receives a CancelMessage, and (2) CancelCurrentTransaction was writing to mDispatchingSyncMessage, which is not protected by any lock. mDispatchingSyncMessage is unprotected because it's written to by DispatchSyncMessage during the phase where it has already dropped locks. All other uses are on the worker thread, so this was safe until now. I tried to fix this by eliminating the write to mDispatchingSyncMessage inside CancelCurrentTransaction. My thinking was that setting the current transaction to 0 was enough to cancel everything. That was mostly true, except that I forgot about the assertions. I think we can fix those by also checking mCurrentTransaction first; if it's zero, we can behave as if that mDispatchingSyncMessage is 0. ::: ipc/glue/MessageChannel.cpp @@ +866,5 @@ > // again. > return false; > } > > + if (mCurrentTransaction && Adding the mCurrentTransaction check to cover the case where we canceled. @@ -877,5 @@ > > if (mCurrentTransaction && > (msg->priority() < DispatchingSyncMessagePriority() || > - mAwaitingSyncReplyPriority > msg->priority() || > - DispatchingSyncMessagePriority() == IPC::Message::PRIORITY_URGENT || This check is redundant I think. If DispatchingSyncMessagePriority() == IPC::Message::PRIORITY_URGENT, then it must also be true that: 1. we're in the parent, since only the parent dispatches urgent messages; also, 2. the message we're sending is not urgent, since the parent is not allowed to send urgent messages. So it must be true that msg->priority() < DispatchingSyncMessagePriority() since any non-urgent message has priority less than urgent. @@ -878,5 @@ > if (mCurrentTransaction && > (msg->priority() < DispatchingSyncMessagePriority() || > - mAwaitingSyncReplyPriority > msg->priority() || > - DispatchingSyncMessagePriority() == IPC::Message::PRIORITY_URGENT || > - DispatchingAsyncMessagePriority() == IPC::Message::PRIORITY_URGENT)) I realized that canceling doesn't make any sense in this case. Async messages aren't part of transactions, so there's nothing we can cancel. Instead, we'll just crash below. @@ +888,2 @@ > > + if (mCurrentTransaction) { The assertions on the left got folded into here, but now they're only done if we haven't canceled the current transaction. @@ -1318,5 @@ > > - IPC_ASSERT(prio >= DispatchingSyncMessagePriority(), > - "priority inversion while dispatching sync message"); > - IPC_ASSERT(prio >= mAwaitingSyncReplyPriority, > - "dispatching a message of lower priority while waiting for a response"); I think these checks are redundant with the ones in Send. Basically, if you never have priority inversions when sending, I don't think you can have them when receiving either. @@ +2018,5 @@ > // avoid that, these RAII classes check if the variable they set has been > // tampered with (by us). If so, they don't reset the variable to the old > // value. > > + MOZ_ASSERT(mCurrentTransaction); This was really stupid. @@ +2023,4 @@ > mCurrentTransaction = 0; > + > + mAwaitingSyncReply = false; > + mAwaitingSyncReplyPriority = 0; Unlike mDispatchingSyncMessage, mAwaitingSyncReply is protected by the MessageChannel monitor, so we can assign to it here. For some reason I thought that wasn't necessary. But looking again I think it's important for ShouldDeferIncomingMessage. @@ +2034,5 @@ > bool > MessageChannel::CancelCurrentTransaction(bool onlyHighPrio) > { > MonitorAutoLock lock(*mMonitor); > + if (mCurrentTransaction && Same deal as above: we only want to do this if we haven't already canceled.
Attachment #8633704 -
Flags: review?(dvander) → review+
Comment 25•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/18d6d206080c https://hg.mozilla.org/integration/mozilla-inbound/rev/4be0dec068cb
Comment 26•9 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/18d6d206080c https://hg.mozilla.org/mozilla-central/rev/4be0dec068cb
Status: REOPENED → RESOLVED
Closed: 9 years ago → 9 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Assignee | ||
Comment 27•9 years ago
|
||
Comment on attachment 8626209 [details] [diff] [review] Don't allow sending CPOWs while dispatching sync msgs The patches in this bug appear to fix a lot of crashes in e10s. We'd like to uplift. Approval Request Comment [Feature/regressing bug #]: e10s [User impact if declined]: crashes [Describe test coverage new/current, TreeHerder]: Has been on Nightly for the last few days. [Risks and why]: The code is mostly e10s only, but there are a lot of patches so there is some risk. [String/UUID change made/needed]: none
Attachment #8626209 -
Flags: approval-mozilla-aurora?
Comment on attachment 8626209 [details] [diff] [review] Don't allow sending CPOWs while dispatching sync msgs Patches have been in m-c for a few days so should be safe.
Attachment #8626209 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
status-firefox41:
--- → affected
Comment 29•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/32692b459f37 https://hg.mozilla.org/releases/mozilla-aurora/rev/66de4a3517ef https://hg.mozilla.org/releases/mozilla-aurora/rev/b7045505b377 https://hg.mozilla.org/releases/mozilla-aurora/rev/28d6a227814a https://hg.mozilla.org/releases/mozilla-aurora/rev/bdc01d55b974 https://hg.mozilla.org/releases/mozilla-aurora/rev/e4e856f1d56f https://hg.mozilla.org/releases/mozilla-aurora/rev/a237339bf4c8 https://hg.mozilla.org/releases/mozilla-aurora/rev/231a13d3f989 https://hg.mozilla.org/releases/mozilla-aurora/rev/8143f84d76c7 https://hg.mozilla.org/releases/mozilla-aurora/rev/d48ec38732e2 https://hg.mozilla.org/releases/mozilla-aurora/rev/a65955dbafdb https://hg.mozilla.org/releases/mozilla-aurora/rev/f9828f111585 https://hg.mozilla.org/releases/mozilla-aurora/rev/e7a487a3cc1a https://hg.mozilla.org/releases/mozilla-aurora/rev/0bdd39b4a6db
Comment 31•9 years ago
|
||
I think this needs reopened. https://crash-stats.mozilla.com/report/index/e69112be-c817-4644-8430-71f292150826 is a test crash of 1187448 which is marked as a duplicate.
You need to log in
before you can comment on or make changes to this bug.
Description
•