Closed Bug 1240985 Opened 8 years ago Closed 8 years ago

Fix bugs found by IPC fuzzer

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla47

Tracking Flags:

Tracking

Status

firefox45

---

fixed

firefox46

---

fixed

firefox47

---

fixed

People

(Reporter: billm, Assigned: billm)

References

Details

(Whiteboard: [e10s-45-uplift])

Attachments

(17 files)

fix for chromium ipc bug in glib message pump 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 3.65 KB, patch	jld : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
logging 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 7.41 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
fix cancel during timeout 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 1.65 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Fix bug where mAwaitingSyncReply can be overwritten in Send after Cancel 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 1.32 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Cancel racy CPOW messages 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 8.48 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
return false from Send when dispatching urgent messages 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 2.24 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
don't return MsgNotAllowed 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 2.03 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
Make sure to queue mDequeueOneTasks after cancellation 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 1.82 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+	Details \| Diff \| Splinter Review
crash if cancelling during urgent dispatch 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 1.47 KB, patch	dvander : review+ Sylvestre : approval-mozilla-aurora+ Sylvestre : approval-mozilla-beta+	Details \| Diff \| Splinter Review
add mLastError field 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 11.27 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Fix some timeout/cancel interactions 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 7.98 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Crash sooner when cancelling racy messages 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 9.43 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Hold off processing some messages during timeout 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 7.93 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Always enqueue dispatch task on message arrival 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 2.06 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
patch 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 6.80 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Check WasTransactionCanceled after timeout 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 1.90 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
fuzzer code (for testing only) 8 years ago Bill McCloskey [inactive unless it's an emergency] (:billm) 17.59 KB, patch	gkrizsanits : review+	Details \| Diff \| Splinter Review

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Description

•

8 years ago

I've been working on a MessageChannel fuzzer for the last two weeks. It basically just randomly sends/receives messages of different priorities and sometimes cancels them or times them out. It has found a number of bugs, almost all of them having to do with CPOW cancellation. I'm going to post patches here.

Working on this, I'm also trying to eliminate some of the weird edge cases in the code. Besides bug fixes, I'm going to make a few policy changes:

1. Right now, if a CPOW from the parent races with a sync message from the child, we dispatch the CPOW in the child first. This causes all sorts of weird issues. Most importantly, if we then cancel the CPOW, it's very difficult to know what to do with the sync message from the child, so we crash. To avoid this, and to simplify the possible interleavings, I've decided that we should just cancel any CPOWs that race with sync messages from the child. It makes CPOWs a little less reliable, but they're already a kind of "best effort" mechanism, so I think that's okay. And it's better than crashing later on.

2. Currently if we time out a CPOW, we refuse to respond to any sync messages from the child until we get a response to the timed out message. The comment that justifies this no longer makes sense to me and I can't think of any reason why it's needed. I think the need for it may have been eliminated when we added cancellation. In any case, I think we should remove this and allow these messages to be dispatched.

I feel a lot more confident making these changes now that we have the fuzzer. It seems to be really good at exposing the kinds of races that were very hard to track down otherwise.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 1

•

8 years ago

Attached patch fix for chromium ipc bug in glib message pump — Details — Splinter Review

This is sort of an odd bug. The code uses a Unix pipe to signal to the glib event loop that it needs to wake up. There's a comment that claims:

   // We should only ever have a single message on the wakeup pipe, since we
   // are only signaled when the queue went from empty to non-empty.

As far as I can tell, this isn't true. We call MessagePumpForUI::ScheduleWork unconditionally from here:
https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#335

This is a problem because if we write too many bytes to the pipe without reading them, the OS causes the write calls to block. That, in turn, can lead to a deadlock.

To solve this, I added a flag to ensure that we only write to the pipe if it's empty.

Attachment #8709769 - Flags: review?(jld)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 2

•

8 years ago

Attached patch logging — Details — Splinter Review

Fixing these bugs would have been very difficult without logging. The later patches also include more logging.

Attachment #8709772 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 3

•

8 years ago

Attached patch fix cancel during timeout — Details — Splinter Review

This fixes a somewhat minor bug that happens if the child cancels a CPOW that the parent timed out. In that case we want to end the timeout rather than trying to cancel anything (since the transaction was already effectively cancelled by the parent).

Attachment #8709773 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 4

•

8 years ago

Attached patch Fix bug where mAwaitingSyncReply can be overwritten in Send after Cancel — Details — Splinter Review

The way the cancellation code works, several state variables (mCurrentTransaction and mAwaitingSyncReply) are overwritten when we cancel. These variables are normally set/unset in a stacky way using the RAII classes AutoEnterTransaction and AutoSetValue. The sequence normally looks like this:

On entering Send:
  set mCurrentTransaction, mAwaitingSyncReply
  send and wait for reply
  set mCurrentTransaction, mAwaitingSyncReply to their previous values

However, if we cancel while waiting for the reply, we *don't* want to restore them to their original values. The AutoEnterTransaction destructor was fixed a while ago to handle this, but somehow I didn't realize I also needed to fix AutoSetValue. This patch does that.

Attachment #8709774 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 5

•

8 years ago

Attached patch Cancel racy CPOW messages — Details — Splinter Review

This patch implements the first policy change I mentioned in comment 0: racy CPOWs are cancelled instead of being executed. It also removes the intentional crash that was needed if we cancel during a racy CPOW. And it fixes AutoEnterTransaction to have a more precise assertion about how transaction IDs nest now that we can't have these races.

Finally, it fixes an unrelated problem where we need to call WasTransactionCanceled during ProcessPendingRequests. (This should really be a separate patch but I had trouble splitting it out.) The comment hopefully explains.

Attachment #8709775 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 6

•

8 years ago

Attached patch return false from Send when dispatching urgent messages — Details — Splinter Review

This is kind of another policy change. We're never supposed to send anything when dispatching urgent messages. Right now, we do a mix of cancelling (when dispatching sync urgent messages) and crashing (when dispatching async urgent messages) in this situation. However, I realized that we can just return false when doing the Send. This seems like a much better solution.

Attachment #8709776 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 7

•

8 years ago

Attached patch don't return MsgNotAllowed — Details — Splinter Review

Like I said in comment 0, I don't think we need to do this anymore.

Attachment #8709777 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 8

•

8 years ago

Attached patch Make sure to queue mDequeueOneTasks after cancellation — Details — Splinter Review

This fixes another cancellation bug. Let's say we're doing a sync Send and we receive a message. In some cases we want to process that message during the Send. But before the worker thread gets a chance to do that, we receive a cancel message for the transaction. In that case, the Send call returns immediately without processing the message that was received. In order to process it, we need to make sure to queue an OnMaybeDequeueOneTask. This patch does that.

Attachment #8709779 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 9

•

8 years ago

Attached patch crash if cancelling during urgent dispatch — Details — Splinter Review

The fuzzer also found a problem if you cancel while dispatching an urgent message. I couldn't figure out how to handle that, so I added this crash. Hopefully it will never happen. (We only cancel when we run a nested event loop, so I hope that never happens.)

Attachment #8709781 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 10

•

8 years ago

Oh, one thing I forgot is that I started attaching a transaction ID to the cancel messages so that we can make assertions about it. That part is kind of mixed into several of these patches. Sorry.

Also, I'll get the actual fuzzer posted tomorrow.

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8709772 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8709773 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8709774 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 11

•

8 years ago

Comment on attachment 8709775 [details] [diff] [review]
Cancel racy CPOW messages

Review of attachment 8709775 [details] [diff] [review]:
-----------------------------------------------------------------

I like this a lot more than crashing. Do we know if that condition occurred a lot in the wild?

::: ipc/glue/MessageChannel.cpp
@@ +806,5 @@
> +
> +            if (mSide == ChildSide && msg.is_sync() && msg.transaction_id() != mCurrentTransaction) {
> +                // A message has arrived that is racing with our own message. It
> +                // must be a CPOW message since those are the only kind of sync
> +                // messages that parent can send. We just cancel it. This avoids

"the parent"

@@ +820,5 @@
> +            }
> +
> +            bool defer = ShouldDeferMessage(msg);
> +            if (msg.is_sync() || msg.priority() == IPC::Message::PRIORITY_URGENT) {
> +                // Only log the interesting messages.

Could you move this comment out of the |if| and change it to "Log interesting messages."? (Every time I see an if-condition in IPDL I automatically expect some critical logic to follow :)

Attachment #8709775 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8709776 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 12

•

8 years ago

Comment on attachment 8709777 [details] [diff] [review]
don't return MsgNotAllowed

Review of attachment 8709777 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ -1398,5 @@
> -        //
> -        // We do this because want to avoid a situation where we process an
> -        // incoming message from the child here while it simultaneously starts
> -        // processing our timed-out CPOW. It's very bad for both sides to
> -        // be processing sync messages concurrently.

I agree we can probably drop this. It looks like mTimedOutMessagePriority might be dead after this change (unless it's used again later).

Attachment #8709777 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8709779 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 13

•

8 years ago

(In reply to Bill McCloskey (:billm) from comment #9)
> Created attachment 8709781 [details] [diff] [review]
> crash if cancelling during urgent dispatch
> 
> The fuzzer also found a problem if you cancel while dispatching an urgent
> message. I couldn't figure out how to handle that, so I added this crash.
> Hopefully it will never happen. (We only cancel when we run a nested event
> loop, so I hope that never happens.)

What makes this difficult to handle?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Comment 14

•

8 years ago

Comment on attachment 8709769 [details] [diff] [review]
fix for chromium ipc bug in glib message pump

Review of attachment 8709769 [details] [diff] [review]:
-----------------------------------------------------------------

I stared at this for a while trying to see if there were some kind of ordering of events that would do something unexpected, but I think the “worst” that can happen is for the pipe buffer to briefly hold 2 bytes instead of 0 or 1, but that should be fine.

::: ipc/chromium/src/base/message_pump_glib.cc
@@ +231,5 @@
>    if (!state_)  // state_ may be null during tests.
>      return false;
>  
>    // We should only ever have a single message on the wakeup pipe, since we
>    // are only signaled when the queue went from empty to non-empty.  The glib

Does this comment need to be adjusted?

Attachment #8709769 - Flags: review?(jld) → review+

Pulsebot

Comment 15

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/fddcf8b3b088
https://hg.mozilla.org/integration/mozilla-inbound/rev/34f87cc8ac24
https://hg.mozilla.org/integration/mozilla-inbound/rev/12de76467424
https://hg.mozilla.org/integration/mozilla-inbound/rev/a7eecc68ca20
https://hg.mozilla.org/integration/mozilla-inbound/rev/488690ba4c8f
https://hg.mozilla.org/integration/mozilla-inbound/rev/3fcd50d83821
https://hg.mozilla.org/integration/mozilla-inbound/rev/b87f893bd6aa
https://hg.mozilla.org/integration/mozilla-inbound/rev/26e6fe3875a5

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 16

•

8 years ago

I wasn't able to land the policy change about CPOWs racing with sync messages because it breaks a bunch of tests that rely on CPOWs. I'll think some more about what to do about that.

I also made some changes to my fuzzer and now I'm seeing more assertions, so I think there's more work to do here.

> What makes this difficult to handle?

If we're dispatching an async urgent message, it's not clear what it means to cancel it. There might be another sync message that the other side is waiting on, and that should be cancelled. But there might be nothing going on, and in that case the cancel shouldn't do anything I guess.

Anyway, it would be pretty bad if we cancel during these messages. That would mean that we're spinning a nested event loop while dispatching an urgent message. We're not supposed to send at all while dispatching urgent messages, and we can't make any claims about what will happen during a nested event loop.

Keywords: leave-open

Pulsebot

Comment 17

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/1a5165c7be18

Andrew McCreight [:mccr8]

Comment 18

•

8 years ago

It looked like the way you were doing variadic macros broke on Visual Studio, so I pushed a followup, trying to match what NS_IMPL_CYCLE_COLLECTION_TRAVERSE does. Though looking at it, I think I did not quite get it right...

Flags: needinfo?(wmccloskey)

Andrew McCreight [:mccr8]

Comment 19

•

8 years ago

Ok, that made it worse, so Kwierso is just going to back all these patches out.

Andrew McCreight [:mccr8]

Comment 20

•

8 years ago

This might work, though I've lost my nerve to actually try it on inbound:

#define IPC_LOG(...)                                                           \
  MOZ_STATIC_ASSERT_VALID_ARG_COUNT(__VA_ARGS__);                              \
  MOZ_LOG(sLogModule, LogLevel::Debug, (__VA_ARGS__))

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 21

•

8 years ago

All backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/10d4e8736dbb

Probably wouldn't have been able to merge or cherry-pick tonight with all of the infra problems that popped up, anyway...

Ryan VanderMeulen [:RyanVM]

Comment 22

•

8 years ago

Mass B2G debug bustage too like the log below:
https://treeherder.mozilla.org/logviewer.html#?job_id=20163871&repo=mozilla-inbound

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 23

•

8 years ago

Thanks for the help guys.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Whiteboard: [e10s-45-uplift]

Pulsebot

Comment 24

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/19ab70a7137c
https://hg.mozilla.org/integration/mozilla-inbound/rev/6d7dd1de5dd5
https://hg.mozilla.org/integration/mozilla-inbound/rev/5a2a8262a368
https://hg.mozilla.org/integration/mozilla-inbound/rev/2020b3de24bc
https://hg.mozilla.org/integration/mozilla-inbound/rev/e7ce9369ddb6
https://hg.mozilla.org/integration/mozilla-inbound/rev/908c5be531ae
https://hg.mozilla.org/integration/mozilla-inbound/rev/cda5277636b6
https://hg.mozilla.org/integration/mozilla-inbound/rev/aac985c60052

Christoph Diehl [:posidron]

Comment 25

•

8 years ago

Awesome work but please file found issues by a fuzzer in separate bugs. This makes it easier to track down and to reference to.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 26

•

8 years ago

These patches don't make sense in separate bugs.

Ryan VanderMeulen [:RyanVM]

Comment 27

•

8 years ago

(In reply to Pulsebot from comment #24)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/e7ce9369ddb6

This needs rebasing for Aurora. The others graft cleanly.

Flags: needinfo?(wmccloskey)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 28

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/19ab70a7137c
https://hg.mozilla.org/mozilla-central/rev/6d7dd1de5dd5
https://hg.mozilla.org/mozilla-central/rev/5a2a8262a368
https://hg.mozilla.org/mozilla-central/rev/2020b3de24bc
https://hg.mozilla.org/mozilla-central/rev/e7ce9369ddb6
https://hg.mozilla.org/mozilla-central/rev/908c5be531ae
https://hg.mozilla.org/mozilla-central/rev/cda5277636b6
https://hg.mozilla.org/mozilla-central/rev/aac985c60052

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 29

•

8 years ago

Comment on attachment 8709769 [details] [diff] [review]
fix for chromium ipc bug in glib message pump

I'm requesting uplift for all the patches here. I'm hoping these patches will fix a bunch of e10s crashes like bug 1177013.

Approval Request Comment
[Feature/regressing bug #]: bug 1177013
[User impact if declined]: crashes like bug 1210821
[Describe test coverage new/current, TreeHerder]: we have a new fuzzer to discover bugs like this now
[Risks and why]: These changes could result in additional crashes for e10s users (although they're designed to *fix* crashes). However, they really only affect e10s, which won't ship in 45. So the risk of breaking the 45 release is quite low.
[String/UUID change made/needed]: none

Flags: needinfo?(wmccloskey)

Attachment #8709769 - Flags: approval-mozilla-aurora?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 30

•

8 years ago

Also, it's probably easier if I handle the uplifts here. That way I can do the rebasing.

Sylvestre Ledru [:Sylvestre]

Comment 31

•

8 years ago

Comment on attachment 8709779 [details] [diff] [review]
Make sure to queue mDequeueOneTasks after cancellation

[Triage Comment]
Fix crashes, taking it

Attachment #8709779 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Comment 32

•

8 years ago

Comment on attachment 8709777 [details] [diff] [review]
don't return MsgNotAllowed

[Triage Comment]

Attachment #8709777 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8709776 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8709775 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8709774 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8709773 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8709772 - Flags: approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

8 years ago

Attachment #8709769 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 33

•

8 years ago

Attached patch add mLastError field — Details — Splinter Review

This patch causes us to record the reason why a Send call failed. It's important for the fuzzer to know whether a message actually made it to the other side or not so that it can make assertions about message ordering. This allows that to happen.

Attachment #8711277 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 34

•

8 years ago

Just to follow up on comment 16, the additional bugs I found are mostly caused by the message races that I had hoped to eliminate, but I couldn't because cancelling in this case broke our tests. Instead, I've put together some different fixes for them.

I also found some additional bugs when I added some artificial sleeps and timeouts.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 35

•

8 years ago

Attached patch Fix some timeout/cancel interactions — Details — Splinter Review

This patch fixes some more cancellation issues. Namely:

1. I had fixed a problem where a cancel message for a transaction that was timed out would do the right thing. However, there's also a related problem where we send CPOW message M, time it out, then receive a message M' that's in the same transaction as M, and cancel M' (and hence M as well). In this case, the cancellation happens on *our* side rather than from a message. We need to end the timeout in that case, which we weren't doing.

2. Let's say that the child sends a sync message MC at the same time the parent sends a high prio sync message MP. MP will be dispatched first, in the child. If we cancel MP, then MC will be automatically be cancelled as well, in the child at least (since WasTransactionCancelled will return true). But, in the parent, MC will still be in mPending. When the parent eventually replies to MC, the child will crash. This patch removes MC from mPending in the parent if this happens. (Note that the "Cancel racy CPOW messages" would have forbidden this scenario, but I couldn't land it.)

Attachment #8711285 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 36

•

8 years ago

Attached patch Crash sooner when cancelling racy messages — Details — Splinter Review

This is kind of a weird situation that mostly impacts fuzzing, but I'd like to fix it. Let's say that we have one of these race situations where the child sends a sync normal prio message at the same time the parent sends a CPOW message. The parent's message will dispatch in the child first. Suppose the child cancels it. This is the situation where we intentionally crash, since my "Cancel racy CPOW messages" patch didn't land. However, we don't intentionally crash until the stack has unwound to the original sync send message call in the child. At that point, we call WasTransactionCancelled and crash.

However, in between the cancellation and crashing, a lot of messages can be sent and received. The request to cancel happens inside a message handler, which could do a lot more stuff after it cancels. The fact that the sync message was never delivered means that message ordering has been violated, so the fuzzer will find "bugs", even though we're planning to crash soon after.

This patch moves up the crash so that it happens at the time of the cancel. That way we can't do any more messaging and so the fuzzer won't find any false positives like this.

Attachment #8711286 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 37

•

8 years ago

Attached patch Hold off processing some messages during timeout — Details — Splinter Review

Remember that patch "don't return MsgNotAllowed"? It turns out it was only valid to remove MsgNotAllowed when we had the "Cancel racy CPOW messages" patch. The comment in the patch explains:

// If we've timed out a message and we're awaiting the reply to the timed
// out message, we have to be careful what messages we process. Here's what
// can go wrong:
// 1. child sends a normal priority sync message S
// 2. parent sends a high priority sync message H at the same time
// 3. parent times out H
// 4. child starts processing H and sends a high priority message H' nested
//    within the same transaction
// 5. parent dispatches S and sends reply
// 6. child asserts because it instead expected a reply to H'.

Rather than using MsgNotAllowed as before, I thought of a better way:

// To solve this, we refuse to process S in the parent until we get a reply
// to H. More generally, let the timed out message be M. We don't process a
// message unless the child would need the response to that message in order
// to process M. Those messages are the ones that have a higher priority
// than M or that are part of the same transaction as M.

Attachment #8711287 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 38

•

8 years ago

Attached patch Always enqueue dispatch task on message arrival — Details — Splinter Review

Here's another problem I ran into:

Suppose that we're waiting for a reply to a sync message M. Eventually we time out, call ShouldContinueFromTimeout, and it returns false. However, ShouldContinueFromTimeout drops the lock. Suppose that, in that time, a nested message for the same transaction as M is received--call it M'. Once ShouldContinueFromTimeout returns, Send will return without having looked at M'.

At this point we have a problem. When the link thread added M' to mPending, it did not post an OnMaybeDequeueOne task because Send was still on the stack. But Send returned without looking at mPending, so now the message is essentially lost. The parent is not allowed to send anymore until it gets a response to the timed out message M. And the child is blocked waiting for a response to M'. So we're basically deadlocked.

We could solve this by calling ProcessPendingRequests after ShouldContinueFromTimeout. However, that causes some other weird issues. It seems easier to just unconditionally post an OnMaybeDequeueOne task every time a message is received. Even if we dispatch the message from ProcessPendingRequests, the worst that will happen is that OnMaybeDequeueOne will find an empty queue and do nothing. This should be fairly uncommon anyway since nested CPOW messages are uncommon. Overall, it's a nice way of eliminating weird bugs.

Attachment #8711289 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 39

•

8 years ago

Attached patch patch — Details — Splinter Review

This patch changes a function signature and also changes WasTransactionCanceled to null out mRecvd in a weird case. Here's the comment:

// Imagine this scenario:
// 1. Child sends high prio sync message H1.
// 2. Parent sends reply to H1.
// 3. Parent sends high prio sync message H2.
// 4. Child's link thread receives H1 reply and H2 before worker wakes up.
// 5. Child dispatches H2 while still waiting for H1 reply.
// 6. Child cancels H2.
//
// In this case H1 will also be considered cancelled. However, its
// reply is still sitting in mRecvd, which can trip up later Sends. So
// we null it out here.

I thought about changing things so that Send checks mRecvd before calling ProcessPendingRequests. I think that might work, but this seems like a more conservative change.

Attachment #8711290 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 40

•

8 years ago

Attached patch Check WasTransactionCanceled after timeout — Details — Splinter Review

Another problem caused by ShouldContinueFromTimeout dropping the lock. In this case, the transaction we're waiting on could be cancelled while ShouldContinueFromTimeout runs. In that case, we don't want to act as if a timeout occurred because we'll never get a reply to the timed out message.

Attachment #8711291 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 41

•

8 years ago

Attached patch fuzzer code (for testing only) — Details — Splinter Review

Here's the fuzzer. With the patches above, it runs cleanly for me.

Besides randomly sending messages of different types, it checks that normal and urgent priority messages are dispatched in the order they were sent. It doesn't check high priority messages (CPOWs) because cancellation and races basically means that they might not run in order.

It also has hooks in MessageChannel.cpp that allow it to simulate timeouts and also to inject sleep calls when the mMonitor is not held. Both of these found several bugs.

Finally, I added a hook so that an intentional crash just causes the fuzzer to exit normally. That way our intentional crashes don't look like bugs.

If you run the fuzzer, it will usually exit pretty quickly since it hits intentional crashes so often. If it doesn't exit, that means you hit a deadlock. I've been running it in a loop, checking the exit code.

Attachment #8711292 - Flags: review?(dvander)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 42

•

8 years ago

It's getting a bit late. Tomorrow I'm going to uplift everything that had approval except for the MsgNotAllowed patch, which was wrong. It's likely the remaining patches will miss beta 1, but I think they're probably less important.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 43

•

8 years ago

Unfortunately, the patches that have been approved for Aurora cause test_cpows.xul to die with an assertion failure on Aurora. I guess I'll look into this tomorrow.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 44

•

8 years ago

Fixed the Aurora thing. It turns out that a line from the patch David didn't review is needed to avoid an assert. It just sets the ID of the cancel message. I'm not sure why it's not failing on central. I'm going to fix that now, as well as backing out the MsgNotAllowed patch.

remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/b7104abd6362
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/d1a3b8124b7a
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/0e4c292ead72
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/882c647d9103
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/20666a406bfa
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/4128f782d644

Pulsebot

Comment 45

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/4a51cd342caf
https://hg.mozilla.org/integration/mozilla-inbound/rev/376dc3151e42

Guilherme Lima

Comment 46

•

8 years ago

https://hg.mozilla.org/mozilla-central/rev/4a51cd342caf
https://hg.mozilla.org/mozilla-central/rev/376dc3151e42
landed these morning, in time for Firefox 46

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8711277 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8709781 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8711285 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8711286 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8711287 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 47

•

8 years ago

Comment on attachment 8711289 [details] [diff] [review]
Always enqueue dispatch task on message arrival

Review of attachment 8711289 [details] [diff] [review]:
-----------------------------------------------------------------

This sounds like a good idea.

Attachment #8711289 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8711290 - Flags: review?(dvander) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

8 years ago

Attachment #8711291 - Flags: review?(dvander) → review+

Pulsebot

Comment 48

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/b94c1a472d46
https://hg.mozilla.org/integration/mozilla-inbound/rev/f10c0445c450
https://hg.mozilla.org/integration/mozilla-inbound/rev/06b4cb890ab0
https://hg.mozilla.org/integration/mozilla-inbound/rev/85c8cb422bad
https://hg.mozilla.org/integration/mozilla-inbound/rev/2c0c0ed35656
https://hg.mozilla.org/integration/mozilla-inbound/rev/b7f2ce03b34e
https://hg.mozilla.org/integration/mozilla-inbound/rev/c3dfc2c674ff
https://hg.mozilla.org/integration/mozilla-inbound/rev/6d7f80a057f0
https://hg.mozilla.org/integration/mozilla-inbound/rev/30f1acd9387f

Pulsebot

Comment 49

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/296fd9d50c25

Jan de Mooij [:jandem]

Comment 50

•

8 years ago

(In reply to Pulsebot from comment #49)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/296fd9d50c25

(Pushed a follow-up to make CancelMessage |explicit|.)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 51

•

8 years ago

All backed out between these two commmits:
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/f8be949a7397
remote:   https://hg.mozilla.org/integration/mozilla-inbound/rev/928befa33041

Something from this or bug 1242097 caused near permafails like https://treeherder.mozilla.org/logviewer.html#?job_id=20470069&repo=mozilla-inbound

Flags: needinfo?(wmccloskey)

Pulsebot

Comment 52

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/4ce037f2107d
https://hg.mozilla.org/integration/mozilla-inbound/rev/3274c88c07a2
https://hg.mozilla.org/integration/mozilla-inbound/rev/3e614d3776e0
https://hg.mozilla.org/integration/mozilla-inbound/rev/cbdcef6d6e01
https://hg.mozilla.org/integration/mozilla-inbound/rev/2a967aa0f327
https://hg.mozilla.org/integration/mozilla-inbound/rev/abc21e8ce10c
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b1d57fc5b3c
https://hg.mozilla.org/integration/mozilla-inbound/rev/9e5480083084
https://hg.mozilla.org/integration/mozilla-inbound/rev/958a7057e1ac

Alice0775 White

Updated

•

8 years ago

Depends on: 1243557

Alice0775 White

Updated

•

8 years ago

No longer depends on: 1243557

Carsten Book [:Tomcat]

Comment 53

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4ce037f2107d
https://hg.mozilla.org/mozilla-central/rev/3274c88c07a2
https://hg.mozilla.org/mozilla-central/rev/3e614d3776e0
https://hg.mozilla.org/mozilla-central/rev/cbdcef6d6e01
https://hg.mozilla.org/mozilla-central/rev/2a967aa0f327
https://hg.mozilla.org/mozilla-central/rev/abc21e8ce10c
https://hg.mozilla.org/mozilla-central/rev/8b1d57fc5b3c
https://hg.mozilla.org/mozilla-central/rev/9e5480083084
https://hg.mozilla.org/mozilla-central/rev/958a7057e1ac

Carsten Book [:Tomcat]

Comment 54

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4ce037f2107d
https://hg.mozilla.org/mozilla-central/rev/3274c88c07a2
https://hg.mozilla.org/mozilla-central/rev/3e614d3776e0
https://hg.mozilla.org/mozilla-central/rev/cbdcef6d6e01
https://hg.mozilla.org/mozilla-central/rev/2a967aa0f327
https://hg.mozilla.org/mozilla-central/rev/abc21e8ce10c
https://hg.mozilla.org/mozilla-central/rev/8b1d57fc5b3c
https://hg.mozilla.org/mozilla-central/rev/9e5480083084
https://hg.mozilla.org/mozilla-central/rev/958a7057e1ac

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 55

•

8 years ago

Comment on attachment 8709781 [details] [diff] [review]
crash if cancelling during urgent dispatch

Asking for approval for the remainder of these patches. They're a continuation of the same work that wasn't finished in time for the merge. Reasons are the same as comment 29.

Attachment #8709781 - Flags: approval-mozilla-aurora?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Attachment #8711292 - Attachment description: patch → fuzzer code (for testing only)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Attachment #8709781 - Flags: approval-mozilla-beta?

Sylvestre Ledru [:Sylvestre]

Comment 56

•

8 years ago

Comment on attachment 8709781 [details] [diff] [review]
crash if cancelling during urgent dispatch

Let's take it too
Should be in 45 beta 2.

Attachment #8709781 - Flags: approval-mozilla-beta?

Attachment #8709781 - Flags: approval-mozilla-beta+

Attachment #8709781 - Flags: approval-mozilla-aurora?

Attachment #8709781 - Flags: approval-mozilla-aurora+

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 57

•

8 years ago

Landing on beta first in the hopes it will make it into beta2:

remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/e26322e1dbb0
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/2e2b6ee1256d
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/1f14d50e3ee6
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/3bd120e0ce2c
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/37bdb0e79e27
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/a78b92088738
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/c04e48db3167
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/f1d61f8c26d3
remote:   https://hg.mozilla.org/releases/mozilla-beta/rev/55272f23ad95

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 58

•

8 years ago

remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/25abb1fd2b07
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/d7943f066935
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/344dc7f95bf7
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/f4688b5593f3
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/f2b01eab1bfe
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/172b78c572c2
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/e2eeee0ceb5f
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/3860af14030c
remote:   https://hg.mozilla.org/releases/mozilla-aurora/rev/980fea2f7011

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 59

•

8 years ago

Comment on attachment 8711292 [details] [diff] [review]
fuzzer code (for testing only)

Review of attachment 8711292 [details] [diff] [review]:
-----------------------------------------------------------------

Gabor asked to review some IPC code. This seems like it would be a good start.

Attachment #8711292 - Flags: review?(dvander) → review?(gkrizsanits)

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox45: --- → affected

status-firefox47: --- → affected

Sylvestre Ledru [:Sylvestre]

Comment 60

•

8 years ago

It was too late for beta2, should be in beta 3.

Please open a new bug for further patches. It is going to be too complex otherwise.
Thanks

Carsten Book [:Tomcat]

Comment 61

•

8 years ago

(In reply to Sylvestre Ledru [:sylvestre] from comment #60)
> It was too late for beta2, should be in beta 3.
> 
> Please open a new bug for further patches. It is going to be too complex
> otherwise.
> Thanks

seems this was not to late for beta 2 (at least in https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=55272f23ad95) was before the release tags

status-firefox45: affected → fixed

status-firefox46: affected → fixed

Ryan VanderMeulen [:RyanVM]

Comment 62

•

8 years ago

(In reply to Carsten Book [:Tomcat] from comment #61)
> seems this was not to late for beta 2 (at least in
> https://treeherder.mozilla.org/#/jobs?repo=mozilla-
> beta&revision=55272f23ad95) was before the release tags

You need to look at the revision that was actually tagged for release, not just the order they showed up on Treeherder/pushlog. Mercurial history isn't guaranteed to be linear. This did in fact miss beta 2.

Gabor Krizsanits (INACTIVE)

Comment 63

•

8 years ago

Comment on attachment 8711292 [details] [diff] [review]
fuzzer code (for testing only)

Review of attachment 8711292 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the lag, I wanted to play with it a bit, but could not figure out how to run the test (relevant docs are outdated, and it seems like I'm missing one more flag to get the test compiled...). Anyway I don't want to hold up landing this any longer I can do the experimenting part later. I think I understand more or less everything what the test does, my only remark is:

::: ipc/ipdl/test/cxx/TestDemon.h
@@ +49,5 @@
> +
> +private:
> +    bool mDone;
> +    int mIncoming[3];
> +    int mOutgoing[3];

Both for mIncoming and mOutgoing you only use 0 and 2. It would be easier to follow if you just used mIncomingUrgent/mIncomingNormal pairs or something... Or do you plan to expend it to high prio too somehow and that's why you need 3 pairs?

Attachment #8711292 - Flags: review?(gkrizsanits) → review+

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Updated

•

8 years ago

Depends on: 1248750

Pulsebot

Comment 64

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/c82028946fee

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Keywords: leave-open

Ryan VanderMeulen [:RyanVM]

Comment 65

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c82028946fee

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox47: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla47

You need to log in before you can comment on or make changes to this bug.