Closed Bug 1170231 Opened 5 years ago Closed 5 years ago

Interrupt race resolution may cause deadlocks

Categories

(Core :: IPC, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox39 + disabled
firefox40 + fixed
firefox41 + fixed
firefox42 --- fixed

People

(Reporter: kairo, Assigned: aklotz)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(2 files, 3 obsolete files)

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-41e8cc89-ca30-47df-a7b0-110a32150601.
=============================================================

Top Frames:
0 	ntdll.dll 	ZwWaitForMultipleObjects 	
1 	kernelbase.dll 	WaitForMultipleObjectsEx 	
2 	kernel32.dll 	WaitForMultipleObjectsExImplementation 	
3 	user32.dll 	RealMsgWaitForMultipleObjectsEx 	
4 	user32.dll 	MsgWaitForMultipleObjects 	
5 	xul.dll 	mozilla::ipc::MessageChannel::WaitForInterruptNotify() 	ipc/glue/WindowsMessageLoop.cpp
6 	xul.dll 	mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
7 	xul.dll 	mozilla::plugins::PPluginInstanceChild::CallNPN_GetValue_NPNVdocumentOrigin(nsCString*, short*) 	obj-firefox/ipc/ipdl/PPluginInstanceChild.cpp
8 	xul.dll 	mozilla::plugins::PluginInstanceChild::NPN_GetValue(NPNVariable, void*) 	dom/plugins/ipc/PluginInstanceChild.cpp
9 	xul.dll 	mozilla::plugins::child::_getvalue 	dom/plugins/ipc/PluginModuleChild.cpp
10 	npswf32_17_0_0_169.dll 	F21225463__________________________________________________________________________________________________________ 	F209679109___________________________________________________________________________________:1234
[...]

This hang is spiking with 39.0b1 which is the first beta to include async plugin init.

Find more of those crashes in 39.0b1 at https://crash-stats.mozilla.com/report/list?product=Firefox&signature=hang+|+WaitForMultipleObjectsEx+|+RealMsgWaitForMultipleObjectsEx+|+MsgWaitForMultipleObjects+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AWaitForInterruptNotify%28%29+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACall%28IPC%3A%3AMessage*%2C+IPC%3A%3AMessage*%29&version=Firefox%3A39.0b1#tab-reports

Aaron says he knows what the problem is and he's looking into a fix so assigning to him right away.
This is a problem with interrupt race resolution in our MessageChannel code. When a particular sequence of interrupt calls race between parent and child processes, it is possible to deadlock both of them.
Moving component to IPC.
Component: Plug-ins → IPC
Summary: Flash hang mozilla::ipc::MessageChannel::Call with async plugin init → Interrupt race resolution may cause deadlocks
Tracking since this is a top crash in 39.0b1
Attached patch Unit Test (obsolete) — Splinter Review
This test exhibits the problem - it hangs permanently. Note that your build needs to be configured with --enable-ipdl-tests to try this.
Attached patch Unit Test (obsolete) — Splinter Review
Whoops, messed up the mercurial export. Here's the real patch :-)
Attachment #8615589 - Attachment is obsolete: true
Attachment #8615590 - Attachment is obsolete: true
Attached patch Part 1 - Possible Fix (obsolete) — Splinter Review
Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix

The existing predicate is fine for determining when a race has occurred and when to defer a message. The problem with _undeferring_ an incall with this predicate is that in some cases (particularly the one covered by the test in this bug), interrupt_remote_stack_depth_guess will never be correct as long as the outcall that the incall raced with is on the stack. In this case, this results in the message sitting in mDeferred in perpetuity.

This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run.

I've run this against all existing IPDL regression tests without any regressions. I've noticed that this does cause a bit of "hot potato" passing back and forth of some messages between mPending and mDeferred. This ends up having the effect of allowing the pending messages that need to run to propagate through the mPending queue and eventually break the deadlock.

Here's a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=de071b34a7a3
Attachment #8616532 - Flags: review?(dvander)
Attachment #8616531 - Flags: review?(dvander)
Ping?
Flags: needinfo?(dvander)
Sorry Aaron, I'm looking at this now and should have it reviewed by the end of the day. This code is pretty obscure, cjones would be a better reviewer but it's probably best if someone else sucks it up and understands the race detection stuff as well.
Flags: needinfo?(dvander)
Okay, I think I see what is going wrong here now:
 1) Parent sends RPC1 to child. Its remote stack guess is 0, local is 1.
 2) Child sends RPC2 to parent. Its remote stack guess is 0, local is 1.
 3) Child wakes up for RPC1, detects a race. Child defers and remote stack guess is 1.
 4) Parent sends RPC3. Its remote stack guess is still 0, local is now 2.
 5) Parent wakes up for RPC2, detects a race. Parent defers and remote stack guess is 1.
 6) Child wakes up for RPC3, detects a race. Child defers and remote stack guess is 2.

In this sequence of events, the remote stack guess will always be < RemoteViewOfStackDepth(), and we'll never wake up. So clearly that condition can't be right.
The logic for race detection is roughly: "the other process thought I had N calls on the stack, but actually I've already sent another and it just hasn't processed it yet." The check in MaybeUndeferIncall was attempting to duplicate that so it wouldn't undefer a message that would just be deferred again.

You said "This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run." But MayebUndeferIncall doesn't run anything, DispatchInterruptMessage does, and it has a different check. In theory MaybeUndeferIncall could have *no* predicate and always undefer, and we'd just be stressing the race condition code. That's what we're probably seeing with the message bouncing.
Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix

I think the new predicate needs a very clear explanation of what its relation to the check in DispatchInterruptMessage() is. Unless I'm mistaken and they're unrelated, in which case we should have an explanation of why the new predicate is any better than having none at all.

I'm also yet not sure what the best condition would be, to both not cause deadlocks and to not wake up the thread just to defer again. I'll keep thinking about it though.

Sorry again for the late review, I think I understand this well enough to have a much faster turnaround time now though.
Attachment #8616532 - Flags: review?(dvander)
Comment on attachment 8616531 [details] [diff] [review]
Part 2 - Unit Test

Review of attachment 8616531 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for writing this test case, it does reproduce for me. I do see that part 1 fixes it, but I want to make sure that's because MaybeUndeferIncall is correct rather than just letting the race detection logic run until it succeeds.
Attachment #8616531 - Flags: review?(dvander) → review+
We turned off async plugin init in 39.0b5. So we will likely still need this for 40+. 
Wontfix for 39 though.
Aaron, any reason why the patch didn't land? Thanks
Flags: needinfo?(aklotz)
Part 2 is just the test. Part 1 needs more work.
Flags: needinfo?(aklotz)
Attachment #8616532 - Attachment is obsolete: true
Aaron, the hangs including the signatures in this bug are high again in 40 beta 1 (of course, as no fix landed here) - are you doing more work on this or did you miss to request review for the fix?
Flags: needinfo?(aklotz)
dvander and I talked about this at Whistler. We have a path going forward here. The revised patch is written but I am running this through a bunch of tests.
Flags: needinfo?(aklotz)
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

This patch removes the condition outright. try and ipdl regression tests look good.
Attachment #8626828 - Flags: review?(dvander)
Crash Signature: [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify() | mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)] → [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify() | mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)] [@ hang | CleanupPerAppKey | Msg…
Attachment #8626828 - Flags: review?(dvander) → review+
Nominations for uplift will be done gradually depending on how things look on treeherder.
Hi Aaron, can you update us on how things are going in treeherder? We are interested in getting an uplift for 40/41.
Flags: needinfo?(aklotz)
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

I'm OK with uplifting to Aurora at this time. I think we should let it bake there for a bit. I'm not concerned with Treeherder so much as I am with crashstats.

Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly.
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8626828 - Flags: approval-mozilla-aurora?
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Looking good on m-c, IPDL test suite.
Attachment #8626828 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Aaron, how many days of baking do you need for this before the uplift to beta? Thanks
Flags: needinfo?(aklotz)
We should reevaluate when I get back from PTO on Monday the 20th. We need sufficient crashstats data.
Flags: needinfo?(aklotz)
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite, has been on Aurora for over a week.
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly and Aurora. Simple backout + shut off asyncInit if necessary to revert.
[String/UUID change made/needed]: None
Attachment #8626828 - Flags: approval-mozilla-beta?
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Let's take this small change and measure the impact in beta6. Beta+
Attachment #8626828 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
This is affecting our plugin as well in 40+. I had to disable dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the plugin side to avoid the problem?
(In reply to akorchemniy from comment #34)
> This is affecting our plugin as well in 40+. I had to disable
> dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the
> plugin side to avoid the problem?

If you needed to disable dom.ipc.plugins.asyncInit in 40 then this bug is not your problem. We've had quite a few compatibility issues roll in with async plugin init over the weekend. Those bugs are being tracked as dependencies of bug 1116806. We're probably going to push out a hotfix to disable asyncInit in 40 until the problem is fixed. Ideally async plugin init should just work seamlessly with all plugins (provided that they are correctly following the API contracts).
You need to log in before you can comment on or make changes to this bug.