Interrupt race resolution may cause deadlocks

RESOLVED FIXED in Firefox 40

Status

()

Core
IPC
--
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: Robert Kaiser, Assigned: aklotz)

Tracking

({crash, topcrash})

Trunk
mozilla42
x86
Windows NT
crash, topcrash
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox39+ disabled, firefox40+ fixed, firefox41+ fixed, firefox42 fixed)

Details

(crash signature)

Attachments

(2 attachments, 3 obsolete attachments)

(Reporter)

Description

2 years ago
[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-41e8cc89-ca30-47df-a7b0-110a32150601.
=============================================================

Top Frames:
0 	ntdll.dll 	ZwWaitForMultipleObjects 	
1 	kernelbase.dll 	WaitForMultipleObjectsEx 	
2 	kernel32.dll 	WaitForMultipleObjectsExImplementation 	
3 	user32.dll 	RealMsgWaitForMultipleObjectsEx 	
4 	user32.dll 	MsgWaitForMultipleObjects 	
5 	xul.dll 	mozilla::ipc::MessageChannel::WaitForInterruptNotify() 	ipc/glue/WindowsMessageLoop.cpp
6 	xul.dll 	mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
7 	xul.dll 	mozilla::plugins::PPluginInstanceChild::CallNPN_GetValue_NPNVdocumentOrigin(nsCString*, short*) 	obj-firefox/ipc/ipdl/PPluginInstanceChild.cpp
8 	xul.dll 	mozilla::plugins::PluginInstanceChild::NPN_GetValue(NPNVariable, void*) 	dom/plugins/ipc/PluginInstanceChild.cpp
9 	xul.dll 	mozilla::plugins::child::_getvalue 	dom/plugins/ipc/PluginModuleChild.cpp
10 	npswf32_17_0_0_169.dll 	F21225463__________________________________________________________________________________________________________ 	F209679109___________________________________________________________________________________:1234
[...]

This hang is spiking with 39.0b1 which is the first beta to include async plugin init.

Find more of those crashes in 39.0b1 at https://crash-stats.mozilla.com/report/list?product=Firefox&signature=hang+|+WaitForMultipleObjectsEx+|+RealMsgWaitForMultipleObjectsEx+|+MsgWaitForMultipleObjects+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AWaitForInterruptNotify%28%29+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACall%28IPC%3A%3AMessage*%2C+IPC%3A%3AMessage*%29&version=Firefox%3A39.0b1#tab-reports

Aaron says he knows what the problem is and he's looking into a fix so assigning to him right away.
This is a problem with interrupt race resolution in our MessageChannel code. When a particular sequence of interrupt calls race between parent and child processes, it is possible to deadlock both of them.
Moving component to IPC.
Component: Plug-ins → IPC
Summary: Flash hang mozilla::ipc::MessageChannel::Call with async plugin init → Interrupt race resolution may cause deadlocks
Blocks: 1116806

Comment 3

2 years ago
Tracking since this is a top crash in 39.0b1
tracking-firefox39: ? → +
Keywords: topcrash
Created attachment 8615589 [details] [diff] [review]
Unit Test

This test exhibits the problem - it hangs permanently. Note that your build needs to be configured with --enable-ipdl-tests to try this.
Created attachment 8615590 [details] [diff] [review]
Unit Test

Whoops, messed up the mercurial export. Here's the real patch :-)
Attachment #8615589 - Attachment is obsolete: true
Created attachment 8616531 [details] [diff] [review]
Part 2 - Unit Test
Attachment #8615590 - Attachment is obsolete: true
Created attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix
Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix

The existing predicate is fine for determining when a race has occurred and when to defer a message. The problem with _undeferring_ an incall with this predicate is that in some cases (particularly the one covered by the test in this bug), interrupt_remote_stack_depth_guess will never be correct as long as the outcall that the incall raced with is on the stack. In this case, this results in the message sitting in mDeferred in perpetuity.

This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run.

I've run this against all existing IPDL regression tests without any regressions. I've noticed that this does cause a bit of "hot potato" passing back and forth of some messages between mPending and mDeferred. This ends up having the effect of allowing the pending messages that need to run to propagate through the mPending queue and eventually break the deadlock.

Here's a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=de071b34a7a3
Attachment #8616532 - Flags: review?(dvander)
Attachment #8616531 - Flags: review?(dvander)
Ping?
Flags: needinfo?(dvander)
Sorry Aaron, I'm looking at this now and should have it reviewed by the end of the day. This code is pretty obscure, cjones would be a better reviewer but it's probably best if someone else sucks it up and understands the race detection stuff as well.
Flags: needinfo?(dvander)
Okay, I think I see what is going wrong here now:
 1) Parent sends RPC1 to child. Its remote stack guess is 0, local is 1.
 2) Child sends RPC2 to parent. Its remote stack guess is 0, local is 1.
 3) Child wakes up for RPC1, detects a race. Child defers and remote stack guess is 1.
 4) Parent sends RPC3. Its remote stack guess is still 0, local is now 2.
 5) Parent wakes up for RPC2, detects a race. Parent defers and remote stack guess is 1.
 6) Child wakes up for RPC3, detects a race. Child defers and remote stack guess is 2.

In this sequence of events, the remote stack guess will always be < RemoteViewOfStackDepth(), and we'll never wake up. So clearly that condition can't be right.
The logic for race detection is roughly: "the other process thought I had N calls on the stack, but actually I've already sent another and it just hasn't processed it yet." The check in MaybeUndeferIncall was attempting to duplicate that so it wouldn't undefer a message that would just be deferred again.

You said "This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run." But MayebUndeferIncall doesn't run anything, DispatchInterruptMessage does, and it has a different check. In theory MaybeUndeferIncall could have *no* predicate and always undefer, and we'd just be stressing the race condition code. That's what we're probably seeing with the message bouncing.
Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix

I think the new predicate needs a very clear explanation of what its relation to the check in DispatchInterruptMessage() is. Unless I'm mistaken and they're unrelated, in which case we should have an explanation of why the new predicate is any better than having none at all.

I'm also yet not sure what the best condition would be, to both not cause deadlocks and to not wake up the thread just to defer again. I'll keep thinking about it though.

Sorry again for the late review, I think I understand this well enough to have a much faster turnaround time now though.
Attachment #8616532 - Flags: review?(dvander)
Comment on attachment 8616531 [details] [diff] [review]
Part 2 - Unit Test

Review of attachment 8616531 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for writing this test case, it does reproduce for me. I do see that part 1 fixes it, but I want to make sure that's because MaybeUndeferIncall is correct rather than just letting the race detection logic run until it succeeds.
Attachment #8616531 - Flags: review?(dvander) → review+
We turned off async plugin init in 39.0b5. So we will likely still need this for 40+. 
Wontfix for 39 though.
status-firefox40: --- → affected
status-firefox41: --- → affected
tracking-firefox40: --- → +
tracking-firefox41: --- → +
Aaron, any reason why the patch didn't land? Thanks
status-firefox39: affected → disabled
Flags: needinfo?(aklotz)
Part 2 is just the test. Part 1 needs more work.
Flags: needinfo?(aklotz)
Created attachment 8626828 [details] [diff] [review]
Possible Fix (r2)
Attachment #8616532 - Attachment is obsolete: true
(Reporter)

Comment 19

2 years ago
Aaron, the hangs including the signatures in this bug are high again in 40 beta 1 (of course, as no fix landed here) - are you doing more work on this or did you miss to request review for the fix?
Flags: needinfo?(aklotz)
dvander and I talked about this at Whistler. We have a path going forward here. The revised patch is written but I am running this through a bunch of tests.
Flags: needinfo?(aklotz)
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

This patch removes the condition outright. try and ipdl regression tests look good.
Attachment #8626828 - Flags: review?(dvander)
(Reporter)

Updated

2 years ago
Crash Signature: [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify() | mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)] → [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify() | mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)] [@ hang | CleanupPerAppKe&hellip;
Attachment #8626828 - Flags: review?(dvander) → review+

Comment 22

2 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/3586d497bde2
https://hg.mozilla.org/integration/mozilla-inbound/rev/61a5b8d678f3
Nominations for uplift will be done gradually depending on how things look on treeherder.
https://hg.mozilla.org/mozilla-central/rev/3586d497bde2
https://hg.mozilla.org/mozilla-central/rev/61a5b8d678f3
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox42: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla42

Comment 25

2 years ago
Hi Aaron, can you update us on how things are going in treeherder? We are interested in getting an uplift for 40/41.
Flags: needinfo?(aklotz)
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

I'm OK with uplifting to Aurora at this time. I think we should let it bake there for a bit. I'm not concerned with Treeherder so much as I am with crashstats.

Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly.
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8626828 - Flags: approval-mozilla-aurora?

Comment 27

2 years ago
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Looking good on m-c, IPDL test suite.
Attachment #8626828 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/19937e5a2904
https://hg.mozilla.org/releases/mozilla-aurora/rev/ff1058741c2c
status-firefox41: affected → fixed
Flags: in-testsuite+
Aaron, how many days of baking do you need for this before the uplift to beta? Thanks
Flags: needinfo?(aklotz)
We should reevaluate when I get back from PTO on Monday the 20th. We need sufficient crashstats data.
Flags: needinfo?(aklotz)
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite, has been on Aurora for over a week.
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly and Aurora. Simple backout + shut off asyncInit if necessary to revert.
[String/UUID change made/needed]: None
Attachment #8626828 - Flags: approval-mozilla-beta?
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Let's take this small change and measure the impact in beta6. Beta+
Attachment #8626828 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/ddc45855ad74
https://hg.mozilla.org/releases/mozilla-beta/rev/75e056cdd421
status-firefox40: affected → fixed

Comment 34

2 years ago
This is affecting our plugin as well in 40+. I had to disable dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the plugin side to avoid the problem?
(In reply to akorchemniy from comment #34)
> This is affecting our plugin as well in 40+. I had to disable
> dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the
> plugin side to avoid the problem?

If you needed to disable dom.ipc.plugins.asyncInit in 40 then this bug is not your problem. We've had quite a few compatibility issues roll in with async plugin init over the weekend. Those bugs are being tracked as dependencies of bug 1116806. We're probably going to push out a hotfix to disable asyncInit in 40 until the problem is fixed. Ideally async plugin init should just work seamlessly with all plugins (provided that they are correctly following the API contracts).
Comment hidden (spam)
You need to log in before you can comment on or make changes to this bug.