Closed Bug 1170231 Opened 9 years ago Closed 9 years ago

Interrupt race resolution may cause deadlocks

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla42

Tracking Flags:

Tracking

Status

firefox39

disabled

firefox40

fixed

firefox41

fixed

firefox42

---

fixed

People

(Reporter: kairo, Assigned: bugzilla)

References

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(2 files, 3 obsolete files)

Unit Test 9 years ago (No longer employed by Mozilla) Aaron Klotz 918 bytes, patch		Details \| Diff \| Splinter Review
Unit Test 9 years ago (No longer employed by Mozilla) Aaron Klotz 7.03 KB, patch		Details \| Diff \| Splinter Review
Part 2 - Unit Test 9 years ago (No longer employed by Mozilla) Aaron Klotz 6.57 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Part 1 - Possible Fix 9 years ago (No longer employed by Mozilla) Aaron Klotz 1.20 KB, patch		Details \| Diff \| Splinter Review
Possible Fix (r2) 9 years ago (No longer employed by Mozilla) Aaron Klotz 1.12 KB, patch	dvander : review+ kglazko : approval-mozilla-aurora+ lmandel : approval-mozilla-beta+	Details \| Diff \| Splinter Review

Robert Kaiser

Reporter

Description

•

9 years ago

[Tracking Requested - why for this release]:

This bug was filed from the Socorro interface and is 
report bp-41e8cc89-ca30-47df-a7b0-110a32150601.
=============================================================

Top Frames:
0 	ntdll.dll 	ZwWaitForMultipleObjects 	
1 	kernelbase.dll 	WaitForMultipleObjectsEx 	
2 	kernel32.dll 	WaitForMultipleObjectsExImplementation 	
3 	user32.dll 	RealMsgWaitForMultipleObjectsEx 	
4 	user32.dll 	MsgWaitForMultipleObjects 	
5 	xul.dll 	mozilla::ipc::MessageChannel::WaitForInterruptNotify() 	ipc/glue/WindowsMessageLoop.cpp
6 	xul.dll 	mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*) 	ipc/glue/MessageChannel.cpp
7 	xul.dll 	mozilla::plugins::PPluginInstanceChild::CallNPN_GetValue_NPNVdocumentOrigin(nsCString*, short*) 	obj-firefox/ipc/ipdl/PPluginInstanceChild.cpp
8 	xul.dll 	mozilla::plugins::PluginInstanceChild::NPN_GetValue(NPNVariable, void*) 	dom/plugins/ipc/PluginInstanceChild.cpp
9 	xul.dll 	mozilla::plugins::child::_getvalue 	dom/plugins/ipc/PluginModuleChild.cpp
10 	npswf32_17_0_0_169.dll 	F21225463__________________________________________________________________________________________________________ 	F209679109___________________________________________________________________________________:1234
[...]

This hang is spiking with 39.0b1 which is the first beta to include async plugin init.

Find more of those crashes in 39.0b1 at https://crash-stats.mozilla.com/report/list?product=Firefox&signature=hang+|+WaitForMultipleObjectsEx+|+RealMsgWaitForMultipleObjectsEx+|+MsgWaitForMultipleObjects+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AWaitForInterruptNotify%28%29+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACall%28IPC%3A%3AMessage*%2C+IPC%3A%3AMessage*%29&version=Firefox%3A39.0b1#tab-reports

Aaron says he knows what the problem is and he's looking into a fix so assigning to him right away.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 1

•

9 years ago

This is a problem with interrupt race resolution in our MessageChannel code. When a particular sequence of interrupt calls race between parent and child processes, it is possible to deadlock both of them.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 2

•

9 years ago

Moving component to IPC.

Component: Plug-ins → IPC

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

9 years ago

Summary: Flash hang mozilla::ipc::MessageChannel::Call with async plugin init → Interrupt race resolution may cause deadlocks

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

9 years ago

Blocks: asyncplugininit

Kate Glazko

Comment 3

•

9 years ago

Tracking since this is a top crash in 39.0b1

tracking-firefox39: ? → +

Keywords: topcrash

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 4

•

9 years ago

Attached patch Unit Test (obsolete) — Details — Splinter Review

This test exhibits the problem - it hangs permanently. Note that your build needs to be configured with --enable-ipdl-tests to try this.

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 5

•

9 years ago

Attached patch Unit Test (obsolete) — Details — Splinter Review

Whoops, messed up the mercurial export. Here's the real patch :-)

Attachment #8615589 - Attachment is obsolete: true

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 6

•

9 years ago

Attached patch Part 2 - Unit Test — Details — Splinter Review

Attachment #8615590 - Attachment is obsolete: true

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 7

•

9 years ago

Attached patch Part 1 - Possible Fix (obsolete) — Details — Splinter Review

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 8

•

9 years ago

Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix

The existing predicate is fine for determining when a race has occurred and when to defer a message. The problem with _undeferring_ an incall with this predicate is that in some cases (particularly the one covered by the test in this bug), interrupt_remote_stack_depth_guess will never be correct as long as the outcall that the incall raced with is on the stack. In this case, this results in the message sitting in mDeferred in perpetuity.

This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run.

I've run this against all existing IPDL regression tests without any regressions. I've noticed that this does cause a bit of "hot potato" passing back and forth of some messages between mPending and mDeferred. This ends up having the effect of allowing the pending messages that need to run to propagate through the mPending queue and eventually break the deadlock.

Here's a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=de071b34a7a3

Attachment #8616532 - Flags: review?(dvander)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Updated

•

9 years ago

Attachment #8616531 - Flags: review?(dvander)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 9

•

9 years ago

Ping?

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 10

•

9 years ago

Sorry Aaron, I'm looking at this now and should have it reviewed by the end of the day. This code is pretty obscure, cjones would be a better reviewer but it's probably best if someone else sucks it up and understands the race detection stuff as well.

Flags: needinfo?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 11

•

9 years ago

Okay, I think I see what is going wrong here now:
 1) Parent sends RPC1 to child. Its remote stack guess is 0, local is 1.
 2) Child sends RPC2 to parent. Its remote stack guess is 0, local is 1.
 3) Child wakes up for RPC1, detects a race. Child defers and remote stack guess is 1.
 4) Parent sends RPC3. Its remote stack guess is still 0, local is now 2.
 5) Parent wakes up for RPC2, detects a race. Parent defers and remote stack guess is 1.
 6) Child wakes up for RPC3, detects a race. Child defers and remote stack guess is 2.

In this sequence of events, the remote stack guess will always be < RemoteViewOfStackDepth(), and we'll never wake up. So clearly that condition can't be right.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 12

•

9 years ago

The logic for race detection is roughly: "the other process thought I had N calls on the stack, but actually I've already sent another and it just hasn't processed it yet." The check in MaybeUndeferIncall was attempting to duplicate that so it wouldn't undefer a message that would just be deferred again.

You said "This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run." But MayebUndeferIncall doesn't run anything, DispatchInterruptMessage does, and it has a different check. In theory MaybeUndeferIncall could have *no* predicate and always undefer, and we'd just be stressing the race condition code. That's what we're probably seeing with the message bouncing.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 13

•

9 years ago

Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix

I think the new predicate needs a very clear explanation of what its relation to the check in DispatchInterruptMessage() is. Unless I'm mistaken and they're unrelated, in which case we should have an explanation of why the new predicate is any better than having none at all.

I'm also yet not sure what the best condition would be, to both not cause deadlocks and to not wake up the thread just to defer again. I'll keep thinking about it though.

Sorry again for the late review, I think I understand this well enough to have a much faster turnaround time now though.

Attachment #8616532 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 14

•

9 years ago

Comment on attachment 8616531 [details] [diff] [review]
Part 2 - Unit Test

Review of attachment 8616531 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for writing this test case, it does reproduce for me. I do see that part 1 fixes it, but I want to make sure that's because MaybeUndeferIncall is correct rather than just letting the race detection logic run until it succeeds.

Attachment #8616531 - Flags: review?(dvander) → review+

Liz Henry (:lizzard) (relman/hg->git project)

Comment 15

•

9 years ago

We turned off async plugin init in 39.0b5. So we will likely still need this for 40+. 
Wontfix for 39 though.

status-firefox40: --- → affected

status-firefox41: --- → affected

tracking-firefox40: --- → +

tracking-firefox41: --- → +

Sylvestre Ledru [:Sylvestre]

Comment 16

•

9 years ago

Aaron, any reason why the patch didn't land? Thanks

status-firefox39: affected → disabled

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 17

•

9 years ago

Part 2 is just the test. Part 1 needs more work.

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 18

•

9 years ago

Attached patch Possible Fix (r2) — Details — Splinter Review

Attachment #8616532 - Attachment is obsolete: true

Robert Kaiser

Reporter

Comment 19

•

9 years ago

Aaron, the hangs including the signatures in this bug are high again in 40 beta 1 (of course, as no fix landed here) - are you doing more work on this or did you miss to request review for the fix?

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 20

•

9 years ago

dvander and I talked about this at Whistler. We have a path going forward here. The revised patch is written but I am running this through a bunch of tests.

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 21

•

9 years ago

Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

This patch removes the condition outright. try and ipdl regression tests look good.

Attachment #8626828 - Flags: review?(dvander)

Robert Kaiser

Reporter

Updated

•

9 years ago

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

9 years ago

Attachment #8626828 - Flags: review?(dvander) → review+

Pulsebot

Comment 22

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/3586d497bde2
https://hg.mozilla.org/integration/mozilla-inbound/rev/61a5b8d678f3

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 23

•

9 years ago

Nominations for uplift will be done gradually depending on how things look on treeherder.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 24

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/3586d497bde2
https://hg.mozilla.org/mozilla-central/rev/61a5b8d678f3

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox42: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla42

Kate Glazko

Comment 25

•

9 years ago

Hi Aaron, can you update us on how things are going in treeherder? We are interested in getting an uplift for 40/41.

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 26

•

9 years ago

Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

I'm OK with uplifting to Aurora at this time. I think we should let it bake there for a bit. I'm not concerned with Treeherder so much as I am with crashstats.

Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly.
[String/UUID change made/needed]: None

Flags: needinfo?(aklotz)

Attachment #8626828 - Flags: approval-mozilla-aurora?

Kate Glazko

Comment 27

•

9 years ago

Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Looking good on m-c, IPDL test suite.

Attachment #8626828 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 28

•

9 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/19937e5a2904
https://hg.mozilla.org/releases/mozilla-aurora/rev/ff1058741c2c

status-firefox41: affected → fixed

Flags: in-testsuite+

Sylvestre Ledru [:Sylvestre]

Comment 29

•

9 years ago

Aaron, how many days of baking do you need for this before the uplift to beta? Thanks

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 30

•

9 years ago

We should reevaluate when I get back from PTO on Monday the 20th. We need sufficient crashstats data.

Flags: needinfo?(aklotz)

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 31

•

9 years ago

Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite, has been on Aurora for over a week.
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly and Aurora. Simple backout + shut off asyncInit if necessary to revert.
[String/UUID change made/needed]: None

Attachment #8626828 - Flags: approval-mozilla-beta?

Lawrence Mandel [:lmandel] (use needinfo)

Comment 32

•

9 years ago

Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)

Let's take this small change and measure the impact in beta6. Beta+

Attachment #8626828 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 33

•

9 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/ddc45855ad74
https://hg.mozilla.org/releases/mozilla-beta/rev/75e056cdd421

status-firefox40: affected → fixed

akorchemniy

Comment 34

•

9 years ago

This is affecting our plugin as well in 40+. I had to disable dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the plugin side to avoid the problem?

(No longer employed by Mozilla) Aaron Klotz

Assignee

Comment 35

•

9 years ago

(In reply to akorchemniy from comment #34)
> This is affecting our plugin as well in 40+. I had to disable
> dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the
> plugin side to avoid the problem?

If you needed to disable dom.ipc.plugins.asyncInit in 40 then this bug is not your problem. We've had quite a few compatibility issues roll in with async plugin init over the weekend. Those bugs are being tracked as dependencies of bug 1116806. We're probably going to push out a hotfix to disable asyncInit in 40 until the problem is fixed. Ideally async plugin init should just work seamlessly with all plugins (provided that they are correctly following the API contracts).

Comment hidden (spam)

https://crash-stats.mozilla.com/report/index/5ca21e26-f063-47f0-8a38-94c402151004

You need to log in before you can comment on or make changes to this bug.