1307708 - Crash in mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState

Reporter

Description

•

9 years ago

This bug was filed from the Socorro interface and is report bp-a619d2b5-2fab-4e36-8b53-28ad32161005. ============================================================= As far as I found out the crash associated multiple users. It started with Build ID 20160927030200 on 2016-09-30 07:18:13 and it is still continuing, latest crash on Build ID 20161003030438 . The crash is on Firefox 52.0a1 version and Firefox 51 version ,Windows NT Maximum crashes encountered on October 3,2016. Uptime range: 1-5 min

Tabs16 (:Tabs16)

Reporter

Updated

•

9 years ago

Flags: needinfo?(benjamin)

Marcia Knous [:marcia]

Comment 1

•

9 years ago

This signature looks similar to one I found in Bug 1303755 - ni on David to see if it is the same issue.

Component: Untriaged → Security: Process Sandboxing

Flags: needinfo?(benjamin) → needinfo?(davidp99)

Product: Firefox → Core

David Parks [:handyman]

Assignee

Comment 2

•

9 years ago

This bug may have been exacerbated by the patch -- I'm not yet sure -- but the patch itself looks to have no real influence on the crash posted in comment 0. The patch from bug 1303755 tests to make sure that the PMCGetKeyState method is running on the main thread. If not, it posts to the main thread and waits (using a semaphore) for the main thread to return a result. Looking at the stack for thread 0 and the other threads in the bug report, the PMCGetKeyState call originated on the main thread (also, it does not appear in the non-crashing-thread stacks). In this case, the old code is run.

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Crash Signature: [@ mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState] → [@ mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState] [@ MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::Messa…

Jim Mathies [:jimm]

Updated

•

9 years ago

Assignee: nobody → davidp99

Flags: needinfo?(davidp99)

Whiteboard: sbwc2

Jim Mathies [:jimm]

Comment 3

•

9 years ago

Resetting the ownership and moving to the proper component. This is low volume and not a regression from our sandboxing work.

Assignee: davidp99 → nobody

Component: Security: Process Sandboxing → Plug-ins

Whiteboard: sbwc2

Mats Palmgren (inactive)

Updated

•

8 years ago

Crash Signature: mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState ] → mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState ] [@ RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc:…

Ryan VanderMeulen [:RyanVM]

Updated

•

8 years ago

status-firefox51: --- → affected

status-firefox52: --- → affected

status-firefox53: --- → affected

Ryan VanderMeulen [:RyanVM]

Comment 4

•

8 years ago

Currently the #14 topcrasher on Aurora. All the reports are on Win10. David, can you please take another look?

status-firefox50: --- → unaffected

Flags: needinfo?(davidp99)

David Parks [:handyman]

Assignee

Comment 5

•

8 years ago

Wow. I'm on it. Note: There are three crash signatures attached to this bug. All entries with the first signature are Win10, the second belong to Win8 and the third to Win7. So this is on pretty much all Windows.

Assignee: nobody → davidp99

Flags: needinfo?(davidp99)

David Parks [:handyman]

Assignee

Comment 6

•

8 years ago

I think I see what's happening here. This is all conjecture as we have no kind of STR for this. But the conditions of this crash are such that it the cause has to be fairly extreme. This condition happens when the GetKeyState call is on the main thread. The IPDL call then happens immediately. This is reflected in crash-stats's call stacks and is why I dont think bug 1303755 is related. Instead, I think the problem can be traced to the fact that the IPDL call is 'intr' [1]. This was set in bug 1180684. What I believe is happening is that the plugin itself -- the PluginModuleChild -- is being ActorDestroy-ed by a message that is interrupting the GetKeyState ipdl call. The ActorDestroy runs fine but since the actor is above it on the stack due to the GetKeyState call, when the stack unwinds to that point, we get a crash. For us, this is in the OS's NtWaitForMultipleObjects interrupt handler, so it isn't obvious the issue is due to the actor, but it is. The handle we wait on, MessageChannel::mEvent [2], is closed in the MessageChannel destructor [3], which is freed as a member of the PluginModuleChild actor. I didn't suspect this as I didn't think that intr would allow an actor to be destroyed if it is on the stack... but I'm pretty sure it does. I'm pretty certain that detecting this condition is what the AutoStackGuard in PluginInstanceChild's (many) intr message handlers is about (e.g. [4]). AutoStackGuard forces a crash if a message tries to delete a PluginInstanceChild that is currently on the stack. I take it that PluginInstanceChild is carefully designed to keep from triggering this crash. There's a lot of complexity in figuring out the order of these messages and how this sort of thing can happen but I don't think we need to go through it here because we may just be able to change the GetKeyState ipdl message from intr to sync. This is certainly the easiest prospect. NIing masayuki about the patch that set the message to intr in bug 1180684. Do we actually need this message to be interruptable? If so, do you know what guarantees we can exploit to establish the protocol that the interruptions won't delete our actor from underneath us? ---- [1] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/dom/plugins/ipc/PPluginModule.ipdl#163 [2] https://hg.mozilla.org/mozilla-central/annotate/0ddfec7126ec/ipc/glue/WindowsMessageLoop.cpp#l1247 [3] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/ipc/glue/MessageChannel.cpp#528 [4] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/dom/plugins/ipc/PluginInstanceChild.cpp#714

Flags: needinfo?(masayuki)

David Parks [:handyman]

Assignee

Comment 7

•

8 years ago

Err... wrong person! Fixing the ni.

Flags: needinfo?(masayuki) → needinfo?(m_kato)

Makoto Kato [:m_kato]

Comment 8

•

8 years ago

I think most (all?) crashes are that s10s is turned off. If e10s, plugin process connects to chrome process directly. Then, since chrome process won't use intr/sync to content process, this issue won't occur. But when disabling e10s, chrome process uses "intr" to plugin process. It is OK to use sync instead. But we can only support GetKeyState hook with e10s. It becomes a limitation not to use e10s.

Flags: needinfo?(m_kato)

David Parks [:handyman]

Assignee

Comment 9

•

8 years ago

Thanks Makoto. A few more questions: (In reply to Makoto Kato [:m_kato] from comment #8) > I think most (all?) crashes are that s10s is turned off. Is this a hunch or do you see that in crash-stats? It didn't even occur to me to look for non-e10s but I don't know how to tell in crash-stats. > If e10s, plugin > process connects to chrome process directly. Then, since chrome process > won't use intr/sync to content process, this issue won't occur. But when > disabling e10s, chrome process uses "intr" to plugin process. I think I am either misunderstanding the chrome/content/plugin process connections or this is something I don't know about the limitations of the 'intr' protocol. I know that, with e10s on, PPluginModule connects the plugin process to chrome as you say. I assumed its the same in non-e10s. Are you saying its not? I didn't think the content process was involved in either case -- is it? I figure that using 'intr GetKeyState' in "plugin -> chrome" messages allows the plugin process to handle other messages from chrome while waiting for a response to the 'intr GetKeyState' message... in both e10s and non-e10s cases. Do you mean that e10s will forbid this? Are you suggesting that the crash is due to intr being illegal instead of the ActorDestroy message I guessed? I know e10s forbids sync/intr messages that are "chrome -> plugin" and "chrome -> content" but I've been assuming it allows sync/intr messages that are "plugin -> chrome"... and I guessed that using intr instead of sync is what causes the crash (regardless of e10s/non-e10s). > > It is OK to use sync instead. But we can only support GetKeyState hook with > e10s. It becomes a limitation not to use e10s. Are you saying that changing to sync will not fix non-e10s? Since you said e10s should be working, that would mean changing to sync has no benefit? Do you have any better ideas that won't leave non-e10s broken? --- Since we don't have an STR or much else to go on with this bug, I feel pretty limited in how far I can get here without experimentation. I'd like to try changing the protocol to 'sync' to see if the crashes go away... as long as it won't cause problems. I can't see any problems it could create but if it is inherently incompatible with the non-e10s case as you suggest then I won't bother. Can you tell me what changing to sync would break in non-e10s? Does it sound like a reasonable idea to try it?

Flags: needinfo?(m_kato)

Jim Mathies [:jimm]

Comment 10

•

8 years ago

(In reply to David Parks (dparks) [:handyman] from comment #9) > (In reply to Makoto Kato [:m_kato] from comment #8) > > I think most (all?) crashes are that s10s is turned off. > > Is this a hunch or do you see that in crash-stats? It didn't even occur to > me to look for non-e10s but I don't know how to tell in crash-stats. You can search for the signature in crash-stats using the SuperSearch interface to find this. https://crash-stats.mozilla.com/search/?product=Firefox&_dont_run=1 Expand the more options section and add a 'facet on' property, for example dom ipc enabled - https://crash-stats.mozilla.com/search/?signature=~PPluginModuleChild%3A%3ACallGetKeyState&product=Firefox&date=%3E%3D2017-01-19T20%3A23%3A00.000Z&date=%3C2017-01-26T20%3A23%3A00.000Z&_sort=-date&_facets=signature&_facets=dom_ipc_enabled&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-dom_ipc_enabled FWIW, looks like this crash is 100% e10s enabled. > > If e10s, plugin > > process connects to chrome process directly. Then, since chrome process > > won't use intr/sync to content process, this issue won't occur. But when > > disabling e10s, chrome process uses "intr" to plugin process. > > I think I am either misunderstanding the chrome/content/plugin process > connections or this is something I don't know about the limitations of the > 'intr' protocol. I know that, with e10s on, PPluginModule connects the > plugin process to chrome as you say. I assumed its the same in non-e10s. > Are you saying its not? I didn't think the content process was involved in > either case -- is it? Under e10s we use a 3-way bridge. The plugin process is connected to content through its normal ipdl interfaces. There also a light weight connection to the browser. http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/dom/plugins/ipc/PluginModuleChild.cpp#67 > I figure that using 'intr GetKeyState' in "plugin -> chrome" messages allows > the plugin process to handle other messages from chrome while waiting for a > response to the 'intr GetKeyState' message... in both e10s and non-e10s > cases. Yes, although under e10s the PluginModuleParent call would fail in the content process. But I don't think this actually gets called. Looks like the chrome (browser) PluginModuleParent makes the actual call. http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/dom/plugins/ipc/PluginModuleParent.cpp#3387 http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/dom/plugins/ipc/PluginModuleParent.cpp#3394 Under non-e10s, if this was called, the call would fail in PluginModuleParent::AnswerGetKeyState. > Do you mean that e10s will forbid this? Are you suggesting that the > crash is due to intr being illegal instead of the ActorDestroy message I > guessed? I know e10s forbids sync/intr messages that are "chrome -> plugin" > and "chrome -> content" but I've been assuming it allows sync/intr messages > that are "plugin -> chrome"... Yep, the code wouldn't compile otherwise. > and I guessed that using intr instead of sync > is what causes the crash (regardless of e10s/non-e10s). Are we dealing with some sort or re-entrant call back over on the plugin side? > > It is OK to use sync instead. But we can only support GetKeyState hook with > > e10s. It becomes a limitation not to use e10s. > > Are you saying that changing to sync will not fix non-e10s? Since you said > e10s should be working, that would mean changing to sync has no benefit? Do > you have any better ideas that won't leave non-e10s broken? Do we want this to work under non-e10s? I thought this was an e10s only thing. From the code, non-e10s calls to GetKeyState wouldn't work currently.

Flags: needinfo?(m_kato) → needinfo?(davidp99)

Jim Mathies [:jimm]

Comment 11

•

8 years ago

This crash also only occurs in 64-bit builds of Firefox. https://crash-stats.mozilla.com/search/?signature=~PPluginModuleChild%3A%3ACallGetKeyState&product=Firefox&date=%3E%3D2017-01-19T20%3A23%3A00.000Z&date=%3C2017-01-26T20%3A23%3A00.000Z&_sort=-date&_facets=signature&_facets=dom_ipc_enabled&_facets=cpu_arch&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_arch

Hardware: Unspecified → x86_64

David Parks [:handyman]

Assignee

Comment 12

•

8 years ago

(In reply to Jim Mathies [:jimm] from comment #10) > > and I guessed that using intr instead of sync > > is what causes the crash (regardless of e10s/non-e10s). > > Are we dealing with some sort or re-entrant call back over on the plugin > side? Indeed. As I understand it, you are confirming what I thought I knew about how this setup works and the intr GetKeyState call [1] is likely the culprit. I'd assumed this stuff was entirely e10s code so we don't need to consider what happens without it. The plugin process' PluginModuleChild singleton that connects to the chrome process (the one you called lightweight) is always what these hijacked sandbox methods use, so the content process is irrelevant. Sounds like crash-stats agrees. I spell out why I think allowing the GetKeyState message to be intr causes a use-after-free error in comment 6. The gist is that the intr call gets interrupted by a destroy message while it waits for its response. The destroy message destroys the object that was interrupted. Then we return to that newly deleted object we had interrupted. Then we crash. My only question was whether we need the message to be intr. We don't do anything obvious to prevent the destruction of the MessageChannel during an interruption so, unless there is something fragile preventing the destruction while GetKeyState waits for a response, that can cause this crash. Simply changing GetKeyState to sync would fix this bug in that case. Do you know why this message is intr? If we do _need_ it to be intr for some reason then maybe we can put a setup in place that delays deletion when the ActorDestroy is received. Since that could be big and hairy and cause more bugs, I was checking to see if it is warranted. (Thanks for the hint on crash-stats. "dom ipc enabled" had fallen off of my list of alternative e10s names.) [1] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/dom/plugins/ipc/PPluginModule.ipdl#163

Flags: needinfo?(davidp99) → needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 13

•

8 years ago

Looking over bug 1180684 I'm not seeing the argument for it. Lets change this to sync.

Flags: needinfo?(jmathies)

David Parks [:handyman]

Assignee

Comment 14

•

8 years ago

Attached patch Make GetKeyState sync (obsolete) — Details — Splinter Review

Tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f1b19fc3c10018267551711ee96ef7d4dd584a4a&selectedJob=74911965

Attachment #8834137 - Flags: review?(jmathies)

Jim Mathies [:jimm]

Updated

•

8 years ago

Attachment #8834137 - Flags: review?(jmathies) → review+

David Parks [:handyman]

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 15

•

8 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f4c6758e15f9 Make the GetKeyState IPDL call sync (it was intr). r=jimm

Keywords: checkin-needed

Carsten Book [:Tomcat]

Comment 16

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f4c6758e15f9

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox54: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Ryan VanderMeulen [:RyanVM]

Comment 17

•

8 years ago

crash-stats shows a pretty consistent crash rate up to the Feb 7 nightly and nothing with today's. Looks encouraging! Please request Aurora/Beta approval on this when you feel it has baked sufficiently.

Flags: needinfo?(davidp99)

David Parks [:handyman]

Assignee

Comment 18

•

8 years ago

I shared in Ryan's optimism... for a moment before I realized that my change would also change the method signature -- CallGetKeyState became SendGetKeyState. I searched for that crash signature and lo and behold: https://crash-stats.mozilla.com/report/index/ee978111-ba24-47cd-978d-66dd42170209 So, unfortunately, this didn't work at all. At least I was happy for a moment. :Tomcat, can you back out the change?

Status: RESOLVED → REOPENED

Flags: needinfo?(davidp99) → needinfo?(cbook)

Resolution: FIXED → ---

Carsten Book [:Tomcat]

Comment 19

•

8 years ago

(In reply to David Parks (dparks) [:handyman] from comment #18) > :Tomcat, can you back out the change? too bad :( yeah sure backed this out in https://hg.mozilla.org/mozilla-central/rev/b772e0f4138540113e91a46c99bb0d14ecc8acca so it should make todays nightly too

Flags: needinfo?(cbook)

David Parks [:handyman]

Assignee

Comment 20

•

8 years ago

Attached patch wip.patch — Details — Splinter Review

Attachment #8834137 - Attachment is obsolete: true

Gerry Chang [:gchang]

Comment 21

•

8 years ago

Too late for 51 and the volume of crashes is low. Mark 51 won't fix.

status-firefox51: affected → wontfix

Marco Castelluccio [:marco]

Updated

•

8 years ago

status-firefox54: fixed → affected

Jim Mathies [:jimm]

Updated

•

8 years ago

Whiteboard: sb?

Jim Mathies [:jimm]

Updated

•

8 years ago

Whiteboard: sb?

Julien Cristau [:jcristau]

Comment 22

•

8 years ago

Too late for firefox 52, mass-wontfix.

status-firefox52: affected → wontfix

Benjamin Smedberg

Updated

•

8 years ago

Priority: -- → P5

David Parks [:handyman]

Assignee

Comment 23

•

7 years ago

This antique may have been fixed by bug 1382251.

Marco Castelluccio [:marco]

Updated

•

7 years ago

status-firefox58: --- → affected

status-firefox59: --- → affected

status-firefox60: --- → ?

Jim Mathies [:jimm]

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 7 years ago

Resolution: --- → WORKSFORME

BMO Automation

Updated

•

3 years ago

Product: Core → Core Graveyard

Make GetKeyState sync 8 years ago David Parks [:handyman] 4.54 KB, patch	jimm : review+	Details \| Diff \| Splinter Review
wip.patch 8 years ago David Parks [:handyman] 30.12 KB, patch		Details \| Diff \| Splinter Review