Closed Bug 1307708 Opened 6 years ago Closed 4 years ago

Crash in mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState

Categories

(Core Graveyard :: Plug-ins, defect, P5)

x86_64
Windows 10
defect

Tracking

(firefox50 unaffected, firefox51 wontfix, firefox52 wontfix, firefox53 affected, firefox54 affected, firefox58 affected, firefox59 affected, firefox60 ?)

RESOLVED WORKSFORME
mozilla54
Tracking Status
firefox50 --- unaffected
firefox51 --- wontfix
firefox52 --- wontfix
firefox53 --- affected
firefox54 --- affected
firefox58 --- affected
firefox59 --- affected
firefox60 --- ?

People

(Reporter: Tabs16, Assigned: handyman)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-a619d2b5-2fab-4e36-8b53-28ad32161005.
=============================================================
As far as I found out the crash associated multiple users.
It started with Build ID   20160927030200  on   2016-09-30 07:18:13    and it is still continuing, latest crash on Build ID   20161003030438 .
The crash is on Firefox 52.0a1 version and Firefox 51 version ,Windows NT
Maximum crashes encountered on October 3,2016.
Uptime range: 1-5 min
Flags: needinfo?(benjamin)
This signature looks similar to one I found in Bug 1303755 - ni on David to see if it is the same issue.
Component: Untriaged → Security: Process Sandboxing
Flags: needinfo?(benjamin) → needinfo?(davidp99)
Product: Firefox → Core
This bug may have been exacerbated by the patch -- I'm not yet sure -- but the patch itself looks to have no real influence on the crash posted in comment 0.  The patch from bug 1303755 tests to make sure that the PMCGetKeyState method is running on the main thread.  If not, it posts to the main thread and waits (using a semaphore) for the main thread to return a result.  Looking at the stack for thread 0 and the other threads in the bug report, the PMCGetKeyState call originated on the main thread (also, it does not appear in the non-crashing-thread stacks).  In this case, the old code is run.
Crash Signature: [@ mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState] → [@ mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState] [@ MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc::Messa…
Assignee: nobody → davidp99
Flags: needinfo?(davidp99)
Whiteboard: sbwc2
Resetting the ownership and moving to the proper component. This is low volume and not a regression from our sandboxing work.
Assignee: davidp99 → nobody
Component: Security: Process Sandboxing → Plug-ins
Whiteboard: sbwc2
Crash Signature: mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState ] → mozilla::ipc::MessageChannel::Call | mozilla::plugins::PPluginModuleChild::CallGetKeyState ] [@ RtlAnsiStringToUnicodeString | MsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify | mozilla::ipc:…
Currently the #14 topcrasher on Aurora. All the reports are on Win10. David, can you please take another look?
Flags: needinfo?(davidp99)
Wow.  I'm on it.

Note: There are three crash signatures attached to this bug.  All entries with the first signature are Win10, the second belong to Win8 and the third to Win7.  So this is on pretty much all Windows.
Assignee: nobody → davidp99
Flags: needinfo?(davidp99)
I think I see what's happening here.  This is all conjecture as we have no kind of STR for this.  But the conditions of this crash are such that it the cause has to be fairly extreme.

This condition happens when the GetKeyState call is on the main thread.  The IPDL call then happens immediately.  This is reflected in crash-stats's call stacks and is why I dont think bug 1303755 is related.

Instead, I think the problem can be traced to the fact that the IPDL call is 'intr' [1].  This was set in  bug 1180684.  What I believe is happening is that the plugin itself -- the PluginModuleChild -- is being ActorDestroy-ed by a message that is interrupting the GetKeyState ipdl call.  The ActorDestroy runs fine but since the actor is above it on the stack due to the GetKeyState call, when the stack unwinds to that point, we get a crash.  For us, this is in the OS's NtWaitForMultipleObjects interrupt handler, so it isn't obvious the issue is due to the actor, but it is.  The handle we wait on, MessageChannel::mEvent [2], is closed in the MessageChannel destructor [3], which is freed as a member of the PluginModuleChild actor.

I didn't suspect this as I didn't think that intr would allow an actor to be destroyed if it is on the stack... but I'm pretty sure it does.  I'm pretty certain that detecting this condition is what the AutoStackGuard in PluginInstanceChild's (many) intr message handlers is about (e.g. [4]).  AutoStackGuard forces a crash if a message tries to delete a PluginInstanceChild that is currently on the stack.  I take it that PluginInstanceChild is carefully designed to keep from triggering this crash.

There's a lot of complexity in figuring out the order of these messages and how this sort of thing can happen but I don't think we need to go through it here because we may just be able to change the GetKeyState ipdl message from intr to sync.  This is certainly the easiest prospect.

NIing masayuki about the patch that set the message to intr in bug 1180684.  Do we actually need this message to be interruptable?  If so, do you know what guarantees we can exploit to establish the protocol that the interruptions won't delete our actor from underneath us?

----

[1] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/dom/plugins/ipc/PPluginModule.ipdl#163
[2] https://hg.mozilla.org/mozilla-central/annotate/0ddfec7126ec/ipc/glue/WindowsMessageLoop.cpp#l1247
[3] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/ipc/glue/MessageChannel.cpp#528
[4] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/dom/plugins/ipc/PluginInstanceChild.cpp#714
Flags: needinfo?(masayuki)
Err... wrong person!  Fixing the ni.
Flags: needinfo?(masayuki) → needinfo?(m_kato)
I think most (all?) crashes are that s10s is turned off.  If e10s, plugin process connects to chrome process directly.  Then, since chrome process won't use intr/sync to content process, this issue won't occur.  But when disabling e10s, chrome process uses "intr" to plugin process.

It is OK to use sync instead.  But we can only support GetKeyState hook with e10s.  It becomes a limitation not to use e10s.
Flags: needinfo?(m_kato)
Thanks Makoto.  A few more questions:

(In reply to Makoto Kato [:m_kato] from comment #8)
> I think most (all?) crashes are that s10s is turned off.

Is this a hunch or do you see that in crash-stats?  It didn't even occur to me to look for non-e10s but I don't know how to tell in crash-stats.

> If e10s, plugin
> process connects to chrome process directly.  Then, since chrome process
> won't use intr/sync to content process, this issue won't occur.  But when
> disabling e10s, chrome process uses "intr" to plugin process.

I think I am either misunderstanding the chrome/content/plugin process connections or this is something I don't know about the limitations of the 'intr' protocol.  I know that, with e10s on, PPluginModule connects the plugin process to chrome as you say.  I assumed its the same in non-e10s.  Are you saying its not?  I didn't think the content process was involved in either case -- is it?

I figure that using 'intr GetKeyState' in "plugin -> chrome" messages allows the plugin process to handle other messages from chrome while waiting for a response to the 'intr GetKeyState' message... in both e10s and non-e10s cases.  Do you mean that e10s will forbid this?  Are you suggesting that the crash is due to intr being illegal instead of the ActorDestroy message I guessed?  I know e10s forbids sync/intr messages that are "chrome -> plugin" and "chrome -> content" but I've been assuming it allows sync/intr messages that are "plugin -> chrome"... and I guessed that using intr instead of sync is what causes the crash (regardless of e10s/non-e10s).

> 
> It is OK to use sync instead.  But we can only support GetKeyState hook with
> e10s.  It becomes a limitation not to use e10s.

Are you saying that changing to sync will not fix non-e10s?  Since you said e10s should be working, that would mean changing to sync has no benefit?  Do you have any better ideas that won't leave non-e10s broken?

---

Since we don't have an STR or much else to go on with this bug, I feel pretty limited in how far I can get here without experimentation.  I'd like to try changing the protocol to 'sync' to see if the crashes go away... as long as it won't cause problems.  I can't see any problems it could create but if it is inherently incompatible with the non-e10s case as you suggest then I won't bother.  Can you tell me what changing to sync would break in non-e10s?  Does it sound like a reasonable idea to try it?
Flags: needinfo?(m_kato)
(In reply to David Parks (dparks) [:handyman] from comment #9)
> (In reply to Makoto Kato [:m_kato] from comment #8)
> > I think most (all?) crashes are that s10s is turned off.
> 
> Is this a hunch or do you see that in crash-stats?  It didn't even occur to
> me to look for non-e10s but I don't know how to tell in crash-stats.


You can search for the signature in crash-stats using the SuperSearch interface to find this.

https://crash-stats.mozilla.com/search/?product=Firefox&_dont_run=1

Expand the more options section and add a 'facet on' property, for example dom ipc enabled -

https://crash-stats.mozilla.com/search/?signature=~PPluginModuleChild%3A%3ACallGetKeyState&product=Firefox&date=%3E%3D2017-01-19T20%3A23%3A00.000Z&date=%3C2017-01-26T20%3A23%3A00.000Z&_sort=-date&_facets=signature&_facets=dom_ipc_enabled&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-dom_ipc_enabled

FWIW, looks like this crash is 100% e10s enabled.
 
> > If e10s, plugin
> > process connects to chrome process directly.  Then, since chrome process
> > won't use intr/sync to content process, this issue won't occur.  But when
> > disabling e10s, chrome process uses "intr" to plugin process.
> 
> I think I am either misunderstanding the chrome/content/plugin process
> connections or this is something I don't know about the limitations of the
> 'intr' protocol.  I know that, with e10s on, PPluginModule connects the
> plugin process to chrome as you say.  I assumed its the same in non-e10s. 
> Are you saying its not?  I didn't think the content process was involved in
> either case -- is it?

Under e10s we use a 3-way bridge. The plugin process is connected to content through its normal ipdl interfaces. There also a light weight connection to the browser.

http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/dom/plugins/ipc/PluginModuleChild.cpp#67


> I figure that using 'intr GetKeyState' in "plugin -> chrome" messages allows
> the plugin process to handle other messages from chrome while waiting for a
> response to the 'intr GetKeyState' message... in both e10s and non-e10s
> cases.

Yes, although under e10s the PluginModuleParent call would fail in the content process. But I don't think this actually gets called. Looks like the chrome (browser) PluginModuleParent makes the actual call. 

http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/dom/plugins/ipc/PluginModuleParent.cpp#3387
http://searchfox.org/mozilla-central/rev/7da3c9dcf467964f2fb82f3a4c63972ee79bf696/dom/plugins/ipc/PluginModuleParent.cpp#3394

Under non-e10s, if this was called, the call would fail in PluginModuleParent::AnswerGetKeyState.

> Do you mean that e10s will forbid this?  Are you suggesting that the
> crash is due to intr being illegal instead of the ActorDestroy message I
> guessed?  I know e10s forbids sync/intr messages that are "chrome -> plugin"
> and "chrome -> content" but I've been assuming it allows sync/intr messages
> that are "plugin -> chrome"... 

Yep, the code wouldn't compile otherwise.

> and I guessed that using intr instead of sync
> is what causes the crash (regardless of e10s/non-e10s).

Are we dealing with some sort or re-entrant call back over on the plugin side?

> > It is OK to use sync instead.  But we can only support GetKeyState hook with
> > e10s.  It becomes a limitation not to use e10s.
> 
> Are you saying that changing to sync will not fix non-e10s?  Since you said
> e10s should be working, that would mean changing to sync has no benefit?  Do
> you have any better ideas that won't leave non-e10s broken?

Do we want this to work under non-e10s? I thought this was an e10s only thing. From the code, non-e10s calls to GetKeyState wouldn't work currently.
Flags: needinfo?(m_kato) → needinfo?(davidp99)
(In reply to Jim Mathies [:jimm] from comment #10)
> > and I guessed that using intr instead of sync
> > is what causes the crash (regardless of e10s/non-e10s).
> 
> Are we dealing with some sort or re-entrant call back over on the plugin
> side?

Indeed.  As I understand it, you are confirming what I thought I knew about how this setup works and the intr GetKeyState call [1] is likely the culprit.  I'd assumed this stuff was entirely e10s code so we don't need to consider what happens without it.  The plugin process' PluginModuleChild singleton that connects to the chrome process (the one you called lightweight) is always what these hijacked sandbox methods use, so the content process is irrelevant.  Sounds like crash-stats agrees.

I spell out why I think allowing the GetKeyState message to be intr causes a use-after-free error in comment 6.  The gist is that the intr call gets interrupted by a destroy message while it waits for its response.  The destroy message destroys the object that was interrupted.  Then we return to that newly deleted object we had interrupted.  Then we crash.

My only question was whether we need the message to be intr.  We don't do anything obvious to prevent the destruction of the MessageChannel during an interruption so, unless there is something fragile preventing the destruction while GetKeyState waits for a response, that can cause this crash.  Simply changing GetKeyState to sync would fix this bug in that case.  Do you know why this message is intr?

If we do _need_ it to be intr for some reason then maybe we can put a setup in place that delays deletion when the ActorDestroy is received.  Since that could be big and hairy and cause more bugs, I was checking to see if it is warranted.

(Thanks for the hint on crash-stats.  "dom ipc enabled" had fallen off of my list of alternative e10s names.)

[1] https://dxr.mozilla.org/mozilla-central/rev/2963cf6be7f830c0d2155e2968cfc53585868a76/dom/plugins/ipc/PPluginModule.ipdl#163
Flags: needinfo?(davidp99) → needinfo?(jmathies)
Looking over bug 1180684 I'm not seeing the argument for it. Lets change this to sync.
Flags: needinfo?(jmathies)
Attachment #8834137 - Flags: review?(jmathies) → review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f4c6758e15f9
Make the GetKeyState IPDL call sync (it was intr). r=jimm
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/f4c6758e15f9
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
crash-stats shows a pretty consistent crash rate up to the Feb 7 nightly and nothing with today's. Looks encouraging! Please request Aurora/Beta approval on this when you feel it has baked sufficiently.
Flags: needinfo?(davidp99)
I shared in Ryan's optimism... for a moment before I realized that my change would also change the method signature -- CallGetKeyState became SendGetKeyState.  I searched for that crash signature and lo and behold:

https://crash-stats.mozilla.com/report/index/ee978111-ba24-47cd-978d-66dd42170209

So, unfortunately, this didn't work at all.  At least I was happy for a moment.

:Tomcat, can you back out the change?
Status: RESOLVED → REOPENED
Flags: needinfo?(davidp99) → needinfo?(cbook)
Resolution: FIXED → ---
(In reply to David Parks (dparks) [:handyman] from comment #18)

> :Tomcat, can you back out the change?

too bad :(

yeah sure backed this out in https://hg.mozilla.org/mozilla-central/rev/b772e0f4138540113e91a46c99bb0d14ecc8acca so it should make todays nightly too
Flags: needinfo?(cbook)
Attached patch wip.patchSplinter Review
Attachment #8834137 - Attachment is obsolete: true
Too late for 51 and the volume of crashes is low. Mark 51 won't fix.
Whiteboard: sb?
Whiteboard: sb?
Too late for firefox 52, mass-wontfix.
Priority: -- → P5
This antique may have been fixed by bug 1382251.
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → WORKSFORME
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.