Last Comment Bug 677711 - [OOPP] Trigger child shutdown on parent time outs
: [OOPP] Trigger child shutdown on parent time outs
Status: RESOLVED FIXED
[qa-]
:
Product: Core
Classification: Components
Component: Plug-ins (show other bugs)
: Trunk
: x86_64 Windows 7
: -- normal (vote)
: mozilla9
Assigned To: Jim Mathies [:jimm]
:
: Benjamin Smedberg [:bsmedberg]
Mentors:
Depends on:
Blocks: 680130 679238 683967
  Show dependency treegraph
 
Reported: 2011-08-09 15:21 PDT by Jim Mathies [:jimm]
Modified: 2012-01-04 11:33 PST (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
fixed
-
-


Attachments
experimental patch (6.18 KB, patch)
2011-08-09 15:22 PDT, Jim Mathies [:jimm]
no flags Details | Diff | Splinter Review
patch v.1 (9.40 KB, patch)
2011-08-10 12:12 PDT, Jim Mathies [:jimm]
benjamin: review+
Details | Diff | Splinter Review
disable for aurora and up patch (1.00 KB, patch)
2011-09-30 11:24 PDT, Jim Mathies [:jimm]
christian: approval‑mozilla‑aurora+
Details | Diff | Splinter Review

Description Jim Mathies [:jimm] 2011-08-09 15:21:55 PDT

    
Comment 1 Jim Mathies [:jimm] 2011-08-09 15:22:24 PDT
Created attachment 551907 [details] [diff] [review]
experimental patch
Comment 2 Jim Mathies [:jimm] 2011-08-09 15:33:42 PDT
Been experimenting with this today. This actually works pretty well at getting out from under some of the parent side hangs I have test cases for. (bug 675645 & bug 611826 are good examples).

I started with just using NS_ABORT(), but that triggered a messy set of assertions in the parent that didn't look good. So I came up with this, which shuts down the module. The parent handles this fine and the child doesn't actually have to crash. Down side to this is we don't get crash stacks for either process. Also looked for a way to get a bit of data over to the parent async but so far no luck. I'll dig into that a bit more tomorrow.

I also experimented with destroying and re-creating plugin child windows which are often the root of the problem. Unfortunately flash did not like that at all, crashing in most cases. (I believe plugin api consumers are supposed to support swapping out to new windows, but apparently flash doesn't support it.)
Comment 3 Jim Mathies [:jimm] 2011-08-09 15:37:01 PDT
One other note - the 45 second time out in the child case is way too long. If we're going to do this, we should set that lower, at least for the child. If the browser is hung, I don't think any user is going to sit around staring at a frozen window for almost a minute. Plus, we should be proactive about freeing up the browser anyway. I've been testing with a five second time out which works quite well. That might be a little short.. maybe the number should be somewhere in between that and the parent hang time.
Comment 4 Benjamin Smedberg [:bsmedberg] 2011-08-10 07:14:41 PDT
We really should just NS_RUNTIMEABORT. What assertions do you see in the parent?
Comment 5 Jim Mathies [:jimm] 2011-08-10 07:23:28 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> We really should just NS_RUNTIMEABORT. What assertions do you see in the
> parent?

http://pastebin.mozilla.org/1295747

This also brings up the "plugin container problem" Windows dialog in debug mode on Windows. I didn't like that at all. I'd rather it get shutdown by the parent cleanly.. but I'm open to whatever. I was hoping a could get some stack data over to the parent somehow, was planning on looking into that today. It looks like we have a independent crash reporter pipe open for communication I might be able to use.
Comment 6 Benjamin Smedberg [:bsmedberg] 2011-08-10 07:51:05 PDT
###!!! ABORT: terminating child process: file f:/Mozilla/firefox/mc/dom/plugins/ipc/PluginModuleChild.cpp, line 532 is expected, that's what NS_RUNTIMEABORT prints.

###!!! ASSERTION: Mismatched RPC stack frames: 'this == mChannel->mTopFrame', file f:/Mozilla/firefox/mc/ipc/glue/WindowsMessageLoop.cpp, line 612
This sounds like it ought to be investigated.

###!!! ASSERTION: IPDL error:: 'Error', file f:/Mozilla/firefox/MC-DBG/ipc/ipdl/PPluginScriptableObjectChild.cpp, line 1226
This doesn't make much sense: once you issue a NS_RUNTIMEABORT, the child ought to be dead, but it's still running and issuing assertions?

###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv
expected, this is normal

Also the windows crash dialog is normal if you don't have crash reporting enabled: if it is enabled you should be getting a normal minidump.
Comment 7 Ted Mielczarek [:ted.mielczarek] 2011-08-10 07:53:01 PDT
It might not work properly if you're killing the parent process. plugin-container doesn't write its own minidumps, it asks the chrome process to do so, and if you've killed that it will probably just give up and you'll get the Windows dialog.
Comment 8 Benjamin Smedberg [:bsmedberg] 2011-08-10 08:17:35 PDT
We're not killing the parent process; the plugin process is just committing suicide.
Comment 9 Jim Mathies [:jimm] 2011-08-10 08:21:52 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> ###!!! ASSERTION: Mismatched RPC stack frames: 'this ==
> mChannel->mTopFrame', file
> f:/Mozilla/firefox/mc/ipc/glue/WindowsMessageLoop.cpp, line 612
> This sounds like it ought to be investigated.
> 
> ###!!! ASSERTION: IPDL error:: 'Error', file
> f:/Mozilla/firefox/MC-DBG/ipc/ipdl/PPluginScriptableObjectChild.cpp, line
> 1226
> This doesn't make much sense: once you issue a NS_RUNTIMEABORT, the child
> ought to be dead, but it's still running and issuing assertions?

In debug mode, the child process gets hung up by the windows error collection dialog, so I think stuff underneath continues to run. I can debug a bit and see what's going on. This spews while the windows error reporting dialog is visible and is "checking for solutions".
Comment 10 Jim Mathies [:jimm] 2011-08-10 08:25:26 PDT
Note, even with the use of runtime abort, we still don't get minidumps:

WARNING: [PluginModuleParent::ActorDestroy] abnormal shutdown without minidump!

This is in ActorDestroy of PluginModuleParent. Not sure yet if there's a way to go ahead and trigger some sort of report. Whether we shut down the module or just crash, it seems some work will need to be done in getting crash data stored away.
Comment 11 Jim Mathies [:jimm] 2011-08-10 12:12:06 PDT
Created attachment 552165 [details] [diff] [review]
patch v.1

Ok, with the crash reporter enabled this is working out pretty well.
Comment 12 Jim Mathies [:jimm] 2011-08-10 12:13:38 PDT
Hmm, I have an unused timeoutMs var in the prefs observer, I'll remove that.
Comment 14 Benjamin Smedberg [:bsmedberg] 2011-08-15 07:45:46 PDT
Comment on attachment 552165 [details] [diff] [review]
patch v.1

Please mark tracking-firefoxN on this when it lands, since we'll probably want to track this very carefully. And file a followup on getting a parent-side stack when we do this child-side abort, since we really want a hang-pair in this case.
Comment 15 Jim Mathies [:jimm] 2011-08-16 03:39:40 PDT
Landed on inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/bef7bcdcd41e

Note to drivers: we probably want to disable this feature in releases. This will depend on the frequency stacks generated by these changes show up in crash stats.

To disable - the following release pref should be set to 0:

+// How long a plugin process will wait for a response from the parent
+// to a synchronous request before terminating itself. After this
+// point the child assumes the parent is hung.
+pref("dom.ipc.plugins.parentTimeoutSecs", 15);
Comment 16 Jim Mathies [:jimm] 2011-08-16 03:48:43 PDT
This isn't going to make the last 8 inbound merge, so when tracking 9 flags arrive in bugzilla they need to be set on this bug.
Comment 18 Asa Dotzler [:asa] 2011-09-27 14:53:29 PDT
Is this an approval request? This is already in 9, I presume. If it is an approval request, please unset the tracking flag and set the approval request in the patch with an explanation of the risk / reward.
Comment 19 Jim Mathies [:jimm] 2011-09-28 07:18:09 PDT
(In reply to Asa Dotzler [:asa] from comment #18)
> Is this an approval request? This is already in 9, I presume. If it is an
> approval request, please unset the tracking flag and set the approval
> request in the patch with an explanation of the risk / reward.

We want this disabled in releases for now. I'll post a patch to disable this in 9 now that we've branched.
Comment 20 Jim Mathies [:jimm] 2011-09-30 11:24:16 PDT
Created attachment 563783 [details] [diff] [review]
disable for aurora and up patch
Comment 21 Jim Mathies [:jimm] 2011-10-10 09:51:45 PDT
(In reply to Jim Mathies [:jimm] from comment #20)
> Created attachment 563783 [details] [diff] [review] [diff] [details] [review]
> disable for aurora and up patch

https://hg.mozilla.org/releases/mozilla-aurora/rev/c4862aaec55b
Comment 22 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-11-21 16:37:41 PST
Is there something QA can do to verify this fix?
Comment 23 Alex Keybl [:akeybl] 2011-12-13 16:09:41 PST
[Triage Comment]
Please provide STR here by tomorrow at 12:00PM PT 12/14. We will be holding our FF9 sign-offs later that afternoon, and need to be able to verify. Thanks!
Comment 24 Jim Mathies [:jimm] 2011-12-13 16:24:14 PST
(In reply to Alex Keybl [:akeybl] from comment #23)
> [Triage Comment]
> Please provide STR here by tomorrow at 12:00PM PT 12/14. We will be holding
> our FF9 sign-offs later that afternoon, and need to be able to verify.
> Thanks!

This code is only enabled on mc, so you don't have to worry about verifying anything. To verify it's disabled, you can check the pref in 'disable for aurora and up patch' is properly set.
Comment 25 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-13 16:46:39 PST
Thanks Jim. So to be completely clear...

dom.ipc.plugins.parentTimeoutSecs == 0 in Firefox 9 means this is FIXED?
Comment 26 Jim Mathies [:jimm] 2011-12-13 17:24:01 PST
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25)
> Thanks Jim. So to be completely clear...
> 
> dom.ipc.plugins.parentTimeoutSecs == 0 in Firefox 9 means this is FIXED?

Disabled really. There's no need to confirm the feature is "fixed", the reports this generates have been showing up in crash stats since the first patch landed. I don't see the stacks from this in aurora builds so it's properly disabled. So yes, basically, confirmed fixed on mc and confirmed disabled on aurora. :)

The crash stack in question is:

mozalloc_abort(char const* const) | NS_DebugBreak_P | mozilla::plugins::PluginModuleChild::ShouldContinueFromReplyTimeout()

We may enabled in the future for all releases, but we'll file a new bug on that.
Comment 27 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-12-13 19:10:21 PST
Okay, thanks Jim. I'm going to mark this one qa-. We can verify the new bug when it gets enabled.
Comment 28 Alex Keybl [:akeybl] 2012-01-04 11:31:42 PST
Looks like this isn't fixed on FF10..

dom.ipc.plugins.parentTimeoutSecs=15

.. and is thus suspected to be causing bug 683967.

Jim - can you prepare the patch for FF10/11?
Comment 29 Alex Keybl [:akeybl] 2012-01-04 11:33:19 PST
Looks like bsmedberg will actually create the disable patch in 683967. Untracking here.

Note You need to log in before you can comment on or make changes to this bug.