Closed Bug 677711 Opened 8 years ago Closed 8 years ago

[OOPP] Trigger child shutdown on parent time outs

Categories

(Core :: Plug-ins, defect)

x86_64
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla9
Tracking Status
firefox9 - fixed
firefox10 - ---
firefox11 - ---

People

(Reporter: jimm, Assigned: jimm)

References

Details

(Whiteboard: [qa-])

Attachments

(2 files, 1 obsolete file)

No description provided.
Attached patch experimental patch (obsolete) — Splinter Review
Been experimenting with this today. This actually works pretty well at getting out from under some of the parent side hangs I have test cases for. (bug 675645 & bug 611826 are good examples).

I started with just using NS_ABORT(), but that triggered a messy set of assertions in the parent that didn't look good. So I came up with this, which shuts down the module. The parent handles this fine and the child doesn't actually have to crash. Down side to this is we don't get crash stacks for either process. Also looked for a way to get a bit of data over to the parent async but so far no luck. I'll dig into that a bit more tomorrow.

I also experimented with destroying and re-creating plugin child windows which are often the root of the problem. Unfortunately flash did not like that at all, crashing in most cases. (I believe plugin api consumers are supposed to support swapping out to new windows, but apparently flash doesn't support it.)
One other note - the 45 second time out in the child case is way too long. If we're going to do this, we should set that lower, at least for the child. If the browser is hung, I don't think any user is going to sit around staring at a frozen window for almost a minute. Plus, we should be proactive about freeing up the browser anyway. I've been testing with a five second time out which works quite well. That might be a little short.. maybe the number should be somewhere in between that and the parent hang time.
We really should just NS_RUNTIMEABORT. What assertions do you see in the parent?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #4)
> We really should just NS_RUNTIMEABORT. What assertions do you see in the
> parent?

http://pastebin.mozilla.org/1295747

This also brings up the "plugin container problem" Windows dialog in debug mode on Windows. I didn't like that at all. I'd rather it get shutdown by the parent cleanly.. but I'm open to whatever. I was hoping a could get some stack data over to the parent somehow, was planning on looking into that today. It looks like we have a independent crash reporter pipe open for communication I might be able to use.
###!!! ABORT: terminating child process: file f:/Mozilla/firefox/mc/dom/plugins/ipc/PluginModuleChild.cpp, line 532 is expected, that's what NS_RUNTIMEABORT prints.

###!!! ASSERTION: Mismatched RPC stack frames: 'this == mChannel->mTopFrame', file f:/Mozilla/firefox/mc/ipc/glue/WindowsMessageLoop.cpp, line 612
This sounds like it ought to be investigated.

###!!! ASSERTION: IPDL error:: 'Error', file f:/Mozilla/firefox/MC-DBG/ipc/ipdl/PPluginScriptableObjectChild.cpp, line 1226
This doesn't make much sense: once you issue a NS_RUNTIMEABORT, the child ought to be dead, but it's still running and issuing assertions?

###!!! [Parent][RPCChannel] Error: Channel error: cannot send/recv
expected, this is normal

Also the windows crash dialog is normal if you don't have crash reporting enabled: if it is enabled you should be getting a normal minidump.
It might not work properly if you're killing the parent process. plugin-container doesn't write its own minidumps, it asks the chrome process to do so, and if you've killed that it will probably just give up and you'll get the Windows dialog.
We're not killing the parent process; the plugin process is just committing suicide.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> ###!!! ASSERTION: Mismatched RPC stack frames: 'this ==
> mChannel->mTopFrame', file
> f:/Mozilla/firefox/mc/ipc/glue/WindowsMessageLoop.cpp, line 612
> This sounds like it ought to be investigated.
> 
> ###!!! ASSERTION: IPDL error:: 'Error', file
> f:/Mozilla/firefox/MC-DBG/ipc/ipdl/PPluginScriptableObjectChild.cpp, line
> 1226
> This doesn't make much sense: once you issue a NS_RUNTIMEABORT, the child
> ought to be dead, but it's still running and issuing assertions?

In debug mode, the child process gets hung up by the windows error collection dialog, so I think stuff underneath continues to run. I can debug a bit and see what's going on. This spews while the windows error reporting dialog is visible and is "checking for solutions".
Note, even with the use of runtime abort, we still don't get minidumps:

WARNING: [PluginModuleParent::ActorDestroy] abnormal shutdown without minidump!

This is in ActorDestroy of PluginModuleParent. Not sure yet if there's a way to go ahead and trigger some sort of report. Whether we shut down the module or just crash, it seems some work will need to be done in getting crash data stored away.
Attached patch patch v.1Splinter Review
Ok, with the crash reporter enabled this is working out pretty well.
Assignee: nobody → jmathies
Attachment #551907 - Attachment is obsolete: true
Attachment #552165 - Flags: review?(benjamin)
Hmm, I have an unused timeoutMs var in the prefs observer, I'll remove that.
Comment on attachment 552165 [details] [diff] [review]
patch v.1

Please mark tracking-firefoxN on this when it lands, since we'll probably want to track this very carefully. And file a followup on getting a parent-side stack when we do this child-side abort, since we really want a hang-pair in this case.
Attachment #552165 - Flags: review?(benjamin) → review+
Blocks: 679238
Landed on inbound:
http://hg.mozilla.org/integration/mozilla-inbound/rev/bef7bcdcd41e

Note to drivers: we probably want to disable this feature in releases. This will depend on the frequency stacks generated by these changes show up in crash stats.

To disable - the following release pref should be set to 0:

+// How long a plugin process will wait for a response from the parent
+// to a synchronous request before terminating itself. After this
+// point the child assumes the parent is hung.
+pref("dom.ipc.plugins.parentTimeoutSecs", 15);
This isn't going to make the last 8 inbound merge, so when tracking 9 flags arrive in bugzilla they need to be set on this bug.
http://hg.mozilla.org/mozilla-central/rev/bef7bcdcd41e
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla9
Blocks: 680130
Is this an approval request? This is already in 9, I presume. If it is an approval request, please unset the tracking flag and set the approval request in the patch with an explanation of the risk / reward.
(In reply to Asa Dotzler [:asa] from comment #18)
> Is this an approval request? This is already in 9, I presume. If it is an
> approval request, please unset the tracking flag and set the approval
> request in the patch with an explanation of the risk / reward.

We want this disabled in releases for now. I'll post a patch to disable this in 9 now that we've branched.
Attachment #563783 - Flags: approval-mozilla-aurora?
Attachment #563783 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Jim Mathies [:jimm] from comment #20)
> Created attachment 563783 [details] [diff] [review] [diff] [details] [review]
> disable for aurora and up patch

https://hg.mozilla.org/releases/mozilla-aurora/rev/c4862aaec55b
Is there something QA can do to verify this fix?
Whiteboard: [qa?]
[Triage Comment]
Please provide STR here by tomorrow at 12:00PM PT 12/14. We will be holding our FF9 sign-offs later that afternoon, and need to be able to verify. Thanks!
(In reply to Alex Keybl [:akeybl] from comment #23)
> [Triage Comment]
> Please provide STR here by tomorrow at 12:00PM PT 12/14. We will be holding
> our FF9 sign-offs later that afternoon, and need to be able to verify.
> Thanks!

This code is only enabled on mc, so you don't have to worry about verifying anything. To verify it's disabled, you can check the pref in 'disable for aurora and up patch' is properly set.
Thanks Jim. So to be completely clear...

dom.ipc.plugins.parentTimeoutSecs == 0 in Firefox 9 means this is FIXED?
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #25)
> Thanks Jim. So to be completely clear...
> 
> dom.ipc.plugins.parentTimeoutSecs == 0 in Firefox 9 means this is FIXED?

Disabled really. There's no need to confirm the feature is "fixed", the reports this generates have been showing up in crash stats since the first patch landed. I don't see the stacks from this in aurora builds so it's properly disabled. So yes, basically, confirmed fixed on mc and confirmed disabled on aurora. :)

The crash stack in question is:

mozalloc_abort(char const* const) | NS_DebugBreak_P | mozilla::plugins::PluginModuleChild::ShouldContinueFromReplyTimeout()

We may enabled in the future for all releases, but we'll file a new bug on that.
Okay, thanks Jim. I'm going to mark this one qa-. We can verify the new bug when it gets enabled.
Whiteboard: [qa?] → [qa-]
Looks like this isn't fixed on FF10..

dom.ipc.plugins.parentTimeoutSecs=15

.. and is thus suspected to be causing bug 683967.

Jim - can you prepare the patch for FF10/11?
Blocks: 683967
Looks like bsmedberg will actually create the disable patch in 683967. Untracking here.
You need to log in before you can comment on or make changes to this bug.