Closed Bug 1262852 Opened 8 years ago Closed 8 years ago

[e10s] plugin? hang mozilla::widget::WinUtils::WaitForMessage | base::MessagePumpForUI::DoRunLoop

Categories

(Core :: DOM: Content Processes, defect, P3)

x86
Windows NT
defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
e10s + ---
firefox50 --- fixed

People

(Reporter: benjamin, Assigned: gsvelto)

References

Details

(Whiteboard: btpp-fixlater [fce-active-legacy])

Crash Data

Attachments

(2 files, 10 obsolete files)

366.54 KB, text/html
Details
16.10 KB, patch
jimm
: review+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-a650b636-8004-420b-88c1-41f402160325.
=============================================================

I was looking at the increased rate of plugin crashes with e10s enabled, and in general not a lot showed up. This signature, however, is much worse with e10s and is also strange/concerning.

To see the stacks of all the processes involved, visit http://bsmedberg.github.io/socorro-toolbox/html/multiple-minidumps.html?crashID=a650b636-8004-420b-88c1-41f402160325

The Firefox ("browser") process is doing nothing interesting.
The plugin process is doing nothing, waiting for an event (either IPC or a native event should wake up that event loop)
The content process is not doing plugin things! The stack is:

7	mozilla::ipc::MessageChannel::Send(IPC::Message *,IPC::Message *)
8	mozilla::dom::PScreenManagerChild::SendScreenForBrowser(mozilla::dom::IdType<mozilla::dom::TabParent> const &,mozilla::dom::ScreenDetails *,bool *)
9	nsScreenManagerProxy::ScreenForNativeWidget(void *,nsIScreen * *)
10	nsDeviceContext::FindScreen(nsIScreen * *)
11	nsDeviceContext::ComputeFullAreaUsingScreen(nsRect *)
12	nsLayoutUtils::GetDeviceContextForScreenInfo(nsPIDOMWindow *)

This stack should not be related to plugins at all. I don't understand how we could get a plugin hang out of this stack at all.

http://bsmedberg.github.io/socorro-toolbox/html/multiple-minidumps.html?crashID=127073c8-6176-4722-8157-cee1b2160325 has another one, and the content stack is even weirder, no evidence of a hang at all.

My current theory is that the stack here is not correct: if content was hung, it could trigger plugin hang monitoring to collect the stacks but then immediately resume running, which would leave us with "random" stacks in the content process.

Jim/Mike/Aaron, do any of you remember/know how the plugin hang detector works with e10s enabled? And whether we could "fix" it so that the content process doesn't resume execution until the hang minidumps have all been collected?
Flags: needinfo?(jmathies)
Priority: -- → P3
No longer blocks: e10s-crashes
Flags: needinfo?(jmathies)
Flags: needinfo?(jmathies)
(I'm putting 'fixlater' here but it's not like this isn't important ... this at least gets it out of the DOM team's query results ;)
Whiteboard: btpp-fixlater
Here's a breakdown of 27 of these during the beta experiment.
Flags: needinfo?(jmathies)
> Jim/Mike/Aaron, do any of you remember/know how the plugin hang detector
> works with e10s enabled? And whether we could "fix" it so that the content
> process doesn't resume execution until the hang minidumps have all been
> collected?

PluginModuleContentParent notifies the browser when it detects a hang in IPC code [1], which it sends on a background thread [2][3]. When the browser gets this notification, it takes a minidump of itself (which captures the state of the main thread during the hang) and then notifies the browser ui to display a hang notification. After that it's up to the user to decide what to do. If they chose 'fix it', TerminatePlugin is called and the uber crash reports gets saved out [4].

[1] http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#1202
[2] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ProcessHangMonitor.cpp#415
[3] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ProcessHangMonitor.cpp#431
[4] http://mxr.mozilla.org/mozilla-central/source/dom/ipc/ProcessHangMonitor.cpp#880

There appear to be two different types -

1) triple WaitForMessage hangs
2) hangs where the plugin and browser are in WaitForMessage and the content process is sending to the browser.

I've seen the triple WaitForMessage hangs occasionally but never in appreciable numbers. I was thinking it might have something to do with pending messages getting marked as not-new through some sort of message processing overlap such that the plugin gets stuck in a peek message type wait. This would lock up our message processing as well.

It could be a situation where the plugin frees itself up shortly after the hang evidence gets sent. The user would see a notification which would stick around for a short while giving them the opportunity to terminate. I don't see a way of avoiding a corner case like this though. We could confirm the hang on the background thread.. but you always have that window between the send from content to the browser where the plugin might free up.
We could add a background thread confirm after the user selects "fix it". No harm in that and it should clean up most of your "freed plugins that get terminated" if they exist.
Curious what you think, and if we want to do something like this, are you interested in doing the work or should I find someone on the e10s team to do it.
Flags: needinfo?(benjamin)
So, I think my question is why #2 is async/background thread instead of sync. Is that necessary? If we can make that essentially sync, that would solve the main problem here of collecting minidumps from a running process.

My understanding is that we collect all the minidumps at once, but don't *send* them until the user selects fix it?
Flags: needinfo?(benjamin) → needinfo?(jmathies)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #6)
> So, I think my question is why #2 is async/background thread instead of
> sync. Is that necessary? If we can make that essentially sync, that would
> solve the main problem here of collecting minidumps from a running process.

We switch threads there so that we don't disrupt the state of the browser stack when we receive the message on the browser side. 

> My understanding is that we collect all the minidumps at once, but don't
> *send* them until the user selects fix it?

Not currently, we collect the browser dump first before we trigger the notification on the main thread. Everything else gets collected when we call TerminatePlugin [1][2]. I guess this isn't ideal, it would be nice to cache everything immediately.

[1] http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#368
[2] http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#1213
Flags: needinfo?(jmathies)
I'm going to pick this bug up next week if it's OK with everybody.
(In reply to Gabriele Svelto [:gsvelto] from comment #8)
> I'm going to pick this bug up next week if it's OK with everybody.

Thanks! If you can't get to it in a week or two please ni me.
Assignee: nobody → gsvelto
(In reply to Jim Mathies [:jimm] from comment #9)
> Thanks! If you can't get to it in a week or two please ni me.

OK, thanks!
Status: NEW → ASSIGNED
I've been looking at the affected code for the past two days but before I dive into it and prepare a fix I wanted to ask a few questions to confirm I'm attempting the right thing:

- First of all I want to make sure that I'm going for the right fix. Comment 4 suggests confirming the hang from the background thread once the user selected the option to terminate the plugin; comment 7 on the other hand suggests gathering the minidumps immediately and deleting them in case the user doesn't choose to terminate the plug-in. Which one of the two solutions is the right one?

- Secondly, I've poked at the MessageChannel code but I couldn't find a simple way to check if a process had become unstuck. In the first case should confirmation be done by sending another message and checking that it also times out?

- Finally in the second solution is there a way to artificially reproduce this scenario so that I can trigger the affected code? I suppose I could write a plug-in that deliberately hangs to test this but it seems like a roundabout way to do it.
Flags: needinfo?(jmathies)
(In reply to Gabriele Svelto [:gsvelto] from comment #11)
> I've been looking at the affected code for the past two days but before I
> dive into it and prepare a fix I wanted to ask a few questions to confirm
> I'm attempting the right thing:
> 
> - First of all I want to make sure that I'm going for the right fix. Comment
> 4 suggests confirming the hang from the background thread once the user
> selected the option to terminate the plugin; comment 7 on the other hand
> suggests gathering the minidumps immediately and deleting them in case the
> user doesn't choose to terminate the plug-in. Which one of the two solutions
> is the right one?

I like the idea of collecting everything when we receive the first hang notification however I'm concerned about the overhead with doing this. Looking at PluginModuleChromeParent::TerminateChildProcess we can collect:

1) plugin process dump
2) content process dump
3) optionally collect two flash process dumps (most common scenario)

This would happen on the PBackground thread.

I'm not sure we want to take the overhead hit of doing this for every plugin hang notification. Benjamin, do you have opinion on this?


> - Secondly, I've poked at the MessageChannel code but I couldn't find a
> simple way to check if a process had become unstuck. In the first case
> should confirmation be done by sending another message and checking that it
> also times out?

We send multiple notifications about timeouts from PluginModuleContentParent::ShouldContinueFromReplyTimeout(). We also track the timeout via ProcessHangMonitor., see PluginModuleContentParent::OnExitedSyncSend().

> - Finally in the second solution is there a way to artificially reproduce
> this scenario so that I can trigger the affected code? I suppose I could
> write a plug-in that deliberately hangs to test this but it seems like a
> roundabout way to do it.

I usually put a really long Sleep statement in an input event handler in PluginInstanceChild. That code is a bit loopy, there are two event paths depending on whether you're working with windowed or windowless plugins. Ping me if you need some test cases. We have a test plugin you can load or you can test with flash, both work for this kind of testing.
Flags: needinfo?(jmathies) → needinfo?(benjamin)
Actually we collect four dumps, I forgot about the browser, which we already collect.

1) plugin process dump
2) content process dump
3) optionally collect two flash process dumps (most common scenario)
4) browser
The plugin hang notifications are done after what timeout, 10 or 45 seconds? I don't think the overhead of collecting dumps is significant compared to that amount of hanging. So I think we should treat each of these as a hang. The only question would be if this happens a lot we may end up eating up disk space. Are we counting these in telemetry now?
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #14)
> The plugin hang notifications are done after what timeout, 10 or 45 seconds?
> I don't think the overhead of collecting dumps is significant compared to
> that amount of hanging. So I think we should treat each of these as a hang.
> The only question would be if this happens a lot we may end up eating up
> disk space. Are we counting these in telemetry now?

The ProcessHangMonitor over on the parent side gets multiple notifications but is smart in that it only takes a browser dump the first time and caches that info -

https://dxr.mozilla.org/mozilla-central/rev/4feb4dd910a5a2d3061dbdd376a80975206819c6/dom/ipc/ProcessHangMonitor.cpp#608

We could take additional dumps here, maybe through a PluginModuleChromeParent helper similar to TerminatePlugin.
 
AFAICT nsExceptionHandler doesn't interact with telemetry.
https://dxr.mozilla.org/mozilla-central/source/toolkit/crashreporter/nsExceptionHandler.cpp#3732
Sorry for the late reply but I've been sick for a few days in a row. I will go ahead next week and take the minidump's approach as per comment 12.

On a side note, I was discussing with :ddurst the opportunity of enabling client-side stack traces wouldn't this bug be a good scenario for those? They would allow us to grab small, easy to de-duplicate traces of the problem without wasting too much of the client's storage. I've been evaluating the unwinding infrastructure we have in the profiler and it seems appropriate for the job.
Here's a WIP of the code that grabs the minidump right during the hang instead of waiting for later. I've just checked that it compiles; I'll test next week. The original code also refrained from taking redundant dumps so I should probably do the same here.
Attachment #8747066 - Flags: feedback?(jmathies)
(In reply to Gabriele Svelto [:gsvelto] from comment #16)
> Sorry for the late reply but I've been sick for a few days in a row. I will
> go ahead next week and take the minidump's approach as per comment 12.
> 
> On a side note, I was discussing with :ddurst the opportunity of enabling
> client-side stack traces wouldn't this bug be a good scenario for those?
> They would allow us to grab small, easy to de-duplicate traces of the
> problem without wasting too much of the client's storage. I've been
> evaluating the unwinding infrastructure we have in the profiler and it seems
> appropriate for the job.

Not sure, I don't know anything about client-side stack traces. Is there a bug where they are being discussed I can take a look at? Where do the traces end up, telemetry maybe?
(In reply to Jim Mathies [:jimm] from comment #18)
> Not sure, I don't know anything about client-side stack traces. Is there a
> bug where they are being discussed I can take a look at? Where do the traces
> end up, telemetry maybe?

There's an old bug (bug 650239) but it's actually under active discussion since it involves more than just capturing the traces (where do we send them? are minidumps still useful once we have them? etc...). If you're interested I'll keep you in the loop. Meanwhile I'll polish the patch here to avoid taking redundant minidumps and submit for review. If you have time to look at the WIP and give me some feedback that'll be very appreciated.
I've further tweaked the patch and reinstated the mechanism that ensured we wouldn't take redundant browser dumps. I did not apply the same mechanism to the child dumps because I assume those are going to be different but we might as well apply it to them if need be.

I feel like this is in good shape for review already.
Attachment #8747066 - Attachment is obsolete: true
Attachment #8747066 - Flags: feedback?(jmathies)
Attachment #8747632 - Flags: review?(jmathies)
Comment on attachment 8747632 [details] [diff] [review]
[PATCH] Create a minidump upon each plugin hang

Review of attachment 8747632 [details] [diff] [review]:
-----------------------------------------------------------------

generally looking great. One request if it's feasible.

::: dom/ipc/ProcessHangMonitor.cpp
@@ +620,5 @@
>          }
>        }
>      }
> +
> +    plugins::GatherMinidumps(phd.pluginId(), phd.contentProcessId(),

I think it might be cleaner if we take the opportunity to move the browser TakeMinidump work into this helper? Seems odd that we take that here then take the rest in plugins::GatherMinidumps. Maybe we could pass bit flags through telling it what we want to do?
Attachment #8747632 - Flags: review?(jmathies) → review-
(In reply to Jim Mathies [:jimm] from comment #21)
> I think it might be cleaner if we take the opportunity to move the browser
> TakeMinidump work into this helper? Seems odd that we take that here then
> take the rest in plugins::GatherMinidumps. Maybe we could pass bit flags
> through telling it what we want to do?

Good point, I left that bit there because I wanted to leave the logic used to decide if we need to take a new minidump or not intact; but those bits can easily be folded into the helper provided it returns the crash ID once the minidump has been taken.
Here's the updated patch with your comment addressed.
Attachment #8747632 - Attachment is obsolete: true
Attachment #8749609 - Flags: review?(jmathies)
The previous patch had an issue that prevented it from building properly on Windows, I didn't notice it until I ran a try build on all architectures.  This one should be OK.
Attachment #8749609 - Attachment is obsolete: true
Attachment #8749609 - Flags: review?(jmathies)
Attachment #8749627 - Flags: review?(jmathies)
Comment on attachment 8749627 [details] [diff] [review]
[PATCH v3] Create a minidump upon each plugin hang

Review of attachment 8749627 [details] [diff] [review]:
-----------------------------------------------------------------

Running this in a debug or opt build should generate debug breaks on these threading issues.. did you test this? Whenever you mess with plugins it's important to take the code for a spin. I noticed the thread safety issues while testing the generation of a report using the test plugin and a long Sleep to simulate a hang.

::: dom/ipc/ProcessHangMonitor.cpp
@@ +591,5 @@
>    if (aHangData.type() == HangData::TPluginHangData) {
>      MutexAutoLock lock(mBrowserCrashDumpHashLock);
>      const PluginHangData& phd = aHangData.get_PluginHangData();
>      if (!mBrowserCrashDumpIds.Get(phd.pluginId(), &crashId)) {
> +      if (!plugins::GatherMinidump(phd.pluginId(), phd.contentProcessId(),

You have an issue here, you're calling this on a background thread.

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +360,5 @@
> +                                 nsAString& aCrashId)
> +{
> +    MOZ_ASSERT(XRE_IsParentProcess());
> +
> +    RefPtr<nsPluginHost> host = nsPluginHost::GetInst();

nsPluginHost, nsPluginTag, and nsNPAPIPlugin don't implement thread safe nsISupports and you are accessing these on a background thread. We could add thread safe support (NS_DECL_THREADSAFE_ISUPPORTS vs. NS_DECL_ISUPPORTS) to each, although that can have performance consequences. I'd suggesting looking at the way we access the data in these (pluginTag->mPlugin?) to be sure what you access is thread safe as well, it might not be. In which case AddRef/Release is the simplest problem we have here.
Attachment #8749627 - Flags: review?(jmathies) → review-
(In reply to Jim Mathies [:jimm] from comment #25)
> Running this in a debug or opt build should generate debug breaks on these
> threading issues.. did you test this? Whenever you mess with plugins it's
> important to take the code for a spin. I noticed the thread safety issues
> while testing the generation of a report using the test plugin and a long
> Sleep to simulate a hang.

No, I didn't get to test it and I apologize for wasting your time. I've tried resurrecting an old plugin I wrote years ago on Friday night so that I could test it but couldn't get it to work. Could you point me to the the plugin/patch you use for testing?

> 
> ::: dom/ipc/ProcessHangMonitor.cpp
> @@ +591,5 @@
> >    if (aHangData.type() == HangData::TPluginHangData) {
> >      MutexAutoLock lock(mBrowserCrashDumpHashLock);
> >      const PluginHangData& phd = aHangData.get_PluginHangData();
> >      if (!mBrowserCrashDumpIds.Get(phd.pluginId(), &crashId)) {
> > +      if (!plugins::GatherMinidump(phd.pluginId(), phd.contentProcessId(),
> 
> You have an issue here, you're calling this on a background thread.
> 
> ::: dom/plugins/ipc/PluginModuleParent.cpp
> @@ +360,5 @@
> > +                                 nsAString& aCrashId)
> > +{
> > +    MOZ_ASSERT(XRE_IsParentProcess());
> > +
> > +    RefPtr<nsPluginHost> host = nsPluginHost::GetInst();
> 
> nsPluginHost, nsPluginTag, and nsNPAPIPlugin don't implement thread safe
> nsISupports and you are accessing these on a background thread. We could add
> thread safe support (NS_DECL_THREADSAFE_ISUPPORTS vs. NS_DECL_ISUPPORTS) to
> each, although that can have performance consequences. I'd suggesting
> looking at the way we access the data in these (pluginTag->mPlugin?) to be
> sure what you access is thread safe as well, it might not be. In which case
> AddRef/Release is the simplest problem we have here.

Yeah, that flew under my radar even though I was aware we were sending a runnable to the main thread when terminating the plugin which should have raised a red flag. I have to re-analyze this code more in-depth to figure out which would be the best way to do this in a thread-safe way.
Yeah, nsPluginHost access is definitely not thread-safe and calling nsPluginHost::PluginWithId() from any thread is racy since it iterates over the elements of a link-list held by the nsPluginHost instance. I have to rethink this code around the problem because we cannot touch the main thread by definition if we want better minidumps than what we get now.
OK, so after thinking a little more about the problem I think that the proper approach might be to:

1. Take all the minidumps on the background thread, within RecvHangData()
2. Send over the id for the aggregated minidumps to plugins::TerminatePlugin() which will add the remaining annotations like it did before

This way we should be taking all the dumps in the right place without having to touch nsPluginHost & friends from another thread.
Quick update: I'm trying to figure out how to reliably get the PID of the plugin & flash processes from the background thread which are required for taking the dumps.
OK, here's an outline of what I'm doing (and which is proving a little bit tricky). When we enter ProcessHangMonitor::NotifyPluginHang() we're in the right place to gather all required information that we'll need later: we're in the parent process' main thread so we can access the plugin facilities to retrieve the plugin process' PID, Flash processes' PID as well as the content process PID. I'm trying to gather all this data and then send it over to the child first and to the background thread in the main process later. Once I'm there I'm trying to gather the full minidump plus all necessary annotations. This is proving to be very tricky due to the thread-safety issues but it looks like the right place to get the best possible data. I hope to be able to post a new WIP patch between today and tomorrow.
Another quick update: I'm making progress with the patch but more importantly I finally managed to test this in an easy and repeatable way on both Windows, Linux and Mac and I feel silly for having wasted so much time trying to roll my own testing stuff. Long story short, the nptest plug-in we always build has a method named hang() which does exactly what it suggest it does. I've found it while wading through our mochitests until I found one which was testing exactly plugin hangs. Ironically enough the test is disabled in e10s mode :-/

Anyway I'm done with modifying how the minidump is taken I'll have to update all the relevant mochitests.
Whiteboard: btpp-fixlater → btpp-fixlater [fct-active]
Whiteboard: btpp-fixlater [fct-active] → btpp-fixlater [fce-active]
After much wrestling against this code I've finally managed to get something working that doesn't touch non-thread-safe bits were it shouldn't. The changes are actually smaller than with my previous patches but there's still a couple of quirks to fix.

First of all I've changed the approach here so that all the relevant PIDs are available to the background thread within RecvHangEvidence. This allow us to grab minidumps of the browser, content and plugin processes directly and later hand them over to TerminatePlugin().

PluginModuleChromeParent::TerminateChildProcess() was adjusted to cope with this scenario, for which I added an extra small helper to the CrashReporterParent class in order to generate - but not finalize right away - a crash report from an existing dump.

There's a couple of things I'm not sure about:

- First of all this code still takes the dumps for the flash processes in PluginModuleChromeParent::TerminateChildProcess(). Is this too late? Shall we also grab those minidumps in RecvHangEvidence()?

- Before my changes we would take a minidump of the plugin process and then attach the browser, content and other minidumps to it. Now we start with a minidump of the browser process to which we attach plugin, content and other minidumps as children. Does the order matter? If it does I'll tweak the logic in RecvHangEvidence() so that we start by grabbing the plugin dump.

Last but not least I've tested this on Windows and Linux and while sometimes it works correctly, most of the time it doesn't. It seems that PluginModuleChromeParent::ActorDestroy() is not being called and I'm not sure why yet.

This is why I'm holding back on asking for review just yet, I want to be sure I've got all the quirks ironed out first.
Attachment #8749627 - Attachment is obsolete: true
Attachment #8754898 - Flags: feedback?(jmathies)
Attachment #8754898 - Flags: feedback?(jmathies) → feedback+
(In reply to Gabriele Svelto [:gsvelto] from comment #32)
> Created attachment 8754898 [details] [diff] [review]
> [PATCH v4 WIP] Create a minidump upon each plugin hang
> 
> After much wrestling against this code I've finally managed to get something
> working that doesn't touch non-thread-safe bits were it shouldn't. The
> changes are actually smaller than with my previous patches but there's still
> a couple of quirks to fix.
> 
> First of all I've changed the approach here so that all the relevant PIDs
> are available to the background thread within RecvHangEvidence. This allow
> us to grab minidumps of the browser, content and plugin processes directly
> and later hand them over to TerminatePlugin().
> 
> PluginModuleChromeParent::TerminateChildProcess() was adjusted to cope with
> this scenario, for which I added an extra small helper to the
> CrashReporterParent class in order to generate - but not finalize right away
> - a crash report from an existing dump.
> 
> There's a couple of things I'm not sure about:
> 
> - First of all this code still takes the dumps for the flash processes in
> PluginModuleChromeParent::TerminateChildProcess(). Is this too late? Shall
> we also grab those minidumps in RecvHangEvidence()?

Hmm, I'd say no, since if these are locked up nothing will change during notification.

> - Before my changes we would take a minidump of the plugin process and then
> attach the browser, content and other minidumps to it. Now we start with a
> minidump of the browser process to which we attach plugin, content and other
> minidumps as children. Does the order matter? If it does I'll tweak the
> logic in RecvHangEvidence() so that we start by grabbing the plugin dump.

Yes I think we want the mail report to be a plugin report. Otherwise it'll probably show up in crash stats as a browser report. It'll also break some existing tools for pulling these out of crashstats.

> Last but not least I've tested this on Windows and Linux and while sometimes
> it works correctly, most of the time it doesn't. It seems that
> PluginModuleChromeParent::ActorDestroy() is not being called and I'm not
> sure why yet.
> 
> This is why I'm holding back on asking for review just yet, I want to be
> sure I've got all the quirks ironed out first.

Interesting, I don't see anything here that could cause that. Might be an existing bug?
Modified patch which takes a plugin dump first and then attaches the others to it. I'm holding back the review until my try run comes back green and I can test on Windows too. I have the feeling this will require some mochitest changes.

For now it's definitely working well on Linux, I've double-checked the minidumps with and without the patch and nothing seems to be missing.
Attachment #8754898 - Attachment is obsolete: true
Revised patch, now properly taking a plugin dump first then attaching the browser, content and flash dumps to it. I've also fixed an issue I had introduced w/o realizing it which caused two crash dumps to be created instead of one and I've factored out the code dealing with the minidumps to make it more readable.

There's only a couple of mochitests left to fix and we should be good to go.
Attachment #8755871 - Attachment is obsolete: true
I've spent most of today poring over the mochitests that broke only to find out that they were just fine, it was my patch's fault if they were failing. I was adding the content minidump to the crash annotations unconditionally which caused tests run in non-e10s mode to fail since that dump wasn't present. Updated patch coming soon.
This time it's looking good, manual testing yielded correct results and try is green:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a8d61cc0c994&selectedJob=21619419

Compared to the previous patch you reviewed the biggest differences are:

- I've added a field to the PluginHangData message to hold the plugin process PID, this way we can safely retrieve it in the main thread.

- RecvHangData() is now only grabbing the content, plugin and browser minidumps, all the remaining annotation is still done in TerminateChildProcess() so that we don't have to touch potentially thread-unsafe objects

- I've factored out the code grabbing the minidumps so that it's easier to read and RecvHangData() is not excessively bloated

- We first grab the browser minidump because it's the easiest to take (we're in the browser process after all) but then we grab the plugin minidump and pair the browser minidump to it, this way the resulting crashdump has the plugin minidump as it's main file as it used to be
Attachment #8756831 - Attachment is obsolete: true
Attachment #8757633 - Flags: review?(jmathies)
Comment on attachment 8757633 [details] [diff] [review]
[PATCH v7] Create a minidump upon each plugin hang

Review of attachment 8757633 [details] [diff] [review]:
-----------------------------------------------------------------

Looking good, a few outstanding questions.

Also please test this by triggering a plugin hang and confirming that a crash report with the desired information is generated. You can submit it via about:crashes and confirm everything makes it to crash stats too.

::: dom/ipc/ProcessHangMonitor.cpp
@@ +592,5 @@
> +      !CrashReporter::CreateMinidumpsAndPair(pluginHandle, 0,
> +                                             NS_LITERAL_CSTRING("browser"),
> +                                             browserDump,
> +                                             getter_AddRefs(pluginDump))) {
> +    NS_WARNING("Failed to generate a minidump for the plugin process");

do we need to delete browserDump here?

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +353,5 @@
>  
>  } // namespace
>  
> +base::ProcessId
> +mozilla::plugins::PluginProcessId(uint32_t aPluginId)

nit - lets null check pointers in here before accessing them and bail early if something unexpected happens.

@@ +358,5 @@
> +{
> +  RefPtr<nsPluginHost> host = nsPluginHost::GetInst();
> +  nsPluginTag* pluginTag = host->PluginWithId(aPluginId);
> +  if (!pluginTag || !pluginTag->mPlugin) {
> +      return false;

Do we have an invalid handle value for base::ProcessId we can use here? false might be the right numeric value but it's not the right type.
Attachment #8757633 - Flags: review?(jmathies) → review+
(In reply to Jim Mathies [:jimm] from comment #38)
> Also please test this by triggering a plugin hang and confirming that a
> crash report with the desired information is generated. You can submit it
> via about:crashes and confirm everything makes it to crash stats too.

Right. I've tested this on Windows and Linux but haven't sent the crash-report. I'll do that and post the link here.

> ::: dom/ipc/ProcessHangMonitor.cpp
> @@ +592,5 @@
> > +      !CrashReporter::CreateMinidumpsAndPair(pluginHandle, 0,
> > +                                             NS_LITERAL_CSTRING("browser"),
> > +                                             browserDump,
> > +                                             getter_AddRefs(pluginDump))) {
> > +    NS_WARNING("Failed to generate a minidump for the plugin process");
> 
> do we need to delete browserDump here?

Yes, good catch.

> ::: dom/plugins/ipc/PluginModuleParent.cpp
> @@ +353,5 @@
> >  
> >  } // namespace
> >  
> > +base::ProcessId
> > +mozilla::plugins::PluginProcessId(uint32_t aPluginId)
> 
> nit - lets null check pointers in here before accessing them and bail early
> if something unexpected happens.

Which pointers? I'm doing the same checks that were in plugins::TerminatePlugin(), should I also check host and chromeParent? BTW since that bit of code is almost identical between plugins::TerminatePlugin() and plugins::PluginProcessId() I can try and factor it out to avoid the duplication.

> @@ +358,5 @@
> > +{
> > +  RefPtr<nsPluginHost> host = nsPluginHost::GetInst();
> > +  nsPluginTag* pluginTag = host->PluginWithId(aPluginId);
> > +  if (!pluginTag || !pluginTag->mPlugin) {
> > +      return false;
> 
> Do we have an invalid handle value for base::ProcessId we can use here?
> false might be the right numeric value but it's not the right type.

Yes, mozilla::ipc::kInvalidProcessId, I modified the method signature while working on it and I must have forgot changing it.
Fixed the review nits (mainly more extensive null checks in  plugins::PluginProcessId() and proper removal of the browser minidump in case we couldn't capture the plugin minidump). Carrying over the r=jimm flag. The try run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=513975fea355
Attachment #8757633 - Attachment is obsolete: true
Attachment #8759625 - Flags: review+
After re-triggering a whole bunch of intermittent failures (none of which related to this patch) I managed to get the Try run completely green. Time to land.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c1dd7376263e
Create a minidump upon each plugin hang r=jimm
https://hg.mozilla.org/mozilla-central/rev/c1dd7376263e
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Depends on: 1279435
Backed out for causing bug 179435.

https://hg.mozilla.org/integration/mozilla-inbound/rev/60b340e55efe
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've spent some time re-thinking how to fix this and there doesn't seem to be an easy way out: the biggest issue is "bubbling" the thread ID of the plugin process' main thread all the way to the background thread in the browser process where this patch was taking the minidumps. If we were in the main thread we'd be able to get the proper minidump complete with thread and all from the main process but it's out of reach from the background thread.

One way to fix this would be take minidumps elsewhere: in the content process. When HangMonitorChild::NotifyPluginHang() is called we're in the content process main thread; we might take a minidump of both the plugin and content process from here (complete with the appropriate thread and all) and then pass it up to the browser process' background thread that would then proceed to take a dump of itself. An issue here is that we'd have to move the logic that prevents taking multiple minidumps here instead of in the browser process.

An alternative would be to send the thread ID along with the other info to the browser process' background thread. It's currently stored in the CrashReporter object as a private field so making it accessible would require us to expose it which feels fairly clunky. This would require less changes compared to the current patch but overall would touch more places in the code. I'll experiment with both approaches and see what gives the best results.
I've spent the last 48h moving the machinery for taking the plugin minidump outside of the browser process and into the content one only to find out that the CrashReporterParent object is instanced only in the browser process and not into the content process (which I should have figured out before since it's present in the PluginModuleChromeParent but not in the PluginModuleContentParent).

Long story short, I have to try and fix this via the hacky solution I outlined above, "bubbling up" the plugin process' main thread ID all the way to the background thread in the parent browser. Yuck.
Fourth different attempt at fixing this (and 10th patch, yuck). This is very different from the previous patches because after wrestling with making the native ID of the main thread accessible to the hang monitor I finally gave up on that approach: it's complicated, possibly non-portable or would require re-engineering most of our crash-reporting machinery. The plug-in-related stuff was never meant to be accessed outside of the main thread and the IPC-aware crash reporter classes cannot be used outside of the browser process.

Long story short I've decided to implement this using the next best thing that came to mind: capturing the plug-in and content process minidumps from the browser process's main thread *after* having capture a minidump of the browser process itself. This means we have safe access to all the browser-only, main-thread only bits we need to touch. The downside is that the plug-in minidump will be captured slightly later; it's still going to be a delay measured in milliseconds versus the current delay of seconds. I've also tried to factor out some bits to avoid duplications in the code-paths.

Here's the try run:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fbded315e36
Attachment #8759625 - Attachment is obsolete: true
Attachment #8764915 - Flags: review?(jmathies)
Another day another patch revision. The previous one "mostly" worked but would trigger the deadlock-detection logic when running mochitests in debug mode under Windows. This patch addresses that issue and factors out some of the shared code to avoid duplications. The try run is here:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e370fc47678
Attachment #8764915 - Attachment is obsolete: true
Attachment #8764915 - Flags: review?(jmathies)
Attachment #8765386 - Flags: review?(jmathies)
Comment on attachment 8765386 [details] [diff] [review]
[PATCH v10] Create a minidump upon each plugin hang

Review of attachment 8765386 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ProcessHangMonitor.cpp
@@ +162,5 @@
>      mActor = nullptr;
>    }
>  
>    void SetHangData(const HangData& aHangData) { mHangData = aHangData; }
> +  void SetDumpId(nsString& aId) {

lets add a header comment here explaining what this is. I know this file isn't very good with header commenting but lets try to change that as we add new code. Maybe do a combined comment here for both the id and SetHangData.

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +364,5 @@
> +    return nullptr;
> +  }
> +  RefPtr<nsNPAPIPlugin> plugin = pluginTag->mPlugin;
> +
> +  return static_cast<PluginModuleChromeParent*>(plugin->GetLibrary());

GetLibrary returns a PluginModuleChromeParent? I would have expected some sort of nspr type here.

@@ +1324,5 @@
> +        TakeFullMinidump(aContentPid, EmptyString(), dumpId);
> +    }
> +
> +#ifdef XP_WIN
> +    mozilla::MutexAutoLock lock(mCrashReporterMutex);

I'm confused about mCrashReporterMutex, what does it protect against?
Attachment #8765386 - Flags: review?(jmathies) → review+
(In reply to Jim Mathies [:jimm] from comment #51)
> lets add a header comment here explaining what this is. I know this file
> isn't very good with header commenting but lets try to change that as we add
> new code. Maybe do a combined comment here for both the id and SetHangData.

I'll add some comments, they're badly needed actually; it took me a while to figure out how the whole machinery worked.

> GetLibrary returns a PluginModuleChromeParent? I would have expected some
> sort of nspr type here.

I've just factored out the code in mozilla::plugins::TerminatePlugin() which did this. Since it has always worked I suppose that's the case.

> I'm confused about mCrashReporterMutex, what does it protect against?

I'm not sure; but it's used only on Windows all over the place so I left it in place where I had to access the |mCrashReporter| field.
After re-triggering a bunch of intermittent failures the try run is finally green, ready to land.
Pushed by gsvelto@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/e309c8e399ad
Create a minidump of the plugin process as soon as possible during hang r=jimm
https://hg.mozilla.org/mozilla-central/rev/e309c8e399ad
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
So this seems to have created a bunch of crash reports in crash-stats that are related to plugins but aren't marked in any way as plugin crashes.

Could you file appropriate bugs on Socorro to cause these crashes to be appropriately described/displayed in crash-stats?

e.g.:
bp-1dce7537-0134-4502-be4b-b8b942160701
Flags: needinfo?(gsvelto)
Flags: needinfo?(benjamin)
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #56)
> So this seems to have created a bunch of crash reports in crash-stats that
> are related to plugins but aren't marked in any way as plugin crashes.

I'm confused, this should affect only plug-in hangs, it shouldn't have changed the way plug-in crashes are reported. In my tests the generated reports always had the process type correctly set so 'plugin' but the report in your post doesn't seem to have any process type set :-/

> Could you file appropriate bugs on Socorro to cause these crashes to be
> appropriately described/displayed in crash-stats?
> 
> e.g.:
> bp-1dce7537-0134-4502-be4b-b8b942160701

If my patch here is really causing that issue then I think it should be backed out instead because there's something wrong with it... Though I have no idea what would that be :-|
Flags: needinfo?(gsvelto)
I've looked at the reports and maybe I know what's wrong: the additional_minidumps metadata entry doesn't seem to have been populated correctly. The plugin metadata information also seems to be missing; I'll double-check my patch to see what might be causing it.
I've manually checked all the code paths I could think of in my patch and I couldn't get it to generate a crash dump that looks like https://crash-stats.mozilla.com/report/index/1dce7537-0134-4502-be4b-b8b942160701

They all look like these (I've used different combinations of e10s/non-e10s as well as where the minidump is taken):

https://crash-stats.mozilla.com/report/index/8870f3aa-33f6-468d-9730-235402160702
https://crash-stats.mozilla.com/report/index/73c65cf9-1f35-4cdd-a539-a7c2f2160702
https://crash-stats.mozilla.com/report/index/e9d47bc9-b80c-4aa4-9648-100a62160702
https://crash-stats.mozilla.com/report/index/e9d47bc9-b80c-4aa4-9648-100a62160702

If you're sure that my patch is causing those crash reports to be generated then I'll back it out ASAP because it's really not supposed to generate crash reports that are different than what we already had.
Flags: needinfo?(dbaron)
Depends on: 1284030
OK, it's a confirmed regression, I'll have a patch ready soon to fix it.
Flags: needinfo?(dbaron)
Flags: needinfo?(benjamin)
Depends on: 1284821
Whiteboard: btpp-fixlater [fce-active] → btpp-fixlater [fce-active][fce-active-legacy]
Whiteboard: btpp-fixlater [fce-active][fce-active-legacy] → btpp-fixlater [fce-active]
Whiteboard: btpp-fixlater [fce-active] → [fce-active-legacy]
Whiteboard: [fce-active-legacy] → btpp-fixlater [fce-active-legacy]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: