Last Comment Bug 764660 - Collect and report on the CPU utilization of a hung plugin process
: Collect and report on the CPU utilization of a hung plugin process
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Plug-ins (show other bugs)
: unspecified
: x86 Windows 7
: -- normal (vote)
: mozilla17
Assigned To: Georg Fritzsche [:gfritzsche]
:
Mentors:
Depends on: 781133
Blocks: 781124
  Show dependency treegraph
 
Reported: 2012-06-13 17:14 PDT by Benjamin Smedberg [:bsmedberg]
Modified: 2012-09-17 06:28 PDT (History)
7 users (show)
ryanvm: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Collect and report plugin CPU utilization + test (14.51 KB, patch)
2012-07-11 08:40 PDT, Georg Fritzsche [:gfritzsche]
no flags Details | Diff | Splinter Review
Collect and report plugin CPU utilization + test (15.02 KB, patch)
2012-07-25 05:41 PDT, Georg Fritzsche [:gfritzsche]
netzen: review+
Details | Diff | Splinter Review
Data dump, tab-delimited (62.68 KB, text/plain)
2012-08-07 14:14 PDT, Benjamin Smedberg [:bsmedberg]
no flags Details

Description Benjamin Smedberg [:bsmedberg] 2012-06-13 17:14:00 PDT
One of the things that we came up with this week for better classifying plugin hang reports: we should report a "profile" of what the plugin process is doing while we're waiting for it. Actual profiling isn't really possible without frame pointers, but we should be able to collect information about whether the plugin process is using CPU. I think it should be pretty interesting to classify plugin crashes according to deadlocks, idle, or active processes.
Comment 1 Benjamin Smedberg [:bsmedberg] 2012-06-13 17:14:22 PDT
Oh, and we really only care about this on Windows. The other platforms are not worth doing.
Comment 2 Brian R. Bondy [:bbondy] 2012-06-13 18:12:26 PDT
We could use the performance counters API for this: 
http://msdn.microsoft.com/en-us/library/aa373078%28v=vs.85%29.aspx
Comment 3 Brian R. Bondy [:bbondy] 2012-06-13 18:32:53 PDT
An easier option for Vista and up: QueryProcessCycleTime
http://msdn.microsoft.com/en-us/library/ms684929
Comment 4 Georg Fritzsche [:gfritzsche] 2012-07-11 08:40:43 PDT
Created attachment 641065 [details] [diff] [review]
Collect and report plugin CPU utilization + test
Comment 5 Georg Fritzsche [:gfritzsche] 2012-07-11 08:54:44 PDT
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> Created attachment 641065 [details] [diff] [review]

Worth pointing out that i ended up using GetProcessTimes() as the performance counters API is a little more involved:
http://gfritzsche.blogspot.de/2012/06/pdh-api-gotchas.html
Comment 6 Brian R. Bondy [:bbondy] 2012-07-13 19:07:44 PDT
Sorry for the delay in reviewing this. I started reviewing this today, I'll try to finish it up over the weekend.
Comment 7 Brian R. Bondy [:bbondy] 2012-07-15 19:50:42 PDT
Comment on attachment 641065 [details] [diff] [review]
Collect and report plugin CPU utilization + test

Review of attachment 641065 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work on the tests, thanks for including them :)

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +173,4 @@
>              notes.Put(CS("HangID"), NS_ConvertUTF16toUTF8(hangID));
> +            if (mPluginCpuUsageOnHang >= 0) {
> +              notes.Put(CS("PluginCpuUsage"), 
> +                        nsPrintfCString("%.2f", mPluginCpuUsageOnHang));

Since this value is divided by the number of processors, I think it's of critical importance that we also have the number of processors reported here via PR_GetNumberOfProcessors().

@@ +173,5 @@
>              notes.Put(CS("HangID"), NS_ConvertUTF16toUTF8(hangID));
> +            if (mPluginCpuUsageOnHang >= 0) {
> +              notes.Put(CS("PluginCpuUsage"), 
> +                        nsPrintfCString("%.2f", mPluginCpuUsageOnHang));
> +            }

Do we need to reset the value of mPluginCpuUsageOnHang to -1 after this? I don't think so but just wanted to point it out in case.

@@ +205,4 @@
>          Close();
>  }
>  
> +namespace {

nit: put this inside the #ifdef XP_WIN

@@ +208,5 @@
> +namespace {
> +
> +#ifdef XP_WIN
> +
> +PRUint64 FileTimeToUTC(const FILETIME& ftime) 

PRUint64 on the previous line. Also put static before the return type.
Also use ULARGE_INTEGER since we're dealing with unsigned int64 here

@@ +216,5 @@
> +  li.HighPart = ftime.dwHighDateTime;
> +  return li.QuadPart;
> +}
> +
> +bool 

nit: Put static before the return type since this function isn't used anywhere else.

@@ +236,5 @@
> +  sampleTimes[0] = FileTimeToUTC(currentTime);
> +  cpuTimes[0]    = FileTimeToUTC(kernelTime) + FileTimeToUTC(userTime);
> +
> +  // we already hung for a while, a little bit longer won't matter
> +  ::Sleep(50);

Please check with bsmedberg whether a sleep on the main thread here for 50ms is OK. I'm not very familiar with the plugin code, so I can't confirm this myself.

@@ +257,5 @@
> +}
> +
> +#endif // #ifdef XP_WIN
> +
> +} // anonymous namespace

nit: put this inside the #endif

::: dom/plugins/ipc/PluginModuleParent.h
@@ +307,4 @@
>      nsString mPluginDumpID;
>      nsString mBrowserDumpID;
>      nsString mHangID;
> +    float mPluginCpuUsageOnHang;

nit: Surround this with #ifdef XP_WIN

::: dom/plugins/test/mochitest/hang_test.js
@@ +80,5 @@
> +    let extraData = parseKeyValuePairsFromFile(pluginExtraFile);
> +    ok("PluginCpuUsage" in extraData, "got extra field for plugin cpu usage");
> +    let cpuUsage = parseFloat(extraData["PluginCpuUsage"].replace(',', '.'));
> +    if (this.idleHang) {
> +      ok(cpuUsage == 0, "plugin cpu usage is 0%");

I think this may be susceptible to intermittent failures.  Could you push to try and then re-run the chrome mochitest that contains this test a bunch of times on each platform?  If it is we could try using <= some smaller value instead of explicitly 0.
Comment 8 Brian R. Bondy [:bbondy] 2012-07-16 01:40:11 PDT
Ignore the static before return type comments, I forgot momentarily when writing that, that the functions were inside an anonymous namespace.
Comment 9 Georg Fritzsche [:gfritzsche] 2012-07-25 05:41:48 PDT
Created attachment 645731 [details] [diff] [review]
Collect and report plugin CPU utilization + test

Sorry for the delay in getting back to this.
Addressed review comments and limited a few parts to Windows that i had overlooked.
Comment 10 Georg Fritzsche [:gfritzsche] 2012-07-25 05:43:56 PDT
(In reply to Brian R. Bondy [:bbondy] from comment #7)
> @@ +236,5 @@
> > +  sampleTimes[0] = FileTimeToUTC(currentTime);
> > +  cpuTimes[0]    = FileTimeToUTC(kernelTime) + FileTimeToUTC(userTime);
> > +
> > +  // we already hung for a while, a little bit longer won't matter
> > +  ::Sleep(50);
> 
> Please check with bsmedberg whether a sleep on the main thread here for 50ms
> is OK. I'm not very familiar with the plugin code, so I can't confirm this
> myself.

I checked this with him and it's ok as the browser already hung for a while (default timeout is 45 seconds).

> ::: dom/plugins/test/mochitest/hang_test.js
> @@ +80,5 @@
> > +    let extraData = parseKeyValuePairsFromFile(pluginExtraFile);
> > +    ok("PluginCpuUsage" in extraData, "got extra field for plugin cpu usage");
> > +    let cpuUsage = parseFloat(extraData["PluginCpuUsage"].replace(',', '.'));
> > +    if (this.idleHang) {
> > +      ok(cpuUsage == 0, "plugin cpu usage is 0%");
> 
> I think this may be susceptible to intermittent failures.  Could you push to
> try and then re-run the chrome mochitest that contains this test a bunch of
> times on each platform?  If it is we could try using <= some smaller value
> instead of explicitly 0.

Will do try tests later.
Comment 11 Brian R. Bondy [:bbondy] 2012-07-25 06:07:51 PDT
Comment on attachment 645731 [details] [diff] [review]
Collect and report plugin CPU utilization + test

Review of attachment 645731 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, given that the try tests mentioned earlier passes.
Comment 12 Georg Fritzsche [:gfritzsche] 2012-07-30 08:57:09 PDT
Try builds in progress:
https://tbpl.mozilla.org/?tree=Try&rev=5e40194ac943
Comment 13 Georg Fritzsche [:gfritzsche] 2012-08-01 04:24:18 PDT
The tests look fine, ran the chrome-tests 6 times on Win/WinXP debug/opt. No failures from this test, only 2 unrelated oranges.
Comment 14 Ryan VanderMeulen [:RyanVM] 2012-08-01 13:35:37 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/d2805aefa5b7
Comment 15 Ryan VanderMeulen [:RyanVM] 2012-08-01 19:39:07 PDT
https://hg.mozilla.org/mozilla-central/rev/d2805aefa5b7
Comment 16 Benjamin Smedberg [:bsmedberg] 2012-08-07 14:14:20 PDT
Created attachment 649806 [details]
Data dump, tab-delimited

Georg, here is a data dump of all the reports with PluginCpuUsage numbers. I think that we're probably going to need to try collecting the CPU for the other Flash processes on Windows as well in order to make this useful. Can you file and take the followup for that?
Comment 17 Benjamin Smedberg [:bsmedberg] 2012-08-07 14:38:00 PDT
Also, I think we have locale-specific formatting sneaking in: some of the numbers are '00.00' and some are '00,00'
Comment 18 Jim Mathies [:jimm] 2012-08-07 14:46:50 PDT
(In reply to Benjamin Smedberg  [:bsmedberg] [away 27-July until 7-Aug] from comment #16)
> Created attachment 649806 [details]
> Data dump, tab-delimited
> 
> Georg, here is a data dump of all the reports with PluginCpuUsage numbers. I
> think that we're probably going to need to try collecting the CPU for the
> other Flash processes on Windows as well in order to make this useful. Can
> you file and take the followup for that?

Seems like ~90% of these are cross process message hangs vs cpu spikes. I see a lot of zeros, and a lot of calls I know process events in the background.

Some of these we might be able to hook, like NtUserSetWindowPos, which seems pretty common.
Comment 19 Georg Fritzsche [:gfritzsche] 2012-08-08 04:42:53 PDT
Hm, looks like the occasional CPU spikes for stacks that are basically waiting on some synchronisation etc. might skew the picture.
Also, from the naming i would have expected "fastzero_I | _VEC_memzero" and "ycc_rgb_convert" to not be idle.
Otherwise the data looks sensible to me so far.

(In reply to Benjamin Smedberg  [:bsmedberg] [away 27-July until 7-Aug] from comment #16)
> I think that we're probably going to need to try collecting the CPU for the
> other Flash processes on Windows as well in order to make this useful. Can
> you file and take the followup for that?
Filed bug 781124.

(In reply to Benjamin Smedberg  [:bsmedberg] [away 27-July until 7-Aug] from comment #17)
> Also, I think we have locale-specific formatting sneaking in: some of the
> numbers are '00.00' and some are '00,00'

Oh, i fixed the locale issue in the test instead of where the .extra fields are written - filed bug 781133.

Note You need to log in before you can comment on or make changes to this bug.