Closed Bug 764660 Opened 12 years ago Closed 12 years ago

Collect and report on the CPU utilization of a hung plugin process

Categories

(Core Graveyard :: Plug-ins, defect)

x86
Windows 7
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla17

People

(Reporter: benjamin, Assigned: gfritzsche)

References

Details

Attachments

(2 files, 1 obsolete file)

One of the things that we came up with this week for better classifying plugin hang reports: we should report a "profile" of what the plugin process is doing while we're waiting for it. Actual profiling isn't really possible without frame pointers, but we should be able to collect information about whether the plugin process is using CPU. I think it should be pretty interesting to classify plugin crashes according to deadlocks, idle, or active processes.
Oh, and we really only care about this on Windows. The other platforms are not worth doing.
We could use the performance counters API for this: 
http://msdn.microsoft.com/en-us/library/aa373078%28v=vs.85%29.aspx
An easier option for Vista and up: QueryProcessCycleTime
http://msdn.microsoft.com/en-us/library/ms684929
Assignee: nobody → georg.fritzsche
OS: Mac OS X → Windows 7
Attachment #641065 - Flags: review?(netzen)
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> Created attachment 641065 [details] [diff] [review]

Worth pointing out that i ended up using GetProcessTimes() as the performance counters API is a little more involved:
http://gfritzsche.blogspot.de/2012/06/pdh-api-gotchas.html
Status: NEW → ASSIGNED
Sorry for the delay in reviewing this. I started reviewing this today, I'll try to finish it up over the weekend.
Comment on attachment 641065 [details] [diff] [review]
Collect and report plugin CPU utilization + test

Review of attachment 641065 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work on the tests, thanks for including them :)

::: dom/plugins/ipc/PluginModuleParent.cpp
@@ +173,4 @@
>              notes.Put(CS("HangID"), NS_ConvertUTF16toUTF8(hangID));
> +            if (mPluginCpuUsageOnHang >= 0) {
> +              notes.Put(CS("PluginCpuUsage"), 
> +                        nsPrintfCString("%.2f", mPluginCpuUsageOnHang));

Since this value is divided by the number of processors, I think it's of critical importance that we also have the number of processors reported here via PR_GetNumberOfProcessors().

@@ +173,5 @@
>              notes.Put(CS("HangID"), NS_ConvertUTF16toUTF8(hangID));
> +            if (mPluginCpuUsageOnHang >= 0) {
> +              notes.Put(CS("PluginCpuUsage"), 
> +                        nsPrintfCString("%.2f", mPluginCpuUsageOnHang));
> +            }

Do we need to reset the value of mPluginCpuUsageOnHang to -1 after this? I don't think so but just wanted to point it out in case.

@@ +205,4 @@
>          Close();
>  }
>  
> +namespace {

nit: put this inside the #ifdef XP_WIN

@@ +208,5 @@
> +namespace {
> +
> +#ifdef XP_WIN
> +
> +PRUint64 FileTimeToUTC(const FILETIME& ftime) 

PRUint64 on the previous line. Also put static before the return type.
Also use ULARGE_INTEGER since we're dealing with unsigned int64 here

@@ +216,5 @@
> +  li.HighPart = ftime.dwHighDateTime;
> +  return li.QuadPart;
> +}
> +
> +bool 

nit: Put static before the return type since this function isn't used anywhere else.

@@ +236,5 @@
> +  sampleTimes[0] = FileTimeToUTC(currentTime);
> +  cpuTimes[0]    = FileTimeToUTC(kernelTime) + FileTimeToUTC(userTime);
> +
> +  // we already hung for a while, a little bit longer won't matter
> +  ::Sleep(50);

Please check with bsmedberg whether a sleep on the main thread here for 50ms is OK. I'm not very familiar with the plugin code, so I can't confirm this myself.

@@ +257,5 @@
> +}
> +
> +#endif // #ifdef XP_WIN
> +
> +} // anonymous namespace

nit: put this inside the #endif

::: dom/plugins/ipc/PluginModuleParent.h
@@ +307,4 @@
>      nsString mPluginDumpID;
>      nsString mBrowserDumpID;
>      nsString mHangID;
> +    float mPluginCpuUsageOnHang;

nit: Surround this with #ifdef XP_WIN

::: dom/plugins/test/mochitest/hang_test.js
@@ +80,5 @@
> +    let extraData = parseKeyValuePairsFromFile(pluginExtraFile);
> +    ok("PluginCpuUsage" in extraData, "got extra field for plugin cpu usage");
> +    let cpuUsage = parseFloat(extraData["PluginCpuUsage"].replace(',', '.'));
> +    if (this.idleHang) {
> +      ok(cpuUsage == 0, "plugin cpu usage is 0%");

I think this may be susceptible to intermittent failures.  Could you push to try and then re-run the chrome mochitest that contains this test a bunch of times on each platform?  If it is we could try using <= some smaller value instead of explicitly 0.
Attachment #641065 - Flags: review?(netzen)
Ignore the static before return type comments, I forgot momentarily when writing that, that the functions were inside an anonymous namespace.
Sorry for the delay in getting back to this.
Addressed review comments and limited a few parts to Windows that i had overlooked.
Attachment #641065 - Attachment is obsolete: true
(In reply to Brian R. Bondy [:bbondy] from comment #7)
> @@ +236,5 @@
> > +  sampleTimes[0] = FileTimeToUTC(currentTime);
> > +  cpuTimes[0]    = FileTimeToUTC(kernelTime) + FileTimeToUTC(userTime);
> > +
> > +  // we already hung for a while, a little bit longer won't matter
> > +  ::Sleep(50);
> 
> Please check with bsmedberg whether a sleep on the main thread here for 50ms
> is OK. I'm not very familiar with the plugin code, so I can't confirm this
> myself.

I checked this with him and it's ok as the browser already hung for a while (default timeout is 45 seconds).

> ::: dom/plugins/test/mochitest/hang_test.js
> @@ +80,5 @@
> > +    let extraData = parseKeyValuePairsFromFile(pluginExtraFile);
> > +    ok("PluginCpuUsage" in extraData, "got extra field for plugin cpu usage");
> > +    let cpuUsage = parseFloat(extraData["PluginCpuUsage"].replace(',', '.'));
> > +    if (this.idleHang) {
> > +      ok(cpuUsage == 0, "plugin cpu usage is 0%");
> 
> I think this may be susceptible to intermittent failures.  Could you push to
> try and then re-run the chrome mochitest that contains this test a bunch of
> times on each platform?  If it is we could try using <= some smaller value
> instead of explicitly 0.

Will do try tests later.
Comment on attachment 645731 [details] [diff] [review]
Collect and report plugin CPU utilization + test

Review of attachment 645731 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, given that the try tests mentioned earlier passes.
Attachment #645731 - Flags: review+
The tests look fine, ran the chrome-tests 6 times on Win/WinXP debug/opt. No failures from this test, only 2 unrelated oranges.
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/d2805aefa5b7
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla17
Georg, here is a data dump of all the reports with PluginCpuUsage numbers. I think that we're probably going to need to try collecting the CPU for the other Flash processes on Windows as well in order to make this useful. Can you file and take the followup for that?
Also, I think we have locale-specific formatting sneaking in: some of the numbers are '00.00' and some are '00,00'
(In reply to Benjamin Smedberg  [:bsmedberg] [away 27-July until 7-Aug] from comment #16)
> Created attachment 649806 [details]
> Data dump, tab-delimited
> 
> Georg, here is a data dump of all the reports with PluginCpuUsage numbers. I
> think that we're probably going to need to try collecting the CPU for the
> other Flash processes on Windows as well in order to make this useful. Can
> you file and take the followup for that?

Seems like ~90% of these are cross process message hangs vs cpu spikes. I see a lot of zeros, and a lot of calls I know process events in the background.

Some of these we might be able to hook, like NtUserSetWindowPos, which seems pretty common.
Blocks: 781124
Depends on: 781133
Hm, looks like the occasional CPU spikes for stacks that are basically waiting on some synchronisation etc. might skew the picture.
Also, from the naming i would have expected "fastzero_I | _VEC_memzero" and "ycc_rgb_convert" to not be idle.
Otherwise the data looks sensible to me so far.

(In reply to Benjamin Smedberg  [:bsmedberg] [away 27-July until 7-Aug] from comment #16)
> I think that we're probably going to need to try collecting the CPU for the
> other Flash processes on Windows as well in order to make this useful. Can
> you file and take the followup for that?
Filed bug 781124.

(In reply to Benjamin Smedberg  [:bsmedberg] [away 27-July until 7-Aug] from comment #17)
> Also, I think we have locale-specific formatting sneaking in: some of the
> numbers are '00.00' and some are '00,00'

Oh, i fixed the locale issue in the test instead of where the .extra fields are written - filed bug 781133.
See Also: → 1315935
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: