Closed Bug 562216 Opened 14 years ago Closed 14 years ago

I would like the ability to show or hide all plugin crashes/hangs on the topcrasher report

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: christian, Assigned: ryansnyder)

Details

Attachments

(1 file)

I would like the ability to show or hide all plugin crashes / hangs on the topcrasher report. Generally, I don't care about plugin crashes/hangs if they don't bring down the browser. It would be nice to be able to filter them all in or out depending on the data set I want to look at.

Also, I would like the show/hide to apply to other views as well, but topcrasher is a good start.
Severity: normal → enhancement
(In reply to comment #0)
The one place this is supported with 1.6.2 is with Advanced Search
http://crash-stats.stage.mozilla.com/query
Austin, the Advanced Search only offers "All" or "Plugins Only", both these
options includes plugin crashes.  What I think Christian is asking for,
and what I'm asking for in bug 558567, is a third option that only shows
crashes in Firefox proper (excluding all plugin crashes).
I'm doing this daily in reports produced at http://people.mozilla.com/~chofmann/crash-stats/20100913/top-4.0b6pre.html

in producing these reports I look or

   process_type matches "\N"
   hangid  matches "\N"

we should figure out a way to get this change, and the simple version of "also found in releases" (bug 564018) features into the main report as soon as possible.  It would really help to speed up analysis and make the top crash reports a lot more useful.
Assignee: nobody → ryan
Target Milestone: --- → 1.8
(In reply to comment #1)
> (In reply to comment #0)
> The one place this is supported with 1.6.2 is with Advanced Search
> http://crash-stats.stage.mozilla.com/query

yeah, re-using the advanced query UI on the top crash lists seems fine.

 Report Process:  0 Any  X Browser  0 Plugins Only 

I talked with several people on this and consensus is to show "browser only" as the default and options to get at the other two forms of the report.
Target Milestone: 1.8 → 1.7.4
Status: NEW → ASSIGNED
Target Milestone: 1.7.4 → 1.7.5
Attached patch Patch for 562216 — — Splinter Review
This patch will allow the Socorro UI to make web service calls that return results for crashes that are plugin-related, browser-related or a combination of the 2.

Crashes that are plugin-related or browser-related may appear in either list, as identified in Bug 598160.

This patch will require an alter statement to the top_crashes_by_signature table, which will include the fields plugin_count and hang_count.  Hang_count will not be utilized, but will be available in case we want to filter based on hangs in the future.

Running a script to update the top_crashes_by_signature table will take an amazingly long time to complete.  So rather than run that script, I've chosen to keep some of the existing queries/logic in place on the Socorro UI side.  These can be removed in Socorro 1.8 (see bug 604740), once the cron/topCrashesBySignature.py has populated 4 weeks worth of hang_count and plugin_count data in the top_crashes_by_signature table.
Attachment #483603 - Flags: review?(lars)
Attachment #483603 - Flags: feedback?(robert)
Attachment #483603 - Flags: review?(laura)
Comment on attachment 483603 [details] [diff] [review]
Patch for 562216

ok, the code in socorro/cron/topCrashesBySignature.py wasn't nearly as involved as I thought it was to be.  Everything in there looks fine.

My only trouble is with the change to the topCrashBySignatureTrends service.  As you demonstrate by changing the URL to a new version number, you're really creating a new service with different parameters and different results.  We have the version numbers so different versions can live side by side and still be usable.  

However, because you have control over all the callers of the service and can change (have changed) them all, it likely doesn't matter that old service goes away without first going through deprecation.  

go for it...
Attachment #483603 - Flags: review?(lars) → review+
Attachment #483603 - Flags: feedback?(robert) → feedback+
Attachment #483603 - Flags: review?(laura) → review+
Necessary database updates performed on stage.  1.7.5 release document updated on Google Code.  Committing files. 

==

Sending        socorro/cron/topCrashesBySignature.py
Sending        socorro/database/schema.py
Sending        socorro/services/topCrashBySignatureTrends.py
Sending        socorro/unittest/cron/testTopCrashesBySignature.py
Sending        webapp-php/application/config/topcrashbysig.php
Sending        webapp-php/application/config/topcrashers.php
Sending        webapp-php/application/controllers/topcrasher.php
Sending        webapp-php/application/models/topcrashers.php
Sending        webapp-php/application/views/common/hang_details.php
Sending        webapp-php/application/views/common/list_topcrashers.php
Sending        webapp-php/css/screen.css
Transmitting file data ...........
Committed revision 2614.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Christian, mind taking a look at this on staging (http://crash-stats.stage.mozilla.com) when you get a chance?  Thanks!
I don't see it on stage yet.  Should be at a place like

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/4.0b6

right?
Yeah, I don't see any of the top crashes by signature reports working. Does stage just not have that data? I tried these releases:

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.11
http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.10
http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/4.0b6
Looks fixed now, thanks!
I should sat it is fixed for http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.9. The other pages likely don't have enough data loaded.
It's not clear to me that the filtering is working right.

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.9/14/browser

has signatures like  hang | KiFastSystemCallRet
which are plugin hangs.
Hmm, does it really work correctly?
http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.9

Actual Results:
Clicking on All I get "The report covers 95.72% of all 327 crashes during this period." (Note NPSWF32.dll@0x182cf0 at rank 10, with 4 incidents)
Clicking on Browser I get "The report covers 95.69% of all 325 crashes during this period." (Note NPSWF32.dll@0x182cf0 still at rank 10)
Clicking on Plugin I get "No results were found.".

Expected Results:
Clicking on Browser should exclude crashes in
plugin-container.exe (e.g. the NPSWF32.dll crashes above)
Clicking on Plugin should show those crashes
@choffman - I'm seeing a number of reports reporting inconsistent process_type values, as reported in Bug 598160.  Essentially, there are signatures (like 'hang | KiFastSystemCallRet') in which I was expecting to see 100% of the crashes logged as a plugin crash, but found otherwise in the data.

Here's a quick example from our dev snapshot of production data:  Of the 25,002 crashes containing the signature 'hang | KiFastSystemCallRet', 24,923 of those crashes are considered plugin crashes, while the remaining 79 crashes were saved with a null value for process_type and are as such considered browser crashes.
for firefox only crashes do we have the logic in comment 3?

   process_type matches "\N"
   and
   hangid  matches "\N"

that's what we need to weed out all the plugin related problems.
and the logic for plugin reports should be

   process_type matches "plugin"
   and
   hangid  does not match "\N"

we get two reports for each "hang event."  one contains the stack of the OOPP, and the other report contains the stack of the firefox process.  we use the hang id to match these up and figure out what is happening on both sides of the interface when the hang occurs.

> Here's a quick example from our dev snapshot of production data:  
> Of the 25,002 crashes containing the signature 'hang | KiFastSystemCallRet',
> 24,923 of those crashes are considered plugin crashes, 

 actually these are plugin hangs with the stack of the out of process plugin

> while the remaining 79 crashes were  saved with a null value for 
> process_type and are as such considered browser crashes.

 these are actually the stack of the firefox process when a hang occurs

The signatures on both sides of the hang event may, or may not, be the same during a hang event.   The signatures are just a reflection of the code that happens to be running on either side of the interface.  

https://wiki.mozilla.org/CrashKill/Crashr#3.6.4_redo  has some stats on these hang pairs.  you should see similar ratios of near matching numbers of hang pairs in your snapshot of production data.  occationally one side of the hang pair gets lost in transmission, and occationally we get duplicate hang pair reports, but these things happen in very low volume.
for the top plugin list I think we can combine both plugin hangs and crashes in the same list.  So the reports in that list would be

hangs

   process_type matches "plugin"  or process type matches "\N"
   and
   hangid  does not match "\N"

plus the reports for plugin crashes

   process_type matches "plugin"
   and
   hangid  matches "\N"
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
or if we need to optimize the performance for these checks for plugin related reports we might be able to collapse to just

if
   process_type matches "plugin"
else if
   hangid  does not match "\N"  

then its plugin related

I guess we could run test to see which checks run faster.  If there is much difference in performance we can go with what ever makes the code more clear. That might be the way the filtering is described in comment 18, but you can decide.
Thanks Choffman.  The queries were written based on the notes in the Development Notes found in:
http://code.google.com/p/socorro/wiki/HangsAndOutOfProcessCrashes 

For browsers, I was pulling all crashes that matched process_type is null regardless of the hangid value.  I'll update the browser-specific query to pull all crashes that have process_type is null and hangid is null.

For plugins, I'll update the logic to pull all crashes that have either process_type = 'plugin' OR hangid is not null.

Let me know if either of those statements are incorrect.  I'll update the code and have that pushed to stage this afternoon.  

The issue that @mats identified is an issue that is due to not having enough data in the stage environment.  I'll look into this issue this afternoon to determine if we can find a way to remedy this issue so that we have a better environment for testing.
> Let me know if either of those statements are incorrect.  I'll update the code
> and have that pushed to stage this afternoon.  

 I think these are correct and we are on a good path now.

http://code.google.com/p/socorro/wiki/HangsAndOutOfProcessCrashes is generally correct, but that page could be enhanced a bit to provide some context about which of the reports should be lump together for areas of interest when analyzing problems.  

There are two basic impacts on the user.

There are times when the browser dies and the user needs to restart and that is one area of interest.  

There times when the plugin dies and the user need to just reload the page or the plugin content area is another lower priority area of interest.

  the plugin might die when it crashes, or the plugin might die when it hangs.
  when it hangs we get a a report from the browser and the plugin to help sort it out.

The development testing notes table at the bottom of http://code.google.com/p/socorro/wiki/HangsAndOutOfProcessCrashes could be color coded to reflect these areas of user impact.


"hangid=null proc=Bro: the center of the table could be marked red to indicate the browser dies impact


            | ANY Report Type   | CRASH                | OOPP HANG
------------+-------------------+----------------------+---------------
ANY PROCESS |                   |                      |                    
------------+-------------------+----------------------+---------------
    BROWSER |                   |hangid=null proc=Bro: | 
------------+-------------------+----------------------+---------------
    PLUG-IN |                   |                      | 


bottom row and "hangid=123 Proc=Bro"  could be marked yellow to so plugin reload as the user impact

            | ANY Report Type   | CRASH                | OOPP HANG
------------+-------------------+----------------------+---------------
ANY PROCESS |                   |                      |                    
------------+-------------------+----------------------+---------------
    BROWSER |                   |                      | hangid=123 Proc=Bro
------------+-------------------+----------------------+---------------
    PLUG-IN |                   | hangid=null proc=Plu | hangid=123 Proc=Plu
we could also probably change that table to remove "all process" and "any report" rows and columns.

that's just the the union of the 1 red and 3 yellow boxes in the example above.
Committing an update to the topcrasher code, as well as a script to help prepopulate the data in the top_crashes_by_signature table.  We can test this again on stage after the script has finished running.

==

Adding         scripts/updateTopCrashesBySignature.py
Sending        socorro/services/topCrashBySignatureTrends.py
Sending        webapp-php/application/models/topcrashers.php
Transmitting file data ...
Committed revision 2628.
Because the top crasher cron job has been running every hour over the past week, it has updated the top crasher data on stage.  I'm starting to see data that is closer to what I've expected to see:

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.11/3/browser
http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.11/3/plugin
yeah, its looking good, and the data we are testing against looks pretty up to date.  here is the verfication I just did.

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.11/3/browser

sort by "Ver" assending and look for the first "1" in the Ver column

that's the signature  for nsParser::cycleCollection::UnmarkPurple(nsISupports*)

and links to
http://crash-stats.stage.mozilla.com/report/list?range_value=2&range_unit=weeks&date=2010-10-22%2011%3A00%3A00&signature=nsParser%3A%3AcycleCollection%3A%3AUnmarkPurple%28nsISupports*%29&version=Firefox%3A3.6.11

we were actually looking at this yesterday in https://bugzilla.mozilla.org/show_bug.cgi?id=606316 and confirmed its a new regression only showing up in 3.6.11


SHIP IT!  ;-)
and 

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.11/3/browser

looks pretty close to

http://people.mozilla.com/~chofmann/crash-stats/20101021/compare-rank-3.6.11-3.6.10.txt

which generates the same kind of filtered browser list out of the daily .csv files.  ranking will be shifted around a bit due to different samples, but should generally match up.

That part looks good too.
I'd probably expect to see a larger number of browser side hang signatures in this view, but I haven't looked at that data in a while, and it also might be a problem with the sample of data on stage.

http://crash-stats.stage.mozilla.com/topcrasher/byversion/Firefox/3.6.11/3/plugin
Great, thanks Chofmann.  Yes, having a limited data set on stage has been problematic in verifying issues.  When clicking All (instead of Browser or Plugin) it shows all of the data that you would find in both of the other 2 views, so I haven't been too concerned about it.

I'm committing a minor update to a script that should help prepopulate some of the data.  This data will fill itself in as the cron job runs every hour, but this should help things look close to what we expect shortly after we release.

==

Sending        updateTopCrashesBySignature.py
Transmitting file data .
Committed revision 2645.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Verified per comment 25, comment 26; thanks, Chofmann!
Status: RESOLVED → VERIFIED
As noted in dev.planning, I am radically opposed to the current *default* view for topcrashes which does not show plugin crashes. I believe the option of showing only the browser crashes is good, but the default should remain to show all crashes.
Component: Socorro → General
Product: Webtools → Socorro
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: