Closed Bug 637661 Opened 9 years ago Closed 8 years ago

new report(s) that roll up combined plugin hang and crash pairs

Categories

(Socorro :: General, task, P1)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chofmann, Assigned: rhelmer)

References

Details

Attachments

(2 files, 2 obsolete files)

can't remember if there is a bug on this, but we need some report that roll up lists of combined signatures for both sides of plugin pairs.

for example in the analysis of top hang [Bug 637533] "hang | ntdll.dll@0xe514"  we want an easy way to see the reports and signatures on the other side of the plugin hang pairs like https://bug637533.bugzilla.mozilla.org/attachment.cgi?id=515838 and some stats like https://bugzilla.mozilla.org/show_bug.cgi?id=637533#c5

one basic report would show the list of the reports side by side

hang|ntdll.dll@0xe514 [browser metadata] hang | F457974632_________ [meta data]
hang|ntdll.dll@0xe514 [browser metadata] hang | hang|F_2090751469___ [meta data]
...
...

the report could be sorted to allow easy groupling of the plugin side reports  so we could start to determine like sets of reports.

meta data would include flash version info, os version, uptime, and similar info to the lists we produce for signature reporting now.

I think bsmedberg had some similar reports going when he had snapshots of plugin/browser pair data last year and may have some examples of useful reports too.
I did have this based on importing data into a couchdb. It's not something that's easy to do with map/reduce because it involves correlating data from different records.
the way I'm gathering this info now in my hack is to 

  take a signature like hang|ntdll.dll@0xe514, 
  then get a list of the corresponding plugin side of the reports for that sig,
  then grep the report_id back out of the .csv files to to grab the info I
   want for both the  browser and plugin side reports.

something like


set file=20110227*

    set list=`grep "hang | ntdll.dll@0xe514" $file | awk -F\t '$8 ~ /4.0b12/ {print $23}' | head -300`

   foreach report ($list)
     grep $report $file | awk -F\t '{printf "%s\t%s\%s\t%s\t%s\n",$23,$22,$8,$1,$2}'
     echo ""
   end
this one also helps to surface dups that show up on both browser and plugin side.
Assignee: nobody → chris.lonnen
Target Milestone: --- → 1.7.8
Target Milestone: 1.7.8 → 2.0
Assignee: chris.lonnen → rhelmer
Priority: -- → P1
Haven't started on this for 2.0; will try to get it in but bumping to 2.1 since freeze for 2.0 is tomorrow.
Target Milestone: 2.0 → 2.1
I think the equivalent SQL for this would be (given a particular signature and version):

SELECT uuid, flash_version, version, signature, url 
FROM reports 
WHERE hangid IN 
  (SELECT hangid 
   FROM reports 
   WHERE signature = 'hang | ntdll.dll@0xe514' 
   AND version = '6.0a2' LIMIT 300);

Attached is the result from a dev database with a smallish set of production crashes loaded.

Does this look reasonable?
Any guidance for how the report page should look? I could easily expose this directly as a table if that's desirable.
yeah, looks like on the right track.  since this has url's we will have to keep it behind login auth.

one other formating idea for the report that gets rid of the redundancy could be to show all the info for a pair on a single line with a count of dups at the end

 browser_sig | plugin_sig | hangid | flash_version | url | dups?

this would also allow us to easily get counts of the frequency of particular browser/plugin pair combinations.
I don't think we should hide the complete report behind login auth, if possible, we should show it without the url field to unauthenticated people, can this be done?
yeah, that sounds right. have two versions of the report.  one that we can used to correlate hang signatures with, and another that also includes the urls.
(In reply to comment #6)
> yeah, looks like on the right track.  since this has url's we will have to
> keep it behind login auth.
> 
> one other formating idea for the report that gets rid of the redundancy
> could be to show all the info for a pair on a single line with a count of
> dups at the end
> 
>  browser_sig | plugin_sig | hangid | flash_version | url | dups?
> 
> this would also allow us to easily get counts of the frequency of particular
> browser/plugin pair combinations.


Thanks, this is helpful. I'll start putting a report page together.


(In reply to comment #7)
> I don't think we should hide the complete report behind login auth, if
> possible, we should show it without the url field to unauthenticated people,
> can this be done?

Yes we could hide just that column if the user is unauthenticated... that helps solve the question of how to get to this report, we could put it in the "Report" pulldown and link to it as we would any other report.
actually, it might be better to reverse the first two fields so it look like this.

 plugin_sig | browser_sig | hangid | flash_version | url | dups?

since it's more likely we want to investigate and associate the plugin signature with the underlying problem.  often the browser sig is just along for the ride and not offering much useful data as in bug 612309, bug 564298, and several others.
(In reply to comment #9)
> Yes we could hide just that column if the user is unauthenticated... that
> helps solve the question of how to get to this report, we could put it in
> the "Report" pulldown and link to it as we would any other report.

Sounds good to me! :)
Target Milestone: 2.1 → 2.2
Target Milestone: 2.2 → 2.3
Depends on: 677790
We think to do this the best way is to process all hang pairs.  Filed bug 677790 for statistical and HBase implications of that move.

We should also consider:
- impact on PostgreSQL of increased throughput
- impact on processors
- impact on collectors.
I'd like to revisit my assertion in comment #12.  If we don't mind only looking at the other halves that are processed we should still be able to build the correlations, so I think we can go ahead and build this report with only the throttled crashes.

rhelmer, do you want to pick this up again since you were the last working on it?
Status: NEW → ASSIGNED
Laura,

hang crashes currently account for around 20% of crash reports overall.  This means that not throttling any of them would result in a 300% increase in the number of total crashes stored.  From experience, this would require some adjustments in the monitor to process this volume, and would require increased storage for PostgreSQL or a shorter expiration date for saved crashes.
Attached patch backend/middleware impl (obsolete) — Splinter Review
Proposed stored procedure to generate a "hang_report" matview, and the middleware implementation to expose it.

Most interested in thoughts on the middleware - I think it'll need to take a least one argument - the report_day. I think that it would be appropriate to implement pagination as well.

I know that Adrian has been doing a lot of work on refactoring reports, I wonder if there's a better way I should be doing it. Even if not, I think it'd be good for people to be familiar with this method of making reports because it seems like it's a decent template for reports as we move them out of PHP+SQL and into middleware.

I ran the SQL in the stored procedure by jberkus, but comments welcome on that too of course. I have my local VM loaded with prod data now so I can see that it works.

The missing pieces here are:

* cron job (this will likely run once per day)
* Kohana/PHP code to expose (this part will unfortunately probably be the most code, but it'll be mostly boilerplate model/view/controller code)

For the PHP side, I am thinking of doing a simple transform of the JSON from the new mware service into a sortable HTML table. It'll be much like search results, except it'll show the URL field only if the user is authenticated and has the same perms as for displaying URLs on the individual reports page.
Attachment #560298 - Flags: feedback?(lars)
Attachment #560298 - Flags: feedback?(chris.lonnen)
Comment on attachment 560298 [details] [diff] [review]
backend/middleware impl

Meant to ask adrian too
Attachment #560298 - Flags: feedback?(adrian)
The details in comment 15 are important of course, but I am going to first generate a quick sample report to get it in front of chofmann/kairo and make sure it looks good.
(In reply to Robert Helmer [:rhelmer] from comment #17)
> The details in comment 15 are important of course, but I am going to first
> generate a quick sample report to get it in front of chofmann/kairo and make
> sure it looks good.

Email with sample HTML report and raw JSON data sent.
Attached patch backend/middleware impl (obsolete) — Splinter Review
Attachment #560298 - Attachment is obsolete: true
Attachment #560298 - Flags: feedback?(lars)
Attachment #560298 - Flags: feedback?(chris.lonnen)
Attachment #560298 - Flags: feedback?(adrian)
Attachment #560316 - Flags: feedback?(lars)
Attachment #560316 - Flags: feedback?(chris.lonnen)
Attachment #560316 - Flags: feedback?(adrian)
(In reply to Robert Helmer [:rhelmer] from comment #18)
> (In reply to Robert Helmer [:rhelmer] from comment #17)
> > The details in comment 15 are important of course, but I am going to first
> > generate a quick sample report to get it in front of chofmann/kairo and make
> > sure it looks good.
> 
> Email with sample HTML report and raw JSON data sent.

thunderbird seems to spin out of control and hang when I try to open that message from you.  maybe try sending without the raw JSON, or maybe put it on a server somewhere.
kairo took a look over this, looks like for some rows we're getting plugin signatures in the browser column and browser signatures in the plugin column.

It looks like I am just not eliminating duplicates correctly doing the self-join on reports table, should be simple to fix. Will send out a new sample once this is done.
(In reply to chris hofmann from comment #20)
> (In reply to Robert Helmer [:rhelmer] from comment #18)
> > (In reply to Robert Helmer [:rhelmer] from comment #17)
> > > The details in comment 15 are important of course, but I am going to first
> > > generate a quick sample report to get it in front of chofmann/kairo and make
> > > sure it looks good.
> > 
> > Email with sample HTML report and raw JSON data sent.
> 
> thunderbird seems to spin out of control and hang when I try to open that
> message from you.  maybe try sending without the raw JSON, or maybe put it
> on a server somewhere.

Sent you a smaller sample ("page 1"). Note the issues from comment 21 though, there's a new sample coming soon.
lonnen, brandon, r? https://github.com/mozilla/socorro/pull/56

Please see comments in the pull request for caveats, etc. Basically I'd like to land this now (since it should be at least as good as the current report example in comment #2) and continue improving this over the next few weeks (adding pagination, etc.)
I'll r+ whats in the pull request now. Everything works for me in my VM. However, I am not merging it, per our irc discussion about adding pagination, etc.
(In reply to Chris Lonnen :lonnen from comment #24)
> I'll r+ whats in the pull request now. Everything works for me in my VM.
> However, I am not merging it, per our irc discussion about adding
> pagination, etc.

Thought about it a bit more this evening, I'd like to go ahead and merge and give QA a head start, while I work on adding pagination and the 1/3/7/14 day selector.
Add pagination and 3/7/14 day selector:
lonnen/brandon, r? https://github.com/mozilla/socorro/pull/58
(In reply to Robert Helmer [:rhelmer] from comment #27)
> Add pagination and 3/7/14 day selector:
> lonnen/brandon, r? https://github.com/mozilla/socorro/pull/58

r+
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/34788ed25426a2a9d461613772a9e7f16447c5c0
bug 637661 - add pagination and duration selector, disable duplicates until we have a better way to show them

https://github.com/mozilla/socorro/commit/afc45b9d3eb67f0958985f90eb5ad2c7db8aaf33
Merge pull request #58 from rhelmer/bug637661-add-pagination-and-duration

Bug637661 add pagination and duration
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
OK here it is on crash-stats-dev (which is our server that automatically tracks the latest build):

https://crash-stats-dev.allizom.org/hangreport/byversion/Firefox/6.0.2?duration=14

Some known deficiencies, I'd like to tackle in followup bugs over the next few weeks (should not be long since we release weekly now):

* show duplicate column
* show URL column when logged in
* UI selection always jumps back to beginning (this is consistent with other reports like topcrashers, I find it really unhelpful though)

Also there's only one day loaded on crash-stats-dev right now, we'll get a more realistic dataset up once this hits staging (hopefully later today or tomorrow).
thx rhelmer. QA verified, the report looks correct. Flash versions are now properly being reported.

https://crash-stats.allizom.org/hangreport/byversion/Firefox/6.0.2
Status: RESOLVED → VERIFIED
Commit pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/1947ec405eea824c874bd6a72fde4ff5f9d87bc1
Merge pull request #64 from rhelmer/bug637661-hang-pair-report-cleanup

fix problems found during QA on stage
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/67e12b5921ffbcb5191da80220b6a045c1f01103
bug 637661 - unlink hang report until query is ready

https://github.com/mozilla/socorro/commit/e32ac1c8554bb0e86d173fd1f6f6cdfce6986b1e
Merge pull request #68 from rhelmer/bug637661-unlink-hang-report

bug 637661 - unlink hang report until query is ready
Josh noticed a problem with the query (not separating "final beta" from "release" correctly), which is causing duplicates. 

Unlinking this from the UI until the query is ready (which will be on or before next Monday).
Status: VERIFIED → REOPENED
Resolution: FIXED → ---
Attachment #560316 - Attachment is obsolete: true
Attachment #560316 - Flags: feedback?(lars)
Attachment #560316 - Flags: feedback?(chris.lonnen)
Attachment #560316 - Flags: feedback?(adrian)
Target Milestone: 2.3 → 2.3.1
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/58beedcf3f026ee5e816bfcde8591285b8a9a8ec
bug 637661 - link to correct date range for signature

https://github.com/mozilla/socorro/commit/052c3fa5053e79545297ca77cf6d51a7a1d5fc60
Merge pull request #80 from rhelmer/bug637661-hang-pair-report-bugs

bug 637661 - link to correct date range for signature
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/87d5a8ea70296475c7d392ee24d48b6b7b3bc227
bug 637661 - hang report is ready, add link back

https://github.com/mozilla/socorro/commit/6176d7a8f4eccb6df436e3a8660ad7fde29dfbc0
Merge pull request #81 from rhelmer/bug637661-hang-pair-report-bugs

bug 637661 - hang report is ready, add link back
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
We're having problems on our dev and staging DBs, so we have to bump this again.

Working on a workaround now (moving the DBs from the SAN which is suspected of causing the problem to local disk) but this won't leave time for QA.
Target Milestone: 2.3.1 → 2.3.2
This remains linked in trunk.
Commits pushed to https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/b26dbf0d09a41071446a3882c3cbbb39093954f1
bug 637661 - days not day param for timedelta, and use list for single param

https://github.com/mozilla/socorro/commit/92f8e1e7888c1cfae98bcc93319307cadb73dcbb
Merge pull request #124 from rhelmer/bug637661-hang-pair-report-cron

bug 637661 - days not day param for timedelta, and use list for single pa
Component: Socorro → General
Product: Webtools → Socorro
Blocks: 812511
You need to log in before you can comment on or make changes to this bug.