Closed Bug 629062 Opened 13 years ago Closed 11 years ago

Detect explosive crashes

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: laura, Assigned: shuhao)

References

Details

(Whiteboard: [Q42011wanted] [Q12012wanted][2.5.1])

Attachments

(3 files, 2 obsolete files)

From
https://wiki.mozilla.org/Socorro:PRD_2.x#New_.2F_explosive_.2F_critical_crash_tracking

We need a detection process, and a set of criteria that will need tuning.  

My suggested initial criteria are:
- New crash in the top 100 for a product, with no associated bug
- Crash that has increased more than (insert frequency formula here) in a time period.
I hope chofmann will chime in with feedback on this.

Once we have a process, we can turn this into a calculation that runs hourly/daily.  Depending on the final complexity, this may run against PostgreSQL or HBase.

What I'd like is to get a prototype of this feature up and running.  We can tune the calculation later.
Blocks: 629075
Assignee: nobody → chofmann
the use cases for where this would have been valuable in the past, and where it might save us in the future can be found in bugs marked with [explosive] in the status whiteboard.

here is a query to find those.

https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=explos

best way to get test cases and tuning parameters and figure out at what earliest points we could predict when these crashes might have been getting out of control, then use those values for future prediction.

I'll try and dig some of those numbers out in the next few days.
jotting down more notes for further refinement later...

"explosive across all releases" will catch some situations where website content or external factors like new plugin releases will blow up on us.

"explosive for a particular release" will be more for catching regressions that we introduce into our own code.

so one of the factors for the "explosion detector" will be for which sets of releases, or all releases, where we will have detection running.
The detector should be capable of spotting the rise of entirely new crashes, but also tell us about significant volume increases on existing signatures.

bug https://bugzilla.mozilla.org/show_bug.cgi?id=528798 represents an interesting use case.

Here we had an existing signagture "\N" (the null signature) for which we saw a volume increase.

Detection of volume increases is the first step, but then we also get bonus points if the detector could also tell us there is an abnormal increase in frequency of keywords like "Zone Alarm" in the crash report comments.
All of the data required to detect these should already be present in the PostgreSQL database.  It's just a question of adding interfaces/APIs to view that data.
Assignee: chofmann → laura
Scope for this bug is:
- define "explosive"
- develop query
- wrap it in an API call
Since we still don't have a good definition of this, this bug is going to slip.
Target Milestone: 1.7.7 → 1.7.8
I'm currently working on an algorithm for that and have some experimental reports going to find out how well it can work, but I still need to verify with the rest of the CrashKill team that what I have there is what we really want from this. Current notes and proposals are at https://wiki.mozilla.org/CrashKill/Plan/Explosive but this is not finalized yet.
Assignee: laura → chris.lonnen
Other priorities are higher for now.
Target Milestone: 1.7.8 → 2.0
KaiRo has made finishing this algorithm a Q2 goal and stated that implementing it should be a Q3 goal, anyhow.
Whiteboard: Q3
Target Milestone: 2.0 → 2.1
Target Milestone: 2.1 → 2.2
Target Milestone: 2.2 → 2.3
KaiRo do you want to take this bug until you've settled on an appropriate algorithm?
I have the algorithm as we decided we'd like to have what I'm running as prototypes, that works well for detecting rising issues. I also now know that we have the numbers the algorithm works on as that should match what we have for TCBS (esp. the new one with the daily numbers).

I just need to spec it out here in a way that it is reasonably understandable.

Still, the code I'm running for my prototypes is in http://hg.mozilla.org/users/kairo_kairo.at/crash-report-tools/file/tip/get-explosives.php and the meat of the algorithm is at http://hg.mozilla.org/users/kairo_kairo.at/crash-report-tools/file/tip/get-explosives.php#l469 - but it's probably more helpful if I do some writeup in words as well.
Assignee: chris.lonnen → kairo
Target Milestone: 2.3 → 2.4
Whiteboard: Q3 → Q42011wanted
I'll try on a spec here, duplicating what I have done in my own reports.

The final reports - with a somewhat crude UI - look like this:
https://crash-analysis.mozilla.com/rkaiser/2011-10-18/2011-10-18.firefox.7.explosiveness.html
https://crash-analysis.mozilla.com/rkaiser/2011-10-18/2011-10-18.firefox.nightly.components.html

The algorithms are as linked in comment #11 - http://hg.mozilla.org/users/kairo_kairo.at/crash-report-tools/file/tip/get-explosives.php - and also in https://wiki.mozilla.org/CrashKill/Plan/Explosive - but here's another description:

2 values for "explosiveness" ("1-day" and a "3-day") are being calculated for every signature and date (and product/version). An unresolved point is if we should match other aggregations by doing this per-platform or if we should sum across platforms as my prototype reports do.

Both "explosiveness" values are analyzing statistically how strongly recent (the analyzed day for "1-day", the average of that and the two days leading up to it for "3-day") values rise over longer-term values. In short, "1-day" is a factor of how much the recent value is outside the previous maximum, "3-day" is a factor of how much the average of the recent 3 days is outside the standard deviation of the days before.

The base values taken for the analysis are the aggregated daily numbers of crashes per (1M) ADU for the respective signature for 10 days up to the analyzed date (or less if 10 are not available, should be a minimum of 4-6 though), which is values we already should have in (new)tcbs tables (and ADU ones).

The "1-day" value is being calculated taking the difference between the maximum and average values of the (9) days leading up to the analyzed day, clamping that to a minimum value (configurable), and dividing the difference of the analyzed day's value to that previous average by that other clamped difference.

The "3-day" value is being calculated taking the standard deviation of the (7) days before the 3 recent days, clamping that to a minimum value (configurable), and dividing the difference of the average of the recent 3 days' values to the average underlying the deviation by the deviation.

Signature are being highlighted in the UI if one of those "explosiveness" values is over a configurable limit.

So, in the end, the configurable variables in the algorithm are the minima to clamp the differences in the divider to (both preventing divison by zero and cleaning out variations at low crash volume) - and the limits to highlight things in the UI. See "*** explosiveness tuning ***" in my script for the values I've used in prototyping.


I hope the description here plus the two linked sources give enough of a spec to work on this, and so I'm reassigning this back to Chris. If any more questions come up, feel free to ask me.
Assignee: kairo → chris.lonnen
I can start working on the UI aspects of this if that makes sense
Oh, I missed that this had a formula now.

I'll see if I can translate Kairo's formula into code.
Whiteboard: Q42011wanted → [Q42011wanted] [Q12012wanted]
Target Milestone: 2.4 → 2.4.1
Component: Socorro → General
Product: Webtools → Socorro
Target Milestone: 2.4.1 → 2.4.2
Schalk: put this on your UI list
Assignee: chris.lonnen → sneethling
As per IRC discussion, this is a new report that will be accessible from the drop down, same as top changers and top crashers.
[:kairo] So in comment 12 you link to two separate HTML pages that covers different data. Both of these are really big so I assume you do not want these combined on one page.

We agreed to link to the explosiveness report from the drop down and I am thinking that from there we can present the user with the options to switch between the different reports.

I am also thinking that we might want to take a more, on demand approach to especially https://crash-analysis.mozilla.com/rkaiser/2011-10-18/2011-10-18.firefox.nightly.components.html

This one, https://crash-analysis.mozilla.com/rkaiser/2011-10-18/2011-10-18.firefox.7.explosiveness.html, should probably be the default and then one can link to the other from there.
(In reply to Schalk Neethling from comment #17)
> I am also thinking that we might want to take a more, on demand approach to
> especially
> https://crash-analysis.mozilla.com/rkaiser/2011-10-18/2011-10-18.firefox.
> nightly.components.html

Umm, that's not explosiveness, that's a components report and doesn't belong in this bug. I think I tried linking something like https://crash-analysis.mozilla.com/rkaiser/2012-02-08/2012-02-08.firefox.nightly.explosiveness.html there (just for a different date).

> This one,
> https://crash-analysis.mozilla.com/rkaiser/2011-10-18/2011-10-18.firefox.7.
> explosiveness.html, should probably be the default and then one can link to
> the other from there.

Yes, this is the one I envision to be there.
[:kairo] I assume all 15 columns are of equal importance or, would it be an option to display on first load only the first 4 columns and then allow the user to show and hide the additional 7?
[:kairo] Would it be possible to relocate the 'Total crashes' row at the bottom of the table? It would make adding sorting to the table a whole lot simpler.
It might be reasonable to hide the daily data the explosiveness data bases on, as long as it's just one click to show it and hopefully not a reload of the page, if possible.

The totals are intentionally at the top, as they are a reference point for the other values - if the totals are explosive themselves, the per-signature values may be "biased" by that.
Attached image Full view on intial load (obsolete) —
Attached image total crashes hidden (obsolete) —
total crashes can be hidden and shown without the need for a page reload
For one thing, we should try to have crashes per million ADU instead of total crashes - we have the needed ADU data for that (while I don't always have in my custom reports), see e.g. the Firefox 7 report linked in comment #12 or the Firefox 12 one here: https://crash-analysis.mozilla.com/rkaiser/2012-02-12/2012-02-12.firefox.12.explosiveness.html

Also, it probably would be good to de-emphasize the historic data somewhat if in any way possible (but else, let's leave it).

I'd like to see a mockup of something containing highlighted "explosive" crashes with one explosiveness number of 2 or higher.
'Also, it probably would be good to de-emphasize the historic data somewhat if in any way possible (but else, let's leave it).'

[:kairo] Could you clarify a little more what you mean by historic data? Do you mean the data under the 'Data (total crashes / 1M ADU)' column?

'I'd like to see a mockup of something containing highlighted "explosive" crashes with one explosiveness number of 2 or higher.'

[:kairo] Will send this to you ASAP
Status: NEW → ASSIGNED
Shouldn't the entry under Total Crashes/Explosiveness be blank?  It's unclear to me what those numbers would mean.

I'd also like to see a rank number for explosiveness, but Kairo might think that's irrelevant.
[:laura] I believe the rank number might be the one on the far left under TC#.

'Shouldn't the entry under Total Crashes/Explosiveness be blank?  It's unclear to me what those numbers would mean.'

[:laura] I assume here you are referring to the blue line. I am thinking this is the total of calculating all of the totals from the bottom up. [:kairo] would of course be able to answer these things better than me most likely.
Target Milestone: 2.4.2 → 2.4.3
(In reply to Schalk Neethling from comment #25)
> [:kairo] Could you clarify a little more what you mean by historic data? Do
> you mean the data under the 'Data (total crashes / 1M ADU)' column?

Yes.

(In reply to Laura Thomson :laura from comment #26)
> Shouldn't the entry under Total Crashes/Explosiveness be blank?  It's
> unclear to me what those numbers would mean.

This means how "explosive" that complete set of crashes for this version is and gives a reference value for the rest of the lines.

> I'd also like to see a rank number for explosiveness, but Kairo might think
> that's irrelevant.

What would a rank number help there? explosiveness itself is already a pretty good indicator number, IMHO.
Target Milestone: 2.4.3 → 2.4.4
Attached image initial loading state
initial loading state
Attachment #596768 - Attachment is obsolete: true
Attachment #596769 - Attachment is obsolete: true
Attached image expanded view
expanded view - no page reload needed
and again the collapsed view - no page reload needed
I like those screenshots - the number would look better if they were right-aligned, though. :)
[:kairo] Awesome. Yeah, I was toying with aligning the numbers either right or center. Will play around with it a bit.
Unfortunately, work on ESR is going to delay implementing this.

Bumping to 2.5.1.
Whiteboard: [Q42011wanted] [Q12012wanted] → [Q42011wanted] [Q12012wanted][2.5.1]
Target Milestone: 2.4.4 → 2.5.1
UI work is 5.2.1 targeted for completion, backend is targeted for 2.5.2
Target Milestone: 2.5.1 → 2.5.2
Blocks: 525316
Target Milestone: 2.5.2 → 3
Target Milestone: 3 → 4
Target Milestone: 4 → 5
Target Milestone: 5 → 6
Target Milestone: 6 → 7
Target Milestone: 7 → 8
Target Milestone: 8 → 9
Target Milestone: 9 → 10
Target Milestone: 10 → 11
The final version of the table for this is now deployed in v.8.0.

Try "select * from explosiveness".
Actually, better example:  to select just the explosiveness for a specific product and version:

SELECT signature, oneday, threeday
FROM explosiveness
  JOIN product_versions USING ( product_version_id )
  JOIN signatures USING ( signature_id )
WHERE product_name = 'Firefox' AND version_string = '14.0a1'
ORDER BY oneday DESC limit 20;

Now, getting the data per day is a bit more complex.  Table explosiveness has a column called "last_date".  This is a DATE, and is the date of "yesterday" as far as explosiveness is concerned.  Then it has columns day0, day1, ... day9, where day0 corresponds to the last_date, and day9 corresponds to last_date - 9 days.

So:

SELECT signature, oneday, threeday, last_date,
  day0, day1, day2, day3, day4, day5,
  day6, day7, day8, day9
FROM explosiveness
  JOIN product_versions USING ( product_version_id )
  JOIN signatures USING ( signature_id )
WHERE product_name = 'Firefox' AND version_string = '14.0a1'
ORDER BY oneday DESC limit 20;

... and then you need to compute the dates for the headers yourself.

Note that some explosiveness charts will have less than 20 signatures available.  Also, you can only select *one* product and *one* version.  Kairo is aware of this.
Target Milestone: 11 → 12
Josh, when I try to use any of those queries I get
ERROR:  permission denied for relation explosiveness
Kairo,

Ooops, sorry, I forgot to grant user Analyst permissions on that table.  Will fix.
Target Milestone: 12 → 14
Target Milestone: 14 → Future
Assignee: sneethling → nobody
Status: ASSIGNED → NEW
I've been working on a newer version of this as a part of my intern projects.

See PR: https://github.com/mozilla/socorro/pull/1394 for the first iteration.
Assignee: nobody → shwu
Status: NEW → ASSIGNED
Depends on: 909572
merged in https://github.com/mozilla/socorro/commit/e3305abd5b75e734f055d445533ff053d3589796
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: