Closed
Bug 519423
Opened 15 years ago
Closed 14 years ago
add tracking and alerts for "explosive" crash signatures.
Categories
(Socorro :: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
1.7.6
People
(Reporter: chofmann, Assigned: ryansnyder)
References
Details
Attachments
(2 files, 1 obsolete file)
205.54 KB,
image/png
|
Details | |
40.39 KB,
patch
|
ozten
:
review+
|
Details | Diff | Splinter Review |
In the past month we have had 2 incidents where non-existent or low volume crashes have exploded and zoomed to the top of the crash list within hours or days.
For details see
[Bug 519039] CoolIris Top Crasher [@ cooliris19.dll@0x351f2 ] and [@ cooliris19.dll@0x351a2 ] and [@ libcooliris19.dylib@0x31ea2 ]
and
Bug 512122 KB article: Possible Adware.DoubleD related Crash [@ NPFFAddOn.dll@0x11867][@ NPFFAddOn.dll@0xceb8][@ NPFFAddOn.dll@0x11657][@ NPFFAddOn.dll@0xe707][@ NPFFAddOn.dll@0xe590]
These were caught because we had eyeballs on the top crash report just happened to be around to see events starting to unfold.
We need to find ways to detect these events earlier and notify a few investigators to start looking at the problems sooner.
https://bug519039.bugzilla.mozilla.org/attachment.cgi?id=403459 shows that we might have caught that bug a day earlier when it when from 2 or 3 crashes per minute to over 10 or 15 crashes per minute.
We will need to figure out at what rate we can monitor all crashes and what thresholds make most sense for existing and new crashes. hourly and some pct. increase over recent similar daytime time slices might work. we will need to do more examination of how much crashes stretch with daily ebb and flow of general browser use so we don't kick of a steady stream of alerts.
this might be a case were we want to start real simple and then expand.
I suspect we will see more, not fewer of these kinds of events as the user base continues to grow.
Comment 1•15 years ago
|
||
related is bug 411397 Need to add changes in rank to top crash reports
Reporter | ||
Comment 2•15 years ago
|
||
I added a comment over there. that bug seem to be about trying to do a better job of measuring the relative behavior and ranking of crashes against other crashes.
I think this is more about trying to measure the current behavior of a single signature against its own past performance or behavior.
we could also factor in overall past performance of all crashes entering the system to help set the thresholds.
the attached chart shows we are running at an overall mean rate of 125 crashes per minute entering the system. if we get about 65% above that mark, or drop to below 75% percent of that figure there might be an interesting event going on that needs some attention. In this case going below 65 crashes per minute might mean network or system problems that inhibit the flow of incoming crash reports.
If we get above that some crashing plugin, website, or new release of software might be driving the numbers higher. producing and watching this kind of over all report might give us a better idea of the dynamics of incoming crashes
Comment 3•15 years ago
|
||
This is the type of thing that predictive monitoring software does very well.
If we instrument the number of crashes per minute (by product and version) and then pick this up in our standard IT monitoring (cactus or whatever) then we could page or email when the number is above a threshold.
I don't know if we can do predictive monitoring with our current monitoring software, but this would analyze historical patterns in and predict a high and low band. If the crash rate per minute goes out of band then an alert would fire.
Comment 4•15 years ago
|
||
We use nagios for monitoring and cacti for trending. Neither of those two do any kind of predictive alerts/checks. If you want this as a part of the monitoring system, you'd have to write the corresponding plugins, figure out a way to store and analyze historic data and alert on it. That kind of capability isn't present in the systems we use now. Seems like this would be an excellent candidate for the metrics super-cluster.
Comment 5•15 years ago
|
||
So.. I was wrong.. there is a way to do this in nagios, however it will involve a lot of setup etc, to the point that this should be a quarterly goal sort of thing.
http://cricket.sourceforge.net/aberrant/rrd_hw.htm
That doc talks about the stuff involved. First, we'd need a way to create rrd databases of any interesting crash signatures, and automate that system to pick up new signatures over time. Then we need a nagios check to examine these rrd databases and use the method described in that link to alert us.
This would be something worth pursuing, but is a huge time sink for whoever is doing the work.
Comment 6•15 years ago
|
||
We can do this will max thresholds. The predictive thing is just much nicer
from a maintenance perspective.
Okay I can provide a cacti [1] script that prints out the total number of crashes in the
last 5 minutes.
So if this script was run at 2009-10-14 11:34:59 it would have outputed
Firefox_3.0.1:4 Firefox_3.0.10:2 Firefox_3.0.11:5 Firefox_3.0.14:38
Firefox_3.0.2:1 Firefox_3.0.3:1 Firefox_3.0.4:1 Firefox_3.0.7:2 Firefox_3.0.8:3
Firefox_3.0.9:1 Firefox_3.5.2:5 Firefox_3.5.3:77 Thunderbird_3.0b1pre: 1
Option 1:
The field names would change through time as they only report product/versions
that had crashes during the last 5 minutes.
You could add and remove cacti outputs on this data source as needed w/o any
changes to the script.
Option 2:
The script could take a list of inputs so that you get back some expected
output. So given the input:
Firefox_2.0.0.18,Firefox_3.5.2,Firefox_3.5.3,Thunderbird_3.0b1pre
it would output:
Firefox_2.0.0.18:0 Firefox_3.5.2:5 Firefox_3.5.3:77 Thunderbird_3.0b1pre: 1
Which Option would work/be easiest to manage with cacti?
Do we have an alert/alarm pl or alarm plugin installed on cacti?
[1] http://docs.cacti.net/manual:087:3a_advanced_topics.1_data_input_methods#data_input_methods
Comment 7•15 years ago
|
||
(In reply to comment #6)
These data points are to detect a surge in overall crashes, but it doesn't detect a burst of a specific crash signature.
We can also set up a query that looks at total number of unique signatures for a 1 hour time period and find's ones that are over a certain threashold for % of total crashes.
Example:
10/14 10am
All Crashes:
Firefox 3.5.3 5834
Top Crashers:
Firefox 3.5.3 74 nsCycleCollectingAutoRefCnt::decr(nsISupports*)
Firefox 3.5.3 74 UserCallWinProcCheckWow
Firefox 3.5.3 65 nsEventListenerManager::Release()
Firefox 3.5.3 64 nsGlobalWindow::cycleCollection::UnmarkPurple(nsISupports*)
Firefox 3.5.3 55 GraphWalker::DoWalk(nsDeque&)
Firefox 3.5.3 54 RtlpWaitOnCriticalSection
Highest % for a unique signature 74/5834 -> 1.3 %
Running this on 9/26 at 10am (cool iris day)
All Crashes:
Firefox 3.5.3 6206
Top Crashers:
Firefox 3.5.3 458 cooliris19.dll@0x351f2
Firefox 3.5.3 103 RtlpWaitOnCriticalSection
Firefox 3.5.3 73 nsEventListenerManager::Release()
Firefox 3.5.3 73 nsGlobalWindow::cycleCollection...
Firefox 3.5.3 73 libcooliris19.dylib@0x31ea2
Firefox 3.5.3 71 nsCycleCollectingAutoRefCnt::de...
Highest % for a unique signature 458/6206 -> 7.4%
I don't know how to set that up in a cacti friendly way... but we could make an email alert or some other mechanism for the occasion where this % goes over a threshold.
I switched from a 5 minute time slice to 1 hour, because there were only tens of crashes per 5 minutes per signature.
Comment 8•15 years ago
|
||
Here is the SQl for comment #7
--
-- signature bursts
SELECT product, version, COUNT(date_processed), signature
FROM reports
WHERE
date_processed > '2009-09-26 10:34:59' AND
date_processed <= '2009-09-26 11:34:59' AND
signature IS NOT NULL
GROUP BY product, version, signature
HAVING COUNT(date_processed) > 10
ORDER BY product, version, COUNT(date_processed) DESC;
--- Taking product and versions from above query... get total # of crashes
SELECT product, version, COUNT(date_processed)
FROM reports
WHERE
date_processed > '2009-09-26 10:34:59' AND
date_processed <= '2009-09-26 11:34:59' AND
signature IS NOT NULL AND
((product = 'Firefox' AND version = '3.0.14') OR
(product = 'Firefox' AND version = '3.5.3'))
GROUP BY product, version;
Reporter | ||
Comment 9•15 years ago
|
||
I'm not sure about the 5 minute or even the 1 hour time slice. The gainlarity of those periods might be too small and deliver too many false positives. We need to model in the ebb and flow of intra-day browser traffic and weekday/weekend effects.
From comment 0 this is a profile of the kind of thing we are trying to montior
https://bug519039.bugzilla.mozilla.org/attachment.cgi?id=403459
We want the system to tell us something was up sometime before that friday morning crash peak. a 12 hour or even 24 hour cycle might be good enough for the kind of warning we need with out generating too many false positives.
If we only looking at the top 10 or top 10% of crashes because of performnance reasons that might be useful, but that top 10% needs to be a dynamically updated so we are looking mostly at new entries into that list.
Another vulable aspect of this is to aleart on the introduction of new signatures we have never seen before. they might even be low volume signatures. here is an example of that.
https://bugzilla.mozilla.org/show_bug.cgi?id=523529
adobe ships a new acrobat reader on oct 13th and we start seeing one or more new crashes the next day or within hours of the release.
Its the old QA premise. Finding bugs as close to their introduction as possible makes them easier and faster to diagnose and fix. Most of the time the top 10 list or the top 10% list is pretty stable and uninteresting. the cooliris example is more of a rare case where the signature made it to the top of the crash list. we actually would prefer to get alerted well before it gets near the top 10%.
Comment 10•15 years ago
|
||
(In reply to comment #9)
Thanks, I'm still digesting this...
Any comment of monitoring burst of total number of crashes with cacti? Would that be useful? (this is comments 2-6) It will be very quick and easy to build.
I'll keep working through comment #9.
Comment 11•15 years ago
|
||
Wearing my "I'm picky" hat about comment #8: The standard behavior for a range of items is:
floor <= item AND item < ceiling // Start at floor, never quite reach ceiling
All the materialized views now follow that standard.
FYI: I once had to deal with legacy code similar to that in comment #8, and it caused endless subtle problems, so I'm gun shy about it.
There's another small problem with the code in comment #8: date_processed is a timestamp with very small granularity. You would want to use a count of bins, not a count of individual date_processed, which will very seldom be equal.
Reporter | ||
Comment 12•15 years ago
|
||
lars had some ideas on this, and maybe it could be done for the next socorro release.
here is another place where getting some alerts sometime after 1pm yesterday might have been helpful in starting the analysis sooner.
https://bugzilla.mozilla.org/show_bug.cgi?id=538998#c2
one idea would be to just just watch the rank changes in the 3 day report would be a start
http://crash-stats.mozilla.com/topcrasher/byversion/Firefox/3.6/3
if the rank changes by more than say some threashhold of 50 or 100 ranking slots then send e-mail.
we could also hook this up to a report for all versions of firefox, instead fo specific versions. that would not have the noise around specfic releases of firefox and be more atuned to catching the spread of malware and external crashes that we aren't watching as closely for.
then maybe throttleing can be turned down to a 1 day, 12 hour, 6 hour rank changes.
OS: Mac OS X → All
Target Milestone: --- → 1.4
Updated•15 years ago
|
Target Milestone: 1.4 → 1.5
Reporter | ||
Comment 13•15 years ago
|
||
larger list of explosive crashes we have seen are in this query
https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=allwordssubstr;query_format=advanced;status_whiteboard=explos
Updated•15 years ago
|
Assignee: nobody → ryan
Updated•15 years ago
|
Target Milestone: 1.5 → 1.6
Assignee | ||
Comment 14•15 years ago
|
||
Still awaiting feedback on this.
In 1.5 we released a new UI that contained top moving top crashers on each product and version dashboard.
The primary question is whether or not the dashboards provide enough information on explosive crash signatures, or if other information or communication mechanisms are necessary. If more is needed, please explain in detail.
Pushing to 1.7 to allow time for proper feedback / specs / implementation.
Target Milestone: 1.6 → 1.7
Reporter | ||
Comment 15•15 years ago
|
||
one possible quick fix would be to change the "top changer" report at
http://crash-stats.mozilla.com/products/Firefox to cut it down to a 2 or 3 day window, and also only show the the red (trending upward) signatures. I think it does the later, but right now I get "Top changers currently unavailable." when trying to view the page.
The reduced window would allow quicker spotting up upward trending signature for people that happend to visit the page.
next would be to add (e-mail and/or possibily rss feed ) notifications interested trcker of this stuff don't have to actually visit the page to learn of spiking crashes.
as we get these foundations in place we could
Assignee | ||
Comment 16•15 years ago
|
||
Has 3.6.2 been released? I don't see it on the Firefox download page. As such, it shouldn't be showing up in the Firefox dashboard, and the reason it is showing up is because the dates for 3.6.2 are incorrect in the admin panel:
https://crash-stats.mozilla.com/admin/branch_data_sources
Here is what top crashers look like on the 3.6 dashboard:
http://crash-stats.mozilla.com/products/Firefox/versions/3.6
We can add a 3 day window to each of the dashboards.
I like the RSS feed idea, because that would be the easiest/quickest solution to implement.
Reporter | ||
Comment 17•15 years ago
|
||
(In reply to comment #16)
> Has 3.6.2 been released? I don't see it on the Firefox download page. As
> such, it shouldn't be showing up in the Firefox dashboard, and the reason it is
> showing up is because the dates for 3.6.2 are incorrect in the admin panel:
> https://crash-stats.mozilla.com/admin/branch_data_sources
>
ok, I see the problem here. going to http://crash-stats.mozilla.com redirects
a page that ends up with a blank top changer section. Maybe thats what we need to fix.
> Here is what top crashers look like on the 3.6 dashboard:
> http://crash-stats.mozilla.com/products/Firefox/versions/3.6
>
> We can add a 3 day window to each of the dashboards.
>
> I like the RSS feed idea, because that would be the easiest/quickest solution
> to implement.
http://crash-stats.mozilla.com/products/Firefox/versions/3.6 looks pretty good. one thing to add in addition to the trending info would be to add the current ranking.
that would help to provide some context of where the movement is happening. If its up 500 slots to move to the #1000 top crash, we might give it a few more hours or days to establish the trend and keep an eye on it, than if its jumped 500 slots into the top 100.
I think doing these couple of small things might yield some good improvments and then we could evaluate again looking closer at each of the use cases in the "explosive" bug list to determine what things might have been done to detect and notify people sooner.
Assignee | ||
Comment 19•14 years ago
|
||
I have this in progress at the moment.
I am ensuring that the changeInRank and currentRank values for each trending top crasher is available on the dashboard, so that the severity of the trend will be readily apparent. I am also creating a separate trending top crasher page, which will have the data available via RSS and CSV.
The last piece to put in place will be to add a 3 day date range to the already existing values of 7, 14 and 28 days.
All other notifications for these trends will take place in #525316.
Status: NEW → ASSIGNED
Target Milestone: 1.9 → 1.8
Assignee | ||
Comment 20•14 years ago
|
||
See comment 19 for the changes this patch encompasses.
To see this in my sandbox, please visit the dashboard for a product / version:
http://rsnyder.khan.mozilla.org/reporter/products/Firefox/versions/3.6.7
Or the trending top crashes page for a product / version:
http://rsnyder.khan.mozilla.org/reporter/products/Firefox/versions/3.6.7/topchangers
To apply this patch, in application/config/products.php, you will need to replace $config['topchangers_count'] with:
/**
* The number of topchangers to feature on the product dashboard.
*/
$config['topchangers_count_dashboard'] = 15;
/**
* The number of topchangers to feature on the product dashboard.
*/
$config['topchangers_count_page'] = 50;
Attachment #459313 -
Flags: review?(ozten.bugs)
Attachment #459313 -
Flags: feedback?
Assignee | ||
Comment 21•14 years ago
|
||
Submitting an updated patch. The rss and csv links for the trending top crashers did not contain the duration variable in the url.
Attachment #459313 -
Attachment is obsolete: true
Attachment #459944 -
Flags: review?(ozten.bugs)
Attachment #459313 -
Flags: review?(ozten.bugs)
Attachment #459313 -
Flags: feedback?
Comment 22•14 years ago
|
||
Comment on attachment 459944 [details] [diff] [review]
Patch 2 for 5159423
Wow, thanks for the quick turnaround... Lots of code!
Thanks for fixing those docstrings.
Looks great!
Attachment #459944 -
Flags: review?(ozten.bugs) → review+
Assignee | ||
Comment 23•14 years ago
|
||
Thanks Austin. Filed https://bugzilla.mozilla.org/show_bug.cgi?id=581679 to get the config file updated on stage. Added documentation to the rollout procedure for 1.8 at http://code.google.com/p/socorro/wiki/SocorroUpgrade#Socorro_1.8 .
==
Sending webapp-php/application/config/products.php-dist
Sending webapp-php/application/config/routes.php
Sending webapp-php/application/controllers/products.php
Sending webapp-php/application/views/common/dashboard_product.php
Adding webapp-php/application/views/common/product_topchangers.php
Sending webapp-php/application/views/layout.php
Sending webapp-php/application/views/moz_pagination/nav.php
Adding webapp-php/application/views/products/product_topchangers.php
Sending webapp-php/css/screen.css
Sending webapp-php/js/socorro/daily.js
Sending webapp-php/js/socorro/dashboard.js
Transmitting file data ...........
Committed revision 2247.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 25•14 years ago
|
||
Review for possible inclusion in 1.7.6.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 1.8 → 1.7.6
Assignee | ||
Comment 26•14 years ago
|
||
Updated upgrade docs at https://code.google.com/p/socorro/wiki/SocorroUpgrade
Updated Bug 612981 to include config change.
Working on integrating remaining code changes.
Assignee | ||
Comment 27•14 years ago
|
||
This will resolve Bug 603561 as well.
Committing.
==
Sending webapp-php/application/config/products.php-dist
Sending webapp-php/application/config/routes.php
Sending webapp-php/application/views/common/dashboard_product.php
Adding webapp-php/application/views/common/product_topchangers.php
Sending webapp-php/application/views/layout.php
Adding webapp-php/application/views/products/product_topchangers.php
Sending webapp-php/js/socorro/daily.js
Sending webapp-php/js/socorro/dashboard.js
Transmitting file data ........
Committed revision 2752.
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 28•14 years ago
|
||
another example
https://bugzilla.mozilla.org/show_bug.cgi?id=585913#c9
Blocks: 585913
Updated•13 years ago
|
Component: Socorro → General
Product: Webtools → Socorro
You need to log in
before you can comment on or make changes to this bug.
Description
•