Closed Bug 733021 Opened 13 years ago Closed 13 years ago

Detect explosive crashes backend work

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: espressive, Assigned: jberkus)

References

Details

(Whiteboard: [qa-])

Backend work related to bug 629062
Target Milestone: --- → 2.5.2
So this is the base SQL I came up with (follows). My thoughts are to wrap this in a stored procedure or python module called like: get_explosiveness( base_date, product_versions[], days, compare_days, threshold ) e.g. get_explosiveness ( '2012-03-04', [ 876 ], 1, 9, 50 ) Not sure whether it should be a stored procedure or a python function. Opinions? There's no real percentage in storing this in a matview initially, because the matview would need to be quite large. Once we set the # of days and the thresholds permanently, we can make add an explosiveness column to TCBS. Anyway, Kairo, see if the below matches the algorithm you outlined: WITH adusum AS ( SELECT adu_date, sum(adu_count) as adu_count FROM product_adu WHERE product_version_id IN ( 876 ) GROUP BY adu_date ), reportsum AS ( SELECT signature_id, report_date, sum(report_count) as report_count FROM tcbs WHERE product_version_id IN ( 876 ) GROUP BY signature_id, report_date ), oneday AS ( SELECT reportsum.signature_id, report_count, (report_count/adu_count::numeric)*1000000 as crash_ratio FROM reportsum JOIN adusum ON reportsum.report_date = adusum.adu_date WHERE reportsum.report_date = '2012-03-04' AND report_count > 50 ), nineday AS ( SELECT reportsum.signature_id, sum(report_count) as report_count, avg( report_count/adu_count::numeric )*1000000 as crash_ratio FROM reportsum JOIN adusum ON reportsum.report_date = adusum.adu_date AND reportsum.report_date BETWEEN '2012-02-24' AND '2012-03-03' GROUP BY reportsum.signature_id ) SELECT oneday.signature_id, signature, round(oneday.crash_ratio / COALESCE(nineday.crash_ratio,0.1), 1) as explosiveness, oneday.report_count FROM oneday JOIN signatures USING (signature_id) LEFT OUTER JOIN nineday USING (signature_id) ORDER BY explosiveness DESC LIMIT 10;
Assignee: nobody → josh
(In reply to [:jberkus] Josh Berkus from comment #1) > Anyway, Kairo, see if the below matches the algorithm you outlined: If I read this correctly, I think it wants match the algorithm for "1-day explosiveness". The "3-day explosiveness" one is somewhat different (yes, this is complicated analysis all in all, but is has proven quite helpful). I don't see you comparing the difference between the maximum and average of the nineday to the difference between the oneday value and the nineday average, and I'm completely out of clue what the COALESCE is for, but that may be because we're off track there already. Is the code in http://hg.mozilla.org/users/kairo_kairo.at/crash-report-tools/file/005d90fe9450/get-explosives.php#l530 any more help then the other descriptions of the algorithm?
Kairo, Yes, that's an example of 1-day explosiveness. 3-day explosiveness seems to be different only in the number of days in each comparison bucket and the threshold, correct? I'm sorry, the PHP is not helpful. Could you express it in pure symbolic math? With definitions of the symbols, of course.
Kairo, Let me get you started so that you know what I mean: c1 = one-day total crashes a1 = one-day ADU c9a = 9-day average crashes a9a = 9-day average ADU t = minimum number of crashes threshold e = explosiveness So the SQL query implements: e = ( c1 / a1 ) / ( c9a / a9a ) IF c1 > t So, please expand that to the math I'm supposed to be implementing. Thanks!
(In reply to [:jberkus] Josh Berkus from comment #4) > e = ( c1 / a1 ) / ( c9a / a9a ) IF c1 > t Which is something other than my algorithm. Apparently you were unwilling to read the function in my PHP that actually has the mathematics of this, or you'd at least have spotted my "creative use of the minus sign" (see http://en.memory-alpha.org/wiki/Good_Shepherd_%28episode%29#Memorable_Quotes for the source of that quote). ;-) > So, please expand that to the math I'm supposed to be implementing. Thanks! Yes, will do that, probably on Monday.
Kairo, How about this? c1 = one-day total crashes a1 = one-day ADU c9a = 9-day average crashes a9a = 9-day average ADU c9m = 9-day maximum crashes a9m = 9-day maximum ADU t = minimum number of crashes threshold e = explosiveness New formula: e = ( ( c1 / a1 ) - ( c9a / a9a ) ) - ( ( c9m / a9m ) - ( c9a / a9a ) ) / ( c9a / a9a ) IF c1 > t that then subtracts the maximum max-mean variance from the current crash figure. Does that work for you for the 9-day threshold? I'll go ahead and implement that as code; we can always change it later if it's not showing what you want.
Kairo, Actually, there's only one issue with the formula above, which is that it gives hugely inflated explosiveness numbers if the crash is brand-new today. For example, imagine we have a signature with 70 crashes per million today which never had any before: ( 70 - 0 ) - ( 0 - 0 ) / 1 == 70.0 ( if there are 0 crashes, we round up to 1 to avoid DIVBYZERO ) Any thoughts on an adjustment for the "brand new crash" case?
Please just wait for me to do the cosmetic touches and make mathematical formulae out of my PHP code (as you don't want to read it otherwise) instead of randomly inventing new formulae that I need to try and understand but which are different again from anything I proposed.
Postponed to 2.5.3; awaiting readable formulae.
Target Milestone: 2.5.2 → 2.5.3
Still waiting on formula, bumped to 4.
Target Milestone: 3 → 4
Target Milestone: 4 → 5
OK, here's a try to make this mathematical: First of all, we have two parameters being passed to the generation of those values. For simplicity, we specify it in crashes and hardcode them (we might want to tweak it at some point, so it's a var to the algorithm): One is a minimum crash rate in crashes per million ADU: minrate = 10 cr/MADU The other is a "clamp" that both prevents a division by zero and masks out some noise. And actually, it's two values, one for each type of explosiveness (see below): clamp_f = 30 clamp_s = 15 Then, we always want the rates of cr/MADU of the last 10 days for the total of the product+version in the report plus for every signature that was over the minrate on the analyzed day: c0, ... , c9: crashes on day 0 (the one to analyze) and days 1-9 before that one a0, ... , a9: ADUs on day 0, etc. r0, ... , r9 = 1e6 * c0 / a0, ... , 1e6 * c9 / a9 [IF r0 > minrate] We'll also need the clamp on a cr/MADU scale, so let's calculate it: clampperadu = 1e6 * clamp / avg(a0, ..., a9) There is one main formula for both variants (let's call them "fast-blow" and "slow-blow" instead of "1-day" and "3-day" to avoid too many numbers in names and to use an analogy to fuses): exp = (val - base) / max(dist, clampperadu) We are comparing the difference between a value and a base to some clamped-to-minimum distance. Fast-blow explosiveness: val_f = r0 base_f = avg(r1, ... , r9) dist_f = max(r1, ... , r9) - base_f Slow-blow explosiveness: val_s = avg(r0, r1, r2) base_s = avg(r3, ... , r9) dist_s = stddev(r3, ... , r9) The avg(), max(), and stddev() functions exist with this naming in PostgreSQL, I think, so are hopefully straight forward. You might of course want to inline some of the extra definitions I made into a larger formula, I broke it up so it hopefully is easier to understand. What the UI should get for display in the end are the values exp_f, exp_s, r0, ... r9 and name for the total and every analyzed signature (i.e. any over the minrate).
Kairo, I think I've got that. Lemme try coding it and I'll see if there's still stuff I don't understand.
Kairo, Actually, one issue: this formula has become complex enough that we don't want to calculate it at user request time. That would mean NOT making it dynamic, and only allowing filtering by a specific product/version. Is that an issue?
This can surely be pregenerated daily by a cron job in a similar way other reports.
OK, I'll take that as a "yes, I'm OK with that".
Yes, I always expected this would be a per-product-and-version daily-calculated report, as I knew it takes some math to get there.
Kairo, I think I have the query for 1-day explosiveness. I'll start rolling it into a stored procedure right away. Below is a sample query I tested on StageDB for 1-day explosiveness. A couple minor things: (1) in your forumula clamperadu is compared against aggregate statistics, but is not itself aggregated. As such, I'm averaging it. If you'd like me to use min() or max() instead, let me know. (2) I'm seeing some negative values, which makes sense for signatures which are declining in frequency. Query: WITH adusum AS ( SELECT adu_date, sum(adu_count) as adu_count, ( 30 * 1000000::numeric / sum(adu_count)) as mindivisor, product_version_id FROM product_adu WHERE adu_date BETWEEN '2012-03-11' and '2012-03-20' AND adu_count > 0 GROUP BY adu_date, product_version_id ), reportsum AS ( SELECT report_date, sum(report_count) as report_count, product_version_id, signature_id FROM tcbs WHERE report_date BETWEEN '2012-03-11' and '2012-03-20' GROUP BY report_date, product_version_id, signature_id ), crash_madu AS ( SELECT ( report_count * 1000000::numeric ) / adu_count AS crash_madu, reportsum.product_version_id, reportsum.signature_id, report_date, mindivisor FROM adusum JOIN reportsum ON adu_date = report_date AND adusum.product_version_id = reportsum.product_version_id ), sum1day AS ( SELECT product_version_id, signature_id, crash_madu as sum1day, mindivisor FROM crash_madu WHERE report_date = '2012-03-20' AND crash_madu > 10 ), agg9day AS ( SELECT product_version_id, signature_id, AVG(crash_madu) AS avg9day, MAX(crash_madu) as max9day FROM crash_madu WHERE report_date BETWEEN '2012-03-11' and '2012-03-19' AND crash_madu > 10 GROUP BY product_version_id, signature_id ) SELECT sum1day.signature_id, sum1day.product_version_id , round ( ( sum1day.sum1day - agg9day.avg9day ) / GREATEST ( agg9day.max9day - agg9day.avg9day, sum1day.mindivisor ) , 2 ) as explosive_1day FROM sum1day JOIN agg9day USING ( signature_id, product_version_id ) WHERE product_version_id = 908 ORDER BY explosive_1day DESC limit 10;
Oh, one other thing: should signatures need to appear on both reports to be listed? I'm thinking not ...
(In reply to [:jberkus] Josh Berkus from comment #17) > (1) in your forumula clamperadu is compared against aggregate statistics, > but is not itself aggregated. Umm, I don't understand what you are saying here. The original clamp is being divided by the average ADUs to get clampperadu, isn't that enough? > (2) I'm seeing some negative values, which makes sense for signatures which > are declining in frequency. Exactly, that's intended. The explosiveness value is basically a factor of how strongly rising a crash is - negative values mean they have a downwards trend. > WITH adusum AS ( > SELECT adu_date, sum(adu_count) as adu_count, > ( 30 * 1000000::numeric / sum(adu_count)) as mindivisor, > product_version_id Umm, why sum(ad_count) - doesn't avg(adu_count) match what I have in my formula? At least if this mindivisor is clampperadu as I think it would be. Still, I'm not clear which adu_count this is, the original one or the post-as one (yes, my SQL is not good enough to know the convention there). You using completely different variable names than me doesn't make me checking your stuff very easy though, as I have a hard time already wrapping my mind around complex SQL like that. Also, I wonder if the 30 here (and the 10 later on) should really be put there without any names attached to them. I guess that's what one pays for more efficiency in the SQL but it makes things even harder to follow, like such "magic values/constants" usually do in code. Still, the rest looks good from what I can tell though I'm unsure about the "AND crash_madu > 10" in the agg9day case (see below). > ORDER BY explosive_1day DESC limit 10; I guess that's just for the moment, right? (In reply to [:jberkus] Josh Berkus from comment #18) > Oh, one other thing: should signatures need to appear on both reports to be > listed? I'm thinking not ... Not sure what you mean by that. As r0 > minrate is the only condition restricting what appears, we calculate everything for all signature+productversion combinations that fulfill this one condition.
> Umm, I don't understand what you are saying here. The original clamp is > being divided by the average ADUs to get clampperadu, isn't that enough? OK, I didn't read it that way. We're in good shape then. > Exactly, that's intended. The explosiveness value is basically a factor of > how strongly rising a crash is - negative values mean they have a downwards > trend. OK, great, just checking. > Umm, why sum(ad_count) - doesn't avg(adu_count) match what I have in my > formula? At least if this mindivisor is clampperadu as I think it would be. > Still, I'm not clear which adu_count this is, the original one or the > post-as one (yes, my SQL is not good enough to know the convention there). > You using completely different variable names than me doesn't make me > checking your stuff very easy though, as I have a hard time already wrapping > my mind around complex SQL like that. The SUM() is because we're drawing from the table product_adu, which has one line for each platform, not just for each product and day. So we need to SUM() the various platforms first. The averaging of ADU happens further down in the query. > Also, I wonder if the 30 here (and the 10 later on) should really be put > there without any names attached to them. I guess that's what one pays for > more efficiency in the SQL but it makes things even harder to follow, like > such "magic values/constants" usually do in code. That's a test query; there's no such things as variables in pure SQL. The eventual function will have variables (and comments) in it. > Still, the rest looks good from what I can tell though I'm unsure about the > "AND crash_madu > 10" in the agg9day case (see below). Yeah, that was me applying the minimum of 10 crashes per day in the comparison set. I realized this morning that that didn't make sense; I see you caught it too. > > > ORDER BY explosive_1day DESC limit 10; > > I guess that's just for the moment, right? Limit and order will be applied by the UI/Middleware code. > Not sure what you mean by that. As r0 > minrate is the only condition > restricting what appears, we calculate everything for all > signature+productversion combinations that fulfill this one condition. So, one problem which I realized with the test query; currently, if there are no rows in the comparison set (e.g. no crash reports for the prior 9 days) for a specific signature, then we get no results. I don't think that's what you intend (since a signature jumping from nothing to 400 would be explosive indeed), so I'll be fixing that too.
Ok, checked in, issued pull request.
Kairo: So, there's a problem with your three-day formula. I can't do a STDDEV unless there's at least 3 days of data. For many signatures, there's less than that. How do you want to modify the formula for signatures where an stddev isn't possible?
Actually, let me explain the above in more detail. For any given signature+product_version, we have days which have 0 crashes and 0 ADU. In my current implementation, I've been ignoring those days; not counting them at all, and not caculating them into the averages. That seems to work fine, except for STDDEV, which needs at least 3 days of data to produce a result. The alternative is to count 0crash:0ADU days as 0 crash_MADU rather than skipping them. However, this would have the effect of skewing the results even more heavily towards new crashes. I'd also need to calculate clampperadu based on the *last* day for which we have ADU rather than for the average, or the effect of the clamp will be erased by averaging over days with 0 ADU. So, do you want me to do that, or instead substitute something for the STDDEV?
(In reply to [:jberkus] Josh Berkus from comment #23) > The alternative is to count 0crash:0ADU days as 0 crash_MADU rather than > skipping them. That's what we should do there, IMHO (and what I just confirmed my code does right now). Note that the clamp will prohibit a division by zero from that fact there anyhow, as well as not make new low-volume crashes have a too high explosiveness. That's exactly (part of) what it is there for. :) > I'd also need to calculate > clampperadu based on the *last* day for which we have ADU rather than for > the average, or the effect of the clamp will be erased by averaging over > days with 0 ADU. Why can't it be based on the days for which we do have ADUs? In any case, my current code seems to actually average over days with 0 ADUs, and making sure that the ADU average can never be smaller than 1.
> Why can't it be based on the days for which we do have ADUs? In any case, my > current code seems to actually average over days with 0 ADUs, and making > sure that the ADU average can never be smaller than 1. If that works for you, I'm fine implementing it!
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [qa-]
Marking this resolved so that we can push it and developers can get started on a UI. Creating a new 6.0 bug "revise math for explosiveness" to fix the known math issues described above.
Blocks: 744492
[:jberkus] Can you please post the 'final' query to be used on https://bugzilla.mozilla.org/show_bug.cgi?id=629062 That would be awesome.
You need to log in before you can comment on or make changes to this bug.