Closed
Bug 733021
Opened 13 years ago
Closed 13 years ago
Detect explosive crashes backend work
Categories
(Socorro :: Backend, task)
Socorro
Backend
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: espressive, Assigned: jberkus)
References
Details
(Whiteboard: [qa-])
Backend work related to bug 629062
Reporter | ||
Updated•13 years ago
|
Target Milestone: --- → 2.5.2
Assignee | ||
Comment 1•13 years ago
|
||
So this is the base SQL I came up with (follows). My thoughts are to wrap this in a stored procedure or python module called like:
get_explosiveness( base_date, product_versions[], days, compare_days, threshold )
e.g. get_explosiveness ( '2012-03-04', [ 876 ], 1, 9, 50 )
Not sure whether it should be a stored procedure or a python function. Opinions?
There's no real percentage in storing this in a matview initially, because the matview would need to be quite large. Once we set the # of days and the thresholds permanently, we can make add an explosiveness column to TCBS.
Anyway, Kairo, see if the below matches the algorithm you outlined:
WITH adusum AS (
SELECT adu_date, sum(adu_count) as adu_count
FROM product_adu
WHERE product_version_id IN ( 876 )
GROUP BY adu_date ),
reportsum AS (
SELECT signature_id, report_date,
sum(report_count) as report_count
FROM tcbs
WHERE product_version_id IN ( 876 )
GROUP BY signature_id, report_date
),
oneday AS (
SELECT reportsum.signature_id, report_count,
(report_count/adu_count::numeric)*1000000 as crash_ratio
FROM reportsum
JOIN adusum
ON reportsum.report_date = adusum.adu_date
WHERE reportsum.report_date = '2012-03-04'
AND report_count > 50
),
nineday AS (
SELECT reportsum.signature_id, sum(report_count) as report_count,
avg( report_count/adu_count::numeric )*1000000 as crash_ratio
FROM reportsum
JOIN adusum
ON reportsum.report_date = adusum.adu_date
AND reportsum.report_date BETWEEN '2012-02-24' AND '2012-03-03'
GROUP BY reportsum.signature_id
)
SELECT oneday.signature_id, signature, round(oneday.crash_ratio /
COALESCE(nineday.crash_ratio,0.1), 1) as explosiveness,
oneday.report_count
FROM oneday
JOIN signatures USING (signature_id)
LEFT OUTER JOIN nineday USING (signature_id)
ORDER BY explosiveness DESC LIMIT 10;
Assignee: nobody → josh
Comment 2•13 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #1)
> Anyway, Kairo, see if the below matches the algorithm you outlined:
If I read this correctly, I think it wants match the algorithm for "1-day explosiveness". The "3-day explosiveness" one is somewhat different (yes, this is complicated analysis all in all, but is has proven quite helpful).
I don't see you comparing the difference between the maximum and average of the nineday to the difference between the oneday value and the nineday average, and I'm completely out of clue what the COALESCE is for, but that may be because we're off track there already.
Is the code in http://hg.mozilla.org/users/kairo_kairo.at/crash-report-tools/file/005d90fe9450/get-explosives.php#l530 any more help then the other descriptions of the algorithm?
Assignee | ||
Comment 3•13 years ago
|
||
Kairo,
Yes, that's an example of 1-day explosiveness. 3-day explosiveness seems to be different only in the number of days in each comparison bucket and the threshold, correct?
I'm sorry, the PHP is not helpful. Could you express it in pure symbolic math? With definitions of the symbols, of course.
Assignee | ||
Comment 4•13 years ago
|
||
Kairo,
Let me get you started so that you know what I mean:
c1 = one-day total crashes
a1 = one-day ADU
c9a = 9-day average crashes
a9a = 9-day average ADU
t = minimum number of crashes threshold
e = explosiveness
So the SQL query implements:
e = ( c1 / a1 ) / ( c9a / a9a ) IF c1 > t
So, please expand that to the math I'm supposed to be implementing. Thanks!
Comment 5•13 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #4)
> e = ( c1 / a1 ) / ( c9a / a9a ) IF c1 > t
Which is something other than my algorithm. Apparently you were unwilling to read the function in my PHP that actually has the mathematics of this, or you'd at least have spotted my "creative use of the minus sign" (see http://en.memory-alpha.org/wiki/Good_Shepherd_%28episode%29#Memorable_Quotes for the source of that quote). ;-)
> So, please expand that to the math I'm supposed to be implementing. Thanks!
Yes, will do that, probably on Monday.
Assignee | ||
Comment 6•13 years ago
|
||
Kairo,
How about this?
c1 = one-day total crashes
a1 = one-day ADU
c9a = 9-day average crashes
a9a = 9-day average ADU
c9m = 9-day maximum crashes
a9m = 9-day maximum ADU
t = minimum number of crashes threshold
e = explosiveness
New formula:
e = ( ( c1 / a1 ) - ( c9a / a9a ) ) - ( ( c9m / a9m ) - ( c9a / a9a ) )
/ ( c9a / a9a ) IF c1 > t
that then subtracts the maximum max-mean variance from the current crash figure. Does that work for you for the 9-day threshold?
I'll go ahead and implement that as code; we can always change it later if it's not showing what you want.
Assignee | ||
Comment 7•13 years ago
|
||
Kairo,
Actually, there's only one issue with the formula above, which is that it gives hugely inflated explosiveness numbers if the crash is brand-new today. For example, imagine we have a signature with 70 crashes per million today which never had any before:
( 70 - 0 ) - ( 0 - 0 ) / 1 == 70.0
( if there are 0 crashes, we round up to 1 to avoid DIVBYZERO )
Any thoughts on an adjustment for the "brand new crash" case?
Comment 8•13 years ago
|
||
Please just wait for me to do the cosmetic touches and make mathematical formulae out of my PHP code (as you don't want to read it otherwise) instead of randomly inventing new formulae that I need to try and understand but which are different again from anything I proposed.
Assignee | ||
Comment 9•13 years ago
|
||
Postponed to 2.5.3; awaiting readable formulae.
Target Milestone: 2.5.2 → 2.5.3
Updated•13 years ago
|
Target Milestone: 4 → 5
Comment 11•13 years ago
|
||
OK, here's a try to make this mathematical:
First of all, we have two parameters being passed to the generation of those values. For simplicity, we specify it in crashes and hardcode them (we might want to tweak it at some point, so it's a var to the algorithm):
One is a minimum crash rate in crashes per million ADU:
minrate = 10 cr/MADU
The other is a "clamp" that both prevents a division by zero and masks out some noise.
And actually, it's two values, one for each type of explosiveness (see below):
clamp_f = 30
clamp_s = 15
Then, we always want the rates of cr/MADU of the last 10 days for the total of the product+version in the report plus for every signature that was over the minrate on the analyzed day:
c0, ... , c9: crashes on day 0 (the one to analyze) and days 1-9 before that one
a0, ... , a9: ADUs on day 0, etc.
r0, ... , r9 = 1e6 * c0 / a0, ... , 1e6 * c9 / a9 [IF r0 > minrate]
We'll also need the clamp on a cr/MADU scale, so let's calculate it:
clampperadu = 1e6 * clamp / avg(a0, ..., a9)
There is one main formula for both variants (let's call them "fast-blow" and "slow-blow" instead of "1-day" and "3-day" to avoid too many numbers in names and to use an analogy to fuses):
exp = (val - base) / max(dist, clampperadu)
We are comparing the difference between a value and a base to some clamped-to-minimum distance.
Fast-blow explosiveness:
val_f = r0
base_f = avg(r1, ... , r9)
dist_f = max(r1, ... , r9) - base_f
Slow-blow explosiveness:
val_s = avg(r0, r1, r2)
base_s = avg(r3, ... , r9)
dist_s = stddev(r3, ... , r9)
The avg(), max(), and stddev() functions exist with this naming in PostgreSQL, I think, so are hopefully straight forward.
You might of course want to inline some of the extra definitions I made into a larger formula, I broke it up so it hopefully is easier to understand.
What the UI should get for display in the end are the values exp_f, exp_s, r0, ... r9 and name for the total and every analyzed signature (i.e. any over the minrate).
Assignee | ||
Comment 12•13 years ago
|
||
Kairo,
I think I've got that. Lemme try coding it and I'll see if there's still stuff I don't understand.
Assignee | ||
Comment 13•13 years ago
|
||
Kairo,
Actually, one issue: this formula has become complex enough that we don't want to calculate it at user request time. That would mean NOT making it dynamic, and only allowing filtering by a specific product/version. Is that an issue?
Comment 14•13 years ago
|
||
This can surely be pregenerated daily by a cron job in a similar way other reports.
Assignee | ||
Comment 15•13 years ago
|
||
OK, I'll take that as a "yes, I'm OK with that".
Comment 16•13 years ago
|
||
Yes, I always expected this would be a per-product-and-version daily-calculated report, as I knew it takes some math to get there.
Assignee | ||
Comment 17•13 years ago
|
||
Kairo,
I think I have the query for 1-day explosiveness. I'll start rolling it into a stored procedure right away. Below is a sample query I tested on StageDB for 1-day explosiveness.
A couple minor things:
(1) in your forumula clamperadu is compared against aggregate statistics, but is not itself aggregated. As such, I'm averaging it. If you'd like me to use min() or max() instead, let me know.
(2) I'm seeing some negative values, which makes sense for signatures which are declining in frequency.
Query:
WITH adusum AS (
SELECT adu_date, sum(adu_count) as adu_count,
( 30 * 1000000::numeric / sum(adu_count)) as mindivisor,
product_version_id
FROM product_adu
WHERE adu_date BETWEEN '2012-03-11' and '2012-03-20'
AND adu_count > 0
GROUP BY adu_date, product_version_id
),
reportsum AS (
SELECT report_date, sum(report_count) as report_count,
product_version_id, signature_id
FROM tcbs
WHERE report_date BETWEEN '2012-03-11' and '2012-03-20'
GROUP BY report_date, product_version_id, signature_id
),
crash_madu AS (
SELECT ( report_count * 1000000::numeric ) / adu_count AS crash_madu,
reportsum.product_version_id, reportsum.signature_id,
report_date, mindivisor
FROM adusum JOIN reportsum
ON adu_date = report_date
AND adusum.product_version_id = reportsum.product_version_id
),
sum1day AS (
SELECT product_version_id, signature_id, crash_madu as sum1day,
mindivisor
FROM crash_madu
WHERE report_date = '2012-03-20'
AND crash_madu > 10
),
agg9day AS (
SELECT product_version_id, signature_id,
AVG(crash_madu) AS avg9day,
MAX(crash_madu) as max9day
FROM crash_madu
WHERE report_date BETWEEN '2012-03-11' and '2012-03-19'
AND crash_madu > 10
GROUP BY product_version_id, signature_id
)
SELECT sum1day.signature_id,
sum1day.product_version_id ,
round (
( sum1day.sum1day - agg9day.avg9day )
/
GREATEST ( agg9day.max9day - agg9day.avg9day, sum1day.mindivisor )
, 2 )
as explosive_1day
FROM sum1day JOIN agg9day USING ( signature_id, product_version_id )
WHERE product_version_id = 908
ORDER BY explosive_1day DESC limit 10;
Assignee | ||
Comment 18•13 years ago
|
||
Oh, one other thing: should signatures need to appear on both reports to be listed? I'm thinking not ...
Comment 19•13 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #17)
> (1) in your forumula clamperadu is compared against aggregate statistics,
> but is not itself aggregated.
Umm, I don't understand what you are saying here. The original clamp is being divided by the average ADUs to get clampperadu, isn't that enough?
> (2) I'm seeing some negative values, which makes sense for signatures which
> are declining in frequency.
Exactly, that's intended. The explosiveness value is basically a factor of how strongly rising a crash is - negative values mean they have a downwards trend.
> WITH adusum AS (
> SELECT adu_date, sum(adu_count) as adu_count,
> ( 30 * 1000000::numeric / sum(adu_count)) as mindivisor,
> product_version_id
Umm, why sum(ad_count) - doesn't avg(adu_count) match what I have in my formula? At least if this mindivisor is clampperadu as I think it would be. Still, I'm not clear which adu_count this is, the original one or the post-as one (yes, my SQL is not good enough to know the convention there).
You using completely different variable names than me doesn't make me checking your stuff very easy though, as I have a hard time already wrapping my mind around complex SQL like that.
Also, I wonder if the 30 here (and the 10 later on) should really be put there without any names attached to them. I guess that's what one pays for more efficiency in the SQL but it makes things even harder to follow, like such "magic values/constants" usually do in code.
Still, the rest looks good from what I can tell though I'm unsure about the "AND crash_madu > 10" in the agg9day case (see below).
> ORDER BY explosive_1day DESC limit 10;
I guess that's just for the moment, right?
(In reply to [:jberkus] Josh Berkus from comment #18)
> Oh, one other thing: should signatures need to appear on both reports to be
> listed? I'm thinking not ...
Not sure what you mean by that. As r0 > minrate is the only condition restricting what appears, we calculate everything for all signature+productversion combinations that fulfill this one condition.
Assignee | ||
Comment 20•13 years ago
|
||
> Umm, I don't understand what you are saying here. The original clamp is
> being divided by the average ADUs to get clampperadu, isn't that enough?
OK, I didn't read it that way. We're in good shape then.
> Exactly, that's intended. The explosiveness value is basically a factor of
> how strongly rising a crash is - negative values mean they have a downwards
> trend.
OK, great, just checking.
> Umm, why sum(ad_count) - doesn't avg(adu_count) match what I have in my
> formula? At least if this mindivisor is clampperadu as I think it would be.
> Still, I'm not clear which adu_count this is, the original one or the
> post-as one (yes, my SQL is not good enough to know the convention there).
> You using completely different variable names than me doesn't make me
> checking your stuff very easy though, as I have a hard time already wrapping
> my mind around complex SQL like that.
The SUM() is because we're drawing from the table product_adu, which has one line for each platform, not just for each product and day. So we need to SUM() the various platforms first. The averaging of ADU happens further down in the query.
> Also, I wonder if the 30 here (and the 10 later on) should really be put
> there without any names attached to them. I guess that's what one pays for
> more efficiency in the SQL but it makes things even harder to follow, like
> such "magic values/constants" usually do in code.
That's a test query; there's no such things as variables in pure SQL. The eventual function will have variables (and comments) in it.
> Still, the rest looks good from what I can tell though I'm unsure about the
> "AND crash_madu > 10" in the agg9day case (see below).
Yeah, that was me applying the minimum of 10 crashes per day in the comparison set. I realized this morning that that didn't make sense; I see you caught it too.
>
> > ORDER BY explosive_1day DESC limit 10;
>
> I guess that's just for the moment, right?
Limit and order will be applied by the UI/Middleware code.
> Not sure what you mean by that. As r0 > minrate is the only condition
> restricting what appears, we calculate everything for all
> signature+productversion combinations that fulfill this one condition.
So, one problem which I realized with the test query; currently, if there are no rows in the comparison set (e.g. no crash reports for the prior 9 days) for a specific signature, then we get no results. I don't think that's what you intend (since a signature jumping from nothing to 400 would be explosive indeed), so I'll be fixing that too.
Assignee | ||
Comment 21•13 years ago
|
||
Ok, checked in, issued pull request.
Assignee | ||
Comment 22•13 years ago
|
||
Kairo:
So, there's a problem with your three-day formula. I can't do a STDDEV unless there's at least 3 days of data. For many signatures, there's less than that.
How do you want to modify the formula for signatures where an stddev isn't possible?
Assignee | ||
Comment 23•13 years ago
|
||
Actually, let me explain the above in more detail.
For any given signature+product_version, we have days which have 0 crashes and 0 ADU. In my current implementation, I've been ignoring those days; not counting them at all, and not caculating them into the averages.
That seems to work fine, except for STDDEV, which needs at least 3 days of data to produce a result.
The alternative is to count 0crash:0ADU days as 0 crash_MADU rather than skipping them. However, this would have the effect of skewing the results even more heavily towards new crashes. I'd also need to calculate clampperadu based on the *last* day for which we have ADU rather than for the average, or the effect of the clamp will be erased by averaging over days with 0 ADU.
So, do you want me to do that, or instead substitute something for the STDDEV?
Comment 24•13 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #23)
> The alternative is to count 0crash:0ADU days as 0 crash_MADU rather than
> skipping them.
That's what we should do there, IMHO (and what I just confirmed my code does right now). Note that the clamp will prohibit a division by zero from that fact there anyhow, as well as not make new low-volume crashes have a too high explosiveness. That's exactly (part of) what it is there for. :)
> I'd also need to calculate
> clampperadu based on the *last* day for which we have ADU rather than for
> the average, or the effect of the clamp will be erased by averaging over
> days with 0 ADU.
Why can't it be based on the days for which we do have ADUs? In any case, my current code seems to actually average over days with 0 ADUs, and making sure that the ADU average can never be smaller than 1.
Assignee | ||
Comment 25•13 years ago
|
||
> Why can't it be based on the days for which we do have ADUs? In any case, my
> current code seems to actually average over days with 0 ADUs, and making
> sure that the ADU average can never be smaller than 1.
If that works for you, I'm fine implementing it!
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [qa-]
Assignee | ||
Comment 26•13 years ago
|
||
Marking this resolved so that we can push it and developers can get started on a UI. Creating a new 6.0 bug "revise math for explosiveness" to fix the known math issues described above.
Reporter | ||
Comment 27•13 years ago
|
||
[:jberkus] Can you please post the 'final' query to be used on https://bugzilla.mozilla.org/show_bug.cgi?id=629062
That would be awesome.
You need to log in
before you can comment on or make changes to this bug.
Description
•