Closed Bug 672606 Opened 9 years ago Closed 8 years ago

Aggregate numbers and topcrashes for Nightly and Aurora channels based on build ID date instead of crash date

Categories

(Socorro :: Webapp, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Assigned: lonnen)

References

Details

(Whiteboard: Q42011wanted)

With the number of changes happening in Nightly and Aurora builds, we often have problems to see if a crash has really been fixed in builds of a certain day because people using older builds continue to crash. This also makes crash rate graphs not go down as fast as wanted after the fix has been checked in.

The idea to fix this is to use the date portion of the build ID instead of the crash date for aggregating topcrashes and numbers.

For the rates, as the current ones are crashes per (100) ADU per day, I *think* that (sum of crashes on Aurora channel with Build IDs starting with 20110719 in the 7 days starting with 2011-07-19) / (sum of daily ADU numbers of those build IDs on those days) *should* result in a number comparable to the ones we currently have - but some testing of the feasibility of that approach is probably needed.

With this change, we could also go and not care (in aggregations) about version number on those channels, but go with the channel (and the build ID date) only, just displaying the latest version on that channel in UI for informative reasons.
Discussed this with Kairo on IRC.  From a database perspective, this is fairly straightforwards; it's just another matview.

However, I really think it makes sense to have a separate UI screen for this.  Intermixing nightlies grouped by build date with betas grouped by report date would be quite confusing to the user, as well as somewhat impractical.  No idea what this UI would look like.
Assignee: nobody → chris.lonnen
Target Milestone: --- → 2.3.2
Whiteboard: Q42011wanted
Target Milestone: 2.3.2 → 2.4
Blocks: 640242
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #0)
> For the rates, as the current ones are crashes per (100) ADU per day, I
> *think* that (sum of crashes on Aurora channel with Build IDs starting with
> 20110719 in the 7 days starting with 2011-07-19) / (sum of daily ADU numbers
> of those build IDs on those days) *should* result in a number comparable to
> the ones we currently have - but some testing of the feasibility of that
> approach is probably needed.

I thought about this often, as I had the gut feeling that this calculation should be reasonable, but it looked strange somehow all the time because I couldn't mathematically / statistically explain why that would be. Today, among thinking about other stuff, it suddenly came to me how this works.

What we want there is a reasonable average crash rate for the build across several days, ideally the first week it was out there. A straight average would be the sum of all averages divided by the number of days, but that's not really what we want - we'd like a weighted average here. And here's how to get there:

Let's take the rates [r1, r2, ..., rX] be the ratio of crashes [c1, c2, ..., cX] per ADU [a1, a2, ..., aX], the total number of days (only needed for showing the way to get there) be N.
The normal average would be:
  sum(r1, r2, ..., rX) / N = r1 * 1/N + r2 * 1/N + ... + rX * 1/N
In the latter formula we have a weighing factor for every rate that is 1/N, but we want the weighing to be the relative amount of ADUs on that day, i.e. aX / sum(a1, a2, ..., aX).
That makes the weighed average:
  r1 * a1 / sum(a1...aX) + r2 * a2 / sum(a1...aX) + ... + rX * aX / sum(a1...aX) =
  (c1 / a1) * a1 / sum(a1...aX) + ... + (cX / aX) * aX / sum(a1...aX) =
  c1 / sum(a1...aX) + c2 / sum(a1...aX) + ... + cX / sum(a1...aX) =
  sum(c1 + c2 + ... + cX) / sum(a1...aX)

So, in the end, my gut feeling was right - and now we have mathematical / statistical proof that this gives us the weighed average of crash rates for those days. Phew.
Kairo,

OK, you've melted my brain now.

Can you explain in terms suitable for a DBA what sort of calculation I would be looking at to derive one day of crash stats for a build, and one day of ADU?
Lonnen,

Here's a script to generate some test data for you.  If this works, let me know and I'll turn it into a full-fledged matview.

\SET ON_ERROR_STOP ON

create or replace function utctz2date (
	timestamptz )
returns date
language sql
stable
set timezone = 'UTC'
as
$f$
SELECT $1::DATE;
$f$;

SELECT create_table_if_not_exists( 'crashes_by_build', $x$
create table crashes_by_build (
	product_version_id int not null references product_versions(product_version_id) on delete cascade,
	build_id numeric not null,
	build_date date not null,
	report_date date not null,
	crash_count int not null default 0,
	constraint crashes_by_build_key 
		primary key ( product_version_id, build_id, report_date )
);$x$, 'breakpad_rw', 
	ARRAY [ 'report_date', 'build_date' ] );
	
SET work_mem = '512MB';
	
INSERT INTO crashes_by_build
SELECT reports_clean.product_version_id, reports_clean.build, 	
	build_date(reports_clean.build),
	utctz2date(date_processed), count(*)
FROM reports_clean
	JOIN product_versions USING ( product_version_id )
	JOIN product_version_builds ON reports_clean.product_version_id =
		product_version_builds.product_version_id
		AND reports_clean.build = product_version_builds.build_id
WHERE date_processed BETWEEN '2011-11-09' AND '2011-11-16'
GROUP BY reports_clean.product_version_id, reports_clean.build, 
	build_date(build), utctz2date(date_processed);
Kairo -- I'd like to expand this to cover all channels. I understand that betas and releases would usually only have a single build, but for chemspill releases I think it would be useful. Thoughts?
note that the above doesn't restrict by any kind of a time window.  for the final version of this feature, I'll need to know how for many days after a build we want to track.
(In reply to [:jberkus] Josh Berkus from comment #3)
> Kairo,
> 
> OK, you've melted my brain now.

Sorry. I just wanted to deliver something that makes even metrics people be satisfied that our approach is correct. ;-)

> Can you explain in terms suitable for a DBA what sort of calculation I would
> be looking at to derive one day of crash stats for a build, and one day of
> ADU?

The conclusion from all my Mathematics there is easy: The average crash rate per build that we want is just a sum of all its crashes divided by a sum of its ADU values, both over the affected days. For the days, I suggest we take 7 days beginning with the one given by its build ID.

(In reply to Chris Lonnen :lonnen from comment #5)
> Kairo -- I'd like to expand this to cover all channels. I understand that
> betas and releases would usually only have a single build, but for chemspill
> releases I think it would be useful. Thoughts?

As long as the default reports for betas and releases are by crash date and the default reports for nightly and aurora by build date, I'm fine with generating both and having them available in some way.
Kairo,

Just to be sure I understand this: we're going to treat it just like a release for calculating crash ratio?  i.e, calculate ( sum(adu_count) * 100 ) / sum(crash_count) per day in the affected period?

Or did you want some kind of moving window for smoothing?
(In reply to [:jberkus] Josh Berkus from comment #8)
> Just to be sure I understand this: we're going to treat it just like a
> release for calculating crash ratio?  i.e, calculate ( sum(adu_count) * 100
> ) / sum(crash_count) per day in the affected period?


I think you're on a different ally there.
What I found is how we can get an average crash rate for a build over several days, which is sum(crashes in those days) / sum(ADUs on those days) - I didn't mention the factor 100 applied, as that's trivial anyhow.
When we are charting/calculating things by *build date* instead of *crash date*, we need that, and it felt fishy without proof to simply sum up the ADU values there, as we usually can't treat those by themselves already *daily* values as additive, but we can for this purpose.

What we'll want to end up with here is to have crash rates by build, which we can graph by build date and so get a picture of how crashy builds created at certain days were instead of how crashy the conglomerate of builds that were run on a day was, which is what we're getting right now.
Kairo,

Oh ... <light bulb going on> I get it.

Right, I'll start working on the DB code as soon as we know what's going on with 2.4.
This bug will track the matview completion, bug 640242 will track the middleware service and UI exposition of the matviews.
Target Milestone: 2.4 → ---
Component: Socorro → General
Product: Webtools → Socorro
@jberkus --

I was taking look at (https://bugzilla.mozilla.org/show_bug.cgi?id=640238), and that ticket will need similar data with a little more resolution. Crashes aggregated by build, broken down by day. You may want to take a look at that ticket before you do any more work towards this. 

@kairo --
I'm not sure yet if resolving 640238 (above) will completely subsume the solution to this one, or if it is still worth implementing separate views in the web app.
bug 640238 and the bug here are completely orthogonal, AFAIK.
Component: General → Webapp
As I understand them, both reports will show crashes aggregated by build over a date range. The solution to this bug will sum them over a 7 day window, while bug 640238 will show the same data broken down by date for the last 10 days.

If I've got it right, one matview could be sufficient for both, but they'll need separate reports.
One main difference is that here we request topcrash reports also being done by build ID date, and bug 640238 is about graphs (overall numbers) only.
By "topcrash reports" do you mean TCBS for crashes occurring on a particular build? Merely aggregating crash counts on build ID seems to be covered by bug 640238.
(In reply to Chris Lonnen :lonnen from comment #16)
> By "topcrash reports" do you mean TCBS for crashes occurring on a particular
> build?

Yes.

> Merely aggregating crash counts on build ID seems to be covered by
> bug 640238.

It seems to, though I've always had a hard time understanding what that report was really about. Let's see how that one works out before picking up any work here.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #15)
> One main difference is that here we request topcrash reports also being done
> by build ID date, and bug 640238 is about graphs (overall numbers) only.

And this just got new importance with the "Rapid Beta" plans of of release management.

Laura, I think this is the prime requirement here that we unfortunately need to put at high priority now.
Just to note this would also be great if we can get a graph of this similar to the top crashes graph.
Naoki: you will get these changes as a side effect of the rapid beta work.
MoBeta resolves this.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.