Last Comment Bug 790844 - per-build crash and ADU data is incomplete/strange
: per-build crash and ADU data is incomplete/strange
Status: RESOLVED FIXED
:
Product: Socorro
Classification: Server Software
Component: Database (show other bugs)
: unspecified
: All All
: -- normal (vote)
: 19
Assigned To: [:jberkus] Josh Berkus
:
:
Mentors:
Depends on:
Blocks: 790845
  Show dependency treegraph
 
Reported: 2012-09-12 19:02 PDT by Robert Kaiser
Modified: 2012-09-18 10:30 PDT (History)
7 users (show)
See Also:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description Robert Kaiser 2012-09-12 19:02:31 PDT
The new per-build crash and ADU data sets seem to miss some data from those aggregations on some days, at least for Firefox.

If you look at the data table on the following page, you'll see that there are huge fluctuations with both crashes and ADUs between days:
https://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&os[]=Mac+OS+X&os[]=Linux&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate

Comparing that to the data I gather in my crashes by build reports show no such fluctuations there, and single instances of my daily reports showing higher numbers for a single day for build IDs in the days that have low numbers in Socorro, see the "Firefox Aurora" and "Firefox Nightly" sections of e.g.
https://crash-analysis.mozilla.com/rkaiser/2012-09-11/2012-09-11.buildcrashes.html
https://crash-analysis.mozilla.com/rkaiser/2012-09-10/2012-09-10.buildcrashes.html
https://crash-analysis.mozilla.com/rkaiser/2012-09-09/2012-09-09.buildcrashes.html


That said, this strange effect (or bug) isn't there for FennecAndroid, see https://crash-stats.mozilla.com/daily?form_selection=by_version&p=FennecAndroid&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&os[]=Mac+OS+X&os[]=Linux&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate
Comment 1 Robert Kaiser 2012-09-12 19:09:34 PDT
OK, this looks to be very much related to or even the same bug as bug 790845 as numbers suddenly look realistic when I display Windows only: https://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate
Comment 2 Robert Kaiser 2012-09-13 13:52:52 PDT
OK, from what I see, it really looks like we are "randomly" taking the value of one checked platform instead of summing up all the platforms that are checked.
Comment 3 Schalk Neethling [:espressive] 2012-09-13 14:11:24 PDT
:kairo Do you see these problems in the graph or the table data or both?
Comment 4 Robert Kaiser 2012-09-13 14:13:57 PDT
Both, on the daily crashes side. Not on the front page graph, though, that is fine.
Comment 5 [:jberkus] Josh Berkus 2012-09-13 14:23:27 PDT
So, there's two bugs with Crashes By User.  One is in the database (unreasonably high crash counts and ADU), which should be my bug.  The other is in the UI, where the various OSes are not getting totalled correctly, which is Schalk's bug. 

I'm going to claim THIS bug, and make 790845 Schalk's bug.
Comment 6 [:jberkus] Josh Berkus 2012-09-13 16:28:18 PDT
So, I've been playing with this and can't find a solution which doesn't involve taking a machete to Adrian's code.

Here's the source of the problem:

 product_name | version_string | os_name  |  crash_type  | report_count |  adu  
--------------+----------------+----------+--------------+--------------+-------
 Firefox      | 18.0a1         | Linux    | OOP Plugin   |            4 |  2495
 Firefox      | 18.0a1         | Linux    | Hang Browser |            1 |  2495
 Firefox      | 18.0a1         | Linux    | Hang Plugin  |            1 |  2495
 Firefox      | 18.0a1         | Linux    | Browser      |           18 |  2495
 Firefox      | 18.0a1         | Mac OS X | Hang Browser |            2 |  1823
 Firefox      | 18.0a1         | Mac OS X | Hang Plugin  |            2 |  1823
 Firefox      | 18.0a1         | Mac OS X | OOP Plugin   |            2 |  1823
 Firefox      | 18.0a1         | Mac OS X | Browser      |           75 |  1823
 Firefox      | 18.0a1         | Windows  | Hang Plugin  |          156 | 64780
 Firefox      | 18.0a1         | Windows  | Browser      |         1127 | 64780
 Firefox      | 18.0a1         | Windows  | OOP Plugin   |          153 | 64780
 Firefox      | 18.0a1         | Windows  | Hang Browser |          156 | 64780

As you can see, we have separate ADU for each Product/Version/OS, but NOT for each crash type (which is obvious if you think about it).  This means if we do a straight sum across the ADU, the ADU gets multiplied if there are multiple crash types.

Unfortunately, Adrian is using the same code for the Home Page (which has no crash types) as for Crashes By User (which has them).  So far, I can't come up with a way to fix this which doesn't require having a completely separate code path for the two interfaces.
Comment 7 [:jberkus] Josh Berkus 2012-09-13 16:33:25 PDT
FYI, there are two reasons we didn't find this issue during testing:

1) Our automated test data is not as complex as our real data, so the problem didn't show.

2) Nobody from analytics/crashkill actually looked at the numbers when were were in dev or staging.

This is probably a lesson for the future.
Comment 8 [:jberkus] Josh Berkus 2012-09-13 18:34:23 PDT
Actually, I have a way to solve this using custom aggegates.  Will test tommorrow.
Comment 9 [:jberkus] Josh Berkus 2012-09-14 14:30:47 PDT
Ok, the quick-fix version didn't work.

I've put in a fix for product_crash_ratio on staging.  I'd like Kairo to check it.

I'm working on new code for Adrian now ...
Comment 10 [:jberkus] Josh Berkus 2012-09-14 16:59:20 PDT
Adrian,

So, bad news.  There's no way around making the query for totals MUCH more complicated.

This would be crashes by report date:

SELECT product_versions.product_name, 
	version_string, 
	report_date, 
	( sum(report_count) / avg(throttle) )::int as report_count, 
	sum(adu)::bigint as adu, 
	crash_hadu(sum(report_count)::bigint, sum(adu)::bigint, avg(throttle)) as crash_hadu
FROM 
	( SELECT product_version_id, report_date, os_short_name,
	    SUM(report_count)::int as report_count,
	    MAX(adu) as adu
	  FROM crashes_by_user
	    JOIN crash_types USING (crash_type_id)
	  WHERE crash_type_short IN ( 'browser','oop','hang plugin' )
	    AND include_agg
	  GROUP BY product_version_id, report_date, os_short_name ) as crcounts
JOIN product_versions USING ( product_version_id )
JOIN product_release_channels ON
	product_versions.product_name = product_release_channels.product_name
	and product_versions.build_type = product_release_channels.release_channel
WHERE product_versions.product_name='Firefox' 
AND version_string='18.0a1'
AND report_date BETWEEN '2012-09-05'::date AND '2012-09-13'::date 
GROUP BY product_versions.product_name, version_string, report_date
ORDER BY report_date;


And this would be crashes by build date:


SELECT product_versions.product_name, 
	version_string, 
	crcounts.build_date, 
	( sum(report_count) / avg(throttle) )::int as report_count, 
	sum(adu)::bigint as adu, 
	crash_hadu(sum(report_count)::bigint, sum(adu)::bigint, avg(throttle)) as crash_hadu
FROM 
	( SELECT product_version_id, report_date, 
		build_date, os_short_name,
	    SUM(report_count)::int as report_count,
	    MAX(adu) as adu
	  FROM crashes_by_user_build
	    JOIN crash_types USING (crash_type_id)
	  WHERE crash_type_short IN ( 'browser','oop','hang plugin' )
	    AND include_agg
	  GROUP BY product_version_id, report_date, build_date, os_short_name ) as crcounts
JOIN product_versions USING ( product_version_id )
JOIN product_release_channels ON
	product_versions.product_name = product_release_channels.product_name
	and product_versions.build_type = product_release_channels.release_channel
WHERE product_versions.product_name='Firefox' 
AND version_string='18.0a1'
AND report_date BETWEEN '2012-09-05'::date AND '2012-09-13'::date 
GROUP BY product_versions.product_name, version_string, crcounts.build_date
ORDER BY crcounts.build_date;


The only alternative I can think of is putting these in a stored procedure.
Comment 11 [:jberkus] Josh Berkus 2012-09-15 11:17:39 PDT
Adrian,

Hmmm.  There may be another solution to this without making your code much more complicated.  However, it will take me a while to work it out.  Not sure how long we have on this ...
Comment 12 [:jberkus] Josh Berkus 2012-09-15 11:23:17 PDT
Ok, checked things, and my alternate solution works out.

ADRIAN, READ THIS COMMENT FIRST.

The solution to this without messing up the mware code is for me to revise the code which builds crashes_by_user and crashes_by_user_build to have zero crashes for each crash_type which isn't represented; then I can employ math tricks to get the right totals, and adrian will only have to add 2-3 lines of code instead of 40-50.  However, given that I'm getting on an airplane tommorrow, I'm not convinced that I'll get it done in time for Wednesday.  I'm also concerned about testing it.
Comment 13 Laura Thomson :laura 2012-09-15 13:09:53 PDT
Just as an aside: fix it the right way, no duct tape here.  We'll ship it when it's ready.
Comment 14 Robert Kaiser 2012-09-15 15:24:44 PDT
(In reply to Laura Thomson :laura from comment #13)
> Just as an aside: fix it the right way, no duct tape here.  We'll ship it
> when it's ready.

Well, I agree at one level and have alarm clock ring on another here.
If I would have realized the actual extent of this one and bug 790845, I would have tried to stop and block the mobeta push for this. I wrongly thought it would be limited to the new by-build-date stuff only, unfortunately this covers by-crash-date as well and we depend on that data quite a bit in our analysis. The only reason I'm not shouting "please roll back the release" is because of this being mobeta and I know what complexity it has.

I'm all for fixing the right way but this is harming our analysis of Firefox stability in serious ways and I don't like us being in that state post Monday - I think we probably cannot really tolerate it past Wednesday because we need to be able to make reasonable stability assessments on Firefox beta.
Comment 15 [:jberkus] Josh Berkus 2012-09-15 15:36:51 PDT
Kairo,

Well, if we're going to get this turned around quickly, we really need your team to help by testing whether our numbers are correct.  As I said before, only your team knows.

Have you checked product_crash_ratio on StageDB?
Comment 16 Robert Kaiser 2012-09-15 15:41:48 PDT
(In reply to [:jberkus] Josh Berkus from comment #15)
> Have you checked product_crash_ratio on StageDB?

I have marked that TODO, but will probably not be able to test before late tomorrow my time, as I'm spending some time with my parents, brother, etc. this weekend to celebrate my birthday.
Comment 17 [:jberkus] Josh Berkus 2012-09-15 16:54:59 PDT
Ok, back in your court Adrian.  My Pull request is here:

https://github.com/mozilla/socorro/pull/823

... but please don't merge with Stage until Adrian tests this.    I've done a week of backfill on DevDB, but not on StageDB.

Adrian, you'll need to make some small changes to how you calculate totals for crashes_by_user and crashes_by_user_build, as follows:

		db_fields = [
			"product_name",
			"version_string",
			date_range_field,
			"sum(adjusted_report_count)::bigint as report_count",
			"( sum(adu) / count(distinct crash_type_short) )::bigint as adu",
			"""crash_hadu(sum(report_count)::bigint, 
			( sum(adu) / count(distinct crash_type_short) )::bigint::bigint,
avg(throttle)) as crash_hadu""",
			"avg(throttle) as throttle"
		]

Basically, where you have sum(adu) currently, you're replacing it with ( sum(adu) / count(distinct crash_type_short) ).

The above only works properly if you also add a full order by clause.  For "by version" this should be:

ORDER BY product_name, version_string, { report_date | build_date }

For "by os" it should be:

ORDER BY product_name, version_string, os_name, { report_date | build_date }

Without the ORDER BY, we'll still get the wrong totals.  If that's going to be a problem, then I'll need to add another workaround.  Thanks!
Comment 18 [:jberkus] Josh Berkus 2012-09-15 16:56:13 PDT
Adrian,

Oh, one other note: you should't feed "adjusted_report_count" to crash_hadu(), since it takes the Throttle into account.  Use "report_count" instead.  Thanks!
Comment 19 [:jberkus] Josh Berkus 2012-09-15 18:17:21 PDT
Adrian,

Actually, the stuff above about ORDER BY?  Not necessary, as it turns out.  Ignore it.  You don't need to add an ORDER BY.
Comment 20 Laura Thomson :laura 2012-09-15 18:47:04 PDT
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #14)
> (In reply to Laura Thomson :laura from comment #13)
> > Just as an aside: fix it the right way, no duct tape here.  We'll ship it
> > when it's ready.
> 
> Well, I agree at one level and have alarm clock ring on another here.
> If I would have realized the actual extent of this one and bug 790845, I
> would have tried to stop and block the mobeta push for this. I wrongly
> thought it would be limited to the new by-build-date stuff only,
> unfortunately this covers by-crash-date as well and we depend on that data
> quite a bit in our analysis. The only reason I'm not shouting "please roll
> back the release" is because of this being mobeta and I know what complexity
> it has.
> 
> I'm all for fixing the right way but this is harming our analysis of Firefox
> stability in serious ways and I don't like us being in that state post
> Monday - I think we probably cannot really tolerate it past Wednesday
> because we need to be able to make reasonable stability assessments on
> Firefox beta.

Two things:

1.  We can't ship this until you make time to test it.  

2.  It's no longer possible to roll this release back without data loss and significant downtime.  There's a small window for that because of the non-reversible database changes in this case, and we've missed it.
Comment 21 Schalk Neethling [:espressive] 2012-09-16 05:00:37 PDT
Hey there Adrian,

I see both Josh' and your pull requests are on Github. Can I merge these and pull down to also help testing? How are things looking on your end? Does it seem like these changes have solved the problem?
Comment 22 Robert Kaiser 2012-09-16 15:19:36 PDT
(In reply to Laura Thomson :laura from comment #20)
> 1.  We can't ship this until you make time to test it.  

Sorry that I take any time off at all. Maybe I should not do that any more. Don't blame me for burning out, though.

That said, I apparently don't even have the info to log into the stage DB so I can't test at all for the moment unless it gets fixed on prod.
Comment 23 Laura Thomson :laura 2012-09-17 04:52:30 PDT
Schalk, Adrian, Josh: Let's get the code (not just the database change) on staging, so anyone from crashkill can test it.
Comment 24 Schalk Neethling [:espressive] 2012-09-17 05:23:59 PDT
So, there are no code changes at the moment from my side. I believe we need to merge the two open PRs to master (stage branch?) and then the auto deploy will kick in right?

I am just wondering, the changes made by Josh, does this need manual intervention or, did Josh already run everything on dev and stage and the PR is just a code check in to have it in the repo when deploying to other env?

Just let me know how I can help.
Comment 25 Laura Thomson :laura 2012-09-17 05:40:24 PDT
Schalk: so there are changes to the DB (from Josh) - these have already been applied to stage - and changes to the code (from Adrian), which need to land in staging (and master) branches if they haven't already done so.
Comment 26 Adrian Gaudebert [:adrian] 2012-09-17 06:12:03 PDT
Josh: so I made my changes before reading all your comments here... The pull request I have opened yesterday ( https://github.com/mozilla/socorro/pull/824 ) uses the first way you showed, using a sub-query in my query. It wasn't that long to do, the code didn't change much and I believe it fixes our issues (I see very similar data between the home page and the daily crashes page). I also improved unit tests with data that reflect a bit more the complexity of our prod numbers, and I fixed the problem with adjusted_report_count being used for crash_hadu calculation. 

What do we do now? I believe either your or my pull request is useless as is. Do we merge mine in and drop yours, or do we merge yours and I rework mine? Assuming my pull request works as expected the first solution would be faster (and no db change would be required during the release).
Comment 27 [:jberkus] Josh Berkus 2012-09-17 08:18:33 PDT
Adrian,

If the subselect is working for you, it'll save us some storage space ( do read to the end of the comments before coding in the future, though!).   Either approach is equally valid.

If so, we want to remove my last commit from my pull request.  The other 3 commits we want to keep; those are other types of cleanup.
Comment 28 [:jberkus] Josh Berkus 2012-09-17 08:19:40 PDT
Adrian,

On second thought: your code will still work whether we push my changes or not.  As such, let's just keep it all in.  Your changes just mean we don't have to backfill.
Comment 29 [github robot] 2012-09-17 08:58:21 PDT
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/97102761503d862f4ca7907d2529b6315be517fb
Fixes bug 790844 - Corrected SQL query for daily crashes in the middleware.

https://github.com/mozilla/socorro/commit/5acf7352ef1d3ddc49930cc13a742ab77980a953
Merge pull request #824 from AdrianGaudebert/790844-fix-sql-for-crashes-by-user

Fixes bug 790844 - Corrected SQL query for daily crashes in the middleware.
Comment 30 Robert Kaiser 2012-09-17 11:05:21 PDT
(In reply to [:jberkus] Josh Berkus from comment #15)
> Have you checked product_crash_ratio on StageDB?

I have now:

breakpad=> SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('15.0.1') AND adu_date BETWEEN '2012-09-13' and '2012-09-16' ORDER BY adu_date DESC, version DESC;
  adu_date  | version | crashes |    adu    
------------+---------+---------+-----------
 2012-09-15 | 15.0.1  | 1303510 | 176486874
 2012-09-14 | 15.0.1  | 1324640 | 215200008
 2012-09-13 | 15.0.1  | 1307350 | 216329748
(3 rows)


Those are the same numbers as for prod, so it's obvious that they are wrong. Those are more ADUs than Firefox has as a whole, so surely a wrong number for a single release.
Comment 31 Matt Brandt [:mbrandt] 2012-09-17 11:12:07 PDT
Bumping to reopened per comment 30. Adrian noted that the fixes should be on stage.
Comment 32 Robert Kaiser 2012-09-17 13:15:33 PDT
OK, mpressman applied the fix to stageDB again after the DB refresh had re-set it. Now things look good there:

breakpad=> SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('15.0.1') AND adu_date BETWEEN '2012-09-13' and '2012-09-16' ORDER BY adu_date DESC, version DESC;
  adu_date  | version | crashes |   adu    
------------+---------+---------+----------
 2012-09-15 | 15.0.1  | 1303510 | 58828958
 2012-09-14 | 15.0.1  | 1324640 | 71733336
 2012-09-13 | 15.0.1  | 1307350 | 72109916
(3 rows)

Those numbers sound totally plausible to me, and the ones for other versions look good as well.
Comment 33 Matt Brandt [:mbrandt] 2012-09-17 13:17:07 PDT
Fantastic .. thanks Kairo.

Bumping to verified per comment 32.
Comment 34 [github robot] 2012-09-17 13:43:48 PDT
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/02623394e260d3250b263482e6c35dc706863f65
Changes to crash accounting in order to fix bug 790844

https://github.com/mozilla/socorro/commit/475cae10c234411036cdc612cbd2656d9106294d
Merge pull request #823 from jberkus/master

Changes to crashes_by_user and product_crash_ratio resolve bug 790844
Comment 35 Adrian Gaudebert [:adrian] 2012-09-17 13:57:57 PDT
I forgot to insert something in the SQL query to filter crashes by type. We don't want to include browser hangs in our graphs for example. There's a field in the db that gives that information, just need to add a simple filter.
Comment 36 Adrian Gaudebert [:adrian] 2012-09-17 14:03:00 PDT
Actually, Josh is going to work on it in the database directly, thus assigning to him.
Comment 37 [github robot] 2012-09-17 14:54:54 PDT
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/7b6201ad2f3d1d8fb75df192de5be0cb60c91b1c
Merge pull request #829 from jberkus/master

Changes to crashes_by_user VIEWs resolve bug 790844

Note You need to log in before you can comment on or make changes to this bug.