Closed Bug 790844 Opened 12 years ago Closed 12 years ago

per-build crash and ADU data is incomplete/strange

Categories

(Socorro :: Database, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Assigned: jberkus)

References

Details

The new per-build crash and ADU data sets seem to miss some data from those aggregations on some days, at least for Firefox.

If you look at the data table on the following page, you'll see that there are huge fluctuations with both crashes and ADUs between days:
https://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&os[]=Mac+OS+X&os[]=Linux&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate

Comparing that to the data I gather in my crashes by build reports show no such fluctuations there, and single instances of my daily reports showing higher numbers for a single day for build IDs in the days that have low numbers in Socorro, see the "Firefox Aurora" and "Firefox Nightly" sections of e.g.
https://crash-analysis.mozilla.com/rkaiser/2012-09-11/2012-09-11.buildcrashes.html
https://crash-analysis.mozilla.com/rkaiser/2012-09-10/2012-09-10.buildcrashes.html
https://crash-analysis.mozilla.com/rkaiser/2012-09-09/2012-09-09.buildcrashes.html


That said, this strange effect (or bug) isn't there for FennecAndroid, see https://crash-stats.mozilla.com/daily?form_selection=by_version&p=FennecAndroid&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&os[]=Mac+OS+X&os[]=Linux&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate
Target Milestone: --- → 19
Assignee: nobody → sneethling
OS: Linux → All
Hardware: x86_64 → All
OK, from what I see, it really looks like we are "randomly" taking the value of one checked platform instead of summing up all the platforms that are checked.
:kairo Do you see these problems in the graph or the table data or both?
Both, on the daily crashes side. Not on the front page graph, though, that is fine.
So, there's two bugs with Crashes By User.  One is in the database (unreasonably high crash counts and ADU), which should be my bug.  The other is in the UI, where the various OSes are not getting totalled correctly, which is Schalk's bug. 

I'm going to claim THIS bug, and make 790845 Schalk's bug.
Assignee: sneethling → josh
So, I've been playing with this and can't find a solution which doesn't involve taking a machete to Adrian's code.

Here's the source of the problem:

 product_name | version_string | os_name  |  crash_type  | report_count |  adu  
--------------+----------------+----------+--------------+--------------+-------
 Firefox      | 18.0a1         | Linux    | OOP Plugin   |            4 |  2495
 Firefox      | 18.0a1         | Linux    | Hang Browser |            1 |  2495
 Firefox      | 18.0a1         | Linux    | Hang Plugin  |            1 |  2495
 Firefox      | 18.0a1         | Linux    | Browser      |           18 |  2495
 Firefox      | 18.0a1         | Mac OS X | Hang Browser |            2 |  1823
 Firefox      | 18.0a1         | Mac OS X | Hang Plugin  |            2 |  1823
 Firefox      | 18.0a1         | Mac OS X | OOP Plugin   |            2 |  1823
 Firefox      | 18.0a1         | Mac OS X | Browser      |           75 |  1823
 Firefox      | 18.0a1         | Windows  | Hang Plugin  |          156 | 64780
 Firefox      | 18.0a1         | Windows  | Browser      |         1127 | 64780
 Firefox      | 18.0a1         | Windows  | OOP Plugin   |          153 | 64780
 Firefox      | 18.0a1         | Windows  | Hang Browser |          156 | 64780

As you can see, we have separate ADU for each Product/Version/OS, but NOT for each crash type (which is obvious if you think about it).  This means if we do a straight sum across the ADU, the ADU gets multiplied if there are multiple crash types.

Unfortunately, Adrian is using the same code for the Home Page (which has no crash types) as for Crashes By User (which has them).  So far, I can't come up with a way to fix this which doesn't require having a completely separate code path for the two interfaces.
FYI, there are two reasons we didn't find this issue during testing:

1) Our automated test data is not as complex as our real data, so the problem didn't show.

2) Nobody from analytics/crashkill actually looked at the numbers when were were in dev or staging.

This is probably a lesson for the future.
Actually, I have a way to solve this using custom aggegates.  Will test tommorrow.
Ok, the quick-fix version didn't work.

I've put in a fix for product_crash_ratio on staging.  I'd like Kairo to check it.

I'm working on new code for Adrian now ...
Adrian,

So, bad news.  There's no way around making the query for totals MUCH more complicated.

This would be crashes by report date:

SELECT product_versions.product_name, 
	version_string, 
	report_date, 
	( sum(report_count) / avg(throttle) )::int as report_count, 
	sum(adu)::bigint as adu, 
	crash_hadu(sum(report_count)::bigint, sum(adu)::bigint, avg(throttle)) as crash_hadu
FROM 
	( SELECT product_version_id, report_date, os_short_name,
	    SUM(report_count)::int as report_count,
	    MAX(adu) as adu
	  FROM crashes_by_user
	    JOIN crash_types USING (crash_type_id)
	  WHERE crash_type_short IN ( 'browser','oop','hang plugin' )
	    AND include_agg
	  GROUP BY product_version_id, report_date, os_short_name ) as crcounts
JOIN product_versions USING ( product_version_id )
JOIN product_release_channels ON
	product_versions.product_name = product_release_channels.product_name
	and product_versions.build_type = product_release_channels.release_channel
WHERE product_versions.product_name='Firefox' 
AND version_string='18.0a1'
AND report_date BETWEEN '2012-09-05'::date AND '2012-09-13'::date 
GROUP BY product_versions.product_name, version_string, report_date
ORDER BY report_date;


And this would be crashes by build date:


SELECT product_versions.product_name, 
	version_string, 
	crcounts.build_date, 
	( sum(report_count) / avg(throttle) )::int as report_count, 
	sum(adu)::bigint as adu, 
	crash_hadu(sum(report_count)::bigint, sum(adu)::bigint, avg(throttle)) as crash_hadu
FROM 
	( SELECT product_version_id, report_date, 
		build_date, os_short_name,
	    SUM(report_count)::int as report_count,
	    MAX(adu) as adu
	  FROM crashes_by_user_build
	    JOIN crash_types USING (crash_type_id)
	  WHERE crash_type_short IN ( 'browser','oop','hang plugin' )
	    AND include_agg
	  GROUP BY product_version_id, report_date, build_date, os_short_name ) as crcounts
JOIN product_versions USING ( product_version_id )
JOIN product_release_channels ON
	product_versions.product_name = product_release_channels.product_name
	and product_versions.build_type = product_release_channels.release_channel
WHERE product_versions.product_name='Firefox' 
AND version_string='18.0a1'
AND report_date BETWEEN '2012-09-05'::date AND '2012-09-13'::date 
GROUP BY product_versions.product_name, version_string, crcounts.build_date
ORDER BY crcounts.build_date;


The only alternative I can think of is putting these in a stored procedure.
Assignee: josh → adrian
Adrian,

Hmmm.  There may be another solution to this without making your code much more complicated.  However, it will take me a while to work it out.  Not sure how long we have on this ...
Ok, checked things, and my alternate solution works out.

ADRIAN, READ THIS COMMENT FIRST.

The solution to this without messing up the mware code is for me to revise the code which builds crashes_by_user and crashes_by_user_build to have zero crashes for each crash_type which isn't represented; then I can employ math tricks to get the right totals, and adrian will only have to add 2-3 lines of code instead of 40-50.  However, given that I'm getting on an airplane tommorrow, I'm not convinced that I'll get it done in time for Wednesday.  I'm also concerned about testing it.
Just as an aside: fix it the right way, no duct tape here.  We'll ship it when it's ready.
(In reply to Laura Thomson :laura from comment #13)
> Just as an aside: fix it the right way, no duct tape here.  We'll ship it
> when it's ready.

Well, I agree at one level and have alarm clock ring on another here.
If I would have realized the actual extent of this one and bug 790845, I would have tried to stop and block the mobeta push for this. I wrongly thought it would be limited to the new by-build-date stuff only, unfortunately this covers by-crash-date as well and we depend on that data quite a bit in our analysis. The only reason I'm not shouting "please roll back the release" is because of this being mobeta and I know what complexity it has.

I'm all for fixing the right way but this is harming our analysis of Firefox stability in serious ways and I don't like us being in that state post Monday - I think we probably cannot really tolerate it past Wednesday because we need to be able to make reasonable stability assessments on Firefox beta.
Kairo,

Well, if we're going to get this turned around quickly, we really need your team to help by testing whether our numbers are correct.  As I said before, only your team knows.

Have you checked product_crash_ratio on StageDB?
(In reply to [:jberkus] Josh Berkus from comment #15)
> Have you checked product_crash_ratio on StageDB?

I have marked that TODO, but will probably not be able to test before late tomorrow my time, as I'm spending some time with my parents, brother, etc. this weekend to celebrate my birthday.
Ok, back in your court Adrian.  My Pull request is here:

https://github.com/mozilla/socorro/pull/823

... but please don't merge with Stage until Adrian tests this.    I've done a week of backfill on DevDB, but not on StageDB.

Adrian, you'll need to make some small changes to how you calculate totals for crashes_by_user and crashes_by_user_build, as follows:

		db_fields = [
			"product_name",
			"version_string",
			date_range_field,
			"sum(adjusted_report_count)::bigint as report_count",
			"( sum(adu) / count(distinct crash_type_short) )::bigint as adu",
			"""crash_hadu(sum(report_count)::bigint, 
			( sum(adu) / count(distinct crash_type_short) )::bigint::bigint,
avg(throttle)) as crash_hadu""",
			"avg(throttle) as throttle"
		]

Basically, where you have sum(adu) currently, you're replacing it with ( sum(adu) / count(distinct crash_type_short) ).

The above only works properly if you also add a full order by clause.  For "by version" this should be:

ORDER BY product_name, version_string, { report_date | build_date }

For "by os" it should be:

ORDER BY product_name, version_string, os_name, { report_date | build_date }

Without the ORDER BY, we'll still get the wrong totals.  If that's going to be a problem, then I'll need to add another workaround.  Thanks!
Adrian,

Oh, one other note: you should't feed "adjusted_report_count" to crash_hadu(), since it takes the Throttle into account.  Use "report_count" instead.  Thanks!
Adrian,

Actually, the stuff above about ORDER BY?  Not necessary, as it turns out.  Ignore it.  You don't need to add an ORDER BY.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #14)
> (In reply to Laura Thomson :laura from comment #13)
> > Just as an aside: fix it the right way, no duct tape here.  We'll ship it
> > when it's ready.
> 
> Well, I agree at one level and have alarm clock ring on another here.
> If I would have realized the actual extent of this one and bug 790845, I
> would have tried to stop and block the mobeta push for this. I wrongly
> thought it would be limited to the new by-build-date stuff only,
> unfortunately this covers by-crash-date as well and we depend on that data
> quite a bit in our analysis. The only reason I'm not shouting "please roll
> back the release" is because of this being mobeta and I know what complexity
> it has.
> 
> I'm all for fixing the right way but this is harming our analysis of Firefox
> stability in serious ways and I don't like us being in that state post
> Monday - I think we probably cannot really tolerate it past Wednesday
> because we need to be able to make reasonable stability assessments on
> Firefox beta.

Two things:

1.  We can't ship this until you make time to test it.  

2.  It's no longer possible to roll this release back without data loss and significant downtime.  There's a small window for that because of the non-reversible database changes in this case, and we've missed it.
Hey there Adrian,

I see both Josh' and your pull requests are on Github. Can I merge these and pull down to also help testing? How are things looking on your end? Does it seem like these changes have solved the problem?
(In reply to Laura Thomson :laura from comment #20)
> 1.  We can't ship this until you make time to test it.  

Sorry that I take any time off at all. Maybe I should not do that any more. Don't blame me for burning out, though.

That said, I apparently don't even have the info to log into the stage DB so I can't test at all for the moment unless it gets fixed on prod.
Schalk, Adrian, Josh: Let's get the code (not just the database change) on staging, so anyone from crashkill can test it.
So, there are no code changes at the moment from my side. I believe we need to merge the two open PRs to master (stage branch?) and then the auto deploy will kick in right?

I am just wondering, the changes made by Josh, does this need manual intervention or, did Josh already run everything on dev and stage and the PR is just a code check in to have it in the repo when deploying to other env?

Just let me know how I can help.
Schalk: so there are changes to the DB (from Josh) - these have already been applied to stage - and changes to the code (from Adrian), which need to land in staging (and master) branches if they haven't already done so.
Josh: so I made my changes before reading all your comments here... The pull request I have opened yesterday ( https://github.com/mozilla/socorro/pull/824 ) uses the first way you showed, using a sub-query in my query. It wasn't that long to do, the code didn't change much and I believe it fixes our issues (I see very similar data between the home page and the daily crashes page). I also improved unit tests with data that reflect a bit more the complexity of our prod numbers, and I fixed the problem with adjusted_report_count being used for crash_hadu calculation. 

What do we do now? I believe either your or my pull request is useless as is. Do we merge mine in and drop yours, or do we merge yours and I rework mine? Assuming my pull request works as expected the first solution would be faster (and no db change would be required during the release).
Adrian,

If the subselect is working for you, it'll save us some storage space ( do read to the end of the comments before coding in the future, though!).   Either approach is equally valid.

If so, we want to remove my last commit from my pull request.  The other 3 commits we want to keep; those are other types of cleanup.
Adrian,

On second thought: your code will still work whether we push my changes or not.  As such, let's just keep it all in.  Your changes just mean we don't have to backfill.
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/97102761503d862f4ca7907d2529b6315be517fb
Fixes bug 790844 - Corrected SQL query for daily crashes in the middleware.

https://github.com/mozilla/socorro/commit/5acf7352ef1d3ddc49930cc13a742ab77980a953
Merge pull request #824 from AdrianGaudebert/790844-fix-sql-for-crashes-by-user

Fixes bug 790844 - Corrected SQL query for daily crashes in the middleware.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to [:jberkus] Josh Berkus from comment #15)
> Have you checked product_crash_ratio on StageDB?

I have now:

breakpad=> SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('15.0.1') AND adu_date BETWEEN '2012-09-13' and '2012-09-16' ORDER BY adu_date DESC, version DESC;
  adu_date  | version | crashes |    adu    
------------+---------+---------+-----------
 2012-09-15 | 15.0.1  | 1303510 | 176486874
 2012-09-14 | 15.0.1  | 1324640 | 215200008
 2012-09-13 | 15.0.1  | 1307350 | 216329748
(3 rows)


Those are the same numbers as for prod, so it's obvious that they are wrong. Those are more ADUs than Firefox has as a whole, so surely a wrong number for a single release.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bumping to reopened per comment 30. Adrian noted that the fixes should be on stage.
OK, mpressman applied the fix to stageDB again after the DB refresh had re-set it. Now things look good there:

breakpad=> SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('15.0.1') AND adu_date BETWEEN '2012-09-13' and '2012-09-16' ORDER BY adu_date DESC, version DESC;
  adu_date  | version | crashes |   adu    
------------+---------+---------+----------
 2012-09-15 | 15.0.1  | 1303510 | 58828958
 2012-09-14 | 15.0.1  | 1324640 | 71733336
 2012-09-13 | 15.0.1  | 1307350 | 72109916
(3 rows)

Those numbers sound totally plausible to me, and the ones for other versions look good as well.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Fantastic .. thanks Kairo.

Bumping to verified per comment 32.
Status: RESOLVED → VERIFIED
Commits pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/02623394e260d3250b263482e6c35dc706863f65
Changes to crash accounting in order to fix bug 790844

https://github.com/mozilla/socorro/commit/475cae10c234411036cdc612cbd2656d9106294d
Merge pull request #823 from jberkus/master

Changes to crashes_by_user and product_crash_ratio resolve bug 790844
Status: VERIFIED → RESOLVED
Closed: 12 years ago12 years ago
I forgot to insert something in the SQL query to filter crashes by type. We don't want to include browser hangs in our graphs for example. There's a field in the db that gives that information, just need to add a simple filter.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Actually, Josh is going to work on it in the database directly, thus assigning to him.
Assignee: adrian → josh
Status: REOPENED → ASSIGNED
Commit pushed to master at https://github.com/mozilla/socorro

https://github.com/mozilla/socorro/commit/7b6201ad2f3d1d8fb75df192de5be0cb60c91b1c
Merge pull request #829 from jberkus/master

Changes to crashes_by_user VIEWs resolve bug 790844
Status: ASSIGNED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.