Closed
Bug 790844
Opened 12 years ago
Closed 12 years ago
per-build crash and ADU data is incomplete/strange
Categories
(Socorro :: Database, task)
Socorro
Database
Tracking
(Not tracked)
RESOLVED
FIXED
19
People
(Reporter: kairo, Assigned: jberkus)
References
Details
The new per-build crash and ADU data sets seem to miss some data from those aggregations on some days, at least for Firefox. If you look at the data table on the following page, you'll see that there are huge fluctuations with both crashes and ADUs between days: https://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&os[]=Mac+OS+X&os[]=Linux&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate Comparing that to the data I gather in my crashes by build reports show no such fluctuations there, and single instances of my daily reports showing higher numbers for a single day for build IDs in the days that have low numbers in Socorro, see the "Firefox Aurora" and "Firefox Nightly" sections of e.g. https://crash-analysis.mozilla.com/rkaiser/2012-09-11/2012-09-11.buildcrashes.html https://crash-analysis.mozilla.com/rkaiser/2012-09-10/2012-09-10.buildcrashes.html https://crash-analysis.mozilla.com/rkaiser/2012-09-09/2012-09-09.buildcrashes.html That said, this strange effect (or bug) isn't there for FennecAndroid, see https://crash-stats.mozilla.com/daily?form_selection=by_version&p=FennecAndroid&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&os[]=Mac+OS+X&os[]=Linux&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate
Reporter | ||
Comment 1•12 years ago
|
||
OK, this looks to be very much related to or even the same bug as bug 790845 as numbers suddenly look realistic when I display Windows only: https://crash-stats.mozilla.com/daily?form_selection=by_version&p=Firefox&v[]=18.0a1&v[]=17.0a2&v[]=16.0b2&v[]=15.0.1&hang_type=any&os[]=Windows&date_range_type=build&date_start=2012-08-29&date_end=2012-09-12&submit=Generate
Updated•12 years ago
|
Target Milestone: --- → 19
Updated•12 years ago
|
Assignee: nobody → sneethling
Updated•12 years ago
|
OS: Linux → All
Hardware: x86_64 → All
Reporter | ||
Comment 2•12 years ago
|
||
OK, from what I see, it really looks like we are "randomly" taking the value of one checked platform instead of summing up all the platforms that are checked.
Comment 3•12 years ago
|
||
:kairo Do you see these problems in the graph or the table data or both?
Reporter | ||
Comment 4•12 years ago
|
||
Both, on the daily crashes side. Not on the front page graph, though, that is fine.
Assignee | ||
Comment 5•12 years ago
|
||
So, there's two bugs with Crashes By User. One is in the database (unreasonably high crash counts and ADU), which should be my bug. The other is in the UI, where the various OSes are not getting totalled correctly, which is Schalk's bug. I'm going to claim THIS bug, and make 790845 Schalk's bug.
Assignee: sneethling → josh
Assignee | ||
Comment 6•12 years ago
|
||
So, I've been playing with this and can't find a solution which doesn't involve taking a machete to Adrian's code. Here's the source of the problem: product_name | version_string | os_name | crash_type | report_count | adu --------------+----------------+----------+--------------+--------------+------- Firefox | 18.0a1 | Linux | OOP Plugin | 4 | 2495 Firefox | 18.0a1 | Linux | Hang Browser | 1 | 2495 Firefox | 18.0a1 | Linux | Hang Plugin | 1 | 2495 Firefox | 18.0a1 | Linux | Browser | 18 | 2495 Firefox | 18.0a1 | Mac OS X | Hang Browser | 2 | 1823 Firefox | 18.0a1 | Mac OS X | Hang Plugin | 2 | 1823 Firefox | 18.0a1 | Mac OS X | OOP Plugin | 2 | 1823 Firefox | 18.0a1 | Mac OS X | Browser | 75 | 1823 Firefox | 18.0a1 | Windows | Hang Plugin | 156 | 64780 Firefox | 18.0a1 | Windows | Browser | 1127 | 64780 Firefox | 18.0a1 | Windows | OOP Plugin | 153 | 64780 Firefox | 18.0a1 | Windows | Hang Browser | 156 | 64780 As you can see, we have separate ADU for each Product/Version/OS, but NOT for each crash type (which is obvious if you think about it). This means if we do a straight sum across the ADU, the ADU gets multiplied if there are multiple crash types. Unfortunately, Adrian is using the same code for the Home Page (which has no crash types) as for Crashes By User (which has them). So far, I can't come up with a way to fix this which doesn't require having a completely separate code path for the two interfaces.
Assignee | ||
Comment 7•12 years ago
|
||
FYI, there are two reasons we didn't find this issue during testing: 1) Our automated test data is not as complex as our real data, so the problem didn't show. 2) Nobody from analytics/crashkill actually looked at the numbers when were were in dev or staging. This is probably a lesson for the future.
Assignee | ||
Comment 8•12 years ago
|
||
Actually, I have a way to solve this using custom aggegates. Will test tommorrow.
Assignee | ||
Comment 9•12 years ago
|
||
Ok, the quick-fix version didn't work. I've put in a fix for product_crash_ratio on staging. I'd like Kairo to check it. I'm working on new code for Adrian now ...
Assignee | ||
Comment 10•12 years ago
|
||
Adrian, So, bad news. There's no way around making the query for totals MUCH more complicated. This would be crashes by report date: SELECT product_versions.product_name, version_string, report_date, ( sum(report_count) / avg(throttle) )::int as report_count, sum(adu)::bigint as adu, crash_hadu(sum(report_count)::bigint, sum(adu)::bigint, avg(throttle)) as crash_hadu FROM ( SELECT product_version_id, report_date, os_short_name, SUM(report_count)::int as report_count, MAX(adu) as adu FROM crashes_by_user JOIN crash_types USING (crash_type_id) WHERE crash_type_short IN ( 'browser','oop','hang plugin' ) AND include_agg GROUP BY product_version_id, report_date, os_short_name ) as crcounts JOIN product_versions USING ( product_version_id ) JOIN product_release_channels ON product_versions.product_name = product_release_channels.product_name and product_versions.build_type = product_release_channels.release_channel WHERE product_versions.product_name='Firefox' AND version_string='18.0a1' AND report_date BETWEEN '2012-09-05'::date AND '2012-09-13'::date GROUP BY product_versions.product_name, version_string, report_date ORDER BY report_date; And this would be crashes by build date: SELECT product_versions.product_name, version_string, crcounts.build_date, ( sum(report_count) / avg(throttle) )::int as report_count, sum(adu)::bigint as adu, crash_hadu(sum(report_count)::bigint, sum(adu)::bigint, avg(throttle)) as crash_hadu FROM ( SELECT product_version_id, report_date, build_date, os_short_name, SUM(report_count)::int as report_count, MAX(adu) as adu FROM crashes_by_user_build JOIN crash_types USING (crash_type_id) WHERE crash_type_short IN ( 'browser','oop','hang plugin' ) AND include_agg GROUP BY product_version_id, report_date, build_date, os_short_name ) as crcounts JOIN product_versions USING ( product_version_id ) JOIN product_release_channels ON product_versions.product_name = product_release_channels.product_name and product_versions.build_type = product_release_channels.release_channel WHERE product_versions.product_name='Firefox' AND version_string='18.0a1' AND report_date BETWEEN '2012-09-05'::date AND '2012-09-13'::date GROUP BY product_versions.product_name, version_string, crcounts.build_date ORDER BY crcounts.build_date; The only alternative I can think of is putting these in a stored procedure.
Assignee | ||
Updated•12 years ago
|
Assignee: josh → adrian
Assignee | ||
Comment 11•12 years ago
|
||
Adrian, Hmmm. There may be another solution to this without making your code much more complicated. However, it will take me a while to work it out. Not sure how long we have on this ...
Assignee | ||
Comment 12•12 years ago
|
||
Ok, checked things, and my alternate solution works out. ADRIAN, READ THIS COMMENT FIRST. The solution to this without messing up the mware code is for me to revise the code which builds crashes_by_user and crashes_by_user_build to have zero crashes for each crash_type which isn't represented; then I can employ math tricks to get the right totals, and adrian will only have to add 2-3 lines of code instead of 40-50. However, given that I'm getting on an airplane tommorrow, I'm not convinced that I'll get it done in time for Wednesday. I'm also concerned about testing it.
Comment 13•12 years ago
|
||
Just as an aside: fix it the right way, no duct tape here. We'll ship it when it's ready.
Reporter | ||
Comment 14•12 years ago
|
||
(In reply to Laura Thomson :laura from comment #13) > Just as an aside: fix it the right way, no duct tape here. We'll ship it > when it's ready. Well, I agree at one level and have alarm clock ring on another here. If I would have realized the actual extent of this one and bug 790845, I would have tried to stop and block the mobeta push for this. I wrongly thought it would be limited to the new by-build-date stuff only, unfortunately this covers by-crash-date as well and we depend on that data quite a bit in our analysis. The only reason I'm not shouting "please roll back the release" is because of this being mobeta and I know what complexity it has. I'm all for fixing the right way but this is harming our analysis of Firefox stability in serious ways and I don't like us being in that state post Monday - I think we probably cannot really tolerate it past Wednesday because we need to be able to make reasonable stability assessments on Firefox beta.
Assignee | ||
Comment 15•12 years ago
|
||
Kairo, Well, if we're going to get this turned around quickly, we really need your team to help by testing whether our numbers are correct. As I said before, only your team knows. Have you checked product_crash_ratio on StageDB?
Reporter | ||
Comment 16•12 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #15) > Have you checked product_crash_ratio on StageDB? I have marked that TODO, but will probably not be able to test before late tomorrow my time, as I'm spending some time with my parents, brother, etc. this weekend to celebrate my birthday.
Assignee | ||
Comment 17•12 years ago
|
||
Ok, back in your court Adrian. My Pull request is here: https://github.com/mozilla/socorro/pull/823 ... but please don't merge with Stage until Adrian tests this. I've done a week of backfill on DevDB, but not on StageDB. Adrian, you'll need to make some small changes to how you calculate totals for crashes_by_user and crashes_by_user_build, as follows: db_fields = [ "product_name", "version_string", date_range_field, "sum(adjusted_report_count)::bigint as report_count", "( sum(adu) / count(distinct crash_type_short) )::bigint as adu", """crash_hadu(sum(report_count)::bigint, ( sum(adu) / count(distinct crash_type_short) )::bigint::bigint, avg(throttle)) as crash_hadu""", "avg(throttle) as throttle" ] Basically, where you have sum(adu) currently, you're replacing it with ( sum(adu) / count(distinct crash_type_short) ). The above only works properly if you also add a full order by clause. For "by version" this should be: ORDER BY product_name, version_string, { report_date | build_date } For "by os" it should be: ORDER BY product_name, version_string, os_name, { report_date | build_date } Without the ORDER BY, we'll still get the wrong totals. If that's going to be a problem, then I'll need to add another workaround. Thanks!
Assignee | ||
Comment 18•12 years ago
|
||
Adrian, Oh, one other note: you should't feed "adjusted_report_count" to crash_hadu(), since it takes the Throttle into account. Use "report_count" instead. Thanks!
Assignee | ||
Comment 19•12 years ago
|
||
Adrian, Actually, the stuff above about ORDER BY? Not necessary, as it turns out. Ignore it. You don't need to add an ORDER BY.
Comment 20•12 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #14) > (In reply to Laura Thomson :laura from comment #13) > > Just as an aside: fix it the right way, no duct tape here. We'll ship it > > when it's ready. > > Well, I agree at one level and have alarm clock ring on another here. > If I would have realized the actual extent of this one and bug 790845, I > would have tried to stop and block the mobeta push for this. I wrongly > thought it would be limited to the new by-build-date stuff only, > unfortunately this covers by-crash-date as well and we depend on that data > quite a bit in our analysis. The only reason I'm not shouting "please roll > back the release" is because of this being mobeta and I know what complexity > it has. > > I'm all for fixing the right way but this is harming our analysis of Firefox > stability in serious ways and I don't like us being in that state post > Monday - I think we probably cannot really tolerate it past Wednesday > because we need to be able to make reasonable stability assessments on > Firefox beta. Two things: 1. We can't ship this until you make time to test it. 2. It's no longer possible to roll this release back without data loss and significant downtime. There's a small window for that because of the non-reversible database changes in this case, and we've missed it.
Comment 21•12 years ago
|
||
Hey there Adrian, I see both Josh' and your pull requests are on Github. Can I merge these and pull down to also help testing? How are things looking on your end? Does it seem like these changes have solved the problem?
Reporter | ||
Comment 22•12 years ago
|
||
(In reply to Laura Thomson :laura from comment #20) > 1. We can't ship this until you make time to test it. Sorry that I take any time off at all. Maybe I should not do that any more. Don't blame me for burning out, though. That said, I apparently don't even have the info to log into the stage DB so I can't test at all for the moment unless it gets fixed on prod.
Comment 23•12 years ago
|
||
Schalk, Adrian, Josh: Let's get the code (not just the database change) on staging, so anyone from crashkill can test it.
Comment 24•12 years ago
|
||
So, there are no code changes at the moment from my side. I believe we need to merge the two open PRs to master (stage branch?) and then the auto deploy will kick in right? I am just wondering, the changes made by Josh, does this need manual intervention or, did Josh already run everything on dev and stage and the PR is just a code check in to have it in the repo when deploying to other env? Just let me know how I can help.
Comment 25•12 years ago
|
||
Schalk: so there are changes to the DB (from Josh) - these have already been applied to stage - and changes to the code (from Adrian), which need to land in staging (and master) branches if they haven't already done so.
Comment 26•12 years ago
|
||
Josh: so I made my changes before reading all your comments here... The pull request I have opened yesterday ( https://github.com/mozilla/socorro/pull/824 ) uses the first way you showed, using a sub-query in my query. It wasn't that long to do, the code didn't change much and I believe it fixes our issues (I see very similar data between the home page and the daily crashes page). I also improved unit tests with data that reflect a bit more the complexity of our prod numbers, and I fixed the problem with adjusted_report_count being used for crash_hadu calculation. What do we do now? I believe either your or my pull request is useless as is. Do we merge mine in and drop yours, or do we merge yours and I rework mine? Assuming my pull request works as expected the first solution would be faster (and no db change would be required during the release).
Assignee | ||
Comment 27•12 years ago
|
||
Adrian, If the subselect is working for you, it'll save us some storage space ( do read to the end of the comments before coding in the future, though!). Either approach is equally valid. If so, we want to remove my last commit from my pull request. The other 3 commits we want to keep; those are other types of cleanup.
Assignee | ||
Comment 28•12 years ago
|
||
Adrian, On second thought: your code will still work whether we push my changes or not. As such, let's just keep it all in. Your changes just mean we don't have to backfill.
Comment 29•12 years ago
|
||
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/97102761503d862f4ca7907d2529b6315be517fb Fixes bug 790844 - Corrected SQL query for daily crashes in the middleware. https://github.com/mozilla/socorro/commit/5acf7352ef1d3ddc49930cc13a742ab77980a953 Merge pull request #824 from AdrianGaudebert/790844-fix-sql-for-crashes-by-user Fixes bug 790844 - Corrected SQL query for daily crashes in the middleware.
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 30•12 years ago
|
||
(In reply to [:jberkus] Josh Berkus from comment #15) > Have you checked product_crash_ratio on StageDB? I have now: breakpad=> SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('15.0.1') AND adu_date BETWEEN '2012-09-13' and '2012-09-16' ORDER BY adu_date DESC, version DESC; adu_date | version | crashes | adu ------------+---------+---------+----------- 2012-09-15 | 15.0.1 | 1303510 | 176486874 2012-09-14 | 15.0.1 | 1324640 | 215200008 2012-09-13 | 15.0.1 | 1307350 | 216329748 (3 rows) Those are the same numbers as for prod, so it's obvious that they are wrong. Those are more ADUs than Firefox has as a whole, so surely a wrong number for a single release.
Updated•12 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 31•12 years ago
|
||
Bumping to reopened per comment 30. Adrian noted that the fixes should be on stage.
Reporter | ||
Comment 32•12 years ago
|
||
OK, mpressman applied the fix to stageDB again after the DB refresh had re-set it. Now things look good there: breakpad=> SELECT adu_date, version_string as version, adjusted_crashes as crashes, adu_count as adu FROM product_crash_ratio WHERE product_name = 'Firefox' AND version_string IN ('15.0.1') AND adu_date BETWEEN '2012-09-13' and '2012-09-16' ORDER BY adu_date DESC, version DESC; adu_date | version | crashes | adu ------------+---------+---------+---------- 2012-09-15 | 15.0.1 | 1303510 | 58828958 2012-09-14 | 15.0.1 | 1324640 | 71733336 2012-09-13 | 15.0.1 | 1307350 | 72109916 (3 rows) Those numbers sound totally plausible to me, and the ones for other versions look good as well.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 33•12 years ago
|
||
Fantastic .. thanks Kairo. Bumping to verified per comment 32.
Status: RESOLVED → VERIFIED
Comment 34•12 years ago
|
||
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/02623394e260d3250b263482e6c35dc706863f65 Changes to crash accounting in order to fix bug 790844 https://github.com/mozilla/socorro/commit/475cae10c234411036cdc612cbd2656d9106294d Merge pull request #823 from jberkus/master Changes to crashes_by_user and product_crash_ratio resolve bug 790844
Updated•12 years ago
|
Status: VERIFIED → RESOLVED
Closed: 12 years ago → 12 years ago
Comment 35•12 years ago
|
||
I forgot to insert something in the SQL query to filter crashes by type. We don't want to include browser hangs in our graphs for example. There's a field in the db that gives that information, just need to add a simple filter.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 36•12 years ago
|
||
Actually, Josh is going to work on it in the database directly, thus assigning to him.
Assignee: adrian → josh
Status: REOPENED → ASSIGNED
Comment 37•12 years ago
|
||
Commit pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/7b6201ad2f3d1d8fb75df192de5be0cb60c91b1c Merge pull request #829 from jberkus/master Changes to crashes_by_user VIEWs resolve bug 790844
Updated•12 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•