Closed
Bug 1223788
Opened 9 years ago
Closed 8 years ago
Wrong ADI data for Nov 10, 2015
Categories
(Socorro :: General, task)
Socorro
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: kairo, Assigned: peterbe)
Details
The ADI data for yesterday, Nov 10, 2015, are wrong on Socorro, see e.g. https://crash-stats.mozilla.com/daily?p=Firefox - for some reason we only have a very small percentage of the actual number we should have. This blocks us from having any useful crash rate data for this day right now.
Assignee | ||
Updated•9 years ago
|
Assignee: nobody → peterbe
Reporter | ||
Comment 1•9 years ago
|
||
As I'll be leaving work pretty soon, I have prepared this psql query to show the underlying issue in the raw_adi table: SELECT date,SUM(adi_count) as adi FROM raw_adi WHERE product_guid = '{ec8030f7-c20a-464f-9b0e-13a3a9e97384}' AND update_channel = 'release' AND product_version = '42.0' AND build = '20151029151421' AND date >= '2015-11-08' GROUP BY date; That query can be used to see if the value for 2015-11-10 is in the same range as the days before. It's ok that 8th is ~38M and 9th is ~53M as that's weekend vs. weekday, but 10th should be in the same order of magnitude and it isn't (right now it ends up as 138k). So the error is already in the data in raw_adi (and pretty surely raw_adi_log, I don't understand that exactly). AFAIK, once that's fixed, a backfill for yesterday should get the matviews and UI fixed up fine (I will still need to manually re-run my custom reports).
Assignee | ||
Comment 2•9 years ago
|
||
The problem lies with Hive and I'm not sure who to escalate this to. Before I dug into backfilling and worrying about our PG and our crontabber I ran some queries on the SCL3 metrics server to see what the output is. I ran them like this: $ /data/socorro/socorro-virtualenv/bin/python testhive.py peach-gw.peach.metrics.scl3.mozilla.com 2015-11-08 > output.2015-11-08.log $ /data/socorro/socorro-virtualenv/bin/python testhive.py peach-gw.peach.metrics.scl3.mozilla.com 2015-11-09 > output.2015-11-09.log $ /data/socorro/socorro-virtualenv/bin/python testhive.py peach-gw.peach.metrics.scl3.mozilla.com 2015-11-10 > output.2015-11-10.log The output of those files where are follows: -rw-r----- 1 pbengtsson pbengtsson 31301945 Nov 11 18:17 output.2015-11-08.log -rw-r----- 1 pbengtsson pbengtsson 36182999 Nov 11 18:23 output.2015-11-09.log -rw-r----- 1 pbengtsson pbengtsson 3393550 Nov 11 18:27 output.2015-11-10.log And: $ wc -l output*.log 219543 output.2015-11-08.log 253344 output.2015-11-09.log 24005 output.2015-11-10.log 496892 total
Comment 3•9 years ago
|
||
Nov 10th is known - weblogs moved and there was a miscommunication about where they moved to, so our proactivity did not help. This data is being re-processed now and there will be NO data loss. Also known is a dip for Nov 4th (we had a 2.5 hour data center outage). This data is NOT recoverable.
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Sheeri Cabral [:sheeri] from comment #3) > Nov 10th is known - weblogs moved and there was a miscommunication about > where they moved to, so our proactivity did not help. This data is being > re-processed now and there will be NO data loss. OK, we should get that into Socorro once it's done so that our multi-day data views are correct. > Also known is a dip for Nov 4th (we had a 2.5 hour data center outage). This > data is NOT recoverable. That seems to be pretty small as it's not really noticeable in the crash rate graphs - but thanks for letting us know.
Assignee | ||
Comment 5•9 years ago
|
||
Supposedly, November 10th is now in. We're still waiting for November 11th.
Assignee | ||
Comment 6•9 years ago
|
||
Here's how I backfill on prod: breakpad=> select count(*) from raw_adi_logs where report_date='2015-11-10'; count ------- 23957 (1 row) breakpad=> delete from raw_adi_logs where report_date='2015-11-10'; DELETE 23957 breakpad=> select count(*) from raw_adi_logs where report_date='2015-11-11'; count ------- 0 (1 row) breakpad=> select count(*) from raw_adi where date='2015-11-10'; count ------- 23918 (1 row) breakpad=> delete from raw_adi where date='2015-11-10'; DELETE 23918 breakpad=> select count(*) from raw_adi where date='2015-11-11'; count ------- 0 (1 row) breakpad=> select count(*) from product_adu where adu_date = '2015-11-10'; count ------- 131 (1 row) breakpad=> delete from product_adu where adu_date = '2015-11-10'; DELETE 131 breakpad=> select count(*) from product_adu where adu_date = '2015-11-11'; count ------- 0 (1 row) breakpad=> begin; BEGIN breakpad=> update crontabber set next_run = next_run - interval '2 days', last_run=last_run - interval '2 days', last_success=last_success - interval '2 days' where app_name='fetch-adi-from-hive'; UPDATE 1 breakpad=> commit; COMMIT
Assignee | ||
Comment 7•9 years ago
|
||
actually, since we want it to restart Nov 10, I added one more day: breakpad=> update crontabber set last_run=last_run - interval '1 days', last_success=last_success - interval '1 days' where app_name='fetch-adi-from-hive'; UPDATE 1 breakpad=> select app_name, next_run, last_run, last_success from crontabber where app_name='fetch-adi-from-hive'; app_name | next_run | last_run | last_success ---------------------+------------------------+------------------------------+------------------------ fetch-adi-from-hive | 2015-11-11 08:00:00+00 | 2015-11-09 08:00:03.23486+00 | 2015-11-09 08:00:00+00 (1 row) I.e. saying that the last success was on the 9th. That'll re-run and make the next day to attempt the 10th and 11th.
Reporter | ||
Comment 8•9 years ago
|
||
OK, the manual stuff somehow didn't end up with completely the right results, see the current state of raw_adi: breakpad=> SELECT date,SUM(adi_count) as adi FROM raw_adi WHERE product_guid = '{ec8030f7-c20a-464f-9b0e-13a3a9e97384}' AND update_channel = 'release' AND date >= '2015-11-01' GROUP BY date; date | adi ------------+----------- 2015-11-01 | 85257219 2015-11-02 | 121852851 2015-11-03 | 125658841 2015-11-04 | 109660017 2015-11-05 | 124234558 2015-11-06 | 117263951 2015-11-07 | 86998090 2015-11-08 | 86847863 2015-11-09 | 362678109 2015-11-10 | 124553042 2015-11-11 | 118096248 2015-11-13 | 115967952 2015-11-14 | 85769040 2015-11-15 | 86776565 Note that them 9th has double numbers and the 12th is missing. I guess both is caused by the stuff that you tried manually. Also, the missing 12th is blocking other reports to run, which now hinders me to perform substantial pieces of my job. And we should run a matview backfill for the 10th to get the numbers in the matviews corrected for that day, now that raw_adi has the right numbers.
Assignee | ||
Comment 9•9 years ago
|
||
breakpad=> delete from raw_adi_logs where report_date >= '2015-11-09'; DELETE 1697541 breakpad=> delete from raw_adi where date >= '2015-11-09'; DELETE 1692478 breakpad=> update crontabber set next_run = next_run - interval '1 days', last_success=last_success - interval '7 days' where app_name='fetch-adi-from-hive'; UPDATE 1 breakpad=> select app_name, next_run, last_run, last_success from crontabber where app_name='fetch-adi-from-hive'; app_name | next_run | last_run | last_success ---------------------+------------------------+-------------------------------+------------------------ fetch-adi-from-hive | 2015-11-16 08:00:00+00 | 2015-11-16 08:00:02.918468+00 | 2015-11-09 08:00:00+00 (1 row)
Assignee | ||
Comment 10•9 years ago
|
||
Full backfill for Nov 10th is complete on prod.
Reporter | ||
Comment 11•8 years ago
|
||
And the data looks good, so let's mark this fixed. Thanks a lot for the work on this!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•