Closed Bug 1223788 Opened 9 years ago Closed 8 years ago

Wrong ADI data for Nov 10, 2015

Categories

(Socorro :: General, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: kairo, Assigned: peterbe)

Details

The ADI data for yesterday, Nov 10, 2015, are wrong on Socorro, see e.g. https://crash-stats.mozilla.com/daily?p=Firefox - for some reason we only have a very small percentage of the actual number we should have.

This blocks us from having any useful crash rate data for this day right now.
Assignee: nobody → peterbe
As I'll be leaving work pretty soon, I have prepared this psql query to show the underlying issue in the raw_adi table:

SELECT date,SUM(adi_count) as adi FROM raw_adi WHERE product_guid = '{ec8030f7-c20a-464f-9b0e-13a3a9e97384}' AND update_channel = 'release' AND product_version = '42.0' AND build = '20151029151421' AND date >= '2015-11-08' GROUP BY date;

That query can be used to see if the value for 2015-11-10 is in the same range as the days before. It's ok that 8th is ~38M and 9th is ~53M as that's weekend vs. weekday, but 10th should be in the same order of magnitude and it isn't (right now it ends up as 138k).
So the error is already in the data in raw_adi (and pretty surely raw_adi_log, I don't understand that exactly).
AFAIK, once that's fixed, a backfill for yesterday should get the matviews and UI fixed up fine (I will still need to manually re-run my custom reports).
The problem lies with Hive and I'm not sure who to escalate this to. Before I dug into backfilling and worrying about our PG and our crontabber I ran some queries on the SCL3 metrics server to see what the output is. 

I ran them like this:

 $ /data/socorro/socorro-virtualenv/bin/python testhive.py peach-gw.peach.metrics.scl3.mozilla.com 2015-11-08 > output.2015-11-08.log
$ /data/socorro/socorro-virtualenv/bin/python testhive.py peach-gw.peach.metrics.scl3.mozilla.com 2015-11-09 > output.2015-11-09.log
 $ /data/socorro/socorro-virtualenv/bin/python testhive.py peach-gw.peach.metrics.scl3.mozilla.com 2015-11-10 > output.2015-11-10.log

The output of those files where are follows:

 -rw-r----- 1 pbengtsson pbengtsson 31301945 Nov 11 18:17 output.2015-11-08.log
 -rw-r----- 1 pbengtsson pbengtsson 36182999 Nov 11 18:23 output.2015-11-09.log
 -rw-r----- 1 pbengtsson pbengtsson  3393550 Nov 11 18:27 output.2015-11-10.log

And:

 $ wc -l output*.log
  219543 output.2015-11-08.log
  253344 output.2015-11-09.log
   24005 output.2015-11-10.log
  496892 total
Nov 10th is known - weblogs moved and there was a miscommunication about where they moved to, so our proactivity did not help. This data is being re-processed now and there will be NO data loss.

Also known is a dip for Nov 4th (we had a 2.5 hour data center outage). This data is NOT recoverable.
(In reply to Sheeri Cabral [:sheeri] from comment #3)
> Nov 10th is known - weblogs moved and there was a miscommunication about
> where they moved to, so our proactivity did not help. This data is being
> re-processed now and there will be NO data loss.

OK, we should get that into Socorro once it's done so that our multi-day data views are correct.

> Also known is a dip for Nov 4th (we had a 2.5 hour data center outage). This
> data is NOT recoverable.

That seems to be pretty small as it's not really noticeable in the crash rate graphs - but thanks for letting us know.
Supposedly, November 10th is now in. We're still waiting for November 11th.
Here's how I backfill on prod:

breakpad=> select count(*) from raw_adi_logs where report_date='2015-11-10';
 count
-------
 23957
(1 row)

breakpad=> delete from raw_adi_logs where report_date='2015-11-10';
DELETE 23957
breakpad=> select count(*) from raw_adi_logs where report_date='2015-11-11';
 count
-------
     0
(1 row)


breakpad=> select count(*) from raw_adi where date='2015-11-10';
 count
-------
 23918
(1 row)

breakpad=> delete from raw_adi where date='2015-11-10';
DELETE 23918
breakpad=> select count(*) from raw_adi where date='2015-11-11';
 count
-------
     0
(1 row)


breakpad=> select count(*) from product_adu where adu_date = '2015-11-10';
 count
-------
   131
(1 row)

breakpad=> delete from product_adu where adu_date = '2015-11-10';
DELETE 131
breakpad=> select count(*) from product_adu where adu_date = '2015-11-11';
 count
-------
     0
(1 row)


breakpad=> begin;
BEGIN
breakpad=> update crontabber set next_run = next_run - interval '2 days', last_run=last_run - interval '2 days', last_success=last_success - interval '2 days' where app_name='fetch-adi-from-hive';
UPDATE 1
breakpad=> commit;
COMMIT
actually, since we want it to restart Nov 10, I added one more day:

breakpad=> update crontabber set last_run=last_run - interval '1 days', last_success=last_success - interval '1 days' where app_name='fetch-adi-from-hive';
UPDATE 1
breakpad=> select app_name, next_run, last_run, last_success from crontabber where app_name='fetch-adi-from-hive';      app_name       |        next_run        |           last_run           |      last_success
---------------------+------------------------+------------------------------+------------------------
 fetch-adi-from-hive | 2015-11-11 08:00:00+00 | 2015-11-09 08:00:03.23486+00 | 2015-11-09 08:00:00+00
(1 row)

I.e. saying that the last success was on the 9th. That'll re-run and make the next day to attempt the 10th and 11th.
OK, the manual stuff somehow didn't end up with completely the right results, see the current state of raw_adi:

breakpad=> SELECT date,SUM(adi_count) as adi FROM raw_adi WHERE product_guid = '{ec8030f7-c20a-464f-9b0e-13a3a9e97384}' AND update_channel = 'release' AND date >= '2015-11-01' GROUP BY date;               
    date    |    adi
------------+-----------
 2015-11-01 |  85257219
 2015-11-02 | 121852851
 2015-11-03 | 125658841
 2015-11-04 | 109660017
 2015-11-05 | 124234558
 2015-11-06 | 117263951
 2015-11-07 |  86998090
 2015-11-08 |  86847863
 2015-11-09 | 362678109
 2015-11-10 | 124553042
 2015-11-11 | 118096248
 2015-11-13 | 115967952
 2015-11-14 |  85769040
 2015-11-15 |  86776565

Note that them 9th has double numbers and the 12th is missing. I guess both is caused by the stuff that you tried manually. Also, the missing 12th is blocking other reports to run, which now hinders me to perform substantial pieces of my job.
And we should run a matview backfill for the 10th to get the numbers in the matviews corrected for that day, now that raw_adi has the right numbers.
breakpad=> delete from raw_adi_logs where report_date >= '2015-11-09';
DELETE 1697541
breakpad=> delete from raw_adi where date >= '2015-11-09';
DELETE 1692478
breakpad=> update crontabber set next_run = next_run - interval '1 days', last_success=last_success - interval '7 days' where app_name='fetch-adi-from-hive';
UPDATE 1
breakpad=> select app_name, next_run, last_run, last_success from crontabber where app_name='fetch-adi-from-hive';
      app_name       |        next_run        |           last_run            |      last_success
---------------------+------------------------+-------------------------------+------------------------
 fetch-adi-from-hive | 2015-11-16 08:00:00+00 | 2015-11-16 08:00:02.918468+00 | 2015-11-09 08:00:00+00
(1 row)
Full backfill for Nov 10th is complete on prod.
And the data looks good, so let's mark this fixed. Thanks a lot for the work on this!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.