Closed Bug 1050387 Opened 10 years ago Closed 8 years ago

duplicate rows in Vertica table "copy_adi_dimensional_by_date"

Categories

(Data & BI Services Team :: Data Processing : ETL : Vertica target, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bcolloran, Assigned: scabral)

Details

As discussed over email (on 2014-08-06) the table Vertica table "copy_adi_dimensional_by_date" appears to have duplicate rows.

"""Each row is meant to be a non-overlapping bin (see rows from table at below). The 'safe' query is a good band-aid, but users of the data shouldn't be required to know that the data may contain duplicates -- the point of this roll-up table is to be a 'safe' place to access BLP data without needing to know about a lot of hidden foot guns.

It appears that the counts returned by 'safe' (de-duplicating) and 'unsafe' queries are identical until 2014-05-30, which I think is when some back-end changed were made. If those changes started introducing duplicate rows, we should fix whatever bug was introduced at that time; or, failing that, we should at the least have a daily job that cleans up the table and removes duplicate rows."""
Product: mozilla.org → Data & BI Services Team
Component: DB: MySQL → Data Processing : ETL : Vertica target
The duplicate rows still exist. Let's work on cleaning these up.
Assignee: server-ops-database → scabral
For example:


dbadmin=> select bl_date, product, v_prod_major, prod_os, v_prod_os, channel, locale, continent_code, cntry_code, tot_requests_on_date, distro_name, distro_version, buildid ,count(*) from copy_adi_dimensional_by_date where bl_date between '2016-06-19' and '2016-06-19' group by 1,2,3,4,5,6,7,8,9,10,11,12,13 having count(*)>1;
  bl_date   |       product        | v_prod_major |  prod_os   | v_prod_os |          channel          |                       locale                       | continent_code | cntry_code | tot_requests_on_date |  distro_name  | distro_version |    buildid     | count 
------------+----------------------+--------------+------------+-----------+---------------------------+----------------------------------------------------+----------------+------------+----------------------+---------------+----------------+----------------+-------
 2016-06-19 | CMS%203.0%20Client%2 | 1.0.9        | Windows_NT | 6.2       | release                   | en-US                                              | -              | HK         |                    1 | default       | default        | 20081021       |     4
 2016-06-19 | CMS%203.0%20Client%2 | 1.0.9        | Windows_NT | 5.1       | release                   | en-US                                              | -              | HK         |                    1 | default       | default        | 20081021       |     2
 2016-06-19 | Firefox              | 47.0         | Windows_NT | 6.1       | release-cck-mozillaonline | zh-CN                                              | -              | GE         |                    1 | MozillaOnline | 2016.4         | 20160604131506 |     2
 2016-06-19 | Firefox              | 47.0         | Windows_NT | 6.0       | release                   | en-US;%20SHC;%20SHC-KIOSK;%20SHC-HTS;%20SHC-Unit-0 | -              | US         |                    1 | default       | default        | 20160604131506 |     4
 2016-06-19 | Firefox              | 47.0         | Windows_NT | 6.1       | release                   | en-US;%20SHC;%20SHC-KIOSK;%20SHC-HTS;%20SHC-Unit-0 | -              | US         |                    1 | default       | default        | 20160604131506 |     7
(5 rows)

Interesting, looks like the duplicates are Windows. Checking to see if that's the case throughout.
That's mostly the case, but not entirely, I'm seeing a smattering of Linux and Darwin entries as well.
Vertica does not allow DELETE with LIMIT so I'm deleting all the duplicate rows and adding one of each back in.
Interestingly, current entries have much fewer duplicates than the past. I easily ran the dedupe on 6/1 through 6/19. I am currently running the script to dedupe 6/1/2014-12/31/2014, which I expect will take quite a while.
And indeed, the script is still running, having only gotten to 6/16/2014.
duplicate script is up to 2014-10-23
2014 is fully deduped. Starting on 2015 now.
Looks like 2015 is going much faster, so whatever issue was causing the duplicates was heavily mitigated (but not entirely fixed) sometime between August 2014 and Dec 2014. Fascinating.
Indeed, 2015 just finished, for a total runtime of 214minutes or about 3.5 hours. 2014 took over a week.
2016 is also complete. The script was made to accept an argument of "yesterday", so we just need to set it running in cron, monitor it, and check it into revision control.
Tested cron, added in cron, added in script. Used the wrapper, so output will appear in the metrics checks. Will check tomorrow to make sure all is good.
The wrapper logs appeared today - which means they are automatically verified and checked. Whee!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.