duplicate rows in Vertica table "copy_adi_dimensional_by_date"

RESOLVED FIXED

Status

Data & BI Services Team
Data Processing : ETL : Vertica target
RESOLVED FIXED
3 years ago
a year ago

People

(Reporter: brendan c, Assigned: sheeri)

Tracking

Details

(Reporter)

Description

3 years ago
As discussed over email (on 2014-08-06) the table Vertica table "copy_adi_dimensional_by_date" appears to have duplicate rows.

"""Each row is meant to be a non-overlapping bin (see rows from table at below). The 'safe' query is a good band-aid, but users of the data shouldn't be required to know that the data may contain duplicates -- the point of this roll-up table is to be a 'safe' place to access BLP data without needing to know about a lot of hidden foot guns.

It appears that the counts returned by 'safe' (de-duplicating) and 'unsafe' queries are identical until 2014-05-30, which I think is when some back-end changed were made. If those changes started introducing duplicate rows, we should fix whatever bug was introduced at that time; or, failing that, we should at the least have a daily job that cleans up the table and removes duplicate rows."""
Product: mozilla.org → Data & BI Services Team
(Assignee)

Updated

2 years ago
Component: DB: MySQL → Data Processing : ETL : Vertica target
(Assignee)

Comment 1

2 years ago
The duplicate rows still exist. Let's work on cleaning these up.
(Assignee)

Updated

2 years ago
Assignee: server-ops-database → scabral
(Assignee)

Comment 2

a year ago
For example:


dbadmin=> select bl_date, product, v_prod_major, prod_os, v_prod_os, channel, locale, continent_code, cntry_code, tot_requests_on_date, distro_name, distro_version, buildid ,count(*) from copy_adi_dimensional_by_date where bl_date between '2016-06-19' and '2016-06-19' group by 1,2,3,4,5,6,7,8,9,10,11,12,13 having count(*)>1;
  bl_date   |       product        | v_prod_major |  prod_os   | v_prod_os |          channel          |                       locale                       | continent_code | cntry_code | tot_requests_on_date |  distro_name  | distro_version |    buildid     | count 
------------+----------------------+--------------+------------+-----------+---------------------------+----------------------------------------------------+----------------+------------+----------------------+---------------+----------------+----------------+-------
 2016-06-19 | CMS%203.0%20Client%2 | 1.0.9        | Windows_NT | 6.2       | release                   | en-US                                              | -              | HK         |                    1 | default       | default        | 20081021       |     4
 2016-06-19 | CMS%203.0%20Client%2 | 1.0.9        | Windows_NT | 5.1       | release                   | en-US                                              | -              | HK         |                    1 | default       | default        | 20081021       |     2
 2016-06-19 | Firefox              | 47.0         | Windows_NT | 6.1       | release-cck-mozillaonline | zh-CN                                              | -              | GE         |                    1 | MozillaOnline | 2016.4         | 20160604131506 |     2
 2016-06-19 | Firefox              | 47.0         | Windows_NT | 6.0       | release                   | en-US;%20SHC;%20SHC-KIOSK;%20SHC-HTS;%20SHC-Unit-0 | -              | US         |                    1 | default       | default        | 20160604131506 |     4
 2016-06-19 | Firefox              | 47.0         | Windows_NT | 6.1       | release                   | en-US;%20SHC;%20SHC-KIOSK;%20SHC-HTS;%20SHC-Unit-0 | -              | US         |                    1 | default       | default        | 20160604131506 |     7
(5 rows)

Interesting, looks like the duplicates are Windows. Checking to see if that's the case throughout.
(Assignee)

Comment 3

a year ago
That's mostly the case, but not entirely, I'm seeing a smattering of Linux and Darwin entries as well.
(Assignee)

Comment 4

a year ago
Vertica does not allow DELETE with LIMIT so I'm deleting all the duplicate rows and adding one of each back in.
(Assignee)

Comment 5

a year ago
Interestingly, current entries have much fewer duplicates than the past. I easily ran the dedupe on 6/1 through 6/19. I am currently running the script to dedupe 6/1/2014-12/31/2014, which I expect will take quite a while.
(Assignee)

Comment 6

a year ago
And indeed, the script is still running, having only gotten to 6/16/2014.
(Assignee)

Comment 7

a year ago
duplicate script is up to 2014-10-23
(Assignee)

Comment 8

a year ago
2014 is fully deduped. Starting on 2015 now.
(Assignee)

Comment 9

a year ago
Looks like 2015 is going much faster, so whatever issue was causing the duplicates was heavily mitigated (but not entirely fixed) sometime between August 2014 and Dec 2014. Fascinating.
(Assignee)

Comment 10

a year ago
Indeed, 2015 just finished, for a total runtime of 214minutes or about 3.5 hours. 2014 took over a week.
(Assignee)

Comment 11

a year ago
2016 is also complete. The script was made to accept an argument of "yesterday", so we just need to set it running in cron, monitor it, and check it into revision control.
(Assignee)

Comment 12

a year ago
Tested cron, added in cron, added in script. Used the wrapper, so output will appear in the metrics checks. Will check tomorrow to make sure all is good.
(Assignee)

Comment 13

a year ago
The wrapper logs appeared today - which means they are automatically verified and checked. Whee!
Status: NEW → RESOLVED
Last Resolved: a year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.