Closed
Bug 1050387
Opened 10 years ago
Closed 8 years ago
duplicate rows in Vertica table "copy_adi_dimensional_by_date"
Categories
(Data & BI Services Team :: Data Processing : ETL : Vertica target, task)
Data & BI Services Team
Data Processing : ETL : Vertica target
x86_64
Linux
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bcolloran, Assigned: scabral)
Details
As discussed over email (on 2014-08-06) the table Vertica table "copy_adi_dimensional_by_date" appears to have duplicate rows. """Each row is meant to be a non-overlapping bin (see rows from table at below). The 'safe' query is a good band-aid, but users of the data shouldn't be required to know that the data may contain duplicates -- the point of this roll-up table is to be a 'safe' place to access BLP data without needing to know about a lot of hidden foot guns. It appears that the counts returned by 'safe' (de-duplicating) and 'unsafe' queries are identical until 2014-05-30, which I think is when some back-end changed were made. If those changes started introducing duplicate rows, we should fix whatever bug was introduced at that time; or, failing that, we should at the least have a daily job that cleans up the table and removes duplicate rows."""
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
Assignee | ||
Updated•9 years ago
|
Component: DB: MySQL → Data Processing : ETL : Vertica target
Assignee | ||
Comment 1•8 years ago
|
||
The duplicate rows still exist. Let's work on cleaning these up.
Assignee | ||
Updated•8 years ago
|
Assignee: server-ops-database → scabral
Assignee | ||
Comment 2•8 years ago
|
||
For example: dbadmin=> select bl_date, product, v_prod_major, prod_os, v_prod_os, channel, locale, continent_code, cntry_code, tot_requests_on_date, distro_name, distro_version, buildid ,count(*) from copy_adi_dimensional_by_date where bl_date between '2016-06-19' and '2016-06-19' group by 1,2,3,4,5,6,7,8,9,10,11,12,13 having count(*)>1; bl_date | product | v_prod_major | prod_os | v_prod_os | channel | locale | continent_code | cntry_code | tot_requests_on_date | distro_name | distro_version | buildid | count ------------+----------------------+--------------+------------+-----------+---------------------------+----------------------------------------------------+----------------+------------+----------------------+---------------+----------------+----------------+------- 2016-06-19 | CMS%203.0%20Client%2 | 1.0.9 | Windows_NT | 6.2 | release | en-US | - | HK | 1 | default | default | 20081021 | 4 2016-06-19 | CMS%203.0%20Client%2 | 1.0.9 | Windows_NT | 5.1 | release | en-US | - | HK | 1 | default | default | 20081021 | 2 2016-06-19 | Firefox | 47.0 | Windows_NT | 6.1 | release-cck-mozillaonline | zh-CN | - | GE | 1 | MozillaOnline | 2016.4 | 20160604131506 | 2 2016-06-19 | Firefox | 47.0 | Windows_NT | 6.0 | release | en-US;%20SHC;%20SHC-KIOSK;%20SHC-HTS;%20SHC-Unit-0 | - | US | 1 | default | default | 20160604131506 | 4 2016-06-19 | Firefox | 47.0 | Windows_NT | 6.1 | release | en-US;%20SHC;%20SHC-KIOSK;%20SHC-HTS;%20SHC-Unit-0 | - | US | 1 | default | default | 20160604131506 | 7 (5 rows) Interesting, looks like the duplicates are Windows. Checking to see if that's the case throughout.
Assignee | ||
Comment 3•8 years ago
|
||
That's mostly the case, but not entirely, I'm seeing a smattering of Linux and Darwin entries as well.
Assignee | ||
Comment 4•8 years ago
|
||
Vertica does not allow DELETE with LIMIT so I'm deleting all the duplicate rows and adding one of each back in.
Assignee | ||
Comment 5•8 years ago
|
||
Interestingly, current entries have much fewer duplicates than the past. I easily ran the dedupe on 6/1 through 6/19. I am currently running the script to dedupe 6/1/2014-12/31/2014, which I expect will take quite a while.
Assignee | ||
Comment 6•8 years ago
|
||
And indeed, the script is still running, having only gotten to 6/16/2014.
Assignee | ||
Comment 7•8 years ago
|
||
duplicate script is up to 2014-10-23
Assignee | ||
Comment 8•8 years ago
|
||
2014 is fully deduped. Starting on 2015 now.
Assignee | ||
Comment 9•8 years ago
|
||
Looks like 2015 is going much faster, so whatever issue was causing the duplicates was heavily mitigated (but not entirely fixed) sometime between August 2014 and Dec 2014. Fascinating.
Assignee | ||
Comment 10•8 years ago
|
||
Indeed, 2015 just finished, for a total runtime of 214minutes or about 3.5 hours. 2014 took over a week.
Assignee | ||
Comment 11•8 years ago
|
||
2016 is also complete. The script was made to accept an argument of "yesterday", so we just need to set it running in cron, monitor it, and check it into revision control.
Assignee | ||
Comment 12•8 years ago
|
||
Tested cron, added in cron, added in script. Used the wrapper, so output will appear in the metrics checks. Will check tomorrow to make sure all is good.
Assignee | ||
Comment 13•8 years ago
|
||
The wrapper logs appeared today - which means they are automatically verified and checked. Whee!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•