Airflow task bqetl_firefox_ios.funnel_retention_clients_week_[n] failed since exec_date 2024-03-19
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: ascholtz, Assigned: kik)
References
Details
(Whiteboard: [airflow-triage])
Attachments
(1 file)
Airflow task bqetl_firefox_ios.funnel_retention_clients_week_[n] failed since exec_date 2024-03-19
Log extract:
Duplicates detected (Expected combined set of values for columns ['client_id'] to be unique.)
| Assignee | ||
Comment 1•1 year ago
|
||
Looks like we have three clients with duplicates in this table:
SELECT
client_id
FROM `moz-fx-data-shared-prod.firefox_ios_derived.funnel_retention_clients_week_4_v1`
GROUP BY 1
HAVING COUNT(*) > 1
Yields:
client_id
a7e556a3-ced9-4506-9dcb-d6974edb1ad9
2756f2ac-a023-41a3-8d32-e3250065819c
0cd70a23-8a11-4eb7-a3cf-686080f4110d
Investigating this further it appears one of the clients: 2756f2ac-a023-41a3-8d32-e3250065819c has more than one entry in the upstream table firefox_ios.baseline_clients_last_seen:
WITH duplicates AS (
SELECT
client_id
FROM `moz-fx-data-shared-prod.firefox_ios_derived.funnel_retention_clients_week_4_v1`
GROUP BY 1
HAVING COUNT(*) > 1
)
SELECT client_id, first_seen_date, normalized_channel
FROM `moz-fx-data-shared-prod.firefox_ios.baseline_clients_last_seen`
INNER JOIN duplicates USING(client_id)
WHERE submission_date > "2023-03-01"
GROUP BY client_id, first_seen_date, normalized_channel
HAVING COUNT(*) > 1
ORDER BY client_id DESC, first_seen_date DESC
Result:
client_id first_seen_date normalized_channel
a7e556a3-ced9-4506-9dcb-d6974edb1ad9 2024-02-28 release
2756f2ac-a023-41a3-8d32-e3250065819c 2024-02-28 release
2756f2ac-a023-41a3-8d32-e3250065819c 2023-10-23 release
0cd70a23-8a11-4eb7-a3cf-686080f4110d 2024-02-23 release
Somehow we ended up with the client_id 2756f2ac-a023-41a3-8d32-e3250065819c having multiple entries in the baseline_clients_last_seen view for the release channel.
This duplication can be traced up to: org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1:
SELECT client_id, first_seen_date, normalized_channel
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
INNER JOIN duplicates USING(client_id)
WHERE
submission_date > "2023-10-01"
AND client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
GROUP BY client_id, first_seen_date, normalized_channel
HAVING COUNT(*) > 1
ORDER BY client_id DESC, first_seen_date DESC
We get 2 rows back (1 expected):
client_id first_seen_date normalized_channel
2756f2ac-a023-41a3-8d32-e3250065819c 2024-02-28 release
2756f2ac-a023-41a3-8d32-e3250065819c 2023-10-23 release
Link the the org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1 query:
https://github.com/mozilla/bigquery-etl/blob/generated-sql/sql/moz-fx-data-shared-prod/org_mozilla_ios_firefox_derived/baseline_clients_last_seen_v1/query.sql
| Assignee | ||
Comment 2•1 year ago
|
||
In this query org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1 references itself and org_mozilla_ios_firefox_derived.baseline_clients_daily_v1. Having a quick look at the upstream table:
SELECT client_id, first_seen_date, normalized_channel
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_daily_v1`
INNER JOIN duplicates USING(client_id)
WHERE
submission_date > "2023-10-01"
AND client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
GROUP BY client_id, first_seen_date, normalized_channel
HAVING COUNT(*) > 1
ORDER BY client_id DESC, first_seen_date DESC
We see that the client has a single entry:
client_id first_seen_date normalized_channel
2756f2ac-a023-41a3-8d32-e3250065819c 2024-02-28 release
| Assignee | ||
Comment 3•1 year ago
|
||
Will handle this specific client duplication by deleting the row with the older first_seen_date to match that returned by the daily table:
DELETE `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
WHERE
client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
AND first_seen_date = "2023-10-23"
AND submission_date IN ("2023-12-25", "2023-12-26")
To be on the safe side, :ascholtz does that DELETE statement look reasonable to you?
| Reporter | ||
Comment 4•1 year ago
|
||
Yes, that looks good to fix the immediate issue.
Do we know why the self-reference results in the duplicates? And whether we can adjust the query to prevent this from happening in the future?
Comment 5•1 year ago
•
|
||
I backed up the data before deleting using this:
CREATE TABLE `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1_bkp_delete`
AS
SELECT *
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
WHERE
client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
AND first_seen_date = "2023-10-23"
AND submission_date IN ("2023-12-25", "2023-12-26");
And then ran the delete statement below:
DELETE
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
WHERE
client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
AND first_seen_date = "2023-10-23"
AND submission_date IN ("2023-12-25", "2023-12-26");
Comment 6•1 year ago
•
|
||
Did some research and put the findings here. There are 6 client IDs causing issues, 5 causing issues in the week 4 table because each has 2 rows, and 1 causing issues in the week 1 table, because it has 2 rows.
More details here: https://docs.google.com/spreadsheets/d/1FYH1Vngds0Hh4K-AcHqcJApbdXNSK7SxIEZ97C5E9Ic/edit?usp=sharing
Comment 7•1 year ago
|
||
Perhaps this is barking up the wrong tree, but IIRC we had some duplicate client ID issues a month or 2 ago and that was tracked down, I believe, to some table(s) not being included in Shredder.
Comment 8•1 year ago
|
||
Why don't we just include first_seen_date in the join in the retention_calculation cte (line 41 of the query in https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/firefox_ios_derived/funnel_retention_clients_week_2_v1/query.sql and 42 of of the query in https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/firefox_ios_derived/funnel_retention_clients_week_4_v1/query.sql).
I am proposing this because each client has only one first_seen_date on the base clients table (supposedly the correct first_seen_date per our new profiles definition). It follows that including the first_seen_date in the join will eliminate the duplicates in the week_[n] table.
Comment 9•1 year ago
|
||
| Assignee | ||
Comment 10•1 year ago
|
||
Looks like we have not seeing this task failing for at least the most recent 25 runs. Feel free to open a new issue or reopen this one if the error comes back.
Description
•