Closed Bug 1887708 Opened 1 year ago Closed 1 year ago

Airflow task bqetl_firefox_ios.funnel_retention_clients_week_[n] failed since exec_date 2024-03-19

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ascholtz, Assigned: kik)

References

Details

(Whiteboard: [airflow-triage])

Attachments

(1 file)

Airflow task bqetl_firefox_ios.funnel_retention_clients_week_[n] failed since exec_date 2024-03-19

Task link:
https://workflow.telemetry.mozilla.org/dags/bqetl_firefox_ios/grid?dag_run_id=scheduled__2024-03-24T04%3A00%3A00%2B00%3A00&task_id=checks__fail_firefox_ios_derived__funnel_retention_clients_week_2__v1

Log extract:

Duplicates detected (Expected combined set of values for columns ['client_id'] to be unique.)

Looks like we have three clients with duplicates in this table:

SELECT
    client_id
  FROM `moz-fx-data-shared-prod.firefox_ios_derived.funnel_retention_clients_week_4_v1`
  GROUP BY 1
  HAVING COUNT(*) > 1

Yields:

client_id
a7e556a3-ced9-4506-9dcb-d6974edb1ad9
2756f2ac-a023-41a3-8d32-e3250065819c
0cd70a23-8a11-4eb7-a3cf-686080f4110d

Investigating this further it appears one of the clients: 2756f2ac-a023-41a3-8d32-e3250065819c has more than one entry in the upstream table firefox_ios.baseline_clients_last_seen:

WITH duplicates AS (
  SELECT
    client_id
  FROM `moz-fx-data-shared-prod.firefox_ios_derived.funnel_retention_clients_week_4_v1`
  GROUP BY 1
  HAVING COUNT(*) > 1
)

SELECT client_id, first_seen_date, normalized_channel
FROM `moz-fx-data-shared-prod.firefox_ios.baseline_clients_last_seen`
INNER JOIN duplicates USING(client_id)
WHERE submission_date > "2023-03-01"
GROUP BY client_id, first_seen_date, normalized_channel
HAVING COUNT(*) > 1
ORDER BY client_id DESC, first_seen_date DESC

Result:

client_id	                                                        first_seen_date	normalized_channel
a7e556a3-ced9-4506-9dcb-d6974edb1ad9	2024-02-28	        release
2756f2ac-a023-41a3-8d32-e3250065819c	2024-02-28	        release
2756f2ac-a023-41a3-8d32-e3250065819c	2023-10-23	        release
0cd70a23-8a11-4eb7-a3cf-686080f4110d	2024-02-23	        release

Somehow we ended up with the client_id 2756f2ac-a023-41a3-8d32-e3250065819c having multiple entries in the baseline_clients_last_seen view for the release channel.

This duplication can be traced up to: org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1:

SELECT client_id, first_seen_date, normalized_channel
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
INNER JOIN duplicates USING(client_id)
WHERE
  submission_date > "2023-10-01"
  AND client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
GROUP BY client_id, first_seen_date, normalized_channel
HAVING COUNT(*) > 1
ORDER BY client_id DESC, first_seen_date DESC

We get 2 rows back (1 expected):

client_id	first_seen_date	normalized_channel
2756f2ac-a023-41a3-8d32-e3250065819c	2024-02-28	release
2756f2ac-a023-41a3-8d32-e3250065819c	2023-10-23	release

Link the the org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1 query:
https://github.com/mozilla/bigquery-etl/blob/generated-sql/sql/moz-fx-data-shared-prod/org_mozilla_ios_firefox_derived/baseline_clients_last_seen_v1/query.sql

In this query org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1 references itself and org_mozilla_ios_firefox_derived.baseline_clients_daily_v1. Having a quick look at the upstream table:

SELECT client_id, first_seen_date, normalized_channel
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_daily_v1`
INNER JOIN duplicates USING(client_id)
WHERE
  submission_date > "2023-10-01"
  AND client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
GROUP BY client_id, first_seen_date, normalized_channel
HAVING COUNT(*) > 1
ORDER BY client_id DESC, first_seen_date DESC

We see that the client has a single entry:

client_id								first_seen_date	normalized_channel
2756f2ac-a023-41a3-8d32-e3250065819c	2024-02-28		release

Will handle this specific client duplication by deleting the row with the older first_seen_date to match that returned by the daily table:

DELETE `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
WHERE
  client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
  AND first_seen_date = "2023-10-23"
  AND submission_date IN ("2023-12-25", "2023-12-26")

To be on the safe side, :ascholtz does that DELETE statement look reasonable to you?

Flags: needinfo?(ascholtz)

Yes, that looks good to fix the immediate issue.

Do we know why the self-reference results in the duplicates? And whether we can adjust the query to prevent this from happening in the future?

Flags: needinfo?(ascholtz)

I backed up the data before deleting using this:

CREATE TABLE `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1_bkp_delete`
AS 
SELECT * 
FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
WHERE
  client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
  AND first_seen_date = "2023-10-23"
  AND submission_date IN ("2023-12-25", "2023-12-26");

And then ran the delete statement below:

  DELETE 
  FROM `moz-fx-data-shared-prod.org_mozilla_ios_firefox_derived.baseline_clients_last_seen_v1`
WHERE
  client_id = "2756f2ac-a023-41a3-8d32-e3250065819c"
  AND first_seen_date = "2023-10-23"
  AND submission_date IN ("2023-12-25", "2023-12-26");

Did some research and put the findings here. There are 6 client IDs causing issues, 5 causing issues in the week 4 table because each has 2 rows, and 1 causing issues in the week 1 table, because it has 2 rows.

More details here: https://docs.google.com/spreadsheets/d/1FYH1Vngds0Hh4K-AcHqcJApbdXNSK7SxIEZ97C5E9Ic/edit?usp=sharing

Perhaps this is barking up the wrong tree, but IIRC we had some duplicate client ID issues a month or 2 ago and that was tracked down, I believe, to some table(s) not being included in Shredder.

Why don't we just include first_seen_date in the join in the retention_calculation cte (line 41 of the query in https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/firefox_ios_derived/funnel_retention_clients_week_2_v1/query.sql and 42 of of the query in https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/firefox_ios_derived/funnel_retention_clients_week_4_v1/query.sql).

I am proposing this because each client has only one first_seen_date on the base clients table (supposedly the correct first_seen_date per our new profiles definition). It follows that including the first_seen_date in the join will eliminate the duplicates in the week_[n] table.

See Also: → 1891523

Looks like we have not seeing this task failing for at least the most recent 25 runs. Feel free to open a new issue or reopen this one if the error comes back.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: