Stable table not populated for topsites_impression_v1 and others on 2022-03-07
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: klukas, Unassigned)
References
Details
(Whiteboard: [dataquality])
:akim reported that contextual services event aggregates looked very low for 2022-03-07 in Looker.
Digging into the data, I could see that there were no topsites-impression included in the aggregates for that day. I then looked at the source tables:
WITH
live AS (
SELECT
DATE(submission_timestamp) AS submission_date,
COUNT(distinct document_id) AS n_live,
FROM
`moz-fx-data-shared-prod.contextual_services_live.topsites_impression_v1`
WHERE
DATE(submission_timestamp) >= "2022-02-10"
GROUP BY
1),
stable AS (
SELECT
DATE(submission_timestamp) AS submission_date,
COUNT(*) AS n_stable,
FROM
`moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1`
WHERE
DATE(submission_timestamp) >= "2022-02-10"
GROUP BY
1)
SELECT
*
FROM
live
LEFT JOIN
stable
USING
(submission_date)
ORDER BY
1
That shows that the count in the live table is about as expected, but the stable table is empty for 2022-03-07. Looking at Airflow logs for copy_deduplicate_all, there's nothing immediately suspicious. The task ran at the normal time and lasted about as long as usual.
We need to repopulate, rerun downstream ETL, and understand root cause.
| Reporter | ||
Comment 1•4 years ago
•
|
||
There are log lines that indicates a job ran for this table:
[2022-03-08 01:04:51,530] {{pod_launcher.py:149}} INFO - INFO:root:Completed query job for moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1$20220307 with params: [ScalarQueryParameter('start_time', 'TIMESTAMP', datetime.datetime(2022, 3, 7, 0, 0, tzinfo=<UTC>)), ScalarQueryParameter('end_time', 'TIMESTAMP', datetime.datetime(2022, 3, 8, 0, 0, tzinfo=<UTC>)), ScalarQueryParameter('num_preceding_days', 'INT64', 0)]
[2022-03-08 01:16:44,813] {{pod_launcher.py:149}} INFO - INFO:root:Processed 292663536326 bytes to populate moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1$20220307
| Reporter | ||
Comment 2•4 years ago
|
||
I was able to find the jobId a10c348a-c8f2-4426-b2d4-aaf34b3e5ad4 via:
SELECT
creation_time, job_id, destination_table
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE true AND job_type = "QUERY"
and destination_table.dataset_id = 'contextual_services_stable'
and destination_table.table_id like 'topsites_impression%'
and creation_time > '2022-03-01 00:00:00Z'
order by creation_time desc
limit 10
| Reporter | ||
Comment 3•4 years ago
|
||
That job ran successfully and populated the table. I see the table populated when time traveling to a few minutes after the query finished:
WITH
live AS (
SELECT
DATE(submission_timestamp) AS submission_date,
COUNT(distinct document_id) AS n_live,
FROM
`moz-fx-data-shared-prod.contextual_services_live.topsites_impression_v1`
WHERE
DATE(submission_timestamp) = "2022-03-07"
GROUP BY
1),
stable AS (
SELECT
DATE(submission_timestamp) AS submission_date,
COUNT(*) AS n_stable,
FROM
`moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1`
for system_time as of '2022-03-08 01:05:00Z'
WHERE
DATE(submission_timestamp) = "2022-03-07"
GROUP BY
1)
SELECT
*
FROM
live
LEFT JOIN
stable
USING
(submission_date)
ORDER BY
1
LIMIT
1000
So this partition was somehow deleted between that timestamp and when the event aggregates query ran.
| Reporter | ||
Comment 4•4 years ago
|
||
The partition disappears sometime between 2022-03-08 01:47:00Z and 2022-03-08 01:48:00Z (determined by running iterations of the above query with different system_time).
Comment 5•4 years ago
•
|
||
Given the timing I think the following job is responsible for the truncation for comment #3 specifically:
https://cloudlogging.app.goo.gl/ju2b7PaAshnqcciy7
gcloud --project moz-fx-data-bq-batch-prod alpha bq jobs describe de19647d-33e0-4346-ae77-51d8bff93f03
since it occurred at 2022-03-08T01:47:27.669973Z.
configuration:
dryRun: false
jobType: QUERY
query:
destinationTable:
datasetId: contextual_services_stable
projectId: moz-fx-data-shared-prod
tableId: topsites_impression_v1
priority: INTERACTIVE
query: |-
SELECT
_target.*
FROM
`moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1` AS _target
LEFT JOIN
(
SELECT
payload.scalars.parent.deletion_request_context_id AS _source_0
FROM
`moz-fx-data-shared-prod.telemetry_stable.deletion_request_v4`
WHERE
DATE(submission_timestamp) >= '2022-01-10'
AND DATE(submission_timestamp) < '2022-03-07'
)
ON
context_id = _source_0
WHERE
(_source_0 IS NULL)
AND CAST(submission_timestamp AS DATE) < '2022-03-07'
useLegacySql: false
writeDisposition: WRITE_TRUNCATE
| Reporter | ||
Updated•4 years ago
|
| Reporter | ||
Comment 6•4 years ago
|
||
Per :whd's discovery, this is definitely due to a shredder logic bug. We reran copy_deduplicate_all and downstream ETL, so this is now in good shape. There is a separate linked bug for resolving the shredder behavior.
Comment 7•4 years ago
•
|
||
moz-fx-data-shared-prod:firefox_accounts_derived.fxa_users_last_seen_v1 was impacted by this and until just now the partition 20220307 remained empty, which also had a cascading impact on following days. the table it is derived from, fxa_users_daily_v1, showed no gap.
Updated•3 years ago
|
Description
•