Closed Bug 1758949 Opened 4 years ago Closed 4 years ago

Stable table not populated for topsites_impression_v1 and others on 2022-03-07

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Unassigned)

References

Details

(Whiteboard: [dataquality])

:akim reported that contextual services event aggregates looked very low for 2022-03-07 in Looker.

Digging into the data, I could see that there were no topsites-impression included in the aggregates for that day. I then looked at the source tables:

WITH
  live AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    COUNT(distinct document_id) AS n_live,
  FROM
    `moz-fx-data-shared-prod.contextual_services_live.topsites_impression_v1`
  WHERE
    DATE(submission_timestamp) >= "2022-02-10"
  GROUP BY
    1),
  stable AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    COUNT(*) AS n_stable,
  FROM
    `moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1`
  WHERE
    DATE(submission_timestamp) >= "2022-02-10"
  GROUP BY
    1)
SELECT
  *
FROM
  live
LEFT JOIN
  stable
USING
  (submission_date)
ORDER BY
  1

That shows that the count in the live table is about as expected, but the stable table is empty for 2022-03-07. Looking at Airflow logs for copy_deduplicate_all, there's nothing immediately suspicious. The task ran at the normal time and lasted about as long as usual.

We need to repopulate, rerun downstream ETL, and understand root cause.

There are log lines that indicates a job ran for this table:

[2022-03-08 01:04:51,530] {{pod_launcher.py:149}} INFO - INFO:root:Completed query job for moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1$20220307 with params: [ScalarQueryParameter('start_time', 'TIMESTAMP', datetime.datetime(2022, 3, 7, 0, 0, tzinfo=<UTC>)), ScalarQueryParameter('end_time', 'TIMESTAMP', datetime.datetime(2022, 3, 8, 0, 0, tzinfo=<UTC>)), ScalarQueryParameter('num_preceding_days', 'INT64', 0)]

[2022-03-08 01:16:44,813] {{pod_launcher.py:149}} INFO - INFO:root:Processed 292663536326 bytes to populate moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1$20220307

I was able to find the jobId a10c348a-c8f2-4426-b2d4-aaf34b3e5ad4 via:

SELECT
 creation_time, job_id, destination_table
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE true AND job_type = "QUERY"
and destination_table.dataset_id = 'contextual_services_stable'
and destination_table.table_id like 'topsites_impression%'
and creation_time > '2022-03-01 00:00:00Z'
order by creation_time desc
limit 10

That job ran successfully and populated the table. I see the table populated when time traveling to a few minutes after the query finished:

WITH
  live AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    COUNT(distinct document_id) AS n_live,
  FROM
    `moz-fx-data-shared-prod.contextual_services_live.topsites_impression_v1`
  WHERE
    DATE(submission_timestamp) = "2022-03-07"
  GROUP BY
    1),
  stable AS (
  SELECT
    DATE(submission_timestamp) AS submission_date,
    COUNT(*) AS n_stable,
  FROM
    `moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1`
    for system_time as of '2022-03-08 01:05:00Z'
  WHERE
    DATE(submission_timestamp) = "2022-03-07"
  GROUP BY
    1)
SELECT
  *
FROM
  live
LEFT JOIN
  stable
USING
  (submission_date)
ORDER BY
  1
LIMIT
  1000

So this partition was somehow deleted between that timestamp and when the event aggregates query ran.

The partition disappears sometime between 2022-03-08 01:47:00Z and 2022-03-08 01:48:00Z (determined by running iterations of the above query with different system_time).

Given the timing I think the following job is responsible for the truncation for comment #3 specifically:

https://cloudlogging.app.goo.gl/ju2b7PaAshnqcciy7

gcloud --project moz-fx-data-bq-batch-prod alpha bq jobs describe de19647d-33e0-4346-ae77-51d8bff93f03

since it occurred at 2022-03-08T01:47:27.669973Z.

configuration:
  dryRun: false
  jobType: QUERY
  query:
    destinationTable:
      datasetId: contextual_services_stable
      projectId: moz-fx-data-shared-prod
      tableId: topsites_impression_v1
    priority: INTERACTIVE
    query: |-
      SELECT
        _target.*
      FROM
        `moz-fx-data-shared-prod.contextual_services_stable.topsites_impression_v1` AS _target
      LEFT JOIN
        (
          SELECT
            payload.scalars.parent.deletion_request_context_id AS _source_0
          FROM
            `moz-fx-data-shared-prod.telemetry_stable.deletion_request_v4`
          WHERE
            DATE(submission_timestamp) >= '2022-01-10'
            AND DATE(submission_timestamp) < '2022-03-07'
        )
      ON
        context_id = _source_0
      WHERE
        (_source_0 IS NULL)
        AND CAST(submission_timestamp AS DATE) < '2022-03-07'
    useLegacySql: false
    writeDisposition: WRITE_TRUNCATE
See Also: → 1759019
Summary: Stable table not populated for topsites_impression_v1 on 2022-03-07 → Stable table not populated for topsites_impression_v1 and others on 2022-03-07

Per :whd's discovery, this is definitely due to a shredder logic bug. We reran copy_deduplicate_all and downstream ETL, so this is now in good shape. There is a separate linked bug for resolving the shredder behavior.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

moz-fx-data-shared-prod:firefox_accounts_derived.fxa_users_last_seen_v1 was impacted by this and until just now the partition 20220307 remained empty, which also had a cascading impact on following days. the table it is derived from, fxa_users_daily_v1, showed no gap.

See Also: → 1767474
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.