Closed Bug 1776156 Opened 2 years ago Closed 2 years ago

Multiple tasks failing in bqetl_fxa_events Airflow DAG on 2022-06-23

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: akomar, Assigned: klukas)

Details

(Whiteboard: [airflow-triage])

Multiple tasks failing in bqetl_fxa_events Airflow DAG on 2022-06-23

Invalid schema update. Field
jsonPayload.fields.metricsoptoutat has changed type from STRING to FLOAT

This has happened several times before and I have usually contacted :jbuck for intervention upstream. We sometimes have had to modify the queries to skip these fields, but sometimes he has been able to get the tables recreated with the correct type.

Query: https://github.com/mozilla/bigquery-etl/blob/main/sql/moz-fx-data-shared-prod/firefox_accounts_derived/fxa_auth_events_v1/query.sql

This failure case exists because the fxa logs sink is set up with the older sharded table approach rather than using date-partitioned tables. So it's possible to have schema drift between different days.

?ni :relud - You have the most context on this fxa ETL besides me. Would you be willing to be named owner for these queries and the overall DAG going forward?

?ni :jbuck - It looks like moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_* had a type regression. Can you take a look at what's going on upstream?

Flags: needinfo?(jbuckley)
Flags: needinfo?(dthorn)

Alright, looks like it only broke on a single day so I'll fix tomorrow's schema today, then fix today's schema tomorrow.

Should I switch the BQ sink to a partitioned table? Seems like it'd fix this particular problem anyways...

Alright, looks like it only broke on a single day so I'll fix tomorrow's schema today, then fix today's schema tomorrow.

Sounds great. Thank you!

Should I switch the BQ sink to a partitioned table? Seems like it'd fix this particular problem anyways

I do think this would be a good long-term stability improvement, but we shouldn't do it piecemeal. We should transition all the logs together, since it will require coordinated changes to the queries to reference the new names and change how we reference dates.

If we can have both old and new sinks running concurrently for a few days, that would be ideal for a seamless transition.

yeah, i can be the owner on this going forward

Flags: needinfo?(dthorn)

(In reply to Daniel Thorn [:relud] from comment #4)

yeah, i can be the owner on this going forward

See https://github.com/mozilla/bigquery-etl/pull/3051

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #3)

If we can have both old and new sinks running concurrently for a few days, that would be ideal for a seamless transition.

Oh duh, right, I can just setup parallel sinks. I'll file a bug to get this migration started.

As for tomorrow's schema, I've created the table with the fixed schema.

This failed again last night, but this time with:

Cannot read field of type FLOAT64 as STRING Field: jsonPayload.fields.metricsoptoutat

:jbuck - Can you take another look?

:klukas - okay, I can fix the 23rd table now. Because I need to look this up every time, here's the query I'm going to run

SELECT
  * REPLACE (
    (
      SELECT AS STRUCT
        jsonPayload.* REPLACE (
          (
            SELECT AS STRUCT
              jsonPayload.fields.* REPLACE (
                CAST(jsonPayload.fields.metricsoptoutat AS STRING) AS metricsoptoutat
              )
          ) AS fields
        )
    ) AS jsonPayload
  )
FROM
  `moz-fx-fxa-prod-0712.fxa_prod_logs.docker_fxa_auth_20220623`

:klukas - this query has been run now, can you trigger the backfill and see if everything works correctly?

Flags: needinfo?(jbuckley) → needinfo?(jklukas)

Thank you!

Looks like the affected queries ran successfully. Just waiting on all the downstream tasks to finish now.

Flags: needinfo?(jklukas)

Downstream ETL has all finished successfully except for https://workflow.telemetry.mozilla.org/tree?dag_id=bqetl_mozilla_vpn_site_metrics where we are still failing the "empty check", which I don't understand. ?ni :relud on that.

Flags: needinfo?(dthorn)

that's unrelated to fxa events

Flags: needinfo?(dthorn)

(In reply to Daniel Thorn [:relud] from comment #11)

that's unrelated to fxa events

Then we're done here. Thanks all!

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

I filed https://mozilla-hub.atlassian.net/browse/SVCSE-644 for follow-up with partitioned tables

Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.