Closed Bug 1706683 Opened 5 years ago Closed 5 years ago

clients_first_seen_v1 Cannot replace a table with a different partitioning spec

Categories

(Data Platform and Tools :: General, defect)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ascholtz, Assigned: ascholtz)

Details

The dryrun for the clients_first_seen_v1 init.sql in bigquery-etl is failing with

Invalid value: Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:day,field:first_seen_date) clustering(normalized_channel,sample_id) and existing spec is none at

Some background on how this happened:

On 2021-04-19 a couple of new fields were added to clients_daily_v6 and clients_last_seen_v1 in https://github.com/mozilla/bigquery-etl/pull/1953. This caused some problems: The changes did set allow_field_addition_on_date for clients_daily_v6 in the metadata.yaml file. Having allow_field_addition_on_date set in the clients_daily_v6 metadata file, changed the order of the columns based on how they appeared in the query. This resulted in different schemas for clients_daily_v6 and clients_first_seen_v1 (I manually updated the schema by appending the new fields here) which then resulted in the clients_first_seen_v1 query failing. Since the clients_first_seen_v1 table is relatively small I decided to create a new table with the schema matching clients_daily_v6 and copied the data over. This is where I forgot to specify the correct clustering and partitioning.

Eventually, I realized that clients_last_seen also has incompatible schemas but that re-creating this table with a new schema is not an option due to the table size. So finally I ended up running the clients_daily query locally but changing the field order in the query to match the expected schema and revert back the clients_first_seen table schema (again forgetting to set the right clustering and partitioning).

Anyway, to solve this issue, I think creating a new table for clients_first_seen_v1 with the right partitioning and clustering and copying the existing data over should fix this.

Thank you for the detailed writeup here. This is definitely pointing to some more vulnerabilities in the clients_daily ETL flow and good to have the situation documented.

I recreated the table and copied the data over. The dry run is now succeeding.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.