clients_first_seen_v1 Cannot replace a table with a different partitioning spec
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: ascholtz, Assigned: ascholtz)
Details
The dryrun for the clients_first_seen_v1 init.sql in bigquery-etl is failing with
Invalid value: Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:day,field:first_seen_date) clustering(normalized_channel,sample_id) and existing spec is none at
Some background on how this happened:
On 2021-04-19 a couple of new fields were added to clients_daily_v6 and clients_last_seen_v1 in https://github.com/mozilla/bigquery-etl/pull/1953. This caused some problems: The changes did set allow_field_addition_on_date for clients_daily_v6 in the metadata.yaml file. Having allow_field_addition_on_date set in the clients_daily_v6 metadata file, changed the order of the columns based on how they appeared in the query. This resulted in different schemas for clients_daily_v6 and clients_first_seen_v1 (I manually updated the schema by appending the new fields here) which then resulted in the clients_first_seen_v1 query failing. Since the clients_first_seen_v1 table is relatively small I decided to create a new table with the schema matching clients_daily_v6 and copied the data over. This is where I forgot to specify the correct clustering and partitioning.
Eventually, I realized that clients_last_seen also has incompatible schemas but that re-creating this table with a new schema is not an option due to the table size. So finally I ended up running the clients_daily query locally but changing the field order in the query to match the expected schema and revert back the clients_first_seen table schema (again forgetting to set the right clustering and partitioning).
Anyway, to solve this issue, I think creating a new table for clients_first_seen_v1 with the right partitioning and clustering and copying the existing data over should fix this.
Comment 1•5 years ago
|
||
Thank you for the detailed writeup here. This is definitely pointing to some more vulnerabilities in the clients_daily ETL flow and good to have the situation documented.
| Assignee | ||
Comment 2•5 years ago
|
||
I recreated the table and copied the data over. The dry run is now succeeding.
Updated•3 years ago
|
Description
•