Remove zombie columns from parquet-imported bigquery tables in user-facing views
Categories
(Data Platform and Tools :: General, task, P3)
Tracking
(Not tracked)
People
(Reporter: bugzilla, Unassigned)
Details
Attachments
(2 files)
Because bigquery does not support deleting columns, when we imported all history, old columns that we later dropped have been resurrected (but null for recent dates.) User-facing views should drop these old columns
Bonus: we could (should?) run a script over these tables to identify null columns programmatically and come up with the EXCEPT list
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Is there a list of tables that got imported from parquet into BigQuery?
Comment 2•6 years ago
|
||
We replaced most imported Parquet datasets by backfilling into historical tables and providing compatibility views. It's not clear to me which tables still exist that would have these phantom columns.
:sunahsuh Can you provide more context for which tables to target here?
Nearly all of the imported tables have zombie columns – e.g. active_experiment is back in main_summary, clients_daily, events, etc (this is an artifact from an experiment system that preceded normandy.)
I think the preferred approach would be to make this systematic anyway and create a report of which columns come up entirely null for the latest day for each table of interest, which we could run on an ongoing basis.
Comment 4•6 years ago
|
||
Tables and their zombie columns
Comment 5•6 years ago
|
||
I started writing a script to get the names of zombie columns (I'm not sure where that script should live) but there are a few tables that have zombie columns. I added an attachment that lists the tables and the names of those columns: https://bugzilla.mozilla.org/attachment.cgi?id=9121403
There were also a few tables that could not be queried and returned this error: Table was not found in location US
Comment 6•6 years ago
|
||
Here is also the script. I executed it with the following parameters to get the results: python3 null_columns.py --project=moz-fx-data-shar-nonprod-efed --dataset=telemetry --date="2020-01-15"
Comment 7•6 years ago
|
||
I created a PR with EXCEPT lists here: https://github.com/mozilla/bigquery-etl/pull/688
Comment 8•6 years ago
|
||
Updated•6 years ago
|
| Assignee | ||
Updated•3 years ago
|
Description
•