Closed Bug 1595141 Opened 6 years ago Closed 6 years ago

Remove zombie columns from parquet-imported bigquery tables in user-facing views

Categories

(Data Platform and Tools :: General, task, P3)

task
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bugzilla, Unassigned)

Details

Attachments

(2 files)

Because bigquery does not support deleting columns, when we imported all history, old columns that we later dropped have been resurrected (but null for recent dates.) User-facing views should drop these old columns

Bonus: we could (should?) run a script over these tables to identify null columns programmatically and come up with the EXCEPT list

Points: --- → 2
Priority: -- → P3

Is there a list of tables that got imported from parquet into BigQuery?

We replaced most imported Parquet datasets by backfilling into historical tables and providing compatibility views. It's not clear to me which tables still exist that would have these phantom columns.

:sunahsuh Can you provide more context for which tables to target here?

Flags: needinfo?(ssuh)

Nearly all of the imported tables have zombie columns – e.g. active_experiment is back in main_summary, clients_daily, events, etc (this is an artifact from an experiment system that preceded normandy.)

I think the preferred approach would be to make this systematic anyway and create a report of which columns come up entirely null for the latest day for each table of interest, which we could run on an ongoing basis.

Flags: needinfo?(ssuh)
Attached file null-columns.txt

Tables and their zombie columns

I started writing a script to get the names of zombie columns (I'm not sure where that script should live) but there are a few tables that have zombie columns. I added an attachment that lists the tables and the names of those columns: https://bugzilla.mozilla.org/attachment.cgi?id=9121403

There were also a few tables that could not be queried and returned this error: Table was not found in location US

Attached file null_columns.py

Here is also the script. I executed it with the following parameters to get the results: python3 null_columns.py --project=moz-fx-data-shar-nonprod-efed --dataset=telemetry --date="2020-01-15"

I created a PR with EXCEPT lists here: https://github.com/mozilla/bigquery-etl/pull/688

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: