Attach table and field descriptions as metadata in BigQuery
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: mreid, Assigned: ascholtz)
Details
Attachments
(1 file)
As part of the BigQuery table generation pipeline, we have access to the JSONSchema and associated probe info - we should add field descriptions to the table where possible.
We may also want to do something similar for tables, at least for derived tables in bigquery-etl.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
Propagation of descriptions and titles from json schemas to BigQuery schemas got implemented in: https://github.com/mozilla/jsonschema-transpiler/pull/93
Assignee | ||
Comment 2•5 years ago
|
||
Metadata of views/tables, such as descriptions, can now be defined in bigquery-etl by adding metadata.yaml
files. See https://github.com/mozilla/bigquery-etl/pull/684
That's also a first step towards making data sets public in GCP.
Assignee | ||
Comment 3•5 years ago
|
||
As a next step for this, I was wondering if it would make sense to get information from the probe dictionary for fields where the description is missing in the JSON schema and then add those descriptions to the schema?
Reporter | ||
Comment 4•5 years ago
|
||
This sounds good to me!
Do you think there's any likelihood that adding significantly more information to the schema is likely to cause (or compound) performance issues like we've seen with the main
ping table recently?
Comment 5•5 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #4)
This sounds good to me!
Do you think there's any likelihood that adding significantly more information to the schema is likely to cause (or compound) performance issues like we've seen with the
main
ping table recently?
We should definitely ask BigQuery support about this.
Anna - Feel free to open an ticket for this, or let me know if you'd rather let me handle it.
Assignee | ||
Comment 6•5 years ago
|
||
I opened a ticket for this: https://console.cloud.google.com/support/cases/detail/22127080?project=moz-fx-data-derived-datasets
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 7•5 years ago
|
||
According to Google Cloud Support adding descriptions should not cause any memory/performance issues:
Since column descriptions are not used in materialization stats, it wouldn't affect you like the issue you had in the ticket #20528460.
Thus, column descriptions won't lead to memory issues.
Comment 8•5 years ago
|
||
Assignee | ||
Comment 9•5 years ago
|
||
Parsing descriptions from main probes and adding it to the schema has been added in: https://github.com/mozilla/mozilla-schema-generator/pull/107
Assignee | ||
Updated•4 years ago
|
Description
•