Histograms that aren't accumulated in a session changed from `null` to `[]` in main_summary around 2019-11-09
Categories
(Data Platform and Tools :: General, defect, P1)
Tracking
(Not tracked)
People
(Reporter: MattN, Assigned: relud)
References
()
Details
(Keywords: regression)
Attachments
(1 file)
For many years Presto, Athena, and BQ would have a NULL value in main_summary if a histogram wasn't recorded/accumulated in the session. Since around 2019-11-09 [] is returned instead and this broke some of my existing queries. See https://sql.telemetry.mozilla.org/queries/66809/source#169318 for how null values stop being reported for this column.
tdsmith: i think this is because https://github.com/mozilla/bigquery-etl/blob/master/udf/json_extract_int_map.sql, which is used in the main_summary etl, returns an empty array for a null input
I can workaround this by switching from a null check to something else but I don't think this breaking change was communicated and I'm not sure it was intentional.
Comment 1•6 years ago
|
||
This was likely an unintentional effect of the transition to SQL-based main_summary job, and I agree it should be fixed. I don't think we ever intend to have a zero-length array.
It sounds like you having a stable work-around, though, so not setting this as an urgent priority.
| Reporter | ||
Comment 2•6 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)
It sounds like you having a stable work-around, though, so not setting this as an urgent priority.
That still requires me to know which of my queries are affected by this bug which isn't super trivial to notice. I also filed this bug so others don't waste time on this issue as I'm sure others also have affected queries. I think this should be communicated widely if it's not going to be fixed soon.
| Assignee | ||
Comment 3•6 years ago
|
||
This is a side-effect of fields in BigQuery tables being either NULLABLE or REPEATED, but not both. Before 2019-11-09 this was handled by nesting each REPEATED field inside a NULLABLE single-field RECORD and removing that nesting in a view. For data on or after 2019-11-09 we removed this layer of nesting in the underlying table, causing all NULL values to be converted to [] when BigQuery writes to main_summary.
I agree it should be fixed. I don't think we ever intend to have a zero-length array.
I think we can safely convert [] to NULL for histograms in the main_summary_v4 view, because as :chutten confirmed for me in slack:
Histograms without samples should not be present in the snapshot in Firefox, so they shouldn't be present in the payload.
Comment 4•6 years ago
|
||
| Assignee | ||
Updated•6 years ago
|
Updated•3 years ago
|
Description
•