Closed Bug 1602161 Opened 6 years ago Closed 6 years ago

Histograms that aren't accumulated in a session changed from `null` to `[]` in main_summary around 2019-11-09

Categories

(Data Platform and Tools :: General, defect, P1)

defect
Points:
1

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: MattN, Assigned: relud)

References

()

Details

(Keywords: regression)

Attachments

(1 file)

For many years Presto, Athena, and BQ would have a NULL value in main_summary if a histogram wasn't recorded/accumulated in the session. Since around 2019-11-09 [] is returned instead and this broke some of my existing queries. See https://sql.telemetry.mozilla.org/queries/66809/source#169318 for how null values stop being reported for this column.

tdsmith: i think this is because https://github.com/mozilla/bigquery-etl/blob/master/udf/json_extract_int_map.sql, which is used in the main_summary etl, returns an empty array for a null input

I can workaround this by switching from a null check to something else but I don't think this breaking change was communicated and I'm not sure it was intentional.

Flags: needinfo?(jklukas)
Flags: needinfo?(dthorn)

This was likely an unintentional effect of the transition to SQL-based main_summary job, and I agree it should be fixed. I don't think we ever intend to have a zero-length array.

It sounds like you having a stable work-around, though, so not setting this as an urgent priority.

Flags: needinfo?(jklukas)

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)

It sounds like you having a stable work-around, though, so not setting this as an urgent priority.

That still requires me to know which of my queries are affected by this bug which isn't super trivial to notice. I also filed this bug so others don't waste time on this issue as I'm sure others also have affected queries. I think this should be communicated widely if it's not going to be fixed soon.

This is a side-effect of fields in BigQuery tables being either NULLABLE or REPEATED, but not both. Before 2019-11-09 this was handled by nesting each REPEATED field inside a NULLABLE single-field RECORD and removing that nesting in a view. For data on or after 2019-11-09 we removed this layer of nesting in the underlying table, causing all NULL values to be converted to [] when BigQuery writes to main_summary.

I agree it should be fixed. I don't think we ever intend to have a zero-length array.

I think we can safely convert [] to NULL for histograms in the main_summary_v4 view, because as :chutten confirmed for me in slack:

Histograms without samples should not be present in the snapshot in Firefox, so they shouldn't be present in the payload.

Assignee: nobody → dthorn
Status: NEW → ASSIGNED
Points: --- → 1
Flags: needinfo?(dthorn)
Priority: -- → P1
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Datasets: Main Summary → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: