Closed Bug 1669288 Opened 5 years ago Closed 5 years ago

Investigate ~20% increase in telemetry.main partition sizes starting 2020-09-28

Categories

(Data Platform and Tools :: General, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ascholtz, Assigned: ascholtz)

References

Details

(Whiteboard: [dataquality])

The telemetry.main partition sizes increased by about 20% starting 2020-09-28: https://sql.telemetry.mozilla.org/queries/73282/source#183488
There was a new release on 2020-09-22 which might be related to this.

Assignee: nobody → ascholtz

I wrote a script to determine column sizes by performing dry runs: https://github.com/mozilla/bigquery-etl/pull/1486
I ran the script for 2020-10-13 and 2020-08-25 (before we saw the increase) and compared column sizes. It looks like payload.keyed_histograms.sqlite_store_query added 1TB in size, payload.keyed_histograms.sqlite_store_open added another 1TB in size and a slight increase in additional_properties with 200GB.

I did some further investigation on how much compressing histograms here:

For payload.keyed_histograms.sqlite_store_open I ran the following analysis:

Table Description Size
analysis.ascholtz_sqlite_store_open_20201103 Contains all payload.keyed_histograms.sqlite_store_open for 2020-11-02 922.91 GB
analysis.ascholtz_sqlite_store_open_compressed_20201103 Contains all payload.keyed_histograms.sqlite_store_open for 2020-11-02 but histograms use compact notation 371.82 GB
analysis.ascholtz_sqlite_store_open_compressed_stripped_20201103 Contains all payload.keyed_histograms.sqlite_store_open for 2020-11-02 but histograms use compact notation and zero counts stripped 337.91 GB

Keys alone are about 25% of the size (193.2 GB) compared to non-compact histograms (729.7 GB).

And for payload.keyed_histograms.sqlite_store_query:

Table Description Size
analysis.ascholtz_sqlite_store_query_20201103 Contains all payload.keyed_histograms.sqlite_store_query for 2020-11-02 948.3 GB
analysis.ascholtz_sqlite_store_query_compressed_20201103 Contains all payload.keyed_histograms.sqlite_store_query for 2020-11-02 but histograms use compact notation 387.53 GB
analysis.ascholtz_sqlite_store_query_compressed_stripped_20201103 Contains all payload.keyed_histograms.sqlite_store_query for 2020-11-02 but histograms use compact notation and zero counts stripped 353.03 GB

Again, keys alone are about 25% of the size (195.9 GB) compared to non-compact histograms (752.4 GB).

So it looks like using compact histogram encoding can reduce the column sizes here by about 60%. I only looked at data for of a single day (2020-11-02) here, but I don't think these numbers would be much different for other days.

Are these savings worth using the compact string encoding for the keyed histograms here?

Flags: needinfo?(jklukas)

Are these savings worth using the compact string encoding for the keyed histograms here?

Probably yes, although it's hard to really estimate the cost savings without knowing how often these columns are referenced in queries.

This is a less clear-cut situation compared to use counters. I'm a bit hesitant to continue adding special cases for compact encodings, as it will become difficult to document and communicate which histograms we should expect are stored compactly vs. as JSON. I would prefer to see us proceed with enabling compact encodings for all histogram types; there were only one or two known blocking use cases left when I deprioritized that work (see https://bugzilla.mozilla.org/show_bug.cgi?id=1646825).

Flags: needinfo?(jklukas)
See Also: → 1646825
Depends on: 1646825

Investigation done.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.