Investigate ~20% increase in telemetry.main partition sizes starting 2020-09-28
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: ascholtz, Assigned: ascholtz)
References
Details
(Whiteboard: [dataquality])
The telemetry.main partition sizes increased by about 20% starting 2020-09-28: https://sql.telemetry.mozilla.org/queries/73282/source#183488
There was a new release on 2020-09-22 which might be related to this.
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 1•5 years ago
|
||
I wrote a script to determine column sizes by performing dry runs: https://github.com/mozilla/bigquery-etl/pull/1486
I ran the script for 2020-10-13 and 2020-08-25 (before we saw the increase) and compared column sizes. It looks like payload.keyed_histograms.sqlite_store_query added 1TB in size, payload.keyed_histograms.sqlite_store_open added another 1TB in size and a slight increase in additional_properties with 200GB.
| Assignee | ||
Comment 2•5 years ago
|
||
I did some further investigation on how much compressing histograms here:
For payload.keyed_histograms.sqlite_store_open I ran the following analysis:
| Table | Description | Size |
|---|---|---|
analysis.ascholtz_sqlite_store_open_20201103 |
Contains all payload.keyed_histograms.sqlite_store_open for 2020-11-02 |
922.91 GB |
analysis.ascholtz_sqlite_store_open_compressed_20201103 |
Contains all payload.keyed_histograms.sqlite_store_open for 2020-11-02 but histograms use compact notation |
371.82 GB |
analysis.ascholtz_sqlite_store_open_compressed_stripped_20201103 |
Contains all payload.keyed_histograms.sqlite_store_open for 2020-11-02 but histograms use compact notation and zero counts stripped |
337.91 GB |
Keys alone are about 25% of the size (193.2 GB) compared to non-compact histograms (729.7 GB).
And for payload.keyed_histograms.sqlite_store_query:
| Table | Description | Size |
|---|---|---|
analysis.ascholtz_sqlite_store_query_20201103 |
Contains all payload.keyed_histograms.sqlite_store_query for 2020-11-02 |
948.3 GB |
analysis.ascholtz_sqlite_store_query_compressed_20201103 |
Contains all payload.keyed_histograms.sqlite_store_query for 2020-11-02 but histograms use compact notation |
387.53 GB |
analysis.ascholtz_sqlite_store_query_compressed_stripped_20201103 |
Contains all payload.keyed_histograms.sqlite_store_query for 2020-11-02 but histograms use compact notation and zero counts stripped |
353.03 GB |
Again, keys alone are about 25% of the size (195.9 GB) compared to non-compact histograms (752.4 GB).
So it looks like using compact histogram encoding can reduce the column sizes here by about 60%. I only looked at data for of a single day (2020-11-02) here, but I don't think these numbers would be much different for other days.
Are these savings worth using the compact string encoding for the keyed histograms here?
Comment 3•5 years ago
|
||
Are these savings worth using the compact string encoding for the keyed histograms here?
Probably yes, although it's hard to really estimate the cost savings without knowing how often these columns are referenced in queries.
This is a less clear-cut situation compared to use counters. I'm a bit hesitant to continue adding special cases for compact encodings, as it will become difficult to document and communicate which histograms we should expect are stored compactly vs. as JSON. I would prefer to see us proceed with enabling compact encodings for all histogram types; there were only one or two known blocking use cases left when I deprioritized that work (see https://bugzilla.mozilla.org/show_bug.cgi?id=1646825).
Comment 4•5 years ago
|
||
Investigation done.
Updated•3 years ago
|
| Assignee | ||
Updated•3 years ago
|
Description
•