New paint.build_displaylist_time metric on desktop seems to be taking more storage than expected
Categories
(Data Platform and Tools :: Glean: SDK, defect, P3)
Tracking
(Not tracked)
People
(Reporter: wlach, Unassigned)
References
Details
(Whiteboard: [dataplatform])
When Firefox 94 was released, we noticed a fairly large jump in the disk space used by Firefox desktop (from about 100GB on an average weekday to about 280GB):
https://sql.telemetry.mozilla.org/queries/82992#205657
Most of this seems to be due to the new paint.build_displaylist_time metric which landed in Firefox 94. BigQuery's size estimator estimates that this query will scan 1.8TB:
SELECT
metrics.timing_distribution.paint_build_displaylist_time
FROM
`moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`
WHERE
DATE(submission_timestamp) > '2021-11-01'
AND DATE(submission_timestamp) < '2021-12-01'
Versus this query against a legacy histogram, which should only scan 150GB for the same interval:
SELECT
payload.histograms.checkerboard_potential_duration
FROM
telemetry.main
WHERE
DATE(submission_timestamp) > '2021-11-01' AND
DATE(submission_timestamp) < '2021-12-01'
There do seem to be more instances of the former histogram than the latter in the data, but IMO not enough to account for this discrepency:
- Checkerboard potential duration: https://sql.telemetry.mozilla.org/queries/83105/source#205942
- Paint list display time: https://sql.telemetry.mozilla.org/queries/83106/source
We should probably investigate why this metric is taking up so much space before adding any more glean histograms to desktop.
Comment 1•2 years ago
|
||
:klukas, could the difference be entirely or partially explained by the efficient histogram encoding being used for Telemetry-based distribution data?
Reporter | ||
Comment 2•2 years ago
•
|
||
Ok, that was actually not so complicated.
It looks like for the case of the main ping, we're storing the histogram data as JSON (e.g. https://docs.telemetry.mozilla.org/cookbooks/main_ping_exponential_histograms.html#getting-client-level-data) whereas for glean data we're storing it as structured data inside BigQuery. I suspect that the latter takes up more space than the former, though I'm not sure it makes that much of a difference (and I'm guessing this was a deliberate choice, :klukas can confirm)
The thing that probably makes most of the difference is that this is a timing distribution with a large range, and there is a large number of buckets in the histogram. Here's a quick example of a client's information processed through BigQuery's TO_JSON_STRING:
{"bucket_count":null,"histogram_type":null,"overflow":null,"range":[],"sum":64547612200,"time_unit":null,"underflow":null,"values":[{"key":"3846193","value":466},{"key":"18295683","value":30},{"key":"311743","value":1777},{"key":"25267","value":39},{"key":"440871","value":1372},{"key":"623487","value":2682},{"key":"370727","value":1523},{"key":"961548","value":1807},{"key":"19483","value":41},{"key":"9975792","value":59},{"key":"30048","value":40},{"key":"262144","value":1272},{"key":"3526975","value":490},{"key":"16777216","value":25},{"key":"25874004","value":5},{"key":"32768","value":31},{"key":"679917","value":2338},{"key":"220435","value":1146},{"key":"4573920","value":280},{"key":"480774","value":1552},{"key":"5439339","value":198},{"key":"1923096","value":926},{"key":"15024","value":6},{"key":"77935","value":152},{"key":"33554432","value":1},{"key":"12633","value":10},{"key":"1143480","value":1454},{"key":"185363","value":989},{"key":"8388608","value":65},{"key":"21247","value":55},{"key":"28215801","v…
It's probably worth verifying that this is operating within parameters but it might be fine. Unfortunately the person who originally implemented this metric is no longer at Mozilla.
Comment 3•2 years ago
|
||
The custom string encoding for telemetry histograms is discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1646825 which links to the proposal doc with analysis of the storage savings.
Comment 4•2 years ago
|
||
Of note, the only encoding optimization that was implemented for the compact histogram encoding proposal was for histogram_type=2
which doesn't apply in this situation. checkerboard_potential_duration
is type 0 (exponential) and the value stored in BQ is a JSON string like:
{"bucket_count":50,"histogram_type":0,"sum":30913,"range":[1,1000000],"values":{"65":0,"86":1,"149":2,"258":3,"340":7,"448":3,"590":5,"777":2,"1347":2,"2336":1,"4053":2,"5338":1,"7031":0}}
I am guessing that the paint_build_displaylist_time
is being populated with more values, and thus all the buckets tend to be filled.
Some other relevant details (see BQ data sizing docs).
- Values in the Glean case are INT64 which are always billed as 8 BYTES. In the telemetry JSON case, we are billed one byte per character, which will usually be smaller than 8 characters
- There is an overhead cost of 2 bytes for each individual STRING-type value, though this likely balances out since the JSON representation spends characters on quotes and separators
Comment 5•2 years ago
|
||
Additional context that came up in the Glean Platform WG: The use of additional buckets is a desired property of the users of this metric -- they were actually looking for better-than-ms resolution. So the obvious suggestion of dropping < 1ms buckets may not be sufficient. Perhaps setting a lower limit of 100ns or so may work, however, but we don't currently have functionality for that.
To suggest one possible optimization -- for timing distributions, the keys (defining the lower bounds of the buckets) should always come from the same set, so could be reconstructed from just a list of values (and the first key). There would be a bunch of downstream implications on analysis and tooling for such a change, but it could have significant impact.
Maybe some analysis on the actual data and where the space usage is most coming from would help to understand the impact of such a change.
Comment 6•2 years ago
|
||
I think next step is to look at the data and see where the storage costs are coming from. That might inform any solutions going forward.
Updated•2 years ago
|
Comment 7•2 years ago
|
||
(In reply to Michael Droettboom [:mdroettboom] from comment #5)
Additional context that came up in the Glean Platform WG: The use of additional buckets is a desired property of the users of this metric -- they were actually looking for better-than-ms resolution. So the obvious suggestion of dropping < 1ms buckets may not be sufficient. Perhaps setting a lower limit of 100ns or so may work, however, but we don't currently have functionality for that.
Bas mentions 10^-7 as perhaps the finest limit of resolution that might be reasonable for even performance stuff, which aligns with your intuition.
But we need some way to determine what storage cost samples less than 100ns contribute. I suppose we could "just" count the number of buckets... but my brain hasn't math'd that way in a donkey's age. Lemme stretch it out first...
Oh, duh. Since this is the first N buckets up to whatever bucket holds 100ns, we can just take the bucket index of 100ns which is 53 (the bucket index for a sample x
is given by floor(8 * log(2, x))
).
Given that we were worried enough about bucket counts in Firefox Telemetry to limit them to an absolute count of 100 (unless given a good reason to extend that limit), using a full half of that limit on values from 0 to 100ns seems... wasteful. And we'll be paying this cost for every distribution that times something quick (perf-scale, not user-scale timings).
Comment 8•2 years ago
|
||
Hm. But it looks as though it actually doesn't matter? The number of buckets in use seems to be largely about 100: https://sql.telemetry.mozilla.org/queries/83777/source
So it seems as though our bigquery encoding of this data is where the storage expense is coming from, not the number of buckets.
Comment 9•2 years ago
|
||
And here's a query for what those bucket values are: https://sql.telemetry.mozilla.org/queries/83778/source
The first decile is 32768 suggesting that any lower bound changes wouldn't make much of a difference here. The 100-or-so buckets are coming from fine resolution overall, not having too-fine resolution at the bottom end.
Comment 10•2 years ago
|
||
I intend to spend an hour or two doing a more thorough analysis of storage space for JSON encoded histogram vs. the structured version to better quantify if our BQ representation is significantly wasteful.
Comment 11•2 years ago
|
||
I took a look at this today, focusing on data from March 13 2022.
SELECT
SUM(ARRAY_LENGTH(metrics.timing_distribution.paint_build_displaylist_time.values))
FROM
`moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`
WHERE
DATE(submission_timestamp) = '2022-03-13'
That gives 4 billion as the total number of value entries across all histograms. Each value consists of an INT64 for key and an INT64 for value, so we get:
4 billion * 2 * (8 bytes) = 64 GB
The values for this histogram consume 64 GB of space for this day.
Lets compare this to a JSON encoding. The minimal space for these in JSON encoding would be to look at the number of characters in string encodings of the individual values:
SELECT
SUM(LENGTH(CAST(key AS string)))
+SUM(LENGTH(CAST(value AS string)))
FROM
`moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`, unnest(metrics.timing_distribution.paint_build_displaylist_time.values)
WHERE
DATE(submission_timestamp) = '2022-03-13'
This gives us a total of 35 billion characters to represent these values. At 1 byte per value, this would be minimally 35 GB encoded as JSON, but in reality would be higher due to quotes and separator characters.
So from a storage standpoint, I don't think JSON encoding would particularly help.
So the storage issue here is coming from the real behavior of the underlying histogram (which has many populated buckets) rather than the BQ representation.
Comment 12•2 years ago
|
||
Unassigning myself, as I've ruled out the BQ representation as a major factor.
chutten - Do you think this histogram is meeting a need and should be kept as-is?
Updated•2 years ago
|
Updated•2 years ago
|
Comment 13•2 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #12)
chutten - Do you think this histogram is meeting a need and should be kept as-is?
That's ultimately up to the timing_distribution's owner, not me, but I believe this to be a key graphics performance canary: if this shifts, we want to know why.
Sounds like this is "just" a heavy metric and that there's nothing we can do about it from the data collection system side.
Comment 14•2 years ago
|
||
Sounds like this is "just" a heavy metric and that there's nothing we can do about it from the data collection system side.
That agrees with my assessment at this point.
Comment 15•2 years ago
|
||
Bereft a :wlach to decide to resolve this, I'm gonna call this resolved. No real great resolution available for it, so I'm choosing WONTFIX
.
Updated•6 months ago
|
Description
•