Closed Bug 1743683 Opened 2 years ago Closed 2 years ago

New paint.build_displaylist_time metric on desktop seems to be taking more storage than expected

Categories

(Data Platform and Tools :: Glean: SDK, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: wlach, Unassigned)

References

Details

(Whiteboard: [dataplatform])

When Firefox 94 was released, we noticed a fairly large jump in the disk space used by Firefox desktop (from about 100GB on an average weekday to about 280GB):

https://sql.telemetry.mozilla.org/queries/82992#205657

Most of this seems to be due to the new paint.build_displaylist_time metric which landed in Firefox 94. BigQuery's size estimator estimates that this query will scan 1.8TB:

SELECT
  metrics.timing_distribution.paint_build_displaylist_time
FROM
  `moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`
WHERE
  DATE(submission_timestamp) > '2021-11-01'
  AND DATE(submission_timestamp) < '2021-12-01'

Versus this query against a legacy histogram, which should only scan 150GB for the same interval:

  SELECT
    payload.histograms.checkerboard_potential_duration
  FROM
    telemetry.main
  WHERE
    DATE(submission_timestamp) > '2021-11-01' AND
    DATE(submission_timestamp) < '2021-12-01' 

There do seem to be more instances of the former histogram than the latter in the data, but IMO not enough to account for this discrepency:

We should probably investigate why this metric is taking up so much space before adding any more glean histograms to desktop.

:klukas, could the difference be entirely or partially explained by the efficient histogram encoding being used for Telemetry-based distribution data?

Flags: needinfo?(jklukas)

Ok, that was actually not so complicated.

It looks like for the case of the main ping, we're storing the histogram data as JSON (e.g. https://docs.telemetry.mozilla.org/cookbooks/main_ping_exponential_histograms.html#getting-client-level-data) whereas for glean data we're storing it as structured data inside BigQuery. I suspect that the latter takes up more space than the former, though I'm not sure it makes that much of a difference (and I'm guessing this was a deliberate choice, :klukas can confirm)

The thing that probably makes most of the difference is that this is a timing distribution with a large range, and there is a large number of buckets in the histogram. Here's a quick example of a client's information processed through BigQuery's TO_JSON_STRING:

{"bucket_count":null,"histogram_type":null,"overflow":null,"range":[],"sum":64547612200,"time_unit":null,"underflow":null,"values":[{"key":"3846193","value":466},{"key":"18295683","value":30},{"key":"311743","value":1777},{"key":"25267","value":39},{"key":"440871","value":1372},{"key":"623487","value":2682},{"key":"370727","value":1523},{"key":"961548","value":1807},{"key":"19483","value":41},{"key":"9975792","value":59},{"key":"30048","value":40},{"key":"262144","value":1272},{"key":"3526975","value":490},{"key":"16777216","value":25},{"key":"25874004","value":5},{"key":"32768","value":31},{"key":"679917","value":2338},{"key":"220435","value":1146},{"key":"4573920","value":280},{"key":"480774","value":1552},{"key":"5439339","value":198},{"key":"1923096","value":926},{"key":"15024","value":6},{"key":"77935","value":152},{"key":"33554432","value":1},{"key":"12633","value":10},{"key":"1143480","value":1454},{"key":"185363","value":989},{"key":"8388608","value":65},{"key":"21247","value":55},{"key":"28215801","v…

It's probably worth verifying that this is operating within parameters but it might be fine. Unfortunately the person who originally implemented this metric is no longer at Mozilla.

The custom string encoding for telemetry histograms is discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1646825 which links to the proposal doc with analysis of the storage savings.

Flags: needinfo?(jklukas)
See Also: → 1646825

Of note, the only encoding optimization that was implemented for the compact histogram encoding proposal was for histogram_type=2 which doesn't apply in this situation. checkerboard_potential_duration is type 0 (exponential) and the value stored in BQ is a JSON string like:

{"bucket_count":50,"histogram_type":0,"sum":30913,"range":[1,1000000],"values":{"65":0,"86":1,"149":2,"258":3,"340":7,"448":3,"590":5,"777":2,"1347":2,"2336":1,"4053":2,"5338":1,"7031":0}}

I am guessing that the paint_build_displaylist_time is being populated with more values, and thus all the buckets tend to be filled.

Some other relevant details (see BQ data sizing docs).

  • Values in the Glean case are INT64 which are always billed as 8 BYTES. In the telemetry JSON case, we are billed one byte per character, which will usually be smaller than 8 characters
  • There is an overhead cost of 2 bytes for each individual STRING-type value, though this likely balances out since the JSON representation spends characters on quotes and separators

Additional context that came up in the Glean Platform WG: The use of additional buckets is a desired property of the users of this metric -- they were actually looking for better-than-ms resolution. So the obvious suggestion of dropping < 1ms buckets may not be sufficient. Perhaps setting a lower limit of 100ns or so may work, however, but we don't currently have functionality for that.

To suggest one possible optimization -- for timing distributions, the keys (defining the lower bounds of the buckets) should always come from the same set, so could be reconstructed from just a list of values (and the first key). There would be a bunch of downstream implications on analysis and tooling for such a change, but it could have significant impact.

Maybe some analysis on the actual data and where the space usage is most coming from would help to understand the impact of such a change.

I think next step is to look at the data and see where the storage costs are coming from. That might inform any solutions going forward.

Priority: -- → P2
See Also: → 1745660

(In reply to Michael Droettboom [:mdroettboom] from comment #5)

Additional context that came up in the Glean Platform WG: The use of additional buckets is a desired property of the users of this metric -- they were actually looking for better-than-ms resolution. So the obvious suggestion of dropping < 1ms buckets may not be sufficient. Perhaps setting a lower limit of 100ns or so may work, however, but we don't currently have functionality for that.

Bas mentions 10^-7 as perhaps the finest limit of resolution that might be reasonable for even performance stuff, which aligns with your intuition.

But we need some way to determine what storage cost samples less than 100ns contribute. I suppose we could "just" count the number of buckets... but my brain hasn't math'd that way in a donkey's age. Lemme stretch it out first...

Oh, duh. Since this is the first N buckets up to whatever bucket holds 100ns, we can just take the bucket index of 100ns which is 53 (the bucket index for a sample x is given by floor(8 * log(2, x))).

Given that we were worried enough about bucket counts in Firefox Telemetry to limit them to an absolute count of 100 (unless given a good reason to extend that limit), using a full half of that limit on values from 0 to 100ns seems... wasteful. And we'll be paying this cost for every distribution that times something quick (perf-scale, not user-scale timings).

Hm. But it looks as though it actually doesn't matter? The number of buckets in use seems to be largely about 100: https://sql.telemetry.mozilla.org/queries/83777/source

So it seems as though our bigquery encoding of this data is where the storage expense is coming from, not the number of buckets.

And here's a query for what those bucket values are: https://sql.telemetry.mozilla.org/queries/83778/source

The first decile is 32768 suggesting that any lower bound changes wouldn't make much of a difference here. The 100-or-so buckets are coming from fine resolution overall, not having too-fine resolution at the bottom end.

I intend to spend an hour or two doing a more thorough analysis of storage space for JSON encoded histogram vs. the structured version to better quantify if our BQ representation is significantly wasteful.

Assignee: nobody → jklukas

I took a look at this today, focusing on data from March 13 2022.

SELECT
  SUM(ARRAY_LENGTH(metrics.timing_distribution.paint_build_displaylist_time.values))
FROM
  `moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`
WHERE
  DATE(submission_timestamp) = '2022-03-13'

That gives 4 billion as the total number of value entries across all histograms. Each value consists of an INT64 for key and an INT64 for value, so we get:

4 billion * 2 * (8 bytes) = 64 GB

The values for this histogram consume 64 GB of space for this day.

Lets compare this to a JSON encoding. The minimal space for these in JSON encoding would be to look at the number of characters in string encodings of the individual values:

SELECT
  SUM(LENGTH(CAST(key AS string)))
  +SUM(LENGTH(CAST(value AS string)))
  
FROM
  `moz-fx-data-shared-prod.firefox_desktop_stable.metrics_v1`, unnest(metrics.timing_distribution.paint_build_displaylist_time.values)
WHERE
  DATE(submission_timestamp) = '2022-03-13'

This gives us a total of 35 billion characters to represent these values. At 1 byte per value, this would be minimally 35 GB encoded as JSON, but in reality would be higher due to quotes and separator characters.

So from a storage standpoint, I don't think JSON encoding would particularly help.

So the storage issue here is coming from the real behavior of the underlying histogram (which has many populated buckets) rather than the BQ representation.

Unassigning myself, as I've ruled out the BQ representation as a major factor.

chutten - Do you think this histogram is meeting a need and should be kept as-is?

Assignee: jklukas → nobody
Priority: P2 → P3
Flags: needinfo?(chutten)

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #12)

chutten - Do you think this histogram is meeting a need and should be kept as-is?

That's ultimately up to the timing_distribution's owner, not me, but I believe this to be a key graphics performance canary: if this shifts, we want to know why.

Sounds like this is "just" a heavy metric and that there's nothing we can do about it from the data collection system side.

Flags: needinfo?(chutten)

Sounds like this is "just" a heavy metric and that there's nothing we can do about it from the data collection system side.

That agrees with my assessment at this point.

Bereft a :wlach to decide to resolve this, I'm gonna call this resolved. No real great resolution available for it, so I'm choosing WONTFIX.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
Whiteboard: [data-platform-infra-wg] → [dataplatform]
You need to log in before you can comment on or make changes to this bug.