Closed Bug 1313592 Opened 3 years ago Closed 3 years ago

Estimate storage impact of Event Telemetry

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

RESOLVED FIXED

People

(Reporter: gfritzsche, Assigned: gfritzsche)

References

(Blocks 1 open bug)

Details

(Whiteboard: [measurement:client])

We need to pessimistically estimate the storage impact of Event Telemetry.

We will also need to decide on what numbers are acceptable and tune the client-side limits from that.
Priority: P2 → P1
Points: --- → 2
Assignee: nobody → gfritzsche
I started to look at storage estimation here:
https://docs.google.com/spreadsheets/d/1o1ZLfiEEj1nA0ViKA67PAP2q8adzmEzQt4BT-gsyPsA/

This is just looking at raw, upper-bound, size impact, but should give us a worst-case scenario to work from.

E.g. for a simple form of [timestamp,"category","method","object","value",null], sending 1k events per ping costs us ~0.16MB, 10k events ~1.6MB.
For event driven data-collection like the ones considered (clicks, navigations, tab open, ...), 1k events doesn't seem very much.
For comparison, the payload size for the whole opt-out ping "main" ping from release is currently ~0.15MB & we just discard any pings >1MB (raw or compressed).

Adding any information in the "extra" dictionary makes this quickly more expensive; i've added some example rows with different amounts of extra submission ratios.

There are different parts in the whole pipeline where this might be problematic:
- client side storage
- client bandwidth & upload times
- pipeline storage? (although this should compress away well?)
- worker processing?

As-is, i think we can't use this on any population without strict limits in place.
We can probably talk about different approaches to this:
- optimization/compression of event submissions
- population sampling (1% of clients only)
- different limits & policies for release & pre-release
- accepting hard cutoff after reaching limit of N events
- ...?
I added data on how events impact ping size, raw & compressed, based on an opt-out & opt-in sample ping:
https://docs.google.com/spreadsheets/d/1o1ZLfiEEj1nA0ViKA67PAP2q8adzmEzQt4BT-gsyPsA/

The script to generate this is here:
https://gist.github.com/georgf/989d484da9b75bc86eb858dfe02b3768
We discussed the initial options here in a smaller group:
https://docs.google.com/document/d/1hxpqQefc2QiIdZlNhaIAd3GpgN9VbABD71q96nh9Xec/

While we have good options to do things more clever in the medium- to longer-term, there is a short-term path we are taking for Fx52:
* cap after N=1000 per subsession
* only pre-release for now, disable recording on release
* no sampling for now as we don't go to release
* ride this on fx52, including at least the initial search probe (bug 1316281)

There will be another meeting with more people about the next steps from there, for which we need to:
* enable others to make budget decisions and state requirements better
* summarize options and concerns to make them more actionable
* build out size estimation with raw worst case & snappy compression
* estimate some expected event impact based on engagement measurements
Blocks: 1316810
Points: 2 → 3
We have more updated numbers in the notes in [1] and settled on roughly the following:
* collect events only from a sample of the population (bug 1320716)
* hard-limit event collection (1000 for built-in events on pre-release for now, bug 1316810)
* limit event collection to pre-release until we are confident about the mechanism (bug 1319102)
* on release, use a much lower limit for built-in events (e.g. 100, bug 1320713)
* allow dynamic event registration to override with higher limits (commented on bug 1302681)
  - this would be for smaller studies or experiments, which would not have as broad an impact
* we will monitor impact on telemetry sending via a minimal health ping (bug 1318297)
* we will need a tool that makes it easy to estimate the budget impact of new event collections (bug 1320711)

1: https://docs.google.com/document/d/1QJhCnuBWR5xVc0zegXDoFXBdswCGIcDBYcQl8DS-UJI/
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.