1440673 - Add the ability to report a summary of Telemetry events

Reporter

Description

•

7 years ago

Currently Telemetry supports instrumenting data collection using events which are reported in the "main" ping. These events can be used to easily measure engagement with various parts of the browser. One drawback is that they quickly add up in terms of storage and transmission cost. Instead of event data reporting being simply "on" or "off", we could introduce a "summary" option. This would report scalar *counts* of events, rather than the full sequence of events. Event data would be recorded as usual, but at report time, the array of event objects would be digested to the form of a keyed scalar for "event counts", with the key being "<event category>.<event method>.<event object>" and the value being the count of events of this type. So if event reporting is "on", we send the full sequence. If it is "off", we send no event data. If it is "summary", we send the scalar counts of events only. This would allow efficiently answering questions like: What is the MAU and DAU for any given feature that is instrumented with events? What fraction of clients use UI feature X What is the level of engagement with feature X

Georg Fritzsche [:gfritzsche]

Comment 1

•

7 years ago

(In reply to Mark Reid [:mreid] from comment #0) > Event data would be recorded as usual, but at report time, [...] Quick note: This would have memory impact, we should probably not record the events into storage.

Georg Fritzsche [:gfritzsche]

Updated

•

7 years ago

Priority: -- → P2

Georg Fritzsche [:gfritzsche]

Comment 2

•

7 years ago

We need some more time to go through the client design implications.

Chris H-C :chutten

Assignee

Comment 3

•

7 years ago

I'll take the design aspect this iteration.

Assignee: nobody → chutten

Status: NEW → ASSIGNED

Priority: P2 → P1

Chris H-C :chutten

Assignee

Comment 4

•

7 years ago

Storage ------- The way I see it, we have two options for storage: Option 1 - Scalar taxonomy We reserve a portion of the scalar namespace (say telemetry.event.counts.*) for "category_method_object" uint scalars. Pros: - Plain scalars, which have better tooling than keyed scalars Cons: - Bit of a pain to implement. - 40-character limit for combined `${category}_${method}_${object}` scalar name means we'll likely be truncating things (30 (category) + 20 (method) + 20 (object) + 2 (underscores) > 40) - the typical scalar name taxonomy recommends underscore delimiters which are likely to be used within category, method, and objects: results in unambiguous parsing. - '.' is allowed in a category, method, or object but is not permitted in scalar names Option 2 - Keyed scalar We define a single keyed uint scalar (say telemetry.event_counts) whose keys are `${category}|${method}|${object}` (or similar, using a non-period non-underscore delimiter). Pros: - We're allowed any characters we want, so unambiguous parsing - Cheap to implement - 70-character limit for combined key string means we're unlikely to truncate (30 (category) + 20 (method) + 20 (object) + 2 (delimiters) = 72) Cons: - Tooling's not as good (no automated alerts from cerberus+medusa, queries get a wee bit more difficult to write, TMO doesn't nicely display keyed things in its current state (shows only the most popular 4)) - Limit of 100 keys per subsession means we have a limit of how many events we can summarize Implementation -------------- Regardless of storage, we're using scalars to count uses of Telemetry Events. This means we will be instrumenting RecordEvent to increment scalars even if that category's event recording is disabled. I do not think we will want to record the events themselves, even to throw them away later, but simply increment a scalar and move on. This will be release collection and will not expire. We will want a little bit of introspective Telemetry on top of this as well. If we go with Option 1 we should consider reporting the number of different category+method+object tuples we map to the same scalar name. If we go with Option 2 we should consider reporting the number of category+method+object tuples we failed to summarize due to key exhaustion. The original request was to control this through a preference. If we choose Option 1, I recommend we include a preference in case we wish to turn this off. There are ~27^70 possible scalars that could be automatically created should this go awry, and upstream resource requirements could expand as a result. If we choose Option 2, the 100-key limit and single scalar limit the impact of runaway use, so I don't think we will need a panic pref. In terms of ping size impact, both options will introduce long strings and small numbers to the "main" ping. I do not anticipate this ballooning the "main" ping's size sufficiently to register in the budget dashboards or trigger "ping size exceeds limit" discarding conditions. We will need to test this summarization, likely within existing Telemetry Events gtests and/or xpcshell tests. We will need to update the Events documentation to include information on this summarization. Recommendation -------------- I recommend Option 2. It should prove quicker to implement, and most of its concerns are with tooling we plan on improving this year. Target ------ Nightly 61 early enough to be uplifted to mid-Beta 60. Future ------ :gfritzsche, :mreid - Do you have any questions, concerns, comments, clarifications, or criticisms? If you can get me this feedback before the Event Telemetry meeting tomorrow, I can present a plan at that time and start filing implementation bugs.

Flags: needinfo?(mreid)

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 5

•

7 years ago

Jan-Erik, with your fresh perspective, does this design make sense?

Flags: needinfo?(jrediger)

Georg Fritzsche [:gfritzsche]

Comment 6

•

7 years ago

This looks good. Two things here: 1) We can use an internal flag for recording the keyed scalars, so we can skip the key length limit. This can be "private" as we don't need to use the "public" API in Telemetry.h. 2) How does this affect dynamic/addon events? 3) We need to make sure search events data is not showing up in the aggregator. 4) Related to 2) & 3), we'll need to have conversations if we can show all dynamic events on the TMO dashboard. Are some of them non-public like when running search experiments?

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 7

•

7 years ago

How can i tell which process the event came from?

Mark Reid [:mreid]

Reporter

Comment 8

•

7 years ago

Option 2 sounds good to me! Should we consider increasing the max length of keys to 72 characters to avoid possible truncation? (Note I saw Georg's suggestion after I wrote this - that sounds fine too). Are the planned improvements to tooling around keyed scalars already captured by existing bugs?

Flags: needinfo?(mreid)

Jan-Erik Rediger [:janerik]

Comment 9

•

7 years ago

Option 2 indeed sounds doable. I'd go with the slightly increased length limit as well to avoid truncation (given it's 2 characters only). Do we have a rough idea how quickly we would hit the 100-item limit or is that to be found out once we have the data collected?

Flags: needinfo?(jrediger)

Chris H-C :chutten

Assignee

Comment 10

•

7 years ago

(In reply to Georg Fritzsche (slow to respond) [:gfritzsche] from comment #6) > 1) We can use an internal flag for recording the keyed scalars, so we can > skip the key length limit. This can be "private" as we don't need to use the > "public" API in Telemetry.h. (In reply to Mark Reid [:mreid] from comment #8) > Should we consider increasing the max length of keys to 72 characters to > avoid possible truncation? (Note I saw Georg's suggestion after I wrote this > - that sounds fine too). (In reply to Jan-Erik Rediger [:janerik] from comment #9) > I'd go with the slightly increased length limit as well to avoid truncation > (given it's 2 characters only). Okay, okay, I get it! :D > 2) How does this affect dynamic/addon events? This treats them identically to static events. > 3) We need to make sure search events data is not showing up in the > aggregator. Javaun's starting an email thread with Legal and BD to see if this is actually necessary. > 4) Related to 2) & 3), we'll need to have conversations if we can show all > dynamic events on the TMO dashboard. Are some of them non-public like when > running search experiments? You're right. At present Events has no expectation of publication. It would be surprising to suddenly have those event counts public. (In reply to Georg Fritzsche (slow to respond) [:gfritzsche] from comment #7) > How can i tell which process the event came from? At present, you can't. This might not be a problem for most uses, but we can prepend the process and another delimiter to the key (since we'll be extending the key length anyway) to make it possible to tell. (In reply to Mark Reid [:mreid] from comment #8) > Are the planned improvements to tooling around keyed scalars already > captured by existing bugs? TMO's improvements are still in planning documents. Bugs will come later. (In reply to Jan-Erik Rediger [:janerik] from comment #9) > Do we have a rough idea how quickly we would hit the 100-item limit or is > that to be found out once we have the data collected? It'll be easier to find out post-facto, especially since we don't have a whole lot of static events coming in. I mean, I've counted them on the existing events dataset (https://sql.telemetry.mozilla.org/queries/51963/source) and the max there is under twenty... but at present we're hardly using Events at all. I expect the use of Events will grow, and I'll be able to assume that this summary thing will be one of the reasons why :) --- Thank you, all, for your prompt and helpful input! I will proceed with Option 2 with three addenda: 1) Extend the key length for either all keyed scalars or just this keyed scalar to be just big enough to fit everything including the delimiters. I'd prefer to keep this added capability to just the events summary, but we'll see what's clearest when we get into the bits. 2) The key will now be `${process}|${category}|${method}|${object}` to identify from which process the event came. This is definitely not a concern for existing static events (search), but there's nothing saying it couldn't be important to dynamic events or future static events. 3) Forbid python_mozaggregator from aggregating or publishing (or both) the events summary scalar. We may be able to relax this requirement later, but for now it seems prudent to begin by limiting analysis to internal tools.

Georg Fritzsche [:gfritzsche]

Comment 11

•

7 years ago

(In reply to Chris H-C :chutten from comment #4) > - Limit of 100 keys per subsession means we have a limit of how many events > we can summarize What are we protecting against with this limit? Do we need this? Is this concern different from how we think about e.g. dynamic scalars?

Georg Fritzsche [:gfritzsche]

Comment 12

•

7 years ago

(In reply to Chris H-C :chutten from comment #10) > I will proceed with Option 2 with three addenda: Can you summarize with an updated plan for easier reading? > 2) The key will now be `${process}|${category}|${method}|${object}` to > identify from which process the event came. This is definitely not a concern > for existing static events (search), but there's nothing saying it couldn't > be important to dynamic events or future static events. We already have an existing internal naming pattern of "category#method#object" in `UniqueEventName()`. Should we re-use that pattern?

Chris H-C :chutten

Assignee

Comment 13

•

7 years ago

(In reply to Georg Fritzsche (slow to respond) [:gfritzsche] from comment #11) > (In reply to Chris H-C :chutten from comment #4) > > - Limit of 100 keys per subsession means we have a limit of how many events > > we can summarize > > What are we protecting against with this limit? Do we need this? > Is this concern different from how we think about e.g. dynamic scalars? We're protecting against accidentally ballooning the size of "main" pings with too many keys. The idea being, I presume, that dynamic scalars will be sent by fewer users than All Of Them whereas scalars in Scalars.yaml will indeed be send by All Of Them. The event summary falls into the All Of Them category, so having this protection is probably worth potential data loss (especially since this is "just" a summary) (In reply to Georg Fritzsche (slow to respond) [:gfritzsche] from comment #12) > (In reply to Chris H-C :chutten from comment #10) > > I will proceed with Option 2 with three addenda: > > Can you summarize with an updated plan for easier reading? I'll put that in the next comment, sure. > > 2) The key will now be `${process}|${category}|${method}|${object}` to > > identify from which process the event came. This is definitely not a concern > > for existing static events (search), but there's nothing saying it couldn't > > be important to dynamic events or future static events. > > We already have an existing internal naming pattern of > "category#method#object" in `UniqueEventName()`. Should we re-use that > pattern? I'm happy to take nearly any letter-width, printing, one-byte delimiter that isn't a dot or a hyphen. '#' fits the bill, so I'll gladly use it instead of pipe.

Chris H-C :chutten

Assignee

Comment 14

•

7 years ago

THE PLAN -------- Define a single keyed uint scalar to count (process, category, method, object) tuples that want to be recorded. These will be counted whether that category has been enabled to record or not. These will include `dynamic` scalars. These counts will be reported on all channels (opt-out). This will necessitate a few extra things that might not be immediately apparent: - Extend scalar keys' character limit for either all keys scalars-wide or just these keys so we can fit the whole (process, category, method, object) tuple in a string with delimiters without truncation. - Forbid python_mozaggregator from aggregating this particular scalar, as events consumers presently do not expect public publication of their data. The data will still be available in ATMO and sql.tmo as other non-aggregated data is, so it'll still be very useful. - Count and report the number of event tuples we can't count because we've reached the "maximum number of keys (100)" limit. This can and should be aggregated by python_mozaggregator to enable regression detection (cerberus) and general "just looking at things on TMO to see if they're working" use cases. ...I think that's about it.

Georg Fritzsche [:gfritzsche]

Comment 15

•

7 years ago

(In reply to Chris H-C :chutten from comment #13) > (In reply to Georg Fritzsche (slow to respond) [:gfritzsche] from comment > #11) > > (In reply to Chris H-C :chutten from comment #4) > > > - Limit of 100 keys per subsession means we have a limit of how many events > > > we can summarize > > > > What are we protecting against with this limit? Do we need this? > > Is this concern different from how we think about e.g. dynamic scalars? > > We're protecting against accidentally ballooning the size of "main" pings > with too many keys. The idea being, I presume, that dynamic scalars will be > sent by fewer users than All Of Them whereas scalars in Scalars.yaml will > indeed be send by All Of Them. The current first stage of Project Savant (Firefox UI metrics) is looking to add >70 events, other events might get added in parallel (Shield, Lockbox, DevTools, ...). I'm worried that this limit is too low to start with and we need to raise it (just for the events summary counts).

Chris H-C :chutten

Assignee

Comment 16

•

7 years ago

Dynamic events have the interesting property of being enabled just before they're recorded, so come to think of it they might not benefit from summarization. Is Savant going to dynamically-register, or are they static? What would you think if I made a further change to the plan and disallowed summarization of dynamic events, due to them always being enabled when recording? That would give us more space in the summary, regardless of what we decide wrt limits.

Georg Fritzsche [:gfritzsche]

Comment 17

•

7 years ago

(In reply to Chris H-C :chutten from comment #16) > Dynamic events have the interesting property of being enabled just before > they're recorded, so come to think of it they might not benefit from > summarization. Is Savant going to dynamically-register, or are they static? It's registering dynamically for the first iteration, but i assume this might become standard Firefox instrumentation and grow. > What would you think if I made a further change to the plan and disallowed > summarization of dynamic events, due to them always being enabled when > recording? That would give us more space in the summary, regardless of what > we decide wrt limits. I think that's breaking the promise of "we'll always summarize your events". The convenient pitch here is to always have your events summarized in the main ping, no matter whether we send them or on what ping. The 100-key limit is currently in place for the standard keyed scalar usage. For the event summary, keyed scalars are just a serialization/transport detail. We should enable event usage and make it easy to use event summary counts, without thinking much about it. If we need a limit, let's look at it from the perspective of "what's an acceptable upper bound for the ping size impact".

Chris H-C :chutten

Assignee

Comment 18

•

7 years ago

(In reply to Georg Fritzsche (away Mar 16 - 26) [:gfritzsche] from comment #17) > If we need a limit, let's look at it from the perspective of "what's an > acceptable upper bound for the ping size impact". I argue we should have a limit. So let's crunch some numbers. The scalar name is going to be at most 30 bytes. Each key will be <process name length> + <delimiter length> + 72 (see comment #4) = 83 ("extension" is the longest process name at present) Value size and punctuation... let's round up to 100 bytes. 100 keys is under 10k. 1000 keys is under 100k. What would you like a limit to be? We have room: https://mzl.la/2FRHCGL mreid: in your opinion, what would be a sensible and prudent (maximum) size increase per ping for this feature?

Flags: needinfo?(mreid)

Georg Fritzsche [:gfritzsche]

Comment 19

•

7 years ago

(In reply to Chris H-C :chutten from comment #13) > > > 2) The key will now be `${process}|${category}|${method}|${object}` to > > > identify from which process the event came. This is definitely not a concern > > > for existing static events (search), but there's nothing saying it couldn't > > > be important to dynamic events or future static events. > > > > We already have an existing internal naming pattern of > > "category#method#object" in `UniqueEventName()`. Should we re-use that > > pattern? > > I'm happy to take nearly any letter-width, printing, one-byte delimiter that > isn't a dot or a hyphen. '#' fits the bill, so I'll gladly use it instead of > pipe. One thing came to mind: We don't need to put the process name into the key. Instead we can put the keyed scalar in the right process payloads (payload.processes.{main,...}), matching the semantics we use elsewhere.

Chris H-C :chutten

Assignee

Comment 20

•

7 years ago

(In reply to Georg Fritzsche (away Mar 16 - 26) [:gfritzsche] from comment #19) > One thing came to mind: > We don't need to put the process name into the key. > Instead we can put the keyed scalar in the right process payloads > (payload.processes.{main,...}), matching the semantics we use elsewhere. That'll require a deeper approach than what I was currently trying, but it would almost certainly be worth it. Oh, and I bothered :mreid and others about limits at the Events Meeting and we collectively came to the decision that the limit should: 1) Be controlled by a (hidden) pref reported via the Environment (so that we have an off switch and reporting of when it's used) 2) Default to 500 (since 100 is too low, and we're really only trying to save infra while we patch the offender)

Flags: needinfo?(mreid)

Comment hidden (mozreview-request)

bug 1440673 - Fix off-by-one errors in keyed scalars 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Allow scalar keys to be an extra 2 chars long 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Allow TelemetryScalar.h to be included in tests 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Summarize events to a keyed scalar 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Allow changing the max number of keys per-keyed-scalar 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Set the max number of event summary keys by pref 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Test event summary scalar collection 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Permit snapshotting non-parent-process scalars 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Summarize dynamic events to a dynamic scalar 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Test that dynamic events are summarized to a dynamic scalar 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Test Event Summarization in xpcshell 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details
bug 1440673 - Test Event Summary's key limit pref 7 years ago Chris H-C :chutten 59 bytes, text/x-review-board-request	Dexter : review+	Details