Create a "rate" metric type
Categories
(Data Platform and Tools Graveyard :: Glean Metric Types, task, P1)
Tracking
(Not tracked)
People
(Reporter: frank, Assigned: mdroettboom)
References
Details
Attachments
(1 file, 1 obsolete file)
|
85 bytes,
text/x-google-doc
|
tdsmith
:
data-review+
|
Details |
Proposal for changing an existing or adding a new Glean metric type
Who is the individual/team requesting this change?
Frank Bertsch
Is this about changing an existing metric type or creating a new one?
Creating a new metric type
Can you describe the data that needs to be recorded?
This issue is for recording rates. Rates are essentially two counters; one for the numerator and one for the denominator. Take an example from application services. They record the read_query_count and read_query_error_count, and use that to ascertain the read query error rate.
It is possible to create this from existing metrics, but the big win here would be enabling this data in GLAM, and having solid ways of querying it in BQ,
Things to consider:
- Can we use existing counters as parts of the rate? i.e. define a rate, with the denominator being an already-collected counter.
- Can we define the rate abstraction on top of existing metrics, without every implementing it in the ping? i.e. use counters for numerator and denominator. (difficulty is how to query with this)
Can you provide a raw sample of the data that needs to be recorded (this is in the abstract, and not any particular implementation details about its representation in the payload or the database)
Here's another - % of page loads that are reader mode capable. Fraction of url bar presses that search.
What is the business question/use-case that requires the data to be recorded?
See above.
How would the data be consumed?
- GLAM dashboard especially would show rates with client experience of each rate.
- BQ querying would have some nice UDFs to make this easy.
Why existing metric types are not enough?
Lots of custom SQL, cannot see these in GLAM.
What is the timeline by which the data needs to be collected?
No timeline.
Comment 1•5 years ago
|
||
This is the discussion document for the new metric type proposal.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 2•5 years ago
|
||
Mike, can you designate the initial set of people required to work on the design stage of this?
Comment 3•5 years ago
|
||
Note: Let's involve Emilio, who's been working on CSS use counters, since this is the metric type that will power them in Firefox after FOG is completed.
| Assignee | ||
Updated•5 years ago
|
| Assignee | ||
Comment 4•5 years ago
|
||
Assigning :travis_ for the design work.
Comment 5•5 years ago
•
|
||
Hey Emilio, Dexter thought you should be involved due to the CSS use counters that you have been working on. I thought I would ni? you and see if you have any input on the design of a rate/ratio metric type for Glean.
<edit> It has been brought to my attention that it might be a little early for this input... clearing the ni? and will ask for comments later. Sorry for the noise. :)
Updated•5 years ago
|
Comment 6•5 years ago
|
||
Hey Frank, I'm thinking about the design for this and I had questions:
- From a query/parsing perspective, does it make it make sense to transmit this as a floating point type (i.e. calculated by the client), or would it be more useful/convenient to see the numerator and denominator as distinct in the data?
- If we did use a floating point type, is there a preference on width? I'm guessing that a Float64 would be the most compatible with the BigQuery but I don't want to make things difficult if another type/format would work better for ingestion.
| Reporter | ||
Comment 7•5 years ago
|
||
- From a query/parsing perspective, does it make it make sense to transmit this as a floating point type (i.e. calculated by the client), or would it be more useful/convenient to see the numerator and denominator as distinct in the data?
- If we did use a floating point type, is there a preference on width? I'm guessing that a Float64 would be the most compatible with the BigQuery but I don't want to make things difficult if another type/format would work better for ingestion.
We definitely want to transfer this as two ints. Otherwise we have no way to aggregate this on our end (unless we include a weight, but I would consider that more confusing, and not as accurate).
Comment 8•5 years ago
|
||
DS comments are here: https://docs.google.com/document/d/1PLLVQTRmmz18sAZKsvylggNT1uYcHEQhP-hklIHLSM4/edit
I could not edit the gdoc proposal to add this link to that.
Comment 9•5 years ago
|
||
Copied from above google doc
The assumption is the thing being measured is A/B. Storing just a computed ratio without context can be misleading. And if one needs to store both A and B
- how is this better than the DS accessing A and B in their original locations?
- Will this improve querying ergonomics?
A type that can handle something beyond times and counts such as rates is a nice thing to have. But merely reporting rates is not enough. A rate needs context.
- The ratio R = A/B can be NA/undefined if A and B are zero. There might be different handling to compute ratio when B is zero. So the DS would need to know The value of B.
- if R is negative (bad data) why? it could be A or B so the DS needs to access A, B
- how to aggregate R across pings? Simple average does not weight clients, if you want to weight clients you’d need to know B.
- Also 1/1 is not the “same” as 50/50
- In many cases, B is the number of samples so encodes information about the variance of the rate
- FLOAT types are useful for things naturally float and not computed ratios e.g. statistical model coefficients , (think federated learning, correlation coefficients, functions that return floats etc) but this discussion is not about float types but about ratios.
- I don’t understand why we can’t have this in Glam. Is it ingesting events directly with no post processing? That doesn’t feel right. It would provide more flexibility unless you want this all ‘for free’.(not sure how said this)
So if the desire is to report a rate why not merely query numerator and denominator and compute rates remotely. Per Frank’s comments, it appears from an ergonomics point of view(difficulty writing SQL). The purpose is so GLAM can directly read these metrics and plot them. This definitely has merit, though the GLAM team needs to be aware of the bullets raised above.
With that said, storing tuples of numbers can make sense (in other words a ‘struct’ type). And one use for that would be computing rates (and having information in the struct) would solve the issues raised above. I think using the word ‘rate’ is too specific, but a struct type might resolve general needs i.e. storing related bit of information in one easily accessible place. For example, in some DS explorations, storing a struct of sum of a histogram values and total counts rather than just mean is useful. It solves ergonomics, and the GLAM use case described above(they could include the ratio as an element in the struct but this means glean needs to support floats)
[tdsmith] Having a way to encode that a specific counter is the denominator for another counter sounds useful; it’s often ambiguous in desktop telemetry. Creating a new data type representing a pair of integers feels like a loss of generality; these are all integers representing a count. I allege that there’s often a many-to-one relationship between numerators and a denominator. Some other kind of annotation that a specific count is a subset of another might provide the same benefit as a new data type
Comment 10•5 years ago
|
||
(In reply to "Saptarshi Guha[:joy]" from comment #8)
DS comments are here: https://docs.google.com/document/d/1PLLVQTRmmz18sAZKsvylggNT1uYcHEQhP-hklIHLSM4/edit
I could not edit the gdoc proposal to add this link to that.
You now have write-access. Sorry for the inconvenience.
Comment 11•5 years ago
|
||
I've added the design to the main document. Untaking this bug for the next contributor's portion.
| Assignee | ||
Updated•5 years ago
|
Comment 12•5 years ago
|
||
Hey folks,
the discussion document is in design stage and the design was completed by Travis Long.
Please review the design and flag any major concern on it.
| Reporter | ||
Updated•5 years ago
|
Left my comments there and flagged travis for questions.
Comment 16•5 years ago
•
|
||
Hey folks,
the document moved from design to comment stage. It is ready for one final look. Final feedback due by June 5th, 2020.
If that looks good to you, please sign off at the top of the document.
Comment 17•5 years ago
|
||
Hi Tim,
we need data-steward review for the attached proposal. Please check the related Data-Steward section at the top of the document. More information about this process here.
Comment 18•5 years ago
|
||
hm, wonder why it decided the url was a binary attachment. here's another go; &shrug;
One last concern about the error reporting, but otherwise no blocker from my side.
| Reporter | ||
Updated•5 years ago
|
Comment 21•5 years ago
|
||
Hey Mike, looks like everybody signed-off on this document. It's your turn now to decide if the proposal is approved or rejected. Please update the document accordingly and also comment on the bug with the outcome. Thanks!
| Assignee | ||
Comment 22•5 years ago
|
||
Approved! Bug 1645166 was created to track its implementation.
Updated•11 months ago
|
Description
•