Should metrics be scrutable on their own inside a ping or can they reference external information?
Categories
(Data Platform and Tools :: Glean: SDK, defect, P3)
Tracking
(Not tracked)
People
(Reporter: chutten, Unassigned)
References
Details
(Whiteboard: [telemetry:glean-rs:m?][schema-v2])
The rate metric type is in implementation (bug 1645166) which brings up a couple of design questions. This is the first (of several) metrics that may (or must) depend on other metrics in the payload for full information.
An example rate css.property_used_per_doc with an external denominator might be a count of the uses of a given css property per loaded document. The denominator of "count of documents" is shared between all use counters, and is itself a counter named dom.documents_destroyed.
We could send a payload where css.property_used_per_doc copies the value of dom.documents.destroyed into its denominator. This is how the design is written and ensures the full information for understanding the metric is present inside the metric's own column in the dataset.
Or we could send a payload where css.property_used_per_doc implicitly references its denominator and contains no value for denominator in its column. In analysis, the analyst will have to use external information to use this data.
(( This is further complicated by how exactly to handle rates with external denominators when the numerator is 0, but let's keep it short ))
This concern isn't limited to rate. Depending on how the design portion of struct goes, we might need an answer sooner rather than later.
I think we should develop a design doctrine that permits payload encodings to have external referents. At the same time it should set limits on just how external the references can be so that we don't make understanding the data any harder than it has to be.
For example, maybe excluding external denominators in the payload is only permissible if we can write a UDF to make analysis easier.
Comment 1•5 years ago
|
||
IIRC, one of the reasons for the current design is that the resolution happens in the client, and everything downstream of that can be straightforward, including the SQL for an analysis.
If we do anything else, we should definitely involve platform in the discussion, both to make sure things are possible, and (as is often the case) there may be magic we can do that is a win/win on performance and usability.
Or we could send a payload where css.property_used_per_doc implicitly references its denominator and contains no value for denominator in its column. In analysis, the analyst will have to use external information to use this data.
Strictly speaking, the reference to the denominator doesn't have to be in the payload at all -- it is static at metric definition time. That would create a few issues if we allowed the denominator reference to change for a given metric, of course. But I struggle to see the upside of sending a reference to another column in the payload without just putting its value there. It feels like the potential improvement would be not putting it in the payload at all.
:frank ni? to get the platform perspective and maybe identify others who have an interest in this space?
Comment 2•5 years ago
|
||
Strictly speaking, the reference to the denominator doesn't have to be in the payload at all -- it is static at metric definition time. That would create a few issues if we allowed the denominator reference to change for a given metric, of course. But I struggle to see the upside of sending a reference to another column in the payload without just putting its value there. It feels like the potential improvement would be not putting it in the payload at all.
I concur with Mike - the overheard of maintaining and adjusting for internal relationships is probably not worth it.
External, however, is something we could probably handle. I'm not necessarily convinced that's worth it, however, because I see two ways of integrating that externally defined data into BQ:
- Generate SQL of some sort that includes this. For example, a UDF that is generated from the source code that identifiers the field and can get the denominator
- Push the external data into a separate table and do a join
Both of these have plenty of flaws, in addition to the headache of changing those values for new builds. For now we should probably stick to what's simple, and only consider these alternatives if we're convinced we need to save those bytes in the payload (or is there another advantage I'm missing?)
| Reporter | ||
Comment 3•5 years ago
|
||
(In reply to Michael Droettboom [:mdroettboom] from comment #1)
Strictly speaking, the reference to the denominator doesn't have to be in the payload at all -- it is static at metric definition time.
Sorry, I wasn't clear. By external referent I meant "not being in the payload". e.g. analysts need to read the docs, or UDFs need to get rather clever.
As for benefits:
On the SDK side, pulling the correct value of an external denominator is very tricky. The database doesn't work that way (it likes working metric by metric). This complicates the client side because each update to dom.documents_destroyed needs to be mirrored to each and every rate's denominator so that the value in the db is always correct. (multiply by 2093 use counters). The advantage here is cpu time, db space, and code complexity in the client. But I have to write it this way for now anyway, so it's a sunk cost.
Another advantage is in bytes of the payload. Sending "denominator": 163214, isn't free. And it isn't necessary, since we're sending "dom.documents_destroyed": 163214 no matter what. Now, with compression, it's pretty close to free. So I'm not too worried. (doubly not worried about space in dataset columns)
But this is just for rate. When we hit struct we're going to be making a metric that is 100% made up of other metrics. A reasonable implementation of this would be to have specially-named metrics in all current parts of the payload, reconstructed into a structlike format via UDF in analysis. (For a struct named category.metric with a string field s and a counter field c, maybe we put in metrics.strings["struct.category.metric.s"] and metrics.counters["struct.category.metric.c"]). Or we'll have to come up with a part of the payload that can contain arbitrarily-typed data in named fields just so that all of the information is present completely within a given column of a dataset.
Even if we get arbitrary JSON columns and have some neato schema generation tech to validate them, the SDK still would need to figure out a way to store them together in the db, complicating a layer that is presently very straightforward.
Comment 4•5 years ago
|
||
The initial question of this bug got answered.
For now we stick with each metric as it is reported being scrutable on their own.
The rate implementation can work with that for now.
In the future we might come back and revisit it and see if we can be smarter about storage and how its exposed in data table views maybe.
Description
•