Closed Bug 1672273 Opened 4 years ago Closed 2 years ago

Labeled Metrics with staticly-defined Labels should be encouraged

Categories

(Data Platform and Tools :: Glean Metric Types, task, P2)

task

Tracking

(firefox114 fixed)

RESOLVED FIXED
Tracking Status
firefox114 --- fixed

People

(Reporter: chutten, Assigned: chutten)

References

(Blocks 2 open bugs)

Details

Attachments

(5 files)

Proposal for changing an existing or adding a new Glean metric type

Who is the individual/team requesting this change?

:chutten

Is this about changing an existing metric type or creating a new one?

Changing labeled metric types (e.g. Labeled Booleans)

Can you describe the data that needs to be recorded?

No change to the data being recorded.

Can you provide a raw sample of the data that needs to be recorded (this is in the abstract, and not any particular implementation details about its representation in the payload or the database)

No change to the data being recorded.

What is the business question/use-case that requires the data to be recorded?

No change to the business question/use-case.

How would the data be consumed?

No change to how the data is consumed.

Why existing metric types are not enough?

Right now there is no incentive besides (perhaps) Data Hygiene to induce an instrumentor to use statically-defined (as in, defined in the labels property in a metrics.yaml) labels for labeled types.

We want to encourage the use of statically-defined labels for reasons including Data Hygiene ( : D ).

So thus we should probably provide a mechanism that makes it worthwhile to go to the effort of statically-defining your labels and ensuring that the list stays up-to-date.

I propose that, like Categorical Histograms, we provide an enum for each label (plus one for __other__) that can be used instead of strings. If we can come up with other inducements (perhaps by discouraging the use of the non-enum API somehow), that'd be nice as well or instead.

What is the timeline by which the data needs to be collected?

The sooner, the better, as established habits are harder to break.

There's some helpful context over here and in the comments following, which I will summarize here:

  • The use case of "error code" is one we've heard before and will hear again.
  • Being able to specify a partial list of known values of interest to label them would be nice
    • The metric should accept any number of that width, but map anything not in the list to e.g. __other__
    • So for an image decoding library, we could sparsely label some likely values (success, file too small) but if we get an error we didn't care much about (retroencabulator is fumbling) we still get a count for that, but it gets bucketed with other errors we don't care about (baffle spindling overwound, etc).
  • Generating an enum for these values would be nice, so users could e.g. Glean.myCategory.myErrorCode.add(Glean.myCategory.myErrorCode.codes.lp0_on_fire);
  • Using an existing enum that is already in the codebase would be ideal, so users could e.g. nsresult rv = SomeGeckoOperation(); glean::my_gecko_category::my_error_code.Add(rv);
    • But it might be frighteningly difficult. For example, a specific class of "error code" collections would be Gecko's nsresult codes. They are both enum class values and also constants. We could write an API to take the nsresult enum class (ErrorCodeMetric::Add(nsresult aCode, uint32_t aErrorCount = 1), but what we get on the inside of this API is something like the number 2147500037. At some point we need to figure out that this means NS_ERROR_FAILURE and then perhaps translate to that to Glean's label format (ns_error_failure maybe) so we can store and transmit and analyze and display it.
    • Getting this right across languages will be interesting as well as I think the JS manifestation of nsresult is a thrown error. But this is very Firefox-specific so maybe I shouldn't overrotate here.
    • The important thing is that for C++, especially with the X Macro pattern, you basically need to run a full compiler frontend in order to be able to codegen a map from a given enum or enum class's values to their symbol names as strings. Other languages may make this easier
  • All these challenges being challenging doesn't change that having a single source of truth for what values mean what strings is crucial. If we force humans to keep these in-sync through static_assert or linters since the tools can only generate transformation functions, then it's still worth it.

(Oh, and some context for the initial submission: I originally made this submission because you couldn't have more than 16 static labels (the same number as dynamic labels). That was a bug. The original design was to allow unlimited static labels. In resolving the bug we fixed the maximum number of static labels to 100. So there is an incentive to use static labels and in fact they are the only choice for categorical counts that have more than 16 categories.)

When trying to record something with a given label, it would be useful to have a way to know if the label is part of the static list of labels that are accepted by the probe, or if the values will be counted as __other__. In the latter case, I would like to report the missing label name using another probe (likely a string_list).

Oh, and keeping this up-to-date, we've lifted the limit on the count of static labels in labeled_* metrics to 4096 in bug 1772163

See Also: → 1657473

bug 1790872 will be adding a data collection that would have used Glean if it could.

See Also: → 1790872

I intend to draft a proposal for exploring and discussing how to solve this specifically for FOG C++ consumers (and, if I'm lucky, in general for all consumers) next week.

Assignee: nobody → chutten
Status: NEW → ASSIGNED
Priority: -- → P2

Alrighty, I've come up with a couple of (additive) proposals for consideration. I thought about making this a full Change Proposal, but these are so small that it didn't seem right. So let's do this the BMO+ni way.

Proposal 1: Remove the label regex

Labeled metrics' labels are not and have not ever been bigquery-column-name safe. The regex is causing friction for use that is not balanced by any (intentional) benefit. (I mean, maybe it accidentally makes it harder for a PII leak, but only because it makes everything harder).

Retain a length limit (expressed as bytes of UTF-8 as is tradition) and reserve __other__, but otherwise permit everything. Or at least everything printable.

Proposal 2: enums for labels

For static labels we can generate an enum with a variant for each label plus one for __other__. This helps by:

  • Providing a cheap (not string) and accurate (no typos) way for API consumers to map external values (like IPC message names or nsresult values) to labels
  • It becomes possible to iterate over all possible labels (helpful for about:glean improvements in bug 1762281 and has been requested for testing)
  • Comment #4 via e.g. MyCategoryMyMetricLabel EnumForLabel(const nsACString& aLabel) returning MyCategoryMyMetricLabel::other__ (can't start a C++ identifier with __ as that's reserved) for anything not recognized
  • Retains all the benefits of the current system by augmenting and not replacing it

It should be performant to implement as we could e.g. opt to implement the enum variants to carry the values of the indices of the string label in the metric's labels vector.

This can be implemented in the language binding layer, leaving glean-core's CowString-based Labeled<T>::get alone. This is much the same way that we did things when we added types to event extras.


Travis, Jan-Erik, Bruno please let me know what you think. Feedback deadline is Jan 17, 2023.

(and of course I'm happy to get input from other folks not named. Did I miss a big use case? Is this the worst idea? Let me know.)

Flags: needinfo?(tlong)
Flags: needinfo?(jrediger)
Flags: needinfo?(brosa)

Re Proposal 1: Yes (though maybe restricted to ascii). It's bitten us more often than it solved things. We should check in with pipeline/data-engineering though just to make sure.

Re Proposal 2: It's surprisingly easy to implement in Rust!. Other languages will require some additional work, but I think that shouldn't be too hard either.

Flags: needinfo?(jrediger)

I agree with this, the label restrictions have mostly been a pain or caused confusion and I cannot point to a single specific thing that it has helped or prevented.

On the second proposal, I'm also in favor of enumerations. I am fairly certain that both Swift and Kotlin enums shouldn't make this too difficult to do, and Jan-Erik already points out that it's simple in Rust.

r+ from me on both counts.

Flags: needinfo?(tlong)

Proposal 1: Based on everything mentioned, this makes sense. I am not as familiar with the pain points described, but the solution sounds like the way to go.

Proposal 2: Enums sound great - this should be fine for Glean.js. Enums on the Typescript side work similar to other languages and we can generate the same sort of object in JS too.

r+ from me as well

Flags: needinfo?(brosa)

chutten merged PR #2364: "Bug 1672273 - Remove label regex, replace with '71-bytes of printable ASCII'" in 1b0be12.

Proposal 1 is now implemented, but not released, as "labels (both static and dynamic) must be at most 71 characters of printable ASCII ( to ~, inclusive)". As always this still requires release and vendoring of both glean_parser and Glean SDK before anyone can take advantage of it.

Next up: "Proposal 2: enums". Which is proving a mite tricky.

We are changing the type of impl::Labeled<T> to impl::Labeled<T, E> making the
current macro use of CPU use metrics within FOGIPC impossible.

Thus we introduce a ProcessType enum to replace storing the metrics and remove
the macros to acheive the same end.

First we have to replace the venerable metric hashmaps with metric match
statements (like events). This is because, by adding the enum to the Labeled<>
type, we've made it impossible to use HashMap (since its type requires a known
Value type V, and Labeled<Counter, $SomeEnumType> is incomplete).

Then we had to generate meaningful enums. The order of the values is important,
as I foresee a future in which we may wish to use the enum variant value
(its "discriminant") to index into the glean-core labels list.

The enums can't have naming collisions. In Rust this is easy, so long as we
stay away from reserved keywords (bug 1814767). In C++ we are using enum class so the variants' identifiers are scoped. However, on my system at least,
this is still susceptible to collision with preprocessor defines. So we prefix
the variants with the letter e. So far so good.

C++ Labeled metric implementations is regrettably moved from Labeled.cpp to
Labeled.h, as the template specialization is no longer full.

No decision about JS is made or implied by its exclusion from this commit.
This includes support for runtime registration (via JOG).

The use of strings for labeled_* metrics is preserved mostly unchanged.

The transformation from enum back to string label is performed in Rust during
enum_get. This is an improvement over the current situation as the strings
in use are guaranteed to not be allocated. However, I suspect we can improve
even more by moving this translation lower (into the SDK, ideally).

GIFFT support requires getting the label string from the enum, which is exposed
on the FFI.

Depends on D170108

I wouldn't mind some feedback on these. They're landable, but I'm not sure they're 100%. And I'm not 100% sold on the ergonomics of it. I'm especially open to suggestions on how to make the C++ API more pleasant, and whether switching to strings as soon as we hit Rust in FOG is worth the effort (or whether I should try for the Full Solution and make the enums literally refer to the indices in the labels: Vec<String> in glean-core's labeled.rs)

Putting these up now for discussion (no ni? yet. Not gonna compel feedback until I have a more concrete request) and so I can switch gears a bit.

(Alrighty, I've figured out what concrete questions I have)

:nika, :florian - As two of the folks who've in the past expressed interest in using static-labeled_* metrics for enum-type-things, what are your thoughts about the ergonomics I've settled on? (e.g. the enum is in the metric category's namespace, the prefixing and casing of the variants, the existence of the __other__ variant and its position at the end of the enum, the naming of EnumGet, etc etc). Are there pieces missing that'd make this a pain for you to use? Feel free to use the patchsets as a guide, but don't feel you need to perform code review.

:Travis - As Glean Tech Lead, what's your thoughts about this? We discussed synchronously about this FOG-first approach, but what do you think about using the enum discriminant as an index into the labels array? Should I push ahead on running these ideas down into the Glean SDK before landing, or will these changes be enough of an improvement (and forward-looking) to land as-is?

As always, feel free to bring in or suggest others who might wish to have a seat at this table.

Flags: needinfo?(tlong)
Flags: needinfo?(nika)
Flags: needinfo?(florian)

The part that worries me the most about indexing into the labels array using the enum is something you already stated:

"The order of the values is important"

How can we guarantee a deterministic ordering of the enums? That feels like it could be a deceptively complicated problem considering that sorting the labels in any way could change the order if/when new labels are added, as well as we don't control the order in the metrics.yaml file that static labels are listed.

I do really like the concept of using the discriminate as an index, it cuts down on the amount of strings crossing the FFI which should always be good, but I'm worried about guaranteeing the stability of the discriminate

Flags: needinfo?(tlong)

Luckily, so long as we control both the SDK and the parser, the order of the labels can be stabilized. glean_parser's object model has ordered_labels which embedders who care about ordering can be sure is the order of the Vec that gets passed to the LabeledXMetric's ctor. Then the order of the labels: [] in the definition matches the order of the enum matches the order of the labels: Vec<CowString> in glean-core, and everyone's happy.

And so long as we keep storing strings, this ordering need only be stable within a build.

e__Other__ is a bit awkward as a name, but I suppose it's fine. I'm a bit uncertain about how well this will work with IPC code because all labels still need to be written in the metrics.yaml file, but it's still definitely an improvement over the old approach. We'll probably need to have IPDL consume the metrics.yaml file as an input to make sure we don't try to name an enum which isn't in the set.

Overall though I think that having explicit enums is definitely a big improvement.

Flags: needinfo?(nika)

(In reply to Nika Layzell [:nika] (ni? for response) from comment #22)

all labels still need to be written in the metrics.yaml file

We need some sort of mapping from enum value to string label someplace (since enum values aren't stable, though their meanings are intended to be), and it's the SDK that needs to know it. Alas, the only way we currently have to tell the SDK anything is via metrics.yaml, so that's where it all goes. Everything else I could come up with was worse : (

We'll probably need to have IPDL consume the metrics.yaml file as an input to make sure we don't try to name an enum which isn't in the set

glean_parser is in-tree and ready to parse files for you whenever you need it.

Attachment #9318245 - Attachment description: WIP: Bug 1672273 - Rework thread cpu metrics to not depend on type of metrics r?florian! → Bug 1672273 - Rework thread cpu metrics to not depend on type of metrics r?janerik!
Attachment #9318246 - Attachment description: WIP: Bug 1672273 - Provide enums in Rust and C++ for labeled_* metrics with static labels r?TravisLong! → Bug 1672273 - Provide enums in Rust and C++ for labeled_* metrics with static labels r?janerik!
Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/16f56b56505b Rework thread cpu metrics to not depend on type of metrics r=janerik https://hg.mozilla.org/integration/autoland/rev/b1508e3f22f5 Provide enums in Rust and C++ for labeled_* metrics with static labels r=janerik https://hg.mozilla.org/integration/autoland/rev/8bec73152dfb apply code formatting via Lando
[task 2023-04-26T20:43:53.168Z] 20:43:53    ERROR -  error[E0412]: cannot find type `_Pred` in this scope
[task 2023-04-26T20:43:53.168Z] 20:43:53     INFO -     --> /builds/worker/workspace/obj-build/x86_64-apple-darwin/release/build/style-bcd2002c8774f423/out/gecko/structs.rs:460:40

I confess a degree of confusion about how this has anything to do with my patch, but I'm looking.

Flags: needinfo?(chutten)

Not sure why this only pops up as part of this patchset. I was forced to move
some std::tuple-using code into a widely-included header. Maybe that's it?

Depends on D170109

Pushed by chutten@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d671c627fba3 Rework thread cpu metrics to not depend on type of metrics r=janerik https://hg.mozilla.org/integration/autoland/rev/349d6fa3f929 Provide enums in Rust and C++ for labeled_* metrics with static labels r=janerik https://hg.mozilla.org/integration/autoland/rev/7c551c5379b4 Mac CI builds need to treat tuple subclasses as opaque r=emilio
Duplicate of this bug: 1704193
Flags: needinfo?(florian)
Blocks: 1892465
See Also: → 1937167
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: