Open Bug 1703095 Opened 3 years ago Updated 3 years ago

Provide convenience function for nullable values

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P4)

enhancement

Tracking

(Not tracked)

People

(Reporter: mhentges, Unassigned)

References

Details

(Whiteboard: [telemetry:glean-rs:backlog])

A recent issue has affected a python consumer of glean: in a rare circumstance, one of the string metrics was being set to None (see the associated patch).

However, this triggered some thinking: from glean's perspective, what's the difference between a metric not being set, and a metric being set to None? Can we consider these things equivalent? If so, then that would simplify interacting with glean with optional values:

metric_value = calculate_optional_metric_value()
if metric_value is not None:
    metric.set(metric_value)

# Would become:
metric.set(calculate_optional_metric_value())
See Also: → 1703092, 1653046

(In reply to Mitchell Hentges [:mhentges] 🦀 from comment #0)

However, this triggered some thinking: from glean's perspective, what's the difference between a metric not being set, and a metric being set to None? Can we consider these things equivalent?

This is a great question with so many implications! In general, we found that overloading a state with different meanings (e.g. None and no value being equivalent) is confusing for analysts and can be problematic when looking back at the data months on. However, we need not to despair!

We also introduced the Glean error reporting to report anomalous usage (like the one you stumbled upon, i.e. attempting to set a None to a string). Unfortunately, the only error string types are reporting do not account for non-strongly typed languages. But that doesn't mean we can't add new error reporting!

Instead of leaving the burden of checking for none values to the user, I think we could make the string type report InvalidValue when attempting to set a none.

Since Jan-Erik is on PTO, I'm flagging Bea (since this also has implications on Glean.js) and Mike. What do you think?

Flags: needinfo?(mdroettboom)
Flags: needinfo?(brizental)

Instead of leaving the burden of checking for none values to the user, I think we could make the string type report InvalidValue when attempting to set a none.

If we do this for strings it would make sense to do it for all metric types, no? And if we do that for all types it might even be worth having a specific error for it e.g. UnexpectedNullValue. What makes string specials that only they would require this special behaviour?

As you point out, this would be a language specific error. Statically typed languages wouldn't suffer from this problem, so currently it would be an API only for Javascript and Python. Or do you suggest we accept None values in all languages?

In Javascript it would not be a big deal to include this because of Glean.js' architecture. In the Glean SDK it would require that we let language bindings report errors, which is currently not possible.

I do think it is advantageous to include error reporting for this specific scenario. In an ideal world we have error reporting for all possible Recording API errors. I am not sure about the scale of work on the SDK to provide this feature, though.

Flags: needinfo?(brizental)

(In reply to Beatriz Rizental [:brizental] from comment #2)

Instead of leaving the burden of checking for none values to the user, I think we could make the string type report InvalidValue when attempting to set a none.

If we do this for strings it would make sense to do it for all metric types, no? And if we do that for all types it might even be worth having a specific error for it e.g. UnexpectedNullValue. What makes string specials that only they would require this special behaviour?

Yes, it makes sense to do that for all the metric types IMHO. And yes, it might make sense to add another error type, so that we won't conflate meanings into invalid value for metrics that support it.

As you point out, this would be a language specific error. Statically typed languages wouldn't suffer from this problem, so currently it would be an API only for Javascript and Python. Or do you suggest we accept None values in all languages?

No, I don't think we should accept None from other bindings.

I do think it is advantageous to include error reporting for this specific scenario. In an ideal world we have error reporting for all possible Recording API errors. I am not sure about the scale of work on the SDK to provide this feature, though.

Yeah, it's worth waiting for Jan-Erik to make a final call, but I wanted to kick off the conversation.

(In reply to Beatriz Rizental [:brizental] from comment #2)

In Javascript it would not be a big deal to include this because of Glean.js' architecture. In the Glean SDK it would require that we let language bindings report errors, which is currently not possible.

I'm just here to note that we have registered the intent to pursue this possibility in bug 1691073. Even at levels higher than LBs! (FOG would like to use the error stream to record IPC errors, for example.)

Not sure that it'd be a proper solution or is of sufficient priority to help out here, but I try not to miss any chance to link bugs.

See Also: → 1691073

In general, we found that overloading a state with different meanings ... is confusing for analysts and can be problematic ...

Good point 👍

But that doesn't mean we can't add new error reporting!

Improved error reporting would be fantastic as well!
Due to the validation currently happening in a different python stack than where the metric is set, exceptions caught by Sentry aren't incredibly useful (it's unclear which metric is causing the error). Adjustments made here to make it "fail fast/better" would be greatly appreciated :)

Thanks all for the discussion here, I'm happy regardless of the solution here!

I generally agree with the suggestion to improve the error reporting and create fewer footguns.

However, just wanted to note that in a dynamically-typed language like Python, there are many things that could go wrong here, not just passing None where not expected. So we probably need a more general "catch-all" so that if any exception is thrown when setting a metric, we record it as unexpected_error (or some such name).

Additionally, wanted to note that glean's Python bindings have full type annotations which may have helped here. I know that moving to type annotations and mypy linting can be a bit of a challenge with legacy projects like mach etc., of course.

Flags: needinfo?(mdroettboom)

Makes sense 👍.

Additionally, wanted to note that glean's Python bindings have full type annotations which may have helped here. I know that moving to type annotations and mypy linting can be a bit of a challenge with legacy projects like mach etc., of course.

Absolutely, I look forward to the day where we can embrace type checking in Mach :)


At the risk of de-railing the conversation: how would you recommend determining the cause of glean usage errors?
If the error is tucked into the Glean.error structure, then I can add logic to check it, but it will be hard to determine the cause of the failure. I've been leaning on Sentry, which is very good at collecting the state at the time of a failure and including it in a useful "error report".

Do you know if it would be possible to have Glean throw an error immediately when it is mis-used? That would allow Sentry to pick up the stack trace and value of local variables when this occurs.
However, I don't know how viable this is, especially if validation is centralized in the Glean core (native non-Python code).

Some historical context: Glean errors are quite different from error codes or exceptions. The general idea is that Glean itself should never return an exception to the user because we don't want the act of recording telemetry to ever break the product or cause crashes. So Glean just provides a signal in the data that errors occurred in the wild that you can check later, or in your project's unit tests. In practice, the lack of a stacktrace or something hasn't been much of an issue in things like Fenix.

(Of course, you've found an exception to this in this bug -- an error is raising an exception -- but that is partly due to how hard it is to account for every possible error in a dynamic language. But I consider that a bug in Glean that we should find a good, general way to fix).

We have discussed whether Glean could have a "mode" that would turn these errors into thrown exceptions at runtime to fail fast during testing (but wouldn't be on in a release build or something), but I don't think that's ever gone beyond discussion.

Thanks! That makes sense.
My only (very hypothetical) fear is that, if a metric is set in multiple different locations, it may be tough to track down which one is causing an error as reported in Glean.error.
However, I haven't run into this situation yet with mach.
Either way, I need my own internal management of Glean to integrate with Glean.error, and that will solve most of my current Glean-related problems. Thanks :)

Assignee: nobody → mdroettboom
Priority: -- → P2
Assignee: mdroettboom → nobody
Priority: P2 → --
Whiteboard: [telemetry:glean-rs:backlog]
Priority: -- → P4
You need to log in before you can comment on or make changes to this bug.