Open Bug 1691073 Opened 3 years ago Updated 1 month ago

Language Bindings have difficulty recording errors

Categories

(Data Platform and Tools :: Glean: SDK, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: chutten, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [telemetry:glean-rs:backlog])

Attachments

(2 files)

The Glean SDK has a lovely Error Reporting mechanism for metrics. Our specific metrics APIs trie to forbid any and all misuse, but for the ones that sneak in despite our best efforts we need to record and report them.

Some errors of usage are only known and checked deep in the core. What is the length for an event extra value? What is the regex for valid labels? What is the maximum value we permit in a Memory Distribution?

Some errors are known at the Language Binding layer, but are passed deep into the core anyway so that we can record errors about them. Counters cannot count backwards, so negative values are an error. We could forbid them at the API by using an unsigned type... but not all of our language bindings support unsigned types. So we weaken our specific metrics API to accept a signed type just so negative values can reach the error reporting machinery in glean-core.

And then there's errors known at the Language Binding layer that we can't pass down. FOG forbids certain operations over IPC. It would be nice to be able to report errors in the same place that our users expect to find them, and in the same ways.

LBs could manufacture their own labeled_counter-based error system, but that would have some notable flaws:

  • It'd live in a different place from glean-core error reporting, making life difficult for testing, analysis, and tooling.
  • It has a Ping Problem: errors in glean-core are reported on all pings in which the offending metric is sent (and the "metrics" ping). That is straight up impossible to do with labeled_counters.
  • It'd be limited to sixteen erroring metrics, which is a limit that glean-core errors doesn't seem to share.

I suggest maybe we should open up core error reporting to the Language Binding layer.

Whiteboard: [telemetry:glean-rs:m?]
See Also: → 1703095
Assignee: nobody → mdroettboom
Blocks: 1704504

ni?me to fill in FOG's use case to move towards a proposed design

Flags: needinfo?(chutten)

FOG would like to:

  1. Add counts to existing Glean errors.
    • In the event someone hands me a negative number for a counter addition, I don't want to have to pass it all the way to the core for it to be recorded as an error. I'd like the MLA counter_add FFI to be unsigned and to be able to record the error so it shows up in tests.
  2. Add new errors
    • The most common example of this is IPC. If someone tries to set a boolean metric in a non-parent process, that is a non-commutative operation and is thus forbidden (at least for now). I'd like to be able to add an error that would be returned from test_get_num_recorded_errors to aid in testing and added to that metric's pings in case the error manifests in the wild.

(( and while we're at it, maybe we should tackle this suggestion of mine where we consider the ergonomics of using test_get_num_recorded_errors from our users who might not know which specific errors they're looking for. ))

The way I see this taking shape is basically an exported version of record_error. Maybe it doesn't bother with the impl Display param, or maybe it does. Since CommonMetricData and ErrorType are necessarily exposed by RLB (possibly other LBs?), we should treat this carefully as it becomes a public-facing API even if we ask people not to use it.

Flags: needinfo?(chutten)
See Also: → 1730112

I think we'll still want to do this, and it remains valid, but it is not actively being worked on.

Assignee: mdroettboom → nobody
Priority: P3 → --
Whiteboard: [telemetry:glean-rs:backlog]
Priority: -- → P4
See Also: → 1827926

Coming back to this now:
Back then we didn't follow up with this because it would have meant a whole lot of extra code manually added to each and every metric.
These days we have UniFFI and might be able to keep the additional code to a minimum inside glean-core.

We have the trait MetricType that each metric type implements.
We can add a function record_error(error_type, count) there with a default implementation that calls error_recording::record_error.
Then we add that record_error(error_type, count) method to each metric type in glean-core/src/glean.udl and moments later we can record errors without much hassle from foreign languages.
That then still needs wiring into FOG, but that might ultimately be easier then.

Priority: P4 → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: