Open Bug 1866357 Opened 1 year ago

Better error telemetry

Categories

(Application Services :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: markh, Unassigned)

Details

From github: https://github.com/mozilla/application-services/issues/4982.

Recently, we've been discussing how to improve how we use Glean to capture error telemetry. After taking a quick survey of our current error telemetry, I think our current code is actually close to what we want, we just need a few tweaks.

I think we should take the logins android error handling as a starting point:

  • In metrics.yaml we define several metrics:
    • Total write query count
    • Total read query count
    • Read query error counts (this is a labeled counter, so we can create a count for each error type)
    • Write query error counts (also a labeled counter)
  • In DatabaseLoginsStorage.kt we increment those counters
  • Finally we graph the errors on our logins dashboard

I think we can get pretty good error telemetry with a few tweaks:

  • Better metrics.
    • I don't think the read/write distinction is that useful, what if we replace that with just a total query count?
    • We use labeled counters to track error types, but we don't use that in the graph. What if we:
      • Combined read_query_error_count and write_query_error_count into a single errors_by_type labeled counter.
      • Visualize that on the dashboard as errors per day, grouped by type, like we do with sync errors
      • Improve the error type detection. Right now, almost all errors are grouped under the __other__ type. Getting access to Glean from Rust would be great, since then we could have this code in Rust.
    • Add an errors_by_function labeled counter and visualize that in a similar way. This could help us track down which function was generating errors. It seems like an improvement on the read/write distinction to me.
  • Use this system for other components as well.
    • I think this means a bunch of similar metrics.yaml files per-component
    • Maybe the upcoming struct metric could reduce the duplication?
    • Create a global errors_by_component labeled counter. This would provide a nice overview for our main dashboard.
  • Create a shared metric system. We should track the same metrics on iOS (and desktop once we're there). Getting access to Glean from Rust would be a huge help here too.

┆Issue is synchronized with this Jira Task

Change performed by the Move to Bugzilla add-on.

You need to log in before you can comment on or make changes to this bug.