Open
Bug 1866357
Opened 1 year ago
Better error telemetry
Categories
(Application Services :: General, enhancement, P3)
Application Services
General
Tracking
(Not tracked)
NEW
People
(Reporter: markh, Unassigned)
Details
From github: https://github.com/mozilla/application-services/issues/4982.
Recently, we've been discussing how to improve how we use Glean to capture error telemetry. After taking a quick survey of our current error telemetry, I think our current code is actually close to what we want, we just need a few tweaks.
I think we should take the logins android error handling as a starting point:
- In
metrics.yamlwe define several metrics:
- Total write query count
- Total read query count
- Read query error counts (this is a labeled counter, so we can create a count for each error type)
- Write query error counts (also a labeled counter)
- In
DatabaseLoginsStorage.ktwe increment those counters- Finally we graph the errors on our logins dashboard
I think we can get pretty good error telemetry with a few tweaks:
- Better metrics.
- I don't think the read/write distinction is that useful, what if we replace that with just a total query count?
- We use labeled counters to track error types, but we don't use that in the graph. What if we:
- Combined
read_query_error_countandwrite_query_error_countinto a singleerrors_by_typelabeled counter.- Visualize that on the dashboard as errors per day, grouped by type, like we do with sync errors
- Improve the error type detection. Right now, almost all errors are grouped under the
__other__type. Getting access to Glean from Rust would be great, since then we could have this code in Rust.- Add an
errors_by_functionlabeled counter and visualize that in a similar way. This could help us track down which function was generating errors. It seems like an improvement on the read/write distinction to me.- Use this system for other components as well.
- I think this means a bunch of similar
metrics.yamlfiles per-component- Maybe the upcoming
structmetric could reduce the duplication?- Create a global
errors_by_componentlabeled counter. This would provide a nice overview for our main dashboard.- Create a shared metric system. We should track the same metrics on iOS (and desktop once we're there). Getting access to Glean from Rust would be a huge help here too.
┆Issue is synchronized with this Jira Task
Change performed by the Move to Bugzilla add-on.
You need to log in
before you can comment on or make changes to this bug.
Description
•