Should we ignore expected errors in sync success rates?
Categories
(Application Services :: Places, enhancement)
Tracking
(Not tracked)
People
(Reporter: bdk, Unassigned)
References
Details
(Whiteboard: [fxsync-])
Our sync errors are currently dominated by things like httperror, Network error, shutdownerror, etc. These are errors that we expect to see in normal operation. Maybe they're more noise than signal and we shouldn't include them in in the sync success rate calculation (they wouldn't count as success or failure). We should still consider counting them and tracking them, but in a different visualization.
Updated•1 day ago
|
Comment 1•13 hours ago
|
||
I'm a little torn here. I think different visualizations are problematic because at some point we simply don't check them - if we can squeeze them onto a single dashboard it might be fine, but that only scales so far.
Also, I guess this is a kind of philosophical question: what are our dashboards actually for? If they are only to track bugs in the code etc, then ignoring them makes sense. If we want a true measure of success as seem by our users, ignoring them makes less sense. Are elevated 401s a problem? Is a spike in ShutdownErrors a problem? I'd say they are and we should know about them.
That said, I agree it does make our stuff noisy. Is there a middle ground? Can we split into "failure" and "error", where the latter are the expected ones, but keep them on the same graph?
| Reporter | ||
Comment 2•2 hours ago
|
||
Yeah, I feel torn too. I like the idea of trying to display both on the same graph, maybe we could lean into the idea of "engine success" vs "general sync success". If we see a network error, then we count it as a failure to sync, but not a failure for the current engine. Then maybe we aim for 99.9% success rates for our engines, and accept that sync in general will be closer 90% (I'm kind of guessing at these numbers). We could graph both on the same chart, maybe with a logarithmic scale.
Description
•