Open Bug 1305142 Opened 8 years ago Updated 3 years ago

Better telemetry alerts for regressions in the extremes of behavior

Categories

(Data Platform and Tools :: General, defect, P4)

defect
Points:
3

Tracking

(Not tracked)

REOPENED

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug)

Details

We recently regressed cycle collector pauses very badly (bug 1301301) on August 26, but this did not generate a telemetry alert.

The regression shows up very clearly in the mean and the 95th percentile for CYCLE_COLLECTOR_MAX_PAUSE: https://mzl.la/2ctDxdF

I think the difficulty here is that most of the time the CC does very little (running every 10 seconds), so almost all of the data we get is around 5ms. However, when the user closes a tab, the CC can take much longer, on the order of a second in some bad cases. To a user, it doesn't matter too much if CC pauses go from 5ms to 8ms, but it is more noticeable if a 500ms pause goes to 800ms, even though these pauses are much less frequent. It is in these longer CCs that we had the regression.

I could add a new telemetry measure that only reports MAX_PAUSE when the value is greater than, say, 10ms, but that seems kind of hacky, and there are likely other measurements where we care about the extremes of behavior in addition to the over all picture, such as garbage collector pauses and tab switching times, so it would be nice to have some system set up for it. I'm not sure what the right approach would be.
Component: Telemetry Server → Metrics: Pipeline
Product: Webtools → Cloud Services
Version: Trunk → unspecified
Points: --- → 3
Priority: -- → P3
Another example of this is GHOST_WINDOWS: https://mzl.la/2qlSZyH

Bug 1357872 was a regression that showed up on 4-18, and is clearly visible on the graph, but there was no regression email.

Even worse, bug 1336811 was a really bad regression with commonly used addons like AdblockPlus that landed around Sept 13, but it made it all the way to release until somebody reported it and I was able to fix it (which you can see as the huge drop on 2-21).
It's almost certainly the overwhelmingly-dominant 0 shape that prevented the detector from registering the changes. The detector reacts to changes in the shape of the distributions, and they didn't really change.

This is one of the kind of changes that cerberus[1] is not capable of identifying in its current form. Which sucks.

[1]: https://github.com/mozilla/cerberus/blob/master/alert/alert.py#L42
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Hooking this back up to the current Telemetry alerting bugtree in Webtools::Telemetry Dashboard.
Blocks: 1450729
Status: RESOLVED → REOPENED
Component: Metrics: Pipeline → Telemetry Dashboard
Priority: P3 → P4
Product: Cloud Services → Webtools
Resolution: INCOMPLETE → ---
Version: unspecified → other
Product: Webtools → Data Platform and Tools
Component: Telemetry Dashboards (TMO) → General
You need to log in before you can comment on or make changes to this bug.