Better telemetry alerts for regressions in the extremes of behavior

RESOLVED INCOMPLETE

Status

Cloud Services
Metrics: Pipeline
P3
normal
RESOLVED INCOMPLETE
a year ago
3 months ago

People

(Reporter: mccr8, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

a year ago
We recently regressed cycle collector pauses very badly (bug 1301301) on August 26, but this did not generate a telemetry alert.

The regression shows up very clearly in the mean and the 95th percentile for CYCLE_COLLECTOR_MAX_PAUSE: https://mzl.la/2ctDxdF

I think the difficulty here is that most of the time the CC does very little (running every 10 seconds), so almost all of the data we get is around 5ms. However, when the user closes a tab, the CC can take much longer, on the order of a second in some bad cases. To a user, it doesn't matter too much if CC pauses go from 5ms to 8ms, but it is more noticeable if a 500ms pause goes to 800ms, even though these pauses are much less frequent. It is in these longer CCs that we had the regression.

I could add a new telemetry measure that only reports MAX_PAUSE when the value is greater than, say, 10ms, but that seems kind of hacky, and there are likely other measurements where we care about the extremes of behavior in addition to the over all picture, such as garbage collector pauses and tab switching times, so it would be nice to have some system set up for it. I'm not sure what the right approach would be.

Updated

a year ago
Component: Telemetry Server → Metrics: Pipeline
Product: Webtools → Cloud Services
Version: Trunk → unspecified

Updated

a year ago
Points: --- → 3
Priority: -- → P3
(Reporter)

Comment 1

8 months ago
Another example of this is GHOST_WINDOWS: https://mzl.la/2qlSZyH

Bug 1357872 was a regression that showed up on 4-18, and is clearly visible on the graph, but there was no regression email.

Even worse, bug 1336811 was a really bad regression with commonly used addons like AdblockPlus that landed around Sept 13, but it made it all the way to release until somebody reported it and I was able to fix it (which you can see as the huge drop on 2-21).

Comment 2

8 months ago
It's almost certainly the overwhelmingly-dominant 0 shape that prevented the detector from registering the changes. The detector reacts to changes in the shape of the distributions, and they didn't really change.

This is one of the kind of changes that cerberus[1] is not capable of identifying in its current form. Which sucks.

[1]: https://github.com/mozilla/cerberus/blob/master/alert/alert.py#L42
Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972
Status: NEW → RESOLVED
Last Resolved: 3 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.