Open Bug 1305142 Opened 8 years ago Updated 3 years ago

Better telemetry alerts for regressions in the extremes of behavior

Tracking

(Not tracked)

Status:

REOPENED

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug)

Details

Andrew McCreight [:mccr8]

Reporter

Description

•

8 years ago

We recently regressed cycle collector pauses very badly (bug 1301301) on August 26, but this did not generate a telemetry alert.

The regression shows up very clearly in the mean and the 95th percentile for CYCLE_COLLECTOR_MAX_PAUSE: https://mzl.la/2ctDxdF

I think the difficulty here is that most of the time the CC does very little (running every 10 seconds), so almost all of the data we get is around 5ms. However, when the user closes a tab, the CC can take much longer, on the order of a second in some bad cases. To a user, it doesn't matter too much if CC pauses go from 5ms to 8ms, but it is more noticeable if a 500ms pause goes to 800ms, even though these pauses are much less frequent. It is in these longer CCs that we had the regression.

I could add a new telemetry measure that only reports MAX_PAUSE when the value is greater than, say, 10ms, but that seems kind of hacky, and there are likely other measurements where we care about the extremes of behavior in addition to the over all picture, such as garbage collector pauses and tab switching times, so it would be nice to have some system set up for it. I'm not sure what the right approach would be.

Chris H-C :chutten

Updated

•

8 years ago

Component: Telemetry Server → Metrics: Pipeline

Product: Webtools → Cloud Services

Version: Trunk → unspecified

Thomas Huelbert

Updated

•

8 years ago

Points: --- → 3

Priority: -- → P3

Andrew McCreight [:mccr8]

Reporter

Comment 1

•

7 years ago

Another example of this is GHOST_WINDOWS: https://mzl.la/2qlSZyH

Bug 1357872 was a regression that showed up on 4-18, and is clearly visible on the graph, but there was no regression email.

Even worse, bug 1336811 was a really bad regression with commonly used addons like AdblockPlus that landed around Sept 13, but it made it all the way to release until somebody reported it and I was able to fix it (which you can see as the huge drop on 2-21).

Chris H-C :chutten

Comment 2

•

7 years ago

It's almost certainly the overwhelmingly-dominant 0 shape that prevented the detector from registering the changes. The detector reacts to changes in the shape of the distributions, and they didn't really change.

This is one of the kind of changes that cerberus[1] is not capable of identifying in its current form. Which sucks.

[1]: https://github.com/mozilla/cerberus/blob/master/alert/alert.py#L42

Firefox Bug Husbandry Bot

Comment 3

•

7 years ago

Closing abandoned bugs in this product per https://bugzilla.mozilla.org/show_bug.cgi?id=1337972

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

Chris H-C :chutten

Comment 4

•

6 years ago

Hooking this back up to the current Telemetry alerting bugtree in Webtools::Telemetry Dashboard.

Blocks: 1450729

Status: RESOLVED → REOPENED

Component: Metrics: Pipeline → Telemetry Dashboard

Priority: P3 → P4

Product: Cloud Services → Webtools

Resolution: INCOMPLETE → ---

Version: unspecified → other

BMO Automation

Updated

•

6 years ago

Product: Webtools → Data Platform and Tools

:shell escalante

Updated

•

3 years ago

Component: Telemetry Dashboards (TMO) → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Better telemetry alerts for regressions in the extremes of behavior

Categories

(Data Platform and Tools :: General, defect, P4)

Tracking

(Not tracked)

People

(Reporter: mccr8, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Updated