1031032 - Automatic alerts for Telemetry regressions

Reporter

Description

•

11 years ago

We have over 1,000 Telemetry probes so we need an automated way to monitor them for regressions. Noise is a major challenge, even more so than with Talos data; Telemetry data is collected from a wide variety of computers, configurations and workloads. We need a reliable means of detecting regressions, improvements, and changes in a measurement's distribution.

Vladan Djeric (:vladan)

Reporter

Updated

•

11 years ago

URL: https://wiki.mozilla.org/Performance/...

Avi Halachmi (:avih)

Comment 1

•

11 years ago

(In reply to Vladan Djeric (:vladan) from comment #0) > Noise is a major challenge, even more so than with > Talos data; Is there some data to back up this claim? Does it suggest that existing tools/algorithms (such as those used for talos data) would not suffice? While I don't disagree that the environment/parameters are not as strictly controlled as talos and therefore the noise levels could be higher, the noise patterns (that's not an oxymoron) are not _necessarily_ different. The ability to utilize existing systems/algorithms is affected more by noise patterns than by noise levels IMO, and existing systems, especially the one used at graph server (which sends perf changes alerts to mozilla.dev.tree-management mailing list) have been deployed and tuned over the years to handle different patterns (as well as levels) effectively, even if admittedly tuned using talos data. Also, a major aspect of such system would be to research and find useful "perspectives" or clusters of the telemetry data where there's more coherency between the data points, and finding the parameters to slice the data by which result in higher coherency (GPU vendor, for instance).

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 2

•

11 years ago

One of the tools we would need has to detect anomalies in histograms given a series of past histograms which are considered to be OK. Since the distributions don't always fit well known distribution (e.g. Gaussian) we should ideally look at non-parametric methods. One way to do it is to use something like a Chi-Square test with a Bonferroni correction or a One Class Support Vector Machine. Avi, given the above problem statement, could you please give us some more information, e.g. specify which of the available statistical tools that have been already written can be applied to our problem or where can we find documentation for it? Furthermore, how do the current tools compare to something simple like a chi-square test or a one class SVM?

Flags: needinfo?(avihpit)

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

11 years ago

Assignee: nobody → rvitillo

Avi Halachmi (:avih)

Comment 3

•

11 years ago

I'm not proficient enough with statistics or know the existing systems well enough to answer these questions. I think Matt has dealt in the past with graph server regression detection tuning (active/deployed for some years now), and Kyle is currently refining the models of a new system - Datazilla alerts - https://wiki.mozilla.org/Auto-tools/Projects/Alerts . Perhaps they could answer your questions.

Flags: needinfo?(mbrubeck)

Flags: needinfo?(klahnakoski)

Flags: needinfo?(avihpit)

Kyle Lahnakoski [:ekyle]

Comment 4

•

11 years ago

Roberto knows more than me: The variety of hardware and software makes the probe measurements very noisy; If a regression happens only in a subset of the population, simple statistics may not capture it. Furthermore, I have seen the probability distributions, and they have a pattern (which is sometimes explained by uncovering hidden variables). Using histograms and the Chi-Square test helps capture that pattern. Uncovering as many machine attributes as possible may help explain the complex distribution, but I do not know how many attributes are accounted for currently, and how expensive they are to gather. The problem of detecting regressions in Telemetry data are very different than the Talos: Talos has the benefit of a controlled environment and smaller data volume which enables simple statistics to be effective. Telemetry requires more advanced statistical techniques to handle the hidden variables, and more advanced data management techniques to handle the data volume.

Flags: needinfo?(klahnakoski)

Matt Brubeck (:mbrubeck)

Comment 5

•

11 years ago

The Talos regression system treats a single number (usually an average) as the output of each benchmark run, and uses a Student's t-test to decide whether a new set of values is regressed from a baseline set. I agree this isn't sufficient for the type of data in Telemetry, though some of the ideas could be useful. Here are some more details and pointers: http://limpet.net/mbrubeck/2013/11/10/improving-regression-detection.html In addition to any complex statistical analysis we do across the whole Telemetry dataset, I think we should also do some very simple easy-to-understand alerting for carefully chosen key metrics. For example, we could decide that we want an alert if SIMPLE_MEASURES_START goes over 100ms at the 50th percentile, 500ms at the 75th percentile, or 2.5s at the 95th percentile. This won't catch every single meaningful change in behavior, but the things it *does* catch are guaranteed to be things we want to investigate. (We should also have manual review of these key metrics on a weekly or monthly basis, by teams whose goals are reflected in the metrics.)

Flags: needinfo?(mbrubeck)

Avi Halachmi (:avih)

Comment 6

•

11 years ago

Following some discussion at bug 1032185, it's probably best if any reports from this system would end up with as tight dates-range as possible, and preferably while the version which regressed was at its "Nightly" stage. Regressions which happen when a version is at other release channel are relatively rare, unless it's caused by an uplift, at which case the report should make it clear to understand.

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

11 years ago

Depends on: 1034119

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 7

•

11 years ago

Vladan, Here are the histograms that fired the alert system from 29th of April to the 22th of June with their respective day of regression. Could you have a quick look at the telemetry dashboard for some of the histogram and give me your opinion? Mind that the dashboard can get "stuck" on an histogram, i.e. after selecting a new histogram the previous one continues to be displayed. MEMORY_VSIZE_MAX_CONTIGUOUS, 09/05/2014 UPDATER_SERVICE_ENABLED, 09/05/2014 UPDATER_SERVICE_INSTALLED, 09/05/2014 NEWTAB_PAGE_DIRECTORY_AFFILIATE_SHOWN, 10/05/2014 NEWTAB_PAGE_DIRECTORY_ORGANIC_SHOWN, 10/05/2014 NEWTAB_PAGE_DIRECTORY_SPONSORED_SHOWN, 10/05/2014 CYCLE_COLLECTOR_TIME_BETWEEN, 18/05/2014 SPDY_PARALLEL_STREAMS, 18/05/2014 SPDY_REQUEST_PER_CONN, 18/05/2014 FX_TAB_ANIM_ANY_FRAME_INTERVAL_MS, 21/05/2014 FX_TAB_ANIM_OPEN_FRAME_INTERVAL_MS, 21/05/2014 FX_TAB_ANIM_OPEN_PREVIEW_FRAME_INTERVAL_MS, 21/05/2014 GC_REASON_2, 21/05/2014 (seems to be a false positive) GC_SLICE_MS, 21/05/2014 MEMORY_IMAGES_CONTENT_USED_UNCOMPRESSED, 22/05/2014 PLACES_BACKUPS_BOOKMARKSTREE_MS, 23/05/2014 GC_MARK_MS, 03/06/2014 GC_MARK_ROOTS_MS, 03/06/2014 GC_MS, 03/06/2014 GC_SCC_SWEEP_MAX_PAUSE_MS, 03/06/2014 GC_SCC_SWEEP_TOTAL_MS, 03/06/2014 MEMORY_IMAGES_CONTENT_USED_UNCOMPRESSED, 07/06/2014 IMAGE_DECODE_CHUNKS, 13/06/2014 XUL_BACKGROUND_REFLOW_MS, 20/06/2014

Flags: needinfo?(vdjeric)

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 8

•

11 years ago

I made some tweaks to the algorithm (https://github.com/vitillo/cerberus), here are the revised alerts: https://etherpad.mozilla.org/lt1bHdjKUP. I will keep the etherpad updated so that I don't have to spam here.

Roberto Agostino Vitillo (:rvitillo)

Assignee

Updated

•

11 years ago

Depends on: 1037494

Vladan Djeric (:vladan)

Reporter

Comment 9

•

11 years ago

Which data did the system operate on? - All Nightlies or the current Nightly? - All OSes? - Are the regression dates buildIDs or calendar days? - Did you take into account discontinuities while the Firefox population switches between versions?

Flags: needinfo?(vdjeric) → needinfo?(rvitillo)

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 10

•

11 years ago

(In reply to Vladan Djeric (:vladan) from comment #9) > Which data did the system operate on? > > - All Nightlies or the current Nightly? I use only the last and the previous Nightly versions; that should make transitions across Nightlys smooth. > - All OSes? Only WINNT as the other ones have too much noise and not enough data. > - Are the regression dates buildIDs or calendar days? The regressions are based on build dates. > - Did you take into account discontinuities while the Firefox population > switches between versions? Yes, see my first comment. I have attached a picture of how the alert looks like in the current stage. Mark, is there a machine where I can deploy the alerting system? I could just send e-mails to myself for now and check that it works as expected in the meantime that we figure out the final details.

Flags: needinfo?(rvitillo) → needinfo?(mreid)

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 11

•

11 years ago

Attached image Example alert. — Details

Avi Halachmi (:avih)

Comment 12

•

11 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11) > Created attachment 8460185 [details] > Example alert. This is an interesting graph. The distribution actually looks nicer after the regression, though its peak is also higher. Any chance the numbers before the regression were somehow inaccurate? e.g. look at the chunk of very-low values to the left. Or maybe the code changed/improved and one of the new code's side effect is that the distribution is nicer but higher?

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 13

•

11 years ago

(In reply to Avi Halachmi (:avih) from comment #12) > (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11) > > Created attachment 8460185 [details] > > Example alert. > > This is an interesting graph. The distribution actually looks nicer after > the regression, though its peak is also higher. > > Any chance the numbers before the regression were somehow inaccurate? e.g. > look at the chunk of very-low values to the left. That doesn't seem to be the case. > Or maybe the code changed/improved and one of the new code's side effect is > that the distribution is nicer but higher? Yes, that's what I think is going on here. Probably "regression" is not the right nomenclature as it could also be an improvement.

Mark Reid [:mreid]

Comment 14

•

11 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #10) > Mark, is there a machine where I can deploy the alerting system? I could > just send e-mails to myself for now and check that it works as expected in > the meantime that we figure out the final details. I have a machine where I run these kinds of things - I can deploy it there for testing. For the longer term, though, I'd like to set up a machine dedicated to running our monitoring code as part of the telemetry CloudFormation setup. I'll file a separate bug for that.

Flags: needinfo?(mreid)

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 15

•

11 years ago

Update: To simplify the burden of manually looking at the data generated by the alerting system, I built a simple dashboard to display the regressions. It temporary lives in my Dropbox at https://dl.dropboxusercontent.com/u/12274806/dashboard/index.html.

Roberto Agostino Vitillo (:rvitillo)

Assignee

Comment 16

•

11 years ago

The alerting system has been deployed with its dashboard which can be access at http://vitillo.github.io/cerberus/dashboard/.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Webtools → Webtools Graveyard