We have over 1,000 Telemetry probes so we need an automated way to monitor them for regressions. Noise is a major challenge, even more so than with Talos data; Telemetry data is collected from a wide variety of computers, configurations and workloads. We need a reliable means of detecting regressions, improvements, and changes in a measurement's distribution.
(In reply to Vladan Djeric (:vladan) from comment #0) > Noise is a major challenge, even more so than with > Talos data; Is there some data to back up this claim? Does it suggest that existing tools/algorithms (such as those used for talos data) would not suffice? While I don't disagree that the environment/parameters are not as strictly controlled as talos and therefore the noise levels could be higher, the noise patterns (that's not an oxymoron) are not _necessarily_ different. The ability to utilize existing systems/algorithms is affected more by noise patterns than by noise levels IMO, and existing systems, especially the one used at graph server (which sends perf changes alerts to mozilla.dev.tree-management mailing list) have been deployed and tuned over the years to handle different patterns (as well as levels) effectively, even if admittedly tuned using talos data. Also, a major aspect of such system would be to research and find useful "perspectives" or clusters of the telemetry data where there's more coherency between the data points, and finding the parameters to slice the data by which result in higher coherency (GPU vendor, for instance).
One of the tools we would need has to detect anomalies in histograms given a series of past histograms which are considered to be OK. Since the distributions don't always fit well known distribution (e.g. Gaussian) we should ideally look at non-parametric methods. One way to do it is to use something like a Chi-Square test with a Bonferroni correction or a One Class Support Vector Machine. Avi, given the above problem statement, could you please give us some more information, e.g. specify which of the available statistical tools that have been already written can be applied to our problem or where can we find documentation for it? Furthermore, how do the current tools compare to something simple like a chi-square test or a one class SVM?
I'm not proficient enough with statistics or know the existing systems well enough to answer these questions. I think Matt has dealt in the past with graph server regression detection tuning (active/deployed for some years now), and Kyle is currently refining the models of a new system - Datazilla alerts - https://wiki.mozilla.org/Auto-tools/Projects/Alerts . Perhaps they could answer your questions.
Roberto knows more than me: The variety of hardware and software makes the probe measurements very noisy; If a regression happens only in a subset of the population, simple statistics may not capture it. Furthermore, I have seen the probability distributions, and they have a pattern (which is sometimes explained by uncovering hidden variables). Using histograms and the Chi-Square test helps capture that pattern. Uncovering as many machine attributes as possible may help explain the complex distribution, but I do not know how many attributes are accounted for currently, and how expensive they are to gather. The problem of detecting regressions in Telemetry data are very different than the Talos: Talos has the benefit of a controlled environment and smaller data volume which enables simple statistics to be effective. Telemetry requires more advanced statistical techniques to handle the hidden variables, and more advanced data management techniques to handle the data volume.
The Talos regression system treats a single number (usually an average) as the output of each benchmark run, and uses a Student's t-test to decide whether a new set of values is regressed from a baseline set. I agree this isn't sufficient for the type of data in Telemetry, though some of the ideas could be useful. Here are some more details and pointers: http://limpet.net/mbrubeck/2013/11/10/improving-regression-detection.html In addition to any complex statistical analysis we do across the whole Telemetry dataset, I think we should also do some very simple easy-to-understand alerting for carefully chosen key metrics. For example, we could decide that we want an alert if SIMPLE_MEASURES_START goes over 100ms at the 50th percentile, 500ms at the 75th percentile, or 2.5s at the 95th percentile. This won't catch every single meaningful change in behavior, but the things it *does* catch are guaranteed to be things we want to investigate. (We should also have manual review of these key metrics on a weekly or monthly basis, by teams whose goals are reflected in the metrics.)
Following some discussion at bug 1032185, it's probably best if any reports from this system would end up with as tight dates-range as possible, and preferably while the version which regressed was at its "Nightly" stage. Regressions which happen when a version is at other release channel are relatively rare, unless it's caused by an uplift, at which case the report should make it clear to understand.
Vladan, Here are the histograms that fired the alert system from 29th of April to the 22th of June with their respective day of regression. Could you have a quick look at the telemetry dashboard for some of the histogram and give me your opinion? Mind that the dashboard can get "stuck" on an histogram, i.e. after selecting a new histogram the previous one continues to be displayed. MEMORY_VSIZE_MAX_CONTIGUOUS, 09/05/2014 UPDATER_SERVICE_ENABLED, 09/05/2014 UPDATER_SERVICE_INSTALLED, 09/05/2014 NEWTAB_PAGE_DIRECTORY_AFFILIATE_SHOWN, 10/05/2014 NEWTAB_PAGE_DIRECTORY_ORGANIC_SHOWN, 10/05/2014 NEWTAB_PAGE_DIRECTORY_SPONSORED_SHOWN, 10/05/2014 CYCLE_COLLECTOR_TIME_BETWEEN, 18/05/2014 SPDY_PARALLEL_STREAMS, 18/05/2014 SPDY_REQUEST_PER_CONN, 18/05/2014 FX_TAB_ANIM_ANY_FRAME_INTERVAL_MS, 21/05/2014 FX_TAB_ANIM_OPEN_FRAME_INTERVAL_MS, 21/05/2014 FX_TAB_ANIM_OPEN_PREVIEW_FRAME_INTERVAL_MS, 21/05/2014 GC_REASON_2, 21/05/2014 (seems to be a false positive) GC_SLICE_MS, 21/05/2014 MEMORY_IMAGES_CONTENT_USED_UNCOMPRESSED, 22/05/2014 PLACES_BACKUPS_BOOKMARKSTREE_MS, 23/05/2014 GC_MARK_MS, 03/06/2014 GC_MARK_ROOTS_MS, 03/06/2014 GC_MS, 03/06/2014 GC_SCC_SWEEP_MAX_PAUSE_MS, 03/06/2014 GC_SCC_SWEEP_TOTAL_MS, 03/06/2014 MEMORY_IMAGES_CONTENT_USED_UNCOMPRESSED, 07/06/2014 IMAGE_DECODE_CHUNKS, 13/06/2014 XUL_BACKGROUND_REFLOW_MS, 20/06/2014
I made some tweaks to the algorithm (https://github.com/vitillo/cerberus), here are the revised alerts: https://etherpad.mozilla.org/lt1bHdjKUP. I will keep the etherpad updated so that I don't have to spam here.
Which data did the system operate on? - All Nightlies or the current Nightly? - All OSes? - Are the regression dates buildIDs or calendar days? - Did you take into account discontinuities while the Firefox population switches between versions?
Flags: needinfo?(vdjeric) → needinfo?(rvitillo)
(In reply to Vladan Djeric (:vladan) from comment #9) > Which data did the system operate on? > > - All Nightlies or the current Nightly? I use only the last and the previous Nightly versions; that should make transitions across Nightlys smooth. > - All OSes? Only WINNT as the other ones have too much noise and not enough data. > - Are the regression dates buildIDs or calendar days? The regressions are based on build dates. > - Did you take into account discontinuities while the Firefox population > switches between versions? Yes, see my first comment. I have attached a picture of how the alert looks like in the current stage. Mark, is there a machine where I can deploy the alerting system? I could just send e-mails to myself for now and check that it works as expected in the meantime that we figure out the final details.
Flags: needinfo?(rvitillo) → needinfo?(mreid)
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11) > Created attachment 8460185 [details] > Example alert. This is an interesting graph. The distribution actually looks nicer after the regression, though its peak is also higher. Any chance the numbers before the regression were somehow inaccurate? e.g. look at the chunk of very-low values to the left. Or maybe the code changed/improved and one of the new code's side effect is that the distribution is nicer but higher?
(In reply to Avi Halachmi (:avih) from comment #12) > (In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11) > > Created attachment 8460185 [details] > > Example alert. > > This is an interesting graph. The distribution actually looks nicer after > the regression, though its peak is also higher. > > Any chance the numbers before the regression were somehow inaccurate? e.g. > look at the chunk of very-low values to the left. That doesn't seem to be the case. > Or maybe the code changed/improved and one of the new code's side effect is > that the distribution is nicer but higher? Yes, that's what I think is going on here. Probably "regression" is not the right nomenclature as it could also be an improvement.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #10) > Mark, is there a machine where I can deploy the alerting system? I could > just send e-mails to myself for now and check that it works as expected in > the meantime that we figure out the final details. I have a machine where I run these kinds of things - I can deploy it there for testing. For the longer term, though, I'd like to set up a machine dedicated to running our monitoring code as part of the telemetry CloudFormation setup. I'll file a separate bug for that.
Update: To simplify the burden of manually looking at the data generated by the alerting system, I built a simple dashboard to display the regressions. It temporary lives in my Dropbox at https://dl.dropboxusercontent.com/u/12274806/dashboard/index.html.
The alerting system has been deployed with its dashboard which can be access at http://vitillo.github.io/cerberus/dashboard/.
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.