We need derived datasets for the current and future e10s experiments. To avoid biasing our analyses we have to use a representative set of clients participating in the experiment. As some clients might experience severe lag, we might either ignore their submissions or waste a considerable amount of resources in our analyses filtering for experiment submissions on a day well beyond the experiment's end date. Let's do this work just once when creating the derived stream. Furthermore, the derived stream should group all submissions by client and compute a representative measure for all metrics considered. Currently we randomly select a single session for a client which is not good enough for low signal-to-noise metrics like plugin crashes and slow script notice counts as we don't have enough statistical power to detect differences.
The code for the derived stream lives at . Rerunning the "all histogram comparison analysis"  on the derived stream, for the data collected from the 22/10 to the 17/11, took less than 10 minutes on a single machine (about 100K users). In comparison, the same analysis for the raw data collected from the 22/10 to the 27/10, took more than an hour with a 10 machine cluster (about 30K users). We should rerun all our current e10s experiment analyses on the derived dataset and check for changes. It should be easy re-use the code not only for future e10s experiments, but more generally for any experiment.  https://github.com/vitillo/telemetry-batch-view/blob/master/src/main/scala/streams/E10sExperiment.scala  http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_all_histograms_experiment.ipynb