The current dataset is written using the spark rdd api, and contains a lot of code paths that currently make the generation of the data slower than it could be. In addition, it is difficult to clean data from similar data sources, such as a main ping. This will require a fairly extensive overhaul of the ETL code. The following changes are expected before new data sources can be included. * Refactoring of script entrypoint to conform to the rest of mozetl scripts * Reusable cleaning process for multiple data sources * Conversion of map-reducebykey action into SparkSQL compatible actions * Joining effective version using a separate table and joins In addition, a validation of the dataset against a previous version is necessary. This will be performed using a 1% sample of the data.
Initial testing of the performance has brought runtime down from 50 minutes on 5 machines to 20 minutes on 3 machines. The accuracy of the new code needs to be tested against the old code. In addition, testing needs to be rewritten since the code has changed significantly to break it down into smaller parts. The first part handles the cleaning of the data before it's aggregated. The second part handles the aggregation of the data into groups, and any post processing that needs to occur (such as the weighted mean as an additional metric). The script is now more agnostic to the start date, period, and overall ping latency and can be adjusted via the command line. It's also more important to have good routines for verifying that the data looks correct. A generic tool for this has been considered in bug 1347705. These additions should make it easier to reuse the cleaning routines against other data sources, as well as as preparing for an eventual HLL dataset proposed in bug 1381840.
It might make sense to bump the usage hours to make this dataset purely about retention figures, instead of cramming extra things like usage hours into the dataset. This makes the set computation significantly easier, and avoids issues with integration across data sources (see new profile ping).
Comment 2 should read "bump the version number".
I spent a little bit of time writing up a document for handling timestamps and dates within main_summary, in order to keep a consistent view of time. This should clear up confusion about dealing with skew and outlier values with these values.  https://docs.google.com/a/mozilla.com/spreadsheets/d/1DQk0YQx2PLaY2ZMTdhF2TA-ueZ6g5PWwUqbPWL90AHM/edit?usp=sharing
The churn cleaning routines have been updated to use the DataFrame API, and additional tests have been added to verify the functionality is the same.