If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Port Churn dataset to Spark 2.0

RESOLVED FIXED

Status

Data Platform and Tools
Datasets: General
P1
normal
RESOLVED FIXED
2 months ago
2 days ago

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

Tracking

(Blocks: 2 bugs)

Details

Attachments

(1 attachment)

(Assignee)

Description

2 months ago
The current dataset is written using the spark rdd api, and contains a lot of code paths that currently make the generation of the data slower than it could be. In addition, it is difficult to clean data from similar data sources, such as a main ping.

This will require a fairly extensive overhaul of the ETL code. The following changes are expected before new data sources can be included.

* Refactoring of script entrypoint to conform to the rest of mozetl scripts
* Reusable cleaning process for multiple data sources
* Conversion of map-reducebykey action into SparkSQL compatible actions
* Joining effective version using a separate table and joins

In addition, a validation of the dataset against a previous version is necessary. This will be performed using a 1% sample of the data.
(Assignee)

Updated

2 months ago
Blocks: 1381806
Points: --- → 3
(Assignee)

Updated

2 months ago
Blocks: 1389231
(Assignee)

Updated

a month ago
Assignee: nobody → amiyaguchi
Status: NEW → ASSIGNED
Priority: -- → P1
(Assignee)

Comment 1

a month ago
Initial testing of the performance has brought runtime down from 50 minutes on 5 machines to 20 minutes on 3 machines. The accuracy of the new code needs to be tested against the old code.

In addition, testing needs to be rewritten since the code has changed significantly to break it down into smaller parts. The first part handles the cleaning of the data before it's aggregated. The second part handles the aggregation of the data into groups, and any post processing that needs to occur (such as the weighted mean as an additional metric). The script is now more agnostic to the start date, period, and overall ping latency and can be adjusted via the command line.

It's also more important to have good routines for verifying that the data looks correct. A generic tool for this has been considered in bug 1347705.

These additions should make it easier to reuse the cleaning routines against other data sources, as well as as preparing for an eventual HLL dataset proposed in bug 1381840.
(Assignee)

Comment 2

a month ago
It might make sense to bump the usage hours to make this dataset purely about retention figures, instead of cramming extra things like usage hours into the dataset. This makes the set computation significantly easier, and avoids issues with integration across data sources (see new profile ping).
(Assignee)

Comment 3

a month ago
Comment 2 should read "bump the version number".
(Assignee)

Comment 4

15 days ago
I spent a little bit of time writing up a document for handling timestamps and dates within main_summary, in order to keep a consistent view of time.[1] This should clear up confusion about dealing with skew and outlier values with these values. 


[1] https://docs.google.com/a/mozilla.com/spreadsheets/d/1DQk0YQx2PLaY2ZMTdhF2TA-ueZ6g5PWwUqbPWL90AHM/edit?usp=sharing
(Assignee)

Comment 5

8 days ago
Created attachment 8908839 [details] [review]
Bug 1389230 - Use DataFrames API in Churn #114
(Assignee)

Comment 6

2 days ago
The churn cleaning routines have been updated to use the DataFrame API, and additional tests have been added to verify the functionality is the same.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.