Closed Bug 1389230 Opened 7 years ago Closed 7 years ago

Use DataFame API in Churn

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

Bug 1389230 - Use DataFrames API in Churn #114 7 years ago Anthony Miyaguchi [:amiyaguchi] 49 bytes, text/x-github-pull-request		Details \| Review

Anthony Miyaguchi [:amiyaguchi]

Assignee

Description

•

7 years ago

The current dataset is written using the spark rdd api, and contains a lot of code paths that currently make the generation of the data slower than it could be. In addition, it is difficult to clean data from similar data sources, such as a main ping.

This will require a fairly extensive overhaul of the ETL code. The following changes are expected before new data sources can be included.

* Refactoring of script entrypoint to conform to the rest of mozetl scripts
* Reusable cleaning process for multiple data sources
* Conversion of map-reducebykey action into SparkSQL compatible actions
* Joining effective version using a separate table and joins

In addition, a validation of the dataset against a previous version is necessary. This will be performed using a 1% sample of the data.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

7 years ago

Blocks: 1381806

Points: --- → 3

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

7 years ago

Blocks: 1389231

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

7 years ago

Assignee: nobody → amiyaguchi

Status: NEW → ASSIGNED

Priority: -- → P1

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 1

•

7 years ago

Initial testing of the performance has brought runtime down from 50 minutes on 5 machines to 20 minutes on 3 machines. The accuracy of the new code needs to be tested against the old code.

In addition, testing needs to be rewritten since the code has changed significantly to break it down into smaller parts. The first part handles the cleaning of the data before it's aggregated. The second part handles the aggregation of the data into groups, and any post processing that needs to occur (such as the weighted mean as an additional metric). The script is now more agnostic to the start date, period, and overall ping latency and can be adjusted via the command line.

It's also more important to have good routines for verifying that the data looks correct. A generic tool for this has been considered in bug 1347705.

These additions should make it easier to reuse the cleaning routines against other data sources, as well as as preparing for an eventual HLL dataset proposed in bug 1381840.

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 2

•

7 years ago

It might make sense to bump the usage hours to make this dataset purely about retention figures, instead of cramming extra things like usage hours into the dataset. This makes the set computation significantly easier, and avoids issues with integration across data sources (see new profile ping).

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 3

•

7 years ago

Comment 2 should read "bump the version number".

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 4

•

7 years ago

I spent a little bit of time writing up a document for handling timestamps and dates within main_summary, in order to keep a consistent view of time.[1] This should clear up confusion about dealing with skew and outlier values with these values. 


[1] https://docs.google.com/a/mozilla.com/spreadsheets/d/1DQk0YQx2PLaY2ZMTdhF2TA-ueZ6g5PWwUqbPWL90AHM/edit?usp=sharing

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 5

•

7 years ago

Attached file Bug 1389230 - Use DataFrames API in Churn #114 — Details

Anthony Miyaguchi [:amiyaguchi]

Assignee

Comment 6

•

7 years ago

The churn cleaning routines have been updated to use the DataFrame API, and additional tests have been added to verify the functionality is the same.

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Anthony Miyaguchi [:amiyaguchi]

Assignee

Updated

•

7 years ago

Summary: Port Churn dataset to Spark 2.0 → Use DataFame API in Churn

Nobody; OK to take it and work on it

Updated

•

2 years ago

Component: Datasets: General → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Use DataFame API in Churn

Categories

(Data Platform and Tools :: General, enhancement, P1)

Tracking

(Not tracked)

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Updated

Attachment

General

Description

File Name

Content Type