Status

defect
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

Tracking

(Blocks 1 bug)

Details

Attachments

(4 attachments, 1 obsolete attachment)

The churn.v2 dataset computes the unbounded 7 day churn for Firefox profiles. 

This is useful for measuring the rate of churn week over week over a wide set of attributes, but is limited for exploratory purposes. The data exists in telemetry, but the data is not easily accessible.

A new generic churn dataset can support n-day retention, unbounded retention, and even bracket retention with carefully chosen pre-computed views of main_summary. [1] 

# Approach
The approach here is to calculate n-day retention over a daily rolling window that captures 95+% of pings in a given retention window. Good values to pre-compute are the 1, 2, 3, 5, 7, 14, and 30 day retention periods. These values should be sufficient for most n-day retention periods -- you can shift the period through choice of a start (reference) date. 

The retention of arbitrary windows can be reconstructed through set intersection sketches (hyperloglog and minhash). [2]  It is possible to maintain good error bounds for arbitrary, unbounded retention periods. 

Implementation of bracket retention follows above.

# Benchmarking
The existing etl job is slow -- churn.v2 runs an average of 80 minutes over 5 c4.8xlarge machines. The job should take advantage of the catalyst optimizer and caching mechanisms of spark 2.0, allowing for faster computations to support processing significantly more data.

There is a significant amount of re-computation, but this is necessary to capture the full set of data. Jobs can be pruned from running, e.g. windows without sufficient power due to client latency are ignored. It may also be possible to intersect many smaller datasets to build larger datasets.

A multi-tenant spark cluster may be able to take advantage of caching with many smaller jobs.

# Dataset monitoring
Due to the sheer number of derived datasets, there is room for quality control. The process of creating the dataset decomposes the incoming telemetry data in a time window over many smaller frequency components (retention periods). Validation of the data can occur across different frequencies in the same time window, or across time windows. 

For example, the set intersection of many 1-day retention windows should be within some tolerance of the 30-day retention. Additionally, the 1-day retention should not differ significantly from day to day.

These assumptions should be put into validation model that asserts the dataset is within design specifications, alerting (and filing a bug) otherwise.

# Backfills
Backfilling this job should be done separately from the scheduled daily job. The backfill job should aim to reconstruct the dataset with the minimal amount of computation. 


# Other comments
It would be nice to retain properties about cohorts over time such as engagement metrics. However, maintaining an accurate distribution of metrics like `unique_domains_count` across sub-populations is not trivial with sql aggregate functions (min, max, sum, avg, count, count(*)). One approach is to use an reduce-by-key function that operates on a parameterized distribution, and implement it as a user defined aggregate function. [3]


API:
> create_churn(start: date, end: date, n: int, unbounded: bool, attributes: list[col]) -> dataset[row]


[1] https://amplitude.com/blog/2016/08/11/3-ways-measure-user-retention/
[2] http://tech.adroll.com/media/hllminhash.pdf
[3] https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
Assignee

Updated

2 years ago
Points: --- → 5
Priority: -- → P3
Anthony, could this be extended to calculate churn on any incoming dataset, ala GenericLongitudinal and GenericClientCount? I could see this being used on mobile, at least.
Yes, it should be able to accept arbitrary rows in the form of `(timestamp, client_id, [attributes], [metrics])`.

Attributes describe the cohort that the client are part of. Metrics are measurements associated with clients and cohorts. Examples of metrics are subsession length, number of profiles, or unique domains count. The metrics are operated on by aggregate (or user defined aggregate) function e.g. `count(*)`, used to calculate metrics across cohorts and retention periods.


It might be kind of tricky to use in this form though, I'm still thinking of a good API to go along with the processed data. I imagine most of the exploratory analysis being done through Spark, and maybe having a couple of UDFs in Presto/Athena that make it easy to plot retention curves/cohort charts.
Assignee

Updated

2 years ago
Blocks: 1381806
There has been a bit of progress on this bug -- I'd like to transition this to be the bug for a 1-day Firefox Retention dataset. A generic proposal would be a bit ambitious without having constructed a more specific one in mind.

I've done a bit more reading into the usage and error properties of hyperloglog (HLL) to know that it's probably unnecessary to keep overlapping days between different HLL objects. I've also constructed two small datasets (`retention_dev` on Presto) that use the generic client count script to understand the two source datasets a bit better. I've done very rough estimates about the size of the dataset, the kinds of questions it can answer, and the tractability of a client count dataset into higher dimensions.

Precomputing rolling window for retention is now unnecessary with the advent of ping-sender in Firefox 55 and a better understanding of merging. Ping-sender generally reduces our latency of 95% of pings to 2 days.[1] In addition, the union (or merge) of HLL objects is a lossless operation. The merge operation takes the max of each register in the object. This property is the reason why HLL is of interest streaming and map-reduce applications. These two things mean the "rolling window" concept in comment #1 provides little added value. 

In general, the dataset is limited to the 1-day granularity of the profile creation date. It is technically possible to have the dataset at a 1-hour granularity (subsession start is rounded to the nearest hour). One [expensive] way of getting rid of skew on the profile creation date is to use the `new_profile` ping to generate an offset for each client. This requires a large random access table defining clock skew for each client. A spreadsheet for understanding columns related to skew in telemetry pings has been compiled in this reference sheet. [2]


I've made two iterations on this dataset using a limited subset of fields and a 1% sample of clients. These datasets use `main_summary` in conjunction with the `new_profile` parquet dataset. They have been fed into the generic client count dataset and queried in redash to build up a 1-day retention graph. It should be possible to construct a 7-day unbounded retention using these datasets. The `first-shutdown` ping in Firefox 57 should replace the `new_profile` in this context.

`retention_dev_v1` was the first attempt at making a dataset. This one was focused on clients that appeared in the new_profile ping and was mostly a learning experience in constructing and using HLL objects. This data is effective starting 20170808 using a 1% sample of clients (sample_id=57). The scripts to generate the data can be found in [3], and the example query starting from Firefox 55 release can be found in [4]. 

`retention_dev_v2` tries to include a wider variety of fields to make it possible to recreate the original churn_v2 dataset. The data in the first iteration is missing fields that would make it easier to cross-reference with the 7-day unbounded retention. This script uses a subset of the cleaning procedure that was developed for an updated version of the churn script. The channel, country, locale, and distribution_id are now fields that can be queried against. There were no existing cohort graphs for the 7 day churn dataset, so I created one for reference in [5]. The script and query can be found in [6] and [7] respectively.

I think that these queries probably need a bit more attention, but they can be worked on once new_profiles have been integrated into churn.

I still need to reproduce the 7-day retention dataset and see if it aligns. There are a few other interesting cases like a release by release retention and uptake of updates that require a bit more thought before they can be performed. Specifically, the current queries use a "day_number" field that represents the number of days since profile creation. However, this number should really be the number of days since an arbitrary cohort start.

I haven't chosen a particular value for the HLL bits, which should be dependent on the telemetry data. I think it might be useful to know the number of unique clients across a day, week, month, and year. A value of 12 bits leads to a 1.63% error with a 4kb object.[8] The current churn dataset has 2e7 rows per week.[9] This is, of course, is inflated by things like the "default_search_engine", so 1e6 might be a more reasonable estimate. This is roughly 4GB a day.

Some dimensions add sparsity and bloat up the size of the dataset more than it should. These include sparse fields like distribution_id, attribution, experiment_id, etc. It might be interesting to experiment with intersections via min-hash, which provides better guarantees than the inclusion-exclusion intersection estimate that is bundled with algebird's implementation.[10][11] Intersections can decompose a sparse, high dimensional dataset into smaller, dense datasets. One interesting example of hll-intersection would be to estimate add-on retention by intersecting an addon table Addon[addon_id: String, client_count: HLL] with the retention table Retention[date: DateTime, client_count: HLL]. 

[1] https://docs.google.com/document/d/1_4szwRs4RGeKhkvlIFCHFNJgp9hQ30iAZGoiArtaVDU/edit#heading=h.nk5led35uaci
[2] https://docs.google.com/a/mozilla.com/spreadsheets/d/1DQk0YQx2PLaY2ZMTdhF2TA-ueZ6g5PWwUqbPWL90AHM/edit?usp=sharing
[3] https://gist.github.com/acmiyaguchi/f261f80920f86d23a14ca105c2a75e63
[4] https://sql.telemetry.mozilla.org/queries/36836#98680
[5] https://sql.telemetry.mozilla.org/queries/35768#95755
[6] https://gist.github.com/acmiyaguchi/8fb19ea223e2e400439fa74ae1906afb
[7] https://sql.telemetry.mozilla.org/queries/38607/source#103680
[8] https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L239
[9] https://sql.telemetry.mozilla.org/queries/35738
[10] http://tech.adroll.com/blog/data/2013/07/10/hll-minhash.html
[11] https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/HyperLogLog.scala#L632-L652
Summary: Generic churn dataset proposal → 1 Day Firefox Retention Dataset
Posted file 1 Day Retention v1 (obsolete) —
Assignee

Updated

2 years ago
Attachment #8909525 - Attachment is obsolete: true
Attachment #8909525 - Attachment mime type: text/plain → text/url
An initial dashboard showcasing simple examples of all the datasets mentioned here can be found on STMO (https://sql.telemetry.mozilla.org/dashboard/firefox-telemetry-retention-dataset-example-usage).
Assignee

Updated

2 years ago
Assignee: nobody → amiyaguchi
Points: 5 → 3
Priority: P3 → P1
Assignee

Updated

2 years ago
Depends on: 1407410
Assignee

Updated

2 years ago
Attachment #8917172 - Attachment description: Bug 131840 - Schedule 1-Day Firefox Retention #188 → Bug 1381840 - Schedule 1-Day Firefox Retention #188
Assignee

Updated

2 years ago
Attachment #8913870 - Attachment description: Link to GitHub pull-request: https://github.com/mozilla/python_mozetl/pull/133 → Bug 1381840 - Add 1 day Firefox retention intermediate dataset #133
Assignee

Updated

2 years ago
Depends on: 1409174, 1409175
Assignee

Updated

2 years ago
Depends on: 1410254
Assignee

Updated

2 years ago
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.