Determine initial slack for computing 1 day retention



2 years ago
2 years ago


(Reporter: amiyaguchi, Assigned: amiyaguchi)




(1 attachment)

The 1-day retention dataset describes the count of users seen in a particular activity date. Submission latency affects what portion of activity we observe after a certain amount waiting. In the `mozetl.engagement.retention` job, there is a configurable `--slack` argument that will compute the retention dataset by an offset of `n` days.

The churn dataset is sensitive to this value and has been set to 10 days. However, this means that there are up to 17 days of lag. This dataset is interested in Firefox 55+ and should benefit from lower latency submissions. 

Currently, this option defaults to 2 days, which should be enough to capture 95% of the data. From the telemetry-health dashboard, the 95th percentile graph for the latest nightly seems to wildly vary (up to 447 hours for nightly 57). [1]

1. Looking at Firefox Beta 57, is it safe to assume the following?

- ~1 day captures about 95% of activity
- ~4 days captures about 99% of activity

2. From Firefox 55+, are these assumptions reasonable?

- For 95% of clients, wait 2 days
- For 99% of clients, wait 5 days

3. What value of slack should be set for 1-day retention?



2 years ago
Blocks: 1381840
Are points 1 and 2 in comment #1 reasonable assumptions to make based on the data in the telemetry health dashboard?
Flags: needinfo?(chutten)
The wild variations in Nightly 57 is because we're now on Nightly 58. The maximum delay we've seen on a current nightly for the 95%ile has only twice been more than 30 hours over the past three months.

I don't know the particular qualities of the 1-day retention dataset. Can you direct me to some documentation?
Flags: needinfo?(chutten) → needinfo?(amiyaguchi)
Thanks, that sheds more light on the bottom most plot. 

Bug 1381840 Comment 3 is probably the closest thing to documentation beside the docstring in the retention module at the moment.[1]  I'll be adding a doc page to DTMO before it's deployed on Airflow.

Flags: needinfo?(amiyaguchi)
The slack value is one of the last configurations to tune before deploying the 1 day retention dataste. Is 95% of activity of clients per day an adequate figure for 1 day retention?
Flags: needinfo?(pdolanjski)
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #4)
> The slack value is one of the last configurations to tune before deploying
> the 1 day retention dataste. Is 95% of activity of clients per day an
> adequate figure for 1 day retention?

Is it safe to assume that 95% gives us a really good sample of the whole data set, such that if we see statistically significant variation in retention when only 95% of data is present, we can assume that'll usually hold true with 100%?
Flags: needinfo?(pdolanjski) → needinfo?(amiyaguchi)
Probably not, it would probably depend on the size of the cohort and the type of bias introduced by latency. In addition, the use of HLL will introduce standard error that will most likely compound. I can think of a way of performing validation to quantify the error with 95% of activity, but waiting an extra 2-3 days for 99% of the activity is a safe route. 

As I understand it, significance testing will require going back to the raw data to calculate the standard deviation. I imagine this to be straightforward to automate once relevant subpopulations are identified, by hand or other means.

In any case, it sounds like it's better to err on the side of caution and increase the slack to accommodate 99% of activity for a day.
Flags: needinfo?(amiyaguchi)


2 years ago
Assignee: nobody → amiyaguchi
Points: --- → 2
Priority: -- → P1


2 years ago
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.