Closed Bug 1323598 Opened 7 years ago Closed 7 years ago

Add additional fields for search retention to churn

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Shraddha, Assigned: amiyaguchi)

References

Details

Attachments

(1 file)

Hi,

There is a report in tableau dataviz for cohort retention data -
https://dataviz.mozilla.org/views/FirefoxDesktopCohortAnalysis-UT_0/ByCountry

The BizDev team would like to leverage the report for additional filters - 
a) distribution_id
b) default_search_engine

Please let me know any other information needed for this request intake. Thanks !
Priority: -- → P3
Points: --- → 2
Hi All,

Adding more context on the bug needs. The BD team(Joanne/Amit) is looking on having search retention data as a priority for 1Q2017. Search data is also important data part from business end hence we would like to know next steps to get this further (meetings with concerned team)

Thanks for helping out
Assignee: nobody → amiyaguchi
Priority: P3 → P2
Blocks: 1337044
Priority: P2 → P1
Below is an updated list of additional filters that will be added to the churn/retention dataset located at [1].

a) distribution_id
b) default_search_engine
c) locale

[1] https://github.com/mozilla/mozilla-reports/blob/master/etl/churn.kp/orig_src/Churn.ipynb
Retention data currently lives in the `telemetry-parquet` bucket under `churn/v2` [1]. The data is stored as parquet and is partitioned by `week_start`, the start of the retention period.

Scripts that have implicit assumptions about the granularity of the data may be affected by these changes. Scripts should be explicit about aggregating over the necessary set of columns for analysis/visualizations, like below:

> SELECT channel, distribution_id, SUM(n_profiles)
> FROM churn
> GROUP BY channel, distribution_id;

The staging location for this data will be located in a private bucket at [2]. I plan to fill this location with 1-3 months worth of data within the next week.

[1] s3://telemetry-parquet/churn/v2/
[2] s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/churn-staging/
I've updated the churn notebook to include the requested fields, among some other changes. I've verified that the updated dataset is equivalent to the older dataset through this notebook [1]. I will be backfilling the job back a few months, next week.

On a tangential note, it would also be nice to have unit tests that can automatically verify changes between versions of the churn notebook, but it is not a blocking issue.

[1] https://gist.github.com/acmiyaguchi/f21a92b2980e177ab7fc4468c0c55074
I have updated the private bucket location [1] and backfilled it with data since 01-01-2017. You can access the data through redash [2] for exploration.

[1] s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/churn_testing/
[2] https://sql.telemetry.mozilla.org/queries/3382/source
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #5)
> I have updated the private bucket location [1] and backfilled it with data
> since 01-01-2017. You can access the data through redash [2] for exploration.
> 
> [1]
> s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/churn_testing/
> [2] https://sql.telemetry.mozilla.org/queries/3382/source

Thanks Anthony!
(In reply to Anthony Miyaguchi [:amiyaguchi] from comment #5)
> I have updated the private bucket location [1] and backfilled it with data
> since 01-01-2017. You can access the data through redash [2] for exploration.
> 
> [1]
> s3://net-mozaws-prod-us-west-2-pipeline-analysis/amiyaguchi/churn_testing/
> [2] https://sql.telemetry.mozilla.org/queries/3382/source

Hey Anthony.
Thanks for the great work here.  Regarding the redash report, Joanne has asked that this not be made available via redash as this data could expose us to a disclosure of usage data comparing one public partner over another, which is something we need to be extremely careful with.  

Could you please remove the redash report?
We will have this data available via Tableau under credentials that are trackable and is the current location for Desktop Retention today.

Questions, please let me know.
Thanks
(In reply to Heather from comment #8)
> Regarding the redash report, Joanne has
> asked that this not be made available via redash as this data could expose
> us to a disclosure of usage data comparing one public partner over another,
> which is something we need to be extremely careful with.  

There are other consumers of this dataset. 

Are the values of concern contained within distribution_id and/or default_search_engine? If so, everything in this processed dataset is accessible to users with Mozilla credentials via the `main_summary`. Or is this more of an issue of intent rather than accessibility of data within our ecosystem?

I am curious about the nature of this issue, since it might require either a fork of the dataset or extra processing to make other fields available to other users.
Flags: needinfo?(hcrince)
Depends on: 1345555
The data is now available from 20160306 in s3://telemetry-parquet/churn/v2.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Blocks: 1381806
Flags: needinfo?(hcrince)
Summary: Cohort Retention data with additional filters → Add additional fields for search retention to churn
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: