Closed Bug 1354632 Opened 8 years ago Closed 8 years ago

Grouping by channel and distribution_id on churn does not return the same results

Categories

(Data Platform and Tools :: General, defect, P1)

x86_64
Linux
defect
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: amiyaguchi, Assigned: amiyaguchi)

Details

Attachments

(1 file)

The churn dataset reports different numbers for distribution_id and channel. Channel is the concatenation of channel and distribution_id if the distribution_id is a funnelcake. I would expect the number of profiles grouped by distribution_id to be strictly equal to or greater than the equivalent channel. This is not the case. It seems that roughly 20% of the distribution_id's go into the void, but somehow make it into the channel [1]. This should not be the case. In addition, null values should not appear in the distribution_id field. [1] https://sql.telemetry.mozilla.org/queries/4013/source#table
Assignee: nobody → amiyaguchi
With some help from :mreid, there are some leads into a potential solution to this problem. The following query will show all of the rows where the distribution_id field is null but is detected to be a funnelcake. This is the source of inconsistency. The first row is 20170212, which means all data with start_dates later than that one are fine. > SELECT week_start, > count(*) > FROM churn > WHERE distribution_id IS NULL > AND is_funnelcake = 'yes' > GROUP BY 1 > ORDER BY 1 DESC This can be verified by replacing the above query with `churn_dev`, which contains data since 20170129. This query returns no results, and is therefore consistent with the code. By cross referencing the last modified date in s3 and the commit history of the notebook, we notice a few odd things that were not clearly documented. There was a bug fix related to the version number filed on March 8th, 2017 [1], and what seems to be a failed backfill on my part. Let's step through the s3 bucket to delve into more detail. > aws s3 ls s3://telemetry-parquet/churn/v2` > ... > 2017-03-08 13:05:43 0 week_start=20160403_$folder$ > 2017-03-08 13:53:22 0 week_start=20160410_$folder$ > 2016-12-20 19:56:45 0 week_start=20160904_$folder$ > 2016-12-20 15:10:29 0 week_start=20160911_$folder$ > ... This first section shows that the churn dataset is missing 5 months (or ~20 data points)! This seems like the backfill job on 20170308 failed early and did not overwrite the previous data as expected. The September data looks consistent with a backfill job for Stub Attribution, and covers the period up to 20161220. > ... > 2017-03-04 16:48:27 0 week_start=20170212_$folder$ > 2017-03-11 17:05:14 0 week_start=20170219_$folder$ > ... This tells a plausible story of how 20170212 could have been the last date with this inconsistency. The next batch of data was generated with an updated notebook fixing this particular issue. This also happens to be the version that `churn_dev` is built on. The next step is to backfill a particular date (20170212) and check for the distribution_id and is_funnelcake discrepancy. This is likely to solve this particular bug. This does not solve the null value problem, which may be caused by empty strings in the source dataset. [1] https://github.com/mozilla/mozilla-reports/commit/27fc2e45b15a0712f61cf150c748e0b488ed4680
Points: --- → 2
Priority: -- → P1
This is fixed with the most recent backfill completed today, 2017-04-19. The result of the the original query now results in a sane result.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: