Closed Bug 1422892 Opened 7 years ago Closed 7 years ago

Decide on `submission_date` vs `activity_date`

Categories

(Data Platform and Tools :: General, enhancement, P1)

x86_64
Linux
enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: harter, Assigned: harter)

References

Details

We experimented with using `activity_date` instead of `submission_date` when developing the `clients_daily` etl job. We should summarize our findings and decide on which of these measures we'd like to standardize against in the future.
Summary of the problem ---------------------- `activity_date` is generally preferable to `submission_date` because it's closer to what we actually want to measure. There's a delay between user activity and us receiving the data. :chutten has some great analysis [1] on the empirical difference between submission and activity dates, if you want to read more. 95% of pings are received within two days of the actual activity [2], but that means using **`submission_date` "smears" data between today and yesterday** (mostly). However, **`submission_date` is much easier to work with computationally**. When we partition by `submission_date`, most jobs only need to process one day of data at a time. This makes it much easier to continuously update datasets and backfill missing data. `clients_daily` is currently limited to 6 months of historical data because the **entire dataset needs to be regenerated every day**. This is inconvenient and causes real limitations when using the dataset [3]. The job takes between 90 and 120 minutes to run and currently finishes near 9:00 UTC. Adding more data to this job will push that completion time back, meaning the data will be unavailable for the first few working hours every day. Solutions --------- I see three possible options: 1. Standardize to `submission_date` 2. Standardize to `activity_date` and try to mitigate the performance losses 3. Allow both, but provide guidance for when to use each configuration So far, the data engineering team has strongly recommended using `submission_date`. The difference between `submission_date` and `activity_date` has become even smaller with our team's work on ping sender [4]. Without a strong counter argument, I recommend continuing with `submission_date`. If we do have a strong reason to continue keying datasets by `activity_date`, I recommend only using `activity_date` on "small" datasets. These are datasets built over a sample of our data, build over a rarer type of ping (e.g. not main pings), or heavily aggregated (e.g. to country-day). Someone should provide documentation on when `activity_date` is [un]necessary to be included in [docs.tmo](https://docs.telemetry.mozilla.com). ------------------------------------------------------------------------ 1. https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/ 2. https://chuttenblog.wordpress.com/2017/09/12/two-days-or-how-long-until-the-data-is-in/ 3. https://bugzilla.mozilla.org/show\_bug.cgi?id=1414044 4. https://chuttenblog.wordpress.com/2017/07/12/latency-improvements-or-yet-another-satisfying-graph/
See Also: → 1422099
Adding Dave, Saptarshi, and Brendan to comment.
Flags: needinfo?(dzeber)
Flags: needinfo?(sguha)
Flags: needinfo?(bcolloran)
I agree with ryan on (2). There are some analyses which might require activity date but as ryan said usually on much smaller data sets.
Flags: needinfo?(sguha)
One point brought up during our in-person conversations this week: activity date is subject to client clock skew. This may actually be a bigger problem than the distinctions described above. This makes submission date even more attractive. I talked to Dave over the work week and I believe I have his OK on this. With brendan's OK we can seal this up.
Blocks: 1426172
I'm ok with whatever Dave and Saptarshi recommend.
Flags: needinfo?(bcolloran)
Awesome, `submission_date` is preferred to `activity_date` :gavel bang:
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(dzeber)
Resolution: --- → FIXED
Blocks: 1424411
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.