Closed
Bug 1422892
Opened 7 years ago
Closed 7 years ago
Decide on `submission_date` vs `activity_date`
Categories
(Data Platform and Tools :: General, enhancement, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: harter, Assigned: harter)
References
Details
We experimented with using `activity_date` instead of `submission_date` when developing the `clients_daily` etl job. We should summarize our findings and decide on which of these measures we'd like to standardize against in the future.
Assignee | ||
Comment 1•7 years ago
|
||
Summary of the problem
----------------------
`activity_date` is generally preferable to `submission_date` because it's closer to what we actually want to measure. There's a delay between user activity and us receiving the data. :chutten has some great analysis [1] on the empirical difference between submission and activity dates, if you want to read more. 95% of pings are received within two days of the actual activity [2], but that means using **`submission_date` "smears" data between today and yesterday** (mostly).
However, **`submission_date` is much easier to work with computationally**. When we partition by `submission_date`, most jobs only need to process one day of data at a time. This makes it much easier to continuously update datasets and backfill missing data.
`clients_daily` is currently limited to 6 months of historical data because the **entire dataset needs to be regenerated every day**. This is inconvenient and causes real limitations when using the dataset [3]. The job takes between 90 and 120 minutes to run and currently finishes near 9:00 UTC. Adding more data to this job will push that completion time back, meaning the data will be unavailable for the first few working hours every day.
Solutions
---------
I see three possible options:
1. Standardize to `submission_date`
2. Standardize to `activity_date` and try to mitigate the performance losses
3. Allow both, but provide guidance for when to use each configuration
So far, the data engineering team has strongly recommended using `submission_date`. The difference between `submission_date` and `activity_date` has become even smaller with our team's work on ping sender [4]. Without a strong counter argument, I recommend continuing with `submission_date`.
If we do have a strong reason to continue keying datasets by `activity_date`, I recommend only using `activity_date` on "small" datasets. These are datasets built over a sample of our data, build over a rarer type of ping (e.g. not main pings), or heavily aggregated (e.g. to country-day). Someone should provide documentation on when `activity_date` is [un]necessary to be included in [docs.tmo](https://docs.telemetry.mozilla.com).
------------------------------------------------------------------------
1. https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/
2. https://chuttenblog.wordpress.com/2017/09/12/two-days-or-how-long-until-the-data-is-in/
3. https://bugzilla.mozilla.org/show\_bug.cgi?id=1414044
4. https://chuttenblog.wordpress.com/2017/07/12/latency-improvements-or-yet-another-satisfying-graph/
Assignee | ||
Comment 2•7 years ago
|
||
Adding Dave, Saptarshi, and Brendan to comment.
Flags: needinfo?(dzeber)
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(sguha)
Flags: needinfo?(bcolloran)
Comment 3•7 years ago
|
||
I agree with ryan on (2). There are some analyses which might require activity date but as ryan said usually on much smaller data sets.
Flags: needinfo?(sguha)
Assignee | ||
Comment 4•7 years ago
|
||
One point brought up during our in-person conversations this week: activity date is subject to client clock skew. This may actually be a bigger problem than the distinctions described above. This makes submission date even more attractive.
I talked to Dave over the work week and I believe I have his OK on this. With brendan's OK we can seal this up.
I'm ok with whatever Dave and Saptarshi recommend.
Flags: needinfo?(bcolloran)
Assignee | ||
Comment 6•7 years ago
|
||
Awesome, `submission_date` is preferred to `activity_date`
:gavel bang:
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(dzeber)
Resolution: --- → FIXED
Updated•2 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•