Closed Bug 1422892 Opened 3 years ago Closed 3 years ago

Decide on `submission_date` vs `activity_date`


(Data Platform and Tools :: Datasets: General, enhancement, P1)



(Not tracked)



(Reporter: harter, Assigned: harter)



We experimented with using `activity_date` instead of `submission_date` when developing the `clients_daily` etl job. We should summarize our findings and decide on which of these measures we'd like to standardize against in the future.
Summary of the problem

`activity_date` is generally preferable to `submission_date` because it's closer to what we actually want to measure. There's a delay between user activity and us receiving the data. :chutten has some great analysis [1] on the empirical difference between submission and activity dates, if you want to read more. 95% of pings are received within two days of the actual activity [2], but that means using **`submission_date` "smears" data between today and yesterday** (mostly).

However, **`submission_date` is much easier to work with computationally**. When we partition by `submission_date`, most jobs only need to process one day of data at a time. This makes it much easier to continuously update datasets and backfill missing data.

`clients_daily` is currently limited to 6 months of historical data because the **entire dataset needs to be regenerated every day**. This is inconvenient and causes real limitations when using the dataset [3]. The job takes between 90 and 120 minutes to run and currently finishes near 9:00 UTC. Adding more data to this job will push that completion time back, meaning the data will be unavailable for the first few working hours every day.


I see three possible options:

1.  Standardize to `submission_date`
2.  Standardize to `activity_date` and try to mitigate the performance losses
3.  Allow both, but provide guidance for when to use each configuration

So far, the data engineering team has strongly recommended using `submission_date`. The difference between `submission_date` and `activity_date` has become even smaller with our team's work on ping sender [4]. Without a strong counter argument, I recommend continuing with `submission_date`.

If we do have a strong reason to continue keying datasets by `activity_date`, I recommend only using `activity_date` on "small" datasets. These are datasets built over a sample of our data, build over a rarer type of ping (e.g. not main pings), or heavily aggregated (e.g. to country-day). Someone should provide documentation on when `activity_date` is [un]necessary to be included in [docs.tmo](


See Also: → 1422099
Adding Dave, Saptarshi, and Brendan to comment.
Flags: needinfo?(dzeber)
Flags: needinfo?(sguha)
Flags: needinfo?(bcolloran)
I agree with ryan on (2). There are some analyses which might require activity date but as ryan said usually on much smaller data sets.
Flags: needinfo?(sguha)
One point brought up during our in-person conversations this week: activity date is subject to client clock skew. This may actually be a bigger problem than the distinctions described above. This makes submission date even more attractive.

I talked to Dave over the work week and I believe I have his OK on this. With brendan's OK we can seal this up.
Blocks: 1426172
I'm ok with whatever Dave and Saptarshi recommend.
Flags: needinfo?(bcolloran)
Awesome, `submission_date` is preferred to `activity_date`

:gavel bang:
Closed: 3 years ago
Flags: needinfo?(dzeber)
Resolution: --- → FIXED
Blocks: 1424411
You need to log in before you can comment on or make changes to this bug.