Decide on `submission_date` vs `activity_date`

RESOLVED FIXED

Status

enhancement
P1
normal
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: harter, Assigned: harter)

Tracking

Details

Assignee

Description

2 years ago
We experimented with using `activity_date` instead of `submission_date` when developing the `clients_daily` etl job. We should summarize our findings and decide on which of these measures we'd like to standardize against in the future.
Assignee

Comment 1

2 years ago
Summary of the problem
----------------------

`activity_date` is generally preferable to `submission_date` because it's closer to what we actually want to measure. There's a delay between user activity and us receiving the data. :chutten has some great analysis [1] on the empirical difference between submission and activity dates, if you want to read more. 95% of pings are received within two days of the actual activity [2], but that means using **`submission_date` "smears" data between today and yesterday** (mostly).

However, **`submission_date` is much easier to work with computationally**. When we partition by `submission_date`, most jobs only need to process one day of data at a time. This makes it much easier to continuously update datasets and backfill missing data.

`clients_daily` is currently limited to 6 months of historical data because the **entire dataset needs to be regenerated every day**. This is inconvenient and causes real limitations when using the dataset [3]. The job takes between 90 and 120 minutes to run and currently finishes near 9:00 UTC. Adding more data to this job will push that completion time back, meaning the data will be unavailable for the first few working hours every day.

Solutions
---------

I see three possible options:

1.  Standardize to `submission_date`
2.  Standardize to `activity_date` and try to mitigate the performance losses
3.  Allow both, but provide guidance for when to use each configuration

So far, the data engineering team has strongly recommended using `submission_date`. The difference between `submission_date` and `activity_date` has become even smaller with our team's work on ping sender [4]. Without a strong counter argument, I recommend continuing with `submission_date`.

If we do have a strong reason to continue keying datasets by `activity_date`, I recommend only using `activity_date` on "small" datasets. These are datasets built over a sample of our data, build over a rarer type of ping (e.g. not main pings), or heavily aggregated (e.g. to country-day). Someone should provide documentation on when `activity_date` is [un]necessary to be included in [docs.tmo](https://docs.telemetry.mozilla.com).

------------------------------------------------------------------------

1.  https://chuttenblog.wordpress.com/2017/02/09/data-science-is-hard-client-delays-for-crash-pings/
2.  https://chuttenblog.wordpress.com/2017/09/12/two-days-or-how-long-until-the-data-is-in/
3.  https://bugzilla.mozilla.org/show\_bug.cgi?id=1414044
4.  https://chuttenblog.wordpress.com/2017/07/12/latency-improvements-or-yet-another-satisfying-graph/
Assignee

Updated

2 years ago
See Also: → 1422099
Assignee

Comment 2

2 years ago
Adding Dave, Saptarshi, and Brendan to comment.
Flags: needinfo?(dzeber)
Assignee

Updated

2 years ago
Flags: needinfo?(sguha)
Flags: needinfo?(bcolloran)
I agree with ryan on (2). There are some analyses which might require activity date but as ryan said usually on much smaller data sets.
Flags: needinfo?(sguha)
Assignee

Comment 4

2 years ago
One point brought up during our in-person conversations this week: activity date is subject to client clock skew. This may actually be a bigger problem than the distinctions described above. This makes submission date even more attractive.

I talked to Dave over the work week and I believe I have his OK on this. With brendan's OK we can seal this up.
Assignee

Updated

2 years ago
Blocks: 1426172

Comment 5

2 years ago
I'm ok with whatever Dave and Saptarshi recommend.
Flags: needinfo?(bcolloran)
Assignee

Comment 6

2 years ago
Awesome, `submission_date` is preferred to `activity_date`

:gavel bang:
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(dzeber)
Resolution: --- → FIXED
Blocks: 1424411
You need to log in before you can comment on or make changes to this bug.