Closed Bug 1422099 Opened 7 years ago Closed 7 years ago

Provide client-day level dataset for search analysis

Categories

(Data Platform and Tools :: General, enhancement, P2)

x86_64
Linux
enhancement
Points:
3

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1426437

People

(Reporter: harter, Unassigned)

References

Details

This is a summary of a request I received from mconnor and arana yesterday.

BD is interested in a dataset similar to `clients_daily` for search analyses.  The new dataset, `search_clients_daily`, will be keyed by (`client_id`, `submission_date`, `search_engine`). We're only interested in chrome-sap search counts for this dataset. The goal is to have this dataset complete in the first month of 2018.

I suggest we refactor the `clients_daily` job so that it is partitioned by `submission_date` instead of `activity_date`. This will have the beneficial side-effect of making the `clients_daily` job easier to calculate and backfill (we'll only need to compute one `submission_date` at a time). We can then easily modify the `clients_daily` job to generate both `clients_daily` and `search_clients_daily`. I considered deriving `clients_daily` directly from `search_clients_daily`, but some of the aggregations wouldn't be trivial.
Mark, do we have an owner for clients_daily yet? Does it make sense to transition clients_daily to submission_date? If so, let's discuss who can take this work (possibly me).
Flags: needinfo?(mreid)
Owner for clients_daily is still me, unfortunately. I expect to remedy that situation early in Q1.

As far as changing from activity_date to submission_date, that is a long-standing question. I believe that we are in better shape than we've ever been to make that decision, but I lack strong evidence that this is so.

We have recently put together a client_count_daily dataset with both activity date and submission date, so at least in terms of client counts we can empirically compare things. I've written a query[1] to do a simple comparison, it's running now and should complete in the next hour or two.

Would you propose to drop activity_date from the dataset entirely and switch to using submission_date? Or would you want to retain both dates in the dataset?

[1] https://sql.telemetry.mozilla.org/queries/49729/source#133815
Flags: needinfo?(mreid) → needinfo?(rharter)
Flags: needinfo?(rharter)
See Also: → 1422892
Filed Bug 1422892 to establish a decision re:submission_date vs. activity_date.

ETA for this dataset will be Jan 2018. If that decision begins to drag out we may need to have a parallel dataset to support BD since activity_date will not serve for this purpose.
Depends on: 1426170
Depends on: 1426172
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Component: Datasets: Search → Datasets: General
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.