Closed
Bug 1422099
Opened 7 years ago
Closed 7 years ago
Provide client-day level dataset for search analysis
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 1426437
People
(Reporter: harter, Unassigned)
References
Details
This is a summary of a request I received from mconnor and arana yesterday.
BD is interested in a dataset similar to `clients_daily` for search analyses. The new dataset, `search_clients_daily`, will be keyed by (`client_id`, `submission_date`, `search_engine`). We're only interested in chrome-sap search counts for this dataset. The goal is to have this dataset complete in the first month of 2018.
I suggest we refactor the `clients_daily` job so that it is partitioned by `submission_date` instead of `activity_date`. This will have the beneficial side-effect of making the `clients_daily` job easier to calculate and backfill (we'll only need to compute one `submission_date` at a time). We can then easily modify the `clients_daily` job to generate both `clients_daily` and `search_clients_daily`. I considered deriving `clients_daily` directly from `search_clients_daily`, but some of the aggregations wouldn't be trivial.
Reporter | ||
Comment 1•7 years ago
|
||
Mark, do we have an owner for clients_daily yet? Does it make sense to transition clients_daily to submission_date? If so, let's discuss who can take this work (possibly me).
Flags: needinfo?(mreid)
Comment 2•7 years ago
|
||
Owner for clients_daily is still me, unfortunately. I expect to remedy that situation early in Q1.
As far as changing from activity_date to submission_date, that is a long-standing question. I believe that we are in better shape than we've ever been to make that decision, but I lack strong evidence that this is so.
We have recently put together a client_count_daily dataset with both activity date and submission date, so at least in terms of client counts we can empirically compare things. I've written a query[1] to do a simple comparison, it's running now and should complete in the next hour or two.
Would you propose to drop activity_date from the dataset entirely and switch to using submission_date? Or would you want to retain both dates in the dataset?
[1] https://sql.telemetry.mozilla.org/queries/49729/source#133815
Flags: needinfo?(mreid) → needinfo?(rharter)
Reporter | ||
Comment 3•7 years ago
|
||
Filed Bug 1422892 to establish a decision re:submission_date vs. activity_date.
ETA for this dataset will be Jan 2018. If that decision begins to drag out we may need to have a parallel dataset to support BD since activity_date will not serve for this purpose.
Reporter | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Updated•4 years ago
|
Component: Datasets: Search → Datasets: General
Assignee | ||
Updated•2 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•