Remove incorrectly aggregated fields from `search_clients_daily_v8` ahead of backfill
Categories
(Data Platform and Tools :: General, defect)
Tracking
(Not tracked)
People
(Reporter: frank, Assigned: benwu)
Details
Attachments
(1 file)
Information here: https://docs.google.com/document/d/1mwmmn6daxqdp_oX82pVP2clwtFZ7YieWLD45zBKd7bk/edit#
Basically, a few fields in search_clients_daily
are incorrectly aggregated. This includes:
subsession_hours_sum
active_addons_count_mean
max_concurrent_tab_count_max
tab_open_event_count_sum
active_hours_sum
sessions_started_on_this_day
total_uri_count
experiments
All of these fields should be removed before the backfill. We will, in the future, decide if and how to manage getting more data back-in to search_clients_daily
, but correctly.
Reporter | ||
Comment 1•6 years ago
|
||
Note that the only one that really works to correctly aggregate is max_concurrent_tab_count
, since you can take the max
of that field for that client on that day. However this is tedious and if we're removing the others should probably remove that one as well, since it's value on any one row does not necessarily represent the value for that client for that day.
Assignee | ||
Comment 3•6 years ago
|
||
Having a single row per client per day sounds the most sensible to me. We could create a view to make it more convenient to query if needed. Does it make sense to have the modified_search_clients_daily
table exist only as a view that's a join between scd and cd that way we don't need to backfill and maintain two tables? I'm going to experiment with this a bit.
What should the behaviour for experiments be? Currently it returns a deduped list of all experiments for the client for each day.
Reporter | ||
Comment 4•6 years ago
|
||
In some sense, every field, except searches, are incorrectly aggregated in search_clients_daily
. If a client reports different values for e.g. os_version
in one day, then different rows of search_clients_daily
will have different values for that field.
Having a single row per client per day sounds the most sensible to me. We could create a view to make it more convenient to query if needed. Does it make sense to have the
modified_search_clients_daily
table exist only as a view that's a join between scd and cd that way we don't need to backfill and maintain two tables? I'm going to experiment with this a bit.
That makes a lot of sense. If we're going to update the job, I would love to see the modified_search_clients_daily
have 1-row per-client per-day. Each row would contain every engine -> source -> type -> searches, such that we could unnest those fields to get the original search_clients_daily
(with 1-row per client-day-engine-source-type, and multiple columns for different search types).
We would unnest this and join it with clients_daily
to get the search_clients_daily
view.
What should the behaviour for experiments be? Currently it returns a deduped list of all experiments for the client for each day.
Correct, experiments
should be a list of all experiments for the client for each day.
Stepping back, Will requested we backfill with the existing aggregates first, removing these minefields before doing so (the ones listed in Comment #1). Once we've removed the minefields and backfilled, we can decide what the next iteration should look like. This is based on analysis that backfilling is really not overly expensive, and we want to move off python-mozetl for this job sooner rather than later. Does that sound good to you, Ben?
Comment 5•6 years ago
|
||
In some sense, every field, except searches, are incorrectly aggregated in search_clients_daily. If a client reports different values for e.g. os_version in one day, then different rows of search_clients_daily will have different values for that field.
Yes, this is known behavior--although clients_daily
would be subject to those same concerns no? This is mostly an issue for things like country
which we use a lot in search queries.
Stepping back, Will requested we backfill with the existing aggregates first, removing these minefields before doing so (the ones listed in Comment #1). Once we've removed the minefields and backfilled, we can decide what the next iteration should look like. This is based on analysis that backfilling is really not overly expensive, and we want to move off python-mozetl for this job sooner rather than later. Does that sound good to you, Ben?
SGTM, Thanks all
Assignee | ||
Comment 6•6 years ago
|
||
I'll remove those fields, backfill a month and let bmiroglio validate.
For the categorical aggregated fields, I think the conclusion is that the effect on analysis is insignificant. For fields like country
I would expect it to be very rare for a client to have multiple sessions in a day with different countries, but either way I don't have a solution for that
Comment 7•6 years ago
•
|
||
Yes, I'd say the effect is small but we are aware it isn't perfect. This also gets into cloned profiles, which are profiles on different machines that share a client_id, which is more prevalent I'd say than the same machine reporting two countries on the same day due to travel etc.
Comment 8•6 years ago
|
||
Comment 9•6 years ago
•
|
||
could we solve this by adding the necessary fields to clients_daily
and deriving search_clients_daily
from that, so we get the correct values without a join?
Reporter | ||
Comment 10•6 years ago
•
|
||
(In reply to Daniel Thorn [:relud] from comment #9)
could we solve this by adding the necessary fields to
clients_daily
and derivingsearch_clients_daily
from that, so we get the correct values without a join?
Yes, we absolutely could - I'm just not sure how the search team would want to handle having the search data in a non-locked-down table. (In some sense that is the ideal solution, since then we only need to calculate the aggregates once. I wonder if we could lock the clients_daily_vN
table down, have an authorized search_clients_daily
view with access to search fields, and an authorized clients_daily
view with no access to search fields?)
Assignee | ||
Comment 11•6 years ago
|
||
:arana, are you currently using the above fields for any search analysis? We would like everyone to move to v8 versions of the search datasets as soon as possible which will not have those fields to start, and then we can decommission the v7 datasets running on AWS.
Comment 12•6 years ago
|
||
Hi ,
These fields are used sometimes for analysis but are not a part of any regularly tracked metrics.
Are we planning to get these fields back in in correct form?
Assignee | ||
Comment 13•6 years ago
|
||
Yes we plan on putting these fields back in soon but that will require some more planning.
So would dashboards be able to use the v8 datasets without these fields? If so, we'll move forward with backfilling and then ask everyone to start moving to the new datasets.
Comment 14•6 years ago
|
||
There are no dashboards active using these fields. Please go ahead.
Assignee | ||
Comment 15•6 years ago
|
||
The fields were readded with correct aggregation here: https://github.com/mozilla/bigquery-etl/pull/503/files
Aggregations are done before expanding the search counts field so there shouldn't be any more multi counting. The only field that wasn't readded is experiments because the query wouldn't finish when it was included. More details here: bug 1597361
Backfill is done now so I'm going to close this one.
Updated•4 years ago
|
Updated•3 years ago
|
Description
•