Closed Bug 1592733 Opened 6 years ago Closed 6 years ago

Remove incorrectly aggregated fields from `search_clients_daily_v8` ahead of backfill

Categories

(Data Platform and Tools :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: frank, Assigned: benwu)

Details

Attachments

(1 file)

Information here: https://docs.google.com/document/d/1mwmmn6daxqdp_oX82pVP2clwtFZ7YieWLD45zBKd7bk/edit#

Basically, a few fields in search_clients_daily are incorrectly aggregated. This includes:

  • subsession_hours_sum
  • active_addons_count_mean
  • max_concurrent_tab_count_max
  • tab_open_event_count_sum
  • active_hours_sum
  • sessions_started_on_this_day
  • total_uri_count
  • experiments

All of these fields should be removed before the backfill. We will, in the future, decide if and how to manage getting more data back-in to search_clients_daily, but correctly.

Note that the only one that really works to correctly aggregate is max_concurrent_tab_count, since you can take the max of that field for that client on that day. However this is tedious and if we're removing the others should probably remove that one as well, since it's value on any one row does not necessarily represent the value for that client for that day.

:bewu is probably best placed to look into this.

Assignee: nobody → bewu

Having a single row per client per day sounds the most sensible to me. We could create a view to make it more convenient to query if needed. Does it make sense to have the modified_search_clients_daily table exist only as a view that's a join between scd and cd that way we don't need to backfill and maintain two tables? I'm going to experiment with this a bit.

What should the behaviour for experiments be? Currently it returns a deduped list of all experiments for the client for each day.

In some sense, every field, except searches, are incorrectly aggregated in search_clients_daily. If a client reports different values for e.g. os_version in one day, then different rows of search_clients_daily will have different values for that field.

Having a single row per client per day sounds the most sensible to me. We could create a view to make it more convenient to query if needed. Does it make sense to have the modified_search_clients_daily table exist only as a view that's a join between scd and cd that way we don't need to backfill and maintain two tables? I'm going to experiment with this a bit.

That makes a lot of sense. If we're going to update the job, I would love to see the modified_search_clients_daily have 1-row per-client per-day. Each row would contain every engine -> source -> type -> searches, such that we could unnest those fields to get the original search_clients_daily (with 1-row per client-day-engine-source-type, and multiple columns for different search types).

We would unnest this and join it with clients_daily to get the search_clients_daily view.

What should the behaviour for experiments be? Currently it returns a deduped list of all experiments for the client for each day.

Correct, experiments should be a list of all experiments for the client for each day.

Stepping back, Will requested we backfill with the existing aggregates first, removing these minefields before doing so (the ones listed in Comment #1). Once we've removed the minefields and backfilled, we can decide what the next iteration should look like. This is based on analysis that backfilling is really not overly expensive, and we want to move off python-mozetl for this job sooner rather than later. Does that sound good to you, Ben?

Flags: needinfo?(bewu)

In some sense, every field, except searches, are incorrectly aggregated in search_clients_daily. If a client reports different values for e.g. os_version in one day, then different rows of search_clients_daily will have different values for that field.

Yes, this is known behavior--although clients_daily would be subject to those same concerns no? This is mostly an issue for things like country which we use a lot in search queries.

Stepping back, Will requested we backfill with the existing aggregates first, removing these minefields before doing so (the ones listed in Comment #1). Once we've removed the minefields and backfilled, we can decide what the next iteration should look like. This is based on analysis that backfilling is really not overly expensive, and we want to move off python-mozetl for this job sooner rather than later. Does that sound good to you, Ben?

SGTM, Thanks all

I'll remove those fields, backfill a month and let bmiroglio validate.

For the categorical aggregated fields, I think the conclusion is that the effect on analysis is insignificant. For fields like country I would expect it to be very rare for a client to have multiple sessions in a day with different countries, but either way I don't have a solution for that

Flags: needinfo?(bewu)

Yes, I'd say the effect is small but we are aware it isn't perfect. This also gets into cloned profiles, which are profiles on different machines that share a client_id, which is more prevalent I'd say than the same machine reporting two countries on the same day due to travel etc.

could we solve this by adding the necessary fields to clients_daily and deriving search_clients_daily from that, so we get the correct values without a join?

Flags: needinfo?(fbertsch)

(In reply to Daniel Thorn [:relud] from comment #9)

could we solve this by adding the necessary fields to clients_daily and deriving search_clients_daily from that, so we get the correct values without a join?

Yes, we absolutely could - I'm just not sure how the search team would want to handle having the search data in a non-locked-down table. (In some sense that is the ideal solution, since then we only need to calculate the aggregates once. I wonder if we could lock the clients_daily_vN table down, have an authorized search_clients_daily view with access to search fields, and an authorized clients_daily view with no access to search fields?)

Flags: needinfo?(fbertsch)

:arana, are you currently using the above fields for any search analysis? We would like everyone to move to v8 versions of the search datasets as soon as possible which will not have those fields to start, and then we can decommission the v7 datasets running on AWS.

Flags: needinfo?(arana)

Hi ,

These fields are used sometimes for analysis but are not a part of any regularly tracked metrics.
Are we planning to get these fields back in in correct form?

Flags: needinfo?(arana)
Flags: needinfo?(bewu)

Yes we plan on putting these fields back in soon but that will require some more planning.
So would dashboards be able to use the v8 datasets without these fields? If so, we'll move forward with backfilling and then ask everyone to start moving to the new datasets.

Flags: needinfo?(bewu) → needinfo?(arana)

There are no dashboards active using these fields. Please go ahead.

Flags: needinfo?(arana)

The fields were readded with correct aggregation here: https://github.com/mozilla/bigquery-etl/pull/503/files

Aggregations are done before expanding the search counts field so there shouldn't be any more multi counting. The only field that wasn't readded is experiments because the query wouldn't finish when it was included. More details here: bug 1597361

Backfill is done now so I'm going to close this one.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Datasets: Search → Datasets: General
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: