Create static lookup table for normalized search engines
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: benwu, Assigned: benwu)
Details
The search datasets, especially for mobile, have many different possible values for the same search engine. e.g. ddg
, duckduckgo
, DuckDuckGo
It would be useful to have a normalized list so that doing analysis wouldn't require a long case statement. We can create a static table of mappings that can be joined on the search data to get a normalized engine column.
We can also do this during table population so that the columns are in search clients daily so a join isn't necessary during analysis.
We'll likely end up following the approach of bug 1591207 (whatever that ends up being)
edit: probably only for major search engines
Assignee | ||
Comment 1•5 years ago
|
||
Ben, can you provide a list of the engines we're interested in? I can then create a lookup table that covers all the cases.
Comment 2•5 years ago
•
|
||
here is a query that handles some of logic for the top engines:
https://sql.telemetry.mozilla.org/queries/65652/source#166785
But, in short:
Google
DuckDuckGo
Bing
Yandex
Assignee | ||
Comment 3•5 years ago
|
||
We decided to go with a udf since it's a lot cleaner to use: https://sql.telemetry.mozilla.org/queries/67071/source
This is just a wrapper around the case statement: https://github.com/mozilla/bigquery-etl/blob/master/udf/normalize_search_engine.sql
There are a few extra cases compared to the query you gave. Does this look good to you?
Comment 4•5 years ago
|
||
This is great, however it would be useful to have the Other
engines listed out as well, in case we need to dig deeper into a engine-specific analysis or bug. My understanding is we'd map those major engines in Comment 2 to something consistent, since they are the most used, but leave the rest as-is.
Let me know what you think.
Assignee | ||
Comment 5•5 years ago
|
||
That makes sense to me. I'll make the change
Assignee | ||
Comment 6•5 years ago
•
|
||
Actually since the engine
column still exists would it be enough to have that along with the normalized engine name? So for "others" you can filter for that. It would be cleaner to have the normalized column be a determined set of values. We'll provide two udfs strict_normalize_search_engine
and normalize_search_engine
where strict
will give Other
and non-strict will give the unrecognized engine.
Basically, the question I have is can we have Other
in a normalized_engine
column in all the search tables but provide a udf that can return the engine name instead?
Comment 7•5 years ago
|
||
How about just having a normalized_search_engine
(which implements the strict case you mention) alongside engine
. This way, we can still have full granularity for engine
when/if needed. If this is what you're already suggesting, then great :).
Assignee | ||
Comment 8•5 years ago
|
||
Yes that's what I meant
Assignee | ||
Comment 9•5 years ago
|
||
Created udf.normalize_search_engine
and udf.strict_normalize_search_engine
. Strict will coerce unrecognized engines to Other
while non-strict will leave the engine as is. This covers both use cases but I'm not sure if there is much of a need for the non-strict version and if it isn't needed it would be better to remove it to avoid confusion.
Also added normalized_engine
field to search aggregates and clients daily, not backfilled for desktop but will be backfilled for mobile.
Closing this bug now since the udfs do the job of the lookup table. We can open a new bug if changes to the udfs need to be made. I'll leave this example here: https://sql.telemetry.mozilla.org/queries/67071/source
Comment 10•5 years ago
•
|
||
I'd agree we can remove the non-strict version, and just call udf.strict_normalize_search_engine
-> udf.udf.normalize_search_engine
instead.
Updated•4 years ago
|
Updated•3 years ago
|
Description
•