Closed Bug 1602019 Opened 5 years ago Closed 5 years ago

Create static lookup table for normalized search engines

Categories

(Data Platform and Tools :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: benwu, Assigned: benwu)

Details

The search datasets, especially for mobile, have many different possible values for the same search engine. e.g. ddg, duckduckgo, DuckDuckGo

It would be useful to have a normalized list so that doing analysis wouldn't require a long case statement. We can create a static table of mappings that can be joined on the search data to get a normalized engine column.

We can also do this during table population so that the columns are in search clients daily so a join isn't necessary during analysis.

We'll likely end up following the approach of bug 1591207 (whatever that ends up being)

edit: probably only for major search engines

Ben, can you provide a list of the engines we're interested in? I can then create a lookup table that covers all the cases.

Flags: needinfo?(bmiroglio)

here is a query that handles some of logic for the top engines:

https://sql.telemetry.mozilla.org/queries/65652/source#166785

But, in short:

Google
DuckDuckGo
Bing
Yandex

Flags: needinfo?(bmiroglio)

We decided to go with a udf since it's a lot cleaner to use: https://sql.telemetry.mozilla.org/queries/67071/source

This is just a wrapper around the case statement: https://github.com/mozilla/bigquery-etl/blob/master/udf/normalize_search_engine.sql
There are a few extra cases compared to the query you gave. Does this look good to you?

Flags: needinfo?(bmiroglio)

This is great, however it would be useful to have the Other engines listed out as well, in case we need to dig deeper into a engine-specific analysis or bug. My understanding is we'd map those major engines in Comment 2 to something consistent, since they are the most used, but leave the rest as-is.

Let me know what you think.

Flags: needinfo?(bmiroglio) → needinfo?(bewu)

That makes sense to me. I'll make the change

Flags: needinfo?(bewu)

Actually since the engine column still exists would it be enough to have that along with the normalized engine name? So for "others" you can filter for that. It would be cleaner to have the normalized column be a determined set of values. We'll provide two udfs strict_normalize_search_engine and normalize_search_engine where strict will give Other and non-strict will give the unrecognized engine.

Basically, the question I have is can we have Other in a normalized_engine column in all the search tables but provide a udf that can return the engine name instead?

Flags: needinfo?(bmiroglio)

How about just having a normalized_search_engine (which implements the strict case you mention) alongside engine. This way, we can still have full granularity for engine when/if needed. If this is what you're already suggesting, then great :).

Flags: needinfo?(bmiroglio)

Yes that's what I meant

Created udf.normalize_search_engine and udf.strict_normalize_search_engine. Strict will coerce unrecognized engines to Other while non-strict will leave the engine as is. This covers both use cases but I'm not sure if there is much of a need for the non-strict version and if it isn't needed it would be better to remove it to avoid confusion.

Also added normalized_engine field to search aggregates and clients daily, not backfilled for desktop but will be backfilled for mobile.

Closing this bug now since the udfs do the job of the lookup table. We can open a new bug if changes to the udfs need to be made. I'll leave this example here: https://sql.telemetry.mozilla.org/queries/67071/source

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

I'd agree we can remove the non-strict version, and just call udf.strict_normalize_search_engine -> udf.udf.normalize_search_engine instead.

Component: Datasets: Search → Datasets: General
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.