1602019 - Create static lookup table for normalized search engines

Assignee

Description

•

5 years ago

•

The search datasets, especially for mobile, have many different possible values for the same search engine. e.g. ddg, duckduckgo, DuckDuckGo

It would be useful to have a normalized list so that doing analysis wouldn't require a long case statement. We can create a static table of mappings that can be joined on the search data to get a normalized engine column.

We can also do this during table population so that the columns are in search clients daily so a join isn't necessary during analysis.

We'll likely end up following the approach of bug 1591207 (whatever that ends up being)

edit: probably only for major search engines

Ben Wu [:benwu]

Assignee

Comment 1

•

5 years ago

Ben, can you provide a list of the engines we're interested in? I can then create a lookup table that covers all the cases.

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 2

•

5 years ago

•

Edited

here is a query that handles some of logic for the top engines:

https://sql.telemetry.mozilla.org/queries/65652/source#166785

But, in short:

Google
DuckDuckGo
Bing
Yandex

Flags: needinfo?(bmiroglio)

Ben Wu [:benwu]

Assignee

Comment 3

•

5 years ago

We decided to go with a udf since it's a lot cleaner to use: https://sql.telemetry.mozilla.org/queries/67071/source

This is just a wrapper around the case statement: https://github.com/mozilla/bigquery-etl/blob/master/udf/normalize_search_engine.sql
There are a few extra cases compared to the query you gave. Does this look good to you?

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 4

•

5 years ago

This is great, however it would be useful to have the Other engines listed out as well, in case we need to dig deeper into a engine-specific analysis or bug. My understanding is we'd map those major engines in Comment 2 to something consistent, since they are the most used, but leave the rest as-is.

Let me know what you think.

Flags: needinfo?(bmiroglio) → needinfo?(bewu)

Ben Wu [:benwu]

Assignee

Comment 5

•

5 years ago

That makes sense to me. I'll make the change

Flags: needinfo?(bewu)

Ben Wu [:benwu]

Assignee

Comment 6

•

5 years ago

•

Edited

Actually since the engine column still exists would it be enough to have that along with the normalized engine name? So for "others" you can filter for that. It would be cleaner to have the normalized column be a determined set of values. We'll provide two udfs strict_normalize_search_engine and normalize_search_engine where strict will give Other and non-strict will give the unrecognized engine.

Basically, the question I have is can we have Other in a normalized_engine column in all the search tables but provide a udf that can return the engine name instead?

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 7

•

5 years ago

How about just having a normalized_search_engine (which implements the strict case you mention) alongside engine. This way, we can still have full granularity for engine when/if needed. If this is what you're already suggesting, then great :).

Flags: needinfo?(bmiroglio)

Ben Wu [:benwu]

Assignee

Comment 8

•

5 years ago

Yes that's what I meant

Ben Wu [:benwu]

Assignee

Comment 9

•

5 years ago

Created udf.normalize_search_engine and udf.strict_normalize_search_engine. Strict will coerce unrecognized engines to Other while non-strict will leave the engine as is. This covers both use cases but I'm not sure if there is much of a need for the non-strict version and if it isn't needed it would be better to remove it to avoid confusion.

Also added normalized_engine field to search aggregates and clients daily, not backfilled for desktop but will be backfilled for mobile.

Closing this bug now since the udfs do the job of the lookup table. We can open a new bug if changes to the udfs need to be made. I'll leave this example here: https://sql.telemetry.mozilla.org/queries/67071/source

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Ben Miroglio [:bmiroglio]

Comment 10

•

5 years ago

•

Edited

I'd agree we can remove the non-strict version, and just call udf.strict_normalize_search_engine -> udf.udf.normalize_search_engine instead.

Nobody; OK to take it and work on it

Updated

•

4 years ago

Component: Datasets: Search → Datasets: General

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Datasets: General → General

Bugzilla

Create static lookup table for normalized search engines

Categories

(Data Platform and Tools :: General, task)

Tracking

(Not tracked)

People

(Reporter: benwu, Assigned: benwu)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated