1543434 - Build derived datasets for mobile search

Reporter

Description

•

6 years ago

We need datasets like search_clients_daily for mobile (most notably fenix). We need to verify the data format in Bug 1543433 before we can get started on this bug.

Ryan Harter [:harter]

Reporter

Updated

•

6 years ago

Type: defect → enhancement

Ryan Harter [:harter]

Reporter

Comment 1

•

6 years ago

cc:frank as an FYI.

Ryan Harter [:harter]

Reporter

Comment 3

•

6 years ago

Hey All,

We've validated Fenix search telemetry. Now we need a search_clients_daily like dataset for this data. Bumping this thread to get the work prioritized. NI? Frank to hot-potato this bug into the right hands.

Thanks!

Flags: needinfo?(fbertsch)

Ryan Harter [:harter]

Reporter

Updated

•

6 years ago

Group: mozilla-employee-confidential

Frank Bertsch [:frank]

Comment 4

•

6 years ago

Hey Ryan, sorry about the delay - what's the priority on your end? Is this blocking any pending work?

Flags: needinfo?(fbertsch)

Ryan Harter [:harter]

Reporter

Comment 5

•

6 years ago

:mconnor is probably the right person to establish priority.

cc:bkal for context.

Flags: needinfo?(mconnor)

Mike Connor [:mconnor]

Comment 6

•

6 years ago

It's blocking me from planning 2020 Mobile optimization work. It also makes it a lot harder to build cross-product dashboards without a whole bunch of data munging due to inconsistent Telemetry. Ideally I'd have it before Q4, so I have time to dig in and find targets.

Flags: needinfo?(mconnor)

Frank Bertsch [:frank]

Comment 7

•

5 years ago

Okay, let's be clear about what this ask is, and what we have for mobile. I have not investigated all of mobile and what every mobile product reports. However, based simply on cursory information from the core ping, we can begin to derive a subset of search_clients_daily that we can use, with:

country
app_version
distribution_id
locale
os
os_version
channel
profile_creation_date
default_search_engine
sample_id
profile_age_in_days

However, I do not know how to correspond the searches column to what we do on desktop. Given values like: google-b-m.suggestion: 10 and google-b-m.actionbar: 1, how do we want to break those down?

Flags: needinfo?(rharter)

Flags: needinfo?(mconnor)

Ryan Harter [:harter]

Reporter

Comment 8

•

5 years ago

Hey! Sorry for the looong ni?. Specific to your question:

However, I do not know how to correspond the searches column to what we do on desktop. Given values like: google-b-m.suggestion: 10 and google-b-m.actionbar: 1, how do we want to break those down?

I believe this is the same format that we send on Desktop. They key is of the form: engine.source. We split they key into engine and source while building main_summary (source).

That list of columns sounds good to me. I expect we can add to these later? Leaving the ni? for mconnor to make sure he doesn't see anything missing.

Flags: needinfo?(rharter)

Frank Bertsch [:frank]

Updated

•

5 years ago

Blocks: 1577600

Frank Bertsch [:frank]

Updated

•

5 years ago

Blocks: 1577604

Frank Bertsch [:frank]

Updated

•

5 years ago

Assignee: nobody → bewu

Priority: -- → P2

Ben Wu [:benwu]

Assignee

Comment 9

•

5 years ago

I just want to double check the columns to use because there are some differences in the metrics between Fennec and Fenix.
If we want to use only values available for both, the table could look like

submission_date
client_id
searches
default_search_engine
country
app_version
channel
os
os_version
profile_creation_date
profile_age_in_days
sample_id

Some questions:

Fenix doesn't have a distribution_id or locale equivalent. Should we leave those out or have those be null in Fenix?
We can get default_search_engine.submission_url for fenix but not fennec (and probably other useful columns). Should we add it and leave it null for Fennec? I assume we want to focus on Fenix anyway.
would app_name be useful (Fennec/Fenix)?

We can always add more in the future and backfill, especially since the mobile datasets are small and it would be inexpensive.

Another observation is that older versions of fennec have source.engine instead of engine.source so that's a fun case to deal with.

Flags: needinfo?(rharter)

Ryan Harter [:harter]

Reporter

Comment 10

•

5 years ago

CC: Ben who's taking over as the search analysis team lead. Ben, please object if you disagree with any of these design decisions.

(In reply to Benjamin Wu from comment #9)

Fenix doesn't have a distribution_id or locale equivalent. Should we leave those out or have those be null in Fenix?

I'd say remove these columns. The only complication I can see is if we want to union Desktop and Mobile datasets. In that case, we should probably create a NULL column at analysis time.

We can get default_search_engine.submission_url for fenix but not fennec (and probably other useful columns). Should we add it and leave it null for Fennec? I assume we want to focus on Fenix anyway.

Yes, please add it and leave it null where we don't have information.

would app_name be useful (Fennec/Fenix)?

Yes please!

We can always add more in the future and backfill, especially since the mobile datasets are small and it would be inexpensive.

Another observation is that older versions of fennec have source.engine instead of engine.source so that's a fun case to deal with.

Depending on how old these versions are and how many users are still on these versions, I might consider writing these data off as malformed instead of complicating the pipeline. Is this a prominent problem?

Flags: needinfo?(rharter) → needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 11

•

5 years ago

No objections on my end, all LGTM.

Flags: needinfo?(bmiroglio)

Ben Wu [:benwu]

Assignee

Comment 12

•

5 years ago

Depending on how old these versions are and how many users are still on these versions, I might consider writing these data off as malformed instead of complicating the pipeline. Is this a prominent problem?

A quick look at the last week suggests ~24% of sessions with search have source.engine so that's fairly significant. In any case, we need to treat it as a special case so counting it shouldn't be much harder than throwing it out.

Frank Bertsch [:frank]

Comment 13

•

5 years ago

•

Edited

We do have locale in metrics.string.glean_baseline_locale. Distribution we don't have, but we definitely need it, so include it as NULL for now and we will get it added. (edit: for Fenix in particular. I noted what we need here: https://github.com/mozilla-mobile/fenix/issues/232#issuecomment-548901478)

GitHub Bugzilla PR Linker

Comment 14

•

5 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/bigquery-etl/pull/559 — Details

Ben Wu [:benwu]

Assignee

Comment 15

•

5 years ago

I finished my first attempt at these datasets and backfilled a couple of days of data for testing. In the derived datasets project I created search_staging.mobile_search_aggregates_v1 and search_staging.mobile_search_clients_daily_v1 with data from 2019-11-27 to 2019-12-01.

bmiroglio: Could you take a look at those tables and check to see if the schema is fine and if there's anything obviously wrong with the data?

Some questions I have:

The possible source values from core pings are actionbar, listitem, suggestion, and quicksearch. For Fenix: action and suggestion. Do you want to bucket these in some way like desktop scd? And am I missing any source values?
It looks like the engine values for older app versions use display names (e.g. "DuckDuckGo" instead of "ddg" or "duckduckgo"). Do you want to normalize these somehow or just leave them as is and handle them downstream?

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 16

•

5 years ago

I'll take a look at this data today.

Regarding your questions:

The possible source values from core pings are actionbar, listitem, suggestion, and quicksearch. For Fenix: action and suggestion. Do you want to bucket these in some way like desktop scd? And am I missing any source values?

I don't think we need any bucketing here, however I'm not sure if you're missing any source values since I'm not very familiar with Mobile search for Fenix. Do you know of a mobile engineer we can flag here to verify?

It looks like the engine values for older app versions use display names (e.g. "DuckDuckGo" instead of "ddg" or "duckduckgo"). Do you want to normalize these somehow or just leave them as is and handle them downstream?

How difficult would it be to normalized these? We can certainly handle this downstream, but I foresee lots of CASE statements for engine-specific queries which are no fun.

Flags: needinfo?(bmiroglio) → needinfo?(bewu)

Ben Wu [:benwu]

Assignee

Comment 17

•

5 years ago

For fenix those two are the only values I found so I'm fairly certain those are correct. Core pings though are a bit of a mess so I had to filter out what looked to be garbage values and I have those four values hard coded in to be used in search key parsing. I'm not sure who to ask for this so I'll direct this to Frank.

For normalizing the engines, one way to do this is to have a static table of mappings that can be used as a lookup when populating the datasets. Right now I'm looking into doing this for os names and versions. I would expect that this wouldn't be too hard; we just need to come up with a list. The upper case engine names look to be an issue for only Firefox Focus so the other less robust option would be to lower case all engine values in Focus. Either way shouldn't be much work.

Flags: needinfo?(bewu) → needinfo?(fbertsch)

Ben Miroglio [:bmiroglio]

Comment 18

•

5 years ago

Do you have a reference query for the core pings garbage values?

Otherwise, I looked at the schemas and did some simple queries against both datasets and things look good to me, nothing stood out as obviously broken. Pending confirmation on the source fields, I'd say we're OK to move forward.

Flags: needinfo?(bewu)

Ben Wu [:benwu]

Assignee

Comment 19

•

5 years ago

I must have mixed up source and engine. I couldn't find any invalid source values. I was probably thinking of when I originally looked through the data before I normalized the search key where older keys are source.engine instead of engine.source. So there are some weird engines which you can see in mobile search clients daily but source seems fine.

Flags: needinfo?(bewu)

Ben Miroglio [:bmiroglio]

Comment 20

•

5 years ago

Oh ok, cool. Weird engines are fine, we can handle those downstream. All the more reason to have common names among the major engines across apps though to keep things clean :).

Ben Miroglio [:bmiroglio]

Comment 21

•

5 years ago

Clearing frank's NI, since the sources are fine.

Flags: needinfo?(fbertsch)

Ben Wu [:benwu]

Assignee

Comment 22

•

5 years ago

The datasets have been created: mobile_search_clients_daily_v1 and mobile_search_aggregates in the search dataset. I'm going to backfill back to the start of search data in core pings which looks to be 2017-01-01. It will be inexpensive since there isn't much mobile telemetry compared to desktop.

I created bug 1602019 for normalizing the search engines. I expect to do that next week and I think it would make sense to backfill after that gets done so the normalized search engine is another column in the dataset without requiring a join. bmiroglio, does this sound good to you? If you want to look at the historical data sooner I can backfill now but I would prefer a single backfill.

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 23

•

5 years ago

SGTM, thanks Ben!

Flags: needinfo?(bmiroglio)

Ben Wu [:benwu]

Assignee

Comment 24

•

5 years ago

•

Edited

Both mobile_search_clients_daily and mobile_search_aggregates have been backfilled back to 2017-01-01. Here's a small example query: https://sql.telemetry.mozilla.org/queries/67118/source#170007

I did a very quick look over and the only obvious issue is missing data from 2017-02-02 and 2017-02-03 due to lack of core pings. That should be fixed soon.

Ben, I assume you want to validate?

edit: 2017-02-02 and 2017-02-03 fixed now

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 25

•

5 years ago

I also did a once-over and things LGTM. Thanks Ben

Flags: needinfo?(bmiroglio)

Chenxia Liu [:liuche]

Updated

•

5 years ago

Blocks: 1608997

Ben Wu [:benwu]

Assignee

Comment 26

•

5 years ago

Ben, is this bug ok to close? I would guess that any other changes would be new bugs.

Flags: needinfo?(bmiroglio)

Ben Miroglio [:bmiroglio]

Comment 27

•

5 years ago

Yes let's use new bugs for future changes.

Status: NEW → RESOLVED

Closed: 5 years ago

Flags: needinfo?(mconnor)

Flags: needinfo?(bmiroglio)

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

4 years ago

Component: Datasets: Search → Datasets: General

Nobody; OK to take it and work on it

Updated

•

3 years ago

Component: Datasets: General → General