Closed Bug 1543434 Opened 6 years ago Closed 5 years ago

Build derived datasets for mobile search

Categories

(Data Platform and Tools :: General, enhancement, P2)

enhancement
Points:
2

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: harter, Assigned: benwu)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

We need datasets like search_clients_daily for mobile (most notably fenix). We need to verify the data format in Bug 1543433 before we can get started on this bug.

Type: defect → enhancement

cc:frank as an FYI.

Hey All,

We've validated Fenix search telemetry. Now we need a search_clients_daily like dataset for this data. Bumping this thread to get the work prioritized. NI? Frank to hot-potato this bug into the right hands.

Thanks!

Flags: needinfo?(fbertsch)
Group: mozilla-employee-confidential

Hey Ryan, sorry about the delay - what's the priority on your end? Is this blocking any pending work?

Flags: needinfo?(fbertsch)

:mconnor is probably the right person to establish priority.

cc:bkal for context.

Flags: needinfo?(mconnor)

It's blocking me from planning 2020 Mobile optimization work. It also makes it a lot harder to build cross-product dashboards without a whole bunch of data munging due to inconsistent Telemetry. Ideally I'd have it before Q4, so I have time to dig in and find targets.

Flags: needinfo?(mconnor)

Okay, let's be clear about what this ask is, and what we have for mobile. I have not investigated all of mobile and what every mobile product reports. However, based simply on cursory information from the core ping, we can begin to derive a subset of search_clients_daily that we can use, with:

country
app_version
distribution_id
locale
os
os_version
channel
profile_creation_date
default_search_engine
sample_id
profile_age_in_days

However, I do not know how to correspond the searches column to what we do on desktop. Given values like: google-b-m.suggestion: 10 and google-b-m.actionbar: 1, how do we want to break those down?

Flags: needinfo?(rharter)
Flags: needinfo?(mconnor)

Hey! Sorry for the looong ni?. Specific to your question:

However, I do not know how to correspond the searches column to what we do on desktop. Given values like: google-b-m.suggestion: 10 and google-b-m.actionbar: 1, how do we want to break those down?

I believe this is the same format that we send on Desktop. They key is of the form: engine.source. We split they key into engine and source while building main_summary (source).

That list of columns sounds good to me. I expect we can add to these later? Leaving the ni? for mconnor to make sure he doesn't see anything missing.

Flags: needinfo?(rharter)
Blocks: 1577600
Blocks: 1577604
Assignee: nobody → bewu
Priority: -- → P2

I just want to double check the columns to use because there are some differences in the metrics between Fennec and Fenix.
If we want to use only values available for both, the table could look like

submission_date
client_id
searches
default_search_engine
country
app_version
channel
os
os_version
profile_creation_date
profile_age_in_days
sample_id

Some questions:

  • Fenix doesn't have a distribution_id or locale equivalent. Should we leave those out or have those be null in Fenix?
  • We can get default_search_engine.submission_url for fenix but not fennec (and probably other useful columns). Should we add it and leave it null for Fennec? I assume we want to focus on Fenix anyway.
  • would app_name be useful (Fennec/Fenix)?

We can always add more in the future and backfill, especially since the mobile datasets are small and it would be inexpensive.

Another observation is that older versions of fennec have source.engine instead of engine.source so that's a fun case to deal with.

Flags: needinfo?(rharter)

CC: Ben who's taking over as the search analysis team lead. Ben, please object if you disagree with any of these design decisions.

(In reply to Benjamin Wu from comment #9)

  • Fenix doesn't have a distribution_id or locale equivalent. Should we leave those out or have those be null in Fenix?

I'd say remove these columns. The only complication I can see is if we want to union Desktop and Mobile datasets. In that case, we should probably create a NULL column at analysis time.

  • We can get default_search_engine.submission_url for fenix but not fennec (and probably other useful columns). Should we add it and leave it null for Fennec? I assume we want to focus on Fenix anyway.

Yes, please add it and leave it null where we don't have information.

  • would app_name be useful (Fennec/Fenix)?

Yes please!

We can always add more in the future and backfill, especially since the mobile datasets are small and it would be inexpensive.

Another observation is that older versions of fennec have source.engine instead of engine.source so that's a fun case to deal with.

Depending on how old these versions are and how many users are still on these versions, I might consider writing these data off as malformed instead of complicating the pipeline. Is this a prominent problem?

Flags: needinfo?(rharter) → needinfo?(bmiroglio)

No objections on my end, all LGTM.

Flags: needinfo?(bmiroglio)

Depending on how old these versions are and how many users are still on these versions, I might consider writing these data off as malformed instead of complicating the pipeline. Is this a prominent problem?

A quick look at the last week suggests ~24% of sessions with search have source.engine so that's fairly significant. In any case, we need to treat it as a special case so counting it shouldn't be much harder than throwing it out.

We do have locale in metrics.string.glean_baseline_locale. Distribution we don't have, but we definitely need it, so include it as NULL for now and we will get it added. (edit: for Fenix in particular. I noted what we need here: https://github.com/mozilla-mobile/fenix/issues/232#issuecomment-548901478)

I finished my first attempt at these datasets and backfilled a couple of days of data for testing. In the derived datasets project I created search_staging.mobile_search_aggregates_v1 and search_staging.mobile_search_clients_daily_v1 with data from 2019-11-27 to 2019-12-01.

bmiroglio: Could you take a look at those tables and check to see if the schema is fine and if there's anything obviously wrong with the data?

Some questions I have:

  • The possible source values from core pings are actionbar, listitem, suggestion, and quicksearch. For Fenix: action and suggestion. Do you want to bucket these in some way like desktop scd? And am I missing any source values?
  • It looks like the engine values for older app versions use display names (e.g. "DuckDuckGo" instead of "ddg" or "duckduckgo"). Do you want to normalize these somehow or just leave them as is and handle them downstream?
Flags: needinfo?(bmiroglio)

I'll take a look at this data today.

Regarding your questions:

The possible source values from core pings are actionbar, listitem, suggestion, and quicksearch. For Fenix: action and suggestion. Do you want to bucket these in some way like desktop scd? And am I missing any source values?

I don't think we need any bucketing here, however I'm not sure if you're missing any source values since I'm not very familiar with Mobile search for Fenix. Do you know of a mobile engineer we can flag here to verify?

It looks like the engine values for older app versions use display names (e.g. "DuckDuckGo" instead of "ddg" or "duckduckgo"). Do you want to normalize these somehow or just leave them as is and handle them downstream?

How difficult would it be to normalized these? We can certainly handle this downstream, but I foresee lots of CASE statements for engine-specific queries which are no fun.

Flags: needinfo?(bmiroglio) → needinfo?(bewu)

For fenix those two are the only values I found so I'm fairly certain those are correct. Core pings though are a bit of a mess so I had to filter out what looked to be garbage values and I have those four values hard coded in to be used in search key parsing. I'm not sure who to ask for this so I'll direct this to Frank.

For normalizing the engines, one way to do this is to have a static table of mappings that can be used as a lookup when populating the datasets. Right now I'm looking into doing this for os names and versions. I would expect that this wouldn't be too hard; we just need to come up with a list. The upper case engine names look to be an issue for only Firefox Focus so the other less robust option would be to lower case all engine values in Focus. Either way shouldn't be much work.

Flags: needinfo?(bewu) → needinfo?(fbertsch)

Do you have a reference query for the core pings garbage values?

Otherwise, I looked at the schemas and did some simple queries against both datasets and things look good to me, nothing stood out as obviously broken. Pending confirmation on the source fields, I'd say we're OK to move forward.

Flags: needinfo?(bewu)

I must have mixed up source and engine. I couldn't find any invalid source values. I was probably thinking of when I originally looked through the data before I normalized the search key where older keys are source.engine instead of engine.source. So there are some weird engines which you can see in mobile search clients daily but source seems fine.

Flags: needinfo?(bewu)

Oh ok, cool. Weird engines are fine, we can handle those downstream. All the more reason to have common names among the major engines across apps though to keep things clean :).

Clearing frank's NI, since the sources are fine.

Flags: needinfo?(fbertsch)

The datasets have been created: mobile_search_clients_daily_v1 and mobile_search_aggregates in the search dataset. I'm going to backfill back to the start of search data in core pings which looks to be 2017-01-01. It will be inexpensive since there isn't much mobile telemetry compared to desktop.

I created bug 1602019 for normalizing the search engines. I expect to do that next week and I think it would make sense to backfill after that gets done so the normalized search engine is another column in the dataset without requiring a join. bmiroglio, does this sound good to you? If you want to look at the historical data sooner I can backfill now but I would prefer a single backfill.

Flags: needinfo?(bmiroglio)

SGTM, thanks Ben!

Flags: needinfo?(bmiroglio)

Both mobile_search_clients_daily and mobile_search_aggregates have been backfilled back to 2017-01-01. Here's a small example query: https://sql.telemetry.mozilla.org/queries/67118/source#170007

I did a very quick look over and the only obvious issue is missing data from 2017-02-02 and 2017-02-03 due to lack of core pings. That should be fixed soon.

Ben, I assume you want to validate?

edit: 2017-02-02 and 2017-02-03 fixed now

Flags: needinfo?(bmiroglio)

I also did a once-over and things LGTM. Thanks Ben

Flags: needinfo?(bmiroglio)
Blocks: 1608997

Ben, is this bug ok to close? I would guess that any other changes would be new bugs.

Flags: needinfo?(bmiroglio)

Yes let's use new bugs for future changes.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(mconnor)
Flags: needinfo?(bmiroglio)
Resolution: --- → FIXED
Component: Datasets: Search → Datasets: General
Component: Datasets: General → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: