Build derived datasets for mobile search
Categories
(Data Platform and Tools :: General, enhancement, P2)
Tracking
(Not tracked)
People
(Reporter: harter, Assigned: benwu)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
We need datasets like search_clients_daily for mobile (most notably fenix). We need to verify the data format in Bug 1543433 before we can get started on this bug.
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 1•6 years ago
|
||
cc:frank as an FYI.
Reporter | ||
Comment 3•6 years ago
|
||
Hey All,
We've validated Fenix search telemetry. Now we need a search_clients_daily
like dataset for this data. Bumping this thread to get the work prioritized. NI? Frank to hot-potato this bug into the right hands.
Thanks!
Reporter | ||
Updated•6 years ago
|
Comment 4•6 years ago
|
||
Hey Ryan, sorry about the delay - what's the priority on your end? Is this blocking any pending work?
Reporter | ||
Comment 5•6 years ago
|
||
:mconnor is probably the right person to establish priority.
cc:bkal for context.
Comment 6•6 years ago
|
||
It's blocking me from planning 2020 Mobile optimization work. It also makes it a lot harder to build cross-product dashboards without a whole bunch of data munging due to inconsistent Telemetry. Ideally I'd have it before Q4, so I have time to dig in and find targets.
Comment 7•5 years ago
|
||
Okay, let's be clear about what this ask is, and what we have for mobile. I have not investigated all of mobile and what every mobile product reports. However, based simply on cursory information from the core ping, we can begin to derive a subset of search_clients_daily
that we can use, with:
country
app_version
distribution_id
locale
os
os_version
channel
profile_creation_date
default_search_engine
sample_id
profile_age_in_days
However, I do not know how to correspond the searches
column to what we do on desktop. Given values like: google-b-m.suggestion: 10
and google-b-m.actionbar: 1
, how do we want to break those down?
Reporter | ||
Comment 8•5 years ago
|
||
Hey! Sorry for the looong ni?. Specific to your question:
However, I do not know how to correspond the
searches
column to what we do on desktop. Given values like:google-b-m.suggestion: 10
andgoogle-b-m.actionbar: 1
, how do we want to break those down?
I believe this is the same format that we send on Desktop. They key is of the form: engine.source
. We split they key into engine
and source
while building main_summary
(source).
That list of columns sounds good to me. I expect we can add to these later? Leaving the ni? for mconnor to make sure he doesn't see anything missing.
Updated•5 years ago
|
Assignee | ||
Comment 9•5 years ago
|
||
I just want to double check the columns to use because there are some differences in the metrics between Fennec and Fenix.
If we want to use only values available for both, the table could look like
submission_date
client_id
searches
default_search_engine
country
app_version
channel
os
os_version
profile_creation_date
profile_age_in_days
sample_id
Some questions:
- Fenix doesn't have a
distribution_id
orlocale
equivalent. Should we leave those out or have those be null in Fenix? - We can get
default_search_engine.submission_url
for fenix but not fennec (and probably other useful columns). Should we add it and leave it null for Fennec? I assume we want to focus on Fenix anyway. - would
app_name
be useful (Fennec/Fenix)?
We can always add more in the future and backfill, especially since the mobile datasets are small and it would be inexpensive.
Another observation is that older versions of fennec have source.engine
instead of engine.source
so that's a fun case to deal with.
Reporter | ||
Comment 10•5 years ago
|
||
CC: Ben who's taking over as the search analysis team lead. Ben, please object if you disagree with any of these design decisions.
(In reply to Benjamin Wu from comment #9)
- Fenix doesn't have a
distribution_id
orlocale
equivalent. Should we leave those out or have those be null in Fenix?
I'd say remove these columns. The only complication I can see is if we want to union Desktop and Mobile datasets. In that case, we should probably create a NULL
column at analysis time.
- We can get
default_search_engine.submission_url
for fenix but not fennec (and probably other useful columns). Should we add it and leave it null for Fennec? I assume we want to focus on Fenix anyway.
Yes, please add it and leave it null where we don't have information.
- would
app_name
be useful (Fennec/Fenix)?
Yes please!
We can always add more in the future and backfill, especially since the mobile datasets are small and it would be inexpensive.
Another observation is that older versions of fennec have
source.engine
instead ofengine.source
so that's a fun case to deal with.
Depending on how old these versions are and how many users are still on these versions, I might consider writing these data off as malformed instead of complicating the pipeline. Is this a prominent problem?
Assignee | ||
Comment 12•5 years ago
|
||
Depending on how old these versions are and how many users are still on these versions, I might consider writing these data off as malformed instead of complicating the pipeline. Is this a prominent problem?
A quick look at the last week suggests ~24% of sessions with search have source.engine
so that's fairly significant. In any case, we need to treat it as a special case so counting it shouldn't be much harder than throwing it out.
Comment 13•5 years ago
•
|
||
We do have locale in metrics.string.glean_baseline_locale
. Distribution we don't have, but we definitely need it, so include it as NULL
for now and we will get it added. (edit: for Fenix in particular. I noted what we need here: https://github.com/mozilla-mobile/fenix/issues/232#issuecomment-548901478)
Comment 14•5 years ago
|
||
Assignee | ||
Comment 15•5 years ago
|
||
I finished my first attempt at these datasets and backfilled a couple of days of data for testing. In the derived datasets project I created search_staging.mobile_search_aggregates_v1
and search_staging.mobile_search_clients_daily_v1
with data from 2019-11-27 to 2019-12-01.
bmiroglio: Could you take a look at those tables and check to see if the schema is fine and if there's anything obviously wrong with the data?
Some questions I have:
- The possible
source
values from core pings are actionbar, listitem, suggestion, and quicksearch. For Fenix: action and suggestion. Do you want to bucket these in some way like desktop scd? And am I missing any source values? - It looks like the engine values for older app versions use display names (e.g. "DuckDuckGo" instead of "ddg" or "duckduckgo"). Do you want to normalize these somehow or just leave them as is and handle them downstream?
Comment 16•5 years ago
|
||
I'll take a look at this data today.
Regarding your questions:
The possible source values from core pings are actionbar, listitem, suggestion, and quicksearch. For Fenix: action and suggestion. Do you want to bucket these in some way like desktop scd? And am I missing any source values?
I don't think we need any bucketing here, however I'm not sure if you're missing any source values since I'm not very familiar with Mobile search for Fenix. Do you know of a mobile engineer we can flag here to verify?
It looks like the engine values for older app versions use display names (e.g. "DuckDuckGo" instead of "ddg" or "duckduckgo"). Do you want to normalize these somehow or just leave them as is and handle them downstream?
How difficult would it be to normalized these? We can certainly handle this downstream, but I foresee lots of CASE
statements for engine-specific queries which are no fun.
Assignee | ||
Comment 17•5 years ago
|
||
For fenix those two are the only values I found so I'm fairly certain those are correct. Core pings though are a bit of a mess so I had to filter out what looked to be garbage values and I have those four values hard coded in to be used in search key parsing. I'm not sure who to ask for this so I'll direct this to Frank.
For normalizing the engines, one way to do this is to have a static table of mappings that can be used as a lookup when populating the datasets. Right now I'm looking into doing this for os names and versions. I would expect that this wouldn't be too hard; we just need to come up with a list. The upper case engine names look to be an issue for only Firefox Focus so the other less robust option would be to lower case all engine values in Focus. Either way shouldn't be much work.
Comment 18•5 years ago
|
||
Do you have a reference query for the core pings garbage values?
Otherwise, I looked at the schemas and did some simple queries against both datasets and things look good to me, nothing stood out as obviously broken. Pending confirmation on the source fields, I'd say we're OK to move forward.
Assignee | ||
Comment 19•5 years ago
|
||
I must have mixed up source and engine. I couldn't find any invalid source values. I was probably thinking of when I originally looked through the data before I normalized the search key where older keys are source.engine
instead of engine.source
. So there are some weird engines which you can see in mobile search clients daily but source seems fine.
Comment 20•5 years ago
|
||
Oh ok, cool. Weird engines are fine, we can handle those downstream. All the more reason to have common names among the major engines across apps though to keep things clean :).
Comment 21•5 years ago
|
||
Clearing frank's NI, since the sources are fine.
Assignee | ||
Comment 22•5 years ago
|
||
The datasets have been created: mobile_search_clients_daily_v1
and mobile_search_aggregates
in the search
dataset. I'm going to backfill back to the start of search data in core pings which looks to be 2017-01-01. It will be inexpensive since there isn't much mobile telemetry compared to desktop.
I created bug 1602019 for normalizing the search engines. I expect to do that next week and I think it would make sense to backfill after that gets done so the normalized search engine is another column in the dataset without requiring a join. bmiroglio, does this sound good to you? If you want to look at the historical data sooner I can backfill now but I would prefer a single backfill.
Assignee | ||
Comment 24•5 years ago
•
|
||
Both mobile_search_clients_daily
and mobile_search_aggregates
have been backfilled back to 2017-01-01. Here's a small example query: https://sql.telemetry.mozilla.org/queries/67118/source#170007
I did a very quick look over and the only obvious issue is missing data from 2017-02-02 and 2017-02-03 due to lack of core pings. That should be fixed soon.
Ben, I assume you want to validate?
edit: 2017-02-02 and 2017-02-03 fixed now
Comment 25•5 years ago
|
||
I also did a once-over and things LGTM. Thanks Ben
Assignee | ||
Comment 26•5 years ago
|
||
Ben, is this bug ok to close? I would guess that any other changes would be new bugs.
Comment 27•5 years ago
|
||
Yes let's use new bugs for future changes.
Updated•4 years ago
|
Updated•3 years ago
|
Description
•