Sanitize search engine values on ingestion for mobile telemetry
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
References
Details
(Whiteboard: [dataquality])
This is an immediate follow-up to https://bugzilla.mozilla.org/show_bug.cgi?id=1751753
We need to sanitize values for probes in Focus Android and Fenix Android, using the same allowlist of codes.
Assignee | ||
Comment 1•2 years ago
|
||
I need to identify exactly which probes are of concern for Fenix and Focus.
:ANich pointed me to https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/browser_search_in_content which is metrics.labeled_counter.browser_search_in_content
in the metrics ping. I haven't seen any problematic values there so far, but it's likely one we need to sanitize.
Assignee | ||
Comment 2•2 years ago
|
||
I've looked at telemetry.core
which has a searches
field. There are some very low-incidence values there, but spot checking the last several months of data show no entries that look problematic. The single-occurrence values are often wikipedia variants like quicksearch.wikipedia-az
.
Assignee | ||
Comment 3•2 years ago
•
|
||
Here's source in android-components where the in-content
string is produced, so gives context on the format:
<provider>.in-content.[sap|sap-follow-on|organic].code|none?
Assignee | ||
Comment 4•2 years ago
|
||
The two fields I've found that potentially contain in-content
strings are:
- browser_search_in_content
- browser_search_ad_clicks
Assignee | ||
Comment 5•2 years ago
|
||
PR available for review: https://github.com/mozilla/gcp-ingestion/pull/1961
Of note:
- We replace invalid code values with
scrubbed
which is a bit different from the desktop case (other.scrubbed
) since.
is already used as the separator between fields here - We process all
labeled_counter
metrics named likebrowser.search.*
Assignee | ||
Comment 6•2 years ago
•
|
||
:srose brought up in code review that the channel
value should also be sanitized.
After discussion in Slack with :mconnor and :royang, we will drop any value in the channel
position unless the value is "ts", both client-side and server-side. So we will accept that we won't be able to track modifications to channel
values.
Comment 7•2 years ago
|
||
Patch landed in main https://github.com/mozilla-mobile/android-components/pull/11622
Assignee | ||
Comment 8•2 years ago
|
||
We deployed https://github.com/mozilla/gcp-ingestion/pull/1961 to the stage pipeline yesterday and found a much higher rate of scrubbing than expected. I looked into historical values in BQ and realized we were missing two cases:
- Mobile has additional Baidu codes that start with numerics which weren't present in the pipeline PR
- Mobile lowercases the values before sending, so Bing codes weren't being recognized
So additional changes are needed before we push this change to the prod pipeline.
Assignee | ||
Comment 9•2 years ago
|
||
Fixups up for review in https://github.com/mozilla/gcp-ingestion/pull/1962
Assignee | ||
Comment 10•2 years ago
|
||
This is in prod and confirmed working. See https://bugzilla.mozilla.org/show_bug.cgi?id=1752239 for some further work to fixup DDG data.
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•1 year ago
|
Description
•