Sanitize search engine values on ingestion for desktop telemetry
Categories
(Data Platform and Tools :: General, task, P1)
Tracking
(Not tracked)
People
(Reporter: klukas, Assigned: klukas)
References
Details
(Whiteboard: [dataquality])
We need to sanitize the keys of the search_counts histogram during ingestion based on an allowlist of known engines.
Per :standard8:
From a desktop perspective, we’re going to have this list in remote settings (existing version using prefixes), and that also gets updated a few days after remote settings into the main repositories.
From a BQ perspective, the histogram exists in main_v4
under payload.keyed_histograms.search_counts
, which is a key/value struct.
From the JSON perspective, the histogram would be paylod.keyedHistograms.SEARCH_COUNTS
with potential casing differences we'll need to account for.
The structure of the histogram is documented in Histograms.json:
Records search counts for search access points and in-content searches. For search access points in general, the format is: <engine-name>.<search-access-point> For the urlbar when in search mode, the format is <engine name>.urlbar-searchmode For the urlbar when an internal @engine shortcut is used, the format is: <engine-name>.alias For in-content searches, the format is <provider>.in-content:[sap|sap-follow-on|organic]:[code|none]
Assignee | ||
Comment 1•2 years ago
|
||
I've updated the title to show that scope is a little bigger than just the SEARCH_COUNTS histogram. We also need to account for the browser.search.content.*
keyed scalars.
Assignee | ||
Comment 2•2 years ago
•
|
||
From the BQ perspective, the scalars are found at paths like payload.processes.parent.keyed_scalars.browser_search_content_about_home
which should correspond to JSON path payload.processes.parent.keyedScalars
with key browser.search.content.about_home
.
These probes are documented as:
The key format is <provider>:[tagged|tagged-follow-on|organic]:[code|none]
Assignee | ||
Comment 3•2 years ago
|
||
Based on the posted client PR, I've processed the following list of allowed codes:
$ curl -O https://d2mfgivbiy2fiw.cloudfront.net/file/data/yjvjoterwpwif3zsnk6c/PHID-FILE-fya6hv5d52g26jesmjd4/services_settings_dumps_main_search-telemetry-v2.json
$ jq -c '.data|map(.taggedCodes)|flatten' services_settings_dumps_main_search-telemetry-v2.json
["MOZ2","MOZ4","MOZ5","MOZA","MOZB","MOZD","MOZE","MOZI","MOZM","MOZO","MOZT","MOZW","MOZSL01","MOZSL02","MOZSL03","monline_dg","monline_4_dg","monline_7_dg","firefox-a","firefox-b","firefox-b-1","firefox-b-ab","firefox-b-1-ab","firefox-b-d","firefox-b-1-d","firefox-b-e","firefox-b-1-e","firefox-b-m","firefox-b-1-m","firefox-b-o","firefox-b-1-o","firefox-b-lm","firefox-b-1-lm","firefox-b-lg","firefox-b-huawei-h1611","firefox-b-is-oem1","firefox-b-oem1","firefox-b-oem2","firefox-b-tinno","firefox-b-pn-wt","firefox-b-pn-wt-us","ubuntu","ffab","ffcm","ffhp","ffip","ffit","ffnt","ffocus","ffos","ffsb","fpas","fpsa","ftas","ftsa","newext",null]
Comment 4•2 years ago
|
||
other
and none
also need to be allowed values.
Assignee | ||
Comment 5•2 years ago
|
||
The evolving PR for this is https://github.com/mozilla/gcp-ingestion/pull/1956
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 6•2 years ago
•
|
||
The change is merged and has deployed to stage and been running for an hour.
I'm doing some validation: https://sql.telemetry.mozilla.org/queries/83894/source
That shows ~0.5% of entries getting scrubbed for search_counts
. :mconnor showed some numbers for desktop overall that showed about 1% of entries were unknown, so these are not wildly off. If I substitute in one of the keyed scalars (payload.processes.parent.keyed_scalars.browser_search_content_searchbar
), I see about 1% of entries being scrubbed.
These volumes seem reasonable, and spot checking values, I am not seeing any concerning content, so the scrubbing appears to be working.
Assignee | ||
Comment 7•2 years ago
|
||
This is now deployed to prod, and we see about 1% of entries getting scrubbed: https://sql.telemetry.mozilla.org/queries/83895/source
Assignee | ||
Comment 8•2 years ago
|
||
Note that desktop nightly now includes client-side scrubbing, so we can watch the evolution of client-side vs. server-side scrubbing: https://sql.telemetry.mozilla.org/queries/83897/source
Assignee | ||
Updated•2 years ago
|
Updated•2 years ago
|
Updated•1 year ago
|
Description
•