Closed Bug 1751753 Opened 2 years ago Closed 2 years ago

Sanitize search engine values on ingestion for desktop telemetry

Categories

(Data Platform and Tools :: General, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataquality])

We need to sanitize the keys of the search_counts histogram during ingestion based on an allowlist of known engines.

Per :standard8:

From a desktop perspective, we’re going to have this list in remote settings (existing version using prefixes), and that also gets updated a few days after remote settings into the main repositories.

From a BQ perspective, the histogram exists in main_v4 under payload.keyed_histograms.search_counts, which is a key/value struct.

From the JSON perspective, the histogram would be paylod.keyedHistograms.SEARCH_COUNTS with potential casing differences we'll need to account for.

The structure of the histogram is documented in Histograms.json:

Records search counts for search access points and in-content searches. For search access points in general, the format is: <engine-name>.<search-access-point> For the urlbar when in search mode, the format is <engine name>.urlbar-searchmode For the urlbar when an internal @engine shortcut is used, the format is: <engine-name>.alias For in-content searches, the format is <provider>.in-content:[sap|sap-follow-on|organic]:[code|none]

I've updated the title to show that scope is a little bigger than just the SEARCH_COUNTS histogram. We also need to account for the browser.search.content.* keyed scalars.

Summary: Sanitize engine values in search_counts histogram on ingestion → Sanitize search engine values on ingestion

From the BQ perspective, the scalars are found at paths like payload.processes.parent.keyed_scalars.browser_search_content_about_home which should correspond to JSON path payload.processes.parent.keyedScalars with key browser.search.content.about_home.

These probes are documented as:

The key format is <provider>:[tagged|tagged-follow-on|organic]:[code|none]

Based on the posted client PR, I've processed the following list of allowed codes:

$ curl -O https://d2mfgivbiy2fiw.cloudfront.net/file/data/yjvjoterwpwif3zsnk6c/PHID-FILE-fya6hv5d52g26jesmjd4/services_settings_dumps_main_search-telemetry-v2.json

$ jq -c '.data|map(.taggedCodes)|flatten' services_settings_dumps_main_search-telemetry-v2.json
["MOZ2","MOZ4","MOZ5","MOZA","MOZB","MOZD","MOZE","MOZI","MOZM","MOZO","MOZT","MOZW","MOZSL01","MOZSL02","MOZSL03","monline_dg","monline_4_dg","monline_7_dg","firefox-a","firefox-b","firefox-b-1","firefox-b-ab","firefox-b-1-ab","firefox-b-d","firefox-b-1-d","firefox-b-e","firefox-b-1-e","firefox-b-m","firefox-b-1-m","firefox-b-o","firefox-b-1-o","firefox-b-lm","firefox-b-1-lm","firefox-b-lg","firefox-b-huawei-h1611","firefox-b-is-oem1","firefox-b-oem1","firefox-b-oem2","firefox-b-tinno","firefox-b-pn-wt","firefox-b-pn-wt-us","ubuntu","ffab","ffcm","ffhp","ffip","ffit","ffnt","ffocus","ffos","ffsb","fpas","fpsa","ftas","ftsa","newext",null]

other and none also need to be allowed values.

Blocks: 1751920
Summary: Sanitize search engine values on ingestion → Sanitize search engine values on ingestion for desktop telemetry
See Also: → 1751955
Blocks: 1751979

The change is merged and has deployed to stage and been running for an hour.

I'm doing some validation: https://sql.telemetry.mozilla.org/queries/83894/source

That shows ~0.5% of entries getting scrubbed for search_counts. :mconnor showed some numbers for desktop overall that showed about 1% of entries were unknown, so these are not wildly off. If I substitute in one of the keyed scalars (payload.processes.parent.keyed_scalars.browser_search_content_searchbar), I see about 1% of entries being scrubbed.

These volumes seem reasonable, and spot checking values, I am not seeing any concerning content, so the scrubbing appears to be working.

This is now deployed to prod, and we see about 1% of entries getting scrubbed: https://sql.telemetry.mozilla.org/queries/83895/source

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

Note that desktop nightly now includes client-side scrubbing, so we can watch the evolution of client-side vs. server-side scrubbing: https://sql.telemetry.mozilla.org/queries/83897/source

Blocks: 1752239
Group: mozilla-employee-confidential
Component: Pipeline Ingestion → General
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.