Closed Bug 1751955 Opened 2 years ago Closed 2 years ago

Sanitize search engine values on ingestion for mobile telemetry

Categories

(Data Platform and Tools :: General, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: klukas, Assigned: klukas)

References

Details

(Whiteboard: [dataquality])

This is an immediate follow-up to https://bugzilla.mozilla.org/show_bug.cgi?id=1751753

We need to sanitize values for probes in Focus Android and Fenix Android, using the same allowlist of codes.

I need to identify exactly which probes are of concern for Fenix and Focus.

:ANich pointed me to https://dictionary.telemetry.mozilla.org/apps/fenix/metrics/browser_search_in_content which is metrics.labeled_counter.browser_search_in_content in the metrics ping. I haven't seen any problematic values there so far, but it's likely one we need to sanitize.

I've looked at telemetry.core which has a searches field. There are some very low-incidence values there, but spot checking the last several months of data show no entries that look problematic. The single-occurrence values are often wikipedia variants like quicksearch.wikipedia-az.

Here's source in android-components where the in-content string is produced, so gives context on the format:

<provider>.in-content.[sap|sap-follow-on|organic].code|none?

The two fields I've found that potentially contain in-content strings are:

  • browser_search_in_content
  • browser_search_ad_clicks

PR available for review: https://github.com/mozilla/gcp-ingestion/pull/1961

Of note:

  • We replace invalid code values with scrubbed which is a bit different from the desktop case (other.scrubbed) since . is already used as the separator between fields here
  • We process all labeled_counter metrics named like browser.search.*

:srose brought up in code review that the channel value should also be sanitized.

After discussion in Slack with :mconnor and :royang, we will drop any value in the channel position unless the value is "ts", both client-side and server-side. So we will accept that we won't be able to track modifications to channel values.

We deployed https://github.com/mozilla/gcp-ingestion/pull/1961 to the stage pipeline yesterday and found a much higher rate of scrubbing than expected. I looked into historical values in BQ and realized we were missing two cases:

  • Mobile has additional Baidu codes that start with numerics which weren't present in the pipeline PR
  • Mobile lowercases the values before sending, so Bing codes weren't being recognized

So additional changes are needed before we push this change to the prod pipeline.

This is in prod and confirmed working. See https://bugzilla.mozilla.org/show_bug.cgi?id=1752239 for some further work to fixup DDG data.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Group: mozilla-employee-confidential
Component: Pipeline Ingestion → General
Whiteboard: [data-quality] → [dataquality]
You need to log in before you can comment on or make changes to this bug.