Closed Bug 1694764 Opened 3 years ago Closed 3 years ago

Decide on uri-based dedup semantics for stub-installer pings

Categories

(Data Platform and Tools :: General, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: whd, Unassigned)

Details

Since the 16th we've had uri-based pubsub deduplication enabled on all pipeline family decoders. It occurred to me that semantically this is probably inappropriate for stub-installer pings, which AFAIK have no document id in the URL.

As a practical matter I think the number of varied parameters in stub installer GET requests preclude a significant number of "false-positive" duplicates, but we should verify this and perform a backfill if necessary. We'd also want to disable uri-based deduplication on the stub installer decoder only. Worth nothing is that the stub_installer pipeline family, currently separated from structured and telemetry exclusively for operational convenience, would then exhibit what could be considered a user-facing difference in behavior from other pipeline families. Since this is a deprecated endpoint I don't think we need to update any documentation to this effect, but perhaps we should. Put another way, perhaps we should update deduplication documentation to explicitly mention uri instead of document id for the standard pipeline families.

An alternative is to declare deduplication by uri the correct (or correct enough) thing to do based on specific knowledge of the data and WONTFIX this bug.

I'm pretty sure the stub installer should not be deduplicating on URI, but if I remember correctly :mhowell owns the client and would know for sure.

Flags: needinfo?(mhowell)

Right, URI-based deduplication for stub pings doesn't seem appropriate, since they don't contain any unique identifiers. The closest things in there would be a few time durations that various phases took to run, so those can be pretty variable, but they're only reported to the nearest second, so I doubt there would be enough variation there to make URI's unique enough. The rest of the ping is just Firefox and OS version numbers and other pretty broad system parameters.

Flags: needinfo?(mhowell)

As of UTC 26th we're no longer deduplicating based on URI for the stub_installer pipeline family:

https://github.com/mozilla/gcp-ingestion/pull/1576
https://github.com/mozilla-services/cloudops-infra/pull/2910

NI :mreid to decide if backfill from payload_bytes_raw is necessary, and to prioritize it if it is. Conservatively the upper bound on discarded duplicates is 2.7%[1], but given the semantics here I'd guess the actual number of affected pings would be much lower than that. It's not super straightforward to sum live (with installer_type = 'stub') and error tables against the raw counts given the new dedup semantics and existing BQ sink dedup semantics. An analysis based on actual duplicate submission times or ips might be appropriate (but I would hope unnecessary).

[1] select 1 - count(distinct(uri)) / count(uri) from payload_bytes_raw.stub_installer where date(submission_timestamp) > '2021-02-15'

Flags: needinfo?(mreid)

I think we should add an item on the "notable events" list and accept that there may be a small number of discarded records in this 10-day window.

Flags: needinfo?(mreid)
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.