Decide on uri-based dedup semantics for stub-installer pings
Categories
(Data Platform and Tools :: General, task)
Tracking
(Not tracked)
People
(Reporter: whd, Unassigned)
Details
Since the 16th we've had uri-based pubsub deduplication enabled on all pipeline family decoders. It occurred to me that semantically this is probably inappropriate for stub-installer pings, which AFAIK have no document id in the URL.
As a practical matter I think the number of varied parameters in stub installer GET requests preclude a significant number of "false-positive" duplicates, but we should verify this and perform a backfill if necessary. We'd also want to disable uri-based deduplication on the stub installer decoder only. Worth nothing is that the stub_installer
pipeline family, currently separated from structured
and telemetry
exclusively for operational convenience, would then exhibit what could be considered a user-facing difference in behavior from other pipeline families. Since this is a deprecated endpoint I don't think we need to update any documentation to this effect, but perhaps we should. Put another way, perhaps we should update deduplication documentation to explicitly mention uri
instead of document id for the standard pipeline families.
An alternative is to declare deduplication by uri the correct (or correct enough) thing to do based on specific knowledge of the data and WONTFIX this bug.
Comment 1•3 years ago
|
||
I'm pretty sure the stub installer should not be deduplicating on URI, but if I remember correctly :mhowell owns the client and would know for sure.
Comment 2•3 years ago
|
||
Right, URI-based deduplication for stub pings doesn't seem appropriate, since they don't contain any unique identifiers. The closest things in there would be a few time durations that various phases took to run, so those can be pretty variable, but they're only reported to the nearest second, so I doubt there would be enough variation there to make URI's unique enough. The rest of the ping is just Firefox and OS version numbers and other pretty broad system parameters.
Reporter | ||
Comment 3•3 years ago
|
||
As of UTC 26th we're no longer deduplicating based on URI for the stub_installer
pipeline family:
https://github.com/mozilla/gcp-ingestion/pull/1576
https://github.com/mozilla-services/cloudops-infra/pull/2910
NI :mreid to decide if backfill from payload_bytes_raw
is necessary, and to prioritize it if it is. Conservatively the upper bound on discarded duplicates is 2.7%[1], but given the semantics here I'd guess the actual number of affected pings would be much lower than that. It's not super straightforward to sum live
(with installer_type = 'stub'
) and error
tables against the raw
counts given the new dedup semantics and existing BQ sink dedup semantics. An analysis based on actual duplicate submission times or ips might be appropriate (but I would hope unnecessary).
[1] select 1 - count(distinct(uri)) / count(uri) from payload_bytes_raw.stub_installer where date(submission_timestamp) > '2021-02-15'
Comment 4•3 years ago
|
||
I think we should add an item on the "notable events" list and accept that there may be a small number of discarded records in this 10-day window.
Reporter | ||
Comment 5•3 years ago
|
||
Assignee | ||
Updated•2 years ago
|
Description
•