Investigate duplicates in the reported data

NEW
Unassigned

Status

defect
P1
normal
2 months ago
20 days ago

People

(Reporter: Dexter, Unassigned)

Tracking

Details

Reporter

Description

2 months ago

In bug 1525603 Chris found that we have a problem with duplicates:

(In reply to Chris H-C :chutten from bug 1525603 comment #2)

Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.

We seem to have two different problems:

(1) full dupes with same document id, hinting at the fact that we might be sending dupes spread across a long time period or that the deduper on the pipeline is not catching them for other reasons;
(2) "half dupes", aka dupes with different document id, hinting at the fact that we have a problem in the SDK of re-using sequence numbers when we shouldn't

For (1), we expect imperfect deduplication on the AWS pipeline due to maintaining seen docIds separately on the various hindsight servers; duplicate pings that hit different servers won't be filtered out. We are observing substantially better deduplication performance on the GCP pipeline, which stores seen docIds centrally on a Redis cluster and maintains 24 hours of history.

Reporter

Comment 2

2 months ago

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)

For (1), we expect imperfect deduplication on the AWS pipeline due to maintaining seen docIds separately on the various hindsight servers; duplicate pings that hit different servers won't be filtered out.

Yup, we're aware of this. However, we're seeing 9% of duplicates, which seems a bit too high. We're interested in understanding why they are not filtered out on the pipeline, in addition to fixing the root cause in the SDK. Knowing this would also point us to the right direction on the SDK side :-D

We are observing substantially better deduplication performance on the GCP pipeline, which stores seen docIds centrally on a Redis cluster and maintains 24 hours of history.

Nice!

we're seeing 9% of duplicates, which seems a bit too high

Indeed. In a validation exercise we undertook earlier this week, we saw an overall dupe rate below 1% on AWS. But if something in the SDK is causing more retried glean payloads than the general case, I could certainly see that pushing the dupe rate higher. But I'd agree we can't rule out an issue in the pipeline as a contributor to the dupes.

For reference: Here's a query to pick up a specific instance of "half-dupes" (2):

https://sql.telemetry.mozilla.org/queries/62480/

Interesting findings from the above query:

For these pings with the same seq, but different doc_id. The time periods as marked in ping_info are all non-overlapping and a least 5 minutes apart, so racing on updating the seq in SharedPreferences seems unlikely. My best theory is that updating the seq number in SharedPreferences is just failing for a very long time for these clients...?

Comment 6

2 months ago

Time distribution of duped pings, by "full" (docid is duped, too) or "half" (docid is different, duped {client_id, seq} tuple) dupe: https://sql.telemetry.mozilla.org/queries/62490/source#160463

Most (over 90%) of dupes are received within a day of each other.
In fact, most are received within 10min of each other.

Reporter

Updated

Last month
Priority: -- → P1
Reporter

Updated

20 days ago
See Also: → 1554729
You need to log in before you can comment on or make changes to this bug.