Investigate duplicates in the reported data
Categories
(Data Platform and Tools :: Glean: SDK, defect, P1)
Tracking
(Not tracked)
People
(Reporter: Dexter, Unassigned)
References
Details
In bug 1525603 Chris found that we have a problem with duplicates:
(In reply to Chris H-C :chutten from bug 1525603 comment #2)
Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.
We seem to have two different problems:
(1) full dupes with same document id, hinting at the fact that we might be sending dupes spread across a long time period or that the deduper on the pipeline is not catching them for other reasons;
(2) "half dupes", aka dupes with different document id, hinting at the fact that we have a problem in the SDK of re-using sequence numbers when we shouldn't
Comment 1•6 years ago
|
||
For (1), we expect imperfect deduplication on the AWS pipeline due to maintaining seen docIds separately on the various hindsight servers; duplicate pings that hit different servers won't be filtered out. We are observing substantially better deduplication performance on the GCP pipeline, which stores seen docIds centrally on a Redis cluster and maintains 24 hours of history.
Reporter | ||
Comment 2•6 years ago
|
||
(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)
For (1), we expect imperfect deduplication on the AWS pipeline due to maintaining seen docIds separately on the various hindsight servers; duplicate pings that hit different servers won't be filtered out.
Yup, we're aware of this. However, we're seeing 9% of duplicates, which seems a bit too high. We're interested in understanding why they are not filtered out on the pipeline, in addition to fixing the root cause in the SDK. Knowing this would also point us to the right direction on the SDK side :-D
We are observing substantially better deduplication performance on the GCP pipeline, which stores seen docIds centrally on a Redis cluster and maintains 24 hours of history.
Nice!
Comment 3•6 years ago
|
||
we're seeing 9% of duplicates, which seems a bit too high
Indeed. In a validation exercise we undertook earlier this week, we saw an overall dupe rate below 1% on AWS. But if something in the SDK is causing more retried glean payloads than the general case, I could certainly see that pushing the dupe rate higher. But I'd agree we can't rule out an issue in the pipeline as a contributor to the dupes.
Comment 4•6 years ago
|
||
For reference: Here's a query to pick up a specific instance of "half-dupes" (2):
Comment 5•6 years ago
|
||
Interesting findings from the above query:
For these pings with the same seq
, but different doc_id
. The time periods as marked in ping_info
are all non-overlapping and a least 5 minutes apart, so racing on updating the seq in SharedPreferences seems unlikely. My best theory is that updating the seq number in SharedPreferences is just failing for a very long time for these clients...?
Comment 6•6 years ago
|
||
Time distribution of duped pings, by "full" (docid is duped, too) or "half" (docid is different, duped {client_id, seq}
tuple) dupe: https://sql.telemetry.mozilla.org/queries/62490/source#160463
Most (over 90%) of dupes are received within a day of each other.
In fact, most are received within 10min of each other.
Reporter | ||
Updated•6 years ago
|
Comment 7•6 years ago
|
||
As mentioned in bug 1548819 comment 16, I accidentally learned that (at least on Firefox for Fire TV) it appears as though a disproportionate amount of these duplicate pings are sent from clients running Android SDK 22 (Android 5.1 Lollipop).
Comment 8•5 years ago
|
||
1596810 is a newer investigation of this with Fenix (created accidentally as a dupe).
Description
•