Closed Bug 1547234 Opened 5 years ago Closed 5 years ago

Investigate duplicates in the reported data

Categories

(Data Platform and Tools :: Glean: SDK, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1586810

People

(Reporter: Dexter, Unassigned)

References

Details

In bug 1525603 Chris found that we have a problem with duplicates:

(In reply to Chris H-C :chutten from bug 1525603 comment #2)

Taking a closer look at the Dupes, only about two-thirds of them are fully dupes (ie, having the same docid). Over a third have different document ids.

We seem to have two different problems:

(1) full dupes with same document id, hinting at the fact that we might be sending dupes spread across a long time period or that the deduper on the pipeline is not catching them for other reasons;
(2) "half dupes", aka dupes with different document id, hinting at the fact that we have a problem in the SDK of re-using sequence numbers when we shouldn't

For (1), we expect imperfect deduplication on the AWS pipeline due to maintaining seen docIds separately on the various hindsight servers; duplicate pings that hit different servers won't be filtered out. We are observing substantially better deduplication performance on the GCP pipeline, which stores seen docIds centrally on a Redis cluster and maintains 24 hours of history.

(In reply to Jeff Klukas [:klukas] (UTC-4) from comment #1)

For (1), we expect imperfect deduplication on the AWS pipeline due to maintaining seen docIds separately on the various hindsight servers; duplicate pings that hit different servers won't be filtered out.

Yup, we're aware of this. However, we're seeing 9% of duplicates, which seems a bit too high. We're interested in understanding why they are not filtered out on the pipeline, in addition to fixing the root cause in the SDK. Knowing this would also point us to the right direction on the SDK side :-D

We are observing substantially better deduplication performance on the GCP pipeline, which stores seen docIds centrally on a Redis cluster and maintains 24 hours of history.

Nice!

we're seeing 9% of duplicates, which seems a bit too high

Indeed. In a validation exercise we undertook earlier this week, we saw an overall dupe rate below 1% on AWS. But if something in the SDK is causing more retried glean payloads than the general case, I could certainly see that pushing the dupe rate higher. But I'd agree we can't rule out an issue in the pipeline as a contributor to the dupes.

For reference: Here's a query to pick up a specific instance of "half-dupes" (2):

https://sql.telemetry.mozilla.org/queries/62480/

Interesting findings from the above query:

For these pings with the same seq, but different doc_id. The time periods as marked in ping_info are all non-overlapping and a least 5 minutes apart, so racing on updating the seq in SharedPreferences seems unlikely. My best theory is that updating the seq number in SharedPreferences is just failing for a very long time for these clients...?

Time distribution of duped pings, by "full" (docid is duped, too) or "half" (docid is different, duped {client_id, seq} tuple) dupe: https://sql.telemetry.mozilla.org/queries/62490/source#160463

Most (over 90%) of dupes are received within a day of each other.
In fact, most are received within 10min of each other.

Priority: -- → P1
See Also: → 1554729

As mentioned in bug 1548819 comment 16, I accidentally learned that (at least on Firefox for Fire TV) it appears as though a disproportionate amount of these duplicate pings are sent from clients running Android SDK 22 (Android 5.1 Lollipop).

See Also: → 1552507
Blocks: 1552507

1596810 is a newer investigation of this with Fenix (created accidentally as a dupe).

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.