Closed Bug 1348008 Opened 7 years ago Closed 7 years ago

Verify that duplicates are being flagged properly

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Assigned: mreid)

References

Details

Attachments

(1 file, 1 obsolete file)

In particular:
- Check the rate of false positives (pings marked as duplicates but whose documentId is unique)
- Check the rate of remaining duplicates
- Check the distribution of "time until duplicate observed" below some upper bound
need to understand what we trade off to take this. Mark would like this done soon, but not sure who it should find.
Assignee: nobody → kparlante
Priority: -- → P2
Attached file DupesBug1348008_sanitized.ipynb (obsolete) —
I ran some analysis for this.

Source data: docType == main, appName == Firefox, 20170401 <= submissionDate <= 20170411

> - Check the rate of false positives (pings marked as duplicates but whose documentId is unique)
There were 7 false positives out of a total of 4,019,499,683 documents, or just under 2 per billion.

> - Check the rate of remaining duplicates
Duplicates detected: 28,843,079 (1,587,103 unique documentIds)
Duplicates missed: 82,160,531 (14,947,605 unique documentIds)

- Check the distribution of "time until duplicate observed" below some upper bound
Using 11 days as the upper bound, the distribution of time-until-duplicate can be seen in the attached notebook. At log scale, it looks like a descending sawtooth pattern with peaks every 24 hours.
Assignee: kparlante → mreid
Attachment #8860059 - Attachment mime type: text/plain → application/json
Attachment #8860059 - Flags: review?(fbertsch)
Comment on attachment 8860059 [details]
DupesBug1348008_sanitized.ipynb

Analysis looks solid. I'm going to add a few more plots for us to explore :)
Attachment #8860059 - Flags: review?(fbertsch) → review+
The 24h peak is neat, but not necessarily surprising (e.g. daily wakeup/work times).

I'm sure you can think of other important criterias, but i'd be curious to know:
- whats the distribution of duplicate counts? is a significant number clients submitting more than one duplicate?
- what proportion of clients is affected?
- what proportion of pings are being duplicated?
I added two more plots.

1. The integral of "duplicate count by hours of day" per x hours, which gives us the number of expected dupes found if we set x hours as our dupe window. I have a red line at 4 hours, our current window.

I found that we've missed 32006549 dupes - these are dupes that occurred during the four-hour window. (Is this due to kafka streams not being partitioned by clientId or docId?)

2. Histogram of number of dupe counts per client. Hard to see because there is the one client with > 2559483 dupes, and another with > 2405914. 99.999% of clients fall in the first bucket.
Attachment #8860059 - Attachment is obsolete: true
Attachment #8860375 - Flags: review?(mreid)
Comment on attachment 8860375 [details]
DupesBug1348008_new_plots

New plots look good, thanks!
Attachment #8860375 - Flags: review?(mreid) → review+
The expiration on the window is 255 minutes but if it is full (26.8MM entries (the working set becomes smaller as the repeat offenders bubble up and hold on to the top slots)) the oldest entries will be discarded immediately.  The data is partitioned across 15 decoders (each with 26.8MM entries) so the buffer should be big enough allow for the full 255 minutes.  Any one duplicate documentId can have up to 15 duplicates.
For the purposes of validating the behaviour of the de-duping cuckoo filter, I think we now have the info we need.

The false positive rate is extremely low, under 2 per billion.
We appear to tag about half of all dupes observed within 4 hours.

This is a major improvement over doing nothing, so I think we should go ahead and filter tagged dupes out of the primary "telemetry" dataset in bug 1342111.

In the future, there are two clear improvements we could make:
1. Increase the time window to ~24 hours, bringing the "best case" duplicate detection up from 65% to 90+%.
2. Partition the data between decoders based on documentId, ensuring that the same id always goes to the same dupe-checker.

Thoughts :trink?
Flags: needinfo?(mtrinkala)
(In reply to Mark Reid [:mreid] from comment #9)
> 2. Partition the data between decoders based on documentId, ensuring that
> the same id always goes to the same dupe-checker.

FYI this conflicts with bug 1357275. We've discussed partitioning by clientId, and presumably that would handle duplicates as well. We could run the analysis to ensure that.
(In reply to Frank Bertsch [:frank] from comment #10)
They would be different topics.  The raw topic would be on documentId (we don't have the clientID yet) and the validated topic on clientId.  If the partition key is missing then we could fall back on UUID giving us a random distribution.  However, before de-duplication the raw topic partitions would definitely be unbalanced.
Flags: needinfo?(mtrinkala)
(In reply to Mike Trinkala [:trink] from comment #11)
> However, before de-duplication the raw topic partitions would definitely be unbalanced.

There are definitely document IDs we see very frequently which would cause unbalance, however this is exactly the case where we want to feed them all through a single decoder to catch all the duplicates.

For example, looking at the data for yesterday (April 27), the top 5 most frequent documentIds occur this many times:
286,606
281,138
70,277
70,245
70,204

For April 26, the worst offenders have counts:
257,549
249,564
56,104
56,087
56,067

This doesn't seem all that bad in light of the overall submission volume.

:whd do you think it's feasible to partition the raw topic by documentId?
Flags: needinfo?(whd)
On another note, looking at these super-frequent duplicates, we should maybe just add a special case for some of the worst docids to drop them at the edge?
(In reply to Mark Reid [:mreid] from comment #12)

> :whd do you think it's feasible to partition the raw topic by documentId?

This brings up the very good point that at ingest time (raw topic) we don't have information such as documentId, so this is currently infeasible.
Flags: needinfo?(whd)
(In reply to Mark Reid [:mreid] from comment #13)
> On another note, looking at these super-frequent duplicates, we should maybe
> just add a special case for some of the worst docids to drop them at the
> edge?

We do have an openresty filter on the edge now that we could modify without requiring a change to the moz ingest logic.
I'm going to call the "verifiction" part of this "done". We can follow up with a deployment strategy in bug 1342111.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: