Closed Bug 1348008 Opened 9 years ago Closed 9 years ago

Verify that duplicates are being flagged properly

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: mreid, Assigned: mreid)

References

Details

Attachments

(1 file, 1 obsolete file)

DupesBug1348008_sanitized.ipynb 9 years ago Mark Reid [:mreid] 72.53 KB, application/json	frank : review+	Details
DupesBug1348008_new_plots 9 years ago Frank Bertsch [:frank] 133.32 KB, text/plain	mreid : review+	Details

Mark Reid [:mreid]

Assignee

Description

•

9 years ago

In particular: - Check the rate of false positives (pings marked as duplicates but whose documentId is unique) - Check the rate of remaining duplicates - Check the distribution of "time until duplicate observed" below some upper bound

Thomas Huelbert

Comment 1

•

9 years ago

need to understand what we trade off to take this. Mark would like this done soon, but not sure who it should find.

Assignee: nobody → kparlante

Priority: -- → P2

Mark Reid [:mreid]

Assignee

Comment 2

•

9 years ago

Attached file DupesBug1348008_sanitized.ipynb (obsolete) — Details

I ran some analysis for this. Source data: docType == main, appName == Firefox, 20170401 <= submissionDate <= 20170411 > - Check the rate of false positives (pings marked as duplicates but whose documentId is unique) There were 7 false positives out of a total of 4,019,499,683 documents, or just under 2 per billion. > - Check the rate of remaining duplicates Duplicates detected: 28,843,079 (1,587,103 unique documentIds) Duplicates missed: 82,160,531 (14,947,605 unique documentIds) - Check the distribution of "time until duplicate observed" below some upper bound Using 11 days as the upper bound, the distribution of time-until-duplicate can be seen in the attached notebook. At log scale, it looks like a descending sawtooth pattern with peaks every 24 hours.

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Assignee: kparlante → mreid

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Attachment #8860059 - Attachment mime type: text/plain → application/json

Frank Bertsch [:frank]

Comment 3

•

9 years ago

gist link: https://gist.github.com/fbertsch/3d388761e383e52e826e4979dab0f81a

Mark Reid [:mreid]

Assignee

Updated

•

9 years ago

Attachment #8860059 - Flags: review?(fbertsch)

Frank Bertsch [:frank]

Comment 4

•

9 years ago

Comment on attachment 8860059 [details] DupesBug1348008_sanitized.ipynb Analysis looks solid. I'm going to add a few more plots for us to explore :)

Attachment #8860059 - Flags: review?(fbertsch) → review+

Georg Fritzsche [:gfritzsche]

Comment 5

•

9 years ago

The 24h peak is neat, but not necessarily surprising (e.g. daily wakeup/work times). I'm sure you can think of other important criterias, but i'd be curious to know: - whats the distribution of duplicate counts? is a significant number clients submitting more than one duplicate? - what proportion of clients is affected? - what proportion of pings are being duplicated?

Frank Bertsch [:frank]

Comment 6

•

9 years ago

Attached file DupesBug1348008_new_plots — Details

I added two more plots. 1. The integral of "duplicate count by hours of day" per x hours, which gives us the number of expected dupes found if we set x hours as our dupe window. I have a red line at 4 hours, our current window. I found that we've missed 32006549 dupes - these are dupes that occurred during the four-hour window. (Is this due to kafka streams not being partitioned by clientId or docId?) 2. Histogram of number of dupe counts per client. Hard to see because there is the one client with > 2559483 dupes, and another with > 2405914. 99.999% of clients fall in the first bucket.

Attachment #8860059 - Attachment is obsolete: true

Attachment #8860375 - Flags: review?(mreid)

Mark Reid [:mreid]

Assignee

Comment 7

•

9 years ago

Comment on attachment 8860375 [details] DupesBug1348008_new_plots New plots look good, thanks!

Attachment #8860375 - Flags: review?(mreid) → review+

Mike Trinkala [:trink]

Comment 8

•

9 years ago

The expiration on the window is 255 minutes but if it is full (26.8MM entries (the working set becomes smaller as the repeat offenders bubble up and hold on to the top slots)) the oldest entries will be discarded immediately. The data is partitioned across 15 decoders (each with 26.8MM entries) so the buffer should be big enough allow for the full 255 minutes. Any one duplicate documentId can have up to 15 duplicates.

Mark Reid [:mreid]

Assignee

Comment 9

•

9 years ago

For the purposes of validating the behaviour of the de-duping cuckoo filter, I think we now have the info we need. The false positive rate is extremely low, under 2 per billion. We appear to tag about half of all dupes observed within 4 hours. This is a major improvement over doing nothing, so I think we should go ahead and filter tagged dupes out of the primary "telemetry" dataset in bug 1342111. In the future, there are two clear improvements we could make: 1. Increase the time window to ~24 hours, bringing the "best case" duplicate detection up from 65% to 90+%. 2. Partition the data between decoders based on documentId, ensuring that the same id always goes to the same dupe-checker. Thoughts :trink?

Flags: needinfo?(mtrinkala)

Frank Bertsch [:frank]

Comment 10

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #9) > 2. Partition the data between decoders based on documentId, ensuring that > the same id always goes to the same dupe-checker. FYI this conflicts with bug 1357275. We've discussed partitioning by clientId, and presumably that would handle duplicates as well. We could run the analysis to ensure that.

Mike Trinkala [:trink]

Comment 11

•

9 years ago

(In reply to Frank Bertsch [:frank] from comment #10) They would be different topics. The raw topic would be on documentId (we don't have the clientID yet) and the validated topic on clientId. If the partition key is missing then we could fall back on UUID giving us a random distribution. However, before de-duplication the raw topic partitions would definitely be unbalanced.

Flags: needinfo?(mtrinkala)

Mark Reid [:mreid]

Assignee

Comment 12

•

9 years ago

(In reply to Mike Trinkala [:trink] from comment #11) > However, before de-duplication the raw topic partitions would definitely be unbalanced. There are definitely document IDs we see very frequently which would cause unbalance, however this is exactly the case where we want to feed them all through a single decoder to catch all the duplicates. For example, looking at the data for yesterday (April 27), the top 5 most frequent documentIds occur this many times: 286,606 281,138 70,277 70,245 70,204 For April 26, the worst offenders have counts: 257,549 249,564 56,104 56,087 56,067 This doesn't seem all that bad in light of the overall submission volume. :whd do you think it's feasible to partition the raw topic by documentId?

Flags: needinfo?(whd)

Mark Reid [:mreid]

Assignee

Comment 13

•

9 years ago

On another note, looking at these super-frequent duplicates, we should maybe just add a special case for some of the worst docids to drop them at the edge?

Wesley Dawson [:whd]

Comment 14

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #12) > :whd do you think it's feasible to partition the raw topic by documentId? This brings up the very good point that at ingest time (raw topic) we don't have information such as documentId, so this is currently infeasible.

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Comment 15

•

9 years ago

(In reply to Mark Reid [:mreid] from comment #13) > On another note, looking at these super-frequent duplicates, we should maybe > just add a special case for some of the worst docids to drop them at the > edge? We do have an openresty filter on the edge now that we could modify without requiring a change to the moz ingest logic.

Mark Reid [:mreid]

Assignee

Comment 16

•

9 years ago

I'm going to call the "verifiction" part of this "done". We can follow up with a deployment strategy in bug 1342111.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.