Closed
Bug 1348008
Opened 7 years ago
Closed 7 years ago
Verify that duplicates are being flagged properly
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, enhancement, P2)
Cloud Services Graveyard
Metrics: Pipeline
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Assigned: mreid)
References
Details
Attachments
(1 file, 1 obsolete file)
In particular: - Check the rate of false positives (pings marked as duplicates but whose documentId is unique) - Check the rate of remaining duplicates - Check the distribution of "time until duplicate observed" below some upper bound
Comment 1•7 years ago
|
||
need to understand what we trade off to take this. Mark would like this done soon, but not sure who it should find.
Assignee: nobody → kparlante
Priority: -- → P2
Assignee | ||
Comment 2•7 years ago
|
||
I ran some analysis for this. Source data: docType == main, appName == Firefox, 20170401 <= submissionDate <= 20170411 > - Check the rate of false positives (pings marked as duplicates but whose documentId is unique) There were 7 false positives out of a total of 4,019,499,683 documents, or just under 2 per billion. > - Check the rate of remaining duplicates Duplicates detected: 28,843,079 (1,587,103 unique documentIds) Duplicates missed: 82,160,531 (14,947,605 unique documentIds) - Check the distribution of "time until duplicate observed" below some upper bound Using 11 days as the upper bound, the distribution of time-until-duplicate can be seen in the attached notebook. At log scale, it looks like a descending sawtooth pattern with peaks every 24 hours.
Assignee | ||
Updated•7 years ago
|
Assignee: kparlante → mreid
Assignee | ||
Updated•7 years ago
|
Attachment #8860059 -
Attachment mime type: text/plain → application/json
Comment 3•7 years ago
|
||
gist link: https://gist.github.com/fbertsch/3d388761e383e52e826e4979dab0f81a
Assignee | ||
Updated•7 years ago
|
Attachment #8860059 -
Flags: review?(fbertsch)
Comment 4•7 years ago
|
||
Comment on attachment 8860059 [details]
DupesBug1348008_sanitized.ipynb
Analysis looks solid. I'm going to add a few more plots for us to explore :)
Attachment #8860059 -
Flags: review?(fbertsch) → review+
Comment 5•7 years ago
|
||
The 24h peak is neat, but not necessarily surprising (e.g. daily wakeup/work times). I'm sure you can think of other important criterias, but i'd be curious to know: - whats the distribution of duplicate counts? is a significant number clients submitting more than one duplicate? - what proportion of clients is affected? - what proportion of pings are being duplicated?
Comment 6•7 years ago
|
||
I added two more plots. 1. The integral of "duplicate count by hours of day" per x hours, which gives us the number of expected dupes found if we set x hours as our dupe window. I have a red line at 4 hours, our current window. I found that we've missed 32006549 dupes - these are dupes that occurred during the four-hour window. (Is this due to kafka streams not being partitioned by clientId or docId?) 2. Histogram of number of dupe counts per client. Hard to see because there is the one client with > 2559483 dupes, and another with > 2405914. 99.999% of clients fall in the first bucket.
Attachment #8860059 -
Attachment is obsolete: true
Attachment #8860375 -
Flags: review?(mreid)
Assignee | ||
Comment 7•7 years ago
|
||
Comment on attachment 8860375 [details]
DupesBug1348008_new_plots
New plots look good, thanks!
Attachment #8860375 -
Flags: review?(mreid) → review+
Comment 8•7 years ago
|
||
The expiration on the window is 255 minutes but if it is full (26.8MM entries (the working set becomes smaller as the repeat offenders bubble up and hold on to the top slots)) the oldest entries will be discarded immediately. The data is partitioned across 15 decoders (each with 26.8MM entries) so the buffer should be big enough allow for the full 255 minutes. Any one duplicate documentId can have up to 15 duplicates.
Assignee | ||
Comment 9•7 years ago
|
||
For the purposes of validating the behaviour of the de-duping cuckoo filter, I think we now have the info we need. The false positive rate is extremely low, under 2 per billion. We appear to tag about half of all dupes observed within 4 hours. This is a major improvement over doing nothing, so I think we should go ahead and filter tagged dupes out of the primary "telemetry" dataset in bug 1342111. In the future, there are two clear improvements we could make: 1. Increase the time window to ~24 hours, bringing the "best case" duplicate detection up from 65% to 90+%. 2. Partition the data between decoders based on documentId, ensuring that the same id always goes to the same dupe-checker. Thoughts :trink?
Flags: needinfo?(mtrinkala)
Comment 10•7 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #9) > 2. Partition the data between decoders based on documentId, ensuring that > the same id always goes to the same dupe-checker. FYI this conflicts with bug 1357275. We've discussed partitioning by clientId, and presumably that would handle duplicates as well. We could run the analysis to ensure that.
Comment 11•7 years ago
|
||
(In reply to Frank Bertsch [:frank] from comment #10) They would be different topics. The raw topic would be on documentId (we don't have the clientID yet) and the validated topic on clientId. If the partition key is missing then we could fall back on UUID giving us a random distribution. However, before de-duplication the raw topic partitions would definitely be unbalanced.
Flags: needinfo?(mtrinkala)
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Mike Trinkala [:trink] from comment #11) > However, before de-duplication the raw topic partitions would definitely be unbalanced. There are definitely document IDs we see very frequently which would cause unbalance, however this is exactly the case where we want to feed them all through a single decoder to catch all the duplicates. For example, looking at the data for yesterday (April 27), the top 5 most frequent documentIds occur this many times: 286,606 281,138 70,277 70,245 70,204 For April 26, the worst offenders have counts: 257,549 249,564 56,104 56,087 56,067 This doesn't seem all that bad in light of the overall submission volume. :whd do you think it's feasible to partition the raw topic by documentId?
Flags: needinfo?(whd)
Assignee | ||
Comment 13•7 years ago
|
||
On another note, looking at these super-frequent duplicates, we should maybe just add a special case for some of the worst docids to drop them at the edge?
Comment 14•7 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #12) > :whd do you think it's feasible to partition the raw topic by documentId? This brings up the very good point that at ingest time (raw topic) we don't have information such as documentId, so this is currently infeasible.
Flags: needinfo?(whd)
Comment 15•7 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #13) > On another note, looking at these super-frequent duplicates, we should maybe > just add a special case for some of the worst docids to drop them at the > edge? We do have an openresty filter on the edge now that we could modify without requiring a change to the moz ingest logic.
Assignee | ||
Comment 16•7 years ago
|
||
I'm going to call the "verifiction" part of this "done". We can follow up with a deployment strategy in bug 1342111.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•