Based on some investigations into a large number of submissions from a single client, we discovered that many of the submissions had the same document id. We should keep at least a small time period of document ids and de-duplicate incoming submissions in the ingestion pipeline. It should only require a short window of time to take care of this particular problem, even if we can't easily solve the "global uniqueness" case.  https://reports.telemetry.mozilla.org/post/projects/problematic_client.kp  https://sql.telemetry.mozilla.org/queries/3244/source
Chutten has put together the top docIds and their counts for the problematic client that led to this discussion: https://gist.github.com/chutten/786bbcca8f848ac65ad75daf9f1a24f5
Per today's discussion, we want to: 1) Create custom filter to detect duplicate document ids. 2) Deploy this logic as an input while reading from Kafka, and append a Heka field that tags records as duplicates 3) Update the Telemetry data warehouse loader to filter the dupes out of the main data lake based on the "duplicate" field above 4) Add a new output on the DWL to store these potential dupes in a separate location on S3 for later analysis and verification. For #1 above, the intention is to implement a cuckoo filter with accuracy tuned to ensure no more than approximately 1 in 1000000 false positives. This should ensure that we end up with no duplicates in the output, but may redirect a small number of non-duplicate documents per #4. We can roll out #1 and #2 without any effect on the current behaviour of the system, and then do steps #3 and 4 once we have tested that the rate of false positives is acceptably low. The intention is to continue to make all data (including duplicates) available to the CEP for the purposes of monitoring and analysis, but to at least make it easy to detect and filter duplicates using the annotated field in #2.
cf https://chuttenblog.wordpress.com/2017/02/23/data-science-is-hard-anomalies-and-what-to-do-about-them/ I had a discussion with a Google person about another strange massive Firefox access but with a fake UA. https://gist.github.com/cramforce/63a8f25639b0201a7c48c90f7a4e1eba
That is interesting, but I think unrelated. I agree that it's fairly likely that it is someone using a spoofed user-agent for reasons that aren't entirely clear. (Well, the WebKit mobile monoculture is a pretty clear reason, but probably isn't specific enough :S )
The measured false positive rate for the production configuration is: 0.000025 So 25 in a million (running at a lower capacity didn't not help as much as I expected). Will this be acceptable or do I need the increase the fingerprint size (basically doubling the storage requirements)?
I'm happy starting with that. (In reply to Mike Trinkala [:trink] from comment #6) > The measured false positive rate for the production configuration is: > 0.000025 > So 25 in a million (running at a lower capacity didn't not help as much as I > expected). Will this be acceptable or do I need the increase the fingerprint > size (basically doubling the storage requirements)?
The duplicate docid testing has finally begun (this version for all practical purposes will produce 0 false positives) https://hsadmin.trink.com/dashboard_output/graphs/analysis.moz_docid_dupes.message_per_minute.html (fyi hsadmin is only processing about 10% of the traffic) The filter consists of a 256 minute window or ~27MM entries (when either limit is reached the oldest entries will be discarded)
De-duping has been added to the telemetry ping decoder. At this point in time it will only add Fields[duplicate_delta] to the message. i.e. name: duplicate_delta type: 2 representation: 1m value: 4 The delta value is the number of intervals since the previous duplicate. In this case 4, the representation is the number of minutes each interval represents. So this duplicate came in ~4 minutes later (technically 3 < x < 5). This allows us to examine the distribution and tune the de-duplication window. It also allows one to query messages within the distribution e.g. "show me some duplicates that were really delayed" as they probably had a different root cause than the high frequency duplicates.
Assigning to whd for deployment. The moz_telemetry ping decoder cfg should be updated with -- number of items in the de-duping cuckoo filter cf_items = 32e6, -- interval size in minutes for cuckoo filter pruning cf_interval_size = 1,
This was just deployed (https://github.com/mozilla-services/puppet-config/commit/ff6bdc7a92285df102da214efc5ca75e6c020413, in addition to the changes in https://github.com/mozilla-services/puppet-config/pull/2515), and I see duplicate_delta showing up as a field on the CEP. As we're not dropping things based on this information I assume this bug isn't complete yet so I will simply un-assign myself.
See the last few comments in Bug 1348008 for some suggestions about deploying this change.
The big docids are: 975fae0b-70f3-4a57-823c-8a4c791655e9 22952ba9-6e8a-4de1-a5c5-71c1d25bbb16 358f9afc-75bc-406b-be35-489e1b82804c 85bcc4bf-42dd-46dc-a453-daddebf49379 The second rank (two orders of magnitude less frequent) are: 1fa0c61d-0137-4f4a-9017-de584ed98b96 708aebb1-0688-4051-8ced-35365d199240 86a93bef-c5be-4aaf-b08c-2d4ec81d2c24 edb125ee-0cf0-46ce-b9da-94c6dae49703 (there are others, up to 15 I've seen from this client id specifically, but they are in low enough volumes we can skip them if you'd like)
There are a few options for our discard procedure (in order of preference): 1) create a new message for each duplicate e.g. "telemetry.duplicate" containing only the URI information (id, docType, appName, appVersion, appUpdateChannel, appBuildId) so we don't have to parse/validate/discard 2) create some kind of summary reports with counts that is emitted every minute 3) just completely drop all duplicates 1 or 2 are for monitoring purposes
I prefer option 1 above.
Having the raw information (#1) gives us more flexibility to alter/create a summary report. Plus it allows us to detect bugs in duplicate filtering, problems with client documentID creation and measure any false-positive impact on different dimensions.
Sounds good to me