Closed
Bug 1342111
Opened 7 years ago
Closed 7 years ago
Drop duplicate telemetry submissions based on recent history of documentId
Categories
(Data Platform and Tools :: General, defect, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mreid, Unassigned)
References
Details
Based on some investigations[1] into a large number of submissions from a single client, we discovered that many of the submissions had the same document id[2]. We should keep at least a small time period of document ids and de-duplicate incoming submissions in the ingestion pipeline. It should only require a short window of time to take care of this particular problem, even if we can't easily solve the "global uniqueness" case. [1] https://reports.telemetry.mozilla.org/post/projects/problematic_client.kp [2] https://sql.telemetry.mozilla.org/queries/3244/source
Comment 1•7 years ago
|
||
Chutten has put together the top docIds and their counts for the problematic client that led to this discussion: https://gist.github.com/chutten/786bbcca8f848ac65ad75daf9f1a24f5
Reporter | ||
Comment 3•7 years ago
|
||
Per today's discussion, we want to: 1) Create custom filter to detect duplicate document ids. 2) Deploy this logic as an input while reading from Kafka, and append a Heka field that tags records as duplicates 3) Update the Telemetry data warehouse loader to filter the dupes out of the main data lake based on the "duplicate" field above 4) Add a new output on the DWL to store these potential dupes in a separate location on S3 for later analysis and verification. For #1 above, the intention is to implement a cuckoo filter with accuracy tuned to ensure no more than approximately 1 in 1000000 false positives. This should ensure that we end up with no duplicates in the output, but may redirect a small number of non-duplicate documents per #4. We can roll out #1 and #2 without any effect on the current behaviour of the system, and then do steps #3 and 4 once we have tested that the rate of false positives is acceptably low. The intention is to continue to make all data (including duplicates) available to the CEP for the purposes of monitoring and analysis, but to at least make it easy to detect and filter duplicates using the annotated field in #2.
Assignee: nobody → mtrinkala
Points: --- → 2
Priority: -- → P1
Comment 4•7 years ago
|
||
cf https://chuttenblog.wordpress.com/2017/02/23/data-science-is-hard-anomalies-and-what-to-do-about-them/ I had a discussion with a Google person about another strange massive Firefox access but with a fake UA. https://gist.github.com/cramforce/63a8f25639b0201a7c48c90f7a4e1eba
Flags: needinfo?(chutten)
Comment 5•7 years ago
|
||
That is interesting, but I think unrelated. I agree that it's fairly likely that it is someone using a spoofed user-agent for reasons that aren't entirely clear. (Well, the WebKit mobile monoculture is a pretty clear reason, but probably isn't specific enough :S )
Flags: needinfo?(chutten)
Comment 6•7 years ago
|
||
The measured false positive rate for the production configuration is: 0.000025 So 25 in a million (running at a lower capacity didn't not help as much as I expected). Will this be acceptable or do I need the increase the fingerprint size (basically doubling the storage requirements)?
Comment 7•7 years ago
|
||
I'm happy starting with that. (In reply to Mike Trinkala [:trink] from comment #6) > The measured false positive rate for the production configuration is: > 0.000025 > So 25 in a million (running at a lower capacity didn't not help as much as I > expected). Will this be acceptable or do I need the increase the fingerprint > size (basically doubling the storage requirements)?
Comment 8•7 years ago
|
||
The duplicate docid testing has finally begun (this version for all practical purposes will produce 0 false positives) https://hsadmin.trink.com/dashboard_output/graphs/analysis.moz_docid_dupes.message_per_minute.html (fyi hsadmin is only processing about 10% of the traffic) The filter consists of a 256 minute window or ~27MM entries (when either limit is reached the oldest entries will be discarded)
Comment 9•7 years ago
|
||
De-duping has been added to the telemetry ping decoder. At this point in time it will only add Fields[duplicate_delta] to the message. i.e. name: duplicate_delta type: 2 representation: 1m value: 4 The delta value is the number of intervals since the previous duplicate. In this case 4, the representation is the number of minutes each interval represents. So this duplicate came in ~4 minutes later (technically 3 < x < 5). This allows us to examine the distribution and tune the de-duplication window. It also allows one to query messages within the distribution e.g. "show me some duplicates that were really delayed" as they probably had a different root cause than the high frequency duplicates.
Comment 10•7 years ago
|
||
Assigning to whd for deployment. The moz_telemetry ping decoder cfg should be updated with -- number of items in the de-duping cuckoo filter cf_items = 32e6, -- interval size in minutes for cuckoo filter pruning cf_interval_size = 1,
Assignee: mtrinkala → whd
Comment 11•7 years ago
|
||
This was just deployed (https://github.com/mozilla-services/puppet-config/commit/ff6bdc7a92285df102da214efc5ca75e6c020413, in addition to the changes in https://github.com/mozilla-services/puppet-config/pull/2515), and I see duplicate_delta showing up as a field on the CEP. As we're not dropping things based on this information I assume this bug isn't complete yet so I will simply un-assign myself.
Assignee: whd → nobody
Updated•7 years ago
|
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Reporter | ||
Comment 12•7 years ago
|
||
See the last few comments in Bug 1348008 for some suggestions about deploying this change.
Comment 13•7 years ago
|
||
The big docids are: 975fae0b-70f3-4a57-823c-8a4c791655e9 22952ba9-6e8a-4de1-a5c5-71c1d25bbb16 358f9afc-75bc-406b-be35-489e1b82804c 85bcc4bf-42dd-46dc-a453-daddebf49379 The second rank (two orders of magnitude less frequent) are: 1fa0c61d-0137-4f4a-9017-de584ed98b96 708aebb1-0688-4051-8ced-35365d199240 86a93bef-c5be-4aaf-b08c-2d4ec81d2c24 edb125ee-0cf0-46ce-b9da-94c6dae49703 (there are others, up to 15 I've seen from this client id specifically, but they are in low enough volumes we can skip them if you'd like)
Comment 14•7 years ago
|
||
There are a few options for our discard procedure (in order of preference): 1) create a new message for each duplicate e.g. "telemetry.duplicate" containing only the URI information (id, docType, appName, appVersion, appUpdateChannel, appBuildId) so we don't have to parse/validate/discard 2) create some kind of summary reports with counts that is emitted every minute 3) just completely drop all duplicates 1 or 2 are for monitoring purposes
Flags: needinfo?(chutten)
Reporter | ||
Comment 15•7 years ago
|
||
I prefer option 1 above.
Comment 16•7 years ago
|
||
Having the raw information (#1) gives us more flexibility to alter/create a summary report. Plus it allows us to detect bugs in duplicate filtering, problems with client documentID creation and measure any false-positive impact on different dimensions.
Reporter | ||
Comment 18•7 years ago
|
||
See https://github.com/mozilla-services/lua_sandbox_extensions/pull/133
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•2 years ago
|
Component: Pipeline Ingestion → General
You need to log in
before you can comment on or make changes to this bug.
Description
•