1342111 - Drop duplicate telemetry submissions based on recent history of documentId

Reporter

Description

•

7 years ago

Based on some investigations[1] into a large number of submissions from a single client, we discovered that many of the submissions had the same document id[2]. 

We should keep at least a small time period of document ids and de-duplicate incoming submissions in the ingestion pipeline. It should only require a short window of time to take care of this particular problem, even if we can't easily solve the "global uniqueness" case.

[1] https://reports.telemetry.mozilla.org/post/projects/problematic_client.kp
[2] https://sql.telemetry.mozilla.org/queries/3244/source

Frank Bertsch [:frank]

Comment 1

•

7 years ago

Chutten has put together the top docIds and their counts for the problematic client that led to this discussion: https://gist.github.com/chutten/786bbcca8f848ac65ad75daf9f1a24f5

Mark Reid [:mreid]

Reporter

Comment 3

•

7 years ago

Per today's discussion, we want to:
1) Create custom filter to detect duplicate document ids.
2) Deploy this logic as an input while reading from Kafka, and append a Heka field that tags records as duplicates
3) Update the Telemetry data warehouse loader to filter the dupes out of the main data lake based on the "duplicate" field above
4) Add a new output on the DWL to store these potential dupes in a separate location on S3 for later analysis and verification.

For #1 above, the intention is to implement a cuckoo filter with accuracy tuned to ensure no more than approximately 1 in 1000000 false positives. This should ensure that we end up with no duplicates in the output, but may redirect a small number of non-duplicate documents per #4.

We can roll out #1 and #2 without any effect on the current behaviour of the system, and then do steps #3 and 4 once we have tested that the rate of false positives is acceptably low.

The intention is to continue to make all data (including duplicates) available to the CEP for the purposes of monitoring and analysis, but to at least make it easy to detect and filter duplicates using the annotated field in #2.

Assignee: nobody → mtrinkala

Points: --- → 2

Priority: -- → P1

Karl Dubost💡 :karlcow

Comment 4

•

7 years ago

cf https://chuttenblog.wordpress.com/2017/02/23/data-science-is-hard-anomalies-and-what-to-do-about-them/
I had a discussion with a Google person about another strange massive Firefox access but with a fake UA.
https://gist.github.com/cramforce/63a8f25639b0201a7c48c90f7a4e1eba

Flags: needinfo?(chutten)

Chris H-C :chutten

Comment 5

•

7 years ago

That is interesting, but I think unrelated. I agree that it's fairly likely that it is someone using a spoofed user-agent for reasons that aren't entirely clear. (Well, the WebKit mobile monoculture is a pretty clear reason, but probably isn't specific enough :S )

Flags: needinfo?(chutten)

Mike Trinkala [:trink]

Comment 6

•

7 years ago

The measured false positive rate for the production configuration is: 0.000025
So 25 in a million (running at a lower capacity didn't not help as much as I expected). Will this be acceptable or do I need the increase the fingerprint size (basically doubling the storage requirements)?

Sam Penrose

Comment 7

•

7 years ago

I'm happy starting with that.

(In reply to Mike Trinkala [:trink] from comment #6)
> The measured false positive rate for the production configuration is:
> 0.000025
> So 25 in a million (running at a lower capacity didn't not help as much as I
> expected). Will this be acceptable or do I need the increase the fingerprint
> size (basically doubling the storage requirements)?

Mike Trinkala [:trink]

Comment 8

•

7 years ago

The duplicate docid testing has finally begun (this version for all practical purposes will produce 0 false positives) https://hsadmin.trink.com/dashboard_output/graphs/analysis.moz_docid_dupes.message_per_minute.html (fyi hsadmin is only processing about 10% of the traffic)

The filter consists of a 256 minute window or ~27MM entries (when either limit is reached the oldest entries will be discarded)

Mike Trinkala [:trink]

Comment 9

•

7 years ago

De-duping has been added to the telemetry ping decoder.  At this point in time it will only add Fields[duplicate_delta] to the message. 

i.e. name: duplicate_delta type: 2 representation: 1m value: 4

The delta value is the number of intervals since the previous duplicate.  In this case 4, the representation is the number of minutes each interval represents.  So this duplicate came in ~4 minutes later (technically  3 < x < 5).  This allows us to examine the distribution and tune the de-duplication window.  It also allows one to query messages within the distribution e.g. "show me some duplicates that were really delayed" as they probably had a different root cause than the high frequency duplicates.

Mike Trinkala [:trink]

Comment 10

•

7 years ago

Assigning to whd for deployment.  The moz_telemetry ping decoder cfg should be updated with
    -- number of items in the de-duping cuckoo filter
    cf_items = 32e6,
    -- interval size in minutes for cuckoo filter pruning
    cf_interval_size = 1,

Assignee: mtrinkala → whd

Wesley Dawson [:whd]

Comment 11

•

7 years ago

This was just deployed (https://github.com/mozilla-services/puppet-config/commit/ff6bdc7a92285df102da214efc5ca75e6c020413, in addition to the changes in https://github.com/mozilla-services/puppet-config/pull/2515), and I see duplicate_delta showing up as a field on the CEP. As we're not dropping things based on this information I assume this bug isn't complete yet so I will simply un-assign myself.

Assignee: whd → nobody

Mark Reid [:mreid]

Reporter

Updated

•

7 years ago

Updated

•

7 years ago

Depends on: 1348008

Mike Trinkala [:trink]

Updated

•

7 years ago

Component: Metrics: Pipeline → Pipeline Ingestion

Product: Cloud Services → Data Platform and Tools

Mark Reid [:mreid]

Reporter

Comment 12

•

7 years ago

See the last few comments in Bug 1348008 for some suggestions about deploying this change.

Chris H-C :chutten

Comment 13

•

7 years ago

The big docids are:

975fae0b-70f3-4a57-823c-8a4c791655e9
22952ba9-6e8a-4de1-a5c5-71c1d25bbb16
358f9afc-75bc-406b-be35-489e1b82804c
85bcc4bf-42dd-46dc-a453-daddebf49379

The second rank (two orders of magnitude less frequent) are:

1fa0c61d-0137-4f4a-9017-de584ed98b96
708aebb1-0688-4051-8ced-35365d199240
86a93bef-c5be-4aaf-b08c-2d4ec81d2c24
edb125ee-0cf0-46ce-b9da-94c6dae49703


(there are others, up to 15 I've seen from this client id specifically, but they are in low enough volumes we can skip them if you'd like)

Mike Trinkala [:trink]

Comment 14

•

7 years ago

There are a few options for our discard procedure (in order of preference):
1) create a new message for each duplicate e.g. "telemetry.duplicate" containing only the URI information (id, docType, appName, appVersion, appUpdateChannel, appBuildId) so we don't have to parse/validate/discard
2) create some kind of summary reports with counts that is emitted every minute
3) just completely drop all duplicates

1 or 2 are for monitoring purposes

Flags: needinfo?(chutten)

Mark Reid [:mreid]

Reporter

Comment 15

•

7 years ago

I prefer option 1 above.

Mike Trinkala [:trink]

Comment 16

•

7 years ago

Having the raw information (#1) gives us more flexibility to alter/create a summary report. Plus it allows us to detect bugs in duplicate filtering, problems with client documentID creation and measure any false-positive impact on different dimensions.

Chris H-C :chutten

Comment 17

•

7 years ago

Sounds good to me

Flags: needinfo?(chutten)

Mark Reid [:mreid]

Reporter

Comment 18

•

7 years ago

See https://github.com/mozilla-services/lua_sandbox_extensions/pull/133

Mike Trinkala [:trink]

Updated

•

7 years ago

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Assignee

Updated

•

2 years ago

Component: Pipeline Ingestion → General

Bugzilla

Quick Search

Drop duplicate telemetry submissions based on recent history of documentId

Categories

(Data Platform and Tools :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: mreid, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated

Updated