Closed Bug 1142153 Opened 10 years ago Closed 9 years ago

Deduplicate Telemetry submissions by Document ID

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mreid, Unassigned)

References

Details

Once we have an "errors" stream, we should start considering duplicate Document IDs as errors rather than publishing multiple copies to the data store. Specifically, the Document ID we will use is the one present in the submission URL, which is also available in the JSON payload as the top-level "id" field as described here: https://ci.mozilla.org/job/mozilla-central-docs/Tree_Documentation/toolkit/components/telemetry/telemetry/common-ping.html We will need to take care that our method of detecting duplicates is reliable, since false positives will result in data loss. The benefit of doing this is that downstream consumers of the data will not need to worry about implementing deduplication for each analysis.
Depends on: 1137747
Priority: -- → P2
We were waiting for bug 1139460, so now we should just do this. And we will want to send the duplicates to the error stream, so we don't need Heka #1398.
Assignee: nobody → mtrinkala
Priority: P2 → P1
If possible it would be great to annotate with the date we saw the first instance of the duplicate (so we can refer back to it without looking through the entire history).
Duplicate Analysis Results (we are seeing 0.4% - 0.5% duplicates) Selection schema { "version": 1, "dimensions": [ { "field_name": "submissionDate", "allowed_values": { "min": "20150507" } }, { "field_name": "sourceName", "allowed_values": "telemetry" }, { "field_name": "sourceVersion", "allowed_values": "4" }, { "field_name": "docType", "allowed_values": "main" }, { "field_name": "appName", "allowed_values": "Firefox" }, { "field_name": "appUpdateChannel", "allowed_values": "nightly" }, { "field_name": "appVersion", "allowed_values": { "min": "41.0" } }, { "field_name": "appBuildId", "allowed_values": { "min": "20150507000000" } } ] } This data is not a normal distribution but I have still included the standard deviation because it gives you a sense of how skewed the data is. Unique documentIds: 1278016 Unique duplicate documentIds: 3450 Unique number of clients submitting duplicates: 1139 Total number of unique clients: 76148 Avg number of times a duplicate is submitted: 1.53 standard deviation: 4.42 Avg time between duplicate submissions (s): 71397 standard_deviation: 88186 Number of documentIds duplicated between clients: 0 Maximum number of duplicates submitted by a single client: 661 (the counts are interesting) clientId: 814f1e43-a669-4bc8-9b59-edbd054a3446 documentIds: e2d52618-9741-4eea-bed5-2e1abf8308f9 = 16 9b627308-a0bb-4912-9f6d-a6f384b68d14 = 47 b8b7600d-3b6d-44b8-b677-ca1dc8eebc0c = 47 0e8bc7cf-26b3-4360-a6c2-6f9ca02314b7 = 47 231ccb6f-b85f-43f6-90d9-c467eab895a3 = 16 cc57ccc2-6628-4dff-b243-48ef91e5b854 = 16 6ae61046-064e-4f87-bb5e-afb89289ffbc = 16 914836c3-5139-4a8a-89b3-7c32c366af3a = 47 2280def3-19c0-443a-93a6-19402699693b = 47 0b7a7659-90d5-4a0f-81f3-12212eb2f6cc = 16 d375dba6-2a36-423c-9796-27e797fa50b8 = 16 5b3708cd-98af-4adf-80a4-7f17cde0b5e3 = 47 90f72448-08f6-4197-8c6b-0e8bd674aefe = 47 1bdf62be-c04c-45cc-91b2-c44ebe8d61eb = 16 4e92cc58-3c37-4145-8c84-e97853bc212d = 47 d30f5cef-296b-4fd9-8301-e54045094bc7 = 47 207a10e4-597f-43b6-85ff-fd171d1976df = 47 642eb40f-cd9d-41c1-bd20-3fa5d1730424 = 16 e0f4c97d-a435-4627-868c-da79f1a3d5ef = 16 585c7d33-3883-4e62-b32e-2e2772cac942 = 47 Average number of unique documentIds duplicated per client: 2.58 standard deviation: 2.93
Based on the results above it will not be possible to remove all of the duplicates from the release data stream in the decoder. The number of documentIds (potentially over a billion a day) and the length of time we would have to hold on to them prevents this. It is also unclear if there is any advantage to removing 'some' of the duplicates. Downstream processors would still have to dedupe the data set or accept the resulting data set with a lower but still not duplicate free membership. Are either of these acceptable? What amount of duplicate noise are we willing to tolerate? Cost ---- To perform this action in the decoder an additional 2.25GiB of memory per billion ids is required at 99.99% accuracy (we will erroneously flag up to 100K documents as duplicates per day). Currently there is no estimate on the additional processing time using a set of bloom filters this large. In addition, the sandbox would need some modification if we want push it beyond 4GiB. Proposal -------- Optimally we would fix the client bugs and reduce the duplicates, at the source, to a level that is acceptable without additional post processing. So, instead of deduping the real time stream in the decoder I propose we setup a report to monitor the duplicate submission rate. This report can be used to verify the client fixes and spot check the data in the future. Also, running it as a report eliminates the risk of backlogging or breaking the data loader.
As an extension to this, I propose that we should de-duplicate submissions on the pre-release channels (or at least nightly) in addition to the overall report/monitor. This should require only a small fraction of the resources when compared to the expected release volume, and should correct for any change in duplicate-submitting behaviour on the client before it reaches the release channel.
(In reply to Mark Reid [:mreid] from comment #6) What is the advantage of having a de-duped nightly stream and non de-duped release stream? All plugins will still have to be designed to expect/ignore/correct duplicates in the set of data they are processing.
My hope is that any bugs introduced on nightly would be fixed before making their way through the trains to release, and that de-duping pre-release would result in nearly-fully-deduped data across all channels. There will still be the occasional duplicate on the release channel, since there are legitimate events that would result in duplicate submissions, but hopefully the rate will be low enough that it would not have any statistical impact on data analysis. Brendan, what would a "low enough" rate of duplicate documents look like for your purposes?
Flags: needinfo?(bcolloran)
I have a number of thoughts here... 1. We should get Roberto's opinion. In some ways this will affect his more session-oriented work than the Metrics Team's client-oriented work. So needinfo-ed Roberto. 2. Which brings me to the second point: in the client-oriented dataset, since we'll need to aggregate by client anyway, within each set of per-client pings we must dedupe-- the allowable dupe rate is zero. 3. I think that maybe the real question here is: "what is a 'low enough' rate of duplicate docs for the purpose of real-time analytics on the incoming data streams?". I don't really have an answer for that, because TBH I'm not sure that that the Metrics Team's main interest. I think that we will mostly be doing post-hoc analytics on the de-duped, cleaned, and consolidated data. I'm not sure who is the owner of the real-time streaming metrics effort, or what purposes the metrics generated there will be used for. The answer to "what is a 'low enough' rate?" will be quite dependent on these whos and whats-- regression detection on nightly will have a different audience and may have to meet a different bar than search counts. Needinfo John and Benjamin for their thoughts on this. 4. It also occurs to me that it is hard to know in advance whether dupes are distributed at random across the population of pings, or whether they preferentially accrue to e.g. crashed sessions, long sessions (which maybe only ended because they crashed?), a certain os, etc etc. Not knowing this, it's hard to know what is acceptable. Maybe dupes have a disproportionate share of some feature that someone is interested, and allowing .01% to pass throws off the final metric by a full 1%. Even if it does, probably no one cares. 5. We certainly want to be careful about not submitting dupes from the client... but because v4 does not have the nice redundancy feature of v2, we also need to be aggressive about submitting pings to avoid data loss. I tend to think that we should error on the side of sending too many pings rather than not enough, because we have no chance to dedupe data that we have not received, and it's important that the per-client data is comprehensive.
Flags: needinfo?(rvitillo)
Flags: needinfo?(jjensen)
Flags: needinfo?(benjamin)
Flags: needinfo?(bcolloran)
I think we're discussing two things here: A. There may be duplicate submission because of client bugs. We should fix the known bugs, but also be able to monitor this to make sure we're not introducing new client bugs. B. There may be duplicate submissions *by design* due to network issues. This can be true if the client submits a ping and the network goes down after submission but before the collector has responded. A similar situation may happen if a ping is submitted near client shutdown. In these cases the client will resubmit the ping at a later date to ensure that we have a complete record. A probably only requires prerelease monitoring. B may require release monitoring or other techniques. Would it be possible to mitigate this by only searching for duplicates if the client is *resubmitting*. So something like this: On first submission, the client submits HTTP x-Moz-Telemetry-Submission: Initial. The collector accepts this immediately without any duplicate checking. On resubmission, the client submits HTTP x-Moz-Telemetry-Submission: Resubmit. The collector uses the clientID to look up prior submissions and throws the submission away if a prior submission with the same document ID already exists.
Flags: needinfo?(benjamin)
(In reply to brendan c from comment #9) > I have a number of thoughts here... > > 1. We should get Roberto's opinion. In some ways this will affect his more > session-oriented work than the Metrics Team's client-oriented work. So > needinfo-ed Roberto. As I usually work with smallish dataset that encompass few build-ids, I could deduplicate the submissions in my analsys job. That said, if deduplicating on pre-release is feasbile upstream we should do it there.
Flags: needinfo?(rvitillo)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #10) > Would it be possible to mitigate this by only searching for duplicates if > the client is *resubmitting*. So something like this: > > On first submission, the client submits HTTP x-Moz-Telemetry-Submission: > Initial. The collector accepts this immediately without any duplicate > checking. > On resubmission, the client submits HTTP x-Moz-Telemetry-Submission: > Resubmit. The collector uses the clientID to look up prior submissions and > throws the submission away if a prior submission with the same document ID > already exists. This would decrease the number of documents that need to be checked for duplicity, but the root of the problem is that the full list of observed documents (the list to check against) is prohibitively large.
I didn't suggesting building up a full observed list. The deduplication was to be done by clientID lookup.
Ok, so an external "is this a dupe?" request for only the documents that are reported as retries. I'm not sure how that would impact stream processing, but I will look into it.
(In reply to brendan c from comment #9) > The answer to "what is a 'low enough' rate?" will be quite dependent on > these whos and whats-- regression detection on nightly will have a different > audience and may have to meet a different bar than search counts. Needinfo > John and Benjamin for their thoughts on this. (Finally) responding to this. I don't have a specific answer to this question. But I can tell what factors should, I think, factor into a decision regarding it: a) our level of understanding of what the most prevalent causes are. That is, where we are on the spectrum from "absolutely no idea" to "almost certainly a result of Necko bug XXX". b) related -- an impression of how far along we are on the cost-benefit curve -- how much more effort would be required to reduce this error rate further? c) what is the relative and absolute impact of these duplicates on key metrics that we care about? d) what is the trajectory of this rate? Do we think it will get worse or better over time?
Flags: needinfo?(jjensen)
My action item from this bug has been completed here: https://bugzilla.mozilla.org/show_bug.cgi?id=1168412
Assignee: mtrinkala → nobody
We are waiting to see what the monitor shows, bumping to P2 in the meantime.
Priority: P1 → P2
Let's leave this as a monitor for now, and if the duplicate rate becomes a problem we will consider deduplicating pre-release channels at that time.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.