Closed
Bug 1484276
Opened 7 years ago
Closed 7 years ago
EventsToAmplitude dropping some events
Categories
(Data Platform and Tools :: General, enhancement, P1)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: klukas, Assigned: klukas)
References
Details
While backfilling events to the Savant_Prod Amplitude project for https://bugzilla.mozilla.org/show_bug.cgi?id=1482924 we discovered higher historical event rates for this backfill compared with the original backfill.
Consider the following chart in Amplitude:
https://analytics.amplitude.com/mozilla-corp/chart/paetimn/edit/a3j9lfh
Before Aug 3rd, the event rate for the v1 session split event (which was included with the original Savant backfill) is consistently about 60% of the v2 backfill. This suggests that ~40% of events were dropped in the original backfill, perhaps due to hitting rate limits from Amplitude.
From Aug 3rd on, event rates are consistent between the deprecated v1 and the new v3 events. That seems to coincide with the date that the backfill ended and we moved to doing nightly processing of the previous day's events.
We need to address the following points:
- Confirm with Amplitude whether there was rate limiting happening during the backfill on ~Aug 3rd
- Discuss with Amplitude on solutions for preventing further event loss, perhaps switching to the batch API
- Implement logging of HTTP responses in telemetry-streaming (which :akomar is already looking into)
- Implement more robust backoff or batch logic to prevent event loss
| Assignee | ||
Updated•7 years ago
|
Assignee: nobody → jklukas
Status: NEW → ASSIGNED
Priority: -- → P1
| Assignee | ||
Comment 1•7 years ago
|
||
I located logs for the ATMO cluster I used for the original savant backfill. For example:
https://s3-us-west-2.amazonaws.com/telemetry-analysis-logs-2/clusters/nifty-volhard-0930/2018-08-03T19%3A51%3A57.862556%2B00%3A00/j-3H3AT69T0K3YH/node/i-06179841e175fc652/applications/spark/spark.log.gz
That shows a large number of warning messages that specify a failed message with error code 429. As discussed in Amplitude's API docs [0], this means "Too many requests for a device". They specify logic there for slowing down which we'll need to follow, or we can switch to the batch API.
[0] https://developers.amplitude.com/#http-api
Comment 2•7 years ago
|
||
FYI we can slow down requests by tuning the MaxParallelism and MinDelay parameters: https://github.com/mozilla/telemetry-streaming/blob/master/src/main/scala/com/mozilla/telemetry/streaming/EventsToAmplitude.scala#L33
Other way is increasing the number of backoffs allowed in HTTP Sink: https://github.com/mozilla/telemetry-streaming/blob/master/src/main/scala/com/mozilla/telemetry/sinks/HTTPSink.scala#L13
| Assignee | ||
Comment 3•7 years ago
|
||
Amplitude's docs specify some pretty explicit recommendations about how to handle backoff logic based on error codes. I'm starting to think we need to make some structural changes and introduce an Amplitude-specific HTTPSink to get to a better end-state rather than trying to tune the parameters for a generic HTTPSink.
Comment 4•7 years ago
|
||
The HTTPSink is Amplitude specific - the fact that Amplitude is probably just a bug. I will say that many of those recommendations are assuming the HTTP API is being used by the specific client. For example, 429 error recovery is assuming a single device, with a single device_id, is sending those requests. However it looks like the docs have been updated since the initial work was done so there may have been some changes.
See here: https://github.com/mozilla/telemetry-streaming/blob/master/src/main/scala/com/mozilla/telemetry/sinks/HTTPSink.scala#L20
Comment 5•7 years ago
|
||
Oops, "The fact that Amplitude _is not in the name_ is just a bug".
| Assignee | ||
Comment 6•7 years ago
|
||
Word from caniox at Amplitude is that the batch API should be stable at this point, so I'm going to move forward with a PR to move to the /batch endpoint in our batch jobs. I will also rename the class in question to AmplitudeHTTPSink.
| Assignee | ||
Comment 7•7 years ago
|
||
In starting to make changes, I'm seeing that ExperimentEnrollmentsToTestTube also uses HTTPSink, so it looks to have Amplitude-specific defaults, but can be overridden to serve other uses.
Comment 8•7 years ago
|
||
We should decouple Amplitude-specific parts from generic sink (Bug 1484971).
| Assignee | ||
Comment 9•7 years ago
|
||
There's some uncertainty around what exactly we should do for metrics (Datadog vs. Stackdriver) that will take some time to resolve, so I'm removing 1465974 as a dependency here.
I just ran the new /batch endpoint code that includes infinite retries with backoff in case of 429 responses. I sent a day of data to Savant_Dev, validated that the logs showed no errors, and checked in the Amplitude UI that I had a number of session split events that matched the total number of pings processed (the Amplitude UI number was within 1%, and there appears to be some tail of events still being processed).
Given success in that test, I'm now started a full backfill into Savant_Stg. I will do a comprehensive validation tomorrow morning and then will contact Josephine to switch in the Stg project for Prod.
No longer depends on: 1465974
| Assignee | ||
Comment 10•7 years ago
|
||
Backfill is complete. Here are overall event counts:
- 779M events in Savant_Stg
- 823M events in Savant_Prod
The higher event count in prod seems to be explainable solely by the extra backfills we did for the "Meta - session split" events. Summing all versions of the events, we have:
- 30M in stage
- 75M in prod
So, a difference of 45M, which is very close to the 44M difference when we look at all event types.
I was expecting to see more events in Stg than Prod, so this seems strange.
Let's look at a different event, SAVANT - open tab. For this, we see the following counts:
- 106M in prod
- 104M in stage
Very close on this one, though still slightly higher in prod.
The small difference could perhaps be explained by the different times the backfills ran. It may be possible for data files to change? I'll need to follow up on that.
| Assignee | ||
Comment 11•7 years ago
|
||
As far as errors, I downloaded all logs from the ATMO cluster where these ran. There were zero WARN or ERROR messages from HttpSink, which is what we would see if we had encountered any error codes from Amplitude:
$ zgrep 'HttpSink' j-1J4SZPVJJ1Y0G/node/*/applications/spark/spark.log.gz | grep -vc INFO
0
There are ~30 overall error messages about bad pings, but that seems insignificant out of millions of pings:
ERROR EventsToAmplitude$: Encountered an error; skipping this ping!
| Assignee | ||
Comment 12•7 years ago
|
||
One potential explanation: Do we sometimes see duplicate pings separated by more than a week? Amplitude's deduplication logic forgets about IDs after 1 week, so we could have a situation where we backfilled up to 8/6, but then jobs that ran on 8/13 or later saw some pings from earlier dates that were resent, not deduplicated on the Amplitude side, and thus appear twice in Amplitude.
| Assignee | ||
Comment 13•7 years ago
|
||
I do see some evidence of duplicate pings that appear in different parquet files with submission_date_s3 separated by days or weeks, so that looks like a reasonable explanation for why we see slightly higher numbers in the Prod project where incremental nightly jobs would have fallen outside the 7 day deduplication window that Amplitude provides.
experiment_ids = [
"pref-flip-savant-1457226-de-existing_users",
"pref-flip-savant-1457226-de-new_users",
"pref-flip-savant-1457226-en-existing_users",
"pref-flip-savant-1457226-en-new_users",
]
experiment_ids_underscored = [x.replace('-', '_') for x in experiment_ids]
records = (
Dataset.from_source("telemetry-cohorts")
.where(docType='main')
.where(experimentId=lambda x: x in experiment_ids_underscored)
.select(
"clientId",
submission_date="meta.submissionDate",
session_start="payload.info.sessionStartDate",
session_length="payload.info.sessionLength",
)
).records(sc, sample=0.01)
subset = records
r = subset.groupBy(lambda p: (
p["clientId"],
p["session_start"],
p["session_length"],
)).cache()
multipledates = r.filter(lambda x: len(x[1]) > 1).mapValues(lambda p: sorted([x['submission_date'] for x in p]))
[y for x, y in multipledates.collect()]
Most of the results there show multiple submissions on the same day, but a handful of the entries are like `[u'20180709', u'20180825']` with side gaps in between submissions.
| Assignee | ||
Comment 14•7 years ago
|
||
Calling this closed as I think we have as much understanding of the differences here as we're going to get, and we also now have error handling that is demonstrated to avoid the 429 data dropping scenario.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•3 years ago
|
Component: Datasets: General → General
You need to log in
before you can comment on or make changes to this bug.
Description
•