Closed Bug 1158127 Opened 10 years ago Closed 10 years ago

Investigate AWS lambda errors and throttled submissions.

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: rvitillo)

Details

A significant fraction of S3 events is being throttled for telemetry_index_ping. According to the FAQ: "On exceeding the limit, Lambda functions being invoked synchronously will return a throttling error (429 error code). Lambda functions being invoked asynchronously can absorb reasonable bursts of traffic for approximately 15-30 minutes, after which incoming events will be rejected as throttled. In case the Lambda function is being invoked in response to Amazon S3 events, events rejected by AWS Lambda may be retained and retried by S3 for 24 hours. Events from Amazon Kinesis streams and Amazon DynamoDB streams are retried until the Lambda function succeeds or the data expires. Amazon Kinesis and Amazon DynamoDB Streams retain data for 24 hours." and according to an AWS representative: "We do not currently have a good way of providing customers with insight into the number of messages being held and retried by Amazon S3 when they are throttled by Lambda's invoke API request limit. We're aware of this deficiency in the metrics and are looking at ways of addressing it." In other words the AWS console doesn't tell us how many of the throttled events are still in flight. It's also not clear if throttled events that are not retried or have expired are shown as errors in the lambda metrics. We should: 1) Add additional logging of failures as the cloudwatch logs can't be grepped 2) Run a nightly batch job and count the number of missing filenames If the number of missing filenames is not negligible, we should move to a batch job until AWS increases the maximum concurrency level and metrics reporting.
The lambda function works as expected: 1) The logging revelead few failures (~ 10 per day) due to files that contain dots in the buildid (fxos). 2) The nightly batch job confirmed that no submission went missing, in fact some indexed files (ftu pings) are no longer on S3. According to mreid "there is a job that rewrites the FxOS ftu pings, and sometimes removes files if they only contain dupes".
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.