Closed Bug 1236942 Opened 10 years ago Closed 9 years ago

Socorro Processors running too fast for S3 to keep up

Categories

(Socorro :: Backend, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: lars, Unassigned)

Details

we've seen it before, but it has apparently more frequently than in the past. Crashes get sent to S3 & RMQ by the crashmover. When the processors fetch the job from RMQ and try to fetch the job from S3, S3 frequently isn't ready and returns a "not found" error. The processors are then giving up and moving on to the next job. The processors are supposed to retry a "not found" exception, but they are not doing so because that exception is missing from the list exceptions of eligible for retry. during some hours, especially at night, this means up to 11% of processed crashes are being rejected by the processors.
Is it at all possible that processor ids get queued and consumed before the upload returns a 200? I chatted with Leo, Solutions Architect at AWS, and here's what he had to say: TLDR: The files should be available immediately upon the write completing, and if we're seeing that's not the case, we'll open a ticket with AWS ------- Leo: Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES. so if you are overwriting, it will be eventually constitent if it's a new item, it is read-after-write constitent us-standard used to be eventually consistent for puts too but not anymore me: ok. my fellas are reporting a bit of delay sometimes in it being available Leo: it won't give you a 200 OK on a new item me: i'm trying to pull up metrics to see if i can prove or disprove Leo: until it's available if you see otherwise, you should definitely cut a support ticket me: well what does being available mean? Leo: as its not normal behavior available means being able to download it me: like, my app is done writing, 1ms later i go to read, should be there, right? or is there an expected few second delay as it replicates across buildings or something? Leo: yes, if you got a 200 on the write request, it should be available for a read as soon as the 200 request comes in, without delay ----------
So the bug is as simple has adding the S3 Not Found exception to the list of exceptions that means it does an infinite-with-backoff (or something like that) retry in the processor. We don't do overwrite PUTS since the crash ID is always unique for each new crash.
there is a "retry" solution in place that is broken (probably written by me). Since the reordering of the crashstores in the crashmover, the error has not recurred even once.
We rewrote the Socorro collector and as such, the infrastructure is different now. Antenna saves the crash to S3. When the raw crash is saved, the PutObject event triggers an AWS Lambda function (Pigeon) to look at the crash id and add it to RabbitMQ if it needs to be processed. In this way, we no longer have the race condition we had before. Given that, I think this is FIXED now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.