Closed Bug 922906 Opened 12 years ago Closed 12 years ago

Processors crash when internal queues get out of sync with RabbitMQ assigned jobs

Categories

(Socorro :: Backend, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: brandon, Assigned: brandon)

Details

(Whiteboard: [qa-])

The internal queue that the processor maintains is responsible for ACKing jobs when they are complete. If that internal queue gets out of synch with the actual jobs assigned by RabbitMQ, and a job is submitted for an ACK that isn't in the internal queue, a KeyError is raised. Due to a programming error, logging this results in a syntax error that crashes the processors. See the logs: https://gist.github.com/selenamarie/60504d73044708056837
Commits pushed to master at https://github.com/mozilla/socorro https://github.com/mozilla/socorro/commit/c977a9545274f0c3355da56443f16684abd36a92 Fixes Bug 922906 - Resolves snytax error that caused processors to fail when queues are out of sync. https://github.com/mozilla/socorro/commit/992f93e98d892e9d0e21acbcfb05576905c9247b Merge pull request #1555 from brandonsavage/logfix2 Fixes Bug 922906 - Resolves snytax error that caused processors to fail when queues are out of sync.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to Brandon Savage [:brandon] from comment #0) > The internal queue that the processor maintains is responsible for ACKing > jobs when they are complete. If that internal queue gets out of synch with > the actual jobs assigned by RabbitMQ, and a job is submitted for an ACK that > isn't in the internal queue, a KeyError is raised. How do they get out of synch? BTW I am able to reproduce this behavior (e.g. I can cause prod processors to crash) just using public interfaces. Lars and I looked at this overall issue a bit today, there is nothing stopping priority jobs from being inserted into the queue many times and it's very likely that the same processor will pick them up. In particular the comment on https://github.com/mozilla/socorro/blob/b753af737016d9f19313cffd55fa8704f8a61976/socorro/external/postgresql/priorityjobs.py#L51 seems to be a lie :) I am going to file a separate bug to fix that and add some logging in a few places throughout the system. However all that said, the patch in this bug looks sufficient to fix the problem, it will still happen that multiple threads will be working on the same crash at the same time, but the exception will be caught and logged instead of crashing because of a syntax error in the line doing the logging. The patch I mention above about logging is so that we can trace a crash through the system and see if this every happens for *non-priority* jobs, which should *not* be possible according to our analysis.
(In reply to Robert Helmer [:rhelmer] from comment #2) > In particular > the comment on > https://github.com/mozilla/socorro/blob/ > b753af737016d9f19313cffd55fa8704f8a61976/socorro/external/postgresql/ > priorityjobs.py#L51 seems to be a lie :) I am going to file a separate bug > to fix that and add some logging in a few places throughout the system. I'm sorry that I have failed - I meant to link https://github.com/mozilla/socorro/blob/b753af737016d9f19313cffd55fa8704f8a61976/socorro/external/rabbitmq/priorityjobs.py#L44
Filed bug 924007 to follow up with more logging and comment correction.
We use the uuid as the key for the internal ack token store. Keys in dicts have to be unique, but RabbitMQ is capable of having the same uuid more than once in a queue. As a result, if the same processor gets the same crash more than once, it can only internally track the ack token for the last crash it was given. However, when we ack back to RabbitMQ, we delete the uuid from the internal ack token store. So, when the processor finishes processing the crash for a second time, there's no ack token in the internal token store, causing the processor's internal queue to be out of sync with the jobs it was assigned.
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.