Closed
Bug 922906
Opened 12 years ago
Closed 12 years ago
Processors crash when internal queues get out of sync with RabbitMQ assigned jobs
Categories
(Socorro :: Backend, task)
Tracking
(Not tracked)
RESOLVED
FIXED
62
People
(Reporter: brandon, Assigned: brandon)
Details
(Whiteboard: [qa-])
The internal queue that the processor maintains is responsible for ACKing jobs when they are complete. If that internal queue gets out of synch with the actual jobs assigned by RabbitMQ, and a job is submitted for an ACK that isn't in the internal queue, a KeyError is raised.
Due to a programming error, logging this results in a syntax error that crashes the processors. See the logs: https://gist.github.com/selenamarie/60504d73044708056837
Comment 1•12 years ago
|
||
Commits pushed to master at https://github.com/mozilla/socorro
https://github.com/mozilla/socorro/commit/c977a9545274f0c3355da56443f16684abd36a92
Fixes Bug 922906 - Resolves snytax error that caused processors to fail when queues are out of sync.
https://github.com/mozilla/socorro/commit/992f93e98d892e9d0e21acbcfb05576905c9247b
Merge pull request #1555 from brandonsavage/logfix2
Fixes Bug 922906 - Resolves snytax error that caused processors to fail when queues are out of sync.
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 2•12 years ago
|
||
(In reply to Brandon Savage [:brandon] from comment #0)
> The internal queue that the processor maintains is responsible for ACKing
> jobs when they are complete. If that internal queue gets out of synch with
> the actual jobs assigned by RabbitMQ, and a job is submitted for an ACK that
> isn't in the internal queue, a KeyError is raised.
How do they get out of synch? BTW I am able to reproduce this behavior (e.g. I can cause prod processors to crash) just using public interfaces.
Lars and I looked at this overall issue a bit today, there is nothing stopping priority jobs from being inserted into the queue many times and it's very likely that the same processor will pick them up. In particular the comment on https://github.com/mozilla/socorro/blob/b753af737016d9f19313cffd55fa8704f8a61976/socorro/external/postgresql/priorityjobs.py#L51 seems to be a lie :) I am going to file a separate bug to fix that and add some logging in a few places throughout the system.
However all that said, the patch in this bug looks sufficient to fix the problem, it will still happen that multiple threads will be working on the same crash at the same time, but the exception will be caught and logged instead of crashing because of a syntax error in the line doing the logging. The patch I mention above about logging is so that we can trace a crash through the system and see if this every happens for *non-priority* jobs, which should *not* be possible according to our analysis.
Comment 3•12 years ago
|
||
(In reply to Robert Helmer [:rhelmer] from comment #2)
> In particular
> the comment on
> https://github.com/mozilla/socorro/blob/
> b753af737016d9f19313cffd55fa8704f8a61976/socorro/external/postgresql/
> priorityjobs.py#L51 seems to be a lie :) I am going to file a separate bug
> to fix that and add some logging in a few places throughout the system.
I'm sorry that I have failed - I meant to link https://github.com/mozilla/socorro/blob/b753af737016d9f19313cffd55fa8704f8a61976/socorro/external/rabbitmq/priorityjobs.py#L44
Comment 4•12 years ago
|
||
Filed bug 924007 to follow up with more logging and comment correction.
Assignee | ||
Comment 5•12 years ago
|
||
We use the uuid as the key for the internal ack token store. Keys in dicts have to be unique, but RabbitMQ is capable of having the same uuid more than once in a queue. As a result, if the same processor gets the same crash more than once, it can only internally track the ack token for the last crash it was given.
However, when we ack back to RabbitMQ, we delete the uuid from the internal ack token store. So, when the processor finishes processing the crash for a second time, there's no ack token in the internal token store, causing the processor's internal queue to be out of sync with the jobs it was assigned.
Updated•12 years ago
|
Whiteboard: [qa-]
You need to log in
before you can comment on or make changes to this bug.
Description
•