Closed
Bug 822106
Opened 12 years ago
Closed 12 years ago
[socorro production] processors can't connect to postgres
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rhelmer, Assigned: rbryce)
References
Details
Processors have been reporting postgres failures all morning, can we try restarting the socorro-processor service on sp-processor* ? Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,671 CRITICAL - Thread-21 - something's gone horribly wrong with the database connection Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,672 CRITICAL - Thread-21 - Caught Error: <class 'psycopg2.OperationalError'> Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,672 CRITICAL - Thread-21 - server closed the connection unexpectedly#012#011This probably means the server terminated abnormally#012#011before or while processing the request. Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - trace back follows: Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - Traceback (most recent call last): Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - File "/data/socorro/application/socorro/processor/processor.py", line 526, in processJob#012 threadLocalCursor.execute("update jobs set starteddatetime = %s where id = %s", (startedDateTime, jobId))
Reporter | ||
Updated•12 years ago
|
Severity: normal → blocker
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops-webops → rbryce
Component: Server Operations: Web Operations → Server Operations
QA Contact: nmaul → shyam
Assignee | ||
Comment 1•12 years ago
|
||
Kicked processors and socorro monitor form sp-admin01
Comment 2•12 years ago
|
||
heres' my take on what happened. The logging rollover system started this problem between 3am and 4am this morning. It appears that it happened across the processor and monitor in essentially random order. Because the monitor was noticing that processors were dropping out, it started to reassign jobs to other processors. Then it got slapped with its own cease and desist order in the form of a SIGTERM while it was in the middle of a complicated transaction of reassigning jobs. Trying to complete its work, it was not responding fast enough for whatever was telling it to stop and so was likely hit with SIGKILL. It's transaction was left hanging in postgres. When the monitor restarted, it ran into its own hanging transaction and deadlocked. Solution: during log rotation, always do the monitor first and once that is done, proceed with the processors. Audit the transaction in the monitor that reassigns jobs. Make sure it has the ability to halt quickly and rollback its transaction. see https://bugzilla.mozilla.org/show_bug.cgi?id=822119 for changes to the monitor
Assignee | ||
Comment 3•12 years ago
|
||
#breakpad reports the all the processors are running properly.
Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•