Closed Bug 822106 Opened 12 years ago Closed 12 years ago

[socorro production] processors can't connect to postgres

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rbryce)

References

Details

Processors have been reporting postgres failures all morning, can we try restarting the socorro-processor service on sp-processor* ?

Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,671 CRITICAL - Thread-21 - something's gone horribly wrong with the database connection
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,672 CRITICAL - Thread-21 - Caught Error: <class 'psycopg2.OperationalError'>
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,672 CRITICAL - Thread-21 - server closed the connection unexpectedly#012#011This probably means the server terminated abnormally#012#011before or while processing the request.
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - trace back follows:
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - Traceback (most recent call last):
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - File "/data/socorro/application/socorro/processor/processor.py", line 526, in processJob#012    threadLocalCursor.execute("update jobs set starteddatetime = %s where id = %s", (startedDateTime, jobId))
Severity: normal → blocker
Assignee: server-ops-webops → rbryce
Component: Server Operations: Web Operations → Server Operations
QA Contact: nmaul → shyam
Kicked processors and socorro monitor form sp-admin01
heres' my take on what happened.  The logging rollover system started this problem between 3am and 4am this morning.  It appears that it happened across the processor and monitor in essentially random order.  Because the monitor was noticing that processors were dropping out, it started to reassign jobs to other processors.  Then it got slapped with its own cease and desist order in the form of a SIGTERM while it was in the middle of a complicated transaction of reassigning jobs.  Trying to complete its work, it was not responding fast enough for whatever was telling it to stop and so was likely hit with SIGKILL.  It's transaction was left hanging in postgres.  When the monitor restarted, it ran into its own hanging transaction and deadlocked.  

Solution: during log rotation, always do the monitor first and once that is done, proceed with the processors.  Audit the transaction in the monitor that reassigns jobs.  Make sure it has the ability to halt quickly and rollback its transaction.  

see https://bugzilla.mozilla.org/show_bug.cgi?id=822119 for changes to the monitor
#breakpad reports the all the processors are running properly.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.