Closed Bug 822106 Opened 12 years ago Closed 12 years ago

[socorro production] processors can't connect to postgres

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rhelmer, Assigned: rbryce)

References

Details

Robert Helmer [:rhelmer]

Reporter

Description

•

12 years ago

Processors have been reporting postgres failures all morning, can we try restarting the socorro-processor service on sp-processor* ?

Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,671 CRITICAL - Thread-21 - something's gone horribly wrong with the database connection
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,672 CRITICAL - Thread-21 - Caught Error: <class 'psycopg2.OperationalError'>
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,672 CRITICAL - Thread-21 - server closed the connection unexpectedly#012#011This probably means the server terminated abnormally#012#011before or while processing the request.
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - trace back follows:
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - Traceback (most recent call last):
Dec 16 05:41:50 Socorro Processor (pid 31780): 2012-12-16 05:41:50,673 CRITICAL - Thread-21 - File "/data/socorro/application/socorro/processor/processor.py", line 526, in processJob#012    threadLocalCursor.execute("update jobs set starteddatetime = %s where id = %s", (startedDateTime, jobId))

Robert Helmer [:rhelmer]

Reporter

Updated

•

12 years ago

Severity: normal → blocker

Rick Bryce [:rbryce]

Assignee

Updated

•

12 years ago

Assignee: server-ops-webops → rbryce

Component: Server Operations: Web Operations → Server Operations

QA Contact: nmaul → shyam

Rick Bryce [:rbryce]

Assignee

Comment 1

•

12 years ago

Kicked processors and socorro monitor form sp-admin01

K Lars Lohn [:lars] [:klohn]

Comment 2

•

12 years ago

heres' my take on what happened.  The logging rollover system started this problem between 3am and 4am this morning.  It appears that it happened across the processor and monitor in essentially random order.  Because the monitor was noticing that processors were dropping out, it started to reassign jobs to other processors.  Then it got slapped with its own cease and desist order in the form of a SIGTERM while it was in the middle of a complicated transaction of reassigning jobs.  Trying to complete its work, it was not responding fast enough for whatever was telling it to stop and so was likely hit with SIGKILL.  It's transaction was left hanging in postgres.  When the monitor restarted, it ran into its own hanging transaction and deadlocked.  

Solution: during log rotation, always do the monitor first and once that is done, proceed with the processors.  Audit the transaction in the monitor that reassigns jobs.  Make sure it has the ability to halt quickly and rollback its transaction.  

see https://bugzilla.mozilla.org/show_bug.cgi?id=822119 for changes to the monitor

Rick Bryce [:rbryce]

Assignee

Comment 3

•

12 years ago

#breakpad reports the all the processors are running properly.

Rick Bryce [:rbryce]

Assignee

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: mozilla.org → mozilla.org Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[socorro production] processors can't connect to postgres

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: rhelmer, Assigned: rbryce)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated