processor hangs when there is a database connection problem

RESOLVED FIXED

Status

Socorro
General
RESOLVED FIXED
7 years ago
3 years ago

People

(Reporter: rhelmer, Unassigned)

Tracking

Trunk
x86_64
Linux

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

7 years ago
We've been seeing this over the weekend, the processors seem to hang. Not sure about what's causing the underlying problem yet, by processors should either quit or retry if the database is not reachable:

2011-06-19 23:59:20,001 ERROR - MainThread - trace back follows:
2011-06-19 23:59:20,039 ERROR - MainThread - Traceback (most recent call last):
2011-06-19 23:59:20,040 ERROR - MainThread - File "/data/socorro/application/scripts/startProcessor.py", line 34, in <module>
    p.start()
2011-06-19 23:59:20,041 ERROR - MainThread - File "/data/socorro/application/socorro/processor/processor.py", line 482, in start
    for aJobTuple in self.incomingJobStream():
2011-06-19 23:59:20,041 ERROR - MainThread - File "/data/socorro/application/socorro/processor/processor.py", line 455, in incomingJobStream
    aJobTuple = priorityJobIter.next()
2011-06-19 23:59:20,042 ERROR - MainThread - File "/data/socorro/application/socorro/processor/processor.py", line 386, in newPriorityJobsIter
    getPriorityJobsSql)
2011-06-19 23:59:20,042 ERROR - MainThread - File "/data/socorro/application/socorro/database/database.py", line 33, in f
    result = fn(self, *args, **kwargs)
2011-06-19 23:59:20,043 ERROR - MainThread - File "/data/socorro/application/socorro/database/database.py", line 85, in transaction_execute_with_retry
    connection = database_connection_pool.connection()
2011-06-19 23:59:20,043 ERROR - MainThread - File "/data/socorro/application/socorro/database/database.py", line 186, in connection
    return self.setdefault(name, self.database.connection())
2011-06-19 23:59:20,044 ERROR - MainThread - File "/data/socorro/application/socorro/database/database.py", line 170, in connection
    raise CannotConnectToDatabase(x)
2011-06-19 23:59:20,044 ERROR - MainThread - CannotConnectToDatabase: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
2011-06-19 23:59:20,045 INFO - MainThread - done.
We've been seeing this problem ever since we move the processor to PGBouncer.  Interestingly, it seems to be happening only on the MainThread and never on any of the worker threads.  There must be a behavior of the MainThread that PGBouncer doesn't like.

The code that does the MainThread's three database interactions is wrapped with transaction retry decorator.  If an error occurs within a transaction and that error is a psycopg2.OperationalError, then the transaction will be retried.  

The processors do their own connection pooling keyed on the thread name requesting the connection.  The connection pooling, on failing to connect to the database, captures the exception and raises its own CannotConnectToDatabase exception.

The transaction retry decorator could add that CannotConnectToDatabase exception to its list of exceptions eligible for retry.  With that change, on encountering this problem, the processor will retry to connect.  If retrying eventually succeeds, then the processors could continue to run.  If there is a problem with PGBouncer, this change would mask it from our view.  If the processor doesn't succeed in retrying to connect, the processor will eventually stop working and focus all of its efforts on retrying to connect every five minutes.

I advise making this change and seeing what happens.  Why don't we ever see this in staging?
(Reporter)

Comment 2

7 years ago
(In reply to comment #1)
> I advise making this change and seeing what happens.  Why don't we ever see
> this in staging?

We have been seeing it in staging over the weekend, sorry I was not clear about this in comment 0.
(Assignee)

Updated

7 years ago
Component: Socorro → General
Product: Webtools → Socorro
Fixed long ago
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.