Closed Bug 550535 Opened 15 years ago Closed 15 years ago

Sisyphus - fail to reconnect to database

Categories

(Testing Graveyard :: Sisyphus, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(4 files, 1 obsolete file)

Attached patch patch (obsolete) — Splinter Review
Currently, database related function calls attempt to filter connection related exceptions by looking for conn_request in the exception text. This can result in failures to recover from database connection errors when different exceptions are thrown. Also, sisyphus.couchdb.connectToDatabase() does not terminate until a connection is reestablished. This can cause problems if the couchdb call enters a permanent failure mode independent of the actual database availability. This patch eliminates the exception check for conn_request and the limits the number of attempts that connectToDatabase() will attempt to reconnect before failing.
Attachment #430682 - Flags: review?(ctalbert)
Attached patch patch v2Splinter Review
I missed some 'conn_request' conditionals in the worker.
Attachment #430682 - Attachment is obsolete: true
Attachment #431112 - Flags: review?(ctalbert)
Attachment #430682 - Flags: review?(ctalbert)
Attachment #431112 - Flags: review?(ctalbert) → review+
The only issue I see here (and I can't really see it in the diff view) is that removing some of these "raise" statements might have some unintended consequences, so just be sure this is well tested. I think it is a good approach though, overall to do the connection retries this way instead of guessing at "which error should I reconnect on"
I'll double check tonight. Definitely the CouchDBExceptions such as update or delete conflicts should be handled properly and not temporarily ignored. The right way to do this is to audit couchquery and httplib2 for all connection related exceptions and only retry for them. Or I could check for /couchquery/ or /httplib2/ in the exceptionValue. That might be the preferred method. See http://qm-crash-linux01:5984/crashtest/_design/default/_view/logs?startkey=%222010-03-08T08%22 for this list of exceptions since I pushed these recent changes to the server. Let me redo this with the path checks and see how that behaves.
Attached patch patch v3Splinter Review
Ok, I think this is more robust.
Attachment #431367 - Flags: review?(ctalbert)
Comment on attachment 431367 [details] [diff] [review] patch v3 I think so too.
Attachment #431367 - Flags: review?(ctalbert) → review+
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
When a connection related error occurs, recreate the httplib2 Http object.
Attachment #442033 - Flags: review?(ctalbert)
Attachment #442033 - Flags: review?(ctalbert) → review+
This doesn't quite work when the scripts restart and the last logMessage can not recover. The fix is to recreate the couchquery Database object when the database connection fails.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch patch ultimoSplinter Review
I tested this by running several workers and killing the couchdb server then restarting it after a delay. They do reconnect fine.
Attachment #452081 - Flags: review?(ctalbert)
Attachment #452081 - Flags: review?(ctalbert) → review+
Status: REOPENED → RESOLVED
Closed: 15 years ago15 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: