Closed Bug 857323 Opened 11 years ago Closed 6 years ago

Sync server SQL Alchemy connection pool fails to recycle stale DB connections.

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P4)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: bobm, Unassigned)

References

Details

(Whiteboard: [qa+])

Attachments

(6 files)

When a database connection from a connection pool becomes stale it is not always recycled.  This manifests as an error such as the following in the application log: BackendError: GET /1.1/SYNCUSER/storage/bookmarks.  The problem is easily reproduced by recycling a backend database under load.
Whiteboard: [qa+]
As a test the staging pool_recycle setting was set to 60 seconds.  This shorter recycle time did not clear stale pool connections after a disruptive database restart.
A couple possibilities here:

1) the config variable name for pool_recycle is wrong. That's bitten us in the past. Ryan, can you confirm?

2) This doesn't look like a reap failure. This looks like the connections being held, such that they aren't released back to the pool in a way that would allow them to be reaped at a later point. So, the DB goes down, but the event workers fail to realize that their db connection is gone, so they sit there hanging onto the invalid connection waiting for a response.

Could we split into a pool of readers and a pool of writers? Reads could force-timeout a lot faster.
Comments 4,5, and 6 contain the MySQL slow query analysis for 2013-04-02.  The slow query time limit is currently defined at greater then 10 seconds in Stage.  None of the other Sync staging database hosts logged slow queries during yesterday's load test.
Judging from the staging Connection Pool Failures graph, the pool_recycle setting is correct.  New connections are spaced by the correct time when set to the twenty minutes, and are nearly constant at one minute.
> 1) the config variable name for pool_recycle is wrong. That's bitten us in the
> past. Ryan, can you confirm?

AFAICT the name "pool_recycle" is correct.
I can reproduce this locally by running a pool with pool_size=1, accessing it in a tight loop, and restarting mysql.  It does receive an error "2013 Lost connection to MySQL server during query", but then doesn't invalidate the connection or return it to the pool.  Very strange.
This appears to be a bug in sqlalchemy itself.  It has some cleanup logic that is supposed to return connections to the pool if an error occurs.  But if a second error occurs while the cleanup logic is running, the connection does not get returned to the pool.

Restarting MySQL triggers exactly this scenario - you get an initial "1317 Query Execution was interrupted" error as it kills any running queries, followed by a "2006 MySQL server has gone away" if you attempt to use the connection again during the error-handling logic.

The code for this is rather complicated, I will try to figure out an appropriate bug report and/or fix for upstream...
Verified that this problem exists in the 1.15-5 code base as well.
Filed an upstream bug with SQLAlchemy:

  http://www.sqlalchemy.org/trac/ticket/2695
Remediation for this bug is coming in two parts.  The first is a fix to our custom pooling subclass, to prevent it from recursively wrapper the creator() callback each time the pool is re-created.  This prevents an infinite recursion that can occur when we invalidate the pool, which we'll do in the next patch.
Attachment #733738 - Flags: review?(telliott)
This patch calls engine.dispose() if we detect a connection-related error that was not properly handled by SQLAlchemy.  This will result in the pool being discarded and re-created from scratch.

The list of errors codes things that's I've seen raised when MySQL goes away, including "lost connection" and "server has gone away".  We're not currently seeing any of these error codes in production, so this patch should not cause unnecessary purges of the connection pool during normal operation.  Rather, it should only be triggered when MySQL is actually restarted or otherwise goes away.
Attachment #733740 - Flags: review?(telliott)
Attachment #733738 - Flags: review?(telliott) → review+
Comment on attachment 733740 [details] [diff] [review]
server-core patch to dispose pool upon suspicious connection errors

Patch is fine, but I'll note that action has already been taken on the sqlalchemy bug. Worth exploring that before committing such a brute force solution?
Attachment #733740 - Flags: review?(telliott) → review+
> Patch is fine, but I'll note that action has already been taken on the sqlalchemy
> bug. Worth exploring that before committing such a brute force solution?

We're not API-compatible with the new 0.8 series of SQLAlchemy, and the patch used in this bug will have basically the same effect as the official solution, so I'm comfortable with putting it in.
> We're not API-compatible with the new 0.8 series of SQLAlchemy, and the patch
> used in this bug will have basically the same effect as the official solution,
> so I'm comfortable with putting it in.

I'll also note that if we do update to a fixed version of SQLAlchemy, the "if not exc.is_disconnect" clause in our patch will stop begin triggered and our fix will become a no-op.  So we don't need to remember to back it out straight away, although we should do so eventually.
:rfkelly did this patch make it to Prod today?
no, not part of latest deploy
Blocks: 862613
This fix was deployed to stage, and although it does seem to have improved the situation, we're still seeing a residual level of 503s after restarting one of the databases.  But at least it's not a full-on cascade like before the patch!

Specifically the 503s are produced by TimeoutErrors from sqlalchemy.

As before, restarting the gunicorn processes makes the error rate drop back to zero, so I'm pretty sure there's some bug remaining on the python side of things.

Possibly related: even with gunicorn restarted and the 503 rate at zero, there is still a residual level of "timeout" metlog events being generated.  Maybe false positives, since they're not producing application-level errors.  I need to dig into the code and sanity-check this counter.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Bumping this, we should poke at it again for sync1.5.  No rush though.
:bobm maybe we can tie this into the following work: bug 1006792
Setting this to p4.
Priority: -- → P4
Is this issue still relevant in production today?
Flags: needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #26)
> Is this issue still relevant in production today?

No, I feel confident we can close this one out.
Status: REOPENED → RESOLVED
Closed: 11 years ago6 years ago
Flags: needinfo?(bobm)
Resolution: --- → INVALID
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: