Closed Bug 857323 Opened 11 years ago Closed 6 years ago

Sync server SQL Alchemy connection pool fails to recycle stale DB connections.

Tracking

(Not tracked)

Status:

RESOLVED INVALID

People

(Reporter: bobm, Unassigned)

References

Details

(Whiteboard: [qa+])

Attachments

(6 files)

Example backend failure tracebacks from stale pool connections 11 years ago Bob Micheletto [:bobm] 2.96 KB, text/plain		Details
Slow query log analysis for sync1.db during load test. 11 years ago Bob Micheletto [:bobm] 27.23 KB, text/plain		Details
Slow query log analysis for sync6.db during load test. 11 years ago Bob Micheletto [:bobm] 17.40 KB, text/plain		Details
Slow query log analysis for sync8.db during load test. 11 years ago Bob Micheletto [:bobm] 17.53 KB, text/plain		Details
server-storage patch to avoid infinite recursion when pool is re-created 11 years ago Ryan Kelly [:rfkelly] 1.10 KB, patch	telliott : review+	Details \| Diff \| Splinter Review
server-core patch to dispose pool upon suspicious connection errors 11 years ago Ryan Kelly [:rfkelly] 762 bytes, patch	telliott : review+	Details \| Diff \| Splinter Review

Bob Micheletto [:bobm]

Reporter

Description

•

11 years ago

When a database connection from a connection pool becomes stale it is not always recycled.  This manifests as an error such as the following in the application log: BackendError: GET /1.1/SYNCUSER/storage/bookmarks.  The problem is easily reproduced by recycling a backend database under load.

James Bonacci [:jbonacci]

Updated

•

11 years ago

Whiteboard: [qa+]

Bob Micheletto [:bobm]

Reporter

Comment 1

•

11 years ago

As a test the staging pool_recycle setting was set to 60 seconds.  This shorter recycle time did not clear stale pool connections after a disruptive database restart.

Bob Micheletto [:bobm]

Reporter

Comment 2

•

11 years ago

Attached file Example backend failure tracebacks from stale pool connections — Details

Toby Elliott [:telliott]

Comment 3

•

11 years ago

A couple possibilities here:

1) the config variable name for pool_recycle is wrong. That's bitten us in the past. Ryan, can you confirm?

2) This doesn't look like a reap failure. This looks like the connections being held, such that they aren't released back to the pool in a way that would allow them to be reaped at a later point. So, the DB goes down, but the event workers fail to realize that their db connection is gone, so they sit there hanging onto the invalid connection waiting for a response.

Could we split into a pool of readers and a pool of writers? Reads could force-timeout a lot faster.

Bob Micheletto [:bobm]

Reporter

Comment 4

•

11 years ago

Attached file Slow query log analysis for sync1.db during load test. — Details

Bob Micheletto [:bobm]

Reporter

Comment 5

•

11 years ago

Attached file Slow query log analysis for sync6.db during load test. — Details

Bob Micheletto [:bobm]

Reporter

Comment 6

•

11 years ago

Attached file Slow query log analysis for sync8.db during load test. — Details

Bob Micheletto [:bobm]

Reporter

Comment 7

•

11 years ago

Comments 4,5, and 6 contain the MySQL slow query analysis for 2013-04-02.  The slow query time limit is currently defined at greater then 10 seconds in Stage.  None of the other Sync staging database hosts logged slow queries during yesterday's load test.

Bob Micheletto [:bobm]

Reporter

Comment 8

•

11 years ago

Judging from the staging Connection Pool Failures graph, the pool_recycle setting is correct.  New connections are spaced by the correct time when set to the twenty minutes, and are nearly constant at one minute.

Ryan Kelly [:rfkelly]

Comment 9

•

11 years ago

> 1) the config variable name for pool_recycle is wrong. That's bitten us in the
> past. Ryan, can you confirm?

AFAICT the name "pool_recycle" is correct.

Ryan Kelly [:rfkelly]

Comment 10

•

11 years ago

I can reproduce this locally by running a pool with pool_size=1, accessing it in a tight loop, and restarting mysql.  It does receive an error "2013 Lost connection to MySQL server during query", but then doesn't invalidate the connection or return it to the pool.  Very strange.

Ryan Kelly [:rfkelly]

Comment 11

•

11 years ago

This appears to be a bug in sqlalchemy itself.  It has some cleanup logic that is supposed to return connections to the pool if an error occurs.  But if a second error occurs while the cleanup logic is running, the connection does not get returned to the pool.

Restarting MySQL triggers exactly this scenario - you get an initial "1317 Query Execution was interrupted" error as it kills any running queries, followed by a "2006 MySQL server has gone away" if you attempt to use the connection again during the error-handling logic.

The code for this is rather complicated, I will try to figure out an appropriate bug report and/or fix for upstream...

Bob Micheletto [:bobm]

Reporter

Comment 12

•

11 years ago

Verified that this problem exists in the 1.15-5 code base as well.

Ryan Kelly [:rfkelly]

Comment 13

•

11 years ago

Filed an upstream bug with SQLAlchemy:

  http://www.sqlalchemy.org/trac/ticket/2695

Ryan Kelly [:rfkelly]

Comment 14

•

11 years ago

Attached patch server-storage patch to avoid infinite recursion when pool is re-created — Details — Splinter Review

Remediation for this bug is coming in two parts.  The first is a fix to our custom pooling subclass, to prevent it from recursively wrapper the creator() callback each time the pool is re-created.  This prevents an infinite recursion that can occur when we invalidate the pool, which we'll do in the next patch.

Attachment #733738 - Flags: review?(telliott)

Ryan Kelly [:rfkelly]

Comment 15

•

11 years ago

Attached patch server-core patch to dispose pool upon suspicious connection errors — Details — Splinter Review

This patch calls engine.dispose() if we detect a connection-related error that was not properly handled by SQLAlchemy.  This will result in the pool being discarded and re-created from scratch.

The list of errors codes things that's I've seen raised when MySQL goes away, including "lost connection" and "server has gone away".  We're not currently seeing any of these error codes in production, so this patch should not cause unnecessary purges of the connection pool during normal operation.  Rather, it should only be triggered when MySQL is actually restarted or otherwise goes away.

Attachment #733740 - Flags: review?(telliott)

Toby Elliott [:telliott]

Updated

•

11 years ago

Attachment #733738 - Flags: review?(telliott) → review+

Toby Elliott [:telliott]

Comment 16

•

11 years ago

Comment on attachment 733740 [details] [diff] [review]
server-core patch to dispose pool upon suspicious connection errors

Patch is fine, but I'll note that action has already been taken on the sqlalchemy bug. Worth exploring that before committing such a brute force solution?

Attachment #733740 - Flags: review?(telliott) → review+

Ryan Kelly [:rfkelly]

Comment 17

•

11 years ago

> Patch is fine, but I'll note that action has already been taken on the sqlalchemy
> bug. Worth exploring that before committing such a brute force solution?

We're not API-compatible with the new 0.8 series of SQLAlchemy, and the patch used in this bug will have basically the same effect as the official solution, so I'm comfortable with putting it in.

Ryan Kelly [:rfkelly]

Comment 18

•

11 years ago

http://hg.mozilla.org/services/server-core/rev/9b928e7d679c
http://hg.mozilla.org/services/server-storage/rev/692907981076

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Ryan Kelly [:rfkelly]

Comment 19

•

11 years ago

> We're not API-compatible with the new 0.8 series of SQLAlchemy, and the patch
> used in this bug will have basically the same effect as the official solution,
> so I'm comfortable with putting it in.

I'll also note that if we do update to a fixed version of SQLAlchemy, the "if not exc.is_disconnect" clause in our patch will stop begin triggered and our fix will become a no-op.  So we don't need to remember to back it out straight away, although we should do so eventually.

James Bonacci [:jbonacci]

Comment 20

•

11 years ago

:rfkelly did this patch make it to Prod today?

Ryan Kelly [:rfkelly]

Comment 21

•

11 years ago

no, not part of latest deploy

Ryan Kelly [:rfkelly]

Updated

•

11 years ago

Blocks: 862613

Ryan Kelly [:rfkelly]

Comment 22

•

11 years ago

This fix was deployed to stage, and although it does seem to have improved the situation, we're still seeing a residual level of 503s after restarting one of the databases.  But at least it's not a full-on cascade like before the patch!

Specifically the 503s are produced by TimeoutErrors from sqlalchemy.

As before, restarting the gunicorn processes makes the error rate drop back to zero, so I'm pretty sure there's some bug remaining on the python side of things.

Possibly related: even with gunicorn restarted and the 503 rate at zero, there is still a residual level of "timeout" metlog events being generated.  Maybe false positives, since they're not producing application-level errors.  I need to dig into the code and sanity-check this counter.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Ryan Kelly [:rfkelly]

Comment 23

•

10 years ago

Bumping this, we should poke at it again for sync1.5.  No rush though.

James Bonacci [:jbonacci]

Comment 24

•

10 years ago

:bobm maybe we can tie this into the following work: bug 1006792

Bob Micheletto [:bobm]

Reporter

Comment 25

•

8 years ago

Setting this to p4.

Priority: -- → P4

Ryan Kelly [:rfkelly]

Comment 26

•

6 years ago

Is this issue still relevant in production today?

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Reporter

Comment 27

•

6 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #26)
> Is this issue still relevant in production today?

No, I feel confident we can close this one out.

Status: REOPENED → RESOLVED

Closed: 11 years ago → 6 years ago

Flags: needinfo?(bobm)

Resolution: --- → INVALID

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.