Closed Bug 600208 Opened 15 years ago Closed 13 years ago

Send 503 + Retry-After when the DB queries are timing out

Categories

(Cloud Services :: Server: Other, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: tarek, Unassigned)

References

Details

(Whiteboard: [qa?])

Attachments

(1 file)

We should set the sql connector to wait at most 10mn for a query, and return a X-Weave-Backoff if the DB does not respond. (using PDO::ATTR_TIMEOUT() in PHP) This will inform the client that the node is melting down,
mconnor suggested a Retry-After instead of X-Weave-Backoff in that case.
I don't think this'll help us much. Apache will probably have timed out on the frontend long before this, and nobody will get the response.
The approach we probably need to take is some combination of monitoring and Bug 599018 - where we manually or automatically trigger a backoff based on the length certain queries are taking. Would also be useful to have the queries that really cause us pain logged somewhere.
(In reply to comment #2) > I don't think this'll help us much. Apache will probably have timed out on the > frontend long before this, and nobody will get the response. As long as the PHP thread is killed when Apache is timing out, that seems fine.
Yes, but killing php may not kill the mysql thread.
Oh I though PDO was taking care of that when the thread receives the sigkill. So what about setting timeouts this way (with 1 second between each timeout): apache timeout > php timeout > mysql timeout This way, PHP/PDO can handle queries timeout and we're sure there's no orphan process running on the server.
Summary: send backoff header when the DB query are > 10mn → send backoff header when the DB query are approaching 5mn
the client gives up at 5 minutes from the beginning of the request. we need to either set a global timer from the start of the request and abort at 4m30s total time spent, or we need to set no more than three or four timers at 1m00s each, so that we always return a temporary error and a backoff/retry-after to the client.
Making my previous comment clearer, as it seemed unclear. By setting the Apache timeout to, let's say 30 s, the PHP timeout to 25 s and the DB connector timeout to 20s (on MySQL side too), we can return a timeout error and make sure we don't leave long-running processes on the server. Same thing apply for LDAP.
I am checking on my side that the Python server behaves properly on slow LDAP or SQL servers. tc was a pain to use depending on your kernel options/OS, so I have created a small port forwarder script I am using to add delays. It add a bigger delay on each call until it reaches a max delay, then reduces it to no delay, and starts back. http://bitbucket.org/tarek/sync-server/src/tip/tests/delay/delay.py If you want to use it for the PHP app, install twisted and run it like this: $ sudo python delay.py 390 localhost 389 Forwarding from 390 to localhost:389 with delays This will add a delay to every call on localhost:390 then forward to the ldap server.
As described in bug 616393, last night's incident has revealed that timing out MySQL queries will in some cases have the web head return a 200 response with an invalid JSON body. This will either confuse the client into wiping + reuploading or just surface an Unknown Error. Both aren't acceptable. Web heads should return a 503 + Retry-After as soon as there's a noticeable delay in database response time. This will tell the client to back off and show the right kind of notification to the user.
Summary: send backoff header when the DB query are approaching 5mn → Send 503 + Retry-After when the DB queries are timing out
Attached file error_file
Here's how I fixed it: * uploaded as an extra file ** Catalogs | Extra Files | Miscellaneous Files * changed error_file to sync-500-to-503.html ** Services | Virtual Servers | sync01 | Connection Management | Connection Error Settings To revert, change error_file to Default.
i do not agree that 616393 is a duplicate of 600208.. attachment 495152 [details] is good for when we actually get as far as returning a 500 error to the client (per bug 616393) but is not sufficient interrupting an active request at 30 seconds to send back a 503 to the client (per bug 600208, this one).
Whiteboard: [qa?]
probably outdated
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: