Closed Bug 665874 Opened 13 years ago Closed 12 years ago

zeus: enable non-idempotent passive monitoring for http backends

Categories

(Cloud Services :: Operations: Miscellaneous, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: Atoll, Assigned: Atoll)

Details

(Whiteboard: [qa?])

Zeus has a passive monitoring feature that lets it detect when a backend webserver has failed to handle a request and resend it to another backend webserver. However this isn't consistently enabled everywhere and I haven't confirmed that it's actually a good idea for us. So, do that.
ref. Zeus User Manual, section 13.2 Passive Health Monitoring.

There's four bits that apply to us.  Unable to connect to a node within 4 seconds, connection closed by node, response not received within 30 seconds, invalid HTTP response received.

HTTP idempotent requests are one of GET, HEAD, PUT, DELETE, OPTIONS, TRACE.

503 responses are excluded from being retried when session persistence is enabled (?).

So, if sending a request to a backend node fails due to one of the above 4 bits, and it's idempotent, and the response isn't a 503 while session persistence, THEN the request is retried against another backend node.

One aspect of this is great: If a request can't reach a backend node, it absolutely should be retried against another node.

The rest of this isn't so great: If a request times out, or the web node replies incorrectly, it may actually not be safe to retry the request against the server.  This could lead to DELETEs being stacked across multiple nodes, which might actually *cause* some of the wacky deadlocks we see with long-running DELETEs.

So my proposed setting for this is:

- Enable passive monitoring
- Mark all requests as non-idempotent

In the case of a backend server return spurious errors (has not happened in recent memory), this will prevent the Zeus from retrying requests elsewhere.  That slightly increases the user impact of a failed node.

In exchange for that, we don't risk sending things that cause mysql writes multiple times within 90 seconds, even when the user has gone away / stopped caring after 30 seconds, and when the query is still running.  Which is a considerable win in everyday use.
Status: NEW → ASSIGNED
One easy route out of this is to ask the application developers to explicitly declare *which* commands are safe to be retried N times in a row by the server, and which are not.

And then we can mark only the permitted safe commands as "idempotent" using trafficscript, so they can be retried.

There is one big problem with this, however.  If we enable this logic, the Zeus will retry requests if the server doesn't respond within 30 seconds.  At this point, the client has given up, so the Zeus retrying is actually not helpful and potentially risky.

Thoughts, everyone?  (If you're new to the bug, please spend 5 minutes to read the tl;dr comment #1 above.)
Status: ASSIGNED → NEW
While working with rnewman on bug 691257, I realized that preventing PUT and POST and DELETE requests from being retried-on-backend-timeout is a correct behavior.

In peak load scenarios, this would keep a write timeout from triggering more copies of the same write, leading to deadlocks and/or database load and/or making the outage generally last longer (traffic jam effects).

Will confirm with server team that this is a safe bet, but I imagine that the retry logic already actively doing this will be an unwelcome surprise.
$method = http.getMethod();
if ($method == "POST" || $method == "PUT" || $method == "DELETE") {
   # prevent Zeus from retrying writes against multiple nodes so we don't crush ourselves under load
   http.setIdempotent(0);
}
Group: services-infra
I concur trying to send the same write request again and again is a very bad thing. 

The code is not doing anything to reject a second write when a first write is on its way (because slow for some reasons), so at the very least it's loading the servers for nothing by doing the same writes again several times, and at the worst it's generating conflict errors on this call: PUT /user/storage/collection/singleitem

Moreover, the client seems to be able in all write cases, to know that it failed and try again later.
(In reply to Richard Soderberg [:atoll] from comment #3)

> Will confirm with server team that this is a safe bet, but I imagine that
> the retry logic already actively doing this will be an unwelcome surprise.

You imagine correctly. Some really weird things could happen, especially with POSTs

Is there any reason not to do the same for GETs? If the server fails one, everything is probably horribly loaded, and retrying will just add more pain.
(In reply to Toby Elliott [:telliott] from comment #6)

> Is there any reason not to do the same for GETs? If the server fails one,
> everything is probably horribly loaded, and retrying will just add more pain.

The client certainly doesn't expect GETs to be retried on the server side.

It's possible that atoll and I got crossed wires: I suggested that the response-buffering behavior (Bug 691257) should only apply to PUT and POST, with GET still streaming. Removing retries should apply everywhere.
In comment #0 or #1 above, I asked everyone to chime in on this issue.  Now that everyone has, it seems we can turn off auto-retries for *all* requests.  I was comfortable turning them off for *writes* without getting feedback on *reads*, but since we have "removing retries .. everywhere" from client and server devs, that makes this terrifically simple to do and I can spam the change out in a few minutes today.

(Wires not crossed.)
This is already disabled on scl2-sync and phx-sync, leaving open to cover the rest of the products.
Confirmed that passive monitoring defaults on. We're okay with passive monitoring in the "Connection refused" sense, but not in the "idempotent requests" sense, so this needs to convert to "Enable passive monitoring with all requests marked as NOT idempotent", so that Zeus only retries on connection failure, not on bad response from the node.
To do the above comment 10, assign a request trafficscript to an https vserver:

http.setIdempotent(0);

And then enable passive monitoring on the vserver.  Testing this in staging.
Status: NEW → ASSIGNED
This is ready to be tested.  Virtual Servers -> stage-sync -> Rules -> 'set HTTP as non-idempotent', applied.  To enable/disable for testing:

Login to staging zeus, go to Pools -> stage-sync -> 'Health Monitoring', enable 'Passive Monitoring'.

Verify that n_retries > 0 appears in zeus http logs when port 80 on any single backend node is blocked, briefly. Zeus should adapt within 10-15 seconds and the retries should cease, if not much sooner. No error should be visible to clients. A small increase in response time is guaranteed for each retry.
Whiteboard: [needs loadtest]
Summary: either enable or disable zeus passive monitoring of http sitewide → zeus: enable non-idempotent passive monitoring for http backends
Assignee: rsoderberg → petef
Whiteboard: [needs loadtest] → [needs loadtest][qa?]
(In reply to Richard Soderberg [:atoll] from comment #12)
> Verify that n_retries > 0 appears in zeus http logs when port 80 on any
> single backend node is blocked, briefly. Zeus should adapt within 10-15
> seconds and the retries should cease, if not much sooner. No error should be
> visible to clients. A small increase in response time is guaranteed for each
> retry.

works exactly as advertised.  I left it on for stage-sync.
Assignee: petef → rsoderberg
Whiteboard: [needs loadtest][qa?] → [qa?]
My only concern: what happens if something is going on and forcing people to do lots of initial syncs/whatever query that ends up timing out to gunicorn, and marking backends as bad (until the next health check)? We might end up denying service to other users with "safe" queries.
Richard S. and Pete - recommendations for QA here?
Seems like some verification could be done in Stage and Prod.
(In reply to Pete Fritchman [:petef] from comment #14)
> My only concern: what happens if something is going on and forcing people to
> do lots of initial syncs/whatever query that ends up timing out to gunicorn,
> and marking backends as bad (until the next health check)? We might end up
> denying service to other users with "safe" queries.

After "node_connection_attempts" passive failures (default 3) in a row, Zeus will mark the backend node as passive-offline for "node_fail_time" seconds (default 60).

Once the passive-offline timer expires, service will resume as normal to the node, and normal health check rules will apply.

On stage-sync, active health monitoring is configured to fail a node after 3 retries, with a timeout of 3 seconds, and an initial retry interval of 3 seconds (with retry backoff).

Configured stage-sync passive health monitoring to fail a node after 3 retries, with a timeout of 3 seconds (rather than the default of 60).

(In reply to James Bonacci [:jbonacci] from comment #15)
> Richard S. and Pete - recommendations for QA here?
> Seems like some verification could be done in Stage and Prod.

We'll need to force two types of node failures during a first sync: (a) connection refused, and (b) connection timed out.  The first (a) should be retried successfully without the client knowing something went wrong.  The second (b) should not be retried and result in a client-visible error.
And now we're moving away from Zeus, and it's already working basically fine, so marking this entire thing as WONTFIX unless we decide to care for some reason.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.