665874 - zeus: enable non-idempotent passive monitoring for http backends

Assignee

Description

•

13 years ago

Zeus has a passive monitoring feature that lets it detect when a backend webserver has failed to handle a request and resend it to another backend webserver. However this isn't consistently enabled everywhere and I haven't confirmed that it's actually a good idea for us. So, do that.

:Atoll

Assignee

Comment 1

•

13 years ago

ref. Zeus User Manual, section 13.2 Passive Health Monitoring.

There's four bits that apply to us. Unable to connect to a node within 4 seconds, connection closed by node, response not received within 30 seconds, invalid HTTP response received.

HTTP idempotent requests are one of GET, HEAD, PUT, DELETE, OPTIONS, TRACE.

503 responses are excluded from being retried when session persistence is enabled (?).

So, if sending a request to a backend node fails due to one of the above 4 bits, and it's idempotent, and the response isn't a 503 while session persistence, THEN the request is retried against another backend node.

One aspect of this is great: If a request can't reach a backend node, it absolutely should be retried against another node.

The rest of this isn't so great: If a request times out, or the web node replies incorrectly, it may actually not be safe to retry the request against the server. This could lead to DELETEs being stacked across multiple nodes, which might actually *cause* some of the wacky deadlocks we see with long-running DELETEs.

So my proposed setting for this is:

- Enable passive monitoring
- Mark all requests as non-idempotent

In the case of a backend server return spurious errors (has not happened in recent memory), this will prevent the Zeus from retrying requests elsewhere. That slightly increases the user impact of a failed node.

In exchange for that, we don't risk sending things that cause mysql writes multiple times within 90 seconds, even when the user has gone away / stopped caring after 30 seconds, and when the query is still running. Which is a considerable win in everyday use.

Status: NEW → ASSIGNED

:Atoll

Assignee

Comment 2

•

13 years ago

One easy route out of this is to ask the application developers to explicitly declare *which* commands are safe to be retried N times in a row by the server, and which are not.

And then we can mark only the permitted safe commands as "idempotent" using trafficscript, so they can be retried.

There is one big problem with this, however.  If we enable this logic, the Zeus will retry requests if the server doesn't respond within 30 seconds.  At this point, the client has given up, so the Zeus retrying is actually not helpful and potentially risky.

Thoughts, everyone?  (If you're new to the bug, please spend 5 minutes to read the tl;dr comment #1 above.)

:Atoll

Assignee

Updated

•

13 years ago

Status: ASSIGNED → NEW

:Atoll

Assignee

Comment 3

•

13 years ago

While working with rnewman on bug 691257, I realized that preventing PUT and POST and DELETE requests from being retried-on-backend-timeout is a correct behavior.

In peak load scenarios, this would keep a write timeout from triggering more copies of the same write, leading to deadlocks and/or database load and/or making the outage generally last longer (traffic jam effects).

Will confirm with server team that this is a safe bet, but I imagine that the retry logic already actively doing this will be an unwelcome surprise.

:Atoll

Assignee

Comment 4

•

13 years ago

$method = http.getMethod();
if ($method == "POST" || $method == "PUT" || $method == "DELETE") {
   # prevent Zeus from retrying writes against multiple nodes so we don't crush ourselves under load
   http.setIdempotent(0);
}

:Atoll

Assignee

Updated

•

13 years ago

Group: services-infra

Tarek Ziadé (:tarek)

Comment 5

•

13 years ago

I concur trying to send the same write request again and again is a very bad thing. 

The code is not doing anything to reject a second write when a first write is on its way (because slow for some reasons), so at the very least it's loading the servers for nothing by doing the same writes again several times, and at the worst it's generating conflict errors on this call: PUT /user/storage/collection/singleitem

Moreover, the client seems to be able in all write cases, to know that it failed and try again later.

Toby Elliott [:telliott]

Comment 6

•

13 years ago

(In reply to Richard Soderberg [:atoll] from comment #3)

> Will confirm with server team that this is a safe bet, but I imagine that
> the retry logic already actively doing this will be an unwelcome surprise.

You imagine correctly. Some really weird things could happen, especially with POSTs

Is there any reason not to do the same for GETs? If the server fails one, everything is probably horribly loaded, and retrying will just add more pain.

Richard Newman [:rnewman]

Comment 7

•

13 years ago

(In reply to Toby Elliott [:telliott] from comment #6)

> Is there any reason not to do the same for GETs? If the server fails one,
> everything is probably horribly loaded, and retrying will just add more pain.

The client certainly doesn't expect GETs to be retried on the server side.

It's possible that atoll and I got crossed wires: I suggested that the response-buffering behavior (Bug 691257) should only apply to PUT and POST, with GET still streaming. Removing retries should apply everywhere.

:Atoll

Assignee

Comment 8

•

13 years ago

In comment #0 or #1 above, I asked everyone to chime in on this issue.  Now that everyone has, it seems we can turn off auto-retries for *all* requests.  I was comfortable turning them off for *writes* without getting feedback on *reads*, but since we have "removing retries .. everywhere" from client and server devs, that makes this terrifically simple to do and I can spam the change out in a few minutes today.

(Wires not crossed.)

:Atoll

Assignee

Comment 9

•

13 years ago

This is already disabled on scl2-sync and phx-sync, leaving open to cover the rest of the products.

:Atoll

Assignee

Comment 10

•

13 years ago

Confirmed that passive monitoring defaults on. We're okay with passive monitoring in the "Connection refused" sense, but not in the "idempotent requests" sense, so this needs to convert to "Enable passive monitoring with all requests marked as NOT idempotent", so that Zeus only retries on connection failure, not on bad response from the node.

:Atoll

Assignee

Comment 11

•

13 years ago

To do the above comment 10, assign a request trafficscript to an https vserver:

http.setIdempotent(0);

And then enable passive monitoring on the vserver.  Testing this in staging.

Status: NEW → ASSIGNED

:Atoll

Assignee

Comment 12

•

13 years ago

This is ready to be tested.  Virtual Servers -> stage-sync -> Rules -> 'set HTTP as non-idempotent', applied.  To enable/disable for testing:

Login to staging zeus, go to Pools -> stage-sync -> 'Health Monitoring', enable 'Passive Monitoring'.

Verify that n_retries > 0 appears in zeus http logs when port 80 on any single backend node is blocked, briefly. Zeus should adapt within 10-15 seconds and the retries should cease, if not much sooner. No error should be visible to clients. A small increase in response time is guaranteed for each retry.

Whiteboard: [needs loadtest]

:Atoll

Assignee

Updated

•

13 years ago

Summary: either enable or disable zeus passive monitoring of http sitewide → zeus: enable non-idempotent passive monitoring for http backends

:Atoll

Assignee

Updated

•

13 years ago

Assignee: rsoderberg → petef

James Bonacci [:jbonacci]

Updated

•

13 years ago

Whiteboard: [needs loadtest] → [needs loadtest][qa?]

Pete Fritchman [:petef]

Comment 13

•

13 years ago

(In reply to Richard Soderberg [:atoll] from comment #12)
> Verify that n_retries > 0 appears in zeus http logs when port 80 on any
> single backend node is blocked, briefly. Zeus should adapt within 10-15
> seconds and the retries should cease, if not much sooner. No error should be
> visible to clients. A small increase in response time is guaranteed for each
> retry.

works exactly as advertised.  I left it on for stage-sync.

Assignee: petef → rsoderberg

Whiteboard: [needs loadtest][qa?] → [qa?]

Pete Fritchman [:petef]

Comment 14

•

13 years ago

My only concern: what happens if something is going on and forcing people to do lots of initial syncs/whatever query that ends up timing out to gunicorn, and marking backends as bad (until the next health check)? We might end up denying service to other users with "safe" queries.

James Bonacci [:jbonacci]

Comment 15

•

13 years ago

Richard S. and Pete - recommendations for QA here?
Seems like some verification could be done in Stage and Prod.

:Atoll

Assignee

Comment 16

•

13 years ago

(In reply to Pete Fritchman [:petef] from comment #14)
> My only concern: what happens if something is going on and forcing people to
> do lots of initial syncs/whatever query that ends up timing out to gunicorn,
> and marking backends as bad (until the next health check)? We might end up
> denying service to other users with "safe" queries.

After "node_connection_attempts" passive failures (default 3) in a row, Zeus will mark the backend node as passive-offline for "node_fail_time" seconds (default 60).

Once the passive-offline timer expires, service will resume as normal to the node, and normal health check rules will apply.

On stage-sync, active health monitoring is configured to fail a node after 3 retries, with a timeout of 3 seconds, and an initial retry interval of 3 seconds (with retry backoff).

Configured stage-sync passive health monitoring to fail a node after 3 retries, with a timeout of 3 seconds (rather than the default of 60).

(In reply to James Bonacci [:jbonacci] from comment #15)
> Richard S. and Pete - recommendations for QA here?
> Seems like some verification could be done in Stage and Prod.

We'll need to force two types of node failures during a first sync: (a) connection refused, and (b) connection timed out.  The first (a) should be retried successfully without the client knowing something went wrong.  The second (b) should not be retried and result in a client-visible error.

:Atoll

Assignee

Comment 17

•

12 years ago

And now we're moving away from Zeus, and it's already working basically fine, so marking this entire thing as WONTFIX unless we decide to care for some reason.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

zeus: enable non-idempotent passive monitoring for http backends

Categories

(Cloud Services :: Operations: Miscellaneous, task)

Tracking

(Not tracked)

People

(Reporter: Atoll, Assigned: Atoll)

References

Details

(Whiteboard: [qa?])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Updated

Updated

Updated

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17