Closed Bug 1467553 Opened 7 years ago Closed 7 years ago

Brocade traffic manager doesn't honor "draining" state when using IP-based session persistence

Categories

(Infrastructure & Operations :: Runtime, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: ericz)

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/6687])

If you configure the "Session Persistence" of a pool to use a "Persistence Class" configured for "IP-based persistence," the Brocade traffic manager (which we're using for load balancing in e.g. SCL3) will send HTTP requests to nodes that are in the "draining" state. If you turn off IP-based session persistence, nodes in the "draining" state stop receiving new requests. The behavior persists even despite the order the node state and session persistence options are changed. i.e. if using IP-based session persistence, "draining" nodes will *always* receive traffic. AFAICT the only way to use IP-based session persistence and not have new requests arrive at nodes you don't want them to is to put the node in the "disabled" state. But this has the consequence of severing established connections, thus disrupting in-flight requests. Obviously bad for whoever is on the opposite end of that connection. This feels like a bug in the load balancer and it should probably be reported upstream.
I should add that it's entirely plausible the "doesn't honor draining state" bug is present for other "Session Persistence" settings. I only tested IP-based session persistence.
Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/6687]
Assignee: server-ops-webops → eziegenhorn
I opened case 2018-0618-2965 with PulseSecure about this behavior, thanks for the heads up :gps.
Component: WebOps: Other → Runtime
PulseSecure has replicated this behavior in their lab and is calling it "a known behavior". I pushed that this should be considered a bug that gets fixed in future releases and they're going to talk to second level support again. They have however provided me a workaround which I haven't tested yet but seems plausible: Instead of draining a node like normal, when IP persistence is enable: In the pool settings, change node_delete_behavior to "Allow existing connections to the node to finish before deletion.” from the default of "All connections to the node are closed immediately." Then mark the node Disabled in the pool. It will then allow existing connections to finish, as draining normally does, only this should also work with IP persistence enabled.
PulseSecure adds there is a concept of sessions at work here fwiw, basically saying it keeps IPs in their hash table longer than individual connections, which matches with the behavior we see here and is "by design": I did check with the next level team and understood that, when the node is set to "Draining", the IP based session persistance mapping for that "particular session Ip address" is still in the mapping. Hence, all the new "sessions" from the Ip address would still go to the node. However, new "sessions" from different IPs won't go to the node. That's why when we set the node to "Disbaled" and choose the above option, it just lets only the existing "connections" to complete before stopping any further "sessions" to the node. Hence currently, this is working as per designed.
Spoke with GPS and we're going to close this now that we understand the behavior better.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.