The __heartbeat__ health check endpoint on this, and many of our services, return an error when backends and other upstream services are unavailable. That's great for uptime monitoring of a full service, however it's not suitable for a load balancer membership test. It can trigger a spiraling problem scenario where a load balancer marks otherwise healthy hosts as failing and reaps them when the problem is actually with a backend service such as a DB or caching server. Adding a load balancer specific health check endpoint that only verifies health local to the instance, is one solution to this problem. An alternate approach, might be to report upstream service outages in a payload but return a successful HTTP response.
> The __heartbeat__ health check endpoint on this, and many of our services, > return an error when backends and other upstream services are unavailable. Are you sure that's the case for tokenserver? Bug 996870 suggests that it's only checking the health of the web process itself, not its dependencies.
Summary: Add secondary __heartbeat__ check specific to load balancers → Add __lbheartbeat__ check specific to load balancers
__heartbeat__ should be used for monitoring if the service is ok. This should trigger an alert to ops that something is wrong. __lbheartbeat__ should be used to monitor if the node is ok. This should trigger the ELB and ASG to stop sending traffic to the instance and replace it.
(In reply to Ryan Kelly [:rfkelly] from comment #1) > Are you sure that's the case for tokenserver? Bug 996870 suggests that it's > only checking the health of the web process itself, not its dependencies. Good point. But when it does, we'll need a load balancer specific check.
Should we consider that if the node cannot contact the database anymore it is still ok to send traffic toward it? Right now the strategy we had was to stop sending traffic to a node with no more database contact until it got its connection back. When a node doesn't have a database connection it will return a 503. When an ELB doesn't have any node it will return a 503. The current strategy let us handle the case when not all the node can make contact with their database. Thoughts?
IIUC, the problem here is that the auto-scale group will reap the boxes if they're unhealthy and try to spin up news ones in their place, which is not productive when it's the db that's at fault.
In the case of auto-scalling I totally agree then. I didn't know we had such strategy for loop-server. BTW, we added a lbhealthcheck endpoint for loop-server and Kinto yesterday.
:mostlygeek can you confirm that we can close this bug?
:rfkelly Did we add a __lbheartbeat__ endpoint to tokenserver? The __lbheartbeat__ only needs to respond with 200 OK.
Flags: needinfo?(bwong) → needinfo?(rfkelly)
Nope, doesn't look like it: rfk@tangello:tokenserver(master)$ grep -r lbheartbeat tokenserver rfk@tangello:tokenserver(master)$ Any other services we need to add this on?
I think adding it Tokenserver would be enough for now.
This got added \o/
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.