Closed
Bug 1246008
Opened 8 years ago
Closed 7 years ago
Add __lbheartbeat__ check specific to load balancers
Categories
(Cloud Services Graveyard :: Server: Token, defect)
Cloud Services Graveyard
Server: Token
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bobm, Unassigned)
Details
The __heartbeat__ health check endpoint on this, and many of our services, return an error when backends and other upstream services are unavailable. That's great for uptime monitoring of a full service, however it's not suitable for a load balancer membership test. It can trigger a spiraling problem scenario where a load balancer marks otherwise healthy hosts as failing and reaps them when the problem is actually with a backend service such as a DB or caching server. Adding a load balancer specific health check endpoint that only verifies health local to the instance, is one solution to this problem. An alternate approach, might be to report upstream service outages in a payload but return a successful HTTP response.
Comment 1•8 years ago
|
||
> The __heartbeat__ health check endpoint on this, and many of our services, > return an error when backends and other upstream services are unavailable. Are you sure that's the case for tokenserver? Bug 996870 suggests that it's only checking the health of the web process itself, not its dependencies.
Flags: needinfo?(bobm)
Updated•8 years ago
|
Summary: Add secondary __heartbeat__ check specific to load balancers → Add __lbheartbeat__ check specific to load balancers
Comment 2•8 years ago
|
||
__heartbeat__ should be used for monitoring if the service is ok. This should trigger an alert to ops that something is wrong. __lbheartbeat__ should be used to monitor if the node is ok. This should trigger the ELB and ASG to stop sending traffic to the instance and replace it.
Reporter | ||
Comment 3•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #1) > Are you sure that's the case for tokenserver? Bug 996870 suggests that it's > only checking the health of the web process itself, not its dependencies. Good point. But when it does, we'll need a load balancer specific check.
Flags: needinfo?(bobm)
Comment 4•8 years ago
|
||
Should we consider that if the node cannot contact the database anymore it is still ok to send traffic toward it? Right now the strategy we had was to stop sending traffic to a node with no more database contact until it got its connection back. When a node doesn't have a database connection it will return a 503. When an ELB doesn't have any node it will return a 503. The current strategy let us handle the case when not all the node can make contact with their database. Thoughts?
Comment 5•8 years ago
|
||
IIUC, the problem here is that the auto-scale group will reap the boxes if they're unhealthy and try to spin up news ones in their place, which is not productive when it's the db that's at fault.
Comment 6•8 years ago
|
||
In the case of auto-scalling I totally agree then. I didn't know we had such strategy for loop-server. BTW, we added a lbhealthcheck endpoint for loop-server and Kinto yesterday.
Reporter | ||
Comment 7•8 years ago
|
||
:mostlygeek can you confirm that we can close this bug?
Flags: needinfo?(bwong)
Comment 8•8 years ago
|
||
:rfkelly Did we add a __lbheartbeat__ endpoint to tokenserver? The __lbheartbeat__ only needs to respond with 200 OK.
Flags: needinfo?(bwong) → needinfo?(rfkelly)
Comment 9•8 years ago
|
||
Nope, doesn't look like it: rfk@tangello:tokenserver(master)$ grep -r lbheartbeat tokenserver rfk@tangello:tokenserver(master)$ Any other services we need to add this on?
Flags: needinfo?(rfkelly)
Comment 10•8 years ago
|
||
I think adding it Tokenserver would be enough for now.
Comment 11•7 years ago
|
||
This got added \o/
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•1 year ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•