1246008 - Add __lbheartbeat__ check specific to load balancers

Reporter

Description

•

8 years ago

The __heartbeat__ health check endpoint on this, and many of our services, return an error when backends and other upstream services are unavailable.  That's great for uptime monitoring of a full service, however it's not suitable for a load balancer membership test.

It can trigger a spiraling problem scenario where a load balancer marks otherwise healthy hosts as failing and reaps them when the problem is actually with a backend service such as a DB or caching server.

Adding a load balancer specific health check endpoint that only verifies health local to the instance, is one solution to this problem.  An alternate approach, might be to report upstream service outages in a payload but return a successful HTTP response.

Ryan Kelly [:rfkelly]

Comment 1

•

8 years ago

> The __heartbeat__ health check endpoint on this, and many of our services,
> return an error when backends and other upstream services are unavailable.

Are you sure that's the case for tokenserver?  Bug 996870 suggests that it's only checking the health of the web process itself, not its dependencies.

Flags: needinfo?(bobm)

Benson Wong [:mostlygeek]

Updated

•

8 years ago

Summary: Add secondary __heartbeat__ check specific to load balancers → Add __lbheartbeat__ check specific to load balancers

Benson Wong [:mostlygeek]

Comment 2

•

8 years ago

__heartbeat__ should be used for monitoring if the service is ok. This should trigger an alert to ops that something is wrong. 

__lbheartbeat__ should be used to monitor if the node is ok. This should trigger the ELB and ASG to stop sending traffic to the instance and replace it.

Bob Micheletto [:bobm]

Reporter

Comment 3

•

8 years ago

(In reply to Ryan Kelly [:rfkelly] from comment #1)
 
> Are you sure that's the case for tokenserver?  Bug 996870 suggests that it's
> only checking the health of the web process itself, not its dependencies.

Good point.  But when it does, we'll need a load balancer specific check.

Flags: needinfo?(bobm)

Rémy Hubscher (:natim)

Comment 4

•

8 years ago

Should we consider that if the node cannot contact the database anymore it is still ok to send traffic toward it?

Right now the strategy we had was to stop sending traffic to a node with no more database contact until it got its connection back.

When a node doesn't have a database connection it will return a 503.
When an ELB doesn't have any node it will return a 503.

The current strategy let us handle the case when not all the node can make contact with their database.

Thoughts?

Ryan Kelly [:rfkelly]

Comment 5

•

8 years ago

IIUC, the problem here is that the auto-scale group will reap the boxes if they're unhealthy and try to spin up news ones in their place, which is not productive when it's the db that's at fault.

Rémy Hubscher (:natim)

Comment 6

•

8 years ago

In the case of auto-scalling I totally agree then. I didn't know we had such strategy for loop-server.
BTW, we added a lbhealthcheck endpoint for loop-server and Kinto yesterday.

Bob Micheletto [:bobm]

Reporter

Comment 7

•

8 years ago

:mostlygeek can you confirm that we can close this bug?

Flags: needinfo?(bwong)

Benson Wong [:mostlygeek]

Comment 8

•

8 years ago

:rfkelly Did we add a __lbheartbeat__ endpoint to tokenserver? 
The __lbheartbeat__ only needs to respond with 200 OK.

Flags: needinfo?(bwong) → needinfo?(rfkelly)

Ryan Kelly [:rfkelly]

Comment 9

•

8 years ago

Nope, doesn't look like it:

  rfk@tangello:tokenserver(master)$ grep -r lbheartbeat tokenserver
  rfk@tangello:tokenserver(master)$ 

Any other services we need to add this on?

Flags: needinfo?(rfkelly)

Benson Wong [:mostlygeek]

Comment 10

•

8 years ago

I think adding it Tokenserver would be enough for now.

Ryan Kelly [:rfkelly]

Comment 11

•

7 years ago

This got added \o/

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

1 year ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Quick Search

Add lbheartbeat check specific to load balancers

Categories

(Cloud Services Graveyard :: Server: Token, defect)

Tracking

(Not tracked)

People

(Reporter: bobm, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated