Closed Bug 1243015 Opened 8 years ago Closed 5 years ago

Monitor long-lived server processes to make sure they don't go away

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jhford, Unassigned)

Details

We have things like the AWS Provisioner which have long living processes that ought not to disappear.  We should use something like Deadman's Snitch to make sure that emails are generated when a one of these long living processes disappears.  Things like the s3-copy-proxy and the cloud-mirror should have these alerts set up.
What do we have where this is missing?
Flags: needinfo?(jhford)
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1)
> What do we have where this is missing?

My understanding is that we can use SignalFX for this now?  I'm not sure if we're intending to continue using dead man's snitch or not.
Flags: needinfo?(jhford)
We have data within signalfx and alerts from deadman snitch.  John, is there more you wanted to setup with this bug?
Flags: needinfo?(jhford)
only one service that I'm aware of uses deadman's snitch.  The reason for this bug was to have something like deadman's snitch more widely deployed.
Flags: needinfo?(jhford)
Found in triage.

jhford: is there anything actionable here right now?
Flags: needinfo?(jhford)
Summary: we should monitor that long living server processes to make sure they don't go away → Monitor long-lived server processes to make sure they don't go away
I think we should do a deadman's snitch thing for long running processes.
But to do it everywhere we should have a generic solution that doesn't involve manually creating deadman snitches and copying authentication tokens around.

We should have a programmatic way to create these deadman snitches.
Like:
---
  let {url} = await auth.createDeadManSnitch('<snitchName>')
  while (true) {
     // ping deadman snitch
     await fetch(url); 

     // Do iterative thing...
  }
---

This could be a service that just calls deadman snitch, or it could be a reimplementation of deadman snitch for TC.
The important part:
  A deadman snitch end-point (or similar thing) should be something that can be created with TC creds/scopes. 
  Without the need for manual button pushing.

---
note. dustin have convinced me to stop adding end-points like auth.sentryDSN/webhooktunnel/statsumToken to auth,
but we still like the concept, just want it place on a difference service.
Kubernetes will do some of this for us (representing up-ness as the status of a replicaset), and then it's just a matter of configuring something to monitor those replicasets in kubernetes.
Component: General → Operations
Flags: needinfo?(jhford)
Component: Operations → Redeployability
I think "Operations" was the right component for this.  This is above and beyond the redeployability work.  A responsible deployment will connect the status of the various K8s deployments to their monitoring solution.
Component: Redeployability → Operations
Component: Operations → Operations and Service Requests
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.