Closed Bug 1243015 Opened 8 years ago Closed 5 years ago

Monitor long-lived server processes to make sure they don't go away

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: jhford, Unassigned)

Details

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Description

•

8 years ago

We have things like the AWS Provisioner which have long living processes that ought not to disappear.  We should use something like Deadman's Snitch to make sure that emails are generated when a one of these long living processes disappears.  Things like the s3-copy-proxy and the cloud-mirror should have these alerts set up.

Jonas Finnemann Jensen (:jonasfj)

Comment 1

•

8 years ago

What do we have where this is missing?

Flags: needinfo?(jhford)

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Comment 2

•

8 years ago

(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1)
> What do we have where this is missing?

My understanding is that we can use SignalFX for this now?  I'm not sure if we're intending to continue using dead man's snitch or not.

Flags: needinfo?(jhford)

Greg Arndt [:garndt]

Comment 3

•

7 years ago

We have data within signalfx and alerts from deadman snitch.  John, is there more you wanted to setup with this bug?

Flags: needinfo?(jhford)

John Ford [:jhford] CET/CEST Berlin Time

Reporter

Comment 4

•

7 years ago

only one service that I'm aware of uses deadman's snitch.  The reason for this bug was to have something like deadman's snitch more widely deployed.

Flags: needinfo?(jhford)

Chris Cooper [:coop] (he/him)

Comment 5

•

6 years ago

Found in triage.

jhford: is there anything actionable here right now?

Flags: needinfo?(jhford)

Summary: we should monitor that long living server processes to make sure they don't go away → Monitor long-lived server processes to make sure they don't go away

Jonas Finnemann Jensen (:jonasfj)

Comment 6

•

6 years ago

I think we should do a deadman's snitch thing for long running processes.
But to do it everywhere we should have a generic solution that doesn't involve manually creating deadman snitches and copying authentication tokens around.

We should have a programmatic way to create these deadman snitches.
Like:
---
  let {url} = await auth.createDeadManSnitch('<snitchName>')
  while (true) {
     // ping deadman snitch
     await fetch(url); 

     // Do iterative thing...
  }
---

This could be a service that just calls deadman snitch, or it could be a reimplementation of deadman snitch for TC.
The important part:
  A deadman snitch end-point (or similar thing) should be something that can be created with TC creds/scopes. 
  Without the need for manual button pushing.

---
note. dustin have convinced me to stop adding end-points like auth.sentryDSN/webhooktunnel/statsumToken to auth,
but we still like the concept, just want it place on a difference service.

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

6 years ago

Kubernetes will do some of this for us (representing up-ness as the status of a replicaset), and then it's just a matter of configuring something to monitor those replicasets in kubernetes.

Component: General → Operations

Flags: needinfo?(jhford)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

6 years ago

Component: Operations → Redeployability

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

6 years ago

I think "Operations" was the right component for this.  This is above and beyond the redeployability work.  A responsible deployment will connect the status of the various K8s deployments to their monitoring solution.

Component: Redeployability → Operations

Nobody; OK to take it and work on it

Assignee

Updated

•

5 years ago

Component: Operations → Operations and Service Requests

Brian Stack [:bstack]

Updated

•

5 years ago

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Monitor long-lived server processes to make sure they don't go away

Categories

(Taskcluster :: Operations and Service Requests, task)

Tracking

(Not tracked)

People

(Reporter: jhford, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Updated

Updated