Closed
Bug 1243015
Opened 8 years ago
Closed 5 years ago
Monitor long-lived server processes to make sure they don't go away
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jhford, Unassigned)
Details
We have things like the AWS Provisioner which have long living processes that ought not to disappear. We should use something like Deadman's Snitch to make sure that emails are generated when a one of these long living processes disappears. Things like the s3-copy-proxy and the cloud-mirror should have these alerts set up.
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #1) > What do we have where this is missing? My understanding is that we can use SignalFX for this now? I'm not sure if we're intending to continue using dead man's snitch or not.
Flags: needinfo?(jhford)
Comment 3•7 years ago
|
||
We have data within signalfx and alerts from deadman snitch. John, is there more you wanted to setup with this bug?
Flags: needinfo?(jhford)
Reporter | ||
Comment 4•7 years ago
|
||
only one service that I'm aware of uses deadman's snitch. The reason for this bug was to have something like deadman's snitch more widely deployed.
Flags: needinfo?(jhford)
Comment 5•6 years ago
|
||
Found in triage. jhford: is there anything actionable here right now?
Flags: needinfo?(jhford)
Summary: we should monitor that long living server processes to make sure they don't go away → Monitor long-lived server processes to make sure they don't go away
Comment 6•6 years ago
|
||
I think we should do a deadman's snitch thing for long running processes. But to do it everywhere we should have a generic solution that doesn't involve manually creating deadman snitches and copying authentication tokens around. We should have a programmatic way to create these deadman snitches. Like: --- let {url} = await auth.createDeadManSnitch('<snitchName>') while (true) { // ping deadman snitch await fetch(url); // Do iterative thing... } --- This could be a service that just calls deadman snitch, or it could be a reimplementation of deadman snitch for TC. The important part: A deadman snitch end-point (or similar thing) should be something that can be created with TC creds/scopes. Without the need for manual button pushing. --- note. dustin have convinced me to stop adding end-points like auth.sentryDSN/webhooktunnel/statsumToken to auth, but we still like the concept, just want it place on a difference service.
Comment 7•6 years ago
|
||
Kubernetes will do some of this for us (representing up-ness as the status of a replicaset), and then it's just a matter of configuring something to monitor those replicasets in kubernetes.
Component: General → Operations
Flags: needinfo?(jhford)
Updated•6 years ago
|
Component: Operations → Redeployability
Comment 8•6 years ago
|
||
I think "Operations" was the right component for this. This is above and beyond the redeployability work. A responsible deployment will connect the status of the various K8s deployments to their monitoring solution.
Component: Redeployability → Operations
Assignee | ||
Updated•5 years ago
|
Component: Operations → Operations and Service Requests
Updated•5 years ago
|
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•