Closed Bug 1534841 Opened 6 years ago Closed 5 years ago

Replicate SignalFX monitoring in Stackdriver

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Unassigned)

References

Details

I've been going through SignalFX today removing some cruft, but also trying to get a feeling for the types of monitoring we'd like to reproduce in Stackdriver.

First, there was per-service monitoring (e.g. secrets, queue, auth) where we were tracking API requests, procs (cpu/mem), tables (db), and pulse events. With the exception of the services called out below, these dashboards had no data since Feb 22 corresponding with our cutover to Stackdriver, so I've deleted them. My understanding is that we will setup equivalent monitoring for these services as required.

I have not touched the dashboards for ec2-manager, aws-provisioner, and cloud-mirror. If the equivalent services (worker manager and the object service, respectively) won't be ready before May 1 (likely), we will need to patch these services to log to Stackdriver.

Next, there are some artifact upload stats for docker-worker. I'm not sure if these are still relevant, so I've cc-ed Wander.

There are also some roll-up dashboards:

  • Gecko pending
  • service levels
  • Taskcluster overview

I think these ^^ are all safe to delete, but I'd like someone to verify that the sub-graphs contained therein are not useful. Those fed by statsum still are still plotting trends.

Last among the dashboards are statsum stats (requests and req/cpu) and SFx Usage. I'm guessing/hoping that statsum counts will drop to 0 when bug 1529461 is fixed. I'm not sure what SFx Usage is actually tracking, but I would have naively assumed that number should be going down, not up as it currently is.

Finally, there are the detectors that generate the email alerts we receive. Here's the full list of detectors:

  • gecko-t-osx-1010 pending wait times Detector
  • DataPoints received Detector
  • Failed Tasks Detector
  • Failing instance terminations
  • Failing to get instances
  • Instance Lifespans Detector
  • Linux failed/completed ratio
  • Pending gecko-decision wait times > 20 minutes
  • SLO Detector: Gecko Pending
  • Spot Request Info Detector
  • auth process monitoring
  • cloud-mirror not copying
  • diskspace threshold reached Detector
  • pending time for balrog worker
  • pending time for signing worker
  • pending times for pushapk workers
  • taskcluster-backups.end.1h.count Detector
  • taskcluster-queue.dependency-resolver.resolved-queue-polling-stopped
  • taskcluster-treeherder not publishing

I confess that I ignore the mail I get from SignalFX about these, but that doesn't mean that some of them are not valuable. Can someone weigh-in on what should be replicated in Stackdriver?

Related: there are a handful of non-Taskcluster people (mostly releng and CIDuty) who also have access to SignalFX. I'll be emailing them to make sure they have nothing of value still in SignalFX.

(In reply to Chris Cooper [:coop] pronoun: he from comment #1)

Related: there are a handful of non-Taskcluster people (mostly releng and CIDuty) who also have access to SignalFX. I'll be emailing them to make sure they have nothing of value still in SignalFX.

I've reached out to everyone who had an account in SignalFX, and o one uses SignalFX but us (TC).

I've started deleting accounts and removing alerts/detectors.

There were two active access tokens that I just disabled:

  1. statsum-kubernetes, created by bstack on Apr 8, 2018
  2. tc-stats-collector-tests, create by dustin on Jul 15, 2018

Brian: the statsum-kubernetes token was receiving ~2,000/min when I disabled it. The source will likely start failing now that token is disabled. Should we turn this off at the source? Does that involve shutting down statsum (bug 1529461 and bug 1534672)?

Flags: needinfo?(bstack)

I expect that is the statsum token, yeah. I am wondering though if turning off statsum will make services that are trying to write to statsum fail. Otherwise I think we're safe to turn off statsum.

Flags: needinfo?(bstack)

I just turned statsum off, along with its static IP.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.