Bug 1534841 Comment 0 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

I've been going through SignalFX today removing some cruft, but also trying to get a feeling for the types of monitoring we'd like to reproduce in Stackdriver.

First, there was per-service monitoring (e.g. secrets, queue, auth) where we were tracking API requests, procs (cpu/mem), tables (db), and pulse events. With the exception of the services called out below, these dashboards had no data since Feb 22 corresponding with our cutover to Stackdriver, so I've deleted them. My understanding is that we will setup equivalent monitoring for these services as required.

I have not touched the dashboards for ec2-manager, aws-provisioner, and cloud-mirror. If the equivalent services (worker manager and the object service, respectively) won't be ready before May 1 (likely), we will need to patch these services to log to Stackdriver.

Next, there are some artifact upload stats for docker-worker. I'm not sure if these are still relevant, so I've cc-ed Wander.

There are also some roll-up dashboards:
* Gecko pending
* service levels
* Taskcluster overview

I think these ^^ are all safe to delete, but I'd like someone to verify that the sub-graphs contained therein are not useful. Those fed by statsum still are still plotting trends.

Last among the dashboards are statsum stats (requests and req/cpu) and SFx Usage. I'm guessing hoping that statsum counts will drop to 0 when bug 1529461 is fixed. I'm not sure what SFx Usage is actually tracking, but I would have naively assumed that number should be going down, not up as it currently is.

Finally, there are the detectors that generate the email alerts we receive. Here's the full list of detectors:
* gecko-t-osx-1010 pending wait times Detector 	
* DataPoints received Detector 	
* Failed Tasks Detector 	
* Failing instance terminations 	
* Failing to get instances 	
* Instance Lifespans Detector 	
* Linux failed/completed ratio 	
* Pending gecko-decision wait times > 20 minutes 	
* SLO Detector: Gecko Pending 	
* Spot Request Info Detector 	
* auth process monitoring 	
* cloud-mirror not copying 	
* diskspace threshold reached Detector 	
* pending time for balrog worker 	
* pending time for signing worker 	
* pending times for pushapk workers 	
* taskcluster-backups.end.1h.count Detector 	
* taskcluster-queue.dependency-resolver.resolved-queue-polling-stopped 		
* taskcluster-treeherder not publishing 	

I confess that I ignore the mail I get from SignalFX about these, but that doesn't mean that some of them are not valuable. Can someone weigh-in on what should be replicated in Stackdriver?
I've been going through SignalFX today removing some cruft, but also trying to get a feeling for the types of monitoring we'd like to reproduce in Stackdriver.

First, there was per-service monitoring (e.g. secrets, queue, auth) where we were tracking API requests, procs (cpu/mem), tables (db), and pulse events. With the exception of the services called out below, these dashboards had no data since Feb 22 corresponding with our cutover to Stackdriver, so I've deleted them. My understanding is that we will setup equivalent monitoring for these services as required.

I have not touched the dashboards for ec2-manager, aws-provisioner, and cloud-mirror. If the equivalent services (worker manager and the object service, respectively) won't be ready before May 1 (likely), we will need to patch these services to log to Stackdriver.

Next, there are some artifact upload stats for docker-worker. I'm not sure if these are still relevant, so I've cc-ed Wander.

There are also some roll-up dashboards:
* Gecko pending
* service levels
* Taskcluster overview

I think these ^^ are all safe to delete, but I'd like someone to verify that the sub-graphs contained therein are not useful. Those fed by statsum still are still plotting trends.

Last among the dashboards are statsum stats (requests and req/cpu) and SFx Usage. I'm guessing/hoping that statsum counts will drop to 0 when bug 1529461 is fixed. I'm not sure what SFx Usage is actually tracking, but I would have naively assumed that number should be going down, not up as it currently is.

Finally, there are the detectors that generate the email alerts we receive. Here's the full list of detectors:
* gecko-t-osx-1010 pending wait times Detector 	
* DataPoints received Detector 	
* Failed Tasks Detector 	
* Failing instance terminations 	
* Failing to get instances 	
* Instance Lifespans Detector 	
* Linux failed/completed ratio 	
* Pending gecko-decision wait times > 20 minutes 	
* SLO Detector: Gecko Pending 	
* Spot Request Info Detector 	
* auth process monitoring 	
* cloud-mirror not copying 	
* diskspace threshold reached Detector 	
* pending time for balrog worker 	
* pending time for signing worker 	
* pending times for pushapk workers 	
* taskcluster-backups.end.1h.count Detector 	
* taskcluster-queue.dependency-resolver.resolved-queue-polling-stopped 		
* taskcluster-treeherder not publishing 	

I confess that I ignore the mail I get from SignalFX about these, but that doesn't mean that some of them are not valuable. Can someone weigh-in on what should be replicated in Stackdriver?

Back to Bug 1534841 Comment 0