Configure monitoring for cloudops taskcluster
Categories
(Cloud Services :: Operations: Taskcluster, task)
Tracking
(Not tracked)
People
(Reporter: brian, Assigned: edunham)
References
Details
We need to decide what we want to monitor for taskcluster, and how we're going to do it.
For the question of "are the services even running" we can count on k8s to try to keep them running and alert on rising restart count, which would indicate crashes.
For the web services we can track request rates and timings for success and errors using info from the load balancer and alert on high errors.
For the background services we need to figure out a way to know if they are completing work successfully or not.
Reporter | ||
Comment 1•5 years ago
|
||
For background services we now and crons we now have https://docs.taskcluster.net/docs/manual/deploying/monitoring
In addition to what was mentioned before, we should monitor rabbitmq queue depth. We can deploy https://github.com/influxdata/telegraf/tree/master/plugins/inputs/rabbitmq to the nonprod and prod per-realm telegrafs and point them at the correct rabbitmqs.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 2•5 years ago
|
||
Edunham has the rest set up (including pagerduty and pingsom) but rabbitmq moitoring is still WIP.
Reporter | ||
Comment 3•5 years ago
|
||
Talked with edunham this morning about what's needed to close this
- Set everything up for firefox ci and have bpitts review
- Retest log-based metrics PR then have bpitts re-review and merge
- Automate or document rabbitmq user creation for use by telegraf plugin
Reporter | ||
Comment 4•5 years ago
|
||
We can continue to iterate on what we grapha nd what we alert on, but I think all the basics are in place and working.
Description
•