Create skeleton dashboards and alerts for AWS and GCP projects
Categories
(Cloud Services :: Operations: Metrics/Monitoring, enhancement, P5)
Tracking
(Not tracked)
People
(Reporter: brian, Unassigned)
References
(Depends on 1 open bug)
Details
When spinning up a new project, instead of requiring people to create their infrastructure code from scratch we render templates from a "skeleton" that defines the base structure and resources that every project needs.
We should expand the skeleton to include Grafana dashboards with graphs and alerts. As with our existing skeletons, this will set people up on the happy path but let them customize as much as they need.
This solves a few problems, such as
- Creating dashboards by hand is tedious
- The Grafana defaults for alerts are not always appropriate
- People do not always know the best metric to use or the best way to aggregate in order to measure something, so some of our existing graphs and alerts are misleading
- We have no baseline recommended alerting across projects, so some projects have gaps
While recognizing the constraints that
- most projects will need to add new graphs and alerts
- most projects will need to customize their alert thresholds
- graphs are much easier to build iteratively through the UI than writing as JSON
To do this, I propose
- building the skeletons dashboards in Grafana
- creating a process to export and transform them into a jinja template
- Adding rendering these templates to our existing scripts for bootstrapping new projects
- Adding uploading the rendered templates to Grafana our existing bootstrap script or to new ones
For each project, I propose the skeleton generate two dashboards
A) A dashboard where all the graphs are constrained to the prod resources and are configured to alert
B) A templates dashboard that can be used to graph any environment
I agree that having a skeleton of "standard" dashboards for a starting point would be very useful for the skeleton. Additionally, I would be interested in using code for automating the creation of panels in projects where the same alerts need to be identically configured across multiple variables, such as Taskcluster's alerting for stage and communitytc and firefoxci being nearly identical save that the cluster name and alerting address are different for each.
It would be particularly nice to include panels that show off Grafana's various features like table display of data in the default/skeleton setup, mainly as a reminder that the more advanced visualizations are available and how to use them.
I also wonder whether we could use a couple API calls at project setup time instead of the current manual process to create a skeleton project's pagerduty service and pagerduty/grafana integration.
This does all rely on having recommended alerting, but I think that could potentially be included in our onboarding standards as they become more concrete.
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Updated•5 years ago
|
Description
•