Please create a dashboard for monitoring durable sync operational status metrics
Categories
(Cloud Services :: Operations: Miscellaneous, task, P2)
Tracking
(Not tracked)
People
(Reporter: rachel, Assigned: eolson)
Details
Hi,
We've recently drafted this doc around operational status related to durable sync.
Can we get a dashboard tracking the metrics referred to in that doc so we can review as needed? We'll probably want to monitor (and revise) these metrics for awhile before wiring this up to pagers/etc, but this would be step one.
The metrics that we're watching here are:
- Heartbeat endpoint
- Status codes
- Request times
See doc for details around them, and let me know if you have any questions here.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
:eolson:, I know that you were at one point looking into this; is that still happening?
Assignee | ||
Comment 2•5 years ago
|
||
I am currently working on this. I am working with influx/kapacitor to generate the request rate percentages. Heartbeat will be monitored with pingdom, and the request times I still believe can be aggregated by http method with a log based metric in stackdriver. Not sure yet how to tie them all together.
Reporter | ||
Comment 3•5 years ago
|
||
Sounds good, thanks Erik. One thing that occurs to me too; if it's easier to just skip the request times and find another metric that's easier for us to track without jumping through a lot of hoops, we could go that route. If that's something you think would help here, let me know, and I can dig into what we actually CAN get at from GCP that might make sense for tracking operation status.
Assignee | ||
Comment 4•5 years ago
|
||
That makes sense, I want to chat with Brian when he gets back from leave/vacation tomorrow to see how to get the custom metrics from stackdriver into influx so we can look at it all on one dashboard.
Assignee | ||
Comment 5•5 years ago
|
||
Operational Status Dashboard is here: https://earthangel-b40313e5.influxcloud.net/d/JyobZHPZk/sync-operational-status?orgId=1&var-environment=stage-sync
I’m still working on the 2xx % calculation, and also a way to import uptime checks for the heartbeat endpoint.
The data is all out there, just figuring out how to get it in one place.
Assignee | ||
Comment 6•5 years ago
|
||
heartbeat endpoints are now being reported as well. Still working on the request rate percent calculation metric.
Reporter | ||
Comment 7•5 years ago
|
||
Awesome, thank you Erik
Reporter | ||
Comment 8•5 years ago
|
||
Is there a(n easy) way to get averages visible on these charts, and the ones for request durations? Poking around the UI now and not seeing anything obvious, but let me know if I'm missing anything. If not, is it a pain to add it?
Assignee | ||
Comment 9•5 years ago
•
|
||
Looks like we have the 99th and 50th percentiles available, they are listed as different "measurements". There aren't a ton of options for these because of how the data is created - extracting a field form the nginx log, convert to log based metric of type "distribution", then exported/imported to influxdb as the 3 different measurements.
I added the two other measurements, if you refresh the page you should see them now.
Reporter | ||
Comment 10•5 years ago
|
||
👍
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Description
•