Closed Bug 1605391 Opened 5 years ago Closed 5 years ago

Please create a dashboard for monitoring durable sync operational status metrics

Categories

(Cloud Services :: Operations: Miscellaneous, task, P2)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rachel, Assigned: eolson)

Details

Hi,

We've recently drafted this doc around operational status related to durable sync.

Can we get a dashboard tracking the metrics referred to in that doc so we can review as needed? We'll probably want to monitor (and revise) these metrics for awhile before wiring this up to pagers/etc, but this would be step one.

The metrics that we're watching here are:

  1. Heartbeat endpoint
  2. Status codes
  3. Request times

See doc for details around them, and let me know if you have any questions here.

Priority: -- → P2

:eolson:, I know that you were at one point looking into this; is that still happening?

Flags: needinfo?(eolson)

I am currently working on this. I am working with influx/kapacitor to generate the request rate percentages. Heartbeat will be monitored with pingdom, and the request times I still believe can be aggregated by http method with a log based metric in stackdriver. Not sure yet how to tie them all together.

Flags: needinfo?(eolson)

Sounds good, thanks Erik. One thing that occurs to me too; if it's easier to just skip the request times and find another metric that's easier for us to track without jumping through a lot of hoops, we could go that route. If that's something you think would help here, let me know, and I can dig into what we actually CAN get at from GCP that might make sense for tracking operation status.

That makes sense, I want to chat with Brian when he gets back from leave/vacation tomorrow to see how to get the custom metrics from stackdriver into influx so we can look at it all on one dashboard.

Operational Status Dashboard is here: https://earthangel-b40313e5.influxcloud.net/d/JyobZHPZk/sync-operational-status?orgId=1&var-environment=stage-sync
I’m still working on the 2xx % calculation, and also a way to import uptime checks for the heartbeat endpoint.
The data is all out there, just figuring out how to get it in one place.

heartbeat endpoints are now being reported as well. Still working on the request rate percent calculation metric.

Awesome, thank you Erik

Is there a(n easy) way to get averages visible on these charts, and the ones for request durations? Poking around the UI now and not seeing anything obvious, but let me know if I'm missing anything. If not, is it a pain to add it?

Flags: needinfo?(eolson)

Looks like we have the 99th and 50th percentiles available, they are listed as different "measurements". There aren't a ton of options for these because of how the data is created - extracting a field form the nginx log, convert to log based metric of type "distribution", then exported/imported to influxdb as the 3 different measurements.

I added the two other measurements, if you refresh the page you should see them now.

Flags: needinfo?(eolson)

👍

Assignee: nobody → eolson
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.