Closed
Bug 1309716
Opened 8 years ago
Closed 8 years ago
Create a framework for displaying team dashboards
Categories
(Taskcluster :: Operations and Service Requests, task)
Taskcluster
Operations and Service Requests
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
Attachments
(5 files, 1 obsolete file)
The idea is that we should have a common place where everyone in the team can look to see the things that we have collectively determined are important to us (service level indicators and objectives).
Assignee | ||
Comment 1•8 years ago
|
||
The idea here is to provide a central place where we display the things that are important to us as a team. So this does not include alerting (that should be some other service), incident management (other tools), or any kind of kanban board.
I see that breaking down into a few categories:
Service Levels
Service Level Objectives and their status
Service Level Indicators and their history
Incidents
Postmortems
Tools
Bug lists, PR lists, etc.
Useful dashboards (implemented here or links elsewhere e.g., signalfx)
On the other hand, maybe incidents should be its own service, and maybe tools should be on a wiki or mana page. Hrm.
Anyway, I'm sketching this out. One thing I did agree on with Eli was to make this a separate site from tools.
Assignee | ||
Comment 2•8 years ago
|
||
I'm scoping myself down to just the service levels.
Reviewing the SRE book, we have:
SLI -- a measure of something important, possibly a distribution over dimensions and/or time
SLO -- a boolean measure of whether one or more SLIs are acceptable
Error Budget -- the degree to which we tolerate SLO violations over time (# of nines)
As a concrete (but non-TC) example:
SLI:
sli-homepage-load-time = client-measured homepage load times measured over 5m intervals
sli-frontend-ajax-errors = count of ajax errors as measured client-side in 5m bins, divided by homepage loads in 5m bin
SLO:
slo-homepage-load-time-100ms = 95th percentile of sli-homepage-load-time < 100ms
slo-frontend-ajax-errors = sli-frontend-ajax-errors < 1%
Error Budget:
eb-frontend = 1.0
- (time slo-homepage-load-time-100ms was false over 14 days / 1.5h) # 99.5% uptime
- (time slo-frontend-ajax-errors was false over 14 days / 1.5h) # 99.5% uptime
All of these are measures of some sort:
- distribution
- value
- boolean
A few things to note;
* SLOs don't add an additional time dimension: every measure of an SLI on which an SLO depends generates a new measure of the SLO
* SLOs, if based on an SLI that is a distribution, specify the percentile of interest
* SLOs are always boolean - you're OK or you're not
* error budgets run from 1.0 (move faster! break more things!) to 0.0 (stop feature work! fix broken things!)
* error budgets are additive: failing to hit a single SLO for enough nines will exhaust your error budget; you don't
get to average performance across all of the SLOs! (we could also use `max` here instead of adding)
* error budgets involve some stored history (14 days, here).
We could implement the error budgets as daily (or hourly) fractions: for each SLO in an error budget, for each day, calculate the fraction of time the SLO was false. To calculate the error budget, average the fractions over the error budget duration (14 days here) for each SLO, and subtract the sum (or max) of the averages from 1.0.
Assignee | ||
Comment 3•8 years ago
|
||
In terms of implementation, I'd like to use signalfx as the storage backend for all of these. SLIs are easy (and in many cases, already exist in signalfx -- we can build tools to capture new ones).
SLOs -- and any boolean SLIs -- can be implemented as simple 0/1 metrics. These will bypass statsum and be written directly to signalfx (or if it's easier to authenticate, through some "passthrough" API in statsum to avoid its aggregation.. booleans don't have percentiles..)
EBs can be implemented as simple SignalFX gauges.
I'd like to implement the calculation of SLOs and EB's in tc-stats-collector. For the most part, that means occasionally reading data out of signalfx, transforming it, and writing it back in under a new metric name. Signalfx may make that easier (if it can do the necessary transforms and provide that data via API).
tc-stats-collector will also run an API that allows limited access to data from signalfx (basically an authenticating proxy), as well as metadata about all of the SLIs, SLOs, and EBs that we have defined.
Assignee | ||
Comment 4•8 years ago
|
||
I forgot to mention, the idea with an error budget is that when it nears zero, the team changes their focus to quality instead of features. Since we work across several more-or-less independent "areas", I think it will make sense to have several error budgets, one for each area.
The UI will start with error budgets and allow users to drill down to the underlying SLOs and from there to SLIs. It will display some simple graphs of each (using some basic JS library) for public consumption, and link to signalfx graphs for those who want to do more fiddling (signalfx graphs require a signalfx account).
The home page will list all error budgets with their current values. Drilling into one will show a chart with the component SLOs in color-coded rows, overlaid with the total budget, making it easy to see which SLO violations have driven the error budget down. All of the metadata (description, links to SLOs, etc.) will be here, too.
A list of all SLOs will color-code the status of each in some kind of grid layout. Drilling down to an SLO (either from the list or from an EB will show the value of the SLI being measured, a line indicating the threshold, and a color-coded row showing when the SLO was violated.
Finally, an SLI display will show either a line graph, a graph of the distribution, or a color-coded boolean-by-time display, depending on the SLI. The metadata for the SLI may link to several signalfx measures if it is some kind of compound measure (e.g., the ratio of measure A and measure B)
Assignee | ||
Comment 5•8 years ago
|
||
It's looking like SignalFX does not really support real-time creation of derived statistics that are then stored. It also doesn't have an API method that can say 'get me the latest value of this metric' or anything like that. Rather, its whole API is built around SignalFlow.
SignalFlow "programs" basically define (statically) a stream processor that takes a bunch of input metrics and produces a one or more time-series and/or events. I *think* the best way to think of this is as the basis for a chart in the UI. You can only run a SignalFlow program while you have an active HTTP connection to SignalFX, and the resulting time-series is streamed back to you. You can request that it stream historical data, and it can also stream in near-real time.
I've asked for SignalFlow to be implemented on the mozilla account (it's in beta).
So, if we want to go "all in" on SignalFx, then we could probably implement all of the calculations describe above in SignalFlow. Then, displaying an EB or an SLO would entail running a (complex!) SignalFlow proram and streaming the results to the browser via some kind of WS proxy. The browser UI could redefine the timespan, allowing both historical analysis and live read-outs. I don't see a way to tell the SignalFx UI "show me a chart for this signalflow program", but that would be even better (the programs you can create by pointy-clicking seem pretty limited, especially since we don't use dimensions).
SignalFlow is not really a programming language, so much as an expression-evaluation language with variables. It does appear to support functions, but no loops or conditionals. Or lists, I think. At any rate, its syntax is basically undocumented ("python-like" .. uh, thanks)
The other option is to implement these calculations locally, using simple SignalFlow programs to extract data from SignalFx and either feeding that data back into SignalFx or possibly using something simpler to query, like AWS CloudWatch.
Assignee | ||
Comment 6•8 years ago
|
||
On reflection, I think "the other option" is the right choice here. It avoids tying us too closely to signalfx, and skips the inevitable disappointment when we discover that SignalFlow is not powerful enough to calculate EB's.
Assignee | ||
Comment 7•8 years ago
|
||
SignalFX is a bad fit here.
The wait-time SLO should be based on wait times for all workerTypes with a particular prefix (aws-provisioner-v1/gecko-1-b*). SignalFX has no way to query metrics by a substring of the name. So I'll need to hard-code those names. Also, SignalFlow stopped working, so at this point I have no way to get data out of signalfx. I put in a support request.
CloudWatch may be an option.
This is getting way more complicated than I wanted it to.
Comment 8•8 years ago
|
||
*cough* influxdb? *cough*
Assignee | ||
Comment 9•8 years ago
|
||
Turns out we have a super-expensive influxdb instance sitting around doing very little of any use, so that's definitely a possibility. This would be a relatively low volume of measurements (a few dozen SLIs, fewer SLOs, fewer EBs).
Comment 10•8 years ago
|
||
Since this was really no longer used except for something in aws-provisioner, I forecasted that we would be cancelling it use in 2017 q1 and we've talked about getting rid of it in team meetings. Are we saying we should keep it around now? Or cancel that and forecast spending money on a different type of influxdb instance?
Are things like redshift or bigquery options for storing this data or is that overkill?
Assignee | ||
Comment 11•8 years ago
|
||
I think that's overkill. A simple install of influx, managed in-house, would be sufficient here.
I want to give signalfx another chance, though.
Assignee | ||
Comment 12•8 years ago
|
||
From a chat with SignalFX, it sounds like SignalFlow will support all of this sort of analysis. Specifically:
- function abstraction in signalflow programs
- display charts in the UI based on a signalflow program
- calculate boolean values (SLOs)
Of course, it's easy to make promises about vaporware, but there you are.
There are some caveats:
- long-running queries may not be allowed
- not sure we can be in the beta
- GA date for SignalFlow is still unknown
For the moment, I'm going to scale this back to get a basic proof-of-concept working using tc-stats-collector and the newly-discovered "timeserieswindow" API.
Assignee | ||
Comment 13•8 years ago
|
||
https://github.com/signalfx/signalfx-nodejs/issues/19 - no REST API in the nodejs client, duh.
Assignee | ||
Comment 14•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #12)
> There are some caveats:
> - long-running queries may not be allowed
They are
> - not sure we can be in the beta
We can't ..
> - GA date for SignalFlow is still unknown
.. which suggests, "not soon".
Assignee | ||
Comment 15•8 years ago
|
||
I made a thing!
This is just running locally. The shaded areas on the first three graphs are the SLIs (the maximum of all the other lines on each graph).
The bottom is the SLO, which is set to 1 when the threshold is exceeded. I have the threshold set really low right now for testing :)
Assignee | ||
Comment 16•8 years ago
|
||
Well, SignalFX is throwing errors in their UI for me, so maybe something is sick.
When I look at datapoints I am generating, they are delayed by over 5 minutes via the API.
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
collector.test received from sli: ts=1480461000000: 0 (historical) +424ms
wakeup in 0.5s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +502ms
query 1480460995000 - 1480461391386
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 0.75s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +1s
query 1480460995000 - 1480461392532
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 1.125s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +1s
query 1480460995000 - 1480461394016
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 1.6875s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +2s
query 1480460995000 - 1480461396104
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 2.53125s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +3s
query 1480460995000 - 1480461399026
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 3.796875s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +4s
query 1480460995000 - 1480461403222
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 5.6953125s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +6s
query 1480460995000 - 1480461409316
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 8.54296875s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +9s
query 1480460995000 - 1480461418169
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 12.814453125s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +13s
query 1480460995000 - 1480461431518
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 19.2216796875s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +20s
query 1480460995000 - 1480461451150
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 28.83251953125s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +29s
query 1480460995000 - 1480461480625
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 43.248779296875s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +44s
query 1480460995000 - 1480461524282
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 64.8731689453125s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +1m
query 1480460995000 - 1480461589587
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] }
wakeup in 97.30975341796875s
clock Running 'fetch sf_metric:sli.gecko.pending.other' +2m
query 1480460995000 - 1480461687292
{ 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ], [ 1480461300000, 0 ] ] }
collector.test received from sli: ts=1480461300000: 0 (live, 387.679s delay) +387ms
Each "Running" is a call to the timeserieswindow API endpoint, with the given ("query") startMs and endMs. The result from the API is printed directly. So the datapoint at 1480461300000, for which I saw the ingest call logged at a delay of about 13 seconds (1480461313000), did not appear until somewhere between 290 and 387 seconds delay (1480461589587 - 1480461687292). That range, curiously, includes the time when the *next* datapoint was ingested. The ingestion doesn't add any delay that I can see. Equally curiously, the 300-second delay is equivalent to the resolution with which I'm querying this data source, but experimentally verifying that does not change the behavior.
So, maybe there's some 5-minute rollup occurring on the signalfx end here?
Assignee | ||
Comment 17•8 years ago
|
||
The UI is fixed, but the 300s delay remains. I'm consulting with signalfx about it.
I don't think this is a huge problem. Sure, our service levels are five minutes out of date, but they are meant for mid-term trend monitoring, not for up-to-the-minute status (that's what the underlying metrics are for).
I can see a few fixes that we may want to apply later:
1. Connect SLI output and SLO input in-process (avoiding writing the datapoints to SFX only to read them back out again)
2. Use InfluxDB
3. Switch to an hourly batch query model
There are a few issues with all of these:
1. Connecting components means we can't later separate them into different dynos for load-spreading purposes
2. Managing an InfluxDB install doesn't sound like fun
3. There are lots of boundary conditions on batch processing metrics, since you need some history from the previous datapoint
Assignee | ||
Comment 18•8 years ago
|
||
Message to SignalFX:
---
I'm working on a project I discussed in Mid-November with Chris,
Harnit, and Patrick.
It involves a component that submits datapoints via the "Ingest"
client, and another component that reads those datapoints. The intent
is to generate some complex summaries of multiple timeseries that are
beyond the abilities of the signalfx graph UI.
I'm seeing something odd, though: the data I send via the ingest
endpoint does not appear via the timeserieswindow API for 300 seconds
(the metric has a 5-minute resolution).
From the first process (calling ingest):
collector.sli.gecko.pending.other write datapoint: now=1480512915750
ts=1480512900000: 0 (live, 15.75s delay) +501ms
collector.sli.gecko.pending.other write datapoint: now=1480513215475
ts=1480513200000: 0 (live, 15.475s delay) +227ms
The `now` field is the time when the ingest.send(..) call was made.
From the second process, making queries via timeserieswindow
approximately every five seconds (I poll less frequently in
production, but this provides better information about the issue)
query: sf_metric:sli.gecko.pending.other startMs: 1480512295000 ->
endMs: 1480512913932; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512300000,0]]},"errors":[]} for
sf_metric:sli.gecko.pending.other
query: sf_metric:sli.gecko.pending.other startMs: 1480512295000 ->
endMs: 1480512919247; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512300000,0],[1480512600000,0]]},"errors":[]}
for sf_metric:sli.gecko.pending.other
collector.test received from sli: now=1480512919605
ts=1480512600000: 0 (live, 319.605s delay) +358ms
query: sf_metric:sli.gecko.pending.other startMs: 1480512595000 ->
endMs: 1480512924611; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512600000,0]]},"errors":[]} for
sf_metric:sli.gecko.pending.other
...
query: sf_metric:sli.gecko.pending.other startMs: 1480512595000 ->
endMs: 1480513212667; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512600000,0]]},"errors":[]} for
sf_metric:sli.gecko.pending.other
query: sf_metric:sli.gecko.pending.other startMs: 1480512595000 ->
endMs: 1480513217968; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512600000,0],[1480512900000,0]]},"errors":[]}
for sf_metric:sli.gecko.pending.other
collector.test received from sli: now=1480513218422
ts=1480512900000: 0 (live, 318.422s delay) +454ms
query: sf_metric:sli.gecko.pending.other startMs: 1480512895000 ->
endMs: 1480513223428; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512900000,0]]},"errors":[]} for
sf_metric:sli.gecko.pending.other
...
query: sf_metric:sli.gecko.pending.other startMs: 1480512895000 ->
endMs: 1480513517260; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480512900000,0],[1480513200000,0]]},"errors":[]}
for sf_metric:sli.gecko.pending.other
collector.test received from sli: now=1480513517586
ts=1480513200000: 0 (live, 317.586s delay) +327ms
query: sf_metric:sli.gecko.pending.other startMs: 1480513195000 ->
endMs: 1480513522591; resolution: 300000
got {"data":{"Cx-xIy3AgAM":[[1480513200000,0]]},"errors":[]} for
sf_metric:sli.gecko.pending.other
So, the first query took place at 1480512913932, just before the
1480512900000 datapoint was inserted. The query results contain
neither the 1480512900000 timestamp nor the one before it,
1480512600000.
The second query took place at 1480512919247, just after the
1480512900000 datapoint was inserted. The query results contain not
that datapoint, but finally the 1480512600000 datapoint appears,
319.605 seconds late or about 304s after it was ingested.
The fifth query shows the 1480512900000 datapoint finally arriving,
this time 318.422 seconds late or about 303s after it was ingested.
Can you explain why this is occurring? I have a few guesses, but no
good way to confirm them.
Assignee | ||
Comment 19•8 years ago
|
||
This is intentional -- it's occurring because the data is quantized on 5m intervals, and the quantizer is waiting for further data in that interval.
Assignee | ||
Comment 20•8 years ago
|
||
That means that SLO's will be delayed 10 minutes, and error budgets 15 minutes.. that's probably OK for now.
Assignee | ||
Comment 21•8 years ago
|
||
Assignee | ||
Comment 22•8 years ago
|
||
Well, that's not working very well. Among the issues I see:
* The last three hours are all flat lines, despite the trees being open.
* The build pending SLI spiked above the build pending times from about 11-12
* The SLO is incorrect (why did it go from 1.0 to 0.0?)
Attachment #8813891 -
Attachment is obsolete: true
Assignee | ||
Comment 23•8 years ago
|
||
Indeed, changing the extrapolation method for the SLO and EB gives this result -- which is to say, it's only calculated the EB twice, and the SLO only for a few hours.
Assignee | ||
Comment 24•8 years ago
|
||
* The build pending SLI spiked above the build pending times from about 11-12
This was because the signalfx graph did not include the windows workerTypes. So the SLI calculations appear to be correct.
Assignee | ||
Comment 25•8 years ago
|
||
Ergh, Heroku metrics show it's busy-looping and not even logging anything since 18:15UTC (13:15 on the screenshots). CPU spiked to 100% and memory usage flatlined.
Assignee | ||
Comment 26•8 years ago
|
||
2016-12-12T18:15:00.006940+00:00 app[run.1]: Mon, 12 Dec 2016 18:15:00 GMT clock Running 'calculate error budget for gecko.pending'
2016-12-12T18:15:00.008307+00:00 app[run.1]: Mon, 12 Dec 2016 18:15:00 GMT collector.eb.gecko.pending next calculation at Mon Dec 12 2016 19:15:00 GMT+0000 (UTC)
but no logging of the completion. Derp:
67 while (history[0] && history[0][0] < earliest) {
68 history.unshift();
69 }
I wanted history.shift().
Assignee | ||
Comment 27•8 years ago
|
||
I pushed a fix for the shift/unshift error:
https://github.com/taskcluster/taskcluster-stats-collector/commit/4de5b1f7ef6f6991d97c2c5acee40598c3a126ae
and also switched things around because SLO=1 means the SLO was met.
https://github.com/taskcluster/taskcluster-stats-collector/commit/9431e99a47cffe03f1305383f04182ada8f21781
I'm working on writing tests for eb.js to find things like that shift/unshift error, but mocha is being annoying.
Assignee | ||
Comment 28•8 years ago
|
||
EB tests are up for review: https://github.com/taskcluster/taskcluster-stats-collector/pull/19
Assignee | ||
Comment 29•8 years ago
|
||
So, it looks like the SLI / SLO calculations are only lasting about 4 hours. I'm a little concerned with the sawtooth pattern in the pending. For example, the big blue spike is tc-stats-collector.tasks.aws-provisioner-v1.gecko-t-linux-xlarge.pending.5m.p95 shooting up over 60 minutes. That angle is one minute per minute, meaning that there's a single task that was pending from 20:15 to 21:25. That might be real, or it might be an artifact of a bug in tc-stats-collector.
Assignee | ||
Comment 30•8 years ago
|
||
It looks like all metric stream stuff just fails at the same time: no more incoming datapoints, and the metric stream multiplexer fails at the same time.
Assignee | ||
Comment 31•8 years ago
|
||
I added some debugging in
https://github.com/taskcluster/taskcluster-stats-collector/commit/a6f4309f28322e38409fd72241a07c9ad0311eb5
so hopefully that will help figure out why things time out.
I wonder if this is rate-limiting.
Assignee | ||
Comment 32•8 years ago
|
||
Nope, a bug in how I use streams: https://github.com/taskcluster/taskcluster-stats-collector/pull/20
Assignee | ||
Comment 33•8 years ago
|
||
SLIs are now working better. However, SLOs are being show in the SignalFX UI as an hourly metric, despite being submitted every 5 minutes (admittedly, with a 700 second delay, but that's not my fault..)
Assignee | ||
Comment 34•8 years ago
|
||
Ah, in fact the SLO data is being recorded at full resolution (the timeseries' "native resolution"), as can be seen here:
https://app.signalfx.com/#/chart/v1/new?template=default&filters=sf_metric:slo.gecko.pending&startTime=-12h&endTime=Now&density=4
but it's co-plotted with the error budget which is an hourly metric, so it is getting rounded to the hour for display. So all is well.
I have adjusted the SLO threshold down to 70% within 1 day, just so I can see the EB go above zero. Once I see that, I'll revert that to our normal level and close this bug.
Assignee | ||
Comment 35•8 years ago
|
||
error budget success!
Assignee | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: Operations → Operations and Service Requests
You need to log in
before you can comment on or make changes to this bug.
Description
•