Closed Bug 1309716 Opened 8 years ago Closed 8 years ago

Create a framework for displaying team dashboards

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

Attachments

(5 files, 1 obsolete file)

The idea is that we should have a common place where everyone in the team can look to see the things that we have collectively determined are important to us (service level indicators and objectives).
The idea here is to provide a central place where we display the things that are important to us as a team. So this does not include alerting (that should be some other service), incident management (other tools), or any kind of kanban board. I see that breaking down into a few categories: Service Levels Service Level Objectives and their status Service Level Indicators and their history Incidents Postmortems Tools Bug lists, PR lists, etc. Useful dashboards (implemented here or links elsewhere e.g., signalfx) On the other hand, maybe incidents should be its own service, and maybe tools should be on a wiki or mana page. Hrm. Anyway, I'm sketching this out. One thing I did agree on with Eli was to make this a separate site from tools.
I'm scoping myself down to just the service levels. Reviewing the SRE book, we have: SLI -- a measure of something important, possibly a distribution over dimensions and/or time SLO -- a boolean measure of whether one or more SLIs are acceptable Error Budget -- the degree to which we tolerate SLO violations over time (# of nines) As a concrete (but non-TC) example: SLI: sli-homepage-load-time = client-measured homepage load times measured over 5m intervals sli-frontend-ajax-errors = count of ajax errors as measured client-side in 5m bins, divided by homepage loads in 5m bin SLO: slo-homepage-load-time-100ms = 95th percentile of sli-homepage-load-time < 100ms slo-frontend-ajax-errors = sli-frontend-ajax-errors < 1% Error Budget: eb-frontend = 1.0 - (time slo-homepage-load-time-100ms was false over 14 days / 1.5h) # 99.5% uptime - (time slo-frontend-ajax-errors was false over 14 days / 1.5h) # 99.5% uptime All of these are measures of some sort: - distribution - value - boolean A few things to note; * SLOs don't add an additional time dimension: every measure of an SLI on which an SLO depends generates a new measure of the SLO * SLOs, if based on an SLI that is a distribution, specify the percentile of interest * SLOs are always boolean - you're OK or you're not * error budgets run from 1.0 (move faster! break more things!) to 0.0 (stop feature work! fix broken things!) * error budgets are additive: failing to hit a single SLO for enough nines will exhaust your error budget; you don't get to average performance across all of the SLOs! (we could also use `max` here instead of adding) * error budgets involve some stored history (14 days, here). We could implement the error budgets as daily (or hourly) fractions: for each SLO in an error budget, for each day, calculate the fraction of time the SLO was false. To calculate the error budget, average the fractions over the error budget duration (14 days here) for each SLO, and subtract the sum (or max) of the averages from 1.0.
In terms of implementation, I'd like to use signalfx as the storage backend for all of these. SLIs are easy (and in many cases, already exist in signalfx -- we can build tools to capture new ones). SLOs -- and any boolean SLIs -- can be implemented as simple 0/1 metrics. These will bypass statsum and be written directly to signalfx (or if it's easier to authenticate, through some "passthrough" API in statsum to avoid its aggregation.. booleans don't have percentiles..) EBs can be implemented as simple SignalFX gauges. I'd like to implement the calculation of SLOs and EB's in tc-stats-collector. For the most part, that means occasionally reading data out of signalfx, transforming it, and writing it back in under a new metric name. Signalfx may make that easier (if it can do the necessary transforms and provide that data via API). tc-stats-collector will also run an API that allows limited access to data from signalfx (basically an authenticating proxy), as well as metadata about all of the SLIs, SLOs, and EBs that we have defined.
I forgot to mention, the idea with an error budget is that when it nears zero, the team changes their focus to quality instead of features. Since we work across several more-or-less independent "areas", I think it will make sense to have several error budgets, one for each area. The UI will start with error budgets and allow users to drill down to the underlying SLOs and from there to SLIs. It will display some simple graphs of each (using some basic JS library) for public consumption, and link to signalfx graphs for those who want to do more fiddling (signalfx graphs require a signalfx account). The home page will list all error budgets with their current values. Drilling into one will show a chart with the component SLOs in color-coded rows, overlaid with the total budget, making it easy to see which SLO violations have driven the error budget down. All of the metadata (description, links to SLOs, etc.) will be here, too. A list of all SLOs will color-code the status of each in some kind of grid layout. Drilling down to an SLO (either from the list or from an EB will show the value of the SLI being measured, a line indicating the threshold, and a color-coded row showing when the SLO was violated. Finally, an SLI display will show either a line graph, a graph of the distribution, or a color-coded boolean-by-time display, depending on the SLI. The metadata for the SLI may link to several signalfx measures if it is some kind of compound measure (e.g., the ratio of measure A and measure B)
It's looking like SignalFX does not really support real-time creation of derived statistics that are then stored. It also doesn't have an API method that can say 'get me the latest value of this metric' or anything like that. Rather, its whole API is built around SignalFlow. SignalFlow "programs" basically define (statically) a stream processor that takes a bunch of input metrics and produces a one or more time-series and/or events. I *think* the best way to think of this is as the basis for a chart in the UI. You can only run a SignalFlow program while you have an active HTTP connection to SignalFX, and the resulting time-series is streamed back to you. You can request that it stream historical data, and it can also stream in near-real time. I've asked for SignalFlow to be implemented on the mozilla account (it's in beta). So, if we want to go "all in" on SignalFx, then we could probably implement all of the calculations describe above in SignalFlow. Then, displaying an EB or an SLO would entail running a (complex!) SignalFlow proram and streaming the results to the browser via some kind of WS proxy. The browser UI could redefine the timespan, allowing both historical analysis and live read-outs. I don't see a way to tell the SignalFx UI "show me a chart for this signalflow program", but that would be even better (the programs you can create by pointy-clicking seem pretty limited, especially since we don't use dimensions). SignalFlow is not really a programming language, so much as an expression-evaluation language with variables. It does appear to support functions, but no loops or conditionals. Or lists, I think. At any rate, its syntax is basically undocumented ("python-like" .. uh, thanks) The other option is to implement these calculations locally, using simple SignalFlow programs to extract data from SignalFx and either feeding that data back into SignalFx or possibly using something simpler to query, like AWS CloudWatch.
On reflection, I think "the other option" is the right choice here. It avoids tying us too closely to signalfx, and skips the inevitable disappointment when we discover that SignalFlow is not powerful enough to calculate EB's.
SignalFX is a bad fit here. The wait-time SLO should be based on wait times for all workerTypes with a particular prefix (aws-provisioner-v1/gecko-1-b*). SignalFX has no way to query metrics by a substring of the name. So I'll need to hard-code those names. Also, SignalFlow stopped working, so at this point I have no way to get data out of signalfx. I put in a support request. CloudWatch may be an option. This is getting way more complicated than I wanted it to.
*cough* influxdb? *cough*
Turns out we have a super-expensive influxdb instance sitting around doing very little of any use, so that's definitely a possibility. This would be a relatively low volume of measurements (a few dozen SLIs, fewer SLOs, fewer EBs).
Since this was really no longer used except for something in aws-provisioner, I forecasted that we would be cancelling it use in 2017 q1 and we've talked about getting rid of it in team meetings. Are we saying we should keep it around now? Or cancel that and forecast spending money on a different type of influxdb instance? Are things like redshift or bigquery options for storing this data or is that overkill?
I think that's overkill. A simple install of influx, managed in-house, would be sufficient here. I want to give signalfx another chance, though.
From a chat with SignalFX, it sounds like SignalFlow will support all of this sort of analysis. Specifically: - function abstraction in signalflow programs - display charts in the UI based on a signalflow program - calculate boolean values (SLOs) Of course, it's easy to make promises about vaporware, but there you are. There are some caveats: - long-running queries may not be allowed - not sure we can be in the beta - GA date for SignalFlow is still unknown For the moment, I'm going to scale this back to get a basic proof-of-concept working using tc-stats-collector and the newly-discovered "timeserieswindow" API.
https://github.com/signalfx/signalfx-nodejs/issues/19 - no REST API in the nodejs client, duh.
(In reply to Dustin J. Mitchell [:dustin] from comment #12) > There are some caveats: > - long-running queries may not be allowed They are > - not sure we can be in the beta We can't .. > - GA date for SignalFlow is still unknown .. which suggests, "not soon".
Attached image Screenshot from 2016-11-23 18-05-10.png (obsolete) —
I made a thing! This is just running locally. The shaded areas on the first three graphs are the SLIs (the maximum of all the other lines on each graph). The bottom is the SLO, which is set to 1 when the threshold is exceeded. I have the threshold set really low right now for testing :)
Well, SignalFX is throwing errors in their UI for me, so maybe something is sick. When I look at datapoints I am generating, they are delayed by over 5 minutes via the API. { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } collector.test received from sli: ts=1480461000000: 0 (historical) +424ms wakeup in 0.5s clock Running 'fetch sf_metric:sli.gecko.pending.other' +502ms query 1480460995000 - 1480461391386 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 0.75s clock Running 'fetch sf_metric:sli.gecko.pending.other' +1s query 1480460995000 - 1480461392532 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 1.125s clock Running 'fetch sf_metric:sli.gecko.pending.other' +1s query 1480460995000 - 1480461394016 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 1.6875s clock Running 'fetch sf_metric:sli.gecko.pending.other' +2s query 1480460995000 - 1480461396104 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 2.53125s clock Running 'fetch sf_metric:sli.gecko.pending.other' +3s query 1480460995000 - 1480461399026 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 3.796875s clock Running 'fetch sf_metric:sli.gecko.pending.other' +4s query 1480460995000 - 1480461403222 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 5.6953125s clock Running 'fetch sf_metric:sli.gecko.pending.other' +6s query 1480460995000 - 1480461409316 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 8.54296875s clock Running 'fetch sf_metric:sli.gecko.pending.other' +9s query 1480460995000 - 1480461418169 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 12.814453125s clock Running 'fetch sf_metric:sli.gecko.pending.other' +13s query 1480460995000 - 1480461431518 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 19.2216796875s clock Running 'fetch sf_metric:sli.gecko.pending.other' +20s query 1480460995000 - 1480461451150 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 28.83251953125s clock Running 'fetch sf_metric:sli.gecko.pending.other' +29s query 1480460995000 - 1480461480625 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 43.248779296875s clock Running 'fetch sf_metric:sli.gecko.pending.other' +44s query 1480460995000 - 1480461524282 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 64.8731689453125s clock Running 'fetch sf_metric:sli.gecko.pending.other' +1m query 1480460995000 - 1480461589587 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ] ] } wakeup in 97.30975341796875s clock Running 'fetch sf_metric:sli.gecko.pending.other' +2m query 1480460995000 - 1480461687292 { 'Cx-xIy3AgAM': [ [ 1480461000000, 0 ], [ 1480461300000, 0 ] ] } collector.test received from sli: ts=1480461300000: 0 (live, 387.679s delay) +387ms Each "Running" is a call to the timeserieswindow API endpoint, with the given ("query") startMs and endMs. The result from the API is printed directly. So the datapoint at 1480461300000, for which I saw the ingest call logged at a delay of about 13 seconds (1480461313000), did not appear until somewhere between 290 and 387 seconds delay (1480461589587 - 1480461687292). That range, curiously, includes the time when the *next* datapoint was ingested. The ingestion doesn't add any delay that I can see. Equally curiously, the 300-second delay is equivalent to the resolution with which I'm querying this data source, but experimentally verifying that does not change the behavior. So, maybe there's some 5-minute rollup occurring on the signalfx end here?
The UI is fixed, but the 300s delay remains. I'm consulting with signalfx about it. I don't think this is a huge problem. Sure, our service levels are five minutes out of date, but they are meant for mid-term trend monitoring, not for up-to-the-minute status (that's what the underlying metrics are for). I can see a few fixes that we may want to apply later: 1. Connect SLI output and SLO input in-process (avoiding writing the datapoints to SFX only to read them back out again) 2. Use InfluxDB 3. Switch to an hourly batch query model There are a few issues with all of these: 1. Connecting components means we can't later separate them into different dynos for load-spreading purposes 2. Managing an InfluxDB install doesn't sound like fun 3. There are lots of boundary conditions on batch processing metrics, since you need some history from the previous datapoint
Message to SignalFX: --- I'm working on a project I discussed in Mid-November with Chris, Harnit, and Patrick. It involves a component that submits datapoints via the "Ingest" client, and another component that reads those datapoints. The intent is to generate some complex summaries of multiple timeseries that are beyond the abilities of the signalfx graph UI. I'm seeing something odd, though: the data I send via the ingest endpoint does not appear via the timeserieswindow API for 300 seconds (the metric has a 5-minute resolution). From the first process (calling ingest): collector.sli.gecko.pending.other write datapoint: now=1480512915750 ts=1480512900000: 0 (live, 15.75s delay) +501ms collector.sli.gecko.pending.other write datapoint: now=1480513215475 ts=1480513200000: 0 (live, 15.475s delay) +227ms The `now` field is the time when the ingest.send(..) call was made. From the second process, making queries via timeserieswindow approximately every five seconds (I poll less frequently in production, but this provides better information about the issue) query: sf_metric:sli.gecko.pending.other startMs: 1480512295000 -> endMs: 1480512913932; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512300000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other query: sf_metric:sli.gecko.pending.other startMs: 1480512295000 -> endMs: 1480512919247; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512300000,0],[1480512600000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other collector.test received from sli: now=1480512919605 ts=1480512600000: 0 (live, 319.605s delay) +358ms query: sf_metric:sli.gecko.pending.other startMs: 1480512595000 -> endMs: 1480512924611; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512600000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other ... query: sf_metric:sli.gecko.pending.other startMs: 1480512595000 -> endMs: 1480513212667; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512600000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other query: sf_metric:sli.gecko.pending.other startMs: 1480512595000 -> endMs: 1480513217968; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512600000,0],[1480512900000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other collector.test received from sli: now=1480513218422 ts=1480512900000: 0 (live, 318.422s delay) +454ms query: sf_metric:sli.gecko.pending.other startMs: 1480512895000 -> endMs: 1480513223428; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512900000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other ... query: sf_metric:sli.gecko.pending.other startMs: 1480512895000 -> endMs: 1480513517260; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480512900000,0],[1480513200000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other collector.test received from sli: now=1480513517586 ts=1480513200000: 0 (live, 317.586s delay) +327ms query: sf_metric:sli.gecko.pending.other startMs: 1480513195000 -> endMs: 1480513522591; resolution: 300000 got {"data":{"Cx-xIy3AgAM":[[1480513200000,0]]},"errors":[]} for sf_metric:sli.gecko.pending.other So, the first query took place at 1480512913932, just before the 1480512900000 datapoint was inserted. The query results contain neither the 1480512900000 timestamp nor the one before it, 1480512600000. The second query took place at 1480512919247, just after the 1480512900000 datapoint was inserted. The query results contain not that datapoint, but finally the 1480512600000 datapoint appears, 319.605 seconds late or about 304s after it was ingested. The fifth query shows the 1480512900000 datapoint finally arriving, this time 318.422 seconds late or about 303s after it was ingested. Can you explain why this is occurring? I have a few guesses, but no good way to confirm them.
This is intentional -- it's occurring because the data is quantized on 5m intervals, and the quantizer is waiting for further data in that interval.
That means that SLO's will be delayed 10 minutes, and error budgets 15 minutes.. that's probably OK for now.
Well, that's not working very well. Among the issues I see: * The last three hours are all flat lines, despite the trees being open. * The build pending SLI spiked above the build pending times from about 11-12 * The SLO is incorrect (why did it go from 1.0 to 0.0?)
Attachment #8813891 - Attachment is obsolete: true
Indeed, changing the extrapolation method for the SLO and EB gives this result -- which is to say, it's only calculated the EB twice, and the SLO only for a few hours.
* The build pending SLI spiked above the build pending times from about 11-12 This was because the signalfx graph did not include the windows workerTypes. So the SLI calculations appear to be correct.
Ergh, Heroku metrics show it's busy-looping and not even logging anything since 18:15UTC (13:15 on the screenshots). CPU spiked to 100% and memory usage flatlined.
2016-12-12T18:15:00.006940+00:00 app[run.1]: Mon, 12 Dec 2016 18:15:00 GMT clock Running 'calculate error budget for gecko.pending' 2016-12-12T18:15:00.008307+00:00 app[run.1]: Mon, 12 Dec 2016 18:15:00 GMT collector.eb.gecko.pending next calculation at Mon Dec 12 2016 19:15:00 GMT+0000 (UTC) but no logging of the completion. Derp: 67 while (history[0] && history[0][0] < earliest) { 68 history.unshift(); 69 } I wanted history.shift().
I pushed a fix for the shift/unshift error: https://github.com/taskcluster/taskcluster-stats-collector/commit/4de5b1f7ef6f6991d97c2c5acee40598c3a126ae and also switched things around because SLO=1 means the SLO was met. https://github.com/taskcluster/taskcluster-stats-collector/commit/9431e99a47cffe03f1305383f04182ada8f21781 I'm working on writing tests for eb.js to find things like that shift/unshift error, but mocha is being annoying.
So, it looks like the SLI / SLO calculations are only lasting about 4 hours. I'm a little concerned with the sawtooth pattern in the pending. For example, the big blue spike is tc-stats-collector.tasks.aws-provisioner-v1.gecko-t-linux-xlarge.pending.5m.p95 shooting up over 60 minutes. That angle is one minute per minute, meaning that there's a single task that was pending from 20:15 to 21:25. That might be real, or it might be an artifact of a bug in tc-stats-collector.
It looks like all metric stream stuff just fails at the same time: no more incoming datapoints, and the metric stream multiplexer fails at the same time.
I added some debugging in https://github.com/taskcluster/taskcluster-stats-collector/commit/a6f4309f28322e38409fd72241a07c9ad0311eb5 so hopefully that will help figure out why things time out. I wonder if this is rate-limiting.
SLIs are now working better. However, SLOs are being show in the SignalFX UI as an hourly metric, despite being submitted every 5 minutes (admittedly, with a 700 second delay, but that's not my fault..)
Ah, in fact the SLO data is being recorded at full resolution (the timeseries' "native resolution"), as can be seen here: https://app.signalfx.com/#/chart/v1/new?template=default&filters=sf_metric:slo.gecko.pending&startTime=-12h&endTime=Now&density=4 but it's co-plotted with the error budget which is an hourly metric, so it is getting rounded to the hour for display. So all is well. I have adjusted the SLO threshold down to 70% within 1 day, just so I can see the EB go above zero. Once I see that, I'll revert that to our normal level and close this bug.
error budget success!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Operations → Operations and Service Requests
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: