Closed Bug 1567989 Opened 6 years ago Closed 6 years ago

Monitor non-200s http errors in autograph stackdriver

Categories

(Cloud Services :: Operations: Autograph, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jvehent, Unassigned)

Details

Autograph being an internal application, it should rarely, if not never, return non-200 error codes. We should monitor for elevated levels of 400s and 500s and alert as appropriate.

No longer blocks: 1567986

We discussed some options here, with the preference being for log-based metrics in stackdriver. The idea is that autograph should log HTTP responses in a Stackdriver-friendly format for parsing as log-based metrics, and we should monitor those metrics in Influx / Datadog.

An alternative option would be to have autograph send statsd metrics to a local agent (such as telegraf, datadog agent), which sidesteps the need for configuration of log-based metrics and would mean that those metrics go directly to our monitoring backend.

These exist now named user/autograph-http-errors

for prod (bobm created this one?) using the filter:

resource.type="aws_ec2_instance"
logName="projects/aws-aws-autograph-p-1535037642/logs/nginx-access"
(jsonPayload.code < 200 OR jsonPayload.code >= 300) AND (jsonPayload.code != 404)

example logs:

https://console.cloud.google.com/logs/viewer?project=aws-aws-autograph-p-1535037642&organizationId=442341870013&minLogLevel=0&expandAll=false&timestamp=2019-09-10T18:11:37.981000000Z&customFacets=&limitCustomFacetWidth=true&advancedFilter=resource.type%3D%22aws_ec2_instance%22%0AlogName%3D%22projects%2Faws-aws-autograph-p-1535037642%2Flogs%2Fnginx-access%22%0A(jsonPayload.code%20%3C%20200%20OR%20jsonPayload.code%20%3E%3D%20300)%20AND%20(jsonPayload.code%20!%3D%20404)&dateRangeStart=2019-09-10T17:11:38.724Z&dateRangeEnd=2019-09-10T18:11:38.724Z&interval=PT1H&scrollTimestamp=2019-09-10T17:18:02.483445523Z\

and in metrics viewer with count aggregation and aligner over the last week:

https://app.google.stackdriver.com/metrics-explorer?project=moz-fx-data-aws-logging&timeSelection=%7B%22timeRange%22:%221w%22%7D&xyChart=%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22logging.googleapis.com%2Fuser%2Fautograph-http-errors%5C%22%22,%22perSeriesAligner%22:%22ALIGN_COUNT%22,%22crossSeriesReducer%22:%22REDUCE_COUNT%22,%22secondaryCrossSeriesReducer%22:%22REDUCE_NONE%22,%22minAlignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D,%22unitOverride%22:%221%22%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D&isAutoRefresh=true

for stage:

filtering on:

resource.type="aws_ec2_instance" logName="projects/aws-aws-autograph-s-1534261406/logs/nginx-access" (jsonPayload.code < 200 OR jsonPayload.code >= 300) AND (jsonPayload.code != 404)

example logs:

https://console.cloud.google.com/logs/viewer?project=aws-aws-autograph-s-1534261406&organizationId=442341870013&minLogLevel=0&expandAll=false&timestamp=2019-09-10T18:16:13.725000000Z&customFacets=&limitCustomFacetWidth=true&advancedFilter=resource.type%3D%22aws_ec2_instance%22%0AlogName%3D%22projects%2Faws-aws-autograph-s-1534261406%2Flogs%2Fnginx-access%22%0A(jsonPayload.code%20%3C%20200%20OR%20jsonPayload.code%20%3E%3D%20300)%20AND%20(jsonPayload.code%20!%3D%20404)&dateRangeStart=2019-09-10T17:16:14.514Z&dateRangeEnd=2019-09-10T18:16:14.514Z&interval=PT1H

in metrics explorer with count aggregation and aligner over the last week:

https://app.google.stackdriver.com/metrics-explorer?project=moz-fx-data-aws-logging&timeSelection=%7B%22timeRange%22:%221w%22%7D&xyChart=%7B%22dataSets%22:%5B%7B%22timeSeriesFilter%22:%7B%22filter%22:%22metric.type%3D%5C%22logging.googleapis.com%2Fuser%2Fautograph-http-errors%5C%22%22,%22perSeriesAligner%22:%22ALIGN_COUNT%22,%22crossSeriesReducer%22:%22REDUCE_COUNT%22,%22secondaryCrossSeriesReducer%22:%22REDUCE_NONE%22,%22minAlignmentPeriod%22:%2260s%22,%22groupByFields%22:%5B%5D,%22unitOverride%22:%221%22%7D,%22targetAxis%22:%22Y1%22,%22plotType%22:%22LINE%22%7D%5D,%22options%22:%7B%22mode%22:%22COLOR%22%7D,%22constantLines%22:%5B%5D,%22timeshiftDuration%22:%220s%22,%22y1Axis%22:%7B%22label%22:%22y1Axis%22,%22scale%22:%22LINEAR%22%7D%7D&isAutoRefresh=true

Created a similar email alert for stage https://app.google.stackdriver.com/policies/5444744706530780110?project=moz-fx-data-aws-logging currently it just emails me but other people can subscribe or we could send to autograph-notifications@mozilla.com

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.