1344037 - set up monitoring for pigeon in -stage [antenna]

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

8 years ago

We want to know what's going on in Pigeon land, so we need to set up monitoring for it. This likely requires code changes and maybe some other stuff. Maybe a sandwich and a milkshake. Miles mentioned this: https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/ That's probably insightful. This bug covers figuring out how we want to monitor Pigeon activities and then making the appropriate changes.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

8 years ago

This is helpful, too: http://docs.datadoghq.com/integrations/awslambda/ I'll spin off a new bug for the ops side of this, but for our side, Datadog automatically captures metrics: * aws.lambda.duration * aws.lambda.duration.maximum * aws.lambda.duration.minimum * aws.lambda.duration.sum * aws.lambda.errors * aws.lambda.invocations * aws.lambda.throttles We probably want to also capture accepts and defers. Thus we need to have pigeon print these lines to stdout: For accept: MONITORING|unix_epoch_timestamp|1|count|antenna.pigeon.accept| For defer: MONITORING|unix_epoch_timestamp|1|count|antenna.pigeon.defer| Replacing "unix_epoch_timestamp" with int(time.time()).

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

8 years ago

Also this: http://docs.aws.amazon.com/lambda/latest/dg/python-logging.html

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

8 years ago

In a PR: https://github.com/mozilla/socorro-pigeon/pull/14

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

8 years ago

I merged that PR. We don't have Pigeon running, yet, so we'll have to wait to set the rest of this up.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

8 years ago

One thing that popped in my head is that we want to keep track of AWS Lambda limit errors: http://docs.aws.amazon.com/lambda/latest/dg/limits.html http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html We definitely want an alert on that--we want to know if we ever exceed those. I don't think we should be anywhere near those limits, but we should keep tabs on the concurrent executions limit. Socorro gets like 20 crashes per second (~req/s) and the AWS lambda concurrent executions limit is for 100 req/s. (Assumes I read the docs correctly.)

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 6

•

8 years ago

Lonnen mentioned that we should tell AWS folks to raise our AWS Lambda rate limit. Then we don't have to deal with it.

Lonnen :lonnen

Comment 7

•

8 years ago

We definitely should ask them to raise it a bit. It's probably useful to have an alert of some kind. IIRC when we hit a soft limit like this they notify the account mail, which would be good enough.

Miles Crabill [:miles]

Comment 8

•

8 years ago

Submitted an AWS Lambda concurrent execution limit increase in the Ops prod account. We should be approved for 400 concurrent executions shortly, which should give us a significant padding. There will also be monitoring in place per https://bugzilla.mozilla.org/show_bug.cgi?id=1344037.

Miles Crabill [:miles]

Comment 9

•

8 years ago

The limit was approved. We now have 400 concurrent executions in prod.

Miles Crabill [:miles]

Comment 10

•

8 years ago

We have Datadog graphs for Pigeon in ops infra, but we will be running pigeon in webeng infra in order to use the Socorro stage crash bucket. I set up a WIP dashboard here: https://app.datadoghq.com/dash/269494/pigeon No data yet.

Miles Crabill [:miles]

Comment 11

•

8 years ago

Actually, via the magic of letting ops Datadog have access to mozilla-webeng AWS integration, we should be able to keep our pigeon monitoring in ops Datadog. It might *just work*.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 12

•

8 years ago

We've got Pigeon graphs in the dashboards and we watched data go by yesterday when we transitioned Socorro -stage to use Antenna. We're all set here, so marking as FIXED.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 13

•

8 years ago

Switching Antenna bugs to Antenna component.

Component: General → Antenna

Bugzilla

set up monitoring for pigeon in -stage [antenna]

Categories

(Socorro :: Antenna, task)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13