Closed Bug 1344037 Opened 5 years ago Closed 5 years ago

set up monitoring for pigeon in -stage [antenna]

Categories

(Socorro :: Antenna, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

We want to know what's going on in Pigeon land, so we need to set up monitoring for it.

This likely requires code changes and maybe some other stuff. Maybe a sandwich and a milkshake.

Miles mentioned this:

https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/

That's probably insightful.

This bug covers figuring out how we want to monitor Pigeon activities and then making the appropriate changes.
This is helpful, too:

http://docs.datadoghq.com/integrations/awslambda/

I'll spin off a new bug for the ops side of this, but for our side, Datadog automatically captures metrics:

* aws.lambda.duration
* aws.lambda.duration.maximum
* aws.lambda.duration.minimum
* aws.lambda.duration.sum
* aws.lambda.errors
* aws.lambda.invocations
* aws.lambda.throttles

We probably want to also capture accepts and defers. Thus we need to have pigeon print these lines to stdout:

For accept:

    MONITORING|unix_epoch_timestamp|1|count|antenna.pigeon.accept|

For defer:

    MONITORING|unix_epoch_timestamp|1|count|antenna.pigeon.defer|

Replacing "unix_epoch_timestamp" with int(time.time()).
I merged that PR.

We don't have Pigeon running, yet, so we'll have to wait to set the rest of this up.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
One thing that popped in my head is that we want to keep track of AWS Lambda limit errors:

http://docs.aws.amazon.com/lambda/latest/dg/limits.html

http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html

We definitely want an alert on that--we want to know if we ever exceed those.

I don't think we should be anywhere near those limits, but we should keep tabs on the concurrent executions limit. Socorro gets like 20 crashes per second (~req/s) and the AWS lambda concurrent executions limit is for 100 req/s. (Assumes I read the docs correctly.)
Lonnen mentioned that we should tell AWS folks to raise our AWS Lambda rate limit. Then we don't have to deal with it.
We definitely should ask them to raise it a bit. It's probably useful to have an alert of some kind. IIRC when we hit a soft limit like this they notify the account mail, which would be good enough.
Submitted an AWS Lambda concurrent execution limit increase in the Ops prod account. We should be approved for 400 concurrent executions shortly, which should give us a significant padding. There will also be monitoring in place per https://bugzilla.mozilla.org/show_bug.cgi?id=1344037.
The limit was approved. We now have 400 concurrent executions in prod.
We have Datadog graphs for Pigeon in ops infra, but we will be running pigeon in webeng infra in order to use the Socorro stage crash bucket.

I set up a WIP dashboard here: https://app.datadoghq.com/dash/269494/pigeon
No data yet.
Actually, via the magic of letting ops Datadog have access to mozilla-webeng AWS integration, we should be able to keep our pigeon monitoring in ops Datadog. It might *just work*.
We've got Pigeon graphs in the dashboards and we watched data go by yesterday when we transitioned Socorro -stage to use Antenna. We're all set here, so marking as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in before you can comment on or make changes to this bug.