Closed
Bug 1344037
Opened 7 years ago
Closed 7 years ago
set up monitoring for pigeon in -stage [antenna]
Categories
(Socorro :: Antenna, task)
Socorro
Antenna
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
References
Details
We want to know what's going on in Pigeon land, so we need to set up monitoring for it. This likely requires code changes and maybe some other stuff. Maybe a sandwich and a milkshake. Miles mentioned this: https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/ That's probably insightful. This bug covers figuring out how we want to monitor Pigeon activities and then making the appropriate changes.
Assignee | ||
Comment 1•7 years ago
|
||
This is helpful, too: http://docs.datadoghq.com/integrations/awslambda/ I'll spin off a new bug for the ops side of this, but for our side, Datadog automatically captures metrics: * aws.lambda.duration * aws.lambda.duration.maximum * aws.lambda.duration.minimum * aws.lambda.duration.sum * aws.lambda.errors * aws.lambda.invocations * aws.lambda.throttles We probably want to also capture accepts and defers. Thus we need to have pigeon print these lines to stdout: For accept: MONITORING|unix_epoch_timestamp|1|count|antenna.pigeon.accept| For defer: MONITORING|unix_epoch_timestamp|1|count|antenna.pigeon.defer| Replacing "unix_epoch_timestamp" with int(time.time()).
Assignee | ||
Comment 2•7 years ago
|
||
Also this: http://docs.aws.amazon.com/lambda/latest/dg/python-logging.html
Assignee | ||
Comment 3•7 years ago
|
||
In a PR: https://github.com/mozilla/socorro-pigeon/pull/14
Assignee | ||
Comment 4•7 years ago
|
||
I merged that PR. We don't have Pigeon running, yet, so we'll have to wait to set the rest of this up.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 5•7 years ago
|
||
One thing that popped in my head is that we want to keep track of AWS Lambda limit errors: http://docs.aws.amazon.com/lambda/latest/dg/limits.html http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html We definitely want an alert on that--we want to know if we ever exceed those. I don't think we should be anywhere near those limits, but we should keep tabs on the concurrent executions limit. Socorro gets like 20 crashes per second (~req/s) and the AWS lambda concurrent executions limit is for 100 req/s. (Assumes I read the docs correctly.)
Assignee | ||
Comment 6•7 years ago
|
||
Lonnen mentioned that we should tell AWS folks to raise our AWS Lambda rate limit. Then we don't have to deal with it.
Comment 7•7 years ago
|
||
We definitely should ask them to raise it a bit. It's probably useful to have an alert of some kind. IIRC when we hit a soft limit like this they notify the account mail, which would be good enough.
Comment 8•7 years ago
|
||
Submitted an AWS Lambda concurrent execution limit increase in the Ops prod account. We should be approved for 400 concurrent executions shortly, which should give us a significant padding. There will also be monitoring in place per https://bugzilla.mozilla.org/show_bug.cgi?id=1344037.
Comment 9•7 years ago
|
||
The limit was approved. We now have 400 concurrent executions in prod.
Comment 10•7 years ago
|
||
We have Datadog graphs for Pigeon in ops infra, but we will be running pigeon in webeng infra in order to use the Socorro stage crash bucket. I set up a WIP dashboard here: https://app.datadoghq.com/dash/269494/pigeon No data yet.
Comment 11•7 years ago
|
||
Actually, via the magic of letting ops Datadog have access to mozilla-webeng AWS integration, we should be able to keep our pigeon monitoring in ops Datadog. It might *just work*.
Assignee | ||
Comment 12•7 years ago
|
||
We've got Pigeon graphs in the dashboards and we watched data go by yesterday when we transitioned Socorro -stage to use Antenna. We're all set here, so marking as FIXED.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 13•7 years ago
|
||
Switching Antenna bugs to Antenna component.
Component: General → Antenna
You need to log in
before you can comment on or make changes to this bug.
Description
•