Closed
Bug 1356401
Opened 7 years ago
Closed 6 years ago
monitor pigeon exceptions and alert ops/devs
Categories
(Socorro :: Antenna, task, P3)
Socorro
Antenna
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
Details
Attachments
(1 file)
If Pigeon can't connect to RabbitMQ, it should let us know in a way that gets our attention. Theoretically, it logs something and statsd's something: https://github.com/mozilla/socorro-pigeon/blob/735c183ecc626f1bc31648cc68f8f33870aea6c2/pigeon.py#L324 This bug covers verifying that's correct and tying it to a monitor/alert.
Assignee | ||
Comment 1•7 years ago
|
||
I'll verify the code is correct. Then Miles can add an alert to that statsd key.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•7 years ago
|
||
Switching Antenna bugs to Antenna component.
Component: General → Antenna
Assignee | ||
Comment 3•6 years ago
|
||
I'm going to rescope this to "we need to know when Pigeon is erroring out" generally rather than in the specific Pika connection case. Adjusting the summary accordingly. AWS Lambda docs have this: https://docs.aws.amazon.com/lambda/latest/dg/python-exceptions.html We want to rewrite the stage submitter as a Lambda function, but that's significantly more complex and I want to make sure we have a basic set of best practices figured out before embarking on that. Given that, making this a P2.
Priority: -- → P2
Summary: monitor pigeon connection errors and alert when things are bad [antenna] → monitor pigeon exceptions and alert ops/devs
Assignee | ||
Comment 4•6 years ago
|
||
Bumping this down. This is important, but I don't foresee more pigeon work for a while and I'm pushing off the stage submitter work, so this can wait.
Priority: P2 → P3
Assignee | ||
Comment 5•6 years ago
|
||
Oops--meant to unassign myself, too. Doing that now.
Assignee: willkg → nobody
Status: ASSIGNED → NEW
Assignee | ||
Comment 6•6 years ago
|
||
Grabbing this to do now. The issue here is that pigeon is handling exceptions. We want it to log things and statsd things, but re-raise the exceptions so that the Lambda runtime sees them thus triggering its machinery.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 7•6 years ago
|
||
Assignee | ||
Comment 8•6 years ago
|
||
Landed in https://github.com/mozilla-services/socorro-pigeon/commit/b03c16a18b0babf0011252cc137829fcde52655d That changes pigeon so that it doesn't handle exceptions thus letting the runtime handle them and do whatever it does. Pretty sure that's what we want, but I'm not sure how that manifests itself in AWS-land. Miles, Brian: After we update pigeon, how do we monitor when it raises exceptions? Can we get exception counts in Datadog somewhere?
Flags: needinfo?(miles)
Flags: needinfo?(bpitts)
Comment 9•6 years ago
|
||
Yes, there is the aws.lambda.errors metric. https://docs.datadoghq.com/integrations/amazon_lambda/#data-collected
Flags: needinfo?(bpitts)
Assignee | ||
Comment 10•6 years ago
|
||
Cool! Can one of you add that to the Socorro collector -new-prod dashboard?
Comment 11•6 years ago
|
||
It is there, that's the "Pigeon Errors" graph. We have been monitoring/alerting on this from the start, but we were never throwing the exceptions to the lambda runtime.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(miles)
Resolution: --- → FIXED
Assignee | ||
Comment 12•6 years ago
|
||
Awesome! Thanks, Miles!
You need to log in
before you can comment on or make changes to this bug.
Description
•