Closed Bug 1356401 Opened 7 years ago Closed 6 years ago

monitor pigeon exceptions and alert ops/devs

Categories

(Socorro :: Antenna, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

Details

Attachments

(1 file)

If Pigeon can't connect to RabbitMQ, it should let us know in a way that gets our attention. Theoretically, it logs something and statsd's something:

https://github.com/mozilla/socorro-pigeon/blob/735c183ecc626f1bc31648cc68f8f33870aea6c2/pigeon.py#L324

This bug covers verifying that's correct and tying it to a monitor/alert.
I'll verify the code is correct. Then Miles can add an alert to that statsd key.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Switching Antenna bugs to Antenna component.
Component: General → Antenna
I'm going to rescope this to "we need to know when Pigeon is erroring out" generally rather than in the specific Pika connection case. Adjusting the summary accordingly.

AWS Lambda docs have this:

https://docs.aws.amazon.com/lambda/latest/dg/python-exceptions.html

We want to rewrite the stage submitter as a Lambda function, but that's significantly more complex and I want to make sure we have a basic set of best practices figured out before embarking on that. Given that, making this a P2.
Priority: -- → P2
Summary: monitor pigeon connection errors and alert when things are bad [antenna] → monitor pigeon exceptions and alert ops/devs
Bumping this down. This is important, but I don't foresee more pigeon work for a while and I'm pushing off the stage submitter work, so this can wait.
Priority: P2 → P3
Oops--meant to unassign myself, too. Doing that now.
Assignee: willkg → nobody
Status: ASSIGNED → NEW
Grabbing this to do now.

The issue here is that pigeon is handling exceptions. We want it to log things and statsd things, but re-raise the exceptions so that the Lambda runtime sees them thus triggering its machinery.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Landed in https://github.com/mozilla-services/socorro-pigeon/commit/b03c16a18b0babf0011252cc137829fcde52655d

That changes pigeon so that it doesn't handle exceptions thus letting the runtime handle them and do whatever it does. Pretty sure that's what we want, but I'm not sure how that manifests itself in AWS-land.

Miles, Brian: After we update pigeon, how do we monitor when it raises exceptions? Can we get exception counts in Datadog somewhere?
Flags: needinfo?(miles)
Flags: needinfo?(bpitts)
Yes, there is the aws.lambda.errors metric.

https://docs.datadoghq.com/integrations/amazon_lambda/#data-collected
Flags: needinfo?(bpitts)
Cool! Can one of you add that to the Socorro collector -new-prod dashboard?
It is there, that's the "Pigeon Errors" graph. We have been monitoring/alerting on this from the start, but we were never throwing the exceptions to the lambda runtime.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(miles)
Resolution: --- → FIXED
Awesome! Thanks, Miles!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: