Closed Bug 1306597 Opened 8 years ago Closed 6 years ago

Set up CloudWatch & event subscriptions for Heroku RDS instances

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Our monitoring options for the RDS instances used on Heroku are: 1) AWS CloudWatch (eg free disk space, CPU usage, other hardware metrics) 2) Event subscriptions (eg RDS instance configuration changes, failover, reboots, ...) 3) New Relic MySQL plugin 4) Database stats recorded by the New Relic Python agent, from the app's perspective. #3 is a bit more involved (since it would require a service on another machine to run the plugin) and already has bug 1201063 filed. #4 is already occurring. This bug is about #1-2. Initially I'll get alerts/notifications to be sent to just me. Then once proven non-spammy, I'll send them to the treeherder-internal list. Finally, we can then select a subset of the alerts to send to MOC too. https://console.aws.amazon.com/cloudwatch/home?region=us-east-1#alarm:alarmFilter=ANY https://console.aws.amazon.com/rds/home?region=us-east-1#event-subscriptions:
Some Cloudwatch alerts about max queue size or some of the other equally-obscure metrics might have helped with the diagnosis of bug 1386331.
Blocks: 1386331
Priority: P2 → P1
Alert types to enable: * disk usage * CPU
Assignee: emorley → nobody
Status: ASSIGNED → NEW
Priority: P1 → P2
Today I received a low disk space alert for the dev RDS instance - which was resolved by an instance restart (guessing perhaps stray temp tables or similar? very strange). The alert only went to me since the notification settings have been unchanged since comment 0 - however that's not ideal moving forwards. As such I've enabled failover, low disk space, ... notifications for all RDS instances, which will now be sent to treeherder-internal@ rather than just me. To modify settings go to: https://console.aws.amazon.com/rds/home?region=us-east-1#event-subscriptions: This doesn't include more granular alerts around things like CPU usage, since they have to be configured via CloudWatch (which we don't have sufficient IAM permissions to do at present) and would need a fair amount of tweaking to ensure that there are no false positives - however we can always add those at a later date.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 6 years ago
Priority: P2 → P1
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.