Closed Bug 1155042 Opened 10 years ago Closed 9 years ago

Setup applicable monitoring on socorro via datadog

Categories

(Socorro :: Infra, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jschneider, Assigned: jschneider)

References

Details

* App level performance monitoring via New Relic * Potential RUM via New Relic * CPU / Disk / Mem monitoring via New Relic and/or Cloudwatch * Endpoint monitoring via New Relic and maybe something else? * CPU based autoscaling alarm triggers * Scaling activities * ELB health monitoring via Cloudwatch Any other ideas?
Blocks: 1123833
Depends on: 1155040
I've added SNS topics to our AWS account to provide alerting. TOPIC NAME -- ARN AWS-alerts-mocotools -- arn:aws:sns:us-west-2:293989542403:AWS-alerts-mocotools AWS-scaling-notifications -- arn:aws:sns:us-west-2:293989542403:AWS-scaling-notifications I've added a few example alarms to cloudwatch then, hooking those into the notifications for the AWS-alerts-mocotools topic. We're watching CPU on our DB servers,and watching for unhealthy hosts in our elbs. I should probably write these alarms out in CLI to make this automated for each piece of infra, huh? ALARM NAME -- Metric / Threshold elbhealth-stage-symbolapi -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-webapp -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-rabbitmq -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-elasticsearch -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-collector -- UnHealthyHostCount >= 1 for 1 minute elbhealth-prod-etherpadlite -- UnHealthyHostCount >= 1 for 1 minute cpu-rds-techops-production -- CPUUtilization >= 85 for 5 minutes cpu-rds-socorrotest -- CPUUtilization >= 85 for 5 minutes
Assignee: nobody → jschneider
Instead of New Relic, we're using datadog with the email address of mocotoolseng@mozilla.com
Summary: Setup applicable monitoring on socorro via New Relic/Cloudfront → Setup applicable monitoring on socorro via datadog
See Also: → 1144173
jp, what else do we need to do here?
Status: NEW → ASSIGNED
Flags: needinfo?(jschneider)
Blocks: 1118288
No longer blocks: 1123833
The last remaining item is to get statsd up and running, which I'll file hooked onto this one in just a minute. With monitoring so far, I have: 1) Setup Pingdom account(currently just pinging staging webapp) 2) In datadog, integrated github events for socorro and socorro-infra into events stream (pushes, commits, etc) [Uses mocotoolseng github account] 3) In datadog, integrated the current buildbox events into event stream [Run as centos agent] 4) In datadog, integrated our AWS change audit trail into event stream (using cloudtrail) [Using datadog IAM creds] (Includes Scaling/Alarm alerts) 5) In datadog, experimented with a test ES server integration(for perf stream), looked possibly worthwhile. Will file a bug on this. [Run as centos agent] 6] In datadog, integrated our other infra cloudwatch events/alarms into event stream [IAM user for datadog] 7) In datadog, integrated RSS feeds for AWS alerts into events feed. 8) In datadog, integrated the new pingdom account into events feed. 9) In datadog, created a couple dashboards as demos/templates, 10) Took screencaps of using events layer on top of performance data. 11) Got all the ELB alerts automated in creation with a script on jenkins 12) I maybe did other stuff too. Dashboard example: https://p.datadoghq.com/sb/f34bd0200e Gif's showing use of Datadog events stream w/ perf data: * https://dl.dropboxusercontent.com/u/2273146/datadog-events-stream.gif * https://dl.dropboxusercontent.com/u/2273146/datadog-events-saved-search.gif * https://dl.dropboxusercontent.com/u/2273146/datadog-dashboard-event-layer.gif * https://dl.dropboxusercontent.com/u/2273146/datadog-correlating-activity.gif Just a cool way Datadog lets you easily parse hosts on multiple variables. * https://dl.dropboxusercontent.com/u/2273146/datadog-infra-map.gif We could go live with the data and logs we're now getting, with lots of room to improve. I'll file other bugs, but I'm not sure they should block prod for real. Sometime, we'll want to 1) Setup statsd with datadogd client and start instrumenting the app 2) Setup some frontend RUM with Pingdom 3) Determine if we need a pagerduty type of thing for alerting
Flags: needinfo?(jschneider)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.