1155042 - Setup applicable monitoring on socorro via datadog

Assignee

Description

•

10 years ago

* App level performance monitoring via New Relic * Potential RUM via New Relic * CPU / Disk / Mem monitoring via New Relic and/or Cloudwatch * Endpoint monitoring via New Relic and maybe something else? * CPU based autoscaling alarm triggers * Scaling activities * ELB health monitoring via Cloudwatch Any other ideas?

JP Schneider [:jp]

Assignee

Updated

•

10 years ago

Blocks: 1123833

Depends on: 1155040

JP Schneider [:jp]

Assignee

Comment 1

•

10 years ago

I've added SNS topics to our AWS account to provide alerting. TOPIC NAME -- ARN AWS-alerts-mocotools -- arn:aws:sns:us-west-2:293989542403:AWS-alerts-mocotools AWS-scaling-notifications -- arn:aws:sns:us-west-2:293989542403:AWS-scaling-notifications I've added a few example alarms to cloudwatch then, hooking those into the notifications for the AWS-alerts-mocotools topic. We're watching CPU on our DB servers,and watching for unhealthy hosts in our elbs. I should probably write these alarms out in CLI to make this automated for each piece of infra, huh? ALARM NAME -- Metric / Threshold elbhealth-stage-symbolapi -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-webapp -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-rabbitmq -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-elasticsearch -- UnHealthyHostCount >= 1 for 1 minute elbhealth-stage-socorro-collector -- UnHealthyHostCount >= 1 for 1 minute elbhealth-prod-etherpadlite -- UnHealthyHostCount >= 1 for 1 minute cpu-rds-techops-production -- CPUUtilization >= 85 for 5 minutes cpu-rds-socorrotest -- CPUUtilization >= 85 for 5 minutes

JP Schneider [:jp]

Assignee

Updated

•

10 years ago

Assignee: nobody → jschneider

JP Schneider [:jp]

Assignee

Comment 2

•

10 years ago

Instead of New Relic, we're using datadog with the email address of mocotoolseng@mozilla.com

Robert Helmer [:rhelmer]

Updated

•

10 years ago

Summary: Setup applicable monitoring on socorro via New Relic/Cloudfront → Setup applicable monitoring on socorro via datadog

Daniel Maher [:phrawzty]

Updated

•

10 years ago

Comment 4

•

9 years ago

jp, what else do we need to do here?

Status: NEW → ASSIGNED

Flags: needinfo?(jschneider)

Robert Helmer [:rhelmer]

Updated

•

9 years ago

Blocks: 1118288
No longer blocks: 1123833

JP Schneider [:jp]

Assignee

Comment 5

•

9 years ago

The last remaining item is to get statsd up and running, which I'll file hooked onto this one in just a minute. With monitoring so far, I have: 1) Setup Pingdom account(currently just pinging staging webapp) 2) In datadog, integrated github events for socorro and socorro-infra into events stream (pushes, commits, etc) [Uses mocotoolseng github account] 3) In datadog, integrated the current buildbox events into event stream [Run as centos agent] 4) In datadog, integrated our AWS change audit trail into event stream (using cloudtrail) [Using datadog IAM creds] (Includes Scaling/Alarm alerts) 5) In datadog, experimented with a test ES server integration(for perf stream), looked possibly worthwhile. Will file a bug on this. [Run as centos agent] 6] In datadog, integrated our other infra cloudwatch events/alarms into event stream [IAM user for datadog] 7) In datadog, integrated RSS feeds for AWS alerts into events feed. 8) In datadog, integrated the new pingdom account into events feed. 9) In datadog, created a couple dashboards as demos/templates, 10) Took screencaps of using events layer on top of performance data. 11) Got all the ELB alerts automated in creation with a script on jenkins 12) I maybe did other stuff too. Dashboard example: https://p.datadoghq.com/sb/f34bd0200e Gif's showing use of Datadog events stream w/ perf data: * https://dl.dropboxusercontent.com/u/2273146/datadog-events-stream.gif * https://dl.dropboxusercontent.com/u/2273146/datadog-events-saved-search.gif * https://dl.dropboxusercontent.com/u/2273146/datadog-dashboard-event-layer.gif * https://dl.dropboxusercontent.com/u/2273146/datadog-correlating-activity.gif Just a cool way Datadog lets you easily parse hosts on multiple variables. * https://dl.dropboxusercontent.com/u/2273146/datadog-infra-map.gif We could go live with the data and logs we're now getting, with lots of room to improve. I'll file other bugs, but I'm not sure they should block prod for real. Sometime, we'll want to 1) Setup statsd with datadogd client and start instrumenting the app 2) Setup some frontend RUM with Pingdom 3) Determine if we need a pagerduty type of thing for alerting

Flags: needinfo?(jschneider)

JP Schneider [:jp]

Assignee

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Daniel Maher [:phrawzty]

Comment 6

•

9 years ago

http://i.imgur.com/VLJtnW5.gif

Bugzilla

Setup applicable monitoring on socorro via datadog

Categories

(Socorro :: Infra, task)

Tracking

(Not tracked)

People

(Reporter: jschneider, Assigned: jschneider)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Updated

Updated

Comment 4

Updated

Comment 5

Updated

Comment 6