Closed
Bug 1155042
Opened 10 years ago
Closed 9 years ago
Setup applicable monitoring on socorro via datadog
Categories
(Socorro :: Infra, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jschneider, Assigned: jschneider)
References
Details
* App level performance monitoring via New Relic
* Potential RUM via New Relic
* CPU / Disk / Mem monitoring via New Relic and/or Cloudwatch
* Endpoint monitoring via New Relic and maybe something else?
* CPU based autoscaling alarm triggers
* Scaling activities
* ELB health monitoring via Cloudwatch
Any other ideas?
Assignee | ||
Updated•10 years ago
|
Assignee | ||
Comment 1•10 years ago
|
||
I've added SNS topics to our AWS account to provide alerting.
TOPIC NAME -- ARN
AWS-alerts-mocotools -- arn:aws:sns:us-west-2:293989542403:AWS-alerts-mocotools
AWS-scaling-notifications -- arn:aws:sns:us-west-2:293989542403:AWS-scaling-notifications
I've added a few example alarms to cloudwatch then, hooking those into the notifications for the AWS-alerts-mocotools topic. We're watching CPU on our DB servers,and watching for unhealthy hosts in our elbs. I should probably write these alarms out in CLI to make this automated for each piece of infra, huh?
ALARM NAME -- Metric / Threshold
elbhealth-stage-symbolapi -- UnHealthyHostCount >= 1 for 1 minute
elbhealth-stage-socorro-webapp -- UnHealthyHostCount >= 1 for 1 minute
elbhealth-stage-socorro-rabbitmq -- UnHealthyHostCount >= 1 for 1 minute
elbhealth-stage-socorro-elasticsearch -- UnHealthyHostCount >= 1 for 1 minute
elbhealth-stage-socorro-collector -- UnHealthyHostCount >= 1 for 1 minute
elbhealth-prod-etherpadlite -- UnHealthyHostCount >= 1 for 1 minute
cpu-rds-techops-production -- CPUUtilization >= 85 for 5 minutes
cpu-rds-socorrotest -- CPUUtilization >= 85 for 5 minutes
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → jschneider
Assignee | ||
Comment 2•10 years ago
|
||
Instead of New Relic, we're using datadog with the email address of mocotoolseng@mozilla.com
Updated•10 years ago
|
Summary: Setup applicable monitoring on socorro via New Relic/Cloudfront → Setup applicable monitoring on socorro via datadog
Comment 4•9 years ago
|
||
jp, what else do we need to do here?
Status: NEW → ASSIGNED
Flags: needinfo?(jschneider)
Updated•9 years ago
|
Assignee | ||
Comment 5•9 years ago
|
||
The last remaining item is to get statsd up and running, which I'll file hooked onto this one in just a minute.
With monitoring so far, I have:
1) Setup Pingdom account(currently just pinging staging webapp)
2) In datadog, integrated github events for socorro and socorro-infra into events stream (pushes, commits, etc) [Uses mocotoolseng github account]
3) In datadog, integrated the current buildbox events into event stream [Run as centos agent]
4) In datadog, integrated our AWS change audit trail into event stream (using cloudtrail) [Using datadog IAM creds] (Includes Scaling/Alarm alerts)
5) In datadog, experimented with a test ES server integration(for perf stream), looked possibly worthwhile. Will file a bug on this. [Run as centos agent]
6] In datadog, integrated our other infra cloudwatch events/alarms into event stream [IAM user for datadog]
7) In datadog, integrated RSS feeds for AWS alerts into events feed.
8) In datadog, integrated the new pingdom account into events feed.
9) In datadog, created a couple dashboards as demos/templates,
10) Took screencaps of using events layer on top of performance data.
11) Got all the ELB alerts automated in creation with a script on jenkins
12) I maybe did other stuff too.
Dashboard example: https://p.datadoghq.com/sb/f34bd0200e
Gif's showing use of Datadog events stream w/ perf data:
* https://dl.dropboxusercontent.com/u/2273146/datadog-events-stream.gif
* https://dl.dropboxusercontent.com/u/2273146/datadog-events-saved-search.gif
* https://dl.dropboxusercontent.com/u/2273146/datadog-dashboard-event-layer.gif
* https://dl.dropboxusercontent.com/u/2273146/datadog-correlating-activity.gif
Just a cool way Datadog lets you easily parse hosts on multiple variables.
* https://dl.dropboxusercontent.com/u/2273146/datadog-infra-map.gif
We could go live with the data and logs we're now getting, with lots of room to improve. I'll file other bugs, but I'm not sure they should block prod for real. Sometime, we'll want to
1) Setup statsd with datadogd client and start instrumenting the app
2) Setup some frontend RUM with Pingdom
3) Determine if we need a pagerduty type of thing for alerting
Flags: needinfo?(jschneider)
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 6•9 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•