Closed
Bug 1283111
Opened 9 years ago
Closed 9 years ago
Determine which Nagios alerts need adjusting during the Treeherder Heroku migration
Categories
(Tree Management :: Treeherder: Infrastructure, defect, P2)
Tree Management
Treeherder: Infrastructure
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: emorley, Assigned: emorley)
References
Details
As part of migration, the Treeherder DNS will be updated to point https://treeherder.{mozilla,alliom}.org at the new Heroku instances.
Some Nagios checks will be pointed at that domain, and no doubt some others at the Zeus VIPs or individual hostnames.
We need to figure out:
1) Which checks pointed at the standard domain rely on SCL3-only functionality (eg the /server_info endpoint)
2) Which checks point at the VIP (/some other SCL3 specific hostname) that need to be replicated in some form in the Heroku world (if any)
Kendall, I don't suppose you know what checks are currently set up?
I'm guessing the list includes:
* rabbitmq queue sizes
* site SSL certificate validity
* individual node cpu/disk/... checks
* various DB checks
* ...?
Thanks :-)
Flags: needinfo?(klibby)
Comment 1•9 years ago
|
||
Checks should be fairly self-evident, but let me know if you'd like details on specific checks:
DB all
check_command => 'check_mysql_config_diffs',
check_command => 'check_mysql_table_checksums!24',
check_command => 'check_swap!10%!5%',
check_command => 'check_mysql_uptime',
check_command => "check_mysql!${secrets::nagiosdaemon::username}!${secrets::nagiosdaemon::password}",
DB slaves
check_command => "check_mysql_replication_lag!600!1200",
check_command => 'check_mysql_readonly',
VMs (all)
check_command => 'check_disk_all',
check_command => 'check_ssh!22',
check_command => 'check_swap!50%!25%',
check_command => 'check_nrpe_load!20,25,40!25,50,50',
check_command => 'check_ntp_time_multi!20!60',
check_command => 'check_dnsmasq!mozilla.com',
check_command => 'check_puppet_freshness!7200',
check_command => 'check_puppet_catalog',
check_command => 'check_auditd',
check_command => 'check_uptime!15',
check_command => 'check_nrpe_procs_regex!collectd!1!32',
check_command => 'check_nrpe_procs_regex!crond!1!32',
check_command => 'check_nrpe_procs_regex!atd!1!6',
check_command => 'check_file_age!1500!1800!/var/lib/mig/mig-agent.ok',
rabbit nodes
check_command => 'check_nrpe_procs_regex!rabbitmq!1!5',
check_command => 'check_tcp!5672',
check_command => 'check_rabbitmq_overview!2000,2000,2000!4000,4000,4000,',
web nodes
check_command => 'check_zeus_svc!8080!zlb1.ops.scl3.mozilla.com',
treeherder.m.o
check_command => 'check_https_cert_only!14', (cert expiration)
Flags: needinfo?(klibby)
| Assignee | ||
Comment 2•9 years ago
|
||
Thank you - that's really helpful. I can't see any checks that have to be turned off when we switch, to prevent false alarms.
As for us needing to come up with equivalents in the Heroku/RDS world, the only ones that seem relevant, are:
DB:
* check_mysql_config_diffs -> when we have the Terraform configs in a public repo (ideally Treeherder's) we can diff against the Travis/Vagrant mysql config, plus I guess check Terraform config matching what is running on RDS.
* Various other DB checks -> We can replace with RDS Cloudwatch alerts.
VMs (all):
* I don't think most of these are relevant/available to us. Think we're best setting alerts for errors/response times on New Relic instead
rabbitmq:
* we can use the CloudAMQP alerts, as well as the New Relic plugin alerting features
web:
* We can replace this with the New Relic ping checks (synthetics)
treeherder.m.o:
* we'll likely want this check left in place in Nagios
Is there prior art for how to best get New Relic/Cloudwatch alerts to Nagios/MOC?
Flags: needinfo?(klibby)
Comment 3•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #2)
>
> Is there prior art for how to best get New Relic/Cloudwatch alerts to
> Nagios/MOC?
Not to my knowledge; roping in Linda so we can work out best path forward.
Flags: needinfo?(klibby) → needinfo?(lypulong)
Comment 4•9 years ago
|
||
Or, I could remember this page:
https://mana.mozilla.org/wiki/display/MOC/How+to+request+MOC+support+for+a+new+production+system+or+service
Flags: needinfo?(lypulong)
| Assignee | ||
Comment 5•9 years ago
|
||
That's not quite the same though.
This isn't a new system per-se (though I'm sure we'll go through a similar process) but more importantly, prior to filling that page out, it would be helpful to know how other non-SCL3 systems have hooked into Nagios or MOC systems, so we can figure out an approach.
Linda, could you advise as to comment 3? Thanks :-)
Flags: needinfo?(lypulong)
Comment 6•9 years ago
|
||
We can integrate into pagerduty - it is a native integration so should not take long
Keegan or Peter can you take this bug and get the needed checks added to PagerDuty with some quick documentation on how to react in case of alerting
feel free to pull in ryanc or jedi
Flags: needinfo?(pradcliffe+bugzilla)
Flags: needinfo?(lypulong)
Flags: needinfo?(kferrando)
Comment 7•9 years ago
|
||
There are things being monitored in datadog for cloudwatch and rds. We haven't done anything direct from cloudwatch to pagerduty that I'm aware of. Did you want to set up thresholds and such in cloudwatch? The AWS account details will be needed to set things up there,
https://www.pagerduty.com/blog/aws-cloudwatch-now-integrates-with-pagerduty/
http://docs.datadoghq.com/integrations/awsrds/
New Relic I believe we only currently have email alerts coming in. It can be integrated directly with pagerduty but again we'll need account information:
https://www.pagerduty.com/blog/new-relic-integration-with-pagerduty/
What timescale are we talking about? This isn't a quick set of things to set up and tune to make sure they don't spam the hell out of the pager. Alerts should go to the people working on it first until it's down to a reasonable level then migrated to us when reasonable to do so.
Do people have pagerduty accounts? Those will need to be set up.
Flags: needinfo?(pradcliffe+bugzilla)
Updated•9 years ago
|
Flags: needinfo?(kferrando)
| Assignee | ||
Comment 8•9 years ago
|
||
Until the migration is complete, and we've tweaked New Relic/Cloudwatch thresholds appropriately, we'll manage the Treeherder Heroku instance ourselves. At that point, I'll file a new bug for settings things up more - thank you for the info about pagerduty etc.
Closing this bug out for now, since the answer to its question is:
During prod migration, we need to silence these two:
check_command => 'check_mysql_config_diffs',
check_command => 'check_mysql_readonly',
(this will happen as part of the migration checklist, bug incoming)
All other checks can remain until the VMs are decommed.
Assignee: nobody → emorley
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•