1283111 - Determine which Nagios alerts need adjusting during the Treeherder Heroku migration

Assignee

Description

•

9 years ago

As part of migration, the Treeherder DNS will be updated to point https://treeherder.{mozilla,alliom}.org at the new Heroku instances. Some Nagios checks will be pointed at that domain, and no doubt some others at the Zeus VIPs or individual hostnames. We need to figure out: 1) Which checks pointed at the standard domain rely on SCL3-only functionality (eg the /server_info endpoint) 2) Which checks point at the VIP (/some other SCL3 specific hostname) that need to be replicated in some form in the Heroku world (if any) Kendall, I don't suppose you know what checks are currently set up? I'm guessing the list includes: * rabbitmq queue sizes * site SSL certificate validity * individual node cpu/disk/... checks * various DB checks * ...? Thanks :-)

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 1

•

9 years ago

Checks should be fairly self-evident, but let me know if you'd like details on specific checks: DB all check_command => 'check_mysql_config_diffs', check_command => 'check_mysql_table_checksums!24', check_command => 'check_swap!10%!5%', check_command => 'check_mysql_uptime', check_command => "check_mysql!${secrets::nagiosdaemon::username}!${secrets::nagiosdaemon::password}", DB slaves check_command => "check_mysql_replication_lag!600!1200", check_command => 'check_mysql_readonly', VMs (all) check_command => 'check_disk_all', check_command => 'check_ssh!22', check_command => 'check_swap!50%!25%', check_command => 'check_nrpe_load!20,25,40!25,50,50', check_command => 'check_ntp_time_multi!20!60', check_command => 'check_dnsmasq!mozilla.com', check_command => 'check_puppet_freshness!7200', check_command => 'check_puppet_catalog', check_command => 'check_auditd', check_command => 'check_uptime!15', check_command => 'check_nrpe_procs_regex!collectd!1!32', check_command => 'check_nrpe_procs_regex!crond!1!32', check_command => 'check_nrpe_procs_regex!atd!1!6', check_command => 'check_file_age!1500!1800!/var/lib/mig/mig-agent.ok', rabbit nodes check_command => 'check_nrpe_procs_regex!rabbitmq!1!5', check_command => 'check_tcp!5672', check_command => 'check_rabbitmq_overview!2000,2000,2000!4000,4000,4000,', web nodes check_command => 'check_zeus_svc!8080!zlb1.ops.scl3.mozilla.com', treeherder.m.o check_command => 'check_https_cert_only!14', (cert expiration)

Flags: needinfo?(klibby)

Ed Morley [:emorley]

Assignee

Comment 2

•

9 years ago

Thank you - that's really helpful. I can't see any checks that have to be turned off when we switch, to prevent false alarms. As for us needing to come up with equivalents in the Heroku/RDS world, the only ones that seem relevant, are: DB: * check_mysql_config_diffs -> when we have the Terraform configs in a public repo (ideally Treeherder's) we can diff against the Travis/Vagrant mysql config, plus I guess check Terraform config matching what is running on RDS. * Various other DB checks -> We can replace with RDS Cloudwatch alerts. VMs (all): * I don't think most of these are relevant/available to us. Think we're best setting alerts for errors/response times on New Relic instead rabbitmq: * we can use the CloudAMQP alerts, as well as the New Relic plugin alerting features web: * We can replace this with the New Relic ping checks (synthetics) treeherder.m.o: * we'll likely want this check left in place in Nagios Is there prior art for how to best get New Relic/Cloudwatch alerts to Nagios/MOC?

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 3

•

9 years ago

(In reply to Ed Morley [:emorley] from comment #2) > > Is there prior art for how to best get New Relic/Cloudwatch alerts to > Nagios/MOC? Not to my knowledge; roping in Linda so we can work out best path forward.

Flags: needinfo?(klibby) → needinfo?(lypulong)

Kendall Libby [:fubar] (he/him)

Comment 4

•

9 years ago

Or, I could remember this page: https://mana.mozilla.org/wiki/display/MOC/How+to+request+MOC+support+for+a+new+production+system+or+service

Flags: needinfo?(lypulong)

Ed Morley [:emorley]

Assignee

Comment 5

•

9 years ago

That's not quite the same though. This isn't a new system per-se (though I'm sure we'll go through a similar process) but more importantly, prior to filling that page out, it would be helpful to know how other non-SCL3 systems have hooked into Nagios or MOC systems, so we can figure out an approach. Linda, could you advise as to comment 3? Thanks :-)

Flags: needinfo?(lypulong)

Linda Ypulong [:unixfairy]

Comment 6

•

9 years ago

We can integrate into pagerduty - it is a native integration so should not take long Keegan or Peter can you take this bug and get the needed checks added to PagerDuty with some quick documentation on how to react in case of alerting feel free to pull in ryanc or jedi

Flags: needinfo?(pradcliffe+bugzilla)

Flags: needinfo?(lypulong)

Flags: needinfo?(kferrando)

Peter Radcliffe [:pir]

Comment 7

•

9 years ago

There are things being monitored in datadog for cloudwatch and rds. We haven't done anything direct from cloudwatch to pagerduty that I'm aware of. Did you want to set up thresholds and such in cloudwatch? The AWS account details will be needed to set things up there, https://www.pagerduty.com/blog/aws-cloudwatch-now-integrates-with-pagerduty/ http://docs.datadoghq.com/integrations/awsrds/ New Relic I believe we only currently have email alerts coming in. It can be integrated directly with pagerduty but again we'll need account information: https://www.pagerduty.com/blog/new-relic-integration-with-pagerduty/ What timescale are we talking about? This isn't a quick set of things to set up and tune to make sure they don't spam the hell out of the pager. Alerts should go to the people working on it first until it's down to a reasonable level then migrated to us when reasonable to do so. Do people have pagerduty accounts? Those will need to be set up.

Flags: needinfo?(pradcliffe+bugzilla)

Keegan Ferrando [:fauweh]

Updated

•

9 years ago

Flags: needinfo?(kferrando)

Ed Morley [:emorley]

Assignee

Comment 8

•

9 years ago

Until the migration is complete, and we've tweaked New Relic/Cloudwatch thresholds appropriately, we'll manage the Treeherder Heroku instance ourselves. At that point, I'll file a new bug for settings things up more - thank you for the info about pagerduty etc. Closing this bug out for now, since the answer to its question is: During prod migration, we need to silence these two: check_command => 'check_mysql_config_diffs', check_command => 'check_mysql_readonly', (this will happen as part of the migration checklist, bug incoming) All other checks can remain until the VMs are decommed.

Assignee: nobody → emorley

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Bugzilla

Determine which Nagios alerts need adjusting during the Treeherder Heroku migration

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

Tracking

(Not tracked)

People

(Reporter: emorley, Assigned: emorley)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8