Closed Bug 1308228 Opened 8 years ago Closed 7 years ago

Monitoring of Bugzilla DR in AWS

Categories

(Infrastructure & Operations :: MOC: Projects, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: riweiss, Assigned: lypulong)

References

()

Details

Please establish monitors for the Bugzilla DR instance in AWS per the following requirements to achieve parity with the monitoring in SCL3:

External checks
Pingdom
?
DataDog (for Nubis)
?

Nagios checks
Check with the MOC for exact details of what each check does, though they are mostly self-explanatory. 

The lists are broken out as follows:
hostname
hostgroup
check, with any args

Database checks
The DBA team is primarily responsible for these and alerts will usually go to them (and NOT Pythian) instead of the MOC.

backup2.db.scl3.mozilla.com
mysql-backups-sensitive
check_nrpe_disk!4%!2%!/data
check_disk_all
check_swap!50%!25%
check_nrpe_load!20,25,40!25,50,50
check_mysql_backup_table_checksums
check_backup_replication
db-swap-high
check_swap!10%!5%

bugzilla-reporting1.db.scl3.mozilla.com
mysql2-puppetized-servers-sensitive
check_mysql_config_diffs
mysql-checksum
check_mysql_table_checksums!24
mysql-ro
check_mysql_readonly
mysql-slaves-sensitive
check_mysql_uptime
check_disk_all
check_nrpe_load!20,25,40!25,50,50
check_mysql
mysql-slow-repl
check_mysql_replication_lag!3600!7200
db-swap-high
check_swap!10%!5%

bugzilla[12].stage.db.scl3.mozilla.com
mysql2-puppetized-servers-sensitive
check_mysql_config_diffs
mysql-checksum
check_mysql_table_checksums!24
mysql-ro-nopage
check_mysql_readonly
mysql-slaves-nopage
check_disk_all
check_ssh!22
check_nrpe_load!20,25,40!25,50,50
check_mysql_replication_lag!600!1200
check_mysql
db-swap-high
check_swap!10%!5%

bugzilla[5678].db.scl3.mozilla.com
mysql2-puppetized-servers-sensitive
check_mysql_config_diffs
mysql-checksum
check_mysql_table_checksums!24
mysql-ro-sensitive (slaves only)
check_mysql_readonly
mysql-rw-sensitive (master only)
check_mysql_writable
mysql-masters-sensitive (same as next, but on master)
mysql-slaves-sensitive
check_disk_all
check_mysql_uptime
check_nrpe_load!20,25,40!25,50,50
check_mysql
check_mysql_replication_lag!600!1200
db-swap-high
check_swap!10%!5%

VIP and site checks
bugzilla-rw-vip.db.scl3.mozilla.com
schwartz-db
check_theschwartz_queues
bugzilla-wild-zlb.vips.scl3.mozilla.com
bugzilla-wild-vip
check_https!bugzilla.mozilla.org!/ -p 443
wildcard.bugzilla.mozilla.org
api-dev.bugzilla.mozilla.org
https-websites
check_https_cert_only!14
bugzilla-zlb.vips.scl3.mozilla.com
bugzilla-vips
check_http!bugzilla.mozilla.org!/
check_https!bugzilla.mozilla.org!/ -p 443

Node checks
These tend to be NRPE checks. See below for details on the generic and generic-preprod hostgroup checks.

bugzillaadm.private.scl3.mozilla.com
generic
bug-checks (alerting for blocker bugs)
check_bugzilla_push, "Bugzilla - Push Backlog"
check_serverops_bugs
check_netops_bugs
check_infra_bugs
check_infra_vpn_acl_bugs
check_infra_vpn_support_bugs
check_devservices_bugs
check_database_bugs
check_desktop_bugs
check_webops_bugs
check_telecom_bugs
check_mdn_bugs
check_moc_incident_bugs
check_moc_problem_bugs
check_moc_project_bugs
check_moc_service_bugs
check_releng_loaner_bugs
check_releng_bugs
no_swap
check_mem!10!6

jobqueue[12].bugs.scl3.mozilla.com
generic
bugs-admin
check_nrpe_procs!/usr/libexec/postfix/master!1!1
check_mailq_postfix!1000!1500
no_swap
check_mem!10!6

web[12345].bugs.scl3.mozilla.com
generic
bugs-web
check_http_string!bugzilla.mozilla.org!/!Mozilla
check_http_string!bugzilla.mozilla.org!/!Mozilla -p 443
check_nfs_mounts!/mnt/bugzilla_prod
zeus-scl3-https-443
check_http_string!bugzilla.mozilla.org!/!Mozilla
check_http_string!bugzilla.mozilla.org!/!Mozilla -p 443
check_zeus_svc!443!zlb1.ops.scl3.mozilla.com
check_max_clients

web[1345].stage.bugs.scl3.mozilla.com
memcache[12].stage.bugs.scl3.mozilla.com
generic-preprod

push1.bugs.scl3.mozilla.com
generic
bugs-push
check_nrpe_procs_regex!bugzilla-pushd.pl!1!1

memcache[12].bugs.scl3.mozilla.com
esfrontline1.bugs.scl3.mozilla.com
etl1.bugs.scl3.mozilla.com
generic

elasticsearch1.bugs.scl3.mozilla.com
generic
check_elasticsearch_nodes
check_nrpe_procs_regex!org.elasticsearch.bootstrap.ElasticSearch!1!1
check_disk_all_early!70%!95%

Generic nagios checks
The major difference between these hostgroups is that generic-preprod does NOT page and checks only happen during the work week.

generic
check_disk_all
check_ssh!22
check_swap!50%!25%
check_nrpe_load!20,25,40!25,50,50
check_ntp_time_multi!20!60
check_dnsmasq!mozilla.com
check_puppet_freshness!7200
check_puppet_catalog
check_collectd_log
check_logwarn!/var/log/messages!Out of memory
check_auditd
check_uptime!15
check_nrpe_procs_regex!collectd!1!32
check_nrpe_procs_regex!crond!1!32
check_nrpe_procs_regex!atd!1!6
check_file_age!1500!1800!/var/lib/mig/mig-agent.ok
generic-preprod
check_disk_all
check_ssh!22
check_swap!60%!20%
check_nrpe_load!20,25,40!25,50,50
check_ntp_time_multi!20!60
check_uptime!15
check_nrpe_procs_regex!crond!1!5
check_nrpe_procs_regex!atd!1!6
check_file_age!1500!1800!/var/lib/mig/mig-agent.ok
Assignee: nobody → lypulong
Need to wait until the blockers to the DR environment are cleared (email, attachments and file system)

r2 can you link the bugs for the three items to this one
Flags: needinfo?(riweiss)
Depends on: 1296872
Flags: needinfo?(riweiss)
Depends on: 1298076
Status: NEW → ASSIGNED
Linda, are we killing this bug with bmo going to CloudOps?
Flags: needinfo?(lypulong)
yes - both the production and failover site is moving to cloudops in Oct.
Flags: needinfo?(lypulong)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.