Closed Bug 1308228 Opened 8 years ago Closed 7 years ago

Monitoring of Bugzilla DR in AWS

Categories

(Infrastructure & Operations :: MOC: Projects, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: riweiss, Assigned: lypulong)

References

()

Details

Please establish monitors for the Bugzilla DR instance in AWS per the following requirements to achieve parity with the monitoring in SCL3: External checks Pingdom ? DataDog (for Nubis) ? Nagios checks Check with the MOC for exact details of what each check does, though they are mostly self-explanatory. The lists are broken out as follows: hostname hostgroup check, with any args Database checks The DBA team is primarily responsible for these and alerts will usually go to them (and NOT Pythian) instead of the MOC. backup2.db.scl3.mozilla.com mysql-backups-sensitive check_nrpe_disk!4%!2%!/data check_disk_all check_swap!50%!25% check_nrpe_load!20,25,40!25,50,50 check_mysql_backup_table_checksums check_backup_replication db-swap-high check_swap!10%!5% bugzilla-reporting1.db.scl3.mozilla.com mysql2-puppetized-servers-sensitive check_mysql_config_diffs mysql-checksum check_mysql_table_checksums!24 mysql-ro check_mysql_readonly mysql-slaves-sensitive check_mysql_uptime check_disk_all check_nrpe_load!20,25,40!25,50,50 check_mysql mysql-slow-repl check_mysql_replication_lag!3600!7200 db-swap-high check_swap!10%!5% bugzilla[12].stage.db.scl3.mozilla.com mysql2-puppetized-servers-sensitive check_mysql_config_diffs mysql-checksum check_mysql_table_checksums!24 mysql-ro-nopage check_mysql_readonly mysql-slaves-nopage check_disk_all check_ssh!22 check_nrpe_load!20,25,40!25,50,50 check_mysql_replication_lag!600!1200 check_mysql db-swap-high check_swap!10%!5% bugzilla[5678].db.scl3.mozilla.com mysql2-puppetized-servers-sensitive check_mysql_config_diffs mysql-checksum check_mysql_table_checksums!24 mysql-ro-sensitive (slaves only) check_mysql_readonly mysql-rw-sensitive (master only) check_mysql_writable mysql-masters-sensitive (same as next, but on master) mysql-slaves-sensitive check_disk_all check_mysql_uptime check_nrpe_load!20,25,40!25,50,50 check_mysql check_mysql_replication_lag!600!1200 db-swap-high check_swap!10%!5% VIP and site checks bugzilla-rw-vip.db.scl3.mozilla.com schwartz-db check_theschwartz_queues bugzilla-wild-zlb.vips.scl3.mozilla.com bugzilla-wild-vip check_https!bugzilla.mozilla.org!/ -p 443 wildcard.bugzilla.mozilla.org api-dev.bugzilla.mozilla.org https-websites check_https_cert_only!14 bugzilla-zlb.vips.scl3.mozilla.com bugzilla-vips check_http!bugzilla.mozilla.org!/ check_https!bugzilla.mozilla.org!/ -p 443 Node checks These tend to be NRPE checks. See below for details on the generic and generic-preprod hostgroup checks. bugzillaadm.private.scl3.mozilla.com generic bug-checks (alerting for blocker bugs) check_bugzilla_push, "Bugzilla - Push Backlog" check_serverops_bugs check_netops_bugs check_infra_bugs check_infra_vpn_acl_bugs check_infra_vpn_support_bugs check_devservices_bugs check_database_bugs check_desktop_bugs check_webops_bugs check_telecom_bugs check_mdn_bugs check_moc_incident_bugs check_moc_problem_bugs check_moc_project_bugs check_moc_service_bugs check_releng_loaner_bugs check_releng_bugs no_swap check_mem!10!6 jobqueue[12].bugs.scl3.mozilla.com generic bugs-admin check_nrpe_procs!/usr/libexec/postfix/master!1!1 check_mailq_postfix!1000!1500 no_swap check_mem!10!6 web[12345].bugs.scl3.mozilla.com generic bugs-web check_http_string!bugzilla.mozilla.org!/!Mozilla check_http_string!bugzilla.mozilla.org!/!Mozilla -p 443 check_nfs_mounts!/mnt/bugzilla_prod zeus-scl3-https-443 check_http_string!bugzilla.mozilla.org!/!Mozilla check_http_string!bugzilla.mozilla.org!/!Mozilla -p 443 check_zeus_svc!443!zlb1.ops.scl3.mozilla.com check_max_clients web[1345].stage.bugs.scl3.mozilla.com memcache[12].stage.bugs.scl3.mozilla.com generic-preprod push1.bugs.scl3.mozilla.com generic bugs-push check_nrpe_procs_regex!bugzilla-pushd.pl!1!1 memcache[12].bugs.scl3.mozilla.com esfrontline1.bugs.scl3.mozilla.com etl1.bugs.scl3.mozilla.com generic elasticsearch1.bugs.scl3.mozilla.com generic check_elasticsearch_nodes check_nrpe_procs_regex!org.elasticsearch.bootstrap.ElasticSearch!1!1 check_disk_all_early!70%!95% Generic nagios checks The major difference between these hostgroups is that generic-preprod does NOT page and checks only happen during the work week. generic check_disk_all check_ssh!22 check_swap!50%!25% check_nrpe_load!20,25,40!25,50,50 check_ntp_time_multi!20!60 check_dnsmasq!mozilla.com check_puppet_freshness!7200 check_puppet_catalog check_collectd_log check_logwarn!/var/log/messages!Out of memory check_auditd check_uptime!15 check_nrpe_procs_regex!collectd!1!32 check_nrpe_procs_regex!crond!1!32 check_nrpe_procs_regex!atd!1!6 check_file_age!1500!1800!/var/lib/mig/mig-agent.ok generic-preprod check_disk_all check_ssh!22 check_swap!60%!20% check_nrpe_load!20,25,40!25,50,50 check_ntp_time_multi!20!60 check_uptime!15 check_nrpe_procs_regex!crond!1!5 check_nrpe_procs_regex!atd!1!6 check_file_age!1500!1800!/var/lib/mig/mig-agent.ok
Assignee: nobody → lypulong
Need to wait until the blockers to the DR environment are cleared (email, attachments and file system) r2 can you link the bugs for the three items to this one
Flags: needinfo?(riweiss)
Depends on: 1296872
Flags: needinfo?(riweiss)
Depends on: 1298076
Status: NEW → ASSIGNED
Linda, are we killing this bug with bmo going to CloudOps?
Flags: needinfo?(lypulong)
yes - both the production and failover site is moving to cloudops in Oct.
Flags: needinfo?(lypulong)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.