Move nagios CI checks out of scl3
Categories
(Release Engineering :: General, enhancement, P1)
Tracking
(Not tracked)
People
(Reporter: nthomas, Assigned: ciduty)
References
Details
Attachments
(19 files, 15 obsolete files)
11.34 KB,
patch
|
ryanc
:
review+
|
Details | Diff | Splinter Review |
4.35 KB,
patch
|
ryanc
:
review+
|
Details | Diff | Splinter Review |
16.88 KB,
patch
|
ryanc
:
review+
|
Details | Diff | Splinter Review |
2.03 KB,
patch
|
ryanc
:
review+
|
Details | Diff | Splinter Review |
6.60 KB,
patch
|
fubar
:
review+
ryanc
:
review+
|
Details | Diff | Splinter Review |
8.46 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
196.55 KB,
patch
|
nthomas
:
review+
|
Details | Diff | Splinter Review |
22.28 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
58.17 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
15.59 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
16.80 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
39.60 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
42.97 KB,
patch
|
dividehex
:
review+
|
Details | Diff | Splinter Review |
37.55 KB,
patch
|
nthomas
:
review+
dhouse
:
review+
|
Details | Diff | Splinter Review |
1.87 KB,
patch
|
Details | Diff | Splinter Review | |
6.47 KB,
patch
|
nthomas
:
review+
|
Details | Diff | Splinter Review |
15.06 KB,
patch
|
garbas
:
review+
|
Details | Diff | Splinter Review |
607 bytes,
patch
|
dlabici
:
review+
|
Details | Diff | Splinter Review |
2.67 KB,
patch
|
garbas
:
review+
|
Details | Diff | Splinter Review |
Updated•6 years ago
|
Comment 1•6 years ago
|
||
Reporter | ||
Comment 2•6 years ago
|
||
Comment 3•6 years ago
|
||
Comment 5•6 years ago
|
||
Comment 6•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Comment 7•6 years ago
|
||
Comment 8•6 years ago
|
||
Updated•6 years ago
|
Comment 9•6 years ago
|
||
Comment 10•6 years ago
|
||
Comment 11•6 years ago
|
||
Comment 13•6 years ago
|
||
Comment 14•6 years ago
|
||
Updated•6 years ago
|
Reporter | ||
Comment 15•6 years ago
|
||
Comment 16•6 years ago
|
||
Reporter | ||
Comment 17•6 years ago
|
||
Comment 18•6 years ago
|
||
Reporter | ||
Comment 19•6 years ago
|
||
Comment 20•6 years ago
|
||
Reporter | ||
Comment 21•6 years ago
|
||
Comment 22•6 years ago
|
||
Comment 23•6 years ago
|
||
Reporter | ||
Comment 24•6 years ago
|
||
Comment 25•6 years ago
|
||
Comment 26•6 years ago
|
||
Comment 27•6 years ago
|
||
Comment 28•6 years ago
|
||
Updated•6 years ago
|
Comment 29•6 years ago
|
||
Comment 30•6 years ago
|
||
Comment 31•6 years ago
|
||
Reporter | ||
Comment 32•6 years ago
|
||
Comment 33•6 years ago
|
||
Reporter | ||
Comment 34•6 years ago
|
||
Comment 35•6 years ago
|
||
Comment 36•6 years ago
|
||
Comment 37•6 years ago
|
||
Comment 38•6 years ago
|
||
Updated•6 years ago
|
Reporter | ||
Comment 39•6 years ago
|
||
Comment 40•6 years ago
|
||
Comment 41•6 years ago
|
||
Reporter | ||
Comment 42•6 years ago
|
||
Reporter | ||
Comment 43•6 years ago
|
||
Comment 44•6 years ago
|
||
Comment 45•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Comment 46•6 years ago
|
||
Reporter | ||
Comment 47•6 years ago
|
||
Reporter | ||
Comment 48•6 years ago
|
||
Reporter | ||
Comment 49•6 years ago
|
||
Reporter | ||
Comment 50•6 years ago
|
||
Reporter | ||
Comment 51•6 years ago
|
||
Comment 52•6 years ago
|
||
Reporter | ||
Comment 53•6 years ago
|
||
Comment 54•6 years ago
|
||
Reporter | ||
Comment 55•6 years ago
|
||
Comment 56•6 years ago
|
||
Reporter | ||
Comment 57•6 years ago
|
||
Reporter | ||
Comment 58•6 years ago
|
||
Reporter | ||
Comment 59•6 years ago
|
||
Reporter | ||
Comment 60•6 years ago
|
||
Reporter | ||
Comment 61•6 years ago
|
||
Reporter | ||
Comment 62•6 years ago
|
||
Reporter | ||
Comment 63•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Reporter | ||
Comment 64•6 years ago
|
||
Comment 65•6 years ago
|
||
Reporter | ||
Comment 66•6 years ago
|
||
Comment 67•6 years ago
|
||
Reporter | ||
Comment 68•6 years ago
|
||
Comment 69•6 years ago
|
||
Reporter | ||
Comment 70•6 years ago
|
||
Reporter | ||
Comment 71•6 years ago
|
||
Comment 72•6 years ago
|
||
Reporter | ||
Comment 73•6 years ago
|
||
Comment 74•6 years ago
|
||
Comment 75•6 years ago
|
||
Reporter | ||
Comment 76•6 years ago
|
||
Reporter | ||
Comment 77•6 years ago
|
||
Comment 78•6 years ago
|
||
Comment 79•6 years ago
|
||
Updated•6 years ago
|
Comment 80•6 years ago
|
||
Comment 81•6 years ago
|
||
Comment 82•6 years ago
|
||
Reporter | ||
Comment 83•6 years ago
|
||
Reporter | ||
Comment 84•6 years ago
|
||
Reporter | ||
Comment 85•6 years ago
|
||
Comment 86•6 years ago
|
||
Comment 87•6 years ago
|
||
Comment 88•6 years ago
|
||
Comment 89•6 years ago
|
||
Comment 90•6 years ago
|
||
Comment 91•6 years ago
|
||
Comment 92•6 years ago
|
||
Comment 93•6 years ago
|
||
Comment 94•6 years ago
|
||
Comment 95•6 years ago
|
||
Comment 96•6 years ago
|
||
Updated•6 years ago
|
Comment 97•6 years ago
|
||
(In reply to Ryan C [:ryanc] (UTC-4) from comment #96)
Comment on attachment 9034275 [details] [diff] [review]
comment70.patchSince this host is in AWS use1, it has no relation to ESX in mdc1. The
parents should be the firewalls in mdc1 because monitoring depends on the
those for links to use1 etc.
- 'releng-puppet1.srv.releng.use1.mozilla.com' => {
parents => 'esx-cluster1.ops.mdc1.mozilla.com',
Are these the firewalls in mdc1 that I should use for the puppet servers and the log aggregation instances ?
parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
If I use traceroute, it returns another 'fw1' node than the above for both of the puppet servers and all of the log aggregation instances, but the same 'fw1' node for all from mdc1, use1 and usw2.
I'm confused because for the puppet servers and the log aggregator from MDC1 (already present in the Nagios configuration) the parent host is 'esx-cluster1.ops.mdc1.mozilla.com'
'releng-puppet1.srv.releng.mdc1.mozilla.com' => {
parents => 'esx-cluster1.ops.mdc1.mozilla.com',
contact_groups => 'build',
hostgroups => [
'puppetagain-masters'
]
},
Thank you.
Comment 98•6 years ago
|
||
Outside for mdc[12], use the firewalls.
Looking in vcenter, "releng-puppet1.srv.releng.mdc1.mozilla.com" is indeed a VM and should use "esx-cluster1.ops.mdc1.mozilla.com" as the parent.
Comment 99•6 years ago
|
||
added mdc1 firewalls as parents
parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
Comment 100•6 years ago
|
||
Comment on attachment 9035166 [details] [diff] [review]
comment70.patch
LGTM
Comment 101•6 years ago
|
||
I've replaced the parent hosts with the mdc1 firewalls since the releng services hosts are out of mdc.
Updated•6 years ago
|
Reporter | ||
Comment 102•6 years ago
|
||
Comment on attachment 9035166 [details] [diff] [review]
comment70.patch
diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
...
'syslog-open-connections-1514' => {
...
hostgroups => $nagiosbot ? {
'nagios-releng-mdc1' => [
'open-tcp-1514',
],
default => [
Nit, the indentation on default should be fixed up when landing.
(In reply to Radu Iman[:riman] from comment #91)
- log aggregation loadbalancers
log-aggregator.srv.releng.use1.mozilla.com
log-aggregator.srv.releng.usw2.mozilla.comThese two hosts are not present in AWS, should we add or ignore them?
These are AWS load balancers rather than EC2 instances, and seem to be in use. Could we migrate them to mdc1 too, unless it's going to be painful to provide a netflow. A separate patch would be fine.
Reporter | ||
Comment 103•6 years ago
|
||
Comment on attachment 9035182 [details] [diff] [review]
comment65.patch
diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
...
'https-checks-sni-only' => {
service_description => "HTTPS",
check_command => 'check_https_sni_only!/',
check_interval => 60,
contact_groups => 'shipitalerts',
This patch looks great except this contact_group appears to be used everywhere, but map to an empty IRC channel of the same name. Lets send it to #platform-ops-alerts (and maybe #release-services) by editing mozilla/contactgroups.pp. Bonus points for renaming the group to something like release-services-alerts.
Comment 104•6 years ago
|
||
Is it possible to provide a link with the alert for mozilla-releng.net services? The link would be different link for each services. (eg: treestatus.mozilla-releng.net -> https://docs.mozilla-releng.net/projects/treestatus.html)
Reporter | ||
Comment 105•6 years ago
•
|
||
I'm not 100% sure, but looking at modules/nagios4/templates/prod/nagios-service.cfg.erb it seems that the m.mozilla.org links come from notes_url, which defaults to using service_desrciption. If we set info_url instead we could use "https://docs.mozilla-releng.net" for all the services checked by https-checks-sni-only, and more specific urls for the treestatus and tooltool checks. I can't see any examples of info_url being set to be sure, and there are other files for host/hostgroup/servicegroup etc.
Related, if I run ./please tools nagios-config, and grep away the testing.mozilla-releng.net, then it's a different list of hosts now:
-'archiver.staging.mozilla-releng.net' => {
-'coalesce.mozilla-releng.net' => {
+'api.shipit.staging.mozilla-releng.net' => {
'docs.mozilla-releng.net' => {
'docs.staging.mozilla-releng.net' => {
+'identity.notification.mozilla-releng.net' => {
+'identity.notification.staging.mozilla-releng.net' => {
+'mapper.mozilla-releng.net' => {
'mapper.staging.mozilla-releng.net' => {
'mozilla-releng.net' => {
-'pipeline.shipit.staging.mozilla-releng.net' => {
+'policy.notification.mozilla-releng.net' => {
+'policy.notification.staging.mozilla-releng.net' => {
+'shipit-api.mozilla-releng.net' => {
'shipit.mozilla-releng.net' => {
'shipit.staging.mozilla-releng.net' => {
-'signoff.shipit.staging.mozilla-releng.net' => {
'staging.mozilla-releng.net' => {
-'taskcluster.shipit.staging.mozilla-releng.net' => {
+'tokens.mozilla-releng.net' => {
+'tokens.staging.mozilla-releng.net' => {
'tooltool.mozilla-releng.net' => {
'tooltool.staging.mozilla-releng.net' => {
'treestatus.mozilla-releng.net' => {
Comment 106•6 years ago
|
||
:nthomas those "custom" links would be great to have, so that who ever responds to can know how to troubleshoot and who to escalate to.
that ./please tools nagios-config
script got a bit out of sync. once we know what we need to generate we can change the template for the script.
most importantly is that tooltool and treestatus get checks back.
Comment 107•6 years ago
|
||
-added the load balancers
-fixed the indentation
Comment 108•6 years ago
|
||
-renamed the contact group
I believe I edited it to send to #platform-ops-alert channel
Reporter | ||
Comment 109•6 years ago
|
||
Reporter | ||
Comment 110•6 years ago
|
||
Comment 111•6 years ago
|
||
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #109)
Comment on attachment 9036859 [details] [diff] [review]
comment70.patchdiff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp
- 'log-aggregator.srv.releng.use1.mozilla.com' => {
parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
contact_groups => 'build',
hostgroups => [
'log-aggregator-lb',
'rsyslog-tcp-1514'
]
- },
I was surprised to find that there were no checks configured for the
log-aggregator-lb hostgroup in scl3, so we only got the syslog-tcp-1514
service check via the rsyslog-tcp-1514 hostgroup. Lets just match that,
which means removing all the references to log-aggregator-lb.
-removed all the references to log-aggregator-lb
Comment 112•6 years ago
|
||
-removed the whitespaces
Reporter | ||
Updated•6 years ago
|
Comment 113•6 years ago
|
||
Comment 114•6 years ago
|
||
(In reply to Rok Garbas [:garbas] from comment #113)
Comment on attachment 9036871 [details] [diff] [review]
comment65.patchReview of attachment 9036871 [details] [diff] [review]:
:radu
regarding alert groups for release-services we need to have 2 groups:
- one which will alert #ci and #release-services when production services
(domains without staging and testing in it) are down- and another alert which sends alerts to #release-services once for
non-production services (domains with staging and testing in it)Does above makes sense?
The currently Nagios configuration use two contact groups:
[1] - 'build' group for domains without staging and testing in it
[2] - 'release-services-alerts' group for domains with staging and testing in it
Both of them send the alerts to #platform-ops-alert channel.
As you mentioned above, we want to send [1] to #ci and #release-services and [2] only to #release-services. Am I right?
In case Yes:
Considering that 'build' group is also used for other hosts, we have to create a new group which will alert #ci and use it for domains without staging and testing in it. (or we can still send the alerts to #platform-ops-alerts)
Regarding 'release-services-alerts' group we can change the channel where the alerts will be send(#platform-ops-alerts => #release-services) and use it for all the domains.
Is that how things should work?
Also, for now include link to https://docs.mozilla-releng.net for all
release-service projects.
Comment 115•6 years ago
|
||
I have created the following contact-groups:
- 'release-production-services' -> for domains without staging and testing in it
- 'release-non-production-services' -> for domains with staging and testing in it
The alerts from the production services will be sent to #platform-ops-alets and #release-services and the alerts from the non production services will only be sent to #release-services.
Hello Rok, could you have a look and review the patch please?
Comment 116•6 years ago
|
||
After cleaning the trailing whitespace in patch: https://bug1484880.bmoattachments.org/attachment.cgi?id=9037445
dlabici successfully landed the patch with commit: 2208837e0dc31ebd1d98adf21ea0560f25fde6df
We will keep an eye to see if anything went wrong.
Comment 117•6 years ago
|
||
All new added domains are present and UP in Nagios ( https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=all&style=overview )
Seems that every new log-aggregators have the TCP-1514 issue. Everything else seems to be green.
Comment 118•6 years ago
|
||
-added a small fix for the last patch landed
dlabici, please take a look and land it
Comment 119•6 years ago
|
||
Comment 120•6 years ago
|
||
dlabici successfully landed the fix with commit: 21965da0ca3b50777d2c0c38d6b3ef7c108a64f7
Thank you!
Comment 121•6 years ago
|
||
for less noise on #platform-ops-alerts, I've set ack on log-aggregator2.srv.releng.usw2.mozilla.com:open syslog TCP connections is UNKNOWN: NRPE: Unable to read output
Comment 122•6 years ago
|
||
Comment 123•6 years ago
•
|
||
I have landed the patch including the requested email.
Commit sha: e0bc63e9b5bcad1d5d5b0ca2ceea493b476f8bd5
What remains here to do it figure out how we can provide custom docs. link(s) and not use the automatically generated one.
Comment 124•6 years ago
|
||
for less noise, I've set ack for the following alerts :
treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: CRITICAL - Cannot make SSL connection.
treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory
treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees is CRITICAL: CRITICAL - Cannot make SSL connection. (http://m.mozilla.org/HTTP+Status+-+https://treestatus.mozilla-releng.net/trees)
treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory
tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-1 is CRITICAL: CRITICAL - Cannot make SSL connection.
tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-2 is CRITICAL: CRITICAL - Cannot make SSL connection.
Reporter | ||
Comment 125•6 years ago
|
||
(In reply to Adrian Pop from comment #124)
for less noise, I've set ack for the following alerts :
treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: CRITICAL - Cannot make SSL connection.
I'm not sure what's up with these. Might be to do with configuration or the use of Let's Encrypt as the SSL CA.
treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory
I think you'll need to modify manifests/nodes/nagios.pp
so that nagios1.private.releng.mdc1.mozilla.com also has realize(Nrpe::Plugin["check_json"])
. See the other usage in the same file for examples.
There seems to be another class of errors - staging sites that aren't in DNS:
- archiver.staging.mozilla-releng.net
- pipeline.shipit.staging.mozilla-releng.net
- signoff.shipit.staging.mozilla-releng.net
- taskcluster.shipit.staging.mozilla-releng.net
Rok, what do you want to do about them ?
Comment 126•6 years ago
|
||
:nthomas Those you listed services are not longer working, they can be removed.
Reporter | ||
Comment 127•6 years ago
|
||
Over to CIDuty for that.
Now that bug 1525365 is resolved, please check for any acknowledged checks that are still failing. eg https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/extinfo.cgi?type=2&host=log-aggregator2.srv.releng.usw2.mozilla.com&service=open+syslog+TCP+connections
Comment 128•6 years ago
|
||
Everything seems to be green.
We will be watching the hosts over the weekend and see how the situation remains the same.
Comment 129•6 years ago
|
||
I've removed the following services since they are not longer working (comment 126):
- archiver.staging.mozilla-releng.net
- pipeline.shipit.staging.mozilla-releng.net
- signoff.shipit.staging.mozilla-releng.net
- taskcluster.shipit.staging.mozilla-releng.net
Comment 130•6 years ago
|
||
Comment 131•6 years ago
|
||
(In reply to Rok Garbas [:garbas] from comment #104)
Is it possible to provide a link with the alert for mozilla-releng.net services? The link would be different link for each services. (eg: treestatus.mozilla-releng.net -> https://docs.mozilla-releng.net/projects/treestatus.html)
ryanc, would you be able to help us here? I think nagios and IT configuration knowledge is limited so any help would be appreciated.
Comment 132•6 years ago
|
||
Jordan, we currently provide this functionality. For example, the "Disk - All" service documentation lives here,
https://mana.mozilla.org/wiki/display/NAGIOS/Disk+-+All
Let me know if this is sufficient or if it must link somewhere else.
Comment 133•6 years ago
|
||
@ryanc: We need to provide custom links. Not the ones that get automatically generated.
For example, we need this link when an alert comes up: https://docs.mozilla-releng.net
Comment 134•6 years ago
|
||
Landed patch in comment 129.
Revision: b337a3122752e4478221acce80db5e4f398d42e2
Comment 135•6 years ago
|
||
(In reply to Danut Labici [:dlabici] from comment #133)
@ryanc: We need to provide custom links. Not the ones that get automatically
generated.
For example, we need this link when an alert comes up:
https://docs.mozilla-releng.net
Alright, would these links always start with that domain name? e.g.
https://docs.mozilla-releng.net/treestatus
https://docs.mozilla-releng.net/otherstatus
...
Comment 136•6 years ago
|
||
:ryanc as first try the link should always be the same and that is https://docs.mozilla-releng.net, but there should be an easy way to update them later on.
I'm reviewing all the documentation for relengapi services by the end of this Q and once i'm done links for this services will follow the pattern: https://docs.mozilla-releng.net/projects/<project>.html (eg. already existing https://docs.mozilla-releng.net/projects/treestatus.html, https://docs.mozilla-releng.net/projects/tooltool.html)
Updated•6 years ago
|
Comment 137•6 years ago
|
||
This is outside the scope of this bug. I will try to get around to this as soon as I can.
Comment 138•5 years ago
|
||
we dont use nagios anymore, since we migrated to GCP
Description
•