Closed Bug 1484880 Opened 4 years ago Closed 3 years ago

Move nagios CI checks out of scl3

Categories

(Release Engineering :: General, enhancement, P1)

enhancement

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: nthomas, Assigned: ciduty)

References

Details

Attachments

(19 files, 15 obsolete files)

11.34 KB, patch
ryanc
: review+
Details | Diff | Splinter Review
4.35 KB, patch
ryanc
: review+
Details | Diff | Splinter Review
16.88 KB, patch
ryanc
: review+
Details | Diff | Splinter Review
2.03 KB, patch
ryanc
: review+
Details | Diff | Splinter Review
6.60 KB, patch
fubar
: review+
ryanc
: review+
Details | Diff | Splinter Review
8.46 KB, patch
aki
: review+
Details | Diff | Splinter Review
196.55 KB, patch
nthomas
: review+
Details | Diff | Splinter Review
22.28 KB, patch
dividehex
: review+
Details | Diff | Splinter Review
58.17 KB, patch
dividehex
: review+
Details | Diff | Splinter Review
15.59 KB, patch
dividehex
: review+
Details | Diff | Splinter Review
16.80 KB, patch
dividehex
: review+
Details | Diff | Splinter Review
39.60 KB, patch
dividehex
: review+
Details | Diff | Splinter Review
42.97 KB, patch
dividehex
: review+
Details | Diff | Splinter Review
37.55 KB, patch
nthomas
: review+
dhouse
: review+
Details | Diff | Splinter Review
1.87 KB, patch
Details | Diff | Splinter Review
6.47 KB, patch
nthomas
: review+
Details | Diff | Splinter Review
15.06 KB, patch
garbas
: review+
Details | Diff | Splinter Review
607 bytes, patch
dlabici
: review+
Details | Diff | Splinter Review
2.67 KB, patch
garbas
: review+
Details | Diff | Splinter Review
https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.scl3.mozilla.com lists out lots of checks which we should keep after SCL3 goes away.

At a quick glance
* HPKP Expiration - Beta and 3 friends
* Pending Builds, Pending Tests
* balrog scriptworker queue and 4 similar checks on scriptworkers
* ping cluster for any hardware which will persist

Danut, would ciduty be able to move these to mdc1/mdc2 as appropriate ?
Flags: needinfo?(dlabici)
Flags: needinfo?(ciduty)
Attached patch hpkp_expiration_patch (obsolete) — Splinter Review
The patch to move all HPKP Expiration checks( Beta, ESR, Nightly and Release ) from scl3 to mdc1.
 
:nthomas, :jlund could you take a look, please?
Assignee: nobody → riman
Attachment #9003008 - Flags: review?(nthomas)
Attachment #9003008 - Flags: review?(jlund)
Comment on attachment 9003008 [details] [diff] [review]
hpkp_expiration_patch

I'm going to pass my review on to Jake because I know just enough to read the code and make some guesses, but not enough to know where the man traps are.
Attachment #9003008 - Flags: review?(nthomas) → review?(jwatkins)
Thank you :nthomas 

I have a question regarding:
> ... to move these to mdc1/mdc2 as appropriate ?
depending on which case do we choose mdc1 or mdc2 to move the checks from scl3?
Flags: needinfo?(nthomas)
That'd be a good question for RelOps too.
Flags: needinfo?(nthomas)
Attached patch pending-checks-patch (obsolete) — Splinter Review
Patch to move the pending builds & pending tests checks, from scl3 to mdc1.
Attachment #9003017 - Flags: review?(jwatkins)
Attachment #9003017 - Flags: review?(jlund)
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #4)
> That'd be a good question for RelOps too.

++
Attachment #9003008 - Flags: review?(jlund)
Attachment #9003017 - Flags: review?(jlund)
@fubar - is there someone from relops that have moved nagios checks from scl3 to mdc1 that can help guide ciduty here?
Flags: needinfo?(klibby)
Comment on attachment 9003008 [details] [diff] [review]
hpkp_expiration_patch

Review of attachment 9003008 [details] [diff] [review]:
-----------------------------------------------------------------

looks fine too me
Attachment #9003008 - Flags: review?(jwatkins) → review+
Attachment #9003017 - Flags: review?(jwatkins) → review+
Since nagios is managed by IT, I would recommend having someone from MOC review these also.  Maybe :ashish?
Flags: needinfo?(klibby) → needinfo?(ashish)
(In reply to Radu Iman[:riman] from comment #3)
> Thank you :nthomas 
> 
> I have a question regarding:
> > ... to move these to mdc1/mdc2 as appropriate ?
> depending on which case do we choose mdc1 or mdc2 to move the checks from
> scl3?

I think moving them to MDC1 is fine.
Nagios is with the MOC. Passing the ni? to :ryanc.
Flags: needinfo?(ashish) → needinfo?(rchilds)
LGTM
Flags: needinfo?(rchilds)
(In reply to Radu Iman[:riman] from comment #3)
> 
> I have a question regarding:
> > ... to move these to mdc1/mdc2 as appropriate ?
> depending on which case do we choose mdc1 or mdc2 to move the checks from
> scl3?

If we're checking services that are outside of the data centers, then we should probably have a chat with the MOC about the best way to do this. For now MDC1 is fine, but we don't want to lose monitoring of those services if we have an outage there; at the same time, we (probably) don't want to get alerted twice by having the same checks run from both MDC1 and MDC2.


(In reply to Jordan Lund (:jlund) from comment #7)
> @fubar - is there someone from relops that have moved nagios checks from
> scl3 to mdc1 that can help guide ciduty here?

With Dave on PTO, Jake can help; I bet the MOC would be willing to help, as well.
Attached patch scriptworker-checks.txt (obsolete) — Splinter Review
Patch to move the scriptworker checks from SCL3 to MDC1:

-signing_scriptworker_gpg_lock
-signing_scriptworker_gpg_rebuild_log
-scriptworker_log_age
-signing_scriptworker_queue_size
-depsigning_scriptworker_queue_size
-balrog_scriptworker_queue_size
-beetmover_scriptworker_queue_size
-pushapk_scriptworker_queue_size
Attachment #9004913 - Flags: review?(rchilds)
Attachment #9004913 - Flags: review?(rchilds) → review+
Comment on attachment 9004913 [details] [diff] [review]
scriptworker-checks.txt

>diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
>+            hostgroups => $nagiosbot ? {
>+                'nagios-releng' => [

Just checking - there's lots of 'nagios-releng-mdc1' in this file already but no 'nagios-releng'. Does it matter ?
Regarding the last check mentioned by Nick Thomas:

> * ping cluster for any hardware which will persist

Checking this list [1], I've noticed that the single ping cluster we need to keep is for t-yosemite-r7 machines. 
According to [2], we already have a ping cluster check for t-yosemite-r7 machines in MDC1

[1] https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/
[2] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1.mozilla.com

Therefore: 

1. Do these three patches cover all the checks that we should move from SCL3 ? 
2. Is the next step to push these patches to the repo?
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #15)
> >diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
> >+            hostgroups => $nagiosbot ? {
> >+                'nagios-releng' => [
> 
> Just checking - there's lots of 'nagios-releng-mdc1' in this file already
> but no 'nagios-releng'. Does it matter ?

ryanc, do you know the answer to this ? Seems like there are datacenter specific nagios bots, and nagios-releng is in scl3 ?
Flags: needinfo?(rchilds)
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #17)
> (In reply to Nick Thomas [:nthomas] (UTC+12) from comment #15)
> > >diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
> > >+            hostgroups => $nagiosbot ? {
> > >+                'nagios-releng' => [
> > 
> > Just checking - there's lots of 'nagios-releng-mdc1' in this file already
> > but no 'nagios-releng'. Does it matter ?

It does matter. Please add to mdc[12] manifests.

> Seems like there are datacenter specific nagios bots, and nagios-releng is in scl3 ?

Correct.
Flags: needinfo?(rchilds)
This would start failing when secure.pub.build.mozilla.org went away so remove buildbot support from the script.
Attachment #9007648 - Flags: review?(rchilds)
Comment on attachment 9007648 [details] [diff] [review]
Remove buildbot support from check_pending_jobs

WFM
Attachment #9007648 - Flags: review?(rchilds) → review+
(In reply to Radu Iman[:riman] from comment #16)
> Therefore: 
> 
> 1. Do these three patches cover all the checks that we should move from SCL3 ? 

I'm not knowledgeable enough to tell for sure, but I don't think so. 

Looking around there's modules/nagios4/manifests/prod/releng/scl3.pp with many mappings of hosts into the hostgroups (which your earlier patches are using). eg beetmoverworker-1.srv.releng.use1 into beetmover-scriptworkers. There's no equivalent definition in the mdc1.pp file.

I think the idea is that anything we're monitoring from scl3, and which isn't in scl3, needs to be moved to mdc1. That means AWS instances in use1 or usw2 like the various scriptworker families, signingworker-N, signing-linux-N, etc. Of the buildbot-masters only buildbot-master01 and buildbot-master77 in use1 need to be moved, as the rest will be disabled. There's a bunch of checks on web sites on the mozilla-releng.net domain too.

Jake, could you confirm/deny the above ? 

> 2. Is the next step to push these patches to the repo?

Comment #18 will need to be incorporated at least.
Flags: needinfo?(jwatkins)
Attached patch bouncer-check-patch (obsolete) — Splinter Review
Made a patch for moving bouncer checks in mdc1. :nthomas , can you have a look please?
Attachment #9007658 - Flags: review?(nthomas)
Much like in bug 1479620#c6, you'll need to move more than just the check. The bouncer-checks hostgroup is tied to bm81 in SCL3, so you'll also need to move it to another host (presumably bm01 or bm77 in USE1).
Comment on attachment 9007658 [details] [diff] [review]
bouncer-check-patch

Obsoleted by comment #23 (clearly ni on dividehex too).

Please also change "nagios-releng" to "nagios-releng-mdc1" when you copy and paste the service definitions.
Flags: needinfo?(jwatkins)
Attachment #9007658 - Flags: review?(nthomas)
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #21)

> 
> Jake, could you confirm/deny the above ? 


I would agree with that assessment.  There are many checks for services/hosts that are outside of SCL3 so they need to exist past the scl3 shutdown and mdc1 nagios is the best place to shift those.
This patch is:
- to move bm01 and bm77 from releng/scl3 to releng/mdc1 
- to move bouncer-checks hostgroup from bm81 to bm01
- to change nagios-releng to nagios-releng-mdc1 in the rmutter's patch
Attachment #9007658 - Attachment is obsolete: true
Attachment #9007967 - Flags: review?(rchilds)
to gather the following checks in one patch:
-HPKP expiration checks
-pending checks
-script worker checks

, and to changed "nagios-releng" to "nagios-releng-mdc1".
Attachment #9003008 - Attachment is obsolete: true
Attachment #9003017 - Attachment is obsolete: true
Attachment #9004913 - Attachment is obsolete: true
Attachment #9007972 - Flags: review?(rchilds)
Comment on attachment 9007972 [details] [diff] [review]
bug-1484880.patch

LGTM
Attachment #9007972 - Flags: review?(rchilds) → review+
Attachment #9007967 - Flags: review?(rchilds) → review+
All 3 patches are landed. Here is the revision: 61467e1b35..20efa70256
After the patches were landed the following alert came:

nagios-releng-mdc1> Thu 14:15:19 UTC [8047] [] buildbot-master01.bb.releng.use1.mozilla.com:bouncer is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. (http://m.mozilla.org/bouncer)

As @fubar said: seems like the Nagios can't get to bm01 to check. The build-cloud-tools repository is used to generate all of the AWS security groups. To fix it we need to remove SCL3 nagios hosts and add mdc1/2.
Removed the nagios scl3 hosts and added nagios MDC2 hosts.
Attachment #9008987 - Flags: review?(rchilds)
Hmm, that's unfortunate. Looks like nagios1.private.releng.mdc1.mozilla.com is already in the nagios-nrpe block in your patch, so the masters should already allow that on port 5666. I've verified the AWS console matches that. So unless nagios is using a different port it might be some other firewall/system blocking the traffic. I'm not able to connect to the nagios server to investigate further.

In other news, I tried running the check on bm01 and the 30s timeout isn't long enough there. 600ms response from bouncer, then 1200+ ms to hit download-installer.cdn.mozilla.net, ; both seem to be routing via our west coast network then back out to Amazon. I can look at the routing next week. Also wondering about writing this as a python3 async script that can run quickly on the nagios server itself.
Comment on attachment 9008987 [details] [diff] [review]
scl3_to_mdc2.patch

LGTM
Attachment #9008987 - Flags: review?(rchilds) → review+
https://github.com/mozilla-releng/build-puppet/pull/213 will probably fix bm01 to accept nrpe calls from mdc1+2.
I will be gone till 4th October, @ciduty, can you please check and see whats left to do?
Flags: needinfo?(dlabici)
Somewhere along the way we've accidentally removed some important checks (or they weren't moved over to new hosts in mdc1/2), eg https://mana.mozilla.org/wiki/display/NAGIOS/Pending+Scriptworker+Tasks which should have caught the slowness issues we were seeing the last couple days. CIDuty, can you please track these down and add them?
The patch for moving the scriptworkers-pending-tasks from /clusterchecks/scl3.pp to /clusterchecks/mdc1.pp and from /servicesgroups/scl3.pp to /servicesgroup/mdc1.pp

Could you take a look, please?
Attachment #9010831 - Flags: review?(rchilds)
Attachment #9010831 - Flags: review?(klibby)
Comment on attachment 9010831 [details] [diff] [review]
pending-scriptworker-tasks.patch

LGTM
Attachment #9010831 - Flags: review?(rchilds) → review+
Attachment #9010831 - Flags: review?(klibby) → review+
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #34)
> https://github.com/mozilla-releng/build-puppet/pull/213 will probably fix
> bm01 to accept nrpe calls from mdc1+2.

The check on bm01 for bouncer is still timing out, not sure what is blocking the traffic. In better news, the slower running of the check (see comment #32) has been resolved by the routing changes in bug 1491948.
CIDuty do not have access to push into this repository. Could someone push this patch or give us the write access to the nagios module?

Patch: https://bug1484880.bmoattachments.org/attachment.cgi?id=9010831

Thank you !
Everyone in CIDuty (and RelEng) should have write access to the nagios4 module in IT puppet. If folks are unable, I suspect it's either a git/ssh config or VPN issue. Git should be configured to use ssh://gitolite3@git-internal.mozilla.org/sysadmins/puppet.git

If you can pull that then you should be fine to push; if not, then it's a VPN issue which I think I can fix for folks.
Comment on attachment 9010831 [details] [diff] [review]
pending-scriptworker-tasks.patch

Pushed as 955e4dcd5a3e79c596a9eb148b5601b826e4a07c, with some bracket fix ups.

Previously I hadn't been able to push so thanks for fixing that up.
The checks all have results like
  CLUSTER OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical
which I think is incorrect.

In modules/nagios4/manifests/prod/releng/scl3.pp there are lots of machine definitions where they're added to scriptworker hostgroups, but this is missing in mdc1.pp. Lets bring over all the scriptworkers, but drop signingworker-[1234] (deprecated).
Created patch where I have added all the scriptworkers.

Can you please check and review it ?

Thank You
Depends on: 149591
No longer blocks: 1478215
No longer depends on: 149591
Depends on: 1495917
The patch has been landed. Now, there were a lot of alerts with broken checks. Currently for all the services with the issue it was set a downtime and an other bug has been opened at Netops ( bug 1495920 )  to resolve the nrpe checks that are blocked by firewall.
Status now that bug 1495920 resolved the network flows from mdc1/2 to use1/w2:

Fixed:
* hpkp checks were already working
* pending builds/tests alerts were already working
* l10n bumper checks are now working and green (bug 1479620)
* bouncer check is now working (here and bug 1495920)
* ping/Scriptworker gpg rebuild log age/Scriptworker gpg_homedirs.lock age/Scriptworker log age are all green on scriptworker instances now

Not finished:
* clear downtimes set for last point in Fixed, eg [1], so that they alert if they start failing
* there are pending tasks checks on individual script workers in state WARNING with 'NRPE: Unable to read output', which I suspect leads to ...
* all the scriptworker queue alerts at [1] still have status like "CLUSTER OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical"


[1] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=beetmoverworker-2.srv.releng.usw2
[2] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1]
[3] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=beetmoverworker-2.srv.releng.usw2
Depends on: 1479620
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #47)
> * there are pending tasks checks on individual script workers in state
> WARNING with 'NRPE: Unable to read output', which I suspect leads to ...

Broke this out to 1498374.
Depends on: 1498374
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #47)
> Not finished:
> * clear downtimes set for last point in Fixed, eg [1], so that they alert if
> they start failing

Done. https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/extinfo.cgi?type=6 is very handy to see what is downtimed; that's System > Downtime on the sidebar of the main UI.

> * there are pending tasks checks on individual script workers in state
> WARNING with 'NRPE: Unable to read output', which I suspect leads to ...

Fixed by 1498374. It shows a bit of a signing queue backlog ~1110 tasks, critical threshold is set to 400; other queues OK.

> * all the scriptworker queue alerts at [1] still have status like "CLUSTER
> OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical"

Waiting to see if this starts working.
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #49)
> > * all the scriptworker queue alerts at [1] still have status like "CLUSTER
> > OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical"
> 
> Waiting to see if this starts working.

It did not start working - they all still say 1 unknown and 0 for others. This is very odd because the configuration seems to be the same as when it worked in scl3: 
* each scriptworker instance is assigned into a hostgroup (modules/nagios4/manifests/prod/releng/mdc1.pp), eg balrogworker-1.srv -> balrog-scriptworkers
* a service check for queue length is setup for that hostgroup (.../releng/services/mdc1.pp) and assiged to a servicegroup, eg balrog_scriptworker_queue_size, which has servicegroups => 'balrog-scriptworkers-pending-tasks'
* a cluster check is defined on the service group (.../releng/clusterchecks/mdc1.pp), eg "balrog scriptworker queue"

This also shows up properly in https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/config.cgi AFAICT, comparing to "t-yosemite-r7-machines-ping cluster" which is getting a sensible status, so that mostly rules out a silly typo. The cluster checks are definitely running regularly, and aren't acked/downtimed. Nagios has restarted since the underlying checks were fixed.

dividehex, can you see what's wrong ?
Flags: needinfo?(jwatkins)
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #50)
> 
> dividehex, can you see what's wrong ?

I don't see anything returning unknown so I'm assuming this cleared up.  NI me if I missed something.
Flags: needinfo?(jwatkins)
Steps to reproduce:
1, load https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1
2, search for 'scriptworker queue'

Actual results:
In the right most column, all the statuses end in nonsensical '0 ok, 0 warning, 1 unknown, 0 critical'. Overall check status is OK.

Expected results:
Checks aggregate properly, so balrog would have '2 ok, 0 warning, 0 unknown, 0 critical'. The groups are visible on https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?servicegroup=all&style=overview.

I'm working from memory here so perhaps my expectation is wrong, but last week when the signing queues were long (and the queue checks on the individual scriptworkers were CRITICAL) the cluster checks didn't respond at all.
Flags: needinfo?(jwatkins)
Ah! Ok, I see it now.  I think that error from the servicegroups being set as an array and not as a string.  I have no idea why it worked previously in scl3 and not now in mdc1.  Anyway, I've corrected the config and tested it.  The scriptworker queue checks all look good now.

diff --git a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp
index 6601823a88..74ae7193b5 100644
--- a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp
+++ b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp
@@ -11,7 +11,6 @@ class nagios4::prod::releng::clusterchecks::mdc1 {
           critical_percentage => 75,
           stalking_options => 'w,c',
           servicegroups => 'mdc1-t-yosemite-r7-ping',
-          # hostgroups => ['t-yosemite-r7-machines'],
         },
         'signing scriptworker queue' => {
           cluster_description => 'signing scriptworker queue',
@@ -19,7 +18,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 {
           warning_percentage => 5,
           critical_percentage => 7,
           stalking_options => 'w,c',
-          servicegroups => ["signing-scriptworkers-pending-tasks"],
+          servicegroups => 'signing-scriptworkers-pending-tasks',
         },
         'depsigning scriptworker queue' => {
           cluster_description => 'depsigning scriptworker queue',
@@ -27,7 +26,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 {
           warning_percentage => 5,
           critical_percentage => 7,
           stalking_options => 'w,c',
-          servicegroups => ["depsigning-scriptworkers-pending-tasks"],
+          servicegroups => 'depsigning-scriptworkers-pending-tasks',
         },
         'balrog scriptworker queue' => {
           cluster_description => 'balrog scriptworker queue',
@@ -35,7 +34,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 {
           warning_percentage => 1,
           critical_percentage => 2,
           stalking_options => 'w,c',
-          servicegroups => ["balrog-scriptworkers-pending-tasks"],
+          servicegroups => 'balrog-scriptworkers-pending-tasks',
         },
         'beetmover scriptworker queue' => {
           cluster_description => 'beetmover scriptworker queue',
@@ -43,7 +42,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 {
           warning_percentage => 2,
           critical_percentage => 4,
           stalking_options => 'w,c',
-          servicegroups => ["beetmover-scriptworkers-pending-tasks"],
+          servicegroups => 'beetmover-scriptworkers-pending-tasks',
         },
         'pushapk scriptworker queue' => {
           cluster_description => 'pushapk scriptworker queue',
@@ -51,7 +50,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 {
           warning_percentage => 1,
           critical_percentage => 1,
           stalking_options => 'w,c',
-          servicegroups => ["pushapk-scriptworkers-pending-tasks"],
+          servicegroups => 'pushapk-scriptworkers-pending-tasks',
         },
       }
     }
diff --git a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp
index 55af86dd67..b7bcfd863d 100644
--- a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp
+++ b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp
@@ -11,7 +11,6 @@ class nagios4::prod::releng::clusterchecks::mdc2 {
           critical_percentage => 75,
           stalking_options => 'w,c',
           servicegroups => 'mdc2-t-yosemite-r7-ping',
-          # hostgroups => ['t-yosemite-r7-machines'],
         },
       }
     }
Flags: needinfo?(jwatkins)
Ah, that array vs string thing was the issue. Thanks Jake.

Everything in comment #47 is now done. Planning to take one last look through the scl3 configs to see if we missed anything.
Our[ciduty] part it is pretty much done here so I will remove the NI for ciduty.
There is still some work to be done in bug 1495917, but apart from that everything seems to be ok.

If someone notices something odd that needs patch/fix and we can do it, please NI ciduty again.
Flags: needinfo?(ciduty)
I took another look at the scl3 configs too see if we missed anything (regrets, I have a few) and found a several checks of note:

modules/nagios4/manifests/prod/releng/scl3.pp:
* hostgroup releng-apps with checks for sites coalesce.mozilla-releng.net, archiver.staging.mozilla-releng.net, docs.mozilla-releng.net, treestatus, tooltool, shipit v2, and others. Generally releng services, which I heard Rok is working on elswhere too (ni? for more on that)
* puppet servers in AWS, releng-puppet1.srv.releng.use1 & releng-puppet1.srv.releng.usw2
* log aggregators in AWS, log-aggregator[12].srv.releng.use1, log-aggregator[12].srv.releng.usw2

modules/nagios4/manifests/prod/releng/services/scl3.pp:
* scriptworker-procs

(possibly more here, I ran out of steam).

I'd like for ciduty to work on patches (where Rok isn't already). To make it easier I've got some patches which remove deprecated hosts and their associated configuration, which I'll attach here. That'll make it much more obvious what is still left to consider.
Flags: needinfo?(rgarbas)
Attachment #9021998 - Flags: review?(jwatkins)
Attached patch Cleanup buildbotSplinter Review
Removes
* checks for buildbot and associated processes (eg command/pulse queue, mysql)
* node definitions for deleted buildbot masters
* disk checks for separate /builds and /var partitions (which aren't used on bm01/77)

Fixes
* whitespace on bouncer and l10n_bumper_lock

Restores
* checks for swap, load, free space on /, ntp on l10n-bumper-servers (bm01/bm77)

I wondered if cross-data center ntp will be flaky. AFAIK time  is not super important to l10n bumper or bouncer checks.
Attachment #9022011 - Flags: review?(jwatkins)
Attachment #9022014 - Flags: review?(jwatkins)
signingworker-N are the old funsize signing systems, now removed. The partner repack hosts went away in bug 1478977 and bug 1500323.
Attachment #9022018 - Flags: review?(jwatkins)
Attachment #9021998 - Flags: review?(jwatkins) → review+
Attachment #9022011 - Flags: review?(jwatkins) → review+
Attachment #9022012 - Flags: review?(jwatkins) → review+
Attachment #9022014 - Flags: review?(jwatkins) → review+
Attachment #9022018 - Flags: review?(jwatkins) → review+
Landed those five attachments as 69db12b66a..0f461850c5.
:(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #57)
> I took another look at the scl3 configs too see if we missed anything
> (regrets, I have a few) and found a several checks of note:
> 
> modules/nagios4/manifests/prod/releng/scl3.pp:
> * hostgroup releng-apps with checks for sites coalesce.mozilla-releng.net,
> archiver.staging.mozilla-releng.net, docs.mozilla-releng.net, treestatus,
> tooltool, shipit v2, and others. Generally releng services, which I heard
> Rok is working on elswhere too (ni? for more on that)
> * puppet servers in AWS, releng-puppet1.srv.releng.use1 &
> releng-puppet1.srv.releng.usw2
> * log aggregators in AWS, log-aggregator[12].srv.releng.use1,
> log-aggregator[12].srv.releng.usw2
> 
> modules/nagios4/manifests/prod/releng/services/scl3.pp:
> * scriptworker-procs
> 
> (possibly more here, I ran out of steam).
> 
> I'd like for ciduty to work on patches (where Rok isn't already). To make it
> easier I've got some patches which remove deprecated hosts and their
> associated configuration, which I'll attach here. That'll make it much more
> obvious what is still left to consider.

:nthomas sorry for late reply, busy week.

We have 2 tickets open[1][2], so that we remember to do this in hopefully near future, but currently other work is more important and we're postpone this. But it would be awesome if we can get some help getting this done.

Because there are many services coming and going in release-services we had (until scl3 was up) nagios configuration generated by some script[3].

If you get somebody to help us, please let them ping me in #release-services to explain more details.

[1] https://github.com/mozilla/release-services/issues/267
[2] https://github.com/mozilla/release-services/issues/1205
[3] https://github.com/mozilla/release-services/blob/master/lib/please_cli/please_cli/nagios_config.py
Flags: needinfo?(rgarbas)
* adds newer members of balrogworker and beetmover pools to mdc1
* removes scriptworker hosts from scl3, ensuring same checks run in mdc1. We were missing quite a few (load, disk usage, scriptworker-procs, time, puppet freshness, gpg checks on depsigning workers) but they seem sensible checks soit doesn't seem deliberate
* finishes cleaning up signing-servers and mac-signing-servers in scl3, ensuring the same checks run in mdc1
* removes check_disk_10_5_signing since it's the same as check_disk_build_10_5

There's a bunch of other scriptworker pools that we're not covering at all (eg pushsnap, treescript, addonworker, bouncer, mobile and tb variations of the same) but that's for another patch.
Attachment #9023556 - Flags: review?(jwatkins)
Comment on attachment 9023556 [details] [diff] [review]
Finish up signing; handle scriptworkers

Review of attachment 9023556 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #9023556 - Flags: review?(jwatkins) → review+
Landed as dddad86241fd3303ce23d1256e7b1b74b5de4a65. Needed a fix (df1567a9be) to disable signing_scriptworker_gpg_lock and signing_scriptworker_gpg_rebuild_log on depsigning-worker, as they don't enable chain of trust.
What's the overall status of this? I'd like to cleanup anything related to scl3 from the nagios module in bug 1447892.
Status: NEW → ASSIGNED
We're getting close to having everything moved over to mdc1 or ignored because of decommissioning. Still to move to mdc1 config and verify checks are the same:
* puppet servers
    releng-puppet1.srv.releng.use1.mozilla.com
    releng-puppet1.srv.releng.usw2.mozilla.com
* log aggregation instances
    log-aggregator1.srv.releng.use1.mozilla.com
    log-aggregator2.srv.releng.use1.mozilla.com
    log-aggregator1.srv.releng.usw2.mozilla.com
    log-aggregator2.srv.releng.usw2.mozilla.com
* log aggregation loadbalancers
    log-aggregator.srv.releng.use1.mozilla.com
    log-aggregator.srv.releng.usw2.mozilla.com
* release-services host checks (see comment #65)

Feel free to jump in with any of that if you want. Once we're done there I'd have no objections to removing
  modules/nagios4/manifests/prod/releng/scl3.pp
  modules/nagios4/manifests/prod/releng/servicegroups/scl3.pp
  modules/nagios4/manifests/prod/releng/services/scl3.pp
  modules/nagios4/manifests/prod/releng/clusterchecks/scl3.pp
and whatever refers to those.

Separately, we should still ....
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #66)
> There's a bunch of other scriptworker pools that we're not covering at all
> (eg pushsnap, treescript, addonworker, bouncer, mobile and tb variations of
> the same) but that's for another patch.

Radu, do you have time to look at this part ?
For the last, I'd suggest looking at https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types to find the pools, but ignore dev. You'll need
* to track down the host names using the AWS console and DNS, and add them to mdc1.pp, assigning them to a new hostgroup named after that pool
* setup the hostgroup in hostgroups.pp
* add the hostgroup to the services in services/mdc1.pp
* add a clustercheck for the queue length in clusterchecks/mdc1.pp, and an alias for that in servicechecks/mdc1.pp

balrog-scriptworkers would be a useful template to follow, except treat tb-depsigning, dep-pushapk, and dep-pushsnap like depsigning (doesn't do gpg checks, see comment #68 for names).
Attached patch hostgroup.patch (obsolete) — Splinter Review
I added a hostgroup for each of the following pools:
 * addon-v1
 * balrogworker-v1
 * beetmoverworker-v1
 * bouncer-v1
 * dep-pushapk
 * dep-pushsnap
 * depsigning
 * mobile-beetmover-v1
 * mobile-pushapk-v1
 * mobile-signing-v1
 * pushapk-v1
 * pushsnap-v1
 * shipit-v1
 * signing-linux-v1
 * tb-balrog-v1
 * tb-beetmover-v1
 * tb-bouncer-v1
 * tb-depsigning
 * tb-shipit-v1
 * tb-signing-v1
 * tb-treescript-comm-v1
 * treescript-v1

I'm not sure of the alias. Please let me know if I should change it or add something there.

I'll go ahead to add each host from these pools to mdc1.pp
Attachment #9024691 - Flags: review?(nthomas)
Comment on attachment 9024691 [details] [diff] [review]
hostgroup.patch

The taskcluster naming doesn't match the puppet naming. For example, 'balrogworker-v1' in tc becomes 'balrog-scriptworkers' in puppet, and already exists. Please stick to the existing naming style in puppet by translating tc names, and keep all the scriptworker definitions near each other in the file.

For the aliases I'd suggest 'thunderbird balrog scriptworkers', 'thunderbird treescript scriptworkers', 'treescript scriptworkers' etc. These show up as the table headings on https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=all&style=overview
Attachment #9024691 - Flags: review?(nthomas) → review-
Attached patch hostgroups.patch (obsolete) — Splinter Review
I've added the scriptworkers hostgroups that have not already been defined in hostgroups.pp.

nthomas thank you for your help. Please have a look to this new patch.
Attachment #9024691 - Attachment is obsolete: true
Attachment #9025213 - Flags: review?(nthomas)
Attached patch add-hosts-to-mdc1.patch (obsolete) — Splinter Review
I've added each host name from the scriptworker pools that were not covered, in mdc1.pp.

Used the following template:

"FQDN" => {
   hostgroups => [
            'hostgroup'
   ]
}
Attachment #9025253 - Flags: review?(nthomas)
Comment on attachment 9025213 [details] [diff] [review]
hostgroups.patch

>diff --git a/modules/nagios4/manifests/prod/releng/hostgroups.pp b/modules/nagios4/manifests/prod/releng/hostgroups.pp
>+        'depsigning-pushapk-scriptworkers' => {
>+          alias => 'depsigning pushapk scriptworkers',
>+        },
>+        'depsigning-pushsnap-scriptworkers' => {
>+          alias => 'depsigning pushsnap scriptworkers',

I'd just leave the names as dep-pushapk and dep-pushsnap as they're not doing any signing.

I think you're on the right track with this file but I'll f+ instead of r+ for a couple of reasons. Firstly, someone like dividehex should do the actual review once we've got a good patch ready; I can help you get to that point. Secondly, it makes more sense to me to have one patch that handles all the files in one go, rather than N patches for N files. They just logically go together, and may need to land all together. If that seems like a big hill to climb then lets pick one class of scriptworkers and modify all the files for that class.
Attachment #9025213 - Flags: review?(nthomas) → feedback+
Comment on attachment 9025253 [details] [diff] [review]
add-hosts-to-mdc1.patch

The hosts are all present, woo!

>diff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp
>+    'tb-beetmover-6.srv.releng.usw2.mozilla.com' => {
>+        contact_groups => 'build',

Please make sure that contact_groups is defined for each host. Also see comments on previous patch about dep-pushapk and dep-pushsnap, and creating patches.
Attachment #9025253 - Flags: review?(nthomas) → feedback-
Thank you for your suggestions. I did the changes you mentioned in the previous comments. I'm moving forward to prepare the big patch. 

- (In reply to Nick Thomas [:nthomas] (UTC+13) from comment #71)
> For the last, I'd suggest looking at
> https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types
> to find the pools, but ignore dev. You'll need
> * to track down the host names using the AWS console and DNS, and add them
> to mdc1.pp, assigning them to a new hostgroup named after that pool
DONE ^
> * setup the hostgroup in hostgroups.pp
DONE ^
> * add the hostgroup to the services in services/mdc1.pp
TODO ^
> * add a clustercheck for the queue length in clusterchecks/mdc1.pp and an alias for that in servicechecks/mdc1.pp
TODO ^
* setup a host group for each scriptworker pool in hostgroups.pp 
* added the hosts from each pool in mdc1.pp
* added the hostgroups to the services in services/mdc1.pp
* added the clusterchecks for the queue length in clusterchecks/mdc1.pp
* set an alias for each clustercheck in servicechecks/mdc1.pp

:dividehex, could you have a look to this patch please?
Attachment #9025213 - Attachment is obsolete: true
Attachment #9025253 - Attachment is obsolete: true
Attachment #9030182 - Flags: review?(jwatkins)
Attachment #9030182 - Flags: review?(jwatkins) → review?(dhouse)
:riman, could you get a final review from nthomas? I'm not familiar with the scriptworkers. The nagios syntax looks right, but I'd like to have a sign-off that this addressed the requests.
Also, could you remove/hide the patches that are not relevant or mark the ones that are committed? It looks like all but one are r+'d but that none are committed in the repo.
Flags: needinfo?(riman)
To speed up the process, I will add nick for the review, :riman will be back at work in 48-ish hours.
But anyone in the team can land it if its okay.

@Nick: Can you please review the attachment in comment https://bugzilla.mozilla.org/show_bug.cgi?id=1484880#c79
Flags: needinfo?(nthomas)
(In reply to Danut Labici [:dlabici] from comment #81)
> To speed up the process, I will add nick for the review, :riman will be back
> at work in 48-ish hours.
> But anyone in the team can land it if its okay.
> 
> @Nick: Can you please review the attachment in comment
> https://bugzilla.mozilla.org/show_bug.cgi?id=1484880#c79

Thankyou Danut!
Attachment #9030182 - Flags: review?(nthomas)
Attachment #9030182 - Flags: review?(dhouse)
Attachment #9030182 - Flags: review+
Comment on attachment 9030182 [details] [diff] [review]
add scriptworker pools that were not covered

r+, landed as 9b4bc441a6.
Flags: needinfo?(riman)
Flags: needinfo?(nthomas)
Attachment #9030182 - Flags: review?(nthomas) → review+
Attached patch Fixes (landed)Splinter Review
We needed these fixes to attachment 9030182 [details] [diff] [review] to generate a good nagios config. Super hard to spot that sort of thing in a giant patch.
We discovered some problems as a result of the checks
* tb-depsigning-worker1.srv.releng.use1.mozilla.com was not properly puppetized. I mostly followed https://github.com/mozilla-releng/scriptworker/blob/master/docs/new_instance.md#3-puppetize-the-instance, except it already accepted my own ssh key
* tb-depsigning-worker6.srv.releng.use2.mozilla.com was off in AWS, restarted it and made sure puppet ran successfully
* tb-bouncer-1.srv.releng.use1.mozilla.com was pinned to an environment, unpinned and made sure puppet ran
* tb-depsigning-worker1.srv.releng.use1.mozilla.com - a trailing newline on the comm_thunderbird_dep_signing_scriptworker_taskcluster_access_token secret caused the nagios check to be malformed

That wraps up the scriptworkers, assuming nothing was added in the meantime.

Remaining to do
* via comment #70 - AWS puppet and log-aggregator servers
* via comment #65 - checks on release-services websites (would be worth checking in with Rok if anything changed in the meantime)
Adding NI? to ciduty so we can keep a close view to this.
Assignee: riman → ciduty
Priority: -- → P1
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #85)

MOC is going away and we will no longer get coverage for our services starting Jan 1st. We should close out open bugs that's on their radar and fix state to match MDC1/2 infra.

> Remaining to do
> * via comment #70 - AWS puppet and log-aggregator servers

@zsolt, is this something CIDuty can do?

> * via comment #65 - checks on release-services websites (would be worth
> checking in with Rok if anything changed in the meantime)

@garbas, we need to wrap up this bug before end of year. I know you are focused on shipit v2 so perhaps you could file a bug to fix up release-services and nagios separately?

Do any release-services (tooltool, treestatus, etc) have nagios checks working? Or should we remove them and any mention of scl3?
Flags: needinfo?(zfay)
Flags: needinfo?(rgarbas)
I'm meeting with :zfay today or tomorrow. We will solve this by tomorrow evening, since then I'm on PTO.
Flags: needinfo?(rgarbas)
Meeting has been set on 19 Dec 10:00 CET . From our side, Apop will be on duty.
(In reply to Rok Garbas [:garbas] from comment #88)
> I'm meeting with :zfay today or tomorrow. We will solve this by tomorrow
> evening, since then I'm on PTO.

Any updates here?
Flags: needinfo?(rgarbas)
Attached patch comment70.patch (obsolete) — Splinter Review
comment #70 - AWS puppet and log-aggregator servers

> * puppet servers
>  releng-puppet1.srv.releng.use1.mozilla.com
>  releng-puppet1.srv.releng.usw2.mozilla.com

 Added the hosts to mdc1.pp. The hostgroups were added to services/mdc1.pp

>* log aggregation instances
>  log-aggregator1.srv.releng.use1.mozilla.com
>  log-aggregator2.srv.releng.use1.mozilla.com
>  log-aggregator1.srv.releng.usw2.mozilla.com
>  log-aggregator2.srv.releng.usw2.mozilla.com

 Added the hosts to mdc1.pp and the hostgroups to services.mdc1.pp


>* log aggregation loadbalancers
>   log-aggregator.srv.releng.use1.mozilla.com 
>   log-aggregator.srv.releng.usw2.mozilla.com

These two hosts are not present in AWS, should we add or ignore them?

Please have a look to this patch.
Thank you.
Attachment #9033557 - Flags: review?(rchilds)
Attachment #9033557 - Flags: review?(nthomas)
Attached patch comment65.patch (obsolete) — Splinter Review
comment #65 - checks on release-services websites

This patch moves all the release-servise websites checks from scl3 to mdc1.

I don't know which are the parent/child relationships between hosts. I've used parent [1] which has parents [2], [3]. Please have a look to this part.


[1] esx-cluster1.ops.mdc1.mozilla.com
[2] mgmt.fw1a.private.mdc1.mozilla.net
[3] mgmt.fw1b.private.mdc1.mozilla.net
Attachment #9033861 - Flags: review?(rchilds)
Attachment #9033861 - Flags: review?(nthomas)
Comment on attachment 9033861 [details] [diff] [review]
comment65.patch

(In reply to Radu Iman[:riman] from comment #92)
> I don't know which are the parent/child relationships between hosts. I've
> used parent [1] which has parents [2], [3]. Please have a look to this part.
> 
> 
> [1] esx-cluster1.ops.mdc1.mozilla.com
> [2] mgmt.fw1a.private.mdc1.mozilla.net
> [3] mgmt.fw1b.private.mdc1.mozilla.net

This seems fine besides the parenting. Most of these sites don't seem to be in the DC, e.g.

⋊> ~ host treestatus.mozilla-releng.net                                                                                                                                                             15:44:34
treestatus.mozilla-releng.net is an alias for treestatus.mozilla-releng.net.herokudns.com.
treestatus.mozilla-releng.net.herokudns.com has address 52.22.34.127
treestatus.mozilla-releng.net.herokudns.com has address 52.201.75.180
treestatus.mozilla-releng.net.herokudns.com has address 52.4.95.48
treestatus.mozilla-releng.net.herokudns.com has address 52.202.60.111
treestatus.mozilla-releng.net.herokudns.com has address 52.2.175.150
treestatus.mozilla-releng.net.herokudns.com has address 52.86.186.182
treestatus.mozilla-releng.net.herokudns.com has address 52.3.53.115
treestatus.mozilla-releng.net.herokudns.com has address 52.55.191.55
Attachment #9033861 - Flags: review?(rchilds)
Comment on attachment 9033557 [details] [diff] [review]
comment70.patch

The parenting doesn't make sense considering we're out of scl3 entirely, e.g.

+    'releng-puppet1.srv.releng.use1.mozilla.com' => {
+      parents => 'fw1.private.releng.scl3.mozilla.net',
+      contact_groups => 'build',
+      hostgroups => [
+        'puppetagain-masters'
+      ]
+    },

If you grep around that manifest, there's examples like this for each corresponding DC (which it should be),

> parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
Attachment #9033557 - Flags: review?(rchilds)
Attached patch comment70.patch (obsolete) — Splinter Review
replaced with 
> parents => 'esx-cluster1.ops.mdc1.mozilla.com',


Thanks for help Ryan C.
Attachment #9033557 - Attachment is obsolete: true
Attachment #9033557 - Flags: review?(nthomas)
Attachment #9034275 - Flags: review?(rchilds)
Attachment #9034275 - Flags: review?(nthomas)
Comment on attachment 9034275 [details] [diff] [review]
comment70.patch

Since this host is in AWS use1, it has no relation to ESX in mdc1. The parents should be the firewalls in mdc1 because monitoring depends on the those for links to use1 etc.

+    'releng-puppet1.srv.releng.use1.mozilla.com' => {
+      parents => 'esx-cluster1.ops.mdc1.mozilla.com',
Attachment #9034275 - Flags: review?(rchilds) → review-
Flags: needinfo?(rgarbas)

(In reply to Ryan C [:ryanc] (UTC-4) from comment #96)

Comment on attachment 9034275 [details] [diff] [review]
comment70.patch

Since this host is in AWS use1, it has no relation to ESX in mdc1. The
parents should be the firewalls in mdc1 because monitoring depends on the
those for links to use1 etc.

  • 'releng-puppet1.srv.releng.use1.mozilla.com' => {
  •  parents => 'esx-cluster1.ops.mdc1.mozilla.com',
    

Are these the firewalls in mdc1 that I should use for the puppet servers and the log aggregation instances ?

parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',

If I use traceroute, it returns another 'fw1' node than the above for both of the puppet servers and all of the log aggregation instances, but the same 'fw1' node for all from mdc1, use1 and usw2.
I'm confused because for the puppet servers and the log aggregator from MDC1 (already present in the Nagios configuration) the parent host is 'esx-cluster1.ops.mdc1.mozilla.com'

'releng-puppet1.srv.releng.mdc1.mozilla.com' => {
parents => 'esx-cluster1.ops.mdc1.mozilla.com',
contact_groups => 'build',
hostgroups => [
'puppetagain-masters'
]
},

Thank you.

Flags: needinfo?(zfay) → needinfo?(rchilds)

Outside for mdc[12], use the firewalls.

Looking in vcenter, "releng-puppet1.srv.releng.mdc1.mozilla.com" is indeed a VM and should use "esx-cluster1.ops.mdc1.mozilla.com" as the parent.

Flags: needinfo?(rchilds)
Attached patch comment70.patch (obsolete) — Splinter Review

added mdc1 firewalls as parents

parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',

Attachment #9034275 - Attachment is obsolete: true
Attachment #9034275 - Flags: review?(nthomas)
Attachment #9035166 - Flags: review?(rchilds)
Attachment #9035166 - Flags: review?(rchilds) → review+
Attached patch comment65.patch (obsolete) — Splinter Review

I've replaced the parent hosts with the mdc1 firewalls since the releng services hosts are out of mdc.

Attachment #9033861 - Attachment is obsolete: true
Attachment #9033861 - Flags: review?(nthomas)
Attachment #9035182 - Flags: review?(rchilds)
Attachment #9035182 - Flags: review?(rchilds) → review+

Comment on attachment 9035166 [details] [diff] [review]
comment70.patch

diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
...

  •  'syslog-open-connections-1514'  => {
    

...

  •    hostgroups => $nagiosbot ? {
    
  •      'nagios-releng-mdc1' => [
    
  •        'open-tcp-1514',
    
  •            ],
    
  •            default => [
    

Nit, the indentation on default should be fixed up when landing.

(In reply to Radu Iman[:riman] from comment #91)

  • log aggregation loadbalancers
    log-aggregator.srv.releng.use1.mozilla.com
    log-aggregator.srv.releng.usw2.mozilla.com

These two hosts are not present in AWS, should we add or ignore them?

These are AWS load balancers rather than EC2 instances, and seem to be in use. Could we migrate them to mdc1 too, unless it's going to be painful to provide a netflow. A separate patch would be fine.

Comment on attachment 9035182 [details] [diff] [review]
comment65.patch

diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
...

  •  'https-checks-sni-only' => {
    
  •    service_description => "HTTPS",
    
  •    check_command => 'check_https_sni_only!/',
    
  •    check_interval => 60,
    
  •    contact_groups => 'shipitalerts',
    

This patch looks great except this contact_group appears to be used everywhere, but map to an empty IRC channel of the same name. Lets send it to #platform-ops-alerts (and maybe #release-services) by editing mozilla/contactgroups.pp. Bonus points for renaming the group to something like release-services-alerts.

Is it possible to provide a link with the alert for mozilla-releng.net services? The link would be different link for each services. (eg: treestatus.mozilla-releng.net -> https://docs.mozilla-releng.net/projects/treestatus.html)

I'm not 100% sure, but looking at modules/nagios4/templates/prod/nagios-service.cfg.erb it seems that the m.mozilla.org links come from notes_url, which defaults to using service_desrciption. If we set info_url instead we could use "https://docs.mozilla-releng.net" for all the services checked by https-checks-sni-only, and more specific urls for the treestatus and tooltool checks. I can't see any examples of info_url being set to be sure, and there are other files for host/hostgroup/servicegroup etc.

Related, if I run ./please tools nagios-config, and grep away the testing.mozilla-releng.net, then it's a different list of hosts now:

-'archiver.staging.mozilla-releng.net' => {
-'coalesce.mozilla-releng.net' => {
+'api.shipit.staging.mozilla-releng.net' => {
 'docs.mozilla-releng.net' => {
 'docs.staging.mozilla-releng.net' => {
+'identity.notification.mozilla-releng.net' => {
+'identity.notification.staging.mozilla-releng.net' => {
+'mapper.mozilla-releng.net' => {
 'mapper.staging.mozilla-releng.net' => {
 'mozilla-releng.net' => {
-'pipeline.shipit.staging.mozilla-releng.net' => {
+'policy.notification.mozilla-releng.net' => {
+'policy.notification.staging.mozilla-releng.net' => {
+'shipit-api.mozilla-releng.net' => {
 'shipit.mozilla-releng.net' => {
 'shipit.staging.mozilla-releng.net' => {
-'signoff.shipit.staging.mozilla-releng.net' => {
 'staging.mozilla-releng.net' => {
-'taskcluster.shipit.staging.mozilla-releng.net' => {
+'tokens.mozilla-releng.net' => {
+'tokens.staging.mozilla-releng.net' => {
 'tooltool.mozilla-releng.net' => {
 'tooltool.staging.mozilla-releng.net' => {
 'treestatus.mozilla-releng.net' => {

:nthomas those "custom" links would be great to have, so that who ever responds to can know how to troubleshoot and who to escalate to.

that ./please tools nagios-config script got a bit out of sync. once we know what we need to generate we can change the template for the script.

most importantly is that tooltool and treestatus get checks back.

Attached patch comment70.patch (obsolete) — Splinter Review

-added the load balancers
-fixed the indentation

Attachment #9035166 - Attachment is obsolete: true
Attachment #9036859 - Flags: review?(nthomas)
Attached patch comment65.patch (obsolete) — Splinter Review

-renamed the contact group
I believe I edited it to send to #platform-ops-alert channel

Attachment #9035182 - Attachment is obsolete: true
Attachment #9036871 - Flags: review?(nthomas)
Comment on attachment 9036859 [details] [diff] [review]
comment70.patch

>diff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp
>+    'log-aggregator.srv.releng.use1.mozilla.com' => {
>+      parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
>+      contact_groups => 'build',
>+      hostgroups => [
>+        'log-aggregator-lb',
>+        'rsyslog-tcp-1514'
>+      ]
>+    },

I was surprised to find that there were no checks configured for the log-aggregator-lb hostgroup in scl3, so we only got the syslog-tcp-1514 service check via the rsyslog-tcp-1514 hostgroup. Lets just match that, which means removing all the references to log-aggregator-lb.
Attachment #9036859 - Flags: review?(nthomas) → review-
Comment on attachment 9036871 [details] [diff] [review]
comment65.patch

lgtm. I'm going to leave it Rok to resolve that we use build for some domains/checks and release-services-alerts for others, it's all the same at the moment.
Attachment #9036871 - Flags: review?(nthomas) → review+
Attached patch comment70.patch (obsolete) — Splinter Review

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #109)

Comment on attachment 9036859 [details] [diff] [review]
comment70.patch

diff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp

  • 'log-aggregator.srv.releng.use1.mozilla.com' => {
  •  parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
    
  •  contact_groups => 'build',
    
  •  hostgroups => [
    
  •    'log-aggregator-lb',
    
  •    'rsyslog-tcp-1514'
    
  •  ]
    
  • },

I was surprised to find that there were no checks configured for the
log-aggregator-lb hostgroup in scl3, so we only got the syslog-tcp-1514
service check via the rsyslog-tcp-1514 hostgroup. Lets just match that,
which means removing all the references to log-aggregator-lb.

-removed all the references to log-aggregator-lb

Attachment #9036859 - Attachment is obsolete: true
Attachment #9037440 - Flags: review?(nthomas)
Attached patch comment70.patchSplinter Review

-removed the whitespaces

Attachment #9037440 - Attachment is obsolete: true
Attachment #9037440 - Flags: review?(nthomas)
Attachment #9037445 - Flags: review?(nthomas)
Attachment #9037445 - Flags: review?(nthomas) → review+
Comment on attachment 9036871 [details] [diff] [review]
comment65.patch

Review of attachment 9036871 [details] [diff] [review]:
-----------------------------------------------------------------

:radu

regarding alert groups for release-services we need to have 2 groups:
- one which will alert #ci and #release-services when production services (domains without staging and testing in it) are down
- and another alert which sends alerts to #release-services once for non-production services (domains with staging and testing in it)

Does above makes sense?
Also, for now include link to https://docs.mozilla-releng.net for all release-service projects.

(In reply to Rok Garbas [:garbas] from comment #113)

Comment on attachment 9036871 [details] [diff] [review]
comment65.patch

Review of attachment 9036871 [details] [diff] [review]:

:radu

regarding alert groups for release-services we need to have 2 groups:

  • one which will alert #ci and #release-services when production services
    (domains without staging and testing in it) are down
  • and another alert which sends alerts to #release-services once for
    non-production services (domains with staging and testing in it)

Does above makes sense?

The currently Nagios configuration use two contact groups:
[1] - 'build' group for domains without staging and testing in it
[2] - 'release-services-alerts' group for domains with staging and testing in it
Both of them send the alerts to #platform-ops-alert channel.

As you mentioned above, we want to send [1] to #ci and #release-services and [2] only to #release-services. Am I right?
In case Yes:
Considering that 'build' group is also used for other hosts, we have to create a new group which will alert #ci and use it for domains without staging and testing in it. (or we can still send the alerts to #platform-ops-alerts)
Regarding 'release-services-alerts' group we can change the channel where the alerts will be send(#platform-ops-alerts => #release-services) and use it for all the domains.

Is that how things should work?

Also, for now include link to https://docs.mozilla-releng.net for all
release-service projects.

Flags: needinfo?(rgarbas)
Attached patch comment65.patchSplinter Review

I have created the following contact-groups:

  • 'release-production-services' -> for domains without staging and testing in it
  • 'release-non-production-services' -> for domains with staging and testing in it

The alerts from the production services will be sent to #platform-ops-alets and #release-services and the alerts from the non production services will only be sent to #release-services.

Hello Rok, could you have a look and review the patch please?

Attachment #9036871 - Attachment is obsolete: true
Attachment #9041425 - Flags: review?(rgarbas)

After cleaning the trailing whitespace in patch: https://bug1484880.bmoattachments.org/attachment.cgi?id=9037445
dlabici successfully landed the patch with commit: 2208837e0dc31ebd1d98adf21ea0560f25fde6df

We will keep an eye to see if anything went wrong.

All new added domains are present and UP in Nagios ( https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=all&style=overview )

Seems that every new log-aggregators have the TCP-1514 issue. Everything else seems to be green.

-added a small fix for the last patch landed

dlabici, please take a look and land it

Attachment #9041488 - Flags: review?(dlabici)
Comment on attachment 9041488 [details] [diff] [review]
fix-comment70.patch

Review of attachment 9041488 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the catch and fix.
Attachment #9041488 - Flags: review?(dlabici) → review+

dlabici successfully landed the fix with commit: 21965da0ca3b50777d2c0c38d6b3ef7c108a64f7
Thank you!

for less noise on #platform-ops-alerts, I've set ack on log-aggregator2.srv.releng.usw2.mozilla.com:open syslog TCP connections is UNKNOWN: NRPE: Unable to read output

Comment on attachment 9041425 [details] [diff] [review]
comment65.patch

Review of attachment 9041425 [details] [diff] [review]:
-----------------------------------------------------------------

:riman looks good i only added a minor comment. the only thing i'm missing is a link to https://docs.mozilla-releng.net when an alert happens. is this possible to do?

::: modules/nagios4/manifests/prod/mozilla/contacts.pp
@@ +100,5 @@
>          },
> +        'releaseservicesalerts' => {
> +            contactname => 'release-services IRC channel',
> +            pagertype => 'email',
> +            pageremail => 'nobody@mozilla.org'

you can use release-services@mozilla.com email here
Attachment #9041425 - Flags: review?(rgarbas) → review+

I have landed the patch including the requested email.
Commit sha: e0bc63e9b5bcad1d5d5b0ca2ceea493b476f8bd5

What remains here to do it figure out how we can provide custom docs. link(s) and not use the automatically generated one.

for less noise, I've set ack for the following alerts :

treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: CRITICAL - Cannot make SSL connection.

treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory

treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees is CRITICAL: CRITICAL - Cannot make SSL connection. (http://m.mozilla.org/HTTP+Status+-+https://treestatus.mozilla-releng.net/trees)

treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory

tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-1 is CRITICAL: CRITICAL - Cannot make SSL connection.

tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-2 is CRITICAL: CRITICAL - Cannot make SSL connection.

(In reply to Adrian Pop from comment #124)

for less noise, I've set ack for the following alerts :

treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: CRITICAL - Cannot make SSL connection.

I'm not sure what's up with these. Might be to do with configuration or the use of Let's Encrypt as the SSL CA.

treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory

I think you'll need to modify manifests/nodes/nagios.pp so that nagios1.private.releng.mdc1.mozilla.com also has realize(Nrpe::Plugin["check_json"]). See the other usage in the same file for examples.

There seems to be another class of errors - staging sites that aren't in DNS:

  • archiver.staging.mozilla-releng.net
  • pipeline.shipit.staging.mozilla-releng.net
  • signoff.shipit.staging.mozilla-releng.net
  • taskcluster.shipit.staging.mozilla-releng.net

Rok, what do you want to do about them ?

:nthomas Those you listed services are not longer working, they can be removed.

Flags: needinfo?(rgarbas)

Everything seems to be green.
We will be watching the hosts over the weekend and see how the situation remains the same.

Depends on: 1530085

I've removed the following services since they are not longer working (comment 126):

- archiver.staging.mozilla-releng.net
- pipeline.shipit.staging.mozilla-releng.net
- signoff.shipit.staging.mozilla-releng.net
- taskcluster.shipit.staging.mozilla-releng.net
Attachment #9047253 - Flags: review?(rgarbas)
Comment on attachment 9047253 [details] [diff] [review]
removed-services.patch

:riman thank you
Attachment #9047253 - Flags: review?(rgarbas) → review+

(In reply to Rok Garbas [:garbas] from comment #104)

Is it possible to provide a link with the alert for mozilla-releng.net services? The link would be different link for each services. (eg: treestatus.mozilla-releng.net -> https://docs.mozilla-releng.net/projects/treestatus.html)

ryanc, would you be able to help us here? I think nagios and IT configuration knowledge is limited so any help would be appreciated.

Flags: needinfo?(rchilds)

Jordan, we currently provide this functionality. For example, the "Disk - All" service documentation lives here,

https://mana.mozilla.org/wiki/display/NAGIOS/Disk+-+All

Let me know if this is sufficient or if it must link somewhere else.

Flags: needinfo?(rchilds)

@ryanc: We need to provide custom links. Not the ones that get automatically generated.
For example, we need this link when an alert comes up: https://docs.mozilla-releng.net

Flags: needinfo?(rchilds)

Landed patch in comment 129.
Revision: b337a3122752e4478221acce80db5e4f398d42e2

(In reply to Danut Labici [:dlabici] from comment #133)

@ryanc: We need to provide custom links. Not the ones that get automatically
generated.
For example, we need this link when an alert comes up:
https://docs.mozilla-releng.net

Alright, would these links always start with that domain name? e.g.

https://docs.mozilla-releng.net/treestatus
https://docs.mozilla-releng.net/otherstatus
...

Flags: needinfo?(rchilds) → needinfo?(dlabici)

:ryanc as first try the link should always be the same and that is https://docs.mozilla-releng.net, but there should be an easy way to update them later on.

I'm reviewing all the documentation for relengapi services by the end of this Q and once i'm done links for this services will follow the pattern: https://docs.mozilla-releng.net/projects/<project>.html (eg. already existing https://docs.mozilla-releng.net/projects/treestatus.html, https://docs.mozilla-releng.net/projects/tooltool.html)

Flags: needinfo?(dlabici) → needinfo?(rchilds)

This is outside the scope of this bug. I will try to get around to this as soon as I can.

Flags: needinfo?(rchilds)

we dont use nagios anymore, since we migrated to GCP

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.