Closed Bug 1484880 Opened 6 years ago Closed 5 years ago

Move nagios CI checks out of scl3

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: nthomas, Assigned: ciduty)

References

Details

Attachments

(19 files, 15 obsolete files)

hpkp_expiration_patch 6 years ago Radu Iman[:riman] 4.71 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
pending-checks-patch 6 years ago Radu Iman[:riman] 2.96 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
scriptworker-checks.txt 6 years ago Radu Iman[:riman] 11.38 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
Remove buildbot support from check_pending_jobs 6 years ago Nick Thomas [:nthomas] (UTC+12) 11.34 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
bouncer-check-patch 6 years ago Roland Mutter Michael (:rmutter) 1.78 KB, patch		Details \| Diff \| Splinter Review
bug-1484880#c23-bouncer-check.patch 6 years ago Radu Iman[:riman] 4.35 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
bug-1484880.patch 6 years ago Radu Iman[:riman] 16.88 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
scl3_to_mdc2.patch 6 years ago Zsolt Fay [:zfay] 2.03 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
pending-scriptworker-tasks.patch 6 years ago Radu Iman[:riman] 6.60 KB, patch	fubar : review+ ryanc : review+	Details \| Diff \| Splinter Review
bug_1484880.patch 6 years ago Adrian Pop 8.46 KB, patch	mozilla : review+	Details \| Diff \| Splinter Review
Remove scl3 testers: t-w1064-ix, t-w864-ix, t-w732-ix, t-xp32-ix, talos-linux64-ix, t-yosemite-r7.*scl3 6 years ago Nick Thomas [:nthomas] (UTC+12) 196.55 KB, patch	nthomas : review+	Details \| Diff \| Splinter Review
Remove scl3 builders: bld-lion-r5 6 years ago Nick Thomas [:nthomas] (UTC+12) 22.28 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
Cleanup buildbot 6 years ago Nick Thomas [:nthomas] (UTC+12) 58.17 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
Cleanup buildbot-bridge, self-serve and cruncher 6 years ago Nick Thomas [:nthomas] (UTC+12) 15.59 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
Remove aws-manager* 6 years ago Nick Thomas [:nthomas] (UTC+12) 16.80 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
Remove slaveapi, slavealloc, buildapi, builddata, signing workers, partner-repack 6 years ago Nick Thomas [:nthomas] (UTC+12) 39.60 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
Finish up signing; handle scriptworkers 6 years ago Nick Thomas [:nthomas] (UTC+12) 42.97 KB, patch	dividehex : review+	Details \| Diff \| Splinter Review
hostgroup.patch 6 years ago Radu Iman[:riman] 2.12 KB, patch	nthomas : review-	Details \| Diff \| Splinter Review
hostgroups.patch 6 years ago Radu Iman[:riman] 2.27 KB, patch	nthomas : feedback+	Details \| Diff \| Splinter Review
add-hosts-to-mdc1.patch 6 years ago Radu Iman[:riman] 5.80 KB, patch	nthomas : feedback-	Details \| Diff \| Splinter Review
add scriptworker pools that were not covered 6 years ago Radu Iman[:riman] 37.55 KB, patch	nthomas : review+ dhouse : review+	Details \| Diff \| Splinter Review
Fixes (landed) 6 years ago Nick Thomas [:nthomas] (UTC+12) 1.87 KB, patch		Details \| Diff \| Splinter Review
comment70.patch 6 years ago Radu Iman[:riman] 4.17 KB, patch		Details \| Diff \| Splinter Review
comment65.patch 6 years ago Radu Iman[:riman] 12.23 KB, patch		Details \| Diff \| Splinter Review
comment70.patch 6 years ago Radu Iman[:riman] 4.16 KB, patch	ryanc : review-	Details \| Diff \| Splinter Review
comment70.patch 6 years ago Radu Iman[:riman] 4.37 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
comment65.patch 6 years ago Radu Iman[:riman] 12.85 KB, patch	ryanc : review+	Details \| Diff \| Splinter Review
comment70.patch 6 years ago Radu Iman[:riman] 8.20 KB, patch	nthomas : review-	Details \| Diff \| Splinter Review
comment65.patch 6 years ago Radu Iman[:riman] 13.76 KB, patch	nthomas : review+	Details \| Diff \| Splinter Review
comment70.patch 6 years ago Radu Iman[:riman] 6.20 KB, patch		Details \| Diff \| Splinter Review
comment70.patch 6 years ago Radu Iman[:riman] 6.47 KB, patch	nthomas : review+	Details \| Diff \| Splinter Review
comment65.patch 6 years ago Radu Iman[:riman] 15.06 KB, patch	garbas : review+	Details \| Diff \| Splinter Review
fix-comment70.patch 6 years ago Radu Iman[:riman] 607 bytes, patch	dlabici : review+	Details \| Diff \| Splinter Review
removed-services.patch 6 years ago Radu Iman[:riman] 2.67 KB, patch	garbas : review+	Details \| Diff \| Splinter Review

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

6 years ago

https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.scl3.mozilla.com lists out lots of checks which we should keep after SCL3 goes away. At a quick glance * HPKP Expiration - Beta and 3 friends * Pending Builds, Pending Tests * balrog scriptworker queue and 4 similar checks on scriptworkers * ping cluster for any hardware which will persist Danut, would ciduty be able to move these to mdc1/mdc2 as appropriate ?

Flags: needinfo?(dlabici)

Jordan Lund (:jlund)

Updated

•

6 years ago

Flags: needinfo?(ciduty)

Radu Iman[:riman]

Comment 1

•

6 years ago

Attached patch hpkp_expiration_patch (obsolete) — Details — Splinter Review

The patch to move all HPKP Expiration checks( Beta, ESR, Nightly and Release ) from scl3 to mdc1. :nthomas, :jlund could you take a look, please?

Assignee: nobody → riman

Attachment #9003008 - Flags: review?(nthomas)

Attachment #9003008 - Flags: review?(jlund)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

6 years ago

Comment on attachment 9003008 [details] [diff] [review] hpkp_expiration_patch I'm going to pass my review on to Jake because I know just enough to read the code and make some guesses, but not enough to know where the man traps are.

Attachment #9003008 - Flags: review?(nthomas) → review?(jwatkins)

Radu Iman[:riman]

Comment 3

•

6 years ago

Thank you :nthomas I have a question regarding: > ... to move these to mdc1/mdc2 as appropriate ? depending on which case do we choose mdc1 or mdc2 to move the checks from scl3?

Flags: needinfo?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 4

•

6 years ago

That'd be a good question for RelOps too.

Flags: needinfo?(nthomas)

Radu Iman[:riman]

Comment 5

•

6 years ago

Attached patch pending-checks-patch (obsolete) — Details — Splinter Review

Patch to move the pending builds & pending tests checks, from scl3 to mdc1.

Attachment #9003017 - Flags: review?(jwatkins)

Attachment #9003017 - Flags: review?(jlund)

Jordan Lund (:jlund)

Comment 6

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #4) > That'd be a good question for RelOps too. ++

Jordan Lund (:jlund)

Updated

•

6 years ago

Attachment #9003008 - Flags: review?(jlund)

Jordan Lund (:jlund)

Updated

•

6 years ago

Attachment #9003017 - Flags: review?(jlund)

Jordan Lund (:jlund)

Comment 7

•

6 years ago

@fubar - is there someone from relops that have moved nagios checks from scl3 to mdc1 that can help guide ciduty here?

Flags: needinfo?(klibby)

Jake Watkins [:dividehex]

Comment 8

•

6 years ago

Comment on attachment 9003008 [details] [diff] [review] hpkp_expiration_patch Review of attachment 9003008 [details] [diff] [review]: ----------------------------------------------------------------- looks fine too me

Attachment #9003008 - Flags: review?(jwatkins) → review+

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9003017 - Flags: review?(jwatkins) → review+

Jake Watkins [:dividehex]

Comment 9

•

6 years ago

Since nagios is managed by IT, I would recommend having someone from MOC review these also. Maybe :ashish?

Flags: needinfo?(klibby) → needinfo?(ashish)

Jake Watkins [:dividehex]

Comment 10

•

6 years ago

(In reply to Radu Iman[:riman] from comment #3) > Thank you :nthomas > > I have a question regarding: > > ... to move these to mdc1/mdc2 as appropriate ? > depending on which case do we choose mdc1 or mdc2 to move the checks from > scl3? I think moving them to MDC1 is fine.

Ashish Vijayaram [:ashish]

Comment 11

•

6 years ago

Nagios is with the MOC. Passing the ni? to :ryanc.

Flags: needinfo?(ashish) → needinfo?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 12

•

6 years ago

LGTM

Flags: needinfo?(rchilds)

Kendall Libby [:fubar] (he/him)

Comment 13

•

6 years ago

(In reply to Radu Iman[:riman] from comment #3) > > I have a question regarding: > > ... to move these to mdc1/mdc2 as appropriate ? > depending on which case do we choose mdc1 or mdc2 to move the checks from > scl3? If we're checking services that are outside of the data centers, then we should probably have a chat with the MOC about the best way to do this. For now MDC1 is fine, but we don't want to lose monitoring of those services if we have an outage there; at the same time, we (probably) don't want to get alerted twice by having the same checks run from both MDC1 and MDC2. (In reply to Jordan Lund (:jlund) from comment #7) > @fubar - is there someone from relops that have moved nagios checks from > scl3 to mdc1 that can help guide ciduty here? With Dave on PTO, Jake can help; I bet the MOC would be willing to help, as well.

Radu Iman[:riman]

Comment 14

•

6 years ago

Attached patch scriptworker-checks.txt (obsolete) — Details — Splinter Review

Patch to move the scriptworker checks from SCL3 to MDC1: -signing_scriptworker_gpg_lock -signing_scriptworker_gpg_rebuild_log -scriptworker_log_age -signing_scriptworker_queue_size -depsigning_scriptworker_queue_size -balrog_scriptworker_queue_size -beetmover_scriptworker_queue_size -pushapk_scriptworker_queue_size

Attachment #9004913 - Flags: review?(rchilds)

Ryan C [:ryanc] (UTC-4)

Updated

•

6 years ago

Attachment #9004913 - Flags: review?(rchilds) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 15

•

6 years ago

Comment on attachment 9004913 [details] [diff] [review] scriptworker-checks.txt >diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp >+ hostgroups => $nagiosbot ? { >+ 'nagios-releng' => [ Just checking - there's lots of 'nagios-releng-mdc1' in this file already but no 'nagios-releng'. Does it matter ?

Radu Iman[:riman]

Comment 16

•

6 years ago

Regarding the last check mentioned by Nick Thomas: > * ping cluster for any hardware which will persist Checking this list [1], I've noticed that the single ping cluster we need to keep is for t-yosemite-r7 machines. According to [2], we already have a ping cluster check for t-yosemite-r7 machines in MDC1 [1] https://nagios1.private.releng.scl3.mozilla.com/releng-scl3/ [2] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1.mozilla.com Therefore: 1. Do these three patches cover all the checks that we should move from SCL3 ? 2. Is the next step to push these patches to the repo?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 17

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #15) > >diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp > >+ hostgroups => $nagiosbot ? { > >+ 'nagios-releng' => [ > > Just checking - there's lots of 'nagios-releng-mdc1' in this file already > but no 'nagios-releng'. Does it matter ? ryanc, do you know the answer to this ? Seems like there are datacenter specific nagios bots, and nagios-releng is in scl3 ?

Flags: needinfo?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 18

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #17) > (In reply to Nick Thomas [:nthomas] (UTC+12) from comment #15) > > >diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp > > >+ hostgroups => $nagiosbot ? { > > >+ 'nagios-releng' => [ > > > > Just checking - there's lots of 'nagios-releng-mdc1' in this file already > > but no 'nagios-releng'. Does it matter ? It does matter. Please add to mdc[12] manifests. > Seems like there are datacenter specific nagios bots, and nagios-releng is in scl3 ? Correct.

Flags: needinfo?(rchilds)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 19

•

6 years ago

Attached patch Remove buildbot support from check_pending_jobs — Details — Splinter Review

This would start failing when secure.pub.build.mozilla.org went away so remove buildbot support from the script.

Attachment #9007648 - Flags: review?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 20

•

6 years ago

Comment on attachment 9007648 [details] [diff] [review] Remove buildbot support from check_pending_jobs WFM

Attachment #9007648 - Flags: review?(rchilds) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 21

•

6 years ago

(In reply to Radu Iman[:riman] from comment #16) > Therefore: > > 1. Do these three patches cover all the checks that we should move from SCL3 ? I'm not knowledgeable enough to tell for sure, but I don't think so. Looking around there's modules/nagios4/manifests/prod/releng/scl3.pp with many mappings of hosts into the hostgroups (which your earlier patches are using). eg beetmoverworker-1.srv.releng.use1 into beetmover-scriptworkers. There's no equivalent definition in the mdc1.pp file. I think the idea is that anything we're monitoring from scl3, and which isn't in scl3, needs to be moved to mdc1. That means AWS instances in use1 or usw2 like the various scriptworker families, signingworker-N, signing-linux-N, etc. Of the buildbot-masters only buildbot-master01 and buildbot-master77 in use1 need to be moved, as the rest will be disabled. There's a bunch of checks on web sites on the mozilla-releng.net domain too. Jake, could you confirm/deny the above ? > 2. Is the next step to push these patches to the repo? Comment #18 will need to be incorporated at least.

Flags: needinfo?(jwatkins)

Roland Mutter Michael (:rmutter)

Comment 22

•

6 years ago

Attached patch bouncer-check-patch (obsolete) — Details — Splinter Review

Made a patch for moving bouncer checks in mdc1. :nthomas , can you have a look please?

Attachment #9007658 - Flags: review?(nthomas)

Kendall Libby [:fubar] (he/him)

Comment 23

•

6 years ago

Much like in bug 1479620#c6, you'll need to move more than just the check. The bouncer-checks hostgroup is tied to bm81 in SCL3, so you'll also need to move it to another host (presumably bm01 or bm77 in USE1).

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 24

•

6 years ago

Comment on attachment 9007658 [details] [diff] [review] bouncer-check-patch Obsoleted by comment #23 (clearly ni on dividehex too). Please also change "nagios-releng" to "nagios-releng-mdc1" when you copy and paste the service definitions.

Flags: needinfo?(jwatkins)

Attachment #9007658 - Flags: review?(nthomas)

Jake Watkins [:dividehex]

Comment 25

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #21) > > Jake, could you confirm/deny the above ? I would agree with that assessment. There are many checks for services/hosts that are outside of SCL3 so they need to exist past the scl3 shutdown and mdc1 nagios is the best place to shift those.

Radu Iman[:riman]

Comment 26

•

6 years ago

Attached patch bug-1484880#c23-bouncer-check.patch — Details — Splinter Review

This patch is: - to move bm01 and bm77 from releng/scl3 to releng/mdc1 - to move bouncer-checks hostgroup from bm81 to bm01 - to change nagios-releng to nagios-releng-mdc1 in the rmutter's patch

Attachment #9007658 - Attachment is obsolete: true

Attachment #9007967 - Flags: review?(rchilds)

Radu Iman[:riman]

Comment 27

•

6 years ago

Attached patch bug-1484880.patch — Details — Splinter Review

to gather the following checks in one patch: -HPKP expiration checks -pending checks -script worker checks , and to changed "nagios-releng" to "nagios-releng-mdc1".

Attachment #9003008 - Attachment is obsolete: true

Attachment #9003017 - Attachment is obsolete: true

Attachment #9004913 - Attachment is obsolete: true

Attachment #9007972 - Flags: review?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 28

•

6 years ago

Comment on attachment 9007972 [details] [diff] [review] bug-1484880.patch LGTM

Attachment #9007972 - Flags: review?(rchilds) → review+

Ryan C [:ryanc] (UTC-4)

Updated

•

6 years ago

Attachment #9007967 - Flags: review?(rchilds) → review+

Radu Iman[:riman]

Comment 29

•

6 years ago

All 3 patches are landed. Here is the revision: 61467e1b35..20efa70256

Radu Iman[:riman]

Comment 30

•

6 years ago

After the patches were landed the following alert came: nagios-releng-mdc1> Thu 14:15:19 UTC [8047] [] buildbot-master01.bb.releng.use1.mozilla.com:bouncer is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds. (http://m.mozilla.org/bouncer) As @fubar said: seems like the Nagios can't get to bm01 to check. The build-cloud-tools repository is used to generate all of the AWS security groups. To fix it we need to remove SCL3 nagios hosts and add mdc1/2.

Zsolt Fay [:zfay]

Comment 31

•

6 years ago

Attached patch scl3_to_mdc2.patch — Details — Splinter Review

Removed the nagios scl3 hosts and added nagios MDC2 hosts.

Attachment #9008987 - Flags: review?(rchilds)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 32

•

6 years ago

Hmm, that's unfortunate. Looks like nagios1.private.releng.mdc1.mozilla.com is already in the nagios-nrpe block in your patch, so the masters should already allow that on port 5666. I've verified the AWS console matches that. So unless nagios is using a different port it might be some other firewall/system blocking the traffic. I'm not able to connect to the nagios server to investigate further. In other news, I tried running the check on bm01 and the 30s timeout isn't long enough there. 600ms response from bouncer, then 1200+ ms to hit download-installer.cdn.mozilla.net, ; both seem to be routing via our west coast network then back out to Amazon. I can look at the routing next week. Also wondering about writing this as a python3 async script that can run quickly on the nagios server itself.

Ryan C [:ryanc] (UTC-4)

Comment 33

•

6 years ago

Comment on attachment 9008987 [details] [diff] [review] scl3_to_mdc2.patch LGTM

Attachment #9008987 - Flags: review?(rchilds) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 34

•

6 years ago

https://github.com/mozilla-releng/build-puppet/pull/213 will probably fix bm01 to accept nrpe calls from mdc1+2.

Danut Labici [:dlabici]

Comment 35

•

6 years ago

I will be gone till 4th October, @ciduty, can you please check and see whats left to do?

Flags: needinfo?(dlabici)

Kendall Libby [:fubar] (he/him)

Comment 36

•

6 years ago

Somewhere along the way we've accidentally removed some important checks (or they weren't moved over to new hosts in mdc1/2), eg https://mana.mozilla.org/wiki/display/NAGIOS/Pending+Scriptworker+Tasks which should have caught the slowness issues we were seeing the last couple days. CIDuty, can you please track these down and add them?

Radu Iman[:riman]

Comment 37

•

6 years ago

Attached patch pending-scriptworker-tasks.patch — Details — Splinter Review

The patch for moving the scriptworkers-pending-tasks from /clusterchecks/scl3.pp to /clusterchecks/mdc1.pp and from /servicesgroups/scl3.pp to /servicesgroup/mdc1.pp Could you take a look, please?

Attachment #9010831 - Flags: review?(rchilds)

Attachment #9010831 - Flags: review?(klibby)

Ryan C [:ryanc] (UTC-4)

Comment 38

•

6 years ago

Comment on attachment 9010831 [details] [diff] [review] pending-scriptworker-tasks.patch LGTM

Attachment #9010831 - Flags: review?(rchilds) → review+

Kendall Libby [:fubar] (he/him)

Updated

•

6 years ago

Attachment #9010831 - Flags: review?(klibby) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 39

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #34) > https://github.com/mozilla-releng/build-puppet/pull/213 will probably fix > bm01 to accept nrpe calls from mdc1+2. The check on bm01 for bouncer is still timing out, not sure what is blocking the traffic. In better news, the slower running of the check (see comment #32) has been resolved by the routing changes in bug 1491948.

Radu Iman[:riman]

Comment 40

•

6 years ago

CIDuty do not have access to push into this repository. Could someone push this patch or give us the write access to the nagios module? Patch: https://bug1484880.bmoattachments.org/attachment.cgi?id=9010831 Thank you !

Kendall Libby [:fubar] (he/him)

Comment 41

•

6 years ago

Everyone in CIDuty (and RelEng) should have write access to the nagios4 module in IT puppet. If folks are unable, I suspect it's either a git/ssh config or VPN issue. Git should be configured to use ssh://gitolite3@git-internal.mozilla.org/sysadmins/puppet.git If you can pull that then you should be fine to push; if not, then it's a VPN issue which I think I can fix for folks.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 42

•

6 years ago

Comment on attachment 9010831 [details] [diff] [review] pending-scriptworker-tasks.patch Pushed as 955e4dcd5a3e79c596a9eb148b5601b826e4a07c, with some bracket fix ups. Previously I hadn't been able to push so thanks for fixing that up.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 43

•

6 years ago

The checks all have results like CLUSTER OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical which I think is incorrect. In modules/nagios4/manifests/prod/releng/scl3.pp there are lots of machine definitions where they're added to scriptworker hostgroups, but this is missing in mdc1.pp. Lets bring over all the scriptworkers, but drop signingworker-[1234] (deprecated).

Adrian Pop

Comment 44

•

6 years ago

Attached patch bug_1484880.patch — Details — Splinter Review

Adrian Pop

Comment 45

•

6 years ago

Created patch where I have added all the scriptworkers. Can you please check and review it ? Thank You

Aki Sasaki (not active)

Updated

•

6 years ago

Attachment #9013841 - Flags: review+

Adrian Pop

Updated

•

6 years ago

Depends on: 149591

Adrian Pop

Updated

•

6 years ago

No longer blocks: 1478215

No longer depends on: 149591

Adrian Pop

Updated

•

6 years ago

Depends on: 1495917

Adrian Pop

Comment 46

•

6 years ago

The patch has been landed. Now, there were a lot of alerts with broken checks. Currently for all the services with the issue it was set a downtime and an other bug has been opened at Netops ( bug 1495920 ) to resolve the nrpe checks that are blocked by firewall.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 47

•

6 years ago

Status now that bug 1495920 resolved the network flows from mdc1/2 to use1/w2: Fixed: * hpkp checks were already working * pending builds/tests alerts were already working * l10n bumper checks are now working and green (bug 1479620) * bouncer check is now working (here and bug 1495920) * ping/Scriptworker gpg rebuild log age/Scriptworker gpg_homedirs.lock age/Scriptworker log age are all green on scriptworker instances now Not finished: * clear downtimes set for last point in Fixed, eg [1], so that they alert if they start failing * there are pending tasks checks on individual script workers in state WARNING with 'NRPE: Unable to read output', which I suspect leads to ... * all the scriptworker queue alerts at [1] still have status like "CLUSTER OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical" [1] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=beetmoverworker-2.srv.releng.usw2 [2] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1] [3] https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=beetmoverworker-2.srv.releng.usw2

Depends on: 1479620

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 48

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #47) > * there are pending tasks checks on individual script workers in state > WARNING with 'NRPE: Unable to read output', which I suspect leads to ... Broke this out to 1498374.

Depends on: 1498374

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 49

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #47) > Not finished: > * clear downtimes set for last point in Fixed, eg [1], so that they alert if > they start failing Done. https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/extinfo.cgi?type=6 is very handy to see what is downtimed; that's System > Downtime on the sidebar of the main UI. > * there are pending tasks checks on individual script workers in state > WARNING with 'NRPE: Unable to read output', which I suspect leads to ... Fixed by 1498374. It shows a bit of a signing queue backlog ~1110 tasks, critical threshold is set to 400; other queues OK. > * all the scriptworker queue alerts at [1] still have status like "CLUSTER > OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical" Waiting to see if this starts working.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 50

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #49) > > * all the scriptworker queue alerts at [1] still have status like "CLUSTER > > OK: balrog scriptworker queue: 0 ok, 0 warning, 1 unknown, 0 critical" > > Waiting to see if this starts working. It did not start working - they all still say 1 unknown and 0 for others. This is very odd because the configuration seems to be the same as when it worked in scl3: * each scriptworker instance is assigned into a hostgroup (modules/nagios4/manifests/prod/releng/mdc1.pp), eg balrogworker-1.srv -> balrog-scriptworkers * a service check for queue length is setup for that hostgroup (.../releng/services/mdc1.pp) and assiged to a servicegroup, eg balrog_scriptworker_queue_size, which has servicegroups => 'balrog-scriptworkers-pending-tasks' * a cluster check is defined on the service group (.../releng/clusterchecks/mdc1.pp), eg "balrog scriptworker queue" This also shows up properly in https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/config.cgi AFAICT, comparing to "t-yosemite-r7-machines-ping cluster" which is getting a sensible status, so that mostly rules out a silly typo. The cluster checks are definitely running regularly, and aren't acked/downtimed. Nagios has restarted since the underlying checks were fixed. dividehex, can you see what's wrong ?

Flags: needinfo?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 51

•

6 years ago

https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1 is a good place to look at, search for "scriptworker queue".

Jake Watkins [:dividehex]

Comment 52

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #50) > > dividehex, can you see what's wrong ? I don't see anything returning unknown so I'm assuming this cleared up. NI me if I missed something.

Flags: needinfo?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 53

•

6 years ago

Steps to reproduce: 1, load https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.mdc1 2, search for 'scriptworker queue' Actual results: In the right most column, all the statuses end in nonsensical '0 ok, 0 warning, 1 unknown, 0 critical'. Overall check status is OK. Expected results: Checks aggregate properly, so balrog would have '2 ok, 0 warning, 0 unknown, 0 critical'. The groups are visible on https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?servicegroup=all&style=overview. I'm working from memory here so perhaps my expectation is wrong, but last week when the signing queues were long (and the queue checks on the individual scriptworkers were CRITICAL) the cluster checks didn't respond at all.

Flags: needinfo?(jwatkins)

Jake Watkins [:dividehex]

Comment 54

•

6 years ago

Ah! Ok, I see it now. I think that error from the servicegroups being set as an array and not as a string. I have no idea why it worked previously in scl3 and not now in mdc1. Anyway, I've corrected the config and tested it. The scriptworker queue checks all look good now. diff --git a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp index 6601823a88..74ae7193b5 100644 --- a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp +++ b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc1.pp @@ -11,7 +11,6 @@ class nagios4::prod::releng::clusterchecks::mdc1 { critical_percentage => 75, stalking_options => 'w,c', servicegroups => 'mdc1-t-yosemite-r7-ping', - # hostgroups => ['t-yosemite-r7-machines'], }, 'signing scriptworker queue' => { cluster_description => 'signing scriptworker queue', @@ -19,7 +18,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 { warning_percentage => 5, critical_percentage => 7, stalking_options => 'w,c', - servicegroups => ["signing-scriptworkers-pending-tasks"], + servicegroups => 'signing-scriptworkers-pending-tasks', }, 'depsigning scriptworker queue' => { cluster_description => 'depsigning scriptworker queue', @@ -27,7 +26,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 { warning_percentage => 5, critical_percentage => 7, stalking_options => 'w,c', - servicegroups => ["depsigning-scriptworkers-pending-tasks"], + servicegroups => 'depsigning-scriptworkers-pending-tasks', }, 'balrog scriptworker queue' => { cluster_description => 'balrog scriptworker queue', @@ -35,7 +34,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 { warning_percentage => 1, critical_percentage => 2, stalking_options => 'w,c', - servicegroups => ["balrog-scriptworkers-pending-tasks"], + servicegroups => 'balrog-scriptworkers-pending-tasks', }, 'beetmover scriptworker queue' => { cluster_description => 'beetmover scriptworker queue', @@ -43,7 +42,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 { warning_percentage => 2, critical_percentage => 4, stalking_options => 'w,c', - servicegroups => ["beetmover-scriptworkers-pending-tasks"], + servicegroups => 'beetmover-scriptworkers-pending-tasks', }, 'pushapk scriptworker queue' => { cluster_description => 'pushapk scriptworker queue', @@ -51,7 +50,7 @@ class nagios4::prod::releng::clusterchecks::mdc1 { warning_percentage => 1, critical_percentage => 1, stalking_options => 'w,c', - servicegroups => ["pushapk-scriptworkers-pending-tasks"], + servicegroups => 'pushapk-scriptworkers-pending-tasks', }, } } diff --git a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp index 55af86dd67..b7bcfd863d 100644 --- a/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp +++ b/modules/nagios4/manifests/prod/releng/clusterchecks/mdc2.pp @@ -11,7 +11,6 @@ class nagios4::prod::releng::clusterchecks::mdc2 { critical_percentage => 75, stalking_options => 'w,c', servicegroups => 'mdc2-t-yosemite-r7-ping', - # hostgroups => ['t-yosemite-r7-machines'], }, } }

Flags: needinfo?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 55

•

6 years ago

Ah, that array vs string thing was the issue. Thanks Jake. Everything in comment #47 is now done. Planning to take one last look through the scl3 configs to see if we missed anything.

Bogdan Crisan [:bcrisan] (EEST - GMT + 3)

Comment 56

•

6 years ago

Our[ciduty] part it is pretty much done here so I will remove the NI for ciduty. There is still some work to be done in bug 1495917, but apart from that everything seems to be ok. If someone notices something odd that needs patch/fix and we can do it, please NI ciduty again.

Flags: needinfo?(ciduty)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 57

•

6 years ago

I took another look at the scl3 configs too see if we missed anything (regrets, I have a few) and found a several checks of note: modules/nagios4/manifests/prod/releng/scl3.pp: * hostgroup releng-apps with checks for sites coalesce.mozilla-releng.net, archiver.staging.mozilla-releng.net, docs.mozilla-releng.net, treestatus, tooltool, shipit v2, and others. Generally releng services, which I heard Rok is working on elswhere too (ni? for more on that) * puppet servers in AWS, releng-puppet1.srv.releng.use1 & releng-puppet1.srv.releng.usw2 * log aggregators in AWS, log-aggregator[12].srv.releng.use1, log-aggregator[12].srv.releng.usw2 modules/nagios4/manifests/prod/releng/services/scl3.pp: * scriptworker-procs (possibly more here, I ran out of steam). I'd like for ciduty to work on patches (where Rok isn't already). To make it easier I've got some patches which remove deprecated hosts and their associated configuration, which I'll attach here. That'll make it much more obvious what is still left to consider.

Flags: needinfo?(rgarbas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 58

•

6 years ago

Attached patch Remove scl3 testers: t-w1064-ix, t-w864-ix, t-w732-ix, t-xp32-ix, talos-linux64-ix, t-yosemite-r7.*scl3 — Details — Splinter Review

r+ from dividehex on irc, landed as 7961a4440d.

Attachment #9021996 - Flags: review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 59

•

6 years ago

Attached patch Remove scl3 builders: bld-lion-r5 — Details — Splinter Review

Attachment #9021998 - Flags: review?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 60

•

6 years ago

Attached patch Cleanup buildbot — Details — Splinter Review

Removes * checks for buildbot and associated processes (eg command/pulse queue, mysql) * node definitions for deleted buildbot masters * disk checks for separate /builds and /var partitions (which aren't used on bm01/77) Fixes * whitespace on bouncer and l10n_bumper_lock Restores * checks for swap, load, free space on /, ntp on l10n-bumper-servers (bm01/bm77) I wondered if cross-data center ntp will be flaky. AFAIK time is not super important to l10n bumper or bouncer checks.

Attachment #9022011 - Flags: review?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 61

•

6 years ago

Attached patch Cleanup buildbot-bridge, self-serve and cruncher — Details — Splinter Review

Attachment #9022012 - Flags: review?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 62

•

6 years ago

Attached patch Remove aws-manager* — Details — Splinter Review

Attachment #9022014 - Flags: review?(jwatkins)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 63

•

6 years ago

Attached patch Remove slaveapi, slavealloc, buildapi, builddata, signing workers, partner-repack — Details — Splinter Review

signingworker-N are the old funsize signing systems, now removed. The partner repack hosts went away in bug 1478977 and bug 1500323.

Attachment #9022018 - Flags: review?(jwatkins)

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9021998 - Flags: review?(jwatkins) → review+

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9022011 - Flags: review?(jwatkins) → review+

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9022012 - Flags: review?(jwatkins) → review+

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9022014 - Flags: review?(jwatkins) → review+

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9022018 - Flags: review?(jwatkins) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 64

•

6 years ago

Landed those five attachments as 69db12b66a..0f461850c5.

Rok Garbas [:garbas]

Comment 65

•

6 years ago

:(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #57) > I took another look at the scl3 configs too see if we missed anything > (regrets, I have a few) and found a several checks of note: > > modules/nagios4/manifests/prod/releng/scl3.pp: > * hostgroup releng-apps with checks for sites coalesce.mozilla-releng.net, > archiver.staging.mozilla-releng.net, docs.mozilla-releng.net, treestatus, > tooltool, shipit v2, and others. Generally releng services, which I heard > Rok is working on elswhere too (ni? for more on that) > * puppet servers in AWS, releng-puppet1.srv.releng.use1 & > releng-puppet1.srv.releng.usw2 > * log aggregators in AWS, log-aggregator[12].srv.releng.use1, > log-aggregator[12].srv.releng.usw2 > > modules/nagios4/manifests/prod/releng/services/scl3.pp: > * scriptworker-procs > > (possibly more here, I ran out of steam). > > I'd like for ciduty to work on patches (where Rok isn't already). To make it > easier I've got some patches which remove deprecated hosts and their > associated configuration, which I'll attach here. That'll make it much more > obvious what is still left to consider. :nthomas sorry for late reply, busy week. We have 2 tickets open[1][2], so that we remember to do this in hopefully near future, but currently other work is more important and we're postpone this. But it would be awesome if we can get some help getting this done. Because there are many services coming and going in release-services we had (until scl3 was up) nagios configuration generated by some script[3]. If you get somebody to help us, please let them ping me in #release-services to explain more details. [1] https://github.com/mozilla/release-services/issues/267 [2] https://github.com/mozilla/release-services/issues/1205 [3] https://github.com/mozilla/release-services/blob/master/lib/please_cli/please_cli/nagios_config.py

Flags: needinfo?(rgarbas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 66

•

6 years ago

Attached patch Finish up signing; handle scriptworkers — Details — Splinter Review

* adds newer members of balrogworker and beetmover pools to mdc1 * removes scriptworker hosts from scl3, ensuring same checks run in mdc1. We were missing quite a few (load, disk usage, scriptworker-procs, time, puppet freshness, gpg checks on depsigning workers) but they seem sensible checks soit doesn't seem deliberate * finishes cleaning up signing-servers and mac-signing-servers in scl3, ensuring the same checks run in mdc1 * removes check_disk_10_5_signing since it's the same as check_disk_build_10_5 There's a bunch of other scriptworker pools that we're not covering at all (eg pushsnap, treescript, addonworker, bouncer, mobile and tb variations of the same) but that's for another patch.

Attachment #9023556 - Flags: review?(jwatkins)

Jake Watkins [:dividehex]

Comment 67

•

6 years ago

Comment on attachment 9023556 [details] [diff] [review] Finish up signing; handle scriptworkers Review of attachment 9023556 [details] [diff] [review]: ----------------------------------------------------------------- lgtm

Attachment #9023556 - Flags: review?(jwatkins) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 68

•

6 years ago

Landed as dddad86241fd3303ce23d1256e7b1b74b5de4a65. Needed a fix (df1567a9be) to disable signing_scriptworker_gpg_lock and signing_scriptworker_gpg_rebuild_log on depsigning-worker, as they don't enable chain of trust.

Ryan C [:ryanc] (UTC-4)

Comment 69

•

6 years ago

What's the overall status of this? I'd like to cleanup anything related to scl3 from the nagios module in bug 1447892.

Status: NEW → ASSIGNED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 70

•

6 years ago

We're getting close to having everything moved over to mdc1 or ignored because of decommissioning. Still to move to mdc1 config and verify checks are the same: * puppet servers releng-puppet1.srv.releng.use1.mozilla.com releng-puppet1.srv.releng.usw2.mozilla.com * log aggregation instances log-aggregator1.srv.releng.use1.mozilla.com log-aggregator2.srv.releng.use1.mozilla.com log-aggregator1.srv.releng.usw2.mozilla.com log-aggregator2.srv.releng.usw2.mozilla.com * log aggregation loadbalancers log-aggregator.srv.releng.use1.mozilla.com log-aggregator.srv.releng.usw2.mozilla.com * release-services host checks (see comment #65) Feel free to jump in with any of that if you want. Once we're done there I'd have no objections to removing modules/nagios4/manifests/prod/releng/scl3.pp modules/nagios4/manifests/prod/releng/servicegroups/scl3.pp modules/nagios4/manifests/prod/releng/services/scl3.pp modules/nagios4/manifests/prod/releng/clusterchecks/scl3.pp and whatever refers to those. Separately, we should still .... (In reply to Nick Thomas [:nthomas] (UTC+13) from comment #66) > There's a bunch of other scriptworker pools that we're not covering at all > (eg pushsnap, treescript, addonworker, bouncer, mobile and tb variations of > the same) but that's for another patch. Radu, do you have time to look at this part ?

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 71

•

6 years ago

For the last, I'd suggest looking at https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types to find the pools, but ignore dev. You'll need * to track down the host names using the AWS console and DNS, and add them to mdc1.pp, assigning them to a new hostgroup named after that pool * setup the hostgroup in hostgroups.pp * add the hostgroup to the services in services/mdc1.pp * add a clustercheck for the queue length in clusterchecks/mdc1.pp, and an alias for that in servicechecks/mdc1.pp balrog-scriptworkers would be a useful template to follow, except treat tb-depsigning, dep-pushapk, and dep-pushsnap like depsigning (doesn't do gpg checks, see comment #68 for names).

Radu Iman[:riman]

Comment 72

•

6 years ago

Attached patch hostgroup.patch (obsolete) — Details — Splinter Review

I added a hostgroup for each of the following pools: * addon-v1 * balrogworker-v1 * beetmoverworker-v1 * bouncer-v1 * dep-pushapk * dep-pushsnap * depsigning * mobile-beetmover-v1 * mobile-pushapk-v1 * mobile-signing-v1 * pushapk-v1 * pushsnap-v1 * shipit-v1 * signing-linux-v1 * tb-balrog-v1 * tb-beetmover-v1 * tb-bouncer-v1 * tb-depsigning * tb-shipit-v1 * tb-signing-v1 * tb-treescript-comm-v1 * treescript-v1 I'm not sure of the alias. Please let me know if I should change it or add something there. I'll go ahead to add each host from these pools to mdc1.pp

Attachment #9024691 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 73

•

6 years ago

Comment on attachment 9024691 [details] [diff] [review] hostgroup.patch The taskcluster naming doesn't match the puppet naming. For example, 'balrogworker-v1' in tc becomes 'balrog-scriptworkers' in puppet, and already exists. Please stick to the existing naming style in puppet by translating tc names, and keep all the scriptworker definitions near each other in the file. For the aliases I'd suggest 'thunderbird balrog scriptworkers', 'thunderbird treescript scriptworkers', 'treescript scriptworkers' etc. These show up as the table headings on https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=all&style=overview

Attachment #9024691 - Flags: review?(nthomas) → review-

Radu Iman[:riman]

Comment 74

•

6 years ago

Attached patch hostgroups.patch (obsolete) — Details — Splinter Review

I've added the scriptworkers hostgroups that have not already been defined in hostgroups.pp. nthomas thank you for your help. Please have a look to this new patch.

Attachment #9024691 - Attachment is obsolete: true

Attachment #9025213 - Flags: review?(nthomas)

Radu Iman[:riman]

Comment 75

•

6 years ago

Attached patch add-hosts-to-mdc1.patch (obsolete) — Details — Splinter Review

I've added each host name from the scriptworker pools that were not covered, in mdc1.pp. Used the following template: "FQDN" => { hostgroups => [ 'hostgroup' ] }

Attachment #9025253 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 76

•

6 years ago

Comment on attachment 9025213 [details] [diff] [review] hostgroups.patch >diff --git a/modules/nagios4/manifests/prod/releng/hostgroups.pp b/modules/nagios4/manifests/prod/releng/hostgroups.pp >+ 'depsigning-pushapk-scriptworkers' => { >+ alias => 'depsigning pushapk scriptworkers', >+ }, >+ 'depsigning-pushsnap-scriptworkers' => { >+ alias => 'depsigning pushsnap scriptworkers', I'd just leave the names as dep-pushapk and dep-pushsnap as they're not doing any signing. I think you're on the right track with this file but I'll f+ instead of r+ for a couple of reasons. Firstly, someone like dividehex should do the actual review once we've got a good patch ready; I can help you get to that point. Secondly, it makes more sense to me to have one patch that handles all the files in one go, rather than N patches for N files. They just logically go together, and may need to land all together. If that seems like a big hill to climb then lets pick one class of scriptworkers and modify all the files for that class.

Attachment #9025213 - Flags: review?(nthomas) → feedback+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 77

•

6 years ago

Comment on attachment 9025253 [details] [diff] [review] add-hosts-to-mdc1.patch The hosts are all present, woo! >diff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp >+ 'tb-beetmover-6.srv.releng.usw2.mozilla.com' => { >+ contact_groups => 'build', Please make sure that contact_groups is defined for each host. Also see comments on previous patch about dep-pushapk and dep-pushsnap, and creating patches.

Attachment #9025253 - Flags: review?(nthomas) → feedback-

Radu Iman[:riman]

Comment 78

•

6 years ago

Thank you for your suggestions. I did the changes you mentioned in the previous comments. I'm moving forward to prepare the big patch. - (In reply to Nick Thomas [:nthomas] (UTC+13) from comment #71) > For the last, I'd suggest looking at > https://tools.taskcluster.net/provisioners/scriptworker-prov-v1/worker-types > to find the pools, but ignore dev. You'll need > * to track down the host names using the AWS console and DNS, and add them > to mdc1.pp, assigning them to a new hostgroup named after that pool DONE ^ > * setup the hostgroup in hostgroups.pp DONE ^ > * add the hostgroup to the services in services/mdc1.pp TODO ^ > * add a clustercheck for the queue length in clusterchecks/mdc1.pp and an alias for that in servicechecks/mdc1.pp TODO ^

Radu Iman[:riman]

Comment 79

•

6 years ago

Attached patch add scriptworker pools that were not covered — Details — Splinter Review

* setup a host group for each scriptworker pool in hostgroups.pp * added the hosts from each pool in mdc1.pp * added the hostgroups to the services in services/mdc1.pp * added the clusterchecks for the queue length in clusterchecks/mdc1.pp * set an alias for each clustercheck in servicechecks/mdc1.pp :dividehex, could you have a look to this patch please?

Attachment #9025213 - Attachment is obsolete: true

Attachment #9025253 - Attachment is obsolete: true

Attachment #9030182 - Flags: review?(jwatkins)

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9030182 - Flags: review?(jwatkins) → review?(dhouse)

:dhouse

Comment 80

•

6 years ago

:riman, could you get a final review from nthomas? I'm not familiar with the scriptworkers. The nagios syntax looks right, but I'd like to have a sign-off that this addressed the requests. Also, could you remove/hide the patches that are not relevant or mark the ones that are committed? It looks like all but one are r+'d but that none are committed in the repo.

Flags: needinfo?(riman)

Danut Labici [:dlabici]

Comment 81

•

6 years ago

To speed up the process, I will add nick for the review, :riman will be back at work in 48-ish hours. But anyone in the team can land it if its okay. @Nick: Can you please review the attachment in comment https://bugzilla.mozilla.org/show_bug.cgi?id=1484880#c79

Flags: needinfo?(nthomas)

:dhouse

Comment 82

•

6 years ago

(In reply to Danut Labici [:dlabici] from comment #81) > To speed up the process, I will add nick for the review, :riman will be back > at work in 48-ish hours. > But anyone in the team can land it if its okay. > > @Nick: Can you please review the attachment in comment > https://bugzilla.mozilla.org/show_bug.cgi?id=1484880#c79 Thankyou Danut!

:dhouse

Updated

•

6 years ago

Attachment #9030182 - Flags: review?(nthomas)

Attachment #9030182 - Flags: review?(dhouse)

Attachment #9030182 - Flags: review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 83

•

6 years ago

Comment on attachment 9030182 [details] [diff] [review] add scriptworker pools that were not covered r+, landed as 9b4bc441a6.

Flags: needinfo?(riman)

Flags: needinfo?(nthomas)

Attachment #9030182 - Flags: review?(nthomas) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 84

•

6 years ago

Attached patch Fixes (landed) — Details — Splinter Review

We needed these fixes to attachment 9030182 [details] [diff] [review] to generate a good nagios config. Super hard to spot that sort of thing in a giant patch.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 85

•

6 years ago

We discovered some problems as a result of the checks * tb-depsigning-worker1.srv.releng.use1.mozilla.com was not properly puppetized. I mostly followed https://github.com/mozilla-releng/scriptworker/blob/master/docs/new_instance.md#3-puppetize-the-instance, except it already accepted my own ssh key * tb-depsigning-worker6.srv.releng.use2.mozilla.com was off in AWS, restarted it and made sure puppet ran successfully * tb-bouncer-1.srv.releng.use1.mozilla.com was pinned to an environment, unpinned and made sure puppet ran * tb-depsigning-worker1.srv.releng.use1.mozilla.com - a trailing newline on the comm_thunderbird_dep_signing_scriptworker_taskcluster_access_token secret caused the nagios check to be malformed That wraps up the scriptworkers, assuming nothing was added in the meantime. Remaining to do * via comment #70 - AWS puppet and log-aggregator servers * via comment #65 - checks on release-services websites (would be worth checking in with Rok if anything changed in the meantime)

Danut Labici [:dlabici]

Comment 86

•

6 years ago

Adding NI? to ciduty so we can keep a close view to this.

Assignee: riman → ciduty

Priority: -- → P1

Jordan Lund (:jlund)

Comment 87

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #85) MOC is going away and we will no longer get coverage for our services starting Jan 1st. We should close out open bugs that's on their radar and fix state to match MDC1/2 infra. > Remaining to do > * via comment #70 - AWS puppet and log-aggregator servers @zsolt, is this something CIDuty can do? > * via comment #65 - checks on release-services websites (would be worth > checking in with Rok if anything changed in the meantime) @garbas, we need to wrap up this bug before end of year. I know you are focused on shipit v2 so perhaps you could file a bug to fix up release-services and nagios separately? Do any release-services (tooltool, treestatus, etc) have nagios checks working? Or should we remove them and any mention of scl3?

Flags: needinfo?(zfay)

Flags: needinfo?(rgarbas)

Rok Garbas [:garbas]

Comment 88

•

6 years ago

I'm meeting with :zfay today or tomorrow. We will solve this by tomorrow evening, since then I'm on PTO.

Flags: needinfo?(rgarbas)

Roland Mutter Michael (:rmutter)

Comment 89

•

6 years ago

Meeting has been set on 19 Dec 10:00 CET . From our side, Apop will be on duty.

Ryan C [:ryanc] (UTC-4)

Comment 90

•

6 years ago

(In reply to Rok Garbas [:garbas] from comment #88) > I'm meeting with :zfay today or tomorrow. We will solve this by tomorrow > evening, since then I'm on PTO. Any updates here?

Flags: needinfo?(rgarbas)

Radu Iman[:riman]

Comment 91

•

6 years ago

Attached patch comment70.patch (obsolete) — Details — Splinter Review

comment #70 - AWS puppet and log-aggregator servers > * puppet servers > releng-puppet1.srv.releng.use1.mozilla.com > releng-puppet1.srv.releng.usw2.mozilla.com Added the hosts to mdc1.pp. The hostgroups were added to services/mdc1.pp >* log aggregation instances > log-aggregator1.srv.releng.use1.mozilla.com > log-aggregator2.srv.releng.use1.mozilla.com > log-aggregator1.srv.releng.usw2.mozilla.com > log-aggregator2.srv.releng.usw2.mozilla.com Added the hosts to mdc1.pp and the hostgroups to services.mdc1.pp >* log aggregation loadbalancers > log-aggregator.srv.releng.use1.mozilla.com > log-aggregator.srv.releng.usw2.mozilla.com These two hosts are not present in AWS, should we add or ignore them? Please have a look to this patch. Thank you.

Attachment #9033557 - Flags: review?(rchilds)

Attachment #9033557 - Flags: review?(nthomas)

Radu Iman[:riman]

Comment 92

•

6 years ago

Attached patch comment65.patch (obsolete) — Details — Splinter Review

comment #65 - checks on release-services websites This patch moves all the release-servise websites checks from scl3 to mdc1. I don't know which are the parent/child relationships between hosts. I've used parent [1] which has parents [2], [3]. Please have a look to this part. [1] esx-cluster1.ops.mdc1.mozilla.com [2] mgmt.fw1a.private.mdc1.mozilla.net [3] mgmt.fw1b.private.mdc1.mozilla.net

Attachment #9033861 - Flags: review?(rchilds)

Attachment #9033861 - Flags: review?(nthomas)

Ryan C [:ryanc] (UTC-4)

Comment 93

•

6 years ago

Comment on attachment 9033861 [details] [diff] [review] comment65.patch (In reply to Radu Iman[:riman] from comment #92) > I don't know which are the parent/child relationships between hosts. I've > used parent [1] which has parents [2], [3]. Please have a look to this part. > > > [1] esx-cluster1.ops.mdc1.mozilla.com > [2] mgmt.fw1a.private.mdc1.mozilla.net > [3] mgmt.fw1b.private.mdc1.mozilla.net This seems fine besides the parenting. Most of these sites don't seem to be in the DC, e.g. ⋊> ~ host treestatus.mozilla-releng.net 15:44:34 treestatus.mozilla-releng.net is an alias for treestatus.mozilla-releng.net.herokudns.com. treestatus.mozilla-releng.net.herokudns.com has address 52.22.34.127 treestatus.mozilla-releng.net.herokudns.com has address 52.201.75.180 treestatus.mozilla-releng.net.herokudns.com has address 52.4.95.48 treestatus.mozilla-releng.net.herokudns.com has address 52.202.60.111 treestatus.mozilla-releng.net.herokudns.com has address 52.2.175.150 treestatus.mozilla-releng.net.herokudns.com has address 52.86.186.182 treestatus.mozilla-releng.net.herokudns.com has address 52.3.53.115 treestatus.mozilla-releng.net.herokudns.com has address 52.55.191.55

Attachment #9033861 - Flags: review?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 94

•

6 years ago

Comment on attachment 9033557 [details] [diff] [review] comment70.patch The parenting doesn't make sense considering we're out of scl3 entirely, e.g. + 'releng-puppet1.srv.releng.use1.mozilla.com' => { + parents => 'fw1.private.releng.scl3.mozilla.net', + contact_groups => 'build', + hostgroups => [ + 'puppetagain-masters' + ] + }, If you grep around that manifest, there's examples like this for each corresponding DC (which it should be), > parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',

Attachment #9033557 - Flags: review?(rchilds)

Radu Iman[:riman]

Comment 95

•

6 years ago

Attached patch comment70.patch (obsolete) — Details — Splinter Review

replaced with > parents => 'esx-cluster1.ops.mdc1.mozilla.com', Thanks for help Ryan C.

Attachment #9033557 - Attachment is obsolete: true

Attachment #9033557 - Flags: review?(nthomas)

Attachment #9034275 - Flags: review?(rchilds)

Attachment #9034275 - Flags: review?(nthomas)

Ryan C [:ryanc] (UTC-4)

Comment 96

•

6 years ago

Comment on attachment 9034275 [details] [diff] [review] comment70.patch Since this host is in AWS use1, it has no relation to ESX in mdc1. The parents should be the firewalls in mdc1 because monitoring depends on the those for links to use1 etc. + 'releng-puppet1.srv.releng.use1.mozilla.com' => { + parents => 'esx-cluster1.ops.mdc1.mozilla.com',

Attachment #9034275 - Flags: review?(rchilds) → review-

Ryan C [:ryanc] (UTC-4)

Updated

•

6 years ago

Flags: needinfo?(rgarbas)

Radu Iman[:riman]

Comment 97

•

6 years ago

(In reply to Ryan C [:ryanc] (UTC-4) from comment #96)

Comment on attachment 9034275 [details] [diff] [review]
comment70.patch

Since this host is in AWS use1, it has no relation to ESX in mdc1. The
parents should be the firewalls in mdc1 because monitoring depends on the
those for links to use1 etc.
'releng-puppet1.srv.releng.use1.mozilla.com' => {
 parents => 'esx-cluster1.ops.mdc1.mozilla.com',

Are these the firewalls in mdc1 that I should use for the puppet servers and the log aggregation instances ?

parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',

If I use traceroute, it returns another 'fw1' node than the above for both of the puppet servers and all of the log aggregation instances, but the same 'fw1' node for all from mdc1, use1 and usw2.
I'm confused because for the puppet servers and the log aggregator from MDC1 (already present in the Nagios configuration) the parent host is 'esx-cluster1.ops.mdc1.mozilla.com'

'releng-puppet1.srv.releng.mdc1.mozilla.com' => {
parents => 'esx-cluster1.ops.mdc1.mozilla.com',
contact_groups => 'build',
hostgroups => [
'puppetagain-masters'
]
},

Thank you.

Flags: needinfo?(zfay) → needinfo?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 98

•

6 years ago

Outside for mdc[12], use the firewalls.

Looking in vcenter, "releng-puppet1.srv.releng.mdc1.mozilla.com" is indeed a VM and should use "esx-cluster1.ops.mdc1.mozilla.com" as the parent.

Flags: needinfo?(rchilds)

Radu Iman[:riman]

Comment 99

•

6 years ago

Attached patch comment70.patch (obsolete) — Details — Splinter Review

added mdc1 firewalls as parents

parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',

Attachment #9034275 - Attachment is obsolete: true

Attachment #9034275 - Flags: review?(nthomas)

Attachment #9035166 - Flags: review?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 100

•

6 years ago

Comment on attachment 9035166 [details] [diff] [review]
comment70.patch

LGTM

Attachment #9035166 - Flags: review?(rchilds) → review+

Radu Iman[:riman]

Comment 101

•

6 years ago

Attached patch comment65.patch (obsolete) — Details — Splinter Review

I've replaced the parent hosts with the mdc1 firewalls since the releng services hosts are out of mdc.

Attachment #9033861 - Attachment is obsolete: true

Attachment #9033861 - Flags: review?(nthomas)

Attachment #9035182 - Flags: review?(rchilds)

Ryan C [:ryanc] (UTC-4)

Updated

•

6 years ago

Attachment #9035182 - Flags: review?(rchilds) → review+

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 102

•

6 years ago

Comment on attachment 9035166 [details] [diff] [review]
comment70.patch

diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
...
 'syslog-open-connections-1514'  => {

...

```
   hostgroups => $nagiosbot ? {
```
```
     'nagios-releng-mdc1' => [
```
```
       'open-tcp-1514',
```
```
           ],
```
```
           default => [
```

Nit, the indentation on default should be fixed up when landing.

(In reply to Radu Iman[:riman] from comment #91)

log aggregation loadbalancers
log-aggregator.srv.releng.use1.mozilla.com
log-aggregator.srv.releng.usw2.mozilla.com

These two hosts are not present in AWS, should we add or ignore them?

These are AWS load balancers rather than EC2 instances, and seem to be in use. Could we migrate them to mdc1 too, unless it's going to be painful to provide a netflow. A separate patch would be fine.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 103

•

6 years ago

Comment on attachment 9035182 [details] [diff] [review]
comment65.patch

diff --git a/modules/nagios4/manifests/prod/releng/services/mdc1.pp b/modules/nagios4/manifests/prod/releng/services/mdc1.pp
...
 'https-checks-sni-only' => {
   service_description => "HTTPS",
   check_command => 'check_https_sni_only!/',
   check_interval => 60,
   contact_groups => 'shipitalerts',

This patch looks great except this contact_group appears to be used everywhere, but map to an empty IRC channel of the same name. Lets send it to #platform-ops-alerts (and maybe #release-services) by editing mozilla/contactgroups.pp. Bonus points for renaming the group to something like release-services-alerts.

Rok Garbas [:garbas]

Comment 104

•

6 years ago

Is it possible to provide a link with the alert for mozilla-releng.net services? The link would be different link for each services. (eg: treestatus.mozilla-releng.net -> https://docs.mozilla-releng.net/projects/treestatus.html)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 105

•

6 years ago

•

Edited

I'm not 100% sure, but looking at modules/nagios4/templates/prod/nagios-service.cfg.erb it seems that the m.mozilla.org links come from notes_url, which defaults to using service_desrciption. If we set info_url instead we could use "https://docs.mozilla-releng.net" for all the services checked by https-checks-sni-only, and more specific urls for the treestatus and tooltool checks. I can't see any examples of info_url being set to be sure, and there are other files for host/hostgroup/servicegroup etc.

Related, if I run ./please tools nagios-config, and grep away the testing.mozilla-releng.net, then it's a different list of hosts now:

-'archiver.staging.mozilla-releng.net' => {
-'coalesce.mozilla-releng.net' => {
+'api.shipit.staging.mozilla-releng.net' => {
 'docs.mozilla-releng.net' => {
 'docs.staging.mozilla-releng.net' => {
+'identity.notification.mozilla-releng.net' => {
+'identity.notification.staging.mozilla-releng.net' => {
+'mapper.mozilla-releng.net' => {
 'mapper.staging.mozilla-releng.net' => {
 'mozilla-releng.net' => {
-'pipeline.shipit.staging.mozilla-releng.net' => {
+'policy.notification.mozilla-releng.net' => {
+'policy.notification.staging.mozilla-releng.net' => {
+'shipit-api.mozilla-releng.net' => {
 'shipit.mozilla-releng.net' => {
 'shipit.staging.mozilla-releng.net' => {
-'signoff.shipit.staging.mozilla-releng.net' => {
 'staging.mozilla-releng.net' => {
-'taskcluster.shipit.staging.mozilla-releng.net' => {
+'tokens.mozilla-releng.net' => {
+'tokens.staging.mozilla-releng.net' => {
 'tooltool.mozilla-releng.net' => {
 'tooltool.staging.mozilla-releng.net' => {
 'treestatus.mozilla-releng.net' => {

Rok Garbas [:garbas]

Comment 106

•

6 years ago

:nthomas those "custom" links would be great to have, so that who ever responds to can know how to troubleshoot and who to escalate to.

that ./please tools nagios-config script got a bit out of sync. once we know what we need to generate we can change the template for the script.

most importantly is that tooltool and treestatus get checks back.

Radu Iman[:riman]

Comment 107

•

6 years ago

Attached patch comment70.patch (obsolete) — Details — Splinter Review

-added the load balancers
-fixed the indentation

Attachment #9035166 - Attachment is obsolete: true

Attachment #9036859 - Flags: review?(nthomas)

Radu Iman[:riman]

Comment 108

•

6 years ago

Attached patch comment65.patch (obsolete) — Details — Splinter Review

-renamed the contact group
I believe I edited it to send to #platform-ops-alert channel

Attachment #9035182 - Attachment is obsolete: true

Attachment #9036871 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 109

•

6 years ago

Comment on attachment 9036859 [details] [diff] [review] comment70.patch >diff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp >+ 'log-aggregator.srv.releng.use1.mozilla.com' => { >+ parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net', >+ contact_groups => 'build', >+ hostgroups => [ >+ 'log-aggregator-lb', >+ 'rsyslog-tcp-1514' >+ ] >+ }, I was surprised to find that there were no checks configured for the log-aggregator-lb hostgroup in scl3, so we only got the syslog-tcp-1514 service check via the rsyslog-tcp-1514 hostgroup. Lets just match that, which means removing all the references to log-aggregator-lb.

Attachment #9036859 - Flags: review?(nthomas) → review-

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 110

•

6 years ago

Comment on attachment 9036871 [details] [diff] [review] comment65.patch lgtm. I'm going to leave it Rok to resolve that we use build for some domains/checks and release-services-alerts for others, it's all the same at the moment.

Attachment #9036871 - Flags: review?(nthomas) → review+

Radu Iman[:riman]

Comment 111

•

6 years ago

Attached patch comment70.patch (obsolete) — Details — Splinter Review

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #109)

Comment on attachment 9036859 [details] [diff] [review]
comment70.patch
diff --git a/modules/nagios4/manifests/prod/releng/mdc1.pp b/modules/nagios4/manifests/prod/releng/mdc1.pp
'log-aggregator.srv.releng.use1.mozilla.com' => {
 parents => 'mgmt.fw1a.private.mdc1.mozilla.net, mgmt.fw1b.private.mdc1.mozilla.net',
 contact_groups => 'build',
 hostgroups => [
   'log-aggregator-lb',
   'rsyslog-tcp-1514'
 ]
},
I was surprised to find that there were no checks configured for the
log-aggregator-lb hostgroup in scl3, so we only got the syslog-tcp-1514
service check via the rsyslog-tcp-1514 hostgroup. Lets just match that,
which means removing all the references to log-aggregator-lb.

-removed all the references to log-aggregator-lb

Attachment #9036859 - Attachment is obsolete: true

Attachment #9037440 - Flags: review?(nthomas)

Radu Iman[:riman]

Comment 112

•

6 years ago

Attached patch comment70.patch — Details — Splinter Review

-removed the whitespaces

Attachment #9037440 - Attachment is obsolete: true

Attachment #9037440 - Flags: review?(nthomas)

Attachment #9037445 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Updated

•

6 years ago

Attachment #9037445 - Flags: review?(nthomas) → review+

Rok Garbas [:garbas]

Comment 113

•

6 years ago

Comment on attachment 9036871 [details] [diff] [review] comment65.patch Review of attachment 9036871 [details] [diff] [review]: ----------------------------------------------------------------- :radu regarding alert groups for release-services we need to have 2 groups: - one which will alert #ci and #release-services when production services (domains without staging and testing in it) are down - and another alert which sends alerts to #release-services once for non-production services (domains with staging and testing in it) Does above makes sense? Also, for now include link to https://docs.mozilla-releng.net for all release-service projects.

Radu Iman[:riman]

Comment 114

•

6 years ago

(In reply to Rok Garbas [:garbas] from comment #113)

Comment on attachment 9036871 [details] [diff] [review]
comment65.patch

Review of attachment 9036871 [details] [diff] [review]:

:radu

regarding alert groups for release-services we need to have 2 groups:

one which will alert #ci and #release-services when production services
(domains without staging and testing in it) are down

and another alert which sends alerts to #release-services once for
non-production services (domains with staging and testing in it)

Does above makes sense?

The currently Nagios configuration use two contact groups:
[1] - 'build' group for domains without staging and testing in it
[2] - 'release-services-alerts' group for domains with staging and testing in it
Both of them send the alerts to #platform-ops-alert channel.

As you mentioned above, we want to send [1] to #ci and #release-services and [2] only to #release-services. Am I right?
In case Yes:
Considering that 'build' group is also used for other hosts, we have to create a new group which will alert #ci and use it for domains without staging and testing in it. (or we can still send the alerts to #platform-ops-alerts)
Regarding 'release-services-alerts' group we can change the channel where the alerts will be send(#platform-ops-alerts => #release-services) and use it for all the domains.

Is that how things should work?

Also, for now include link to https://docs.mozilla-releng.net for all
release-service projects.

Flags: needinfo?(rgarbas)

Radu Iman[:riman]

Comment 115

•

6 years ago

Attached patch comment65.patch — Details — Splinter Review

I have created the following contact-groups:

'release-production-services' -> for domains without staging and testing in it
'release-non-production-services' -> for domains with staging and testing in it

The alerts from the production services will be sent to #platform-ops-alets and #release-services and the alerts from the non production services will only be sent to #release-services.

Hello Rok, could you have a look and review the patch please?

Attachment #9036871 - Attachment is obsolete: true

Attachment #9041425 - Flags: review?(rgarbas)

Radu Iman[:riman]

Comment 116

•

6 years ago

After cleaning the trailing whitespace in patch: https://bug1484880.bmoattachments.org/attachment.cgi?id=9037445
dlabici successfully landed the patch with commit: 2208837e0dc31ebd1d98adf21ea0560f25fde6df

We will keep an eye to see if anything went wrong.

Radu Iman[:riman]

Comment 117

•

6 years ago

All new added domains are present and UP in Nagios ( https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=all&style=overview )

Seems that every new log-aggregators have the TCP-1514 issue. Everything else seems to be green.

Radu Iman[:riman]

Comment 118

•

6 years ago

Attached patch fix-comment70.patch — Details — Splinter Review

-added a small fix for the last patch landed

dlabici, please take a look and land it

Attachment #9041488 - Flags: review?(dlabici)

Danut Labici [:dlabici]

Comment 119

•

6 years ago

Comment on attachment 9041488 [details] [diff] [review] fix-comment70.patch Review of attachment 9041488 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the catch and fix.

Attachment #9041488 - Flags: review?(dlabici) → review+

Radu Iman[:riman]

Comment 120

•

6 years ago

dlabici successfully landed the fix with commit: 21965da0ca3b50777d2c0c38d6b3ef7c108a64f7
Thank you!

Adrian Pop

Comment 121

•

6 years ago

for less noise on #platform-ops-alerts, I've set ack on log-aggregator2.srv.releng.usw2.mozilla.com:open syslog TCP connections is UNKNOWN: NRPE: Unable to read output

Rok Garbas [:garbas]

Comment 122

•

6 years ago

Comment on attachment 9041425 [details] [diff] [review] comment65.patch Review of attachment 9041425 [details] [diff] [review]: ----------------------------------------------------------------- :riman looks good i only added a minor comment. the only thing i'm missing is a link to https://docs.mozilla-releng.net when an alert happens. is this possible to do? ::: modules/nagios4/manifests/prod/mozilla/contacts.pp @@ +100,5 @@ > }, > + 'releaseservicesalerts' => { > + contactname => 'release-services IRC channel', > + pagertype => 'email', > + pageremail => 'nobody@mozilla.org' you can use release-services@mozilla.com email here

Attachment #9041425 - Flags: review?(rgarbas) → review+

Danut Labici [:dlabici]

Comment 123

•

6 years ago

•

Edited

I have landed the patch including the requested email.
Commit sha: e0bc63e9b5bcad1d5d5b0ca2ceea493b476f8bd5

What remains here to do it figure out how we can provide custom docs. link(s) and not use the automatically generated one.

Adrian Pop

Comment 124

•

6 years ago

for less noise, I've set ack for the following alerts :

treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: CRITICAL - Cannot make SSL connection.

treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory

treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees is CRITICAL: CRITICAL - Cannot make SSL connection. (http://m.mozilla.org/HTTP+Status+-+https://treestatus.mozilla-releng.net/trees)

treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory

tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-1 is CRITICAL: CRITICAL - Cannot make SSL connection.

tooltool.mozilla-releng.net:HTTP Status - https://tooltool.mozilla-releng.net/sha512-2 is CRITICAL: CRITICAL - Cannot make SSL connection.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 125

•

6 years ago

(In reply to Adrian Pop from comment #124)

for less noise, I've set ack for the following alerts :

treestatus.mozilla-releng.net:HTTP Status - https://treestatus.mozilla-releng.net/trees/mozilla-beta is CRITICAL: CRITICAL - Cannot make SSL connection.

I'm not sure what's up with these. Might be to do with configuration or the use of Let's Encrypt as the SSL CA.

treestatus.mozilla-releng.net:JSON String - https://treestatus.mozilla-releng.net/trees is CRITICAL: (No output on stdout) stderr: execvp(/usr/lib64/nagios/plugins/custom/check_json.pl, ...) failed. errno is 2: No such file or directory

I think you'll need to modify manifests/nodes/nagios.pp so that nagios1.private.releng.mdc1.mozilla.com also has realize(Nrpe::Plugin["check_json"]). See the other usage in the same file for examples.

There seems to be another class of errors - staging sites that aren't in DNS:

archiver.staging.mozilla-releng.net
pipeline.shipit.staging.mozilla-releng.net
signoff.shipit.staging.mozilla-releng.net
taskcluster.shipit.staging.mozilla-releng.net

Rok, what do you want to do about them ?

Rok Garbas [:garbas]

Comment 126

•

6 years ago

:nthomas Those you listed services are not longer working, they can be removed.

Flags: needinfo?(rgarbas)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 127

•

6 years ago

Over to CIDuty for that.

Now that bug 1525365 is resolved, please check for any acknowledged checks that are still failing. eg https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/extinfo.cgi?type=2&host=log-aggregator2.srv.releng.usw2.mozilla.com&service=open+syslog+TCP+connections

Danut Labici [:dlabici]

Comment 128

•

6 years ago

Everything seems to be green.
We will be watching the hosts over the weekend and see how the situation remains the same.

Radu Iman[:riman]

Updated

•

6 years ago

Depends on: 1530085

Radu Iman[:riman]

Comment 129

•

6 years ago

Attached patch removed-services.patch — Details — Splinter Review

I've removed the following services since they are not longer working (comment 126):

- archiver.staging.mozilla-releng.net
- pipeline.shipit.staging.mozilla-releng.net
- signoff.shipit.staging.mozilla-releng.net
- taskcluster.shipit.staging.mozilla-releng.net

Attachment #9047253 - Flags: review?(rgarbas)

Rok Garbas [:garbas]

Comment 130

•

6 years ago

Comment on attachment 9047253 [details] [diff] [review] removed-services.patch :riman thank you

Attachment #9047253 - Flags: review?(rgarbas) → review+

Jordan Lund (:jlund)

Comment 131

•

6 years ago

(In reply to Rok Garbas [:garbas] from comment #104)

Is it possible to provide a link with the alert for mozilla-releng.net services? The link would be different link for each services. (eg: treestatus.mozilla-releng.net -> https://docs.mozilla-releng.net/projects/treestatus.html)

ryanc, would you be able to help us here? I think nagios and IT configuration knowledge is limited so any help would be appreciated.

Flags: needinfo?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 132

•

6 years ago

Jordan, we currently provide this functionality. For example, the "Disk - All" service documentation lives here,

https://mana.mozilla.org/wiki/display/NAGIOS/Disk+-+All

Let me know if this is sufficient or if it must link somewhere else.

Flags: needinfo?(rchilds)

Danut Labici [:dlabici]

Comment 133

•

6 years ago

@ryanc: We need to provide custom links. Not the ones that get automatically generated.
For example, we need this link when an alert comes up: https://docs.mozilla-releng.net

Flags: needinfo?(rchilds)

Danut Labici [:dlabici]

Comment 134

•

6 years ago

Landed patch in comment 129.
Revision: b337a3122752e4478221acce80db5e4f398d42e2

Ryan C [:ryanc] (UTC-4)

Comment 135

•

6 years ago

(In reply to Danut Labici [:dlabici] from comment #133)

@ryanc: We need to provide custom links. Not the ones that get automatically
generated.
For example, we need this link when an alert comes up:
https://docs.mozilla-releng.net

Alright, would these links always start with that domain name? e.g.

https://docs.mozilla-releng.net/treestatus
https://docs.mozilla-releng.net/otherstatus
...

Flags: needinfo?(rchilds) → needinfo?(dlabici)

Rok Garbas [:garbas]

Comment 136

•

6 years ago

:ryanc as first try the link should always be the same and that is https://docs.mozilla-releng.net, but there should be an easy way to update them later on.

I'm reviewing all the documentation for relengapi services by the end of this Q and once i'm done links for this services will follow the pattern: https://docs.mozilla-releng.net/projects/<project>.html (eg. already existing https://docs.mozilla-releng.net/projects/treestatus.html, https://docs.mozilla-releng.net/projects/tooltool.html)

Danut Labici [:dlabici]

Updated

•

6 years ago

Flags: needinfo?(dlabici) → needinfo?(rchilds)

Ryan C [:ryanc] (UTC-4)

Comment 137

•

6 years ago

This is outside the scope of this bug. I will try to get around to this as soon as I can.

Flags: needinfo?(rchilds)

Rok Garbas [:garbas]

Comment 138

•

5 years ago

we dont use nagios anymore, since we migrated to GCP

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.