Closed Bug 1479620 Opened 7 years ago Closed 6 years ago

Move l10n nagios checks from scl3 to mdc

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dlabici, Assigned: dlabici)

References

Details

Attachments

(2 files)

With BuildBot and SCL3 EOL approaching we will need to move l10n bumper from bm71 & bm01 Based on the emails between Jordan and Nick we will need to fix up nagios configuration from IT puppet repo to point to the new datacenter and appropriate host locations. @fubar Do you or the team happen to know where the services/checks will be living after BB/SCL3 dies?
Flags: needinfo?(klibby)
Also I think that it's worth mentioning that since we are doing this move, would also be a good idea to remove the buildbot specific checks from nagios? For example PING for all the IX machines, buildbot masters checks (command queue, mysql connectivity, puppet freshness, buldbot masters age and process, and many more..) Is there a bug where all of these things mentioned above are being tracked?
Luckily bm01 & bm71 are both in use1, so they could live on for a time once scl3 expires. In the long run it would make sense to move l10n-bumper into https://github.com/mozilla-releng/treescript, since that's our modern scriptworker approach to pushing things into the tree. Lando might change that, tbd! (In reply to Bogdan Crisan [:bcrisan] (UTC +3, EEST) from comment #2) > Also I think that it's worth mentioning that since we are doing this move, > would also be a good idea to remove the buildbot specific checks from nagios? > > For example PING for all the IX machines, buildbot masters checks (command > queue, mysql connectivity, puppet freshness, buldbot masters age and > process, and many more..) Yes, there's a lot of teardown to be done but I don't know of any bug tracking that yet. We've kind been using bug 1478215 even though that's a specific service.
FTR, bug 1488913 tracks turning off buildbot. I don't see a patch on bug 1484880 for l10n bumper's nagios check (l10n_bumper_lock). Could ciduty work one up asap ?
Assignee: nobody → dlabici
Attachment #9007699 - Flags: review?(nthomas)
(In reply to Danut Labici [:dlabici] from comment #5) > Created attachment 9007699 [details] [diff] [review] > l10n_bumper_check.patch I think what we need to do is to move the l10n-bumper-servers hostgroup from the scl3 configs to mdc1, move bm01 and bm77 from releng/scl3.pp to releng/mdc1.pp, and also move the l10n_bumper_lock check from releng/services/scl3.pp to releng/services/mdc1.pp.
adding myself to NI so I have a reminder for tomorrow.
Flags: needinfo?(dlabici)
Comment on attachment 9007699 [details] [diff] [review] l10n_bumper_check.patch Obsoleted by fubar's comment. I'm not a good reviewer for those changes, suggest RelOps/Moc instead.
Attachment #9007699 - Flags: review?(nthomas)
agree; :ryanc reviewed the other checks, so if he's amenable to doing these that'd be great. otherwise jake or I could do it.
I will be in PTO and I somehow missed this bug. @ciduty, can you please check and see how the status is?
Flags: needinfo?(dlabici) → needinfo?(ciduty)
The patch for moving the l10n-bumper-lock check from releng/services/scl3.pp to releng/services/mdc1.pp. Also I've checked the following: -bm01 and bm77 are moved in releng/mdc1.pp Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1484880#c26 -the l10n-bumper-servers hostgroup is to bm01 and bm77 in mdc1 Could you take a look, please?
Attachment #9010841 - Flags: review?(klibby)
Comment on attachment 9010841 [details] [diff] [review] l10n-bumper-lock-check.patch Review of attachment 9010841 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, other than that extra blank line! ::: modules/nagios4/manifests/prod/releng/services/mdc1.pp @@ +857,5 @@ > + default => [ > + ] > + } > + }, > + Extra blank line here, to remove
Attachment #9010841 - Flags: review?(klibby) → review+
FYI, when this check lands it's likely to hit the same 'CHECK_NRPE STATE CRITICAL: Socket timeout after 60 seconds' error that the bouncer check on bm01 is hitting over in bug 1484880.
CIDuty do not have access to push into this repository. Could someone push this patch or give us the write access to the nagios module? Patch: https://bug1479620.bmoattachments.org/attachment.cgi?id=9010841 Thank you !
Did this got landed?
The patch is landed. commit a9250d0c17f73d5ebb0820e074d986945c01d974
This alert came from bm77 and bm01 after the patch was landed. I have acknowledged it: bug 1495920 Fri 23:20:29 UTC [8499] [] buildbot-master77.bb.releng.use1.mozilla.com:L10n bumper lock age is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 15 seconds Fri 23:24:49 UTC [8500] [] buildbot-master01.bb.releng.use1.mozilla.com:L10n bumper lock age is CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 15 seconds.
Bug 1495920 fixed this.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(ciduty)
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: