Closed Bug 1321513 Opened 3 years ago Closed 3 years ago

Add nagios checks for pushapk scriptworkers

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: jlorenzo)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

Nagios config files have been created on the nodes sides[1] (added in bug 1307826). Pushapkworker doesn't rely on the scriptworker shared module yet (introduced in bug 1316702). If I understand bug 1317783 comment 3 correctly, this may not happen immediately (the scriptworker module enables chain of trust AFAUI).

Therefore, I believe we can turn on nagios just like Aki did in bug 1295196

[1] https://hg.mozilla.org/build/puppet/file/8fbdd01cb464/modules/pushapkworker/templates/nagios.cfg.erb
Things have changed since bug 1295196 comment 1, no more SVN repo is required. Releng folks have access to the git repos listed at [1] (and more precisely, the nagios part of it)

[1] https://mana.mozilla.org/wiki/display/SYSADMIN/Git
The patch is totally similar to bug 1295196. Before asking for a formal review to the folks in #moc, I have a question for you, Aki. I'm not sure I need to add the resource "pushapk-scriptworker-procs", because "scriptworker-procs" already exist. Should I create it anyway, or is it better to just add a hostgroup to "scriptworker-procs"?
Attachment #8816127 - Flags: feedback?(aki)
Comment on attachment 8816127 [details] [diff] [review]
add_pushapk_scriptworker_in_nagios.diff

The current idea is to have each scriptworker instance use scriptworker-procs, rather than have one *-procs check per scriptworker instance type, especially since I believe they'll look exactly the same other than the name.  If there isn't a strong reason to add another check, let's reuse the check.
Attachment #8816127 - Flags: feedback?(aki) → feedback+
Here's a new revision where pushapk-scriptworkers is now listed under "scriptworker-procs" instead of a newly created entry. Aki, do you have the rights to review and merge my patch? Or should I ask Amy?
Attachment #8816127 - Attachment is obsolete: true
Attachment #8823219 - Flags: review?(aki)
Oops, I should've added --no-color when generating the diff.
Attachment #8823219 - Attachment is obsolete: true
Attachment #8823219 - Flags: review?(aki)
Attachment #8823221 - Flags: review?(aki)
For the record, the patch has also been rebased on top of the latest master.
Comment on attachment 8823221 [details] [diff] [review]
add_pushapk_scriptworker_in_nagios_rev3.diff

I think :arr is a better reviewer here.
Attachment #8823221 - Flags: review?(aki) → review?(arich)
Comment on attachment 8823221 [details] [diff] [review]
add_pushapk_scriptworker_in_nagios_rev3.diff

This looks fine to me. I presume that /builds/scriptworker/bin/scriptworker already exists on the client and that the nagios config has been modified in releng puppet to include the nrpe check?
Attachment #8823221 - Flags: review?(arich) → review+
The releng had been modified to support the nrpe check[1]. However, thanks for calling out `/builds/scriptworker`!

I noticed pushapk-scriptworker and signing-scriptworker don't share the same root[2]. Therefore, something needs to be changed:
a) either on the releng puppet; we make the root pushapk at /build/scriptworker
b) or on the current nagios config; we extend the regex to support other roots.

I see drawbacks in both scenarios: 
a) if you're logged on a machine, you may not know what type of scriptworker instance you have, unless you dig into its configuration
b) the regex might become a nightmare after several instances being added.

On the other had, having consistency across scriptworker instances is better, and will help setting up new ones. Moreover, if you're logged on a machine, there's great probability its FQDN refers to what type of scriptworker instance you're dealing with. Then, I'll change the puppet configuration.

[1] https://hg.mozilla.org/build/puppet/file/3e2076f7aedc/modules/pushapkworker/manifests/init.pp#l63
[2] Pushapk being defined as: https://hg.mozilla.org/build/puppet/file/3e2076f7aedc/manifests/moco-config.pp#l453
Signing-scriptworker as: https://hg.mozilla.org/build/puppet/file/3e2076f7aedc/modules/signing_scriptworker/manifests/settings.pp#l2
Comment on attachment 8823758 [details]
Bug 1321513 - Make pushapkworker live under /build/scriptworker

https://reviewboard.mozilla.org/r/102242/#review102620

At some point, we should get pushapkworker using modules/scriptworker, and we can make sure all workers using modules/scriptworker have a /builds/scriptworker root.

I think this works until that point.
Attachment #8823758 - Flags: review?(aki) → review+
Even though the puppet run was passing yesterday, a nit slipped through the patch. I updated it.

The latest revision of that patch passed lint tests there: https://github.com/mozilla/build-puppet/pull/24
I tested the changes against dev-linux64-ec2-jlorenzo.dev.releng.use1.m.c:
* /build/pushapkworker doesn't exist anymore
* /build/scriptworker does and contains all the files
* `supervisorctl status pushapkworker` says the process is running
* /builds/scriptworker/logs/worker.log shows the process does what it needs to do (polling TC's queue)

I think we're good to merge and deploy attachment 8823758 [details] (which is necessary to land attachment 8823221 [details] [diff] [review]).
Landed on default [1] and merged in production [2]. Puppet agent applied the changes restarted the process (which restarted properly)[3].

On the production machine, I see now /builds/scriptworker/ but not /build/pushapkworker/ anymore. This complies with [4].

[1] https://hg.mozilla.org/build/puppet/rev/1a429ca163d9
[2] https://hg.mozilla.org/build/puppet/rev/b4d2f4f777e4
[3] https://papertrailapp.com/systems/526654063/events?focus=753674615796166664
[4] https://papertrailapp.com/events?q=program%3Apuppet-agent&focus=753673798523449394
Comment on attachment 8823221 [details] [diff] [review]
add_pushapk_scriptworker_in_nagios_rev3.diff

This patch was landed as revision 568ad3460d032bf498f7bf0e9239938244f64784 on master in the git repo (linked in comment 1).
Attachment #8823221 - Flags: checked-in+
I connected to the releng nagios instance[1] and looked for pushapkwoker. The entry there shows:
* 7 checks (PING + the 6 new checks added with attachment 8823221 [details] [diff] [review]).
* /builds/scriptworker/bin/scriptworker is declared as running. I can tell it's also working, because this job[2] has just finished and it ran fine.
* Every check is green.

Looks like we're all set. Time to close this bug! Thanks for the help Ami and Aki!

[1] Listed at https://mana.mozilla.org/wiki/display/SYSADMIN/Nagios#Nagios-Hardware 
[2] https://tools.taskcluster.net/task-inspector/#Tmlx52SQT9uwVhi9oEqagg/
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
(In reply to Aki Sasaki [:aki] from comment #12)
> At some point, we should get pushapkworker using modules/scriptworker, and
> we can make sure all workers using modules/scriptworker have a
> /builds/scriptworker root.

I agree. I looked more into using this module. I get the feeling I need a greater version than 0.7.1. That is to say, Chain of Trust has to be enabled. Hence, I guess the module can be added alongside bug 1317783.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.