Closed Bug 1314840 Opened 8 years ago Closed 6 years ago

additional nagios checks for signing scriptworkers

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: sfraser)

References

Details

Attachments

(3 files)

We already have nagios checks for signing scriptworkers (num processes) [1].  We'd like to add a few more.

First, we want to alert when the number of pending tasks gets too high.  We can do this by calling the pendingTasks endpoint [2].
The credentials, provisionerId, and workerType are all in the config file.  (For signing scriptworker 0.9.0 this is /builds/scriptworker/scriptworker.json; when we move to 1.0.0 this will be /builds/scriptworker/scriptworker.yaml)
We can either write a standalone script or a separate scriptworker endpoint.  For the latter, we'd need to update setup.py [3] and write a function like this [4]:

    context, _ = get_context_from_cmdln(sys.argv[1:])
    loop = asyncio.get_event_loop()
    result = loop.run_until_complete(context.queue.pendingTasks(context.config['provisioner_id'], context.config['worker_type']))

Second, I'd love to have file age checks.  These three files should exist, and not be too old, maybe changed in the last hour?:

    /builds/scriptworker/logs/create_initial_gpg_homedirs.log
    /builds/scriptworker/logs/rebuild_gpg_homedirs.log
    /builds/scriptworker/logs/worker.log

And this file should either not exist or be less than an hour old:

    /builds/scriptworker/.gpg_homedirs.lock


[1] https://hg.mozilla.org/build/puppet/file/tip/modules/signing_scriptworker/templates/nagios.cfg.erb
[2] https://docs.taskcluster.net/reference/platform/queue/api-docs#pendingTasks
[3] https://github.com/mozilla-releng/scriptworker/blob/03eb5b8/setup.py#L68-L69
[4] https://github.com/mozilla-releng/scriptworker/blob/03eb5b8/scriptworker/gpg.py#L1511-L1532
Assignee: nobody → sfraser
Comment on attachment 8815712 [details]
Add signing scriptworker monitoring for Bug 1314840

https://reviewboard.mozilla.org/r/96556/#review96870

::: modules/scriptworker/files/nagios_pending_tasks.py:73
(Diff revision 1)
> +    report to nagios
> +    """
> +    args = get_args()
> +
> +    context, credentials = get_context_from_cmdln(sys.argv[1:])
> +    cleanup(context)

Oops, I guess I missed these earlier.

These look problematic... have you tried them?

1. This may be wrong, but sys.argv will have the --warning and --critical options, if specified.  I think you're going to fail out with unrecognized arguments.
2. cleanup(context) nukes the work_dir, artifact_dir, and task_log_dir.  Nagios is going to be running in the background, and tasks are going to be running in scriptworker.  The tasks rely on the work_dir, artifact_dir, and task_log_dir, so every time nagios runs this check while a task is running, we're going to break the task.

I think you're mainly creating the context to get the taskcluster queue object to run `queue.pendingTasks`.  However, there's an easier way.

Puppet already knows the taskcluster client id and access token: https://hg.mozilla.org/build/puppet/file/tip/modules/signing_scriptworker/manifests/init.pp#l67

You can import and call taskcluster.async Queue directly: https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/context.py#L111-L113

You already have your session from aiohttp, and you already have the credentials from puppet.  If you don't want to populate them from puppet, you can read them from /builds/scriptworker/scriptworker.yaml : https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker/templates/scriptworker.yaml.erb#l6

I think that'll be cleaner.  Let me know if you have questions?
Attachment #8815712 - Flags: review?(aki) → review-
Comment on attachment 8815712 [details]
Add signing scriptworker monitoring for Bug 1314840

https://reviewboard.mozilla.org/r/96556/#review96870

> Oops, I guess I missed these earlier.
> 
> These look problematic... have you tried them?
> 
> 1. This may be wrong, but sys.argv will have the --warning and --critical options, if specified.  I think you're going to fail out with unrecognized arguments.
> 2. cleanup(context) nukes the work_dir, artifact_dir, and task_log_dir.  Nagios is going to be running in the background, and tasks are going to be running in scriptworker.  The tasks rely on the work_dir, artifact_dir, and task_log_dir, so every time nagios runs this check while a task is running, we're going to break the task.
> 
> I think you're mainly creating the context to get the taskcluster queue object to run `queue.pendingTasks`.  However, there's an easier way.
> 
> Puppet already knows the taskcluster client id and access token: https://hg.mozilla.org/build/puppet/file/tip/modules/signing_scriptworker/manifests/init.pp#l67
> 
> You can import and call taskcluster.async Queue directly: https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/context.py#L111-L113
> 
> You already have your session from aiohttp, and you already have the credentials from puppet.  If you don't want to populate them from puppet, you can read them from /builds/scriptworker/scriptworker.yaml : https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker/templates/scriptworker.yaml.erb#l6
> 
> I think that'll be cleaner.  Let me know if you have questions?

The argument parsing seemed to work, but you're right it does seem more sensible to remove any potential conflict. I'll look at the changes tomorrow.
Attachment #8816105 - Flags: review?(aki)
Comment on attachment 8816105 [details]
Update the pending tasks check to no longer use/wipe the local context

https://reviewboard.mozilla.org/r/96900/#review97174

Thank you!
Attachment #8816105 - Flags: review?(aki) → review+
Changes pushed
Hey Simon,
Is this bug ready to be closed?
I'm wondering if you've seen any nagios alerts about the queue?  We'll probably see more load in January once we start signing all mozilla-central pushes, but verifying this works beforehand may help avoid last minute fixes.
I've not seen any alerts, but I'm also not sure I'm set up to receive any. Where would they go by default?
Probably to #platform-ops-alerts, which is pw protected, or #buildduty.
I don't see your gpg key info in the private releng git repo; have you added yourself? https://mana.mozilla.org/wiki/display/RelEng/Passwords
I dug through the alerts in #platform-ops-alerts and don't appear to see any applicable alerts that match 'signing-' or 'scriptworker'... it's possible we just haven't hit the alert threshold yet.
mana page is not found, presumably I need to be in an extra group there, as I can't access /RelEng/
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #11)
> mana page is not found, presumably I need to be in an extra group there, as
> I can't access /RelEng/

at https://ldapadmin1.private.scl3.mozilla.com , do you have the following groups?
cn=IntranetWiki,ou=groups,dc=mozilla
cn=RelEngWiki,ou=groups,dc=mozilla
cn=vpn_relengwiki,ou=groups,dc=mozilla

You need to vpn in to access that site.

I'm going to guess it's the 2nd one.  We can file a bug to add you.  If we're missing the correct group on the Day 1 checklist, let's update it :)
https://wiki.mozilla.org/ReleaseEngineering/Day_1_Checklist#LDAP.2C_SSH.2C_VPN
Flags: needinfo?(sfraser)
Have filed bug 1328233, cn=RelEngWiki isn't on the day 1 checklist. Once I can confirm it gives me access, I'll add it.
Flags: needinfo?(sfraser)
Simon: are we unblocked here, i.e. do you have mana access now?
Flags: needinfo?(sfraser)
We're unblocked, I can see the IRC channel although I've not yet seen any alerts go past. Is there a history in the nagios web interface that can show this?
Flags: needinfo?(sfraser)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #15)
> We're unblocked, I can see the IRC channel although I've not yet seen any
> alerts go past. Is there a history in the nagios web interface that can show
> this?

I don't see anything in https://nagios.mozilla.org/releng-scl3/cgi-bin/history.cgi?host=all
ok, I will poke at it tomorrow and try to force an alert
I was able to force an alert for signing-linux-1 by killing the scriptworker process. That's working, but the queue + file ages aren't.

I'm now guessing we haven't added the alert to the sysadmin puppet module [1].  I think we need to patch modules/nagios/manifests/releng/services.pp in there.  If we're able to get the same alerts for all scriptworker instances that use the shared scriptworker puppet module [2], that would be ideal.  If we only add it to the signing scriptworkers, that's a good start.

Do you have time+headspace to keep looking at this?  If you need me to help or take over, let me know.  Otherwise I'm going to keep pointing you in what is hopefully the right direction :)

[1] https://mana.mozilla.org/wiki/display/SYSADMIN/Git#Git-Production
[2] https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker
Flags: needinfo?(sfraser)
I can keep working on it, I just didn't have the access to test that it triggered.
Flags: needinfo?(sfraser)
Was it this check_procs that alerted?

https://hg.mozilla.org/build/puppet/file/tip/modules/signingworker/templates/nagios.cfg.erb

Or this one from sysadmin puppet?

},
        'signing-worker-procs' => {
            service_description => 'procs - signing-worker',
            contact_groups => 'build',
            check_command => 'check_nrpe_procs_regex!/builds/signingworker/bin/signing-worker!1!1',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-workers'
                ],
                default => [
                ]
            }
        },
(there's an equivalent for this for signing-scriptworkers, which also has these age/pending checks configured in nagios.cfg)
How does the following look?

I'm nervous about the literal paths for /builds/scriptworker/ but the config definition for those is in a separate puppet

        "service_file_age" => {
            service_description => "Signing Scriptworker optional file ages",
            check_command       => 'nagios_check_file_ages.py!45!60!--optional!--from-file /builds/scriptworker/file_age_check_optionals.txt',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-scriptworkers',
                ],
                default => [
                ]
            }
        },

        "service_file_age" => {
            service_description => "Signing Scriptworker file ages",
            check_command       => 'nagios_check_file_ages.py!45!60!--from-file /builds/scriptworker/file_age_check_required.txt',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-scriptworkers',
                ],
                default => [
                ]
            }
        },

        "service_queue_age" => {
            service_description => "Pending Scriptworker Tasks",
            check_command       => 'nagios_pending_tasks.py!5!10',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-scriptworkers'
                ],
                default => [
                ]
            }
        },
Flags: needinfo?(aki)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #20)
> Was it this check_procs that alerted?
> 
> https://hg.mozilla.org/build/puppet/file/tip/modules/signingworker/templates/
> nagios.cfg.erb
> 
> Or this one from sysadmin puppet?
> 
> },
>         'signing-worker-procs' => {
>             service_description => 'procs - signing-worker',
>             contact_groups => 'build',
>             check_command =>
> 'check_nrpe_procs_regex!/builds/signingworker/bin/signing-worker!1!1',
>             hostgroups => $::fqdn ? {
>                 'nagios1.private.releng.scl3.mozilla.com' => [
>                     'signing-workers'
>                 ],
>                 default => [
>                 ]
>             }
>         },
> (there's an equivalent for this for signing-scriptworkers, which also has
> these age/pending checks configured in nagios.cfg)

This is the alert I saw...  I'm going to guess it's the sysadmin puppet one.  Which makes me wonder if we can remove the nagios erb's from the various scriptworker modules.

[2017-01-17 20:31:51] <nagios-releng> Tue 12:31:51 PST [4070] signing-linux-1.srv.releng.use1.mozilla.com:procs - scriptworker is 4CRITICAL: PROCS CRITICAL: 0 processes with regex args /builds/scriptworker/bin/scriptworker (http://m.mozilla.org/procs+-+scriptworker)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #21)
> How does the following look?
> 
> I'm nervous about the literal paths for /builds/scriptworker/ but the config
> definition for those is in a separate puppet

I'm currently solving that by requiring all production scriptworker instances install into /builds/scriptworker.

> 
>         "service_file_age" => {
>             service_description => "Signing Scriptworker optional file ages",

Maybe s,Signing ,, in all the descriptions.
Ideally we don't have to specify these checks for each instance type.  As long as balrog, beetmover, pushapk, etc scriptworkers live in /builds/scriptworker, these checks should work for each.

Same goes for the other checks.  Otherwise, lgtm, though maybe :arr or other relops person should review.  Thank you!
Flags: needinfo?(aki)
updated, and requesting input from relops:

        "service_optional_file_age" => {
            service_description => "Scriptworker optional file ages",
            check_command       => 'nagios_check_file_ages.py!45!60!--optional!--from-file /builds/scriptworker/file_age_check_optionals.txt',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-scriptworkers',
                ],
                default => [
                ]
            }
        },

        "service_file_age" => {
            service_description => "Scriptworker file ages",
            check_command       => 'nagios_check_file_ages.py!45!60!--from-file /builds/scriptworker/file_age_check_required.txt',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-scriptworkers',
                ],
                default => [
                ]
            }
        },

        "service_queue_age" => {
            service_description => "Pending Scriptworker Tasks",
            check_command       => 'nagios_pending_tasks.py!5!10',
            hostgroups => $::fqdn ? {
                'nagios1.private.releng.scl3.mozilla.com' => [
                    'signing-scriptworkers'
                ],
                default => [
                ]
            }
        },
Flags: needinfo?(jwatkins)
For clarity, just want to make sure that the above will do what we expect (that is, run the commands with the given arguments, report errors)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #24)
> updated, and requesting input from relops:
> 
>         "service_optional_file_age" => {
>             service_description => "Scriptworker optional file ages",
>             check_command       =>
> 'nagios_check_file_ages.py!45!60!--optional!--from-file
> /builds/scriptworker/file_age_check_optionals.txt',


AFAIK, this will not work.  You will need to also define a 'check_command' under puppet/modules/nagios/manifests/mozilla/checkcommands.pp which passes over to nrpe.  The check_command under services.pp should specify the check (as defined in checkcommands.pp) and any variable arguments to be passed.

eg. 'nagios_check_file_ages!$ABS_PATH_TO_FILE!$WARNING_THRESHOLD_INT!$ERROR_THRESHOLD_INT'

Those args are then interpolated with the string defined in checkcommands.pp which should then call nrpe with the passed arguments, including the name of the check as defined on the host itself ( https://hg.mozilla.org/build/puppet/file/tip/modules/nrpe/manifests/check.pp#l19 )

If you want to check the nrpe response from the host before modifying nagios, you can log into nagios1.private.releng.scl3.mozilla.com and exec /usr/lib64/nagios/plugins/check_nrpe directly
Flags: needinfo?(jwatkins)
I don't have access to nagios1.private.releng, and the rest of team appears not to, either. 

I suspect that the commands to try are these:

/usr/lib64/nagios/plugins/check_nrpe -H signing-linux-1.srv.releng.use1.mozilla.com -t 30 -c nagios_pending_tasks.py -w 0 -c 1
/usr/lib64/nagios/plugins/check_nrpe -H signing-linux-1.srv.releng.use1.mozilla.com -t 15 -c check_file_age -a "-w 2700 -c 3600 -W -0 -C -0 -f /builds/scriptworker/logs/worker.log"'
/usr/lib64/nagios/plugins/check_nrpe -H signing-linux-1.srv.releng.use1.mozilla.com -t 15 -c check_file_age_ok_not_exists -a 2700 3600 /builds/scriptworker/.gpg_homedirs.lock'
A few things:

1) I forgot that check_nrpe is installed on all host so you can test calling you check through check_nrpe directly on the localhost
2) The /etc/nagios/nrpe.d/signingworker.cfg is missing ']' on a couple lines. You should also use $ARG1$ $ARG2$ etc in that file rather than hardcoding threshold numbers
3) The nagios_pending_tasks.py should be world exec (0755)
4) You should be able to run the check script by hand and see the formatted output nagios is expecting.

eg. ./check_ntp_peer -H localhost
NTP OK: Offset 0.011208 secs|offset=0.011208s;60.000000;120.000000;

5) It looks like the ./nagios_pending_tasks.py script is missing the asyncio python lib.   I didn't look at the other script.
Traceback (most recent call last):
  File "./nagios_pending_tasks.py", line 18, in <module>
    import asyncio
ImportError: No module named asyncio

I believe you can use a virtualenv but that will need to be sourced properly on the [check] line in /etc/nagios/nrpe.d/check_ntp_peer.cfg file


Hope that helps!
Depends on: 1332640
(In reply to Jake Watkins [:dividehex] from comment #28)
> 5) It looks like the ./nagios_pending_tasks.py script is missing the asyncio
> python lib.   I didn't look at the other script.
> Traceback (most recent call last):
>   File "./nagios_pending_tasks.py", line 18, in <module>
>     import asyncio
> ImportError: No module named asyncio
> 
> I believe you can use a virtualenv but that will need to be sourced properly
> on the [check] line in /etc/nagios/nrpe.d/check_ntp_peer.cfg file

The scriptworker venv in /builds/scriptworker/ should have py35, which has asyncio bundled.  That would be /builds/scriptworker/bin/python, though we should use ${basedir}/bin/python like https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker/manifests/instance.pp#l12 .
I think we can call this fixed, no? Thank you!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: