Closed Bug 1496112 Opened 6 years ago Closed 6 years ago

add nagios check for taskcluster worker processes

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Assigned: dhouse)

References

(Blocks 1 open bug)

Details

User Story

nagios service check configuration options: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html#service

Attachments

(5 files, 1 obsolete file)

The taskcluster worker processes can be stopped without rebooting the machine. This can happen on OSX and Windows because there is no task isolation and the test processes can kill the worker. Nagios's nrpe can execute a process check to alert if the worker processes are not running.
First, osx: I'll start with adding nrpe onto a staging osx machine and configure the nrpe call from nagios to trigger the process check. Then we can run some tests through the staging/beta pool and verify that the time for nrpe to execute does not affect test times.
I pinned t-yosemite-r7-100 to my puppet environment with nrpe added for generic-worker testers. And then checked that the nrpe "check_procs" test can find the correct processes. ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_procs -C bash -a "run-generic-worker.sh" -c 1:1 PROCS OK: 1 process with command name 'bash', args 'run-generic-worker.sh' [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_procs -C generic-worker -c 1:1 PROCS OK: 1 process with command name 'generic-worker' ``` Maybe there is a better way to check for the parent "run-generic-worker.sh" and children for generic-worker and the log capture. ``` [root@t-yosemite-r7-098.test.releng.mdc2.mozilla.com ~]# ps -ef|grep worker 28 319 1 0 9:20AM ?? 0:00.00 /bin/bash /usr/local/bin/run-generic-worker.sh run --config /etc/generic-worker.config 28 321 319 0 9:20AM ?? 0:00.71 /usr/local/bin/generic-worker run --config /etc/generic-worker.config 28 322 319 0 9:20AM ?? 0:00.02 logger -t generic-worker -s ```
Checking the modified age on the ~cltbld/tasks/ directory may be helpful also because it shows that the worker is active: ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_file_age -w 14400 -c 86400 -f ~cltbld/tasks FILE_AGE WARNING: /Users/cltbld/tasks is 81554 seconds old and 102 bytes ``` # if ~cltbld/tasks/task_*/logs exists, # then that is the active task's logs # otherwise, the worker is idle ``` #idle: [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# ls -la ~cltbld/tasks/task* total 0 drwxr-xr-x 3 cltbld staff 102 Oct 2 10:41 . drwxr-xr-x 3 cltbld staff 102 Oct 2 10:41 .. drwxr-xr-x 2 cltbld staff 68 Oct 2 10:41 generic-worker #active task: [root@t-yosemite-r7-098.test.releng.mdc2.mozilla.com ~]# ls -la ~cltbld/tasks/task* total 144792 drwxr-xr-x 7 cltbld staff 238 Oct 3 09:27 . drwxr-xr-x 3 cltbld staff 102 Oct 3 09:27 .. drwxr-xr-x 5 cltbld staff 170 Oct 3 09:27 build drwxr-xr-x 3 cltbld staff 102 Oct 3 09:27 generic-worker -rw-r--r-- 1 cltbld staff 74130844 Oct 3 09:27 installer.dmg drwxr-xr-x 9 cltbld staff 306 Oct 3 09:27 logs drwxr-xr-x 22 cltbld staff 748 Oct 3 09:27 mozharness ```
and eventually, since we will have nrpe, we could try checking puppet freshness: ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_puppet_freshness -t 160000 Puppet agent 3.7.0 running catalog "b1b3c1999b37+" [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_puppet_freshness -t 10000 Last run was 82497 seconds ago ``` It may need to have a threshold of 4days+ for linux (until/unless reboots happen between tasks again).
Attached file GitHub Pull Request (obsolete) —
The historical reason we don't have nrpe running on testers is the same reason we don't have puppet run as a daemon, it affects tests. We should probably discuss whether that still stands true or not. And if so, how can we work around it.
(In reply to Jake Watkins [:dividehex] from comment #6) > The historical reason we don't have nrpe running on testers is the same > reason we don't have puppet run as a daemon, it affects tests. We should > probably discuss whether that still stands true or not. And if so, how can > we work around it. Sounds good. If we do use nrpe, maybe we can make the interval for the check wide enough to have minimal impact (like once per hour?). The individual nrpe tests may have no impact (such as for a process check), but I think the nrpe service could use a percentage (need to test). Maybe we can find an alternative like checking the taskcluster queue to see when a worker last reported (and have nagios check those or something). Or if the nrpe service is heavy, maybe we just trigger a process check through cron that reports to ___ (papertrail?).
There is one proc check set up in the it puppet for the releng signing servers: ```modules/nagios4/manifests/prod/releng/services/mdc2.pp 'child_procs_regex' => { service_description => 'procs - signing-server', check_interval => 5, retry_interval => 3, max_check_attempts => 4, notification_options => 'w,c,r,u', contact_groups => 'build', check_command => 'check_nrpe_child_procs_regex!tools/release/signing/signing-server.py!1!3!3', hostgroups => $nagiosbot ? { 'nagios-releng-mdc2' => [ 'mac-signing-servers', 'signing-servers' ], default => [ ] } }, ```
(In reply to Jake Watkins [:dividehex] from comment #6) > The historical reason we don't have nrpe running on testers is the same > reason we don't have puppet run as a daemon, it affects tests. We should > probably discuss whether that still stands true or not. And if so, how can > we work around it. Yes, and... we don't need to run the check every minute. NRPE running on its own shouldn't affect tests any more than any other daemon, though it would be nice to know for certain.
modules/nagios4/manifests/prod/releng/services/mdc2.pp: check_command => 'check_nrpe_child_procs_regex!tools/release/signing/signing-server.py!1!3!3', modules/nrpe/manifests/check/child_procs_regex.pp (END) ``` # This Source Code Form is subject to the terms of the Mozilla Public # License, v. 2.0. If a copy of the MPL was not distributed with this # file, You can obtain one at http://mozilla.org/MPL/2.0/. class nrpe::check::child_procs_regex { include nrpe::settings $plugins_dir = $nrpe::settings::plugins_dir nrpe::check { 'check_child_procs_regex': cfg => "${plugins_dir}/check_procs -c \$ARG3\$:\$ARG4\$ --ereg-argument-array=\$ARG1\$ -p \$ARG2\$"; } } ``` ``` Usage: check_procs -w <range> -c <range> [-m metric] [-s state] [-p ppid] [-u user] [-r rss] [-z vsz] [-P %cpu] [-a argument-array] [-C command] [-t timeout] [-v] Options: -h, --help Print detailed help screen -V, --version Print version information -w, --warning=RANGE Generate warning state if metric is outside this range -c, --critical=RANGE Generate critical state if metric is outside this range -m, --metric=TYPE Check thresholds against metric. Valid types: PROCS - number of processes (default) VSZ - virtual memory size RSS - resident set memory size CPU - percentage CPU -t, --timeout=INTEGER Seconds before connection times out (default: 10) -v, --verbose Extra information. Up to 3 verbosity levels Filters: -s, --state=STATUSFLAGS Only scan for processes that have, in the output of `ps`, one or more of the status flags you specify (for example R, Z, S, RS, RSZDT, plus others based on the output of your 'ps' command). -p, --ppid=PPID Only scan for children of the parent process ID indicated. -z, --vsz=VSZ Only scan for processes with VSZ higher than indicated. -r, --rss=RSS Only scan for processes with RSS higher than indicated. -P, --pcpu=PCPU Only scan for processes with PCPU higher than indicated. -u, --user=USER Only scan for processes with user name or ID indicated. -a, --argument-array=STRING Only scan for processes with args that contain STRING. --ereg-argument-array=STRING Only scan for processes with args that contain the regex STRING. -C, --command=COMMAND Only scan for exact matches of COMMAND (without path). ``` Testing with regex count of child procs: ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_procs -v -c 1:3 --ereg-argument-array=run-generic-worker.sh\|generic-worker PROCS OK: 3 processes with regex args 'run-generic-worker.sh,generic-worker' ```
(In reply to Dave House [:dhouse] from comment #10) [...] > 'check_nrpe_child_procs_regex!tools/release/signing/signing-server.py!1!3!3', [...] > cfg => "${plugins_dir}/check_procs -c \$ARG3\$:\$ARG4\$ > --ereg-argument-array=\$ARG1\$ -p \$ARG2\$"; [...] > -p, --ppid=PPID > Only scan for children of the parent process ID indicated. [...] > [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# > /usr/local/libexec/check_procs -v -c 1:3 > --ereg-argument-array=run-generic-worker.sh\|generic-worker > PROCS OK: 3 processes with regex args 'run-generic-worker.sh,generic-worker' > ``` The "-p" is hardcoded in. So I cannot make it count children except of a know pid. But for a draft test, it will work to find the run-generic-worker.sh process
Draft for adding beta osx releng workers for IT-puppet nagios nrpe process check.
Attachment #9014218 - Flags: review?(rchilds)
Attachment #9014218 - Flags: review?(jwatkins)
Attachment #9014218 - Flags: review?(dcrisan)
Comment on attachment 9014218 [details] [diff] [review] add nrpe process check for osx staging workers LGTM
Attachment #9014218 - Flags: review?(rchilds) → review+
Comment on attachment 9014218 [details] [diff] [review] add nrpe process check for osx staging workers Review of attachment 9014218 [details] [diff] [review]: ----------------------------------------------------------------- r+ but see inline comments ::: modules/nagios4/manifests/prod/releng/hostgroups.pp @@ +110,5 @@ > }, > 't-yosemite-r7-machines' => { > alias => "OS X 10.10 talos servers" > }, > + 't-yosemite-r7-machines-beta' => { Is there a reason we are calling the staging minis, beta? Are these exclusive to Firefox Beta? If not, we should probably rename them to staging.
Attachment #9014218 - Flags: review?(jwatkins) → review+
Comment on attachment 9014218 [details] [diff] [review] add nrpe process check for osx staging workers LGTM
Attachment #9014218 - Flags: review?(dcrisan) → review+
(In reply to Jake Watkins [:dividehex] from comment #14) > Comment on attachment 9014218 [details] [diff] [review] > add nrpe process check for osx staging workers > > Review of attachment 9014218 [details] [diff] [review]: > ----------------------------------------------------------------- > > r+ but see inline comments > > ::: modules/nagios4/manifests/prod/releng/hostgroups.pp > @@ +110,5 @@ > > }, > > 't-yosemite-r7-machines' => { > > alias => "OS X 10.10 talos servers" > > }, > > + 't-yosemite-r7-machines-beta' => { > > Is there a reason we are calling the staging minis, beta? Are these > exclusive to Firefox Beta? If not, we should probably rename them to > staging. This matches the staging worker pool name "gecko-t-osx-1010-beta" I'll name it "-staging" instead since that's what we call the pool.
Attachment #9014218 - Flags: checked-in+
I'm not yet merging the nrpe install/setup for the releng-puppet test/worker type since I don't have that restricted to the staging group. Instead I'll keep using it pinned to my env, and only merge it if we want nrpe on all the osx workers. I'm expecting the new proc check to show up for the t-yosemite-r7-100 entry (once the nagios config change is applied from the it puppet repo): https://nagios1.private.releng.mdc2.mozilla.com/releng-mdc2/cgi-bin/status.cgi?host=t-yosemite-r7-100.test.releng.mdc2.mozilla.com
We can also collect stats about the processes. This might allow us to set up a dashboard showing machines not running the worker, or running multiple tasks, and other: statsd>collectd -> graphite-host https://graphite-mdc1.mozilla.org/dashboard/ We had statsd+collectd on every worker already posting host data (cpu/etc), and so adding a report of the process or tasks would add little overhead (less then nagios since we do not have nrpe on the workers). I tested this and was able to send arbitrary stats which appeared namespaced under the host I posted it from (hosts.t-linux64-ms-280_test_releng_mdc1_mozilla_com.statsd.gauge.)
Joel, can you check if there is a performance impact with nagios checking the mac workers? I put nrpe on the four osx staging workers, polling 1/min. And I ran the tests over these machines yesterday: https://treeherder.mozilla.org/#/jobs?repo=try&revision=267c79d06585c9af2c5c1a84a2e00f08173baa1d I ran on try twice with the same commit, but perfherder doesn't show a comparison for me. https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bc0383a538c7d86972ae620340da89c4974a094 https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1c50247aa9037267775c29e91785b7dfc9b7cca
Flags: needinfo?(jmaher)
(In reply to Dave House [:dhouse] from comment #19) > I put nrpe on the four osx staging workers, polling 1/min. And I ran the I set the polling for the worker process to 1/min because that is the most frequent I think nagios can check, and I wanted to make any performance impact large enough to notice. In actual use, we will not check this frequently. I think we might only check for the worker processes every 20 minutes.
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #20) > we need more data points to determine a change, I retriggered more on: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=e1c50247aa9037267775c29e91785b7dfc9b7cca Thank you!
this took a while to analyze since there was a bug in perfherder compare. I fixed this locally and analyzed. I find the noise is a bit higher with the try push, focused more heavily on startup tests (which are already noisy to begin with, especially on osx). The rest of the results are reasonable- I am more concerned about noise rather than raw results. I assume a different cycle >1 minute and closer to 20 minutes, would be just fine.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #23) > this took a while to analyze since there was a bug in perfherder compare. I > fixed this locally and analyzed. I find the noise is a bit higher with the > try push, focused more heavily on startup tests (which are already noisy to > begin with, especially on osx). > > The rest of the results are reasonable- I am more concerned about noise > rather than raw results. I assume a different cycle >1 minute and closer to > 20 minutes, would be just fine. Thank you! Would you like for me to test with a 20 minute cycle in staging? Or are you okay with my rolling this out to production?
Flags: needinfo?(jmaher)
lets test with 20 if that is what you plan to roll out
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #25) > lets test with 20 if that is what you plan to roll out Thanks! I've increased the check interval to 20min. Once I see the change applied in nagios, I'll push some tests to try for production and staging.
User Story: (updated)
(In reply to Dave House [:dhouse] from comment #26) > (In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #25) > > lets test with 20 if that is what you plan to roll out > > Thanks! I've increased the check interval to 20min. > > Once I see the change applied in nagios, I'll push some tests to try for > production and staging. I ran this on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f64aa6f1f7cc999f62501845d0f79303d3377ec and on staging (with a 20 minute nagios check cycle): https://treeherder.mozilla.org/#/jobs?repo=try&revision=40eade083ad55944a19db930f81623d57159d70d The staging one has not completed yet (at 29% after 6 hours. so it may not be finished until tomorrow afternoon).
we have more data and here are the results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2f64aa6f1f7c&newProject=try&newRevision=40eade083ad55944a19db930f81623d57159d70d&framework=1 you can see some improvements on: cpstartup sessionrestore_many_windows (but not other session restore) and oddly enough tp6 seems to have increased a bit. Noise levels are the same if not slightly lower overall, this is good. I am not sure why we have improvements that are measured, it looks like we have bimodal data that is no a single mode- not sure how nagios is affecting that.
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #28) > we have more data and here are the results: > https://treeherder.mozilla.org/perf.html#/ > compare?originalProject=try&originalRevision=2f64aa6f1f7c&newProject=try&newR > evision=40eade083ad55944a19db930f81623d57159d70d&framework=1 > > you can see some improvements on: > cpstartup > sessionrestore_many_windows (but not other session restore) > > and oddly enough tp6 seems to have increased a bit. Noise levels are the > same if not slightly lower overall, this is good. > > I am not sure why we have improvements that are measured, it looks like we > have bimodal data that is no a single mode- not sure how nagios is affecting > that. Two (half) of the staging minis are running the new firmware. So that could be the cause of the improvements. Or maybe having the nagios daemon running changed the scheduling and memory management slightly. I'm curious to see if the improvement persists when we run with nagios on more machines. Are you okay with my adding the nagios check to the production machines? If a gradual roll-out sounds good to you, I'll start with 10%. Then I'll wait 2 days to watch for issues, and then we can increase to 50% if things look good. And after a few more days I'll increase to all of the production pool.
Flags: needinfo?(jmaher)
I would prefer to go to 100% instead of gradual- then for perf improvements we have odd staggered bi-modal improvements. If we are concerned, I would rather do 10% of production, then 100%, ideally on a fixed schedule (Nov 5th 10%, Nov 12th 100%)
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #30) > I would prefer to go to 100% instead of gradual- then for perf improvements > we have odd staggered bi-modal improvements. If we are concerned, I would > rather do 10% of production, then 100%, ideally on a fixed schedule (Nov 5th > 10%, Nov 12th 100%) I understand. Let's go for 100% then. I'm confident we won't see any problems. I'll prepare the patch for the change and plan to apply it tomorrow.
Attachment #9021907 - Flags: review?(jwatkins)
Attachment #9021907 - Flags: review?(jwatkins) → review+
Attachment #9021905 - Flags: checked-in+
Attachment #9014077 - Attachment is obsolete: true
Attachment #9021907 - Flags: checked-in+
I've installed nrpe onto the linux staging workers (280,394,395; skipped 240 as it looks like someone is testing on it (systemd?)).
Attachment #9025463 - Flags: checked-in+
Blocks: 1518914
Attachment #9025453 - Flags: checked-in+

These checks have been running, and CIDuty has pointed me to them when there are problems.
example for linux in mdc1: https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=t-linux64-moonshot&style=overview

I will leave them as not alerting via irc for every worker for now; There is some noise in that because nagios does not know about quarantined workers or other activity, and so it is not helpful as an alert but it is helpful for checking state and monitoring.
If we keep using nagios long-term, and do not rely on other monitoring for alerts on workers, then it may be useful to change these alerts to post to irc and alert the production group. Or it may be better to change it to a percentage alert, and alarm when we have N% workers down.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: