1496112 - add nagios check for taskcluster worker processes

Assignee

Description

•

6 years ago

The taskcluster worker processes can be stopped without rebooting the machine. This can happen on OSX and Windows because there is no task isolation and the test processes can kill the worker. Nagios's nrpe can execute a process check to alert if the worker processes are not running.

:dhouse

Assignee

Comment 1

•

6 years ago

First, osx: I'll start with adding nrpe onto a staging osx machine and configure the nrpe call from nagios to trigger the process check. Then we can run some tests through the staging/beta pool and verify that the time for nrpe to execute does not affect test times.

:dhouse

Assignee

Comment 2

•

6 years ago

I pinned t-yosemite-r7-100 to my puppet environment with nrpe added for generic-worker testers. And then checked that the nrpe "check_procs" test can find the correct processes. ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_procs -C bash -a "run-generic-worker.sh" -c 1:1 PROCS OK: 1 process with command name 'bash', args 'run-generic-worker.sh' [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_procs -C generic-worker -c 1:1 PROCS OK: 1 process with command name 'generic-worker' ``` Maybe there is a better way to check for the parent "run-generic-worker.sh" and children for generic-worker and the log capture. ``` [root@t-yosemite-r7-098.test.releng.mdc2.mozilla.com ~]# ps -ef|grep worker 28 319 1 0 9:20AM ?? 0:00.00 /bin/bash /usr/local/bin/run-generic-worker.sh run --config /etc/generic-worker.config 28 321 319 0 9:20AM ?? 0:00.71 /usr/local/bin/generic-worker run --config /etc/generic-worker.config 28 322 319 0 9:20AM ?? 0:00.02 logger -t generic-worker -s ```

:dhouse

Assignee

Comment 3

•

6 years ago

Checking the modified age on the ~cltbld/tasks/ directory may be helpful also because it shows that the worker is active: ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_file_age -w 14400 -c 86400 -f ~cltbld/tasks FILE_AGE WARNING: /Users/cltbld/tasks is 81554 seconds old and 102 bytes ``` # if ~cltbld/tasks/task_*/logs exists, # then that is the active task's logs # otherwise, the worker is idle ``` #idle: [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# ls -la ~cltbld/tasks/task* total 0 drwxr-xr-x 3 cltbld staff 102 Oct 2 10:41 . drwxr-xr-x 3 cltbld staff 102 Oct 2 10:41 .. drwxr-xr-x 2 cltbld staff 68 Oct 2 10:41 generic-worker #active task: [root@t-yosemite-r7-098.test.releng.mdc2.mozilla.com ~]# ls -la ~cltbld/tasks/task* total 144792 drwxr-xr-x 7 cltbld staff 238 Oct 3 09:27 . drwxr-xr-x 3 cltbld staff 102 Oct 3 09:27 .. drwxr-xr-x 5 cltbld staff 170 Oct 3 09:27 build drwxr-xr-x 3 cltbld staff 102 Oct 3 09:27 generic-worker -rw-r--r-- 1 cltbld staff 74130844 Oct 3 09:27 installer.dmg drwxr-xr-x 9 cltbld staff 306 Oct 3 09:27 logs drwxr-xr-x 22 cltbld staff 748 Oct 3 09:27 mozharness ```

:dhouse

Assignee

Comment 4

•

6 years ago

and eventually, since we will have nrpe, we could try checking puppet freshness: ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_puppet_freshness -t 160000 Puppet agent 3.7.0 running catalog "b1b3c1999b37+" [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_puppet_freshness -t 10000 Last run was 82497 seconds ago ``` It may need to have a threshold of 4days+ for linux (until/unless reboots happen between tasks again).

:dhouse

Assignee

Comment 5

•

6 years ago

Attached file GitHub Pull Request (obsolete) — Details

Jake Watkins [:dividehex]

Comment 6

•

6 years ago

The historical reason we don't have nrpe running on testers is the same reason we don't have puppet run as a daemon, it affects tests. We should probably discuss whether that still stands true or not. And if so, how can we work around it.

:dhouse

Assignee

Comment 7

•

6 years ago

(In reply to Jake Watkins [:dividehex] from comment #6) > The historical reason we don't have nrpe running on testers is the same > reason we don't have puppet run as a daemon, it affects tests. We should > probably discuss whether that still stands true or not. And if so, how can > we work around it. Sounds good. If we do use nrpe, maybe we can make the interval for the check wide enough to have minimal impact (like once per hour?). The individual nrpe tests may have no impact (such as for a process check), but I think the nrpe service could use a percentage (need to test). Maybe we can find an alternative like checking the taskcluster queue to see when a worker last reported (and have nagios check those or something). Or if the nrpe service is heavy, maybe we just trigger a process check through cron that reports to ___ (papertrail?).

:dhouse

Assignee

Comment 8

•

6 years ago

There is one proc check set up in the it puppet for the releng signing servers: ```modules/nagios4/manifests/prod/releng/services/mdc2.pp 'child_procs_regex' => { service_description => 'procs - signing-server', check_interval => 5, retry_interval => 3, max_check_attempts => 4, notification_options => 'w,c,r,u', contact_groups => 'build', check_command => 'check_nrpe_child_procs_regex!tools/release/signing/signing-server.py!1!3!3', hostgroups => $nagiosbot ? { 'nagios-releng-mdc2' => [ 'mac-signing-servers', 'signing-servers' ], default => [ ] } }, ```

Kendall Libby [:fubar] (he/him)

Comment 9

•

6 years ago

(In reply to Jake Watkins [:dividehex] from comment #6) > The historical reason we don't have nrpe running on testers is the same > reason we don't have puppet run as a daemon, it affects tests. We should > probably discuss whether that still stands true or not. And if so, how can > we work around it. Yes, and... we don't need to run the check every minute. NRPE running on its own shouldn't affect tests any more than any other daemon, though it would be nice to know for certain.

:dhouse

Assignee

Comment 10

•

6 years ago

modules/nagios4/manifests/prod/releng/services/mdc2.pp: check_command => 'check_nrpe_child_procs_regex!tools/release/signing/signing-server.py!1!3!3', modules/nrpe/manifests/check/child_procs_regex.pp (END) ``` # This Source Code Form is subject to the terms of the Mozilla Public # License, v. 2.0. If a copy of the MPL was not distributed with this # file, You can obtain one at http://mozilla.org/MPL/2.0/. class nrpe::check::child_procs_regex { include nrpe::settings $plugins_dir = $nrpe::settings::plugins_dir nrpe::check { 'check_child_procs_regex': cfg => "${plugins_dir}/check_procs -c \$ARG3\$:\$ARG4\$ --ereg-argument-array=\$ARG1\$ -p \$ARG2\$"; } } ``` ``` Usage: check_procs -w <range> -c <range> [-m metric] [-s state] [-p ppid] [-u user] [-r rss] [-z vsz] [-P %cpu] [-a argument-array] [-C command] [-t timeout] [-v] Options: -h, --help Print detailed help screen -V, --version Print version information -w, --warning=RANGE Generate warning state if metric is outside this range -c, --critical=RANGE Generate critical state if metric is outside this range -m, --metric=TYPE Check thresholds against metric. Valid types: PROCS - number of processes (default) VSZ - virtual memory size RSS - resident set memory size CPU - percentage CPU -t, --timeout=INTEGER Seconds before connection times out (default: 10) -v, --verbose Extra information. Up to 3 verbosity levels Filters: -s, --state=STATUSFLAGS Only scan for processes that have, in the output of `ps`, one or more of the status flags you specify (for example R, Z, S, RS, RSZDT, plus others based on the output of your 'ps' command). -p, --ppid=PPID Only scan for children of the parent process ID indicated. -z, --vsz=VSZ Only scan for processes with VSZ higher than indicated. -r, --rss=RSS Only scan for processes with RSS higher than indicated. -P, --pcpu=PCPU Only scan for processes with PCPU higher than indicated. -u, --user=USER Only scan for processes with user name or ID indicated. -a, --argument-array=STRING Only scan for processes with args that contain STRING. --ereg-argument-array=STRING Only scan for processes with args that contain the regex STRING. -C, --command=COMMAND Only scan for exact matches of COMMAND (without path). ``` Testing with regex count of child procs: ``` [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# /usr/local/libexec/check_procs -v -c 1:3 --ereg-argument-array=run-generic-worker.sh\|generic-worker PROCS OK: 3 processes with regex args 'run-generic-worker.sh,generic-worker' ```

:dhouse

Assignee

Comment 11

•

6 years ago

(In reply to Dave House [:dhouse] from comment #10) [...] > 'check_nrpe_child_procs_regex!tools/release/signing/signing-server.py!1!3!3', [...] > cfg => "${plugins_dir}/check_procs -c \$ARG3\$:\$ARG4\$ > --ereg-argument-array=\$ARG1\$ -p \$ARG2\$"; [...] > -p, --ppid=PPID > Only scan for children of the parent process ID indicated. [...] > [root@t-yosemite-r7-100.test.releng.mdc2.mozilla.com ~]# > /usr/local/libexec/check_procs -v -c 1:3 > --ereg-argument-array=run-generic-worker.sh\|generic-worker > PROCS OK: 3 processes with regex args 'run-generic-worker.sh,generic-worker' > ``` The "-p" is hardcoded in. So I cannot make it count children except of a know pid. But for a draft test, it will work to find the run-generic-worker.sh process

:dhouse

Assignee

Comment 12

•

6 years ago

Attached patch add nrpe process check for osx staging workers — Details — Splinter Review

Draft for adding beta osx releng workers for IT-puppet nagios nrpe process check.

Attachment #9014218 - Flags: review?(rchilds)

Attachment #9014218 - Flags: review?(jwatkins)

Attachment #9014218 - Flags: review?(dcrisan)

Ryan C [:ryanc] (UTC-4)

Comment 13

•

6 years ago

Comment on attachment 9014218 [details] [diff] [review] add nrpe process check for osx staging workers LGTM

Attachment #9014218 - Flags: review?(rchilds) → review+

Jake Watkins [:dividehex]

Comment 14

•

6 years ago

Comment on attachment 9014218 [details] [diff] [review] add nrpe process check for osx staging workers Review of attachment 9014218 [details] [diff] [review]: ----------------------------------------------------------------- r+ but see inline comments ::: modules/nagios4/manifests/prod/releng/hostgroups.pp @@ +110,5 @@ > }, > 't-yosemite-r7-machines' => { > alias => "OS X 10.10 talos servers" > }, > + 't-yosemite-r7-machines-beta' => { Is there a reason we are calling the staging minis, beta? Are these exclusive to Firefox Beta? If not, we should probably rename them to staging.

Attachment #9014218 - Flags: review?(jwatkins) → review+

Dragos Crisan [:dragrom]

Comment 15

•

6 years ago

Comment on attachment 9014218 [details] [diff] [review] add nrpe process check for osx staging workers LGTM

Attachment #9014218 - Flags: review?(dcrisan) → review+

:dhouse

Assignee

Comment 16

•

6 years ago

(In reply to Jake Watkins [:dividehex] from comment #14) > Comment on attachment 9014218 [details] [diff] [review] > add nrpe process check for osx staging workers > > Review of attachment 9014218 [details] [diff] [review]: > ----------------------------------------------------------------- > > r+ but see inline comments > > ::: modules/nagios4/manifests/prod/releng/hostgroups.pp > @@ +110,5 @@ > > }, > > 't-yosemite-r7-machines' => { > > alias => "OS X 10.10 talos servers" > > }, > > + 't-yosemite-r7-machines-beta' => { > > Is there a reason we are calling the staging minis, beta? Are these > exclusive to Firefox Beta? If not, we should probably rename them to > staging. This matches the staging worker pool name "gecko-t-osx-1010-beta" I'll name it "-staging" instead since that's what we call the pool.

:dhouse

Assignee

Updated

•

6 years ago

Attachment #9014218 - Flags: checked-in+

:dhouse

Assignee

Comment 17

•

6 years ago

I'm not yet merging the nrpe install/setup for the releng-puppet test/worker type since I don't have that restricted to the staging group. Instead I'll keep using it pinned to my env, and only merge it if we want nrpe on all the osx workers. I'm expecting the new proc check to show up for the t-yosemite-r7-100 entry (once the nagios config change is applied from the it puppet repo): https://nagios1.private.releng.mdc2.mozilla.com/releng-mdc2/cgi-bin/status.cgi?host=t-yosemite-r7-100.test.releng.mdc2.mozilla.com

:dhouse

Assignee

Comment 18

•

6 years ago

We can also collect stats about the processes. This might allow us to set up a dashboard showing machines not running the worker, or running multiple tasks, and other: statsd>collectd -> graphite-host https://graphite-mdc1.mozilla.org/dashboard/ We had statsd+collectd on every worker already posting host data (cpu/etc), and so adding a report of the process or tasks would add little overhead (less then nagios since we do not have nrpe on the workers). I tested this and was able to send arbitrary stats which appeared namespaced under the host I posted it from (hosts.t-linux64-ms-280_test_releng_mdc1_mozilla_com.statsd.gauge.)

:dhouse

Assignee

Comment 19

•

6 years ago

Joel, can you check if there is a performance impact with nagios checking the mac workers? I put nrpe on the four osx staging workers, polling 1/min. And I ran the tests over these machines yesterday: https://treeherder.mozilla.org/#/jobs?repo=try&revision=267c79d06585c9af2c5c1a84a2e00f08173baa1d I ran on try twice with the same commit, but perfherder doesn't show a comparison for me. https://treeherder.mozilla.org/#/jobs?repo=try&revision=8bc0383a538c7d86972ae620340da89c4974a094 https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1c50247aa9037267775c29e91785b7dfc9b7cca

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 20

•

6 years ago

we need more data points to determine a change, I retriggered more on: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1c50247aa9037267775c29e91785b7dfc9b7cca

:dhouse

Assignee

Comment 21

•

6 years ago

(In reply to Dave House [:dhouse] from comment #19) > I put nrpe on the four osx staging workers, polling 1/min. And I ran the I set the polling for the worker process to 1/min because that is the most frequent I think nagios can check, and I wanted to make any performance impact large enough to notice. In actual use, we will not check this frequently. I think we might only check for the worker processes every 20 minutes.

:dhouse

Assignee

Comment 22

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #20) > we need more data points to determine a change, I retriggered more on: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=e1c50247aa9037267775c29e91785b7dfc9b7cca Thank you!

Joel Maher ( :jmaher ) (UTC -8)

Comment 23

•

6 years ago

this took a while to analyze since there was a bug in perfherder compare. I fixed this locally and analyzed. I find the noise is a bit higher with the try push, focused more heavily on startup tests (which are already noisy to begin with, especially on osx). The rest of the results are reasonable- I am more concerned about noise rather than raw results. I assume a different cycle >1 minute and closer to 20 minutes, would be just fine.

Flags: needinfo?(jmaher)

:dhouse

Assignee

Comment 24

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #23) > this took a while to analyze since there was a bug in perfherder compare. I > fixed this locally and analyzed. I find the noise is a bit higher with the > try push, focused more heavily on startup tests (which are already noisy to > begin with, especially on osx). > > The rest of the results are reasonable- I am more concerned about noise > rather than raw results. I assume a different cycle >1 minute and closer to > 20 minutes, would be just fine. Thank you! Would you like for me to test with a 20 minute cycle in staging? Or are you okay with my rolling this out to production?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 25

•

6 years ago

lets test with 20 if that is what you plan to roll out

Flags: needinfo?(jmaher)

:dhouse

Assignee

Comment 26

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #25) > lets test with 20 if that is what you plan to roll out Thanks! I've increased the check interval to 20min. Once I see the change applied in nagios, I'll push some tests to try for production and staging.

User Story: (updated)

:dhouse

Assignee

Comment 27

•

6 years ago

(In reply to Dave House [:dhouse] from comment #26) > (In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #25) > > lets test with 20 if that is what you plan to roll out > > Thanks! I've increased the check interval to 20min. > > Once I see the change applied in nagios, I'll push some tests to try for > production and staging. I ran this on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f64aa6f1f7cc999f62501845d0f79303d3377ec and on staging (with a 20 minute nagios check cycle): https://treeherder.mozilla.org/#/jobs?repo=try&revision=40eade083ad55944a19db930f81623d57159d70d The staging one has not completed yet (at 29% after 6 hours. so it may not be finished until tomorrow afternoon).

Joel Maher ( :jmaher ) (UTC -8)

Comment 28

•

6 years ago

we have more data and here are the results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2f64aa6f1f7c&newProject=try&newRevision=40eade083ad55944a19db930f81623d57159d70d&framework=1 you can see some improvements on: cpstartup sessionrestore_many_windows (but not other session restore) and oddly enough tp6 seems to have increased a bit. Noise levels are the same if not slightly lower overall, this is good. I am not sure why we have improvements that are measured, it looks like we have bimodal data that is no a single mode- not sure how nagios is affecting that.

:dhouse

Assignee

Comment 29

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #28) > we have more data and here are the results: > https://treeherder.mozilla.org/perf.html#/ > compare?originalProject=try&originalRevision=2f64aa6f1f7c&newProject=try&newR > evision=40eade083ad55944a19db930f81623d57159d70d&framework=1 > > you can see some improvements on: > cpstartup > sessionrestore_many_windows (but not other session restore) > > and oddly enough tp6 seems to have increased a bit. Noise levels are the > same if not slightly lower overall, this is good. > > I am not sure why we have improvements that are measured, it looks like we > have bimodal data that is no a single mode- not sure how nagios is affecting > that. Two (half) of the staging minis are running the new firmware. So that could be the cause of the improvements. Or maybe having the nagios daemon running changed the scheduling and memory management slightly. I'm curious to see if the improvement persists when we run with nagios on more machines. Are you okay with my adding the nagios check to the production machines? If a gradual roll-out sounds good to you, I'll start with 10%. Then I'll wait 2 days to watch for issues, and then we can increase to 50% if things look good. And after a few more days I'll increase to all of the production pool.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 30

•

6 years ago

I would prefer to go to 100% instead of gradual- then for perf improvements we have odd staggered bi-modal improvements. If we are concerned, I would rather do 10% of production, then 100%, ideally on a fixed schedule (Nov 5th 10%, Nov 12th 100%)

Flags: needinfo?(jmaher)

:dhouse

Assignee

Comment 31

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #30) > I would prefer to go to 100% instead of gradual- then for perf improvements > we have odd staggered bi-modal improvements. If we are concerned, I would > rather do 10% of production, then 100%, ideally on a fixed schedule (Nov 5th > 10%, Nov 12th 100%) I understand. Let's go for 100% then. I'm confident we won't see any problems. I'll prepare the patch for the change and plan to apply it tomorrow.

:dhouse

Assignee

Comment 32

•

6 years ago

Attached file add nrpe to osx production workers — Details

:dhouse

Assignee

Comment 33

•

6 years ago

Attached patch add prod mac hostgroup to nagios check — Details — Splinter Review

Attachment #9021907 - Flags: review?(jwatkins)

Jake Watkins [:dividehex]

Updated

•

6 years ago

Attachment #9021907 - Flags: review?(jwatkins) → review+

:dhouse

Assignee

Updated

•

6 years ago

Attachment #9021905 - Flags: checked-in+

:dhouse

Assignee

Updated

•

6 years ago

Attachment #9014077 - Attachment is obsolete: true

:dhouse

Assignee

Updated

•

6 years ago

Attachment #9021907 - Flags: checked-in+

:dhouse

Assignee

Comment 34

•

6 years ago

Attached file add nrpe to linux test workers — Details

:dhouse

Assignee

Comment 35

•

6 years ago

I've installed nrpe onto the linux staging workers (280,394,395; skipped 240 as it looks like someone is testing on it (systemd?)).

:dhouse

Assignee

Comment 36

•

6 years ago

Attached patch turn on check_procs for linux ms staging — Details — Splinter Review

Attachment #9025463 - Flags: checked-in+

Jordan Lund (:jlund)

Updated

•

6 years ago

Blocks: 1518914

:dhouse

Assignee

Updated

•

6 years ago

Attachment #9025453 - Flags: checked-in+

:dhouse

Assignee

Comment 37

•

6 years ago

These checks have been running, and CIDuty has pointed me to them when there are problems.
example for linux in mdc1: https://nagios1.private.releng.mdc1.mozilla.com/releng-mdc1/cgi-bin/status.cgi?hostgroup=t-linux64-moonshot&style=overview

I will leave them as not alerting via irc for every worker for now; There is some noise in that because nagios does not know about quarantined workers or other activity, and so it is not helpful as an alert but it is helpful for checking state and monitoring.
If we keep using nagios long-term, and do not rely on other monitoring for alerts on workers, then it may be useful to change these alerts to post to irc and alert the production group. Or it may be better to change it to a percentage alert, and alarm when we have N% workers down.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

GitHub Pull Request 6 years ago :dhouse 55 bytes, text/x-github-pull-request		Details \| Review
add nrpe process check for osx staging workers 6 years ago :dhouse 2.38 KB, patch	ryanc : review+ dividehex : review+ dragrom : review+ dhouse : checked-in+	Details \| Diff \| Splinter Review
add nrpe to osx production workers 6 years ago :dhouse 55 bytes, text/x-github-pull-request	dhouse : checked-in+	Details \| Review
add prod mac hostgroup to nagios check 6 years ago :dhouse 1.24 KB, patch	dividehex : review+ dhouse : checked-in+	Details \| Diff \| Splinter Review
add nrpe to linux test workers 6 years ago :dhouse 55 bytes, text/x-github-pull-request	dhouse : checked-in+	Details \| Review
turn on check_procs for linux ms staging 6 years ago :dhouse 4.95 KB, patch	dhouse : checked-in+	Details \| Diff \| Splinter Review