nagios1.private.releng.scl3 nagios problems

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
5 years ago
3 years ago

People

(Reporter: ericz, Assigned: ashish)

Tracking

Details

(Reporter)

Description

5 years ago
Nagios is having problems on nagios1.private.releng.scl3 and alerted via email:

NAGIOS WARNING: 1 process, status log updated 1289 seconds ago
oncall has been paged

Besides the emails, it isn't actually paging the oncall.

The nagios log indicated nagios could fork any more processes.  arr indicated this happened earlier today.  Restarting nagios made the emails stop.
(Assignee)

Comment 1

5 years ago
[1352304412] Warning: The check of host 'switch1.r101-7.ops.scl1.mozilla.net' could not be performed due to a fork() error: 'Resource temporarily unavailable'.
[1352304412] Warning: The check of host 'switch1.r101-8.ops.scl1.mozilla.net' could not be performed due to a fork() error: 'Resource temporarily unavailable'.
[1352304412] Warning: The check of host 'switch1.r101-9.ops.scl1.mozilla.net' could not be performed due to a fork() error: 'Resource temporarily unavailable'.
[1352304412] Warning: The check of host 'switch1.r102-2.ops.scl1.mozilla.net' could not be performed due to a fork() error: 'Resource temporarily unavailable'.
[1352304412] Warning: The check of host 'switch2.r102-4.ops.scl1.mozilla.net' could not be performed due to a fork() error: 'Resource temporarily unavailable'.
What's the fix here? what's it running out of?
(Assignee)

Comment 3

5 years ago
(In reply to Shyam Mani [:fox2mike] from comment #2)
> What's the fix here? what's it running out of?

Lack of datapoints to figure out root cause. However, there are some suggested fixes with reasonable assumptions.

Jabba - please r? http://ashish.pastebin.mozilla.org/1925374
Assignee: server-ops → ashish
Status: NEW → ASSIGNED
Flags: needinfo?(jdow)

Comment 4

5 years ago
patch looks good. I'd probably not put the limits file in files/etc-nagios/ but that is just cosmetic.
Flags: needinfo?(jdow)
(Assignee)

Comment 5

5 years ago
jabba: thanks, I've committed the patch in r52250.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Assignee)

Comment 6

5 years ago
For future reference, the patch made the following changes:

* Added limits for nagios user
* Turned on use_large_installation_tweaks - See http://nagios.sourceforge.net/docs/3_0/largeinstalltweaks.html
Happened again today :

[1353758902] Warning: The check of service 'disk - /var' on host 'buildbot-master04.build.scl1.mozilla.com' could not be performed due to a fork() error: 'Resource temporarily unavailable'.  The check will be rescheduled.

Is this on different hardware than the rest of the nagios machines? 

(In reply to Ashish Vijayaram [:ashish] from comment #3)
> (In reply to Shyam Mani [:fox2mike] from comment #2)
> > What's the fix here? what's it running out of?
> 
> Lack of datapoints to figure out root cause. However, there are some
> suggested fixes with reasonable assumptions.

Did we figure out how to fix that?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
strace shows the following :

[root@nagios1.private.releng.scl3 tmp]# strace -p 22816
Process 22816 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
wait4(-1, NULL, WNOHANG, NULL)          = -1 ECHILD (No child processes)
nanosleep({0, 250000000}, NULL)         = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
wait4(-1, NULL, WNOHANG, NULL)          = -1 ECHILD (No child processes)
nanosleep({0, 250000000}, NULL)         = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
wait4(-1, NULL, WNOHANG, NULL)          = -1 ECHILD (No child processes)
nanosleep({0, 250000000}, NULL)

in a loop. The box seems to be using up 5GB/7GB of Memory, but this seems to be sort of standard (nagios1.private.scl3 seems to be using 21/24GB). I'm wondering if it's running into limits there. These servers monitor similar number of resources, yet nagios1.private.releng isn't anywhere as powerful as nagios1.private.scl3 :

[06:23:23] <nagios-rele> | fox2mike: Status file is 8107 seconds stale
[06:23:24] <nagios-rele> | fox2mike: Hosts Total/Up/Warning/Down
[06:23:24] <nagios-rele> | fox2mike:       1607/350/835/422
[06:23:24] <nagios-rele> | fox2mike: Services Total/Up/Warning/Down
[06:23:24] <nagios-rele> | fox2mike:          3918/2016/28/1836
[06:23:34] <@   fox2mike> | nagios-scl3: status
[06:23:38] <nagios-scl3> | fox2mike: Status file is 0 seconds stale
[06:23:39] <nagios-scl3> | fox2mike: Hosts Total/Up/Warning/Down
[06:23:39] <nagios-scl3> | fox2mike:       609/606/3/0
[06:23:39] <nagios-scl3> | fox2mike: Services Total/Up/Warning/Down
[06:23:39] <nagios-scl3> | fox2mike:          4572/4495/4/70
(Assignee)

Comment 9

5 years ago
Fixed this up better. Reopen if this recurs.

Before:
[ashish@nagios1.private.releng.scl3 nagios]$ sudo su - nagios
-sh-4.1$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62835
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Now:
[ashish@nagios1.private.releng.scl3 nagios]$ sudo su - nagios
-sh-4.1$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62835
max locked memory       (kbytes, -l) 128
max memory size         (kbytes, -m) unlimited
open files                      (-n) 4096
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 20480
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.