[log-aggregator2.srv.releng.usw2.mozilla.com] open syslog TCP connections is UNKNOWN
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
People
(Reporter: zfay, Assigned: riman)
References
Details
Attachments
(3 files)
1.11 KB,
patch
|
dhouse
:
review+
|
Details | Diff | Splinter Review |
1.90 KB,
patch
|
Details | Diff | Splinter Review | |
4.79 KB,
patch
|
dhouse
:
review+
|
Details | Diff | Splinter Review |
In response to comment #127 in bug 1484880:
log aggregators started continuously alerting "open syslog TCP connections is UNKNOWN: NRPE: Unable to read output" so I've re-acknowledged them.
Comment 1•6 years ago
|
||
Seems to just be log-aggregator2.srv.releng.usw2.mozilla.com. log-aggregator1 in usw2, and both of them in use1 have a OK status with ~40 open connections.
As the rest of the box seems fine I suggest talking with RelOps about debugging the rsyslog process, or just restarting it.
Reporter | ||
Comment 2•6 years ago
|
||
Hey Jake, can you help us out with this one? ^
This looks like the nagios check is configured with too high of a threshold or something. The host is listening on port 1514, and I see valid connections. However the nrpe check reports a parsing failure.
When I manually run this same check, it does not fail:
[root@log-aggregator2.srv.releng.usw2.mozilla.com ~]# sudo /usr/lib64/nagios/plugins/check_open_tcp -p 1514 -w 500 -c 400
OK: 3 open tcp connections on port 1514. below threshold: 500
I rebooted the machine and verified that the rsyslog service is running and listening.
I think the nagios check needs modified to not alert in this state.
Comment 4•6 years ago
|
||
Here's the strange thing - there's 2 each of these machines in use1 and usw2 which got added to nagios, but just one of them has a broken check.
- log-aggregator2.srv.releng.usw2 services -
open syslog TCP connections
isNRPE: Unable to read output
- log-aggregator1.srv.releng.usw2 services and 2x use1 -
OK: 59 open tcp connections on port 1514. below threshold: 500
In the nagios configuration (in the web UI) the checks are different:
- log-aggregator2.srv.releng.usw2 and the use1 boxes has
check_open_tcp!1000!1600!514
- log-aggregator1.srv.releng.usw2 and the use1 boxes has
check_open_tcp!500!800!1514
ie different port with doubled thresholds on the failing check.
Ok, so what does the puppet config for nagios say ? All four boxes have the same node definition in mdc1.pp, except for the hostname. I'm wondering if there's a sort of race condition in puppet+nagios. Using mdc1.pp and services/mdc1.pp, it looks like the log-aggregator-servers
hostgroup pulls in syslog-open-connections-514
(the failing check), while open-tcp-1514
overwrites it with syslog-open-connections-1514
(which is OK).
Over In the scl3 config, the log-aggregator-servers hostgroup doesn't have any syslog port checks, they're added explicitly to the node definitions.
Updated•6 years ago
|
Assignee | ||
Comment 5•6 years ago
|
||
-removed the 'log-aggregator-servers' hostgroup from 'syslog-open-connections-514'
-removed the 'syslog-tcp-514' port check for 'log-aggregator-servers' hostgroup
Nick, can you please review the attachment?
Comment 6•6 years ago
•
|
||
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #4)
Here's the strange thing - there's 2 each of these machines in use1 and usw2 which got added to nagios, but just one of them has a broken check.
- log-aggregator2.srv.releng.usw2 services -
open syslog TCP connections
isNRPE: Unable to read output
- log-aggregator1.srv.releng.usw2 services and 2x use1 -
OK: 59 open tcp connections on port 1514. below threshold: 500
In the nagios configuration (in the web UI) the checks are different:
- log-aggregator2.srv.releng.usw2 and the use1 boxes has
check_open_tcp!1000!1600!514
- log-aggregator1.srv.releng.usw2 and the use1 boxes has
check_open_tcp!500!800!1514
ie different port with doubled thresholds on the failing check.
Ok, so what does the puppet config for nagios say ? All four boxes have the same node definition in mdc1.pp, except for the hostname. I'm wondering if there's a sort of race condition in puppet+nagios. Using mdc1.pp and services/mdc1.pp, it looks like the
log-aggregator-servers
hostgroup pulls insyslog-open-connections-514
(the failing check), whileopen-tcp-1514
overwrites it withsyslog-open-connections-1514
(which is OK).Over In the scl3 config, the log-aggregator-servers hostgroup doesn't have any syslog port checks, they're added explicitly to the node definitions.
Thank you for the investigation in puppet!
Comment 10•6 years ago
|
||
FTR, port 514 is the standard syslog port which was fine when deploying to the datacenters (scl3, mdc1/2) but when they were deployed to AWS, we ran into problems using that port on ec2 instances. I'm not sure if that limitation still exists but at the time it was decided to use port 1514 in aws and continue using port 514 in the data centers. So in the case of nagios checks, there should be two different port checks, one for the standard 514 port on mdc1/2 log aggregators and one for the non-standard 1514 port on aws ec2 log aggregators.
Assignee | ||
Comment 11•6 years ago
|
||
Following the information from comment 4 and comment 10, I've created two different variants of the patch.
-
log-aggregator-servers-v1.patch - removed log-aggregator-servers hostgroup from the port checks and added the port checks to the node
-
log-aggregator-servers-v2.patch - used a different hostgroup for each log-aggregator type and added the hostgroups to the port checks:
- mdc1-log-aggregator-servers - 'open-tcp-514', 'rsyslog-tcp-514'
- log-aggregator-instances - 'rsyslog-tcp-1514', 'open-tcp-1514'
- og-aggregator-loadbalancers - 'rsyslog-tcp-1514'
Assignee | ||
Comment 12•6 years ago
|
||
- log-aggregator-servers-v1.patch - removed log-aggregator-servers hostgroup from the port checks and added the port checks to the node
Assignee | ||
Comment 13•6 years ago
|
||
- log-aggregator-servers-v2.patch - used a different hostgroup for each log-aggregator type and added the hostgroups to the port checks:
- mdc1-log-aggregator-servers - 'open-tcp-514', 'rsyslog-tcp-514'
- log-aggregator-instances - 'rsyslog-tcp-1514', 'open-tcp-1514'
- og-aggregator-loadbalancers - 'rsyslog-tcp-1514'
Comment 14•6 years ago
•
|
||
@dhouse: What :riman forgot to say is that the 2 patches in comment #12 and comment #13 are two different implementations, and you should feel free to choose one of them.
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Thank you :dhouse
:dlabici, the patch is ready for landing
https://bug1530085.bmoattachments.org/attachment.cgi?id=9049183
Comment 17•6 years ago
|
||
Landed the v2 patch under changeset: e3ae8d1f51631bc16fadfb015f3d7a8d70fa40a8
Leaving this open for a few hours to see how things evolve.
Comment 18•6 years ago
|
||
Added 3 missing commas in services/mdc1.pp
Waiting for the change to propagate and will post updates here.
Comment 19•6 years ago
|
||
All good after the patch.
Updated•5 years ago
|
Description
•