Closed Bug 1530085 Opened 5 years ago Closed 5 years ago

[log-aggregator2.srv.releng.usw2.mozilla.com] open syslog TCP connections is UNKNOWN

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zfay, Assigned: riman)

References

Details

Attachments

(3 files)

In response to comment #127 in bug 1484880:
log aggregators started continuously alerting "open syslog TCP connections is UNKNOWN: NRPE: Unable to read output" so I've re-acknowledged them.

Seems to just be log-aggregator2.srv.releng.usw2.mozilla.com. log-aggregator1 in usw2, and both of them in use1 have a OK status with ~40 open connections.

As the rest of the box seems fine I suggest talking with RelOps about debugging the rsyslog process, or just restarting it.

Summary: [log aggregator] open syslog TCP connections is UNKNOWN → [log-aggregator2.srv.releng.usw2.mozilla.com] open syslog TCP connections is UNKNOWN

Hey Jake, can you help us out with this one? ^

Flags: needinfo?(jwatkins)

This looks like the nagios check is configured with too high of a threshold or something. The host is listening on port 1514, and I see valid connections. However the nrpe check reports a parsing failure.
When I manually run this same check, it does not fail:

[root@log-aggregator2.srv.releng.usw2.mozilla.com ~]# sudo /usr/lib64/nagios/plugins/check_open_tcp -p 1514 -w 500 -c 400
OK: 3 open tcp connections on port 1514. below threshold: 500

I rebooted the machine and verified that the rsyslog service is running and listening.

I think the nagios check needs modified to not alert in this state.

Here's the strange thing - there's 2 each of these machines in use1 and usw2 which got added to nagios, but just one of them has a broken check.

In the nagios configuration (in the web UI) the checks are different:

ie different port with doubled thresholds on the failing check.

Ok, so what does the puppet config for nagios say ? All four boxes have the same node definition in mdc1.pp, except for the hostname. I'm wondering if there's a sort of race condition in puppet+nagios. Using mdc1.pp and services/mdc1.pp, it looks like the log-aggregator-servers hostgroup pulls in syslog-open-connections-514 (the failing check), while open-tcp-1514 overwrites it with syslog-open-connections-1514 (which is OK).

Over In the scl3 config, the log-aggregator-servers hostgroup doesn't have any syslog port checks, they're added explicitly to the node definitions.

Flags: needinfo?(riman)

-removed the 'log-aggregator-servers' hostgroup from 'syslog-open-connections-514'
-removed the 'syslog-tcp-514' port check for 'log-aggregator-servers' hostgroup

Nick, can you please review the attachment?

Assignee: nobody → riman
Flags: needinfo?(riman)
Attachment #9047251 - Flags: review?(nthomas)
Blocks: 1484880
Comment on attachment 9047251 [details] [diff] [review]
log-aggregator-servers.patch

I don't think I should review this, much as I have opinions. Passing it to dhouse.

I bet log-aggregator-servers hostgroup existed before the AWS servers were added, so the port 514 check is presumably intended for servers actually in mdc1.
Attachment #9047251 - Flags: review?(nthomas) → review?(dhouse)

Adding myself to NI so I don't forget about this.

Flags: needinfo?(dlabici)

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #4)

Here's the strange thing - there's 2 each of these machines in use1 and usw2 which got added to nagios, but just one of them has a broken check.

In the nagios configuration (in the web UI) the checks are different:

ie different port with doubled thresholds on the failing check.

Ok, so what does the puppet config for nagios say ? All four boxes have the same node definition in mdc1.pp, except for the hostname. I'm wondering if there's a sort of race condition in puppet+nagios. Using mdc1.pp and services/mdc1.pp, it looks like the log-aggregator-servers hostgroup pulls in syslog-open-connections-514 (the failing check), while open-tcp-1514 overwrites it with syslog-open-connections-1514 (which is OK).

Over In the scl3 config, the log-aggregator-servers hostgroup doesn't have any syslog port checks, they're added explicitly to the node definitions.

Thank you for the investigation in puppet!

Comment on attachment 9047251 [details] [diff] [review]
log-aggregator-servers.patch

Looks good. Nick found that the two checks were both attempting to check the port and this entry was failing. So I think it is valid to remove it.
Attachment #9047251 - Flags: review?(dhouse) → review+

FTR, port 514 is the standard syslog port which was fine when deploying to the datacenters (scl3, mdc1/2) but when they were deployed to AWS, we ran into problems using that port on ec2 instances. I'm not sure if that limitation still exists but at the time it was decided to use port 1514 in aws and continue using port 514 in the data centers. So in the case of nagios checks, there should be two different port checks, one for the standard 514 port on mdc1/2 log aggregators and one for the non-standard 1514 port on aws ec2 log aggregators.

Flags: needinfo?(jwatkins)

Following the information from comment 4 and comment 10, I've created two different variants of the patch.

  • log-aggregator-servers-v1.patch - removed log-aggregator-servers hostgroup from the port checks and added the port checks to the node

  • log-aggregator-servers-v2.patch - used a different hostgroup for each log-aggregator type and added the hostgroups to the port checks:

    • mdc1-log-aggregator-servers - 'open-tcp-514', 'rsyslog-tcp-514'
    • log-aggregator-instances - 'rsyslog-tcp-1514', 'open-tcp-1514'
    • og-aggregator-loadbalancers - 'rsyslog-tcp-1514'
  • log-aggregator-servers-v1.patch - removed log-aggregator-servers hostgroup from the port checks and added the port checks to the node
Attachment #9049182 - Flags: review?(dhouse)
  • log-aggregator-servers-v2.patch - used a different hostgroup for each log-aggregator type and added the hostgroups to the port checks:
    • mdc1-log-aggregator-servers - 'open-tcp-514', 'rsyslog-tcp-514'
    • log-aggregator-instances - 'rsyslog-tcp-1514', 'open-tcp-1514'
    • og-aggregator-loadbalancers - 'rsyslog-tcp-1514'
Attachment #9049183 - Flags: review?(dhouse)

@dhouse: What :riman forgot to say is that the 2 patches in comment #12 and comment #13 are two different implementations, and you should feel free to choose one of them.

Flags: needinfo?(dlabici)
Comment on attachment 9049183 [details] [diff] [review]
log-aggregator-servers-v2.patch

Let's go with v2. Based on Jake's explanation for port 1514 being used in aws, this solution looks the best.
Attachment #9049183 - Flags: review?(dhouse) → review+

Thank you :dhouse

:dlabici, the patch is ready for landing

https://bug1530085.bmoattachments.org/attachment.cgi?id=9049183

Flags: needinfo?(dlabici)

Landed the v2 patch under changeset: e3ae8d1f51631bc16fadfb015f3d7a8d70fa40a8

Leaving this open for a few hours to see how things evolve.

Flags: needinfo?(dlabici)

Added 3 missing commas in services/mdc1.pp
Waiting for the change to propagate and will post updates here.

All good after the patch.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Attachment #9049182 - Flags: review?(dhouse)
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: