<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

6 years ago

This looks like the nagios check is configured with too high of a threshold or something. The host is listening on port 1514, and I see valid connections. However the nrpe check reports a parsing failure.
When I manually run this same check, it does not fail:

[root@log-aggregator2.srv.releng.usw2.mozilla.com ~]# sudo /usr/lib64/nagios/plugins/check_open_tcp -p 1514 -w 500 -c 400
OK: 3 open tcp connections on port 1514. below threshold: 500

I rebooted the machine and verified that the rsyslog service is running and listening.

I think the nagios check needs modified to not alert in this state.

Comment 4

•

6 years ago

Here's the strange thing - there's 2 each of these machines in use1 and usw2 which got added to nagios, but just one of them has a broken check.

log-aggregator2.srv.releng.usw2 services - open syslog TCP connections is NRPE: Unable to read output
log-aggregator1.srv.releng.usw2 services and 2x use1 - OK: 59 open tcp connections on port 1514. below threshold: 500

In the nagios configuration (in the web UI) the checks are different:

log-aggregator2.srv.releng.usw2 and the use1 boxes has check_open_tcp!1000!1600!514
log-aggregator1.srv.releng.usw2 and the use1 boxes has check_open_tcp!500!800!1514

ie different port with doubled thresholds on the failing check.

Ok, so what does the puppet config for nagios say ? All four boxes have the same node definition in mdc1.pp, except for the hostname. I'm wondering if there's a sort of race condition in puppet+nagios. Using mdc1.pp and services/mdc1.pp, it looks like the log-aggregator-servers hostgroup pulls in syslog-open-connections-514 (the failing check), while open-tcp-1514 overwrites it with syslog-open-connections-1514 (which is OK).

Over In the scl3 config, the log-aggregator-servers hostgroup doesn't have any syslog port checks, they're added explicitly to the node definitions.

Updated

•

6 years ago

Flags: needinfo?(riman)

Assignee

Comment 5

•

6 years ago

Attached patch log-aggregator-servers.patch — Details — Splinter Review

-removed the 'log-aggregator-servers' hostgroup from 'syslog-open-connections-514'
-removed the 'syslog-tcp-514' port check for 'log-aggregator-servers' hostgroup

Nick, can you please review the attachment?

Assignee: nobody → riman

Flags: needinfo?(riman)

Attachment #9047251 - Flags: review?(nthomas)

Nick Thomas [:nthomas] (UTC+12)

Assignee

Updated

•

6 years ago

Blocks: 1484880

Comment 6

•

6 years ago

•

Edited

Comment on attachment 9047251 [details] [diff] [review] log-aggregator-servers.patch I don't think I should review this, much as I have opinions. Passing it to dhouse. I bet log-aggregator-servers hostgroup existed before the AWS servers were added, so the port 514 check is presumably intended for servers actually in mdc1.

Attachment #9047251 - Flags: review?(nthomas) → review?(dhouse)

Comment 7

•

6 years ago

Adding myself to NI so I don't forget about this.

Flags: needinfo?(dlabici)

Comment 8

•

6 years ago

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #4)

Here's the strange thing - there's 2 each of these machines in use1 and usw2 which got added to nagios, but just one of them has a broken check.

log-aggregator2.srv.releng.usw2 services - open syslog TCP connections is NRPE: Unable to read output

log-aggregator1.srv.releng.usw2 services and 2x use1 - OK: 59 open tcp connections on port 1514. below threshold: 500

In the nagios configuration (in the web UI) the checks are different:

log-aggregator2.srv.releng.usw2 and the use1 boxes has check_open_tcp!1000!1600!514

log-aggregator1.srv.releng.usw2 and the use1 boxes has check_open_tcp!500!800!1514

ie different port with doubled thresholds on the failing check.

Ok, so what does the puppet config for nagios say ? All four boxes have the same node definition in mdc1.pp, except for the hostname. I'm wondering if there's a sort of race condition in puppet+nagios. Using mdc1.pp and services/mdc1.pp, it looks like the log-aggregator-servers hostgroup pulls in syslog-open-connections-514 (the failing check), while open-tcp-1514 overwrites it with syslog-open-connections-1514 (which is OK).

Over In the scl3 config, the log-aggregator-servers hostgroup doesn't have any syslog port checks, they're added explicitly to the node definitions.

Thank you for the investigation in puppet!

Jake Watkins [:dividehex]

Comment 9

•

6 years ago

Comment on attachment 9047251 [details] [diff] [review] log-aggregator-servers.patch Looks good. Nick found that the two checks were both attempting to check the port and this entry was failing. So I think it is valid to remove it.

Attachment #9047251 - Flags: review?(dhouse) → review+

Comment 10

•

6 years ago

FTR, port 514 is the standard syslog port which was fine when deploying to the datacenters (scl3, mdc1/2) but when they were deployed to AWS, we ran into problems using that port on ec2 instances. I'm not sure if that limitation still exists but at the time it was decided to use port 1514 in aws and continue using port 514 in the data centers. So in the case of nagios checks, there should be two different port checks, one for the standard 514 port on mdc1/2 log aggregators and one for the non-standard 1514 port on aws ec2 log aggregators.

Flags: needinfo?(jwatkins)

Assignee

Comment 11

•

6 years ago

Following the information from comment 4 and comment 10, I've created two different variants of the patch.

log-aggregator-servers-v1.patch - removed log-aggregator-servers hostgroup from the port checks and added the port checks to the node
log-aggregator-servers-v2.patch - used a different hostgroup for each log-aggregator type and added the hostgroups to the port checks:
- mdc1-log-aggregator-servers - 'open-tcp-514', 'rsyslog-tcp-514'
- log-aggregator-instances - 'rsyslog-tcp-1514', 'open-tcp-1514'
- og-aggregator-loadbalancers - 'rsyslog-tcp-1514'

Assignee

Comment 12

•

6 years ago

Attached patch log-aggregator-servers-v1.patch — Details — Splinter Review

log-aggregator-servers-v1.patch - removed log-aggregator-servers hostgroup from the port checks and added the port checks to the node

Attachment #9049182 - Flags: review?(dhouse)

Assignee

Comment 13

•

6 years ago

Attached patch log-aggregator-servers-v2.patch — Details — Splinter Review

log-aggregator-servers-v2.patch - used a different hostgroup for each log-aggregator type and added the hostgroups to the port checks:
- mdc1-log-aggregator-servers - 'open-tcp-514', 'rsyslog-tcp-514'
- log-aggregator-instances - 'rsyslog-tcp-1514', 'open-tcp-1514'
- og-aggregator-loadbalancers - 'rsyslog-tcp-1514'

Attachment #9049183 - Flags: review?(dhouse)

Comment 14

•

6 years ago

•

Edited

@dhouse: What :riman forgot to say is that the 2 patches in comment #12 and comment #13 are two different implementations, and you should feel free to choose one of them.

Flags: needinfo?(dlabici)

Comment 15

•

6 years ago

Comment on attachment 9049183 [details] [diff] [review] log-aggregator-servers-v2.patch Let's go with v2. Based on Jake's explanation for port 1514 being used in aws, this solution looks the best.

Attachment #9049183 - Flags: review?(dhouse) → review+

https://bug1530085.bmoattachments.org/attachment.cgi?id=9049183

Assignee

Comment 16

•

6 years ago

Thank you :dhouse

:dlabici, the patch is ready for landing

Flags: needinfo?(dlabici)

Comment 17

•

6 years ago

Landed the v2 patch under changeset: e3ae8d1f51631bc16fadfb015f3d7a8d70fa40a8

Leaving this open for a few hours to see how things evolve.

Flags: needinfo?(dlabici)

Comment 18

•

6 years ago

Added 3 missing commas in services/mdc1.pp
Waiting for the change to propagate and will post updates here.

Attila Craciun [:arny]

Comment 19

•

6 years ago

All good after the patch.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED