nagios-releng bot does not report correct host status information

RESOLVED FIXED

Status

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: arich, Assigned: rtucker)

Tracking

Details

Attachments

(1 obsolete attachment)

(Reporter)

Description

3 years ago
I seem to get variable incorrect results when running nagios-releng: status

arr: Status file is 3 seconds stale
arr: Hosts Total/Up/Warning/Down
arr:       153/153/0/0
arr: Services Total/Up/Warning/Down
arr:          3399/3321/5/73


arr: Status file is 9 seconds stale
arr: Hosts Total/Up/Warning/Down
arr:       1426/1396/30/0
arr: Services Total/Up/Warning/Down
arr:          146/146/0/0

arr: Status file is 6 seconds stale
arr: Hosts Total/Up/Warning/Down
arr:       1425/1396/29/0
arr: Services Total/Up/Warning/Down
arr:          1082/1071/3/8

We actually had 2121 hosts and 3399 services at the time these commands were run. Most but not all were up in both cases.
:rtucker do you know what's up here?
Flags: needinfo?(rtucker)
(Assignee)

Comment 2

3 years ago
This could be because of a restart. It could also be from mklivestatus taking some time to populate.
Flags: needinfo?(rtucker)
(Reporter)

Comment 3

3 years ago
These were run within a few seconds of each other, so it seems weird that it would be a population issue.
Assignee: nobody → rchilds
Status: NEW → ASSIGNED
Created attachment 8650919 [details] [diff] [review]
nagiosbot

This did it.

01:25:23 <ryanc> nagios-scl3: status
01:25:23 <@nagios-scl3> ryanc: Status file is 5 seconds stale
01:25:23 <@nagios-scl3> ryanc: Hosts Total/Up/Warning/Down
01:25:23 <@nagios-scl3> ryanc:       1335/1329/6/0
01:25:23 <@nagios-scl3> ryanc: Services Total/Up/Warning/Down
01:25:23 <@nagios-scl3> ryanc:          11472/11353/65/44
Attachment #8650919 - Flags: review?(ashish)
(Assignee)

Comment 5

3 years ago
Tracked this down to the receive buffer not being large enough to handle the amount of data coming back from mklivestatus. Will implement a fix and push tomorrow after a bit more testing.
Assignee: rchilds → rtucker
Attachment #8650919 - Flags: review?(ashish) → review-
Attachment #8650919 - Attachment is obsolete: true
(Assignee)

Comment 6

3 years ago
Simply increasing the buffer size didn't really help. 
I resized the buffer smaller but inside of a loop that checks for the end of the stream.

https://github.com/rtucker-mozilla/mozilla-nagios-bot/commit/572e3a536a6f8cc45b36dbdab48052306a61c513

Going to wait for the bots to restart and confirm this before closing.
(Assignee)

Comment 7

3 years ago
After further testing, this looks good now.
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.