tinderbox queue monitoring in nagios flapping because of "NRPE: Unable to read output"

RESOLVED DUPLICATE of bug 690673

Status

Infrastructure & Operations
RelOps
RESOLVED DUPLICATE of bug 690673
6 years ago
4 years ago

People

(Reporter: joduinn, Unassigned)

Tracking

Details

Nagios hits an error, but then clears itself soon after. Only to repeat a few hours later. This means we're getting spammed by nagios, making it harder to see real problems.

This has been going on for weeks, but now more visible as queue backlog has reduced.

One of 7 examples yesterday is:

-------- Original Message --------
Subject: ** PROBLEM alert - dm-webtools02/Tinderbox Queue is WARNING **
Date: Thu, 29 Sep 2011 23:29:39 -0700 (PDT)
From: nagios@dm-nagios01.mozilla.org (nagios)

***** Nagios  *****
Notification Type: PROBLEM
Service: Tinderbox Queue
Host: dm-webtools02
Address: 10.2.74.14
State: WARNING
Date/Time: 09-29-2011 23:29:39
Additional Info:
NRPE: Unable to read output

 

...soon after followed by: 

-------- Original Message --------
Subject: ** RECOVERY alert - dm-webtools02/Tinderbox Queue is OK **
Date: Thu, 29 Sep 2011 23:39:40 -0700 (PDT)
From: nagios@dm-nagios01.mozilla.org (nagios)

***** Nagios  *****
Notification Type: RECOVERY
Service: Tinderbox Queue
Host: dm-webtools02
Address: 10.2.74.14
State: OK
Date/Time: 09-29-2011 23:39:40
Additional Info:
OK: 1 Messages Queued. Oldest is 0 minutes old
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 690673
(In reply to Zandr Milewski [:zandr] from comment #1)
> 
> *** This bug has been marked as a duplicate of bug 690673 ***

Of course that bug is protected (probably due to internal data that can't be shared easily). Any chance I can get a summary of what was going wrong here, as an interested 3rd party?
It's quite literally a duplicate, but in the server ops queue.  It's as-yet unsolved, and AIUI this is a known, long-term bug in nagios.  I'll copy you on it, at any rate.
Is there anything sensitive in bug 690673 ? If not, could we remove the infra bit ? 

Don't suppose that the check fails if nothing is pending, does it ? Kinda uncharted territory for us. :-)
cleared
(In reply to Nick Thomas [:nthomas] from comment #4)

> Don't suppose that the check fails if nothing is pending, does it ? Kinda
> uncharted territory for us. :-)

Umm... that's an interesting thought, actually.
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.