Closed Bug 625867 Opened 13 years ago Closed 13 years ago

investigate why a few slaves are causing nagios PING CRITICAL alerts

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: bear, Assigned: zandr)

References

Details

it's not often, but they come in clumps:

[09] talos-r3-fed64-033.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
[11] talos-r3-fed64-017.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
[15] talos-r3-fed-008.build:PING is CRITICAL: PING CRITICAL - Packet loss = 100%
[16] linux-ix-slave20.build.scl1:PING is CRITICAL: PING CRITICAL - Packet loss = 100%

if I react slowly and look, the slaves are fine.  only once caught one when it was truly offline.

Also don't ever see the alert that they are ok.
Assignee: server-ops-releng → zandr
maybe they took slightly longer then usual to reboot after completing a job and tripped the nagios threshold on ping? (we had a similar problem before with the n810s, hence the idea).
That's one theory.  I don't know what the thresholds are right now.

I'll be taking a deeper look at this on Monday, and will reassign to myself at that point unless zandr has different plans.
OK, here's a funny one.  I'm logged into linux-ix-slave08 right now, and it's pingable.  However, Nagios has

Current Status:	CRITICAL (for 60d 19h 48m 33s)  (Has been acknowledged)
Last Check Time: 01-15-2011 10:33:57

Two problems here:
 * somehow nagios has been pinging this box for two months and not getting a response
 * PING failure is not causing Nagios to think that the host is down (which would appear as a host problem, not a service problem, and would silence all service alerts for the host).  In fact, despite a lot of known-down hosts, our "hosts" list is entirely green!
It *does* look like Nagios is avoiding checking the other services when PING is down - it hasn't looked at the number of running buildbot processes since November, either.

I submitted a passive "OK" result for the host - let's see if that gets it back into action.
It did not get it back into action.  For a while, the service was green/OK, but now it's CRITICAL again (although with no alert to #build).  The host has been up for almost 2h, and it's not been 2h since I posted the last comment, so in principle it should have been pingable for that entire time.

I'm tcpdumping ICMP messages on the machine now.  I see pings from nagios to other hosts, but not yet to linux-ix-slave08 yet.
OK, I tried twice, and I don't see any pings to linux-ix-slave08 via tcpdump on that host[1], yet nagios reverts to marking the PING service on that slave as CRITICAL.

Maybe it's time to punt this to netops to see if the pings are blocked for some reason?  We could also do a simple test by running 'ping' on the command line on bm-admin01 while running tcpdump on the slave.

[1] note that I do see a lot of pings to various talos-* slaves, but not to any non-talos* slaves
linux-ix-slave08 will be fallout from it moving to SCL and the nagios config not getting regenerated (it caches IP addresses or something). It moved at bug 612299 comment #14. IT can fix that up pretty quick.
(In reply to comment #0)
> it's not often, but they come in clumps:

Is this different from talos box falls over, nagios report isn't actioned (adding to reboot bug, acking nagios), nagios repeats the alert every 2 hours ?
eg for talos-r3-fed64-33:

* the last job it did ended at 2011-01-12 04:59:45 (status db) [all times in PDT]
* the PING alert started failing 2011-01-12 05:02, which will be the reboot after that last job
* nagios thinks it made these notifications:
https://nagios.mozilla.org/nagios/cgi-bin/notifications.cgi?host=talos-r3-fed64-033.build&service=PING
* if you step back in time on that there are *no* OK notifications between Jan 12 and now
* therefore this box has been down for 4 days

So what am I missing ? Looks just like the usual 'talos box failing to reboot' issue.
bug 626872 filed to reset the IP cache as nthomas mentioned in comment #8 (oops, I blamed that comment on zandr, sorry!)

I'm keeping this open to try to find an example that's not linux-ix-slave08.
OK, no luck - all of the other ping failures are either transient (by design) or real.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.