Closed Bug 810188 Opened 12 years ago Closed 12 years ago

nagiosbot arbitrarily doesn't display notifications on IRC

Categories

(Infrastructure & Operations :: Infrastructure: Other, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ashish, Assigned: rtucker)

Details

Over the past few days nagios-releng, nagios-phx1 and nagios-scl3 have stopped relaying nagios notifications to IRC. nagiosbot logs do not show any sign of the bot picking up the alerts. Here is one such example that didn't propagate to #sysadmins by nagios-scl3:

[1352433489] HOST ALERT: nagios1.private.corp.phx1.mozilla.com;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 20.58 ms
[1352433489] HOST NOTIFICATION: nmdashnagios;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms
[1352433489] HOST NOTIFICATION: oncall;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-sms;PING OK - Packet loss = 0%, RTA = 20.58 ms
[1352433489] HOST NOTIFICATION: sysalertslist;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms
The only of these that should have made it to IRC would have been this line:

[1352433489] HOST NOTIFICATION: sysalertslist;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms

Was this an alert generated by the host running the bot itself?
Few more alerts that should have gone to #buildduty but didn't:

[1352448654] SERVICE NOTIFICATION: buildteam;bld-lion-r5-075.build.releng.scl3.mozilla.com;buildbot;WARNING;notify-by-email;PROCS WARNING: 0 processes with command name python, args buildbot.tac
[1352449175] HOST NOTIFICATION: buildteam;w64-ix-slave09.winbuild.scl1.mozilla.com;DOWN;host-notify-by-email;PING CRITICAL - Packet loss = 100%

Timestamps of those two alerts are:
Fri Nov  9 00:10:54 PST 2012
Fri Nov  9 00:19:35 PST 2012

Corresponding (lack of) activity in #buildduty:

--- Day changed Fri Nov 09 2012
00:27 -!- nagios-releng [nagios-rel@moz-539655E7.fw1.releng.scl3.mozilla.net] has quit [Input/output error]
00:27 -!- nagios-releng [nagios-rel@moz-539655E7.fw1.releng.scl3.mozilla.net] has joined #buildduty
FWIW bouncing the bot usually fixes the issues, as noted in the oncall notes:

November 8 US
* I got pages from phx1 that weren't showing up in #sysadmins.  Restarting the nagiosbot-python service on nagios1.private.phx1 fixed it.  Logs looked ok.

I'll try to post the corresponding alerts in a bit.
Few more alerts that did missed alert #sysadmins but only paged the oncall:

[1352395461] HOST NOTIFICATION: sysalertslist;bouncer2.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;PING CRITICAL - Packet loss = 100%
[1352395471] HOST NOTIFICATION: sysalertslist;bouncer1.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;PING CRITICAL - Packet loss = 100%
[1352395471] HOST NOTIFICATION: sysalertslist;ns2.private.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;CRITICAL - Host Unreachable (10.8.75.22)

Timestamps:
Thu Nov  8 09:24:21 PST 2012
Thu Nov  8 09:24:31 PST 2012
Thu Nov  8 09:24:31 PST 2012
[1352395851] SERVICE NOTIFICATION: sysalertslist;tp-bugs01-master01.phx.mozilla.com;Schwartz Queue;CRITICAL;notify-by-email;CRITICAL: Bugzilla::Job::Mailer: ct=2275 max=776.

Thu Nov  8 09:30:51 PST 2012
(In reply to Ashish Vijayaram [:ashish] from comment #4)
> Few more alerts that did missed alert #sysadmins but only paged the oncall:
> 
> [1352395461] HOST NOTIFICATION:
> sysalertslist;bouncer2.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-
> email;PING CRITICAL - Packet loss = 100%
> [1352395471] HOST NOTIFICATION:
> sysalertslist;bouncer1.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-
> email;PING CRITICAL - Packet loss = 100%
> [1352395471] HOST NOTIFICATION:
> sysalertslist;ns2.private.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;
> CRITICAL - Host Unreachable (10.8.75.22)
> 
> Timestamps:
> Thu Nov  8 09:24:21 PST 2012
> Thu Nov  8 09:24:31 PST 2012
> Thu Nov  8 09:24:31 PST 2012

I added support for the bot to understand what UNREACHABLE is.
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: shyam → jdow
Could this be a log rotation issue?
Is this still happening. I'm thinking it was fixed by adding support for UNREACHABLE to the bot.
Assignee: server-ops-infra → rtucker
Going to close this out. No activity in a month.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.