All users were logged out of Bugzilla on October 13th, 2018

nagiosbot arbitrarily doesn't display notifications on IRC

RESOLVED FIXED

Status

RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: ashish, Assigned: rtucker)

Tracking

Details

(Reporter)

Description

6 years ago
Over the past few days nagios-releng, nagios-phx1 and nagios-scl3 have stopped relaying nagios notifications to IRC. nagiosbot logs do not show any sign of the bot picking up the alerts. Here is one such example that didn't propagate to #sysadmins by nagios-scl3:

[1352433489] HOST ALERT: nagios1.private.corp.phx1.mozilla.com;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 20.58 ms
[1352433489] HOST NOTIFICATION: nmdashnagios;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms
[1352433489] HOST NOTIFICATION: oncall;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-sms;PING OK - Packet loss = 0%, RTA = 20.58 ms
[1352433489] HOST NOTIFICATION: sysalertslist;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms
(Assignee)

Comment 1

6 years ago
The only of these that should have made it to IRC would have been this line:

[1352433489] HOST NOTIFICATION: sysalertslist;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms

Was this an alert generated by the host running the bot itself?
(Reporter)

Comment 2

6 years ago
Few more alerts that should have gone to #buildduty but didn't:

[1352448654] SERVICE NOTIFICATION: buildteam;bld-lion-r5-075.build.releng.scl3.mozilla.com;buildbot;WARNING;notify-by-email;PROCS WARNING: 0 processes with command name python, args buildbot.tac
[1352449175] HOST NOTIFICATION: buildteam;w64-ix-slave09.winbuild.scl1.mozilla.com;DOWN;host-notify-by-email;PING CRITICAL - Packet loss = 100%

Timestamps of those two alerts are:
Fri Nov  9 00:10:54 PST 2012
Fri Nov  9 00:19:35 PST 2012

Corresponding (lack of) activity in #buildduty:

--- Day changed Fri Nov 09 2012
00:27 -!- nagios-releng [nagios-rel@moz-539655E7.fw1.releng.scl3.mozilla.net] has quit [Input/output error]
00:27 -!- nagios-releng [nagios-rel@moz-539655E7.fw1.releng.scl3.mozilla.net] has joined #buildduty
(Reporter)

Comment 3

6 years ago
FWIW bouncing the bot usually fixes the issues, as noted in the oncall notes:

November 8 US
* I got pages from phx1 that weren't showing up in #sysadmins.  Restarting the nagiosbot-python service on nagios1.private.phx1 fixed it.  Logs looked ok.

I'll try to post the corresponding alerts in a bit.
(Reporter)

Comment 4

6 years ago
Few more alerts that did missed alert #sysadmins but only paged the oncall:

[1352395461] HOST NOTIFICATION: sysalertslist;bouncer2.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;PING CRITICAL - Packet loss = 100%
[1352395471] HOST NOTIFICATION: sysalertslist;bouncer1.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;PING CRITICAL - Packet loss = 100%
[1352395471] HOST NOTIFICATION: sysalertslist;ns2.private.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;CRITICAL - Host Unreachable (10.8.75.22)

Timestamps:
Thu Nov  8 09:24:21 PST 2012
Thu Nov  8 09:24:31 PST 2012
Thu Nov  8 09:24:31 PST 2012
(Reporter)

Comment 5

6 years ago
[1352395851] SERVICE NOTIFICATION: sysalertslist;tp-bugs01-master01.phx.mozilla.com;Schwartz Queue;CRITICAL;notify-by-email;CRITICAL: Bugzilla::Job::Mailer: ct=2275 max=776.

Thu Nov  8 09:30:51 PST 2012
(Assignee)

Comment 6

6 years ago
(In reply to Ashish Vijayaram [:ashish] from comment #4)
> Few more alerts that did missed alert #sysadmins but only paged the oncall:
> 
> [1352395461] HOST NOTIFICATION:
> sysalertslist;bouncer2.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-
> email;PING CRITICAL - Packet loss = 100%
> [1352395471] HOST NOTIFICATION:
> sysalertslist;bouncer1.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-
> email;PING CRITICAL - Packet loss = 100%
> [1352395471] HOST NOTIFICATION:
> sysalertslist;ns2.private.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;
> CRITICAL - Host Unreachable (10.8.75.22)
> 
> Timestamps:
> Thu Nov  8 09:24:21 PST 2012
> Thu Nov  8 09:24:31 PST 2012
> Thu Nov  8 09:24:31 PST 2012

I added support for the bot to understand what UNREACHABLE is.

Updated

6 years ago
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: shyam → jdow

Comment 7

6 years ago
Could this be a log rotation issue?
(Assignee)

Comment 8

6 years ago
Is this still happening. I'm thinking it was fixed by adding support for UNREACHABLE to the bot.
(Assignee)

Updated

6 years ago
Assignee: server-ops-infra → rtucker
(Assignee)

Comment 9

6 years ago
Going to close this out. No activity in a month.
Status: NEW → RESOLVED
Last Resolved: 6 years ago
Resolution: --- → FIXED
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.