Closed
Bug 810188
Opened 12 years ago
Closed 12 years ago
nagiosbot arbitrarily doesn't display notifications on IRC
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Infrastructure & Operations
Infrastructure: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: ashish, Assigned: rtucker)
Details
Over the past few days nagios-releng, nagios-phx1 and nagios-scl3 have stopped relaying nagios notifications to IRC. nagiosbot logs do not show any sign of the bot picking up the alerts. Here is one such example that didn't propagate to #sysadmins by nagios-scl3: [1352433489] HOST ALERT: nagios1.private.corp.phx1.mozilla.com;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 20.58 ms [1352433489] HOST NOTIFICATION: nmdashnagios;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms [1352433489] HOST NOTIFICATION: oncall;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-sms;PING OK - Packet loss = 0%, RTA = 20.58 ms [1352433489] HOST NOTIFICATION: sysalertslist;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms
Assignee | ||
Comment 1•12 years ago
|
||
The only of these that should have made it to IRC would have been this line: [1352433489] HOST NOTIFICATION: sysalertslist;nagios1.private.corp.phx1.mozilla.com;UP;host-notify-by-email;PING OK - Packet loss = 0%, RTA = 20.58 ms Was this an alert generated by the host running the bot itself?
Reporter | ||
Comment 2•12 years ago
|
||
Few more alerts that should have gone to #buildduty but didn't: [1352448654] SERVICE NOTIFICATION: buildteam;bld-lion-r5-075.build.releng.scl3.mozilla.com;buildbot;WARNING;notify-by-email;PROCS WARNING: 0 processes with command name python, args buildbot.tac [1352449175] HOST NOTIFICATION: buildteam;w64-ix-slave09.winbuild.scl1.mozilla.com;DOWN;host-notify-by-email;PING CRITICAL - Packet loss = 100% Timestamps of those two alerts are: Fri Nov 9 00:10:54 PST 2012 Fri Nov 9 00:19:35 PST 2012 Corresponding (lack of) activity in #buildduty: --- Day changed Fri Nov 09 2012 00:27 -!- nagios-releng [nagios-rel@moz-539655E7.fw1.releng.scl3.mozilla.net] has quit [Input/output error] 00:27 -!- nagios-releng [nagios-rel@moz-539655E7.fw1.releng.scl3.mozilla.net] has joined #buildduty
Reporter | ||
Comment 3•12 years ago
|
||
FWIW bouncing the bot usually fixes the issues, as noted in the oncall notes: November 8 US * I got pages from phx1 that weren't showing up in #sysadmins. Restarting the nagiosbot-python service on nagios1.private.phx1 fixed it. Logs looked ok. I'll try to post the corresponding alerts in a bit.
Reporter | ||
Comment 4•12 years ago
|
||
Few more alerts that did missed alert #sysadmins but only paged the oncall: [1352395461] HOST NOTIFICATION: sysalertslist;bouncer2.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;PING CRITICAL - Packet loss = 100% [1352395471] HOST NOTIFICATION: sysalertslist;bouncer1.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;PING CRITICAL - Packet loss = 100% [1352395471] HOST NOTIFICATION: sysalertslist;ns2.private.phx1.mozilla.com;UNREACHABLE;host-notify-by-email;CRITICAL - Host Unreachable (10.8.75.22) Timestamps: Thu Nov 8 09:24:21 PST 2012 Thu Nov 8 09:24:31 PST 2012 Thu Nov 8 09:24:31 PST 2012
Reporter | ||
Comment 5•12 years ago
|
||
[1352395851] SERVICE NOTIFICATION: sysalertslist;tp-bugs01-master01.phx.mozilla.com;Schwartz Queue;CRITICAL;notify-by-email;CRITICAL: Bugzilla::Job::Mailer: ct=2275 max=776. Thu Nov 8 09:30:51 PST 2012
Assignee | ||
Comment 6•12 years ago
|
||
(In reply to Ashish Vijayaram [:ashish] from comment #4) > Few more alerts that did missed alert #sysadmins but only paged the oncall: > > [1352395461] HOST NOTIFICATION: > sysalertslist;bouncer2.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by- > email;PING CRITICAL - Packet loss = 100% > [1352395471] HOST NOTIFICATION: > sysalertslist;bouncer1.webapp.phx1.mozilla.com;UNREACHABLE;host-notify-by- > email;PING CRITICAL - Packet loss = 100% > [1352395471] HOST NOTIFICATION: > sysalertslist;ns2.private.phx1.mozilla.com;UNREACHABLE;host-notify-by-email; > CRITICAL - Host Unreachable (10.8.75.22) > > Timestamps: > Thu Nov 8 09:24:21 PST 2012 > Thu Nov 8 09:24:31 PST 2012 > Thu Nov 8 09:24:31 PST 2012 I added support for the bot to understand what UNREACHABLE is.
Updated•12 years ago
|
Assignee: server-ops → server-ops-infra
Component: Server Operations → Server Operations: Infrastructure
QA Contact: shyam → jdow
Comment 7•12 years ago
|
||
Could this be a log rotation issue?
Assignee | ||
Comment 8•12 years ago
|
||
Is this still happening. I'm thinking it was fixed by adding support for UNREACHABLE to the bot.
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops-infra → rtucker
Assignee | ||
Comment 9•12 years ago
|
||
Going to close this out. No activity in a month.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Component: Server Operations: Infrastructure → Infrastructure: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•