Closed
Bug 627126
Opened 14 years ago
Closed 13 years ago
(tracker) passive nagios check for all desktop slaves
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
References
Details
Talos machines should run a passive check at startup that tells nagios, "Hey! I started up!"
Then nagios should be configured to interpret a lack of such pings as a hung slave, with approximately the same time scale as the existing hung-slave checks.
I think we can do the slave side of this in Python without installing Nagios - IIRC the protocol is dead simple. How about the nagios side? Does this sound doable?
Assignee | ||
Comment 1•14 years ago
|
||
zandr, joduinn - this is the bug I mentioned while we were talking about nagios and talos. This, alone, will go a long way toward keeping our talos slaves up.
I wrote up the big picture here:
https://wiki.mozilla.org/User:Djmitche/Slave_Wrangling_with_Nagios
my writing skills are at low ebb today, so hopefully that's somewhat sensible.
I'm making this a tracker and will try to block out the course to completion using other bugs.
Priority: -- → P2
Summary: passive nagios check for talos machines → (tracker) passive nagios check for talos machines
Assignee | ||
Updated•14 years ago
|
Assignee | ||
Updated•14 years ago
|
Summary: (tracker) passive nagios check for talos machines → (tracker) passive nagios check for all desktop slaves
Assignee | ||
Updated•14 years ago
|
No longer blocks: releng-nagios
Assignee | ||
Comment 2•14 years ago
|
||
update: concept is proven - it only remains to implement this. The nagios checks are proceeding nicely, but the deployment of idleizer on Windows is still a big unknown.
Assignee | ||
Comment 3•13 years ago
|
||
I'm deploying idleizer on the following POSIX build hosts:
linux-hgwriter-slave01
linux-hgwriter-slave02
linux-hgwriter-slave03
linux-hgwriter-slave04
linux-ix-slave03
linux-ix-slave04
linux64-ix-slave01
linux64-ix-slave02
moz2-darwin10-slave01
moz2-darwin10-slave03
moz2-darwin10-slave04
moz2-darwin10-slave10
moz2-darwin9-slave03
moz2-darwin9-slave08
moz2-darwin9-slave10
moz2-darwin9-slave68
moz2-linux-slave04
moz2-linux-slave10
moz2-linux-slave51
moz2-linux64-slave07
moz2-linux64-slave10
mv-moz2-linux-ix-slave01
and rebooting where they aren't busy. Let's see how things turn out!
Assignee | ||
Comment 4•13 years ago
|
||
Landed with the following buildbot.tac template:
from twisted.application import service
from buildbot.slave.bot import BuildSlave
from twisted.python.logfile import LogFile
from twisted.python.log import ILogObserver, FileLogObserver
maxdelay=300
buildmaster_host = %(buildmaster_host)r
passwd = %(passwd)r
maxRotatedFiles = None
basedir = %(basedir)r
umask = 002
slavename = %(slavename)r
usepty = False
rotateLength = 1000000
port = %(port)r
keepalive = None
application = service.Application("buildslave")
logfile = LogFile.fromFullPath("twistd.log", rotateLength=rotateLength,
maxRotatedFiles=maxRotatedFiles)
application.setComponent(ILogObserver, FileLogObserver(logfile).emit)
s = BuildSlave(buildmaster_host, port, slavename, passwd, basedir,
keepalive, usepty, umask=umask, maxdelay=maxdelay)
s.setServiceParent(application)
# enable idleizer
from buildslave import idleizer
idlz = idleizer.Idleizer(s, max_idle_time=3600*7, max_disconnected_time=3600*1)
idlz.setServiceParent(application)
Comment 5•13 years ago
|
||
What happens with Win32 machines that get re-imaged and have buildbot 0.7 marked to be installed instead of the 0.8.x version?
Assignee | ||
Comment 6•13 years ago
|
||
That's bug 662853, not this one.
Assignee | ||
Comment 7•13 years ago
|
||
Sorry, bug 661758. TOO MANY BUGZ!
Assignee | ||
Comment 8•13 years ago
|
||
On linux-ix-slave04, I'm seeing
2011-06-08 17:10:51-0700 [-] Log opened.
2011-06-08 17:10:51-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.6.5) starting up.
# slave starts at 17:10
...
2011-06-08 17:11:00-0700 [Broker,client] Connected to preproduction-master.build.sjc1.mozilla.com:9010; slave is ready
2011-06-08 17:11:00-0700 [Broker,client] SlaveBuilder.remote_print(Firefox mozilla-1.9.1 linux l10n nightly): message from master: ping
# build starts immediately
...
2011-06-08 18:33:31-0700 [Broker,client] I have a leftover directory 'rel-192-lnx-update-verify-1' that is not being used by the buildmaster: you can delete it now
# reconfig occurs
...
2011-06-08 19:11:03-0700 [Broker,client] argv: ['bash', '-c', '/builds/slave/2.0-lnx-l10n-ntly/build/mozilla-2.0/tools/update-packaging/unwrap_full_update.pl ../dist/update/previous.mar']
# commands still running
...
2011-06-08 19:11:03-0700 [Broker,client] using PTY: False
2011-06-08 19:11:06-0700 [-] Received SIGTERM, shutting down.
2011-06-08 19:11:06-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x9aa528c>
2011-06-08 19:11:06-0700 [-] command interrupted, attempting to kill
2011-06-08 19:11:06-0700 [-] trying to kill process group 3736
2011-06-08 19:11:06-0700 [-] signal 9 sent successfully
2011-06-08 19:11:06-0700 [Broker,client] lost remote
2011-06-08 19:11:06-0700 [Broker,client] lost remote
2011-06-08 19:11:06-0700 [Broker,client] lost remote
# and a SIGTERM.
A few things to note:
* SIGTERM is approximately 2h after startup, but the idleizer timeouts in buildbot.tac are currently set to 7h idle and 1h disconnected.
* I do see someone logging in just before this:
cltbld pts/0 bm-vpn01.build.s Wed Jun 8 19:07 - 09:38 (14:31)
So I'm going to assume this wasn't idleizer.
Assignee | ||
Comment 9•13 years ago
|
||
This check is running now, although it's still downtime'd for most hosts, so it's waiting for a complete idleizer rollout before the bug can be marked as FIXED.
Assignee | ||
Updated•13 years ago
|
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Assignee | ||
Comment 10•13 years ago
|
||
A bunch of the staging slaves are dead for unrelated reasons (full VMs, iX reimages). However, the following idleizer'd slaves have been restarted with 0.8.4-pre-moz2:
linux-ix-slave03
linux-ix-slave04
moz2-darwin10-slave01
moz2-darwin10-slave03
moz2-darwin10-slave04
moz2-darwin10-slave10
moz2-darwin9-slave03
moz2-darwin9-slave08
moz2-darwin9-slave10
moz2-darwin9-slave68
moz2-linux64-slave07
moz2-linux64-slave10
mv-moz2-linux-ix-slave01
I've also dialed the idleizer timeout back to 5 and 35 minute timeouts, in hopes of triggering any idleizer-related failures earlier in the overnight.
Assignee | ||
Comment 11•13 years ago
|
||
I saw a number of connected-but-idle hosts in the list above reboot. Success!
I've turned on idleizer for the following dev/pp talos systems, too:
talos-r3-fed-001
talos-r3-fed-002
talos-r3-fed-010
talos-r3-fed64-010
talos-r3-leopard-001
talos-r3-leopard-002
talos-r3-leopard-010
talos-r3-snow-010
talos-r3-fed64-001
talos-r3-snow-001
talos-r3-snow-002
and rebooted them (they were all idle or locked to a disabled master)
Assignee | ||
Comment 12•13 years ago
|
||
I disabled idleizer for linux64-ix-slave01 and linux64-ix-slave02 since they are connected to the production puppet servers and thus don't have 0.8.4-pre-moz2.
Assignee | ||
Comment 13•13 years ago
|
||
I've seen this on two machines now:
2011-07-24 20:02:18-0700 [Broker,client] While trying to connect:
Traceback from remote host -- Traceback (most recent call last):
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 1346, in remote_respond
d = self.portal.login(self, mind, IPerspective)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/cred/portal.py", line 116, in login
).addCallback(self.realm.requestAvatar, mind, *interfaces
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 260, in addCallback
callbackKeywords=kw)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 249, in addCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks
self.result = callback(self.result, *args, **kw)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar
p = self.botmaster.getPerspective(mind, avatarID)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective
d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one")
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 328, in callRemote
_name, args, kw)
File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 807, in _sendMessage
raise DeadReferenceError("Calling Stale Broker")
twisted.spread.pb.DeadReferenceError: Calling Stale Broker
which is occurring because they're reconnecting *really* quickly (I have the timeout set to 39s for testing, which is faster than most masters can reply). This *could* still occur with a longer timeout, but I think it would not be nearly so common.
The fix is to fix the master side's handling of DeadReferenceError, using something like what's available in newer versions of Buildbot.
I'll dial back the idleizer timeout to 1h (disconnected) and 7h (idle) while I roll out 0.8.4-pre-moz2 to the other slaves.
Assignee | ||
Comment 14•13 years ago
|
||
Stale broker errors deferred to bug 668237
Assignee | ||
Comment 15•13 years ago
|
||
The service now looks like this
define service {
use generic-service
host_name replace_with_host_name
service_description buildbot-start
servicegroups scl1-default
contact_groups build
active_checks_enabled 0
passive_checks_enabled 1
check_freshness 1
normal_check_interval 1
max_check_attempts 1
freshness_threshold 56000 ; ~15.5h
notifications_enabled 1
notification_options w,u,c,r
check_command notify-no-buildbot-start
notification_period 24x7
notification_interval 120
}
the 15h is 7h (idle timeout) + 8h (max build time).
Ideally, this will mark the service critical if no passive (OK) result is received in 15h, and will notify every 2h thereafter. Amy suggests that the notifications will not repeat if the result has not been updated, which will mean that we'll only get notified every 16h.
Assignee | ||
Comment 16•13 years ago
|
||
I'm going to test with linux64-ix-slave14 (which is dead, see bug 678907). I'm going to send a passive check result now, and monitor the results over the next 20h or more.
Assignee | ||
Comment 17•13 years ago
|
||
Oops, that host is dead so the PING failure will hide everything else. I'll use moz2-linux-slave03, which started on 09-06-2011 12:52:46, and is not running buildslave. Let's see when it alerts.
Assignee | ||
Comment 18•13 years ago
|
||
Host's last startup was ~1100 PDT yesterday. It alerted at 4:31 this morning, but did not re-alert at 6:31. Amy uncovered this in some mailing-list posts pointing out that nagios will not re-alert if no further information is available after the first alert.
For now, I think we can live with this - it means notifies every 15h instead of 2h. If that's a problem, let's open a new bug to figure out how to solve it. (I'll leave that to releng)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•