Closed Bug 627126 Opened 14 years ago Closed 13 years ago

(tracker) passive nagios check for all desktop slaves

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Talos machines should run a passive check at startup that tells nagios, "Hey! I started up!" Then nagios should be configured to interpret a lack of such pings as a hung slave, with approximately the same time scale as the existing hung-slave checks. I think we can do the slave side of this in Python without installing Nagios - IIRC the protocol is dead simple. How about the nagios side? Does this sound doable?
zandr, joduinn - this is the bug I mentioned while we were talking about nagios and talos. This, alone, will go a long way toward keeping our talos slaves up. I wrote up the big picture here: https://wiki.mozilla.org/User:Djmitche/Slave_Wrangling_with_Nagios my writing skills are at low ebb today, so hopefully that's somewhat sensible. I'm making this a tracker and will try to block out the course to completion using other bugs.
Priority: -- → P2
Summary: passive nagios check for talos machines → (tracker) passive nagios check for talos machines
Depends on: 629694
Depends on: 629701, 565397
Blocks: 617166
Summary: (tracker) passive nagios check for talos machines → (tracker) passive nagios check for all desktop slaves
Depends on: 631851
Depends on: 637347
No longer blocks: releng-nagios
update: concept is proven - it only remains to implement this. The nagios checks are proceeding nicely, but the deployment of idleizer on Windows is still a big unknown.
I'm deploying idleizer on the following POSIX build hosts: linux-hgwriter-slave01 linux-hgwriter-slave02 linux-hgwriter-slave03 linux-hgwriter-slave04 linux-ix-slave03 linux-ix-slave04 linux64-ix-slave01 linux64-ix-slave02 moz2-darwin10-slave01 moz2-darwin10-slave03 moz2-darwin10-slave04 moz2-darwin10-slave10 moz2-darwin9-slave03 moz2-darwin9-slave08 moz2-darwin9-slave10 moz2-darwin9-slave68 moz2-linux-slave04 moz2-linux-slave10 moz2-linux-slave51 moz2-linux64-slave07 moz2-linux64-slave10 mv-moz2-linux-ix-slave01 and rebooting where they aren't busy. Let's see how things turn out!
Landed with the following buildbot.tac template: from twisted.application import service from buildbot.slave.bot import BuildSlave from twisted.python.logfile import LogFile from twisted.python.log import ILogObserver, FileLogObserver maxdelay=300 buildmaster_host = %(buildmaster_host)r passwd = %(passwd)r maxRotatedFiles = None basedir = %(basedir)r umask = 002 slavename = %(slavename)r usepty = False rotateLength = 1000000 port = %(port)r keepalive = None application = service.Application("buildslave") logfile = LogFile.fromFullPath("twistd.log", rotateLength=rotateLength, maxRotatedFiles=maxRotatedFiles) application.setComponent(ILogObserver, FileLogObserver(logfile).emit) s = BuildSlave(buildmaster_host, port, slavename, passwd, basedir, keepalive, usepty, umask=umask, maxdelay=maxdelay) s.setServiceParent(application) # enable idleizer from buildslave import idleizer idlz = idleizer.Idleizer(s, max_idle_time=3600*7, max_disconnected_time=3600*1) idlz.setServiceParent(application)
What happens with Win32 machines that get re-imaged and have buildbot 0.7 marked to be installed instead of the 0.8.x version?
That's bug 662853, not this one.
Sorry, bug 661758. TOO MANY BUGZ!
On linux-ix-slave04, I'm seeing 2011-06-08 17:10:51-0700 [-] Log opened. 2011-06-08 17:10:51-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.6.5) starting up. # slave starts at 17:10 ... 2011-06-08 17:11:00-0700 [Broker,client] Connected to preproduction-master.build.sjc1.mozilla.com:9010; slave is ready 2011-06-08 17:11:00-0700 [Broker,client] SlaveBuilder.remote_print(Firefox mozilla-1.9.1 linux l10n nightly): message from master: ping # build starts immediately ... 2011-06-08 18:33:31-0700 [Broker,client] I have a leftover directory 'rel-192-lnx-update-verify-1' that is not being used by the buildmaster: you can delete it now # reconfig occurs ... 2011-06-08 19:11:03-0700 [Broker,client] argv: ['bash', '-c', '/builds/slave/2.0-lnx-l10n-ntly/build/mozilla-2.0/tools/update-packaging/unwrap_full_update.pl ../dist/update/previous.mar'] # commands still running ... 2011-06-08 19:11:03-0700 [Broker,client] using PTY: False 2011-06-08 19:11:06-0700 [-] Received SIGTERM, shutting down. 2011-06-08 19:11:06-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x9aa528c> 2011-06-08 19:11:06-0700 [-] command interrupted, attempting to kill 2011-06-08 19:11:06-0700 [-] trying to kill process group 3736 2011-06-08 19:11:06-0700 [-] signal 9 sent successfully 2011-06-08 19:11:06-0700 [Broker,client] lost remote 2011-06-08 19:11:06-0700 [Broker,client] lost remote 2011-06-08 19:11:06-0700 [Broker,client] lost remote # and a SIGTERM. A few things to note: * SIGTERM is approximately 2h after startup, but the idleizer timeouts in buildbot.tac are currently set to 7h idle and 1h disconnected. * I do see someone logging in just before this: cltbld pts/0 bm-vpn01.build.s Wed Jun 8 19:07 - 09:38 (14:31) So I'm going to assume this wasn't idleizer.
Depends on: 663399
Depends on: 665254
This check is running now, although it's still downtime'd for most hosts, so it's waiting for a complete idleizer rollout before the bug can be marked as FIXED.
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
A bunch of the staging slaves are dead for unrelated reasons (full VMs, iX reimages). However, the following idleizer'd slaves have been restarted with 0.8.4-pre-moz2: linux-ix-slave03 linux-ix-slave04 moz2-darwin10-slave01 moz2-darwin10-slave03 moz2-darwin10-slave04 moz2-darwin10-slave10 moz2-darwin9-slave03 moz2-darwin9-slave08 moz2-darwin9-slave10 moz2-darwin9-slave68 moz2-linux64-slave07 moz2-linux64-slave10 mv-moz2-linux-ix-slave01 I've also dialed the idleizer timeout back to 5 and 35 minute timeouts, in hopes of triggering any idleizer-related failures earlier in the overnight.
I saw a number of connected-but-idle hosts in the list above reboot. Success! I've turned on idleizer for the following dev/pp talos systems, too: talos-r3-fed-001 talos-r3-fed-002 talos-r3-fed-010 talos-r3-fed64-010 talos-r3-leopard-001 talos-r3-leopard-002 talos-r3-leopard-010 talos-r3-snow-010 talos-r3-fed64-001 talos-r3-snow-001 talos-r3-snow-002 and rebooted them (they were all idle or locked to a disabled master)
I disabled idleizer for linux64-ix-slave01 and linux64-ix-slave02 since they are connected to the production puppet servers and thus don't have 0.8.4-pre-moz2.
I've seen this on two machines now: 2011-07-24 20:02:18-0700 [Broker,client] While trying to connect: Traceback from remote host -- Traceback (most recent call last): File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/cred/portal.py", line 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 328, in callRemote _name, args, kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker which is occurring because they're reconnecting *really* quickly (I have the timeout set to 39s for testing, which is faster than most masters can reply). This *could* still occur with a longer timeout, but I think it would not be nearly so common. The fix is to fix the master side's handling of DeadReferenceError, using something like what's available in newer versions of Buildbot. I'll dial back the idleizer timeout to 1h (disconnected) and 7h (idle) while I roll out 0.8.4-pre-moz2 to the other slaves.
Stale broker errors deferred to bug 668237
Depends on: 683927
The service now looks like this define service { use generic-service host_name replace_with_host_name service_description buildbot-start servicegroups scl1-default contact_groups build active_checks_enabled 0 passive_checks_enabled 1 check_freshness 1 normal_check_interval 1 max_check_attempts 1 freshness_threshold 56000 ; ~15.5h notifications_enabled 1 notification_options w,u,c,r check_command notify-no-buildbot-start notification_period 24x7 notification_interval 120 } the 15h is 7h (idle timeout) + 8h (max build time). Ideally, this will mark the service critical if no passive (OK) result is received in 15h, and will notify every 2h thereafter. Amy suggests that the notifications will not repeat if the result has not been updated, which will mean that we'll only get notified every 16h.
I'm going to test with linux64-ix-slave14 (which is dead, see bug 678907). I'm going to send a passive check result now, and monitor the results over the next 20h or more.
Oops, that host is dead so the PING failure will hide everything else. I'll use moz2-linux-slave03, which started on 09-06-2011 12:52:46, and is not running buildslave. Let's see when it alerts.
Host's last startup was ~1100 PDT yesterday. It alerted at 4:31 this morning, but did not re-alert at 6:31. Amy uncovered this in some mailing-list posts pointing out that nagios will not re-alert if no further information is available after the first alert. For now, I think we can live with this - it means notifies every 15h instead of 2h. If that's a problem, let's open a new bug to figure out how to solve it. (I'll leave that to releng)
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.