627126 - (tracker) passive nagios check for all desktop slaves

Assignee

Description

•

14 years ago

Talos machines should run a passive check at startup that tells nagios, "Hey! I started up!" Then nagios should be configured to interpret a lack of such pings as a hung slave, with approximately the same time scale as the existing hung-slave checks. I think we can do the slave side of this in Python without installing Nagios - IIRC the protocol is dead simple. How about the nagios side? Does this sound doable?

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

14 years ago

zandr, joduinn - this is the bug I mentioned while we were talking about nagios and talos. This, alone, will go a long way toward keeping our talos slaves up. I wrote up the big picture here: https://wiki.mozilla.org/User:Djmitche/Slave_Wrangling_with_Nagios my writing skills are at low ebb today, so hopefully that's somewhat sensible. I'm making this a tracker and will try to block out the course to completion using other bugs.

Priority: -- → P2

Summary: passive nagios check for talos machines → (tracker) passive nagios check for talos machines

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 629694

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 629701, 565397

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Blocks: 617166

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Summary: (tracker) passive nagios check for talos machines → (tracker) passive nagios check for all desktop slaves

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 631851

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 637347

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

No longer blocks: releng-nagios

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

14 years ago

update: concept is proven - it only remains to implement this. The nagios checks are proceeding nicely, but the deployment of idleizer on Windows is still a big unknown.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

13 years ago

I'm deploying idleizer on the following POSIX build hosts: linux-hgwriter-slave01 linux-hgwriter-slave02 linux-hgwriter-slave03 linux-hgwriter-slave04 linux-ix-slave03 linux-ix-slave04 linux64-ix-slave01 linux64-ix-slave02 moz2-darwin10-slave01 moz2-darwin10-slave03 moz2-darwin10-slave04 moz2-darwin10-slave10 moz2-darwin9-slave03 moz2-darwin9-slave08 moz2-darwin9-slave10 moz2-darwin9-slave68 moz2-linux-slave04 moz2-linux-slave10 moz2-linux-slave51 moz2-linux64-slave07 moz2-linux64-slave10 mv-moz2-linux-ix-slave01 and rebooting where they aren't busy. Let's see how things turn out!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

13 years ago

Landed with the following buildbot.tac template: from twisted.application import service from buildbot.slave.bot import BuildSlave from twisted.python.logfile import LogFile from twisted.python.log import ILogObserver, FileLogObserver maxdelay=300 buildmaster_host = %(buildmaster_host)r passwd = %(passwd)r maxRotatedFiles = None basedir = %(basedir)r umask = 002 slavename = %(slavename)r usepty = False rotateLength = 1000000 port = %(port)r keepalive = None application = service.Application("buildslave") logfile = LogFile.fromFullPath("twistd.log", rotateLength=rotateLength, maxRotatedFiles=maxRotatedFiles) application.setComponent(ILogObserver, FileLogObserver(logfile).emit) s = BuildSlave(buildmaster_host, port, slavename, passwd, basedir, keepalive, usepty, umask=umask, maxdelay=maxdelay) s.setServiceParent(application) # enable idleizer from buildslave import idleizer idlz = idleizer.Idleizer(s, max_idle_time=3600*7, max_disconnected_time=3600*1) idlz.setServiceParent(application)

Armen [:armenzg]

Comment 5

•

13 years ago

What happens with Win32 machines that get re-imaged and have buildbot 0.7 marked to be installed instead of the 0.8.x version?

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

13 years ago

That's bug 662853, not this one.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

13 years ago

Sorry, bug 661758. TOO MANY BUGZ!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 8

•

13 years ago

On linux-ix-slave04, I'm seeing 2011-06-08 17:10:51-0700 [-] Log opened. 2011-06-08 17:10:51-0700 [-] twistd 10.2.0 (/tools/buildbot-0.8.4-pre-moz1/bin/python 2.6.5) starting up. # slave starts at 17:10 ... 2011-06-08 17:11:00-0700 [Broker,client] Connected to preproduction-master.build.sjc1.mozilla.com:9010; slave is ready 2011-06-08 17:11:00-0700 [Broker,client] SlaveBuilder.remote_print(Firefox mozilla-1.9.1 linux l10n nightly): message from master: ping # build starts immediately ... 2011-06-08 18:33:31-0700 [Broker,client] I have a leftover directory 'rel-192-lnx-update-verify-1' that is not being used by the buildmaster: you can delete it now # reconfig occurs ... 2011-06-08 19:11:03-0700 [Broker,client] argv: ['bash', '-c', '/builds/slave/2.0-lnx-l10n-ntly/build/mozilla-2.0/tools/update-packaging/unwrap_full_update.pl ../dist/update/previous.mar'] # commands still running ... 2011-06-08 19:11:03-0700 [Broker,client] using PTY: False 2011-06-08 19:11:06-0700 [-] Received SIGTERM, shutting down. 2011-06-08 19:11:06-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x9aa528c> 2011-06-08 19:11:06-0700 [-] command interrupted, attempting to kill 2011-06-08 19:11:06-0700 [-] trying to kill process group 3736 2011-06-08 19:11:06-0700 [-] signal 9 sent successfully 2011-06-08 19:11:06-0700 [Broker,client] lost remote 2011-06-08 19:11:06-0700 [Broker,client] lost remote 2011-06-08 19:11:06-0700 [Broker,client] lost remote # and a SIGTERM. A few things to note: * SIGTERM is approximately 2h after startup, but the idleizer timeouts in buildbot.tac are currently set to 7h idle and 1h disconnected. * I do see someone logging in just before this: cltbld pts/0 bm-vpn01.build.s Wed Jun 8 19:07 - 09:38 (14:31) So I'm going to assume this wasn't idleizer.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 663399

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 665254

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

13 years ago

This check is running now, although it's still downtime'd for most hosts, so it's waiting for a complete idleizer rollout before the bug can be marked as FIXED.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Component: Server Operations: RelEng → Release Engineering

QA Contact: zandr → release

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

13 years ago

A bunch of the staging slaves are dead for unrelated reasons (full VMs, iX reimages). However, the following idleizer'd slaves have been restarted with 0.8.4-pre-moz2: linux-ix-slave03 linux-ix-slave04 moz2-darwin10-slave01 moz2-darwin10-slave03 moz2-darwin10-slave04 moz2-darwin10-slave10 moz2-darwin9-slave03 moz2-darwin9-slave08 moz2-darwin9-slave10 moz2-darwin9-slave68 moz2-linux64-slave07 moz2-linux64-slave10 mv-moz2-linux-ix-slave01 I've also dialed the idleizer timeout back to 5 and 35 minute timeouts, in hopes of triggering any idleizer-related failures earlier in the overnight.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 11

•

13 years ago

I saw a number of connected-but-idle hosts in the list above reboot. Success! I've turned on idleizer for the following dev/pp talos systems, too: talos-r3-fed-001 talos-r3-fed-002 talos-r3-fed-010 talos-r3-fed64-010 talos-r3-leopard-001 talos-r3-leopard-002 talos-r3-leopard-010 talos-r3-snow-010 talos-r3-fed64-001 talos-r3-snow-001 talos-r3-snow-002 and rebooted them (they were all idle or locked to a disabled master)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

13 years ago

I disabled idleizer for linux64-ix-slave01 and linux64-ix-slave02 since they are connected to the production puppet servers and thus don't have 0.8.4-pre-moz2.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 13

•

13 years ago

I've seen this on two machines now: 2011-07-24 20:02:18-0700 [Broker,client] While trying to connect: Traceback from remote host -- Traceback (most recent call last): File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 1346, in remote_respond d = self.portal.login(self, mind, IPerspective) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/cred/portal.py", line 116, in login ).addCallback(self.realm.requestAvatar, mind, *interfaces File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 249, in addCallbacks self._runCallbacks() --- <exception caught here> --- File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 474, in requestAvatar p = self.botmaster.getPerspective(mind, avatarID) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/buildbot-0.8.2_hg_3dc678eecd11_production_0.8-py2.6.egg/buildbot/master.py", line 344, in getPerspective d = sl.slave.callRemote("print", "master got a duplicate connection; keeping this one") File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 328, in callRemote _name, args, kw) File "/builds/buildbot/tests-master/sandbox/lib/python2.6/site-packages/twisted/spread/pb.py", line 807, in _sendMessage raise DeadReferenceError("Calling Stale Broker") twisted.spread.pb.DeadReferenceError: Calling Stale Broker which is occurring because they're reconnecting *really* quickly (I have the timeout set to 39s for testing, which is faster than most masters can reply). This *could* still occur with a longer timeout, but I think it would not be nearly so common. The fix is to fix the master side's handling of DeadReferenceError, using something like what's available in newer versions of Buildbot. I'll dial back the idleizer timeout to 1h (disconnected) and 7h (idle) while I roll out 0.8.4-pre-moz2 to the other slaves.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

13 years ago

Stale broker errors deferred to bug 668237

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 683927

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 15

•

13 years ago

The service now looks like this define service { use generic-service host_name replace_with_host_name service_description buildbot-start servicegroups scl1-default contact_groups build active_checks_enabled 0 passive_checks_enabled 1 check_freshness 1 normal_check_interval 1 max_check_attempts 1 freshness_threshold 56000 ; ~15.5h notifications_enabled 1 notification_options w,u,c,r check_command notify-no-buildbot-start notification_period 24x7 notification_interval 120 } the 15h is 7h (idle timeout) + 8h (max build time). Ideally, this will mark the service critical if no passive (OK) result is received in 15h, and will notify every 2h thereafter. Amy suggests that the notifications will not repeat if the result has not been updated, which will mean that we'll only get notified every 16h.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 16

•

13 years ago

I'm going to test with linux64-ix-slave14 (which is dead, see bug 678907). I'm going to send a passive check result now, and monitor the results over the next 20h or more.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 17

•

13 years ago

Oops, that host is dead so the PING failure will hide everything else. I'll use moz2-linux-slave03, which started on 09-06-2011 12:52:46, and is not running buildslave. Let's see when it alerts.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 18

•

13 years ago

Host's last startup was ~1100 PDT yesterday. It alerted at 4:31 this morning, but did not re-alert at 6:31. Amy uncovered this in some mailing-list posts pointing out that nagios will not re-alert if no further information is available after the first alert. For now, I think we can live with this - it means notifies every 15h instead of 2h. If that's a problem, let's open a new bug to figure out how to solve it. (I'll leave that to releng)

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering