Closed Bug 619082 Opened 14 years ago Closed 13 years ago

slave alloc double-starting slavebuildslave processes?

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

From nthomas in bug 616003:

Might have a regression from this - nagios is reporting some linux-ix-slaveN
with 2 buildbot processes running. eg linux-ix-slave19 has an (abridged) 'ps
xf' of
  PID TTY      STAT   TIME COMMAND
 3170 ?        Ss     0:00 /tools/python/bin/python /usr/local/bin/runslave.py
 3198 ?        S      0:00  \_ /tools/buildbot-0.8.0/bin/python
/tools/buildbot/bin/buildbot start /builds/slave
 3199 ?        Z      0:00      \_ [buildbot] <defunct>
 3202 ?        Sl     0:00 /tools/buildbot-0.8.0/bin/python
/tools/buildbot/bin/buildbot start /builds/slave
Ranking P2 until we see this affecting more slaves.
Priority: -- → P2
I can replicate this on talos-r3-leopard-001.  It looks like launchd is running runslave.py twice at the same time, but puppet is only running once.  I suspect that this has something to do with puppet trying to start/enable the service, followed by touching the puppet.finished file.

This doesn't explain the linux-ix- problem, but it may be unrelated.
The leopard problem is separate, because it's launchd related.  It's also not landed yet, so nothing to worry about.

On linux-ix-slave40, we see:

root      1552  0.0  0.0   4492  1312 ?        Ss   08:27   0:00 /bin/bash /etc/rc.d/rc 3
root      3168  0.0  0.0   4488  1172 ?        S    08:27   0:00  \_ /bin/bash /etc/rc3.d/S99buildbot start
root      3169  0.0  0.0   4792  1152 ?        S    08:27   0:00      \_ su - cltbld sh -c /tools/python/bin/python /usr/local/bin/runslave.py
cltbld    3170  0.0  0.1  10240  5472 ?        Ss   08:27   0:00          \_ /tools/python/bin/python /usr/local/bin/runslave.py
cltbld    3198  0.0  0.2  13760  8308 ?        S    08:27   0:00              \_ /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave
cltbld    3199  0.0  0.0      0     0 ?        Z    08:28   0:00                  \_ [buildbot] <defunct>
...
cltbld    3202  0.0  0.2  26736 10072 ?        Sl   08:28   0:00 /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave

looking at an strace, pid 3198 is hung waiting for bytes from fd 5, which is the read end of its signal socket.  So it's waiting for SIGCHLD.  pid 3202 is a fully-functioning buildslave instance.

I think that this is a bug in the odd way that 'buildbot start' forks and internally emulates twistd.  It prints a loud warning about not seeing the child process exit, which is presumably what's happened here.  What's odd is that, even if it doesn't see the child process exit, it should quit after 10 seconds.

So I believe this is a generic problem, rather than one caused by runslave.py being run twice or otherwise malfunctioning.  I'll keep this ticket open to track it.
This is still harmless, but is causing a lot of spurious nagios warning, and is definitely a bug.
Priority: P2 → P1
Catlee and I agree that the fix is to start the buildslave using twistd directly, rather than via 'buildbot start'.  This will make the process signature different, though, so that will need to change at the same time.  The nagios check is in nrpe.cfg on the slaves, which is (one hopes!) deployed via puppet.
This is running in staging and has been tested on
 talos-r3-fed-001
 linux-ix-slave01
 moz2-darwin10-slave03
 moz2-darwin9-slave68
 talos-r3-leopard-001

This will need to be landed together with the following changes to nrpe.cfg in the unversioned puppet files:

Old:

N/production/darwin9-i386/build/usr/local/nagios/etc/nrpe.cfg:
command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /tools/buildbot/bin/buildbot

N/production/darwin10-i386/build/usr/local/nagios/etc/nrpe.cfg:
command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /tools/buildbot/bin/buildbot

N/production/centos5-i686/build/etc/nagios/nrpe.cfg:
command[check_buildbot]=/usr/lib/nagios/plugins/check_procs -w 1:1 -C buildbot

N/production/centos5-x86_64/build/etc/nagios/nrpe.cfg:
command[check_buildbot]=/usr/lib/nagios/plugins/check_procs -w 1:1 -C buildbot

New:

darwin*:
/usr/local/nagios/plugins/check_procs -w 1:1 -C python --argument-array=buildbot.tac

centos5*:
/usr/lib/nagios/plugins/check_procs -w 1:1 -C twistd --argument-array=buildbot.tac

(the checks are different because twistd cannot rename the process on mac)
Attachment #502089 - Flags: review?(bhearsum)
Comment on attachment 502089 [details] [diff] [review]
m619082-puppet-manifests.patch

Seems OK to me.
Attachment #502089 - Flags: review?(bhearsum) → review+
ba2999d082ff

Landed on all puppet masters with the changes in comment 7.  I've monitored a few slaves that have gotten configuration from the various masters, and all seem to have started up successfully.

Because the puppet-files changes were deployed separately from the puppet-manifests changes, there are machines running 'buildbot start' on which nagios is looking for 'twistd'.  This will settle down slowly..

19:49 < nagios> [55] try-linux-slave29.build:buildbot is WARNING: PROCS WARNING: 0 processes with command name twistd, args buildbot.tac

I'll continue to monitor things.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Aside from the pain with the puppet deployment, this has gone smoothly.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: