From nthomas in bug 616003: Might have a regression from this - nagios is reporting some linux-ix-slaveN with 2 buildbot processes running. eg linux-ix-slave19 has an (abridged) 'ps xf' of PID TTY STAT TIME COMMAND 3170 ? Ss 0:00 /tools/python/bin/python /usr/local/bin/runslave.py 3198 ? S 0:00 \_ /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave 3199 ? Z 0:00 \_ [buildbot] <defunct> 3202 ? Sl 0:00 /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave
Ranking P2 until we see this affecting more slaves.
Priority: -- → P2
I can replicate this on talos-r3-leopard-001. It looks like launchd is running runslave.py twice at the same time, but puppet is only running once. I suspect that this has something to do with puppet trying to start/enable the service, followed by touching the puppet.finished file. This doesn't explain the linux-ix- problem, but it may be unrelated.
The leopard problem is separate, because it's launchd related. It's also not landed yet, so nothing to worry about. On linux-ix-slave40, we see: root 1552 0.0 0.0 4492 1312 ? Ss 08:27 0:00 /bin/bash /etc/rc.d/rc 3 root 3168 0.0 0.0 4488 1172 ? S 08:27 0:00 \_ /bin/bash /etc/rc3.d/S99buildbot start root 3169 0.0 0.0 4792 1152 ? S 08:27 0:00 \_ su - cltbld sh -c /tools/python/bin/python /usr/local/bin/runslave.py cltbld 3170 0.0 0.1 10240 5472 ? Ss 08:27 0:00 \_ /tools/python/bin/python /usr/local/bin/runslave.py cltbld 3198 0.0 0.2 13760 8308 ? S 08:27 0:00 \_ /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave cltbld 3199 0.0 0.0 0 0 ? Z 08:28 0:00 \_ [buildbot] <defunct> ... cltbld 3202 0.0 0.2 26736 10072 ? Sl 08:28 0:00 /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave looking at an strace, pid 3198 is hung waiting for bytes from fd 5, which is the read end of its signal socket. So it's waiting for SIGCHLD. pid 3202 is a fully-functioning buildslave instance. I think that this is a bug in the odd way that 'buildbot start' forks and internally emulates twistd. It prints a loud warning about not seeing the child process exit, which is presumably what's happened here. What's odd is that, even if it doesn't see the child process exit, it should quit after 10 seconds. So I believe this is a generic problem, rather than one caused by runslave.py being run twice or otherwise malfunctioning. I'll keep this ticket open to track it.
Duplicate of this bug: 623828
This is still harmless, but is causing a lot of spurious nagios warning, and is definitely a bug.
Priority: P2 → P1
Catlee and I agree that the fix is to start the buildslave using twistd directly, rather than via 'buildbot start'. This will make the process signature different, though, so that will need to change at the same time. The nagios check is in nrpe.cfg on the slaves, which is (one hopes!) deployed via puppet.
Created attachment 502089 [details] [diff] [review] m619082-puppet-manifests.patch This is running in staging and has been tested on talos-r3-fed-001 linux-ix-slave01 moz2-darwin10-slave03 moz2-darwin9-slave68 talos-r3-leopard-001 This will need to be landed together with the following changes to nrpe.cfg in the unversioned puppet files: Old: N/production/darwin9-i386/build/usr/local/nagios/etc/nrpe.cfg: command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /tools/buildbot/bin/buildbot N/production/darwin10-i386/build/usr/local/nagios/etc/nrpe.cfg: command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /tools/buildbot/bin/buildbot N/production/centos5-i686/build/etc/nagios/nrpe.cfg: command[check_buildbot]=/usr/lib/nagios/plugins/check_procs -w 1:1 -C buildbot N/production/centos5-x86_64/build/etc/nagios/nrpe.cfg: command[check_buildbot]=/usr/lib/nagios/plugins/check_procs -w 1:1 -C buildbot New: darwin*: /usr/local/nagios/plugins/check_procs -w 1:1 -C python --argument-array=buildbot.tac centos5*: /usr/lib/nagios/plugins/check_procs -w 1:1 -C twistd --argument-array=buildbot.tac (the checks are different because twistd cannot rename the process on mac)
Attachment #502089 - Flags: review?(bhearsum)
Comment on attachment 502089 [details] [diff] [review] m619082-puppet-manifests.patch Seems OK to me.
Attachment #502089 - Flags: review?(bhearsum) → review+
ba2999d082ff Landed on all puppet masters with the changes in comment 7. I've monitored a few slaves that have gotten configuration from the various masters, and all seem to have started up successfully. Because the puppet-files changes were deployed separately from the puppet-manifests changes, there are machines running 'buildbot start' on which nagios is looking for 'twistd'. This will settle down slowly.. 19:49 < nagios>  try-linux-slave29.build:buildbot is WARNING: PROCS WARNING: 0 processes with command name twistd, args buildbot.tac I'll continue to monitor things.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → FIXED
Aside from the pain with the puppet deployment, this has gone smoothly.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.