Closed Bug 619082 Opened 15 years ago Closed 15 years ago

slave alloc double-starting slavebuildslave processes?

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(1 file)

m619082-puppet-manifests.patch 15 years ago Dustin J. Mitchell [:dustin] (he/him) 2.42 KB, patch	bhearsum : review+	Details \| Diff \| Splinter Review

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Description

•

15 years ago

From nthomas in bug 616003: Might have a regression from this - nagios is reporting some linux-ix-slaveN with 2 buildbot processes running. eg linux-ix-slave19 has an (abridged) 'ps xf' of PID TTY STAT TIME COMMAND 3170 ? Ss 0:00 /tools/python/bin/python /usr/local/bin/runslave.py 3198 ? S 0:00 \_ /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave 3199 ? Z 0:00 \_ [buildbot] <defunct> 3202 ? Sl 0:00 /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

15 years ago

Ranking P2 until we see this affecting more slaves.

Priority: -- → P2

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

15 years ago

I can replicate this on talos-r3-leopard-001. It looks like launchd is running runslave.py twice at the same time, but puppet is only running once. I suspect that this has something to do with puppet trying to start/enable the service, followed by touching the puppet.finished file. This doesn't explain the linux-ix- problem, but it may be unrelated.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

15 years ago

The leopard problem is separate, because it's launchd related. It's also not landed yet, so nothing to worry about. On linux-ix-slave40, we see: root 1552 0.0 0.0 4492 1312 ? Ss 08:27 0:00 /bin/bash /etc/rc.d/rc 3 root 3168 0.0 0.0 4488 1172 ? S 08:27 0:00 \_ /bin/bash /etc/rc3.d/S99buildbot start root 3169 0.0 0.0 4792 1152 ? S 08:27 0:00 \_ su - cltbld sh -c /tools/python/bin/python /usr/local/bin/runslave.py cltbld 3170 0.0 0.1 10240 5472 ? Ss 08:27 0:00 \_ /tools/python/bin/python /usr/local/bin/runslave.py cltbld 3198 0.0 0.2 13760 8308 ? S 08:27 0:00 \_ /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave cltbld 3199 0.0 0.0 0 0 ? Z 08:28 0:00 \_ [buildbot] <defunct> ... cltbld 3202 0.0 0.2 26736 10072 ? Sl 08:28 0:00 /tools/buildbot-0.8.0/bin/python /tools/buildbot/bin/buildbot start /builds/slave looking at an strace, pid 3198 is hung waiting for bytes from fd 5, which is the read end of its signal socket. So it's waiting for SIGCHLD. pid 3202 is a fully-functioning buildslave instance. I think that this is a bug in the odd way that 'buildbot start' forks and internally emulates twistd. It prints a loud warning about not seeing the child process exit, which is presumably what's happened here. What's odd is that, even if it doesn't see the child process exit, it should quit after 10 seconds. So I believe this is a generic problem, rather than one caused by runslave.py being run twice or otherwise malfunctioning. I'll keep this ticket open to track it.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

15 years ago

This is still harmless, but is causing a lot of spurious nagios warning, and is definitely a bug.

Priority: P2 → P1

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 6

•

15 years ago

Catlee and I agree that the fix is to start the buildslave using twistd directly, rather than via 'buildbot start'. This will make the process signature different, though, so that will need to change at the same time. The nagios check is in nrpe.cfg on the slaves, which is (one hopes!) deployed via puppet.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

15 years ago

Attached patch m619082-puppet-manifests.patch — Details — Splinter Review

This is running in staging and has been tested on talos-r3-fed-001 linux-ix-slave01 moz2-darwin10-slave03 moz2-darwin9-slave68 talos-r3-leopard-001 This will need to be landed together with the following changes to nrpe.cfg in the unversioned puppet files: Old: N/production/darwin9-i386/build/usr/local/nagios/etc/nrpe.cfg: command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /tools/buildbot/bin/buildbot N/production/darwin10-i386/build/usr/local/nagios/etc/nrpe.cfg: command[check_buildbot]=/usr/local/nagios/plugins/check_procs -w 1:1 -a /tools/buildbot/bin/buildbot N/production/centos5-i686/build/etc/nagios/nrpe.cfg: command[check_buildbot]=/usr/lib/nagios/plugins/check_procs -w 1:1 -C buildbot N/production/centos5-x86_64/build/etc/nagios/nrpe.cfg: command[check_buildbot]=/usr/lib/nagios/plugins/check_procs -w 1:1 -C buildbot New: darwin*: /usr/local/nagios/plugins/check_procs -w 1:1 -C python --argument-array=buildbot.tac centos5*: /usr/lib/nagios/plugins/check_procs -w 1:1 -C twistd --argument-array=buildbot.tac (the checks are different because twistd cannot rename the process on mac)

Attachment #502089 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

15 years ago

Comment on attachment 502089 [details] [diff] [review] m619082-puppet-manifests.patch Seems OK to me.

Attachment #502089 - Flags: review?(bhearsum) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

15 years ago

ba2999d082ff Landed on all puppet masters with the changes in comment 7. I've monitored a few slaves that have gotten configuration from the various masters, and all seem to have started up successfully. Because the puppet-files changes were deployed separately from the puppet-manifests changes, there are machines running 'buildbot start' on which nagios is looking for 'twistd'. This will settle down slowly.. 19:49 < nagios> [55] try-linux-slave29.build:buildbot is WARNING: PROCS WARNING: 0 processes with command name twistd, args buildbot.tac I'll continue to monitor things.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

15 years ago

Aside from the pain with the puppet deployment, this has gone smoothly.

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.

Bugzilla

slave alloc double-starting slavebuildslave processes?

Categories

(Release Engineering :: General, defect, P1)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: dustin)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Attachment

General

Description

File Name

Content Type