Closed Bug 1038006 Opened 11 years ago Closed 11 years ago

command_runner service doesn't always restart properly

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

x86
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: massimo)

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716] )

Attachments

(3 files, 4 obsolete files)

I deployed bug 8453895 and a half dozen of the masters ended up with: Sun 16:43:25 PDT [4547] buildbot-master111.srv.releng.scl3.mozilla.com:procs - command_runner is CRITICAL: PROCS CRITICAL: 2 processes with regex args command_runner.py From the puppet log: Jul 13 15:59:03 buildbot-master111 puppet-agent[32267]: Enabling Puppet. Jul 13 16:00:03 buildbot-master111 puppet-agent[32275]: (/Stage[main]/Buildmaster::Queue/File[/builds/buildbot/queue/run_command_runner.sh]/content) content changed '{md5}baa8a83a87fcbfa8e0b735460b164dc8' to '{md5}a8176cac5386d5fae4325f3051da83c4' Jul 13 16:00:04 buildbot-master111 puppet-agent[32275]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]) Triggered 'refresh' from 1 events Jul 13 16:00:06 buildbot-master111 puppet-agent[32275]: Finished catalog run in 43.10 seconds Jul 13 16:29:01 buildbot-master111 puppet-agent[1976]: Enabling Puppet. Jul 13 16:29:35 buildbot-master111 puppet-agent[1982]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]/ensure) ensure changed 'stopped' to 'running' Jul 13 16:29:36 buildbot-master111 puppet-agent[1982]: Finished catalog run in 21.41 seconds And sure enough there are command_runner processes started at 1600 and 1629. The /builds/buildbot/queue/command_runner.pid file corresponds to the later one. I'm guessing the the service script isn't handling that file properly, so that a later do_status finds that the process no longer exists.
.. so puppet starts up another one.
Summary: command_runner service doesn't restart properly → command_runner service doesn't always restart properly
I wonder if this is related to bug 1037860, where slaveapi didn't restart correctly?
Assignee: relops → mgervasini
(Assigned to :mgerva since hg log suggests he was the original author)
Hi Dustin, In command_runner, we are writing the exit status into a variable called "$ret" and then we return "$?". This might be the cause of the problem, in any case "$ret" should not be there.
Attachment #8455674 - Flags: review?(dustin)
Comment on attachment 8455674 [details] [diff] [review] [puppet] Bug 1038006 - command_runner service doesn't always restart properly.patch Indeed, variable assignment resets $?: (sandbox)dustin@euclid ~/code/relengapi/t/relengapi [master] $ false (sandbox)dustin@euclid ~/code/relengapi/t/relengapi [master] $ x=10 (sandbox)dustin@euclid ~/code/relengapi/t/relengapi [master] $ echo $? 0
Attachment #8455674 - Flags: review?(dustin) → review+
Attachment #8455674 - Flags: checked-in+
Dustin did this patch fix the problem?
I don't know -- probably the best way to test would be to restart command_runner on a master.
Triggering this problem wasn't 100% reliable when I filed it, ie all the masters had command_runner restarted but only six of them ended up alerting in nagios. So several tests would be a good idea, using the same call as puppet would make.
I am not sure this is fixed: [root@buildbot-master115.srv.releng.usw2.mozilla.com ~]# service command_runner restart ... seemed to restart /etc/rc.d/init.d/command_runner # exists /builds/buildbot/queue/command_runner.pid # did not initially I then saw in /var/log/messages: Jul 24 16:13:05 buildbot-master115 puppet-agent[9370]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]/ensure) ensure changed 'stopped' to 'running' /builds/buildbot/queue/command_runner.pid # then existed I am not sure if it just took a while but it seems like this was similar behavior to comment 0
I think I have found the problem: command_runner and pulse_publisher use kill -0 $PID to terminate the process but "kill -0" is not killing the process so trap "rm -f command_runner.pid" EXIT, never gets called I have wrote a simple script to test this: #!/bin/bash MY_PID=$$ MY_PID_FILE="$0.pid" echo "$MY_PID" > "$MY_PID_FILE" trap "rm -f $MY_PID_FILE; echo 'trapping stuff'" EXIT while true; do echo "." sleep 300 done test 1: kill $PID $ bash do_nothing.sh & [1] 17874 . $ ps -ef | grep [d]o_nothing cltbld 17874 15168 0 08:52 pts/0 00:00:00 bash do_nothing.sh (process is running) $ kill 17874 trapping stuff [1]+ Terminated bash do_nothing.sh $ this works as expected, pid file gets removed. test 2: kill -0 $PID bash do_nothing.sh & [1] 18031 . ps -ef | grep [d]o_nothing cltbld 18031 15168 0 09:00 pts/0 00:00:00 bash do_nothing.sh (process is running) $ kill -0 18031 $ ps -ef | grep [d]o_nothing cltbld 18031 15168 0 09:00 pts/0 00:00:00 bash do_nothing.sh (process is still running) nothing happened, the process is still alive and the pid file is still there. This patch removes the -0 option from kill and "ret=$?" from buildmaster/templates/pulse_publisher.initd.erb too (as we did for buildmaster/templates/command_runner.initd.erb)
Attachment #8455674 - Attachment is obsolete: true
Attachment #8475994 - Flags: review?(dustin)
Comment on attachment 8475994 [details] [diff] [review] [puppet] Bug 1038006 - command_runner service doesn't always restart properly - kill not trapped.patch http://stackoverflow.com/questions/11012527/what-does-kill-0-pid-in-a-shell-script-do kill -0 is a way to test if a process is running, which is what it's being used for here. I don't think the 'status' command should go around actually terminating processes!
Attachment #8475994 - Flags: review?(dustin) → review-
You're right, Dustin. This patch follows the same pattern as modules/mozpool/templates/dbcron.sh.erb and traps the following signals: SIGHUP SIGINT SIGTERM EXIT
Attachment #8475994 - Attachment is obsolete: true
Attachment #8487075 - Flags: review?(dustin)
Attachment #8487075 - Flags: review?(dustin) → review+
Attachment #8487075 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Experienced issues today in #buildduty: /builds/buildbot/queue/pulse_publisher.pid was out of sync with the cltbld process after nick manually ran: service pulse_publisher restart This caused puppet to think the service was not running, so started a second instance, e.g.: Thu 00:32:42 PDT [4158] buildbot-master01.srv.releng.use1.mozilla.com:procs - pulse_publisher is CRITICAL: PROCS CRITICAL: 2 processes with regex args pulse_publisher.py (http://m.mozilla.org/procs+-+pulse_publisher) It looks like, therefore, that service pulse_publisher restart is still not working correctly.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
"service command_runner start" does not check if the process is already running (same for pulse_publisher)
Attachment #8498744 - Flags: review?(pmoore)
Comment on attachment 8498744 [details] [diff] [review] [puppet] Bug 1038006 - command_runner and pulse_publisher check service is not running before starting it.patch Some questions! =) 1) What testing have you done? For example, have you tried stopping/starting/restarting a few times on a machine with these changes? 2) I don't see anywhere where the pid file is deleted - where does this happen? 3) I think we should also handle stale pid files - i.e., if we find a pid file, we should check if that process exists in the process table, and that it is indeed the service - if and only if all these conditions are met, should it fail and exit - if the pid is there but is either a non-existent process, or the process exists, but is something else (e.g. the same process id has been assigned to a different process) then I think it is fair to assume it was a stale pid, report on this, and then continue with the start up sequence.
Hi Pete, following your questions, I have updated the start up scripts. We are already sourcing /etc/init.d/functions, this library provides some useful commands to start/stop/restart a long running process. Some of the scripts in /etc/init.d/ (puppet, npre) use the daemon provided by functions. This patch extends the same mechanism for command_runner and pulse_publisher. If we want to use this mechanism, we can delete (requires a separate patch): * modules/buildmaster/templates/run_command_runner.sh.erb * modules/buildmaster/templates/run_pulse_publisher.sh.erb
Attachment #8498744 - Attachment is obsolete: true
Attachment #8498744 - Flags: review?(pmoore)
Attachment #8500515 - Flags: review?(pmoore)
Awesome! I'll try to look at this first thing in the morning. Pete
Comment on attachment 8500515 [details] [diff] [review] [puppet] Bug 1038006 - using dameon to manage pulse_publisher and command_runner.patch Review of attachment 8500515 [details] [diff] [review]: ----------------------------------------------------------------- Hi Massimo, Thanks for the new patch. What testing have you done? For example, have you tried stopping/starting/restarting a few times on a machine with these changes? We can step through this together if you like on vidyo. Thanks, Pete ::: modules/buildmaster/templates/command_runner.initd.erb @@ +3,5 @@ > # License, v. 2.0. If a copy of the MPL was not distributed with this > # file, You can obtain one at http://mozilla.org/MPL/2.0/. > +# command_runner Start up the command_runner daemon > +# chkconfig: 2345 50 08 > +# description: command_runner I think this change (lines 5,6 -> lines 5,6,7) was added by mistake - can you revert? (e.g. lines 6 and 9 are now the same) @@ +24,5 @@ > > USER=<%=scope.lookupvar('users::builder::username')%> > +BASE_DIR=<%=scope.lookupvar('buildmaster::settings::queue_dir')%> > +PYTHON="$BASE_DIR/bin/python" > +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py" Previously we were calling run_command_runner.sh. It looks like now we are calling command_runner.py - therefore I think we need to delete run_command_runner.sh.erb in this patch. @@ +27,5 @@ > +PYTHON="$BASE_DIR/bin/python" > +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py" > +LOGFILE="$BASE_DIR/command_runner.log" > +LOCKFILE="/var/lock/subsys/command_runner" > +PIDFILE="$BASE_DIR/command_runner.pid" Are you sure we need a lockfile and a pid file? @@ +39,5 @@ > start_agent() { > + echo -n "Starting $DESC" > + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &" > + RETVAL=$? > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) can this grep not also return the grep command itself? can you use pidofproc function from /etc/init.d/functions? @@ +43,5 @@ > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) > + if [ -n "$PID" ]; then > + echo $PID > "$PIDFILE" > + touch "$LOCKFILE" > + fi Is this not already handled by daemon call in /etc/init.d/functions? @@ +87,5 @@ > > do_status () { > + status -p $PIDFILE > + RETVAL=$? > + echo is this echo intentional?
Attachment #8500515 - Flags: review?(pmoore) → review-
what about: echo $! > "${PIDFILE}" instead of: PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) if [ -n "$PID" ]; then echo $PID > "$PIDFILE" touch "$LOCKFILE" fi ?
Thanks Pete, > ::: modules/buildmaster/templates/command_runner.initd.erb > @@ +3,5 @@ > > # License, v. 2.0. If a copy of the MPL was not distributed with this > > # file, You can obtain one at http://mozilla.org/MPL/2.0/. > > +# command_runner Start up the command_runner daemon > > +# chkconfig: 2345 50 08 > > +# description: command_runner > > I think this change (lines 5,6 -> lines 5,6,7) was added by mistake - can > you revert? (e.g. lines 6 and 9 are now the same) ops... thanks! > > @@ +24,5 @@ > > > > USER=<%=scope.lookupvar('users::builder::username')%> > > +BASE_DIR=<%=scope.lookupvar('buildmaster::settings::queue_dir')%> > > +PYTHON="$BASE_DIR/bin/python" > > +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py" > > Previously we were calling run_command_runner.sh. > > It looks like now we are calling command_runner.py - therefore I think we > need to delete run_command_runner.sh.erb in this patch. This was done on purpose. I think we need the startup files until all the command runner/pulse publisher are restarted with the new command. I am not sure what does puppet when the init.d/ script changes. If it stops the service and then replaces the script, we can remove the old files with a single patch, if not, it's probably better to remove them in a second moment, after all the command_runner/pulse_publisher services have been restarted. > @@ +27,5 @@ > > +PYTHON="$BASE_DIR/bin/python" > > +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py" > > +LOGFILE="$BASE_DIR/command_runner.log" > > +LOCKFILE="/var/lock/subsys/command_runner" > > +PIDFILE="$BASE_DIR/command_runner.pid" > > Are you sure we need a lockfile and a pid file? following the Centos sysvinitfile skeleton: (/usr/share/doc/initscripts-9.03.27/sysvinitfiles): start() { echo -n "Starting <servicename>: " <start daemons, perhaps with the daemon function> touch /var/lock/subsys/<servicename> return <return code of starting daemon> } the pid file is the mechanism used by daemon() to avoid multiple processes of the same service. The lock file is used by "service", on shutdown to ensure all the running services are closed properly. Anyway I think we should move the pid file in standard location as /var/run/command_runner.pid > @@ +39,5 @@ > > start_agent() { > > + echo -n "Starting $DESC" > > + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &" > > + RETVAL=$? > > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) > > can this grep not also return the grep command itself? can you use pidofproc > function from /etc/init.d/functions? I've done some tests, if does not work as expected: running pidofproc -p $PIDFILE $SCRIPT, it returns nothing. > @@ +43,5 @@ > > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) > > + if [ -n "$PID" ]; then > > + echo $PID > "$PIDFILE" > > + touch "$LOCKFILE" > > + fi > > Is this not already handled by daemon call in /etc/init.d/functions? I think the lock file is not managed by daemon(). I have checked on some slaves, there are no lock.*file references in daemon() (CentOS release 6.2) > @@ +87,5 @@ > > > > do_status () { > > + status -p $PIDFILE > > + RETVAL=$? > > + echo > > is this echo intentional? yes, without this, service command_runner status, prints ]] on the prompt. Some utilities in /etc/init.d/function use special chars to print a colorful output on the command line, this ugly echo saves us from an even uglier prompt. Thanks again, Pete, I'll post an updated patch in the next days.
(In reply to Pete Moore [:pete][:pmoore] from comment #19) > what about: > > echo $! > "${PIDFILE}" > > instead of: > > PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) > if [ -n "$PID" ]; then > echo $PID > "$PIDFILE" > touch "$LOCKFILE" > fi > > ? It would be nice, but I have done some tests and PID="$!" is just and empty string: ... daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &" PID=$! PID2=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) PID3=$(pidofproc -p $PIDFILE $SCRIPT) echo "PID: $PID; PID2: $PID2; PID3: $PID3" > /root/COMMAND_RUNNER ... cat /root/COMMAND_RUNNER PID: ; PID2: 4165; PID3: I am open on any alternative idea on how to get the PID of the running process
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/319]
Needinfo'ing myself so I don't forget to look at this later
Flags: needinfo?(pmoore)
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/319] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716] [kanban:engops:https://kanbanize.com/ctrl_board/6/319]
(In reply to Massimo Gervasini [:mgerva] from comment #20) > Thanks again, Pete, I'll post an updated patch in the next days. Hi Massimo, I went to look at this again, but I think from your last comment you were going to post another patch to review. Is that right? Just want to make sure we are not deadlocked, but waiting on each other! =) Pete
Flags: needinfo?(pmoore) → needinfo?(mgervasini)
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716] [kanban:engops:https://kanbanize.com/ctrl_board/6/319] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716]
Hey Pete, here's the new version of the start up/stop script. This version is not very different from the previous patch: * removed the extra: # chkconfig: 2345 50 08 lines * the scripts have been simplified replacing *_agent, do_* with a single function;
Attachment #8500515 - Attachment is obsolete: true
Flags: needinfo?(mgervasini)
Attachment #8519804 - Flags: review?(pmoore)
Comment on attachment 8519804 [details] [diff] [review] [puppet] Bug 1038006 - using dameon to manage pulse_publisher and command_runner II.patch Review of attachment 8519804 [details] [diff] [review]: ----------------------------------------------------------------- Thanks Massimo! ::: modules/buildmaster/templates/command_runner.initd.erb @@ +37,5 @@ > +start() { > + echo -n "Starting $DESC" > + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &" > + RETVAL=$? > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) grep -F ::: modules/buildmaster/templates/pulse_publisher.initd.erb @@ +37,5 @@ > +start() { > + echo -n "Starting $DESC" > + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &" > + RETVAL=$? > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'}) grep -F
Attachment #8519804 - Flags: review?(pmoore) → review+
Comment on attachment 8519804 [details] [diff] [review] [puppet] Bug 1038006 - using dameon to manage pulse_publisher and command_runner II.patch Updated as requested in comment #25 and landed.
Attachment #8519804 - Flags: checked-in+
We have successfully restarted command_runner and pulse_publisher on all the masters. Now it's safe to remove the obsolete files: run_pulse_publisher.sh.erb and run_command_runner.sh.erb
Attachment #8524568 - Flags: review?(pmoore)
Comment on attachment 8524568 [details] [diff] [review] [puppet] Bug 1038006 - remove obsolete files.patch Review of attachment 8524568 [details] [diff] [review]: ----------------------------------------------------------------- Nice work - thanks for taking the time to do a clean up! r+ with my changes below. ::: modules/buildmaster/manifests/queue.pp @@ -23,5 @@ > - "${buildmaster::settings::queue_dir}/run_command_runner.sh": > - require => Python::Virtualenv["${buildmaster::settings::queue_dir}"], > - content => template("buildmaster/run_command_runner.sh.erb"), > - notify => Service["command_runner"], > - mode => 755; Instead of deleting this file entry, it might be better to keep it, but change the content to: ensure => absent; See for example: http://hg.mozilla.org/build/puppet/file/856b2eb87907/modules/vnc/manifests/init.pp#l69 This will mean that puppet will make sure that the file gets deleted, if it is already on the machine. Otherwise machines where it already exists will keep it. @@ -28,5 @@ > - "${buildmaster::settings::queue_dir}/run_pulse_publisher.sh": > - require => Python::Virtualenv["${buildmaster::settings::queue_dir}"], > - content => template("buildmaster/run_pulse_publisher.sh.erb"), > - notify => Service["pulse_publisher"], > - mode => 755; same as above @@ -43,4 @@ > hasstatus => true, > require => [ > File["/etc/init.d/command_runner"], > - File["${buildmaster::settings::queue_dir}/run_command_runner.sh"], I'm guessing we don't need to do anything special here to make sure file is absent, since it should be handled by above change - but you might want to double check! :) @@ -52,4 @@ > hasstatus => true, > require => [ > File["/etc/init.d/pulse_publisher"], > - File["${buildmaster::settings::queue_dir}/run_pulse_publisher.sh"], Same here.
Attachment #8524568 - Flags: review?(pmoore) → review+
Comment on attachment 8524568 [details] [diff] [review] [puppet] Bug 1038006 - remove obsolete files.patch Thanks Pete! Landed with few changes: >- "${buildmaster::settings::queue_dir}/run_command_runner.sh": >- require => Python::Virtualenv["${buildmaster::settings::queue_dir}"], >- content => template("buildmaster/run_command_runner.sh.erb"), >- notify => Service["command_runner"], >- mode => 755; >- "${buildmaster::settings::queue_dir}/run_pulse_publisher.sh": >- require => Python::Virtualenv["${buildmaster::settings::queue_dir}"], >- content => template("buildmaster/run_pulse_publisher.sh.erb"), >- notify => Service["pulse_publisher"], >- mode => 755; using ensure => absent; (comment #28) (r+ via irc)
Attachment #8524568 - Flags: checked-in+
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: