Closed
Bug 1038006
Opened 11 years ago
Closed 11 years ago
command_runner service doesn't always restart properly
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: massimo)
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716] )
Attachments
(3 files, 4 obsolete files)
2.05 KB,
patch
|
dustin
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
6.20 KB,
patch
|
pmoore
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
3.45 KB,
patch
|
pmoore
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
I deployed bug 8453895 and a half dozen of the masters ended up with:
Sun 16:43:25 PDT [4547] buildbot-master111.srv.releng.scl3.mozilla.com:procs - command_runner is CRITICAL: PROCS CRITICAL: 2 processes with regex args command_runner.py
From the puppet log:
Jul 13 15:59:03 buildbot-master111 puppet-agent[32267]: Enabling Puppet.
Jul 13 16:00:03 buildbot-master111 puppet-agent[32275]: (/Stage[main]/Buildmaster::Queue/File[/builds/buildbot/queue/run_command_runner.sh]/content) content changed '{md5}baa8a83a87fcbfa8e0b735460b164dc8' to '{md5}a8176cac5386d5fae4325f3051da83c4'
Jul 13 16:00:04 buildbot-master111 puppet-agent[32275]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]) Triggered 'refresh' from 1 events
Jul 13 16:00:06 buildbot-master111 puppet-agent[32275]: Finished catalog run in 43.10 seconds
Jul 13 16:29:01 buildbot-master111 puppet-agent[1976]: Enabling Puppet.
Jul 13 16:29:35 buildbot-master111 puppet-agent[1982]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]/ensure) ensure changed 'stopped' to 'running'
Jul 13 16:29:36 buildbot-master111 puppet-agent[1982]: Finished catalog run in 21.41 seconds
And sure enough there are command_runner processes started at 1600 and 1629. The /builds/buildbot/queue/command_runner.pid file corresponds to the later one. I'm guessing the the service script isn't handling that file properly, so that a later do_status finds that the process no longer exists.
Reporter | ||
Comment 1•11 years ago
|
||
.. so puppet starts up another one.
Reporter | ||
Updated•11 years ago
|
Summary: command_runner service doesn't restart properly → command_runner service doesn't always restart properly
Comment 2•11 years ago
|
||
I wonder if this is related to bug 1037860, where slaveapi didn't restart correctly?
Assignee: relops → mgervasini
Comment 3•11 years ago
|
||
(Assigned to :mgerva since hg log suggests he was the original author)
Assignee | ||
Comment 4•11 years ago
|
||
Hi Dustin,
In command_runner, we are writing the exit status into a variable called "$ret" and then we return "$?". This might be the cause of the problem, in any case "$ret" should not be there.
Attachment #8455674 -
Flags: review?(dustin)
Comment 5•11 years ago
|
||
Comment on attachment 8455674 [details] [diff] [review]
[puppet] Bug 1038006 - command_runner service doesn't always restart properly.patch
Indeed, variable assignment resets $?:
(sandbox)dustin@euclid ~/code/relengapi/t/relengapi [master] $ false
(sandbox)dustin@euclid ~/code/relengapi/t/relengapi [master] $ x=10
(sandbox)dustin@euclid ~/code/relengapi/t/relengapi [master] $ echo $?
0
Attachment #8455674 -
Flags: review?(dustin) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #8455674 -
Flags: checked-in+
Assignee | ||
Comment 6•11 years ago
|
||
Dustin did this patch fix the problem?
Comment 7•11 years ago
|
||
I don't know -- probably the best way to test would be to restart command_runner on a master.
Reporter | ||
Comment 8•11 years ago
|
||
Triggering this problem wasn't 100% reliable when I filed it, ie all the masters had command_runner restarted but only six of them ended up alerting in nagios. So several tests would be a good idea, using the same call as puppet would make.
Comment 9•11 years ago
|
||
I am not sure this is fixed:
[root@buildbot-master115.srv.releng.usw2.mozilla.com ~]# service command_runner restart
... seemed to restart
/etc/rc.d/init.d/command_runner # exists
/builds/buildbot/queue/command_runner.pid # did not initially
I then saw in /var/log/messages:
Jul 24 16:13:05 buildbot-master115 puppet-agent[9370]: (/Stage[main]/Buildmaster::Queue/Service[command_runner]/ensure) ensure changed 'stopped' to 'running'
/builds/buildbot/queue/command_runner.pid # then existed
I am not sure if it just took a while but it seems like this was similar behavior to comment 0
Assignee | ||
Comment 10•11 years ago
|
||
I think I have found the problem:
command_runner and pulse_publisher use kill -0 $PID to terminate the process but "kill -0" is not killing the process so trap "rm -f command_runner.pid" EXIT, never gets called
I have wrote a simple script to test this:
#!/bin/bash
MY_PID=$$
MY_PID_FILE="$0.pid"
echo "$MY_PID" > "$MY_PID_FILE"
trap "rm -f $MY_PID_FILE; echo 'trapping stuff'" EXIT
while true; do
echo "."
sleep 300
done
test 1: kill $PID
$ bash do_nothing.sh &
[1] 17874
.
$ ps -ef | grep [d]o_nothing
cltbld 17874 15168 0 08:52 pts/0 00:00:00 bash do_nothing.sh (process is running)
$ kill 17874
trapping stuff
[1]+ Terminated bash do_nothing.sh
$
this works as expected, pid file gets removed.
test 2: kill -0 $PID
bash do_nothing.sh &
[1] 18031
.
ps -ef | grep [d]o_nothing
cltbld 18031 15168 0 09:00 pts/0 00:00:00 bash do_nothing.sh (process is running)
$ kill -0 18031
$ ps -ef | grep [d]o_nothing
cltbld 18031 15168 0 09:00 pts/0 00:00:00 bash do_nothing.sh (process is still running)
nothing happened, the process is still alive and the pid file is still there.
This patch removes the -0 option from kill and "ret=$?" from buildmaster/templates/pulse_publisher.initd.erb too (as we did for buildmaster/templates/command_runner.initd.erb)
Attachment #8455674 -
Attachment is obsolete: true
Attachment #8475994 -
Flags: review?(dustin)
Comment 11•11 years ago
|
||
Comment on attachment 8475994 [details] [diff] [review]
[puppet] Bug 1038006 - command_runner service doesn't always restart properly - kill not trapped.patch
http://stackoverflow.com/questions/11012527/what-does-kill-0-pid-in-a-shell-script-do
kill -0 is a way to test if a process is running, which is what it's being used for here. I don't think the 'status' command should go around actually terminating processes!
Attachment #8475994 -
Flags: review?(dustin) → review-
Assignee | ||
Comment 12•11 years ago
|
||
You're right, Dustin.
This patch follows the same pattern as modules/mozpool/templates/dbcron.sh.erb
and traps the following signals: SIGHUP SIGINT SIGTERM EXIT
Attachment #8475994 -
Attachment is obsolete: true
Attachment #8487075 -
Flags: review?(dustin)
Updated•11 years ago
|
Attachment #8487075 -
Flags: review?(dustin) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #8487075 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 13•11 years ago
|
||
Experienced issues today in #buildduty:
/builds/buildbot/queue/pulse_publisher.pid was out of sync with the cltbld process after nick manually ran:
service pulse_publisher restart
This caused puppet to think the service was not running, so started a second instance, e.g.:
Thu 00:32:42 PDT [4158] buildbot-master01.srv.releng.use1.mozilla.com:procs - pulse_publisher is CRITICAL: PROCS CRITICAL: 2 processes with regex args pulse_publisher.py (http://m.mozilla.org/procs+-+pulse_publisher)
It looks like, therefore, that service pulse_publisher restart is still not working correctly.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 14•11 years ago
|
||
"service command_runner start" does not check if the process is already running (same for pulse_publisher)
Attachment #8498744 -
Flags: review?(pmoore)
Comment 15•11 years ago
|
||
Comment on attachment 8498744 [details] [diff] [review]
[puppet] Bug 1038006 - command_runner and pulse_publisher check service is not running before starting it.patch
Some questions! =)
1) What testing have you done? For example, have you tried stopping/starting/restarting a few times on a machine with these changes?
2) I don't see anywhere where the pid file is deleted - where does this happen?
3) I think we should also handle stale pid files - i.e., if we find a pid file, we should check if that process exists in the process table, and that it is indeed the service - if and only if all these conditions are met, should it fail and exit - if the pid is there but is either a non-existent process, or the process exists, but is something else (e.g. the same process id has been assigned to a different process) then I think it is fair to assume it was a stale pid, report on this, and then continue with the start up sequence.
Assignee | ||
Comment 16•11 years ago
|
||
Hi Pete,
following your questions, I have updated the start up scripts.
We are already sourcing /etc/init.d/functions, this library provides some useful commands to start/stop/restart a long running process. Some of the scripts in /etc/init.d/ (puppet, npre) use the daemon provided by functions. This patch extends the same mechanism for command_runner and pulse_publisher.
If we want to use this mechanism, we can delete (requires a separate patch):
* modules/buildmaster/templates/run_command_runner.sh.erb
* modules/buildmaster/templates/run_pulse_publisher.sh.erb
Attachment #8498744 -
Attachment is obsolete: true
Attachment #8498744 -
Flags: review?(pmoore)
Attachment #8500515 -
Flags: review?(pmoore)
Comment 17•11 years ago
|
||
Awesome! I'll try to look at this first thing in the morning.
Pete
Comment 18•11 years ago
|
||
Comment on attachment 8500515 [details] [diff] [review]
[puppet] Bug 1038006 - using dameon to manage pulse_publisher and command_runner.patch
Review of attachment 8500515 [details] [diff] [review]:
-----------------------------------------------------------------
Hi Massimo,
Thanks for the new patch. What testing have you done? For example, have you tried stopping/starting/restarting a few times on a machine with these changes?
We can step through this together if you like on vidyo.
Thanks,
Pete
::: modules/buildmaster/templates/command_runner.initd.erb
@@ +3,5 @@
> # License, v. 2.0. If a copy of the MPL was not distributed with this
> # file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +# command_runner Start up the command_runner daemon
> +# chkconfig: 2345 50 08
> +# description: command_runner
I think this change (lines 5,6 -> lines 5,6,7) was added by mistake - can you revert? (e.g. lines 6 and 9 are now the same)
@@ +24,5 @@
>
> USER=<%=scope.lookupvar('users::builder::username')%>
> +BASE_DIR=<%=scope.lookupvar('buildmaster::settings::queue_dir')%>
> +PYTHON="$BASE_DIR/bin/python"
> +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py"
Previously we were calling run_command_runner.sh.
It looks like now we are calling command_runner.py - therefore I think we need to delete run_command_runner.sh.erb in this patch.
@@ +27,5 @@
> +PYTHON="$BASE_DIR/bin/python"
> +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py"
> +LOGFILE="$BASE_DIR/command_runner.log"
> +LOCKFILE="/var/lock/subsys/command_runner"
> +PIDFILE="$BASE_DIR/command_runner.pid"
Are you sure we need a lockfile and a pid file?
@@ +39,5 @@
> start_agent() {
> + echo -n "Starting $DESC"
> + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &"
> + RETVAL=$?
> + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
can this grep not also return the grep command itself? can you use pidofproc function from /etc/init.d/functions?
@@ +43,5 @@
> + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
> + if [ -n "$PID" ]; then
> + echo $PID > "$PIDFILE"
> + touch "$LOCKFILE"
> + fi
Is this not already handled by daemon call in /etc/init.d/functions?
@@ +87,5 @@
>
> do_status () {
> + status -p $PIDFILE
> + RETVAL=$?
> + echo
is this echo intentional?
Attachment #8500515 -
Flags: review?(pmoore) → review-
Comment 19•11 years ago
|
||
what about:
echo $! > "${PIDFILE}"
instead of:
PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
if [ -n "$PID" ]; then
echo $PID > "$PIDFILE"
touch "$LOCKFILE"
fi
?
Assignee | ||
Comment 20•11 years ago
|
||
Thanks Pete,
> ::: modules/buildmaster/templates/command_runner.initd.erb
> @@ +3,5 @@
> > # License, v. 2.0. If a copy of the MPL was not distributed with this
> > # file, You can obtain one at http://mozilla.org/MPL/2.0/.
> > +# command_runner Start up the command_runner daemon
> > +# chkconfig: 2345 50 08
> > +# description: command_runner
>
> I think this change (lines 5,6 -> lines 5,6,7) was added by mistake - can
> you revert? (e.g. lines 6 and 9 are now the same)
ops... thanks!
>
> @@ +24,5 @@
> >
> > USER=<%=scope.lookupvar('users::builder::username')%>
> > +BASE_DIR=<%=scope.lookupvar('buildmaster::settings::queue_dir')%>
> > +PYTHON="$BASE_DIR/bin/python"
> > +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py"
>
> Previously we were calling run_command_runner.sh.
>
> It looks like now we are calling command_runner.py - therefore I think we
> need to delete run_command_runner.sh.erb in this patch.
This was done on purpose. I think we need the startup files until all the command runner/pulse publisher are restarted with the new command.
I am not sure what does puppet when the init.d/ script changes. If it stops the service and then replaces the script, we can remove the old files with a single patch, if not, it's probably better to remove them in a second moment, after all the command_runner/pulse_publisher services have been restarted.
> @@ +27,5 @@
> > +PYTHON="$BASE_DIR/bin/python"
> > +SCRIPT="$BASE_DIR/tools/buildbot-helpers/command_runner.py"
> > +LOGFILE="$BASE_DIR/command_runner.log"
> > +LOCKFILE="/var/lock/subsys/command_runner"
> > +PIDFILE="$BASE_DIR/command_runner.pid"
>
> Are you sure we need a lockfile and a pid file?
following the Centos sysvinitfile skeleton: (/usr/share/doc/initscripts-9.03.27/sysvinitfiles):
start() {
echo -n "Starting <servicename>: "
<start daemons, perhaps with the daemon function>
touch /var/lock/subsys/<servicename>
return <return code of starting daemon>
}
the pid file is the mechanism used by daemon() to avoid multiple processes of the same service.
The lock file is used by "service", on shutdown to ensure all the running services are closed properly.
Anyway I think we should move the pid file in standard location as /var/run/command_runner.pid
> @@ +39,5 @@
> > start_agent() {
> > + echo -n "Starting $DESC"
> > + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &"
> > + RETVAL=$?
> > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
>
> can this grep not also return the grep command itself? can you use pidofproc
> function from /etc/init.d/functions?
I've done some tests, if does not work as expected: running pidofproc -p $PIDFILE $SCRIPT, it returns nothing.
> @@ +43,5 @@
> > + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
> > + if [ -n "$PID" ]; then
> > + echo $PID > "$PIDFILE"
> > + touch "$LOCKFILE"
> > + fi
>
> Is this not already handled by daemon call in /etc/init.d/functions?
I think the lock file is not managed by daemon().
I have checked on some slaves, there are no lock.*file references in daemon() (CentOS release 6.2)
> @@ +87,5 @@
> >
> > do_status () {
> > + status -p $PIDFILE
> > + RETVAL=$?
> > + echo
>
> is this echo intentional?
yes, without this, service command_runner status, prints ]] on the prompt.
Some utilities in /etc/init.d/function use special chars to print a colorful output on the command line, this ugly echo saves us from an even uglier prompt.
Thanks again, Pete, I'll post an updated patch in the next days.
Assignee | ||
Comment 21•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #19)
> what about:
>
> echo $! > "${PIDFILE}"
>
> instead of:
>
> PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
> if [ -n "$PID" ]; then
> echo $PID > "$PIDFILE"
> touch "$LOCKFILE"
> fi
>
> ?
It would be nice, but I have done some tests and PID="$!" is just and empty string:
...
daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &"
PID=$!
PID2=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
PID3=$(pidofproc -p $PIDFILE $SCRIPT)
echo "PID: $PID; PID2: $PID2; PID3: $PID3" > /root/COMMAND_RUNNER
...
cat /root/COMMAND_RUNNER
PID: ; PID2: 4165; PID3:
I am open on any alternative idea on how to get the PID of the running process
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/319]
Comment 22•11 years ago
|
||
Needinfo'ing myself so I don't forget to look at this later
Flags: needinfo?(pmoore)
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/319] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716] [kanban:engops:https://kanbanize.com/ctrl_board/6/319]
Comment 23•11 years ago
|
||
(In reply to Massimo Gervasini [:mgerva] from comment #20)
> Thanks again, Pete, I'll post an updated patch in the next days.
Hi Massimo,
I went to look at this again, but I think from your last comment you were going to post another patch to review. Is that right? Just want to make sure we are not deadlocked, but waiting on each other! =)
Pete
Flags: needinfo?(pmoore) → needinfo?(mgervasini)
Updated•11 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716] [kanban:engops:https://kanbanize.com/ctrl_board/6/319] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/716]
Assignee | ||
Comment 24•11 years ago
|
||
Hey Pete,
here's the new version of the start up/stop script. This version is not very different from the previous patch:
* removed the extra: # chkconfig: 2345 50 08 lines
* the scripts have been simplified replacing *_agent, do_* with a single function;
Attachment #8500515 -
Attachment is obsolete: true
Flags: needinfo?(mgervasini)
Attachment #8519804 -
Flags: review?(pmoore)
Comment 25•11 years ago
|
||
Comment on attachment 8519804 [details] [diff] [review]
[puppet] Bug 1038006 - using dameon to manage pulse_publisher and command_runner II.patch
Review of attachment 8519804 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks Massimo!
::: modules/buildmaster/templates/command_runner.initd.erb
@@ +37,5 @@
> +start() {
> + echo -n "Starting $DESC"
> + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &"
> + RETVAL=$?
> + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
grep -F
::: modules/buildmaster/templates/pulse_publisher.initd.erb
@@ +37,5 @@
> +start() {
> + echo -n "Starting $DESC"
> + daemon --user $USER --pidfile $PIDFILE "$PYTHON $SCRIPT $OPTIONS &"
> + RETVAL=$?
> + PID=$(ps -f -U "$USER" | grep "$SCRIPT" | awk {'print $2'})
grep -F
Attachment #8519804 -
Flags: review?(pmoore) → review+
Assignee | ||
Comment 26•11 years ago
|
||
Comment on attachment 8519804 [details] [diff] [review]
[puppet] Bug 1038006 - using dameon to manage pulse_publisher and command_runner II.patch
Updated as requested in comment #25 and landed.
Attachment #8519804 -
Flags: checked-in+
Assignee | ||
Comment 27•11 years ago
|
||
We have successfully restarted command_runner and pulse_publisher on all the masters.
Now it's safe to remove the obsolete files: run_pulse_publisher.sh.erb and run_command_runner.sh.erb
Attachment #8524568 -
Flags: review?(pmoore)
Comment 28•11 years ago
|
||
Comment on attachment 8524568 [details] [diff] [review]
[puppet] Bug 1038006 - remove obsolete files.patch
Review of attachment 8524568 [details] [diff] [review]:
-----------------------------------------------------------------
Nice work - thanks for taking the time to do a clean up!
r+ with my changes below.
::: modules/buildmaster/manifests/queue.pp
@@ -23,5 @@
> - "${buildmaster::settings::queue_dir}/run_command_runner.sh":
> - require => Python::Virtualenv["${buildmaster::settings::queue_dir}"],
> - content => template("buildmaster/run_command_runner.sh.erb"),
> - notify => Service["command_runner"],
> - mode => 755;
Instead of deleting this file entry, it might be better to keep it, but change the content to:
ensure => absent;
See for example:
http://hg.mozilla.org/build/puppet/file/856b2eb87907/modules/vnc/manifests/init.pp#l69
This will mean that puppet will make sure that the file gets deleted, if it is already on the machine. Otherwise machines where it already exists will keep it.
@@ -28,5 @@
> - "${buildmaster::settings::queue_dir}/run_pulse_publisher.sh":
> - require => Python::Virtualenv["${buildmaster::settings::queue_dir}"],
> - content => template("buildmaster/run_pulse_publisher.sh.erb"),
> - notify => Service["pulse_publisher"],
> - mode => 755;
same as above
@@ -43,4 @@
> hasstatus => true,
> require => [
> File["/etc/init.d/command_runner"],
> - File["${buildmaster::settings::queue_dir}/run_command_runner.sh"],
I'm guessing we don't need to do anything special here to make sure file is absent, since it should be handled by above change - but you might want to double check! :)
@@ -52,4 @@
> hasstatus => true,
> require => [
> File["/etc/init.d/pulse_publisher"],
> - File["${buildmaster::settings::queue_dir}/run_pulse_publisher.sh"],
Same here.
Attachment #8524568 -
Flags: review?(pmoore) → review+
Assignee | ||
Comment 29•11 years ago
|
||
Comment on attachment 8524568 [details] [diff] [review]
[puppet] Bug 1038006 - remove obsolete files.patch
Thanks Pete!
Landed with few changes:
>- "${buildmaster::settings::queue_dir}/run_command_runner.sh":
>- require => Python::Virtualenv["${buildmaster::settings::queue_dir}"],
>- content => template("buildmaster/run_command_runner.sh.erb"),
>- notify => Service["command_runner"],
>- mode => 755;
>- "${buildmaster::settings::queue_dir}/run_pulse_publisher.sh":
>- require => Python::Virtualenv["${buildmaster::settings::queue_dir}"],
>- content => template("buildmaster/run_pulse_publisher.sh.erb"),
>- notify => Service["pulse_publisher"],
>- mode => 755;
using ensure => absent; (comment #28) (r+ via irc)
Attachment #8524568 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•