Improvements to management of foopy builtbot slaves (watch_devices.sh)

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: pmoore, Assigned: pmoore)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(6 attachments, 5 obsolete attachments)

2.31 KB, patch
Callek
: review-
Details | Diff | Splinter Review
1.21 KB, patch
Callek
: review+
Details | Diff | Splinter Review
752 bytes, patch
Callek
: review+
Details | Diff | Splinter Review
1.77 KB, patch
Callek
: review+
Details | Diff | Splinter Review
6.23 KB, patch
Callek
: review+
Details | Diff | Splinter Review
3.01 KB, patch
Callek
: review+
Details | Diff | Splinter Review
(Assignee)

Description

5 years ago
(1.1) change watch_devices.sh to use flock instead of lockfile
(1.2) change watch_devices.sh to no longer use a persistent looping invocation, and instead run based on a cron
(1.3) make watch_devices.sh logs no longer infinitely growing, my thought is logrotate.d but open to other choices here.
(1.4) [other improvement ideas welcome]

Once catlee's change lands, things to do

(2) - Make [android] jobs *always* do a slave-initiated-shutdown for *every* job. Such that after the job is done, buildbot on the foopy will exit, rather than have watch_devices need to do it.
-- (a) this step will _not_ eliminate the code in watch_devices that does the exiting
-- (b) testing should find some way to ensure that it does indeed exit
-- (c) implementation is possible either in buildbot steps or in watch_devices

(3) should 2 prove fruitful, we can eliminate the error.flg-kills-buildbot code from watch_devices, and in-theory eliminate many of our persistent oranges (related to twistd) right here.

(4) an improvement after that is to reduce the amount of testing we do in verify.py during the build-step. and/or cat a log of the prior-to-buildbot-start verify.py during the build log, etc. steps here and after are less clear with regard to any change from the generic sut_ cleanup I had planned.
(Assignee)

Comment 1

5 years ago
1.1 done, just testing now
(Assignee)

Comment 2

5 years ago
Created attachment 748572 [details] [diff] [review]
Test patch to demonstrate proposed changes

Hey Callek,

I've provided a patch containing all the enhancements you suggested. Please note, there is some additional setup required (e.g. setup of crontab, and creation of new files/symbolic links under /builds). I provided a script in this patch for this so you can try it out (buildfarm/mobile/setup_logrotate.sh) but I would suggest for the final delivery, this is moved over to some puppet config.

Give it a bash, and tell me what you think.

Please I had several issues with flock whereby it would not release file handles, meaning that dead processes would retain locks, so I moved to a simpler locking mechanism using mkdir - this is an atomic operation for checking for the existance of a directory and creating it if it doesn't exist - since it is atomic, it is a safe locking mechanism to use across multiple processes. It also removes the dependency on the mail package for lockfile.

Thanks,
Pete
Attachment #748572 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 3

5 years ago
sed 's/Please I had/Please note I had/'
(Assignee)

Comment 4

5 years ago
Comment on attachment 748572 [details] [diff] [review]
Test patch to demonstrate proposed changes

As per vidyo discussion, splitting this now into individual patches for each part.
Attachment #748572 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 5

5 years ago
Created attachment 748914 [details] [diff] [review]
1.1) tools repo patch

This patch uses mkdir rather than flock for synchronising execution of watch_devices.sh. The reasons for choosing mkdir instead of flock are:

1) Flock was unreliable during the test phase of development. Processes that had been killed, that no longer existing in the process table, could still hold write locks on file descriptors, preventing new instances of watch_devices.sh from obtaining the lock.

2) The current implementation using "lockfile" requires the package procmail to be installed (a mail package). Using mkdir requires no additional packages.

3) Please note mkdir checks for the availability of the directory and creates it in a single, atomic step. This makes it a safe mechanism to use for synchronising the execution of parallel processes.



A problem with the current "lockfile" implementation
====================================================

The current "lockfile" implementation is not safe if the exit trap of watch_devices.sh does not get called for any reason (e.g. process crash, out-of-memory error, physical power loss, kernel panic, ....). In this situation, the lockfile remains in place indefinitely, blocking future runs watch_devices.sh. The blocking lockfile must first manually be discovered, and then manually removed to fix the problem. This can be fixed by adding -l 3600 as a switch to the lockfile command.

Please note this patch using mkdir also has contains this functionality.


I do not see any significant advantages in taking the patch, I simply provided it as an alternative implementation was requested in the ticket, but bearing in mind the -l 3600 option to fix the open issue with lockfile, and the fact procmail already exists on the foopies, there is a strong argument not to change the locking implementation.
Attachment #748572 - Attachment is obsolete: true
Attachment #748914 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 6

5 years ago
Created attachment 748919 [details] [diff] [review]
1.2) tools repo patch
Attachment #748919 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 7

5 years ago
Created attachment 748920 [details] [diff] [review]
1.2) puppet repo patch
(Assignee)

Comment 8

5 years ago
Created attachment 748921 [details] [diff] [review]
1.3) puppet repo patch
(Assignee)

Updated

5 years ago
Attachment #748921 - Attachment description: 1.3) tools puppet patch → 1.3) puppet repo patch
(Assignee)

Comment 9

5 years ago
Created attachment 748922 [details] [diff] [review]
1.4) tools repo patch
(Assignee)

Comment 10

5 years ago
So now all changes are separated into patches as requested.

I've named the patches in accordance with which repo needs updating, and the reference number from the description of this bug (1.1, 1.2, 1.3, 1.4).

1.1 change is detailed heavily in comment 5 above.
Changes 1.2 and 1.3 do not really require any further additional explanation.

1.4 is the collection of all changes I would propose to make to improve the code. These are the changes I made as I went along. In general these are simplifications, removing unused code, providing unique exit codes for different types of errors, etc. Feel free to ask any questions you may have about any of the changes.
(Assignee)

Updated

5 years ago
Attachment #748920 - Flags: review?(bugspam.Callek)
(Assignee)

Updated

5 years ago
Attachment #748921 - Flags: review?(bugspam.Callek)
(Assignee)

Updated

5 years ago
Attachment #748922 - Flags: review?(bugspam.Callek)
(Assignee)

Updated

5 years ago
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Status: NEW → RESOLVED
Resolution: --- → FIXED
???
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 748914 [details] [diff] [review]
1.1) tools repo patch

Review of attachment 748914 [details] [diff] [review]:
-----------------------------------------------------------------

> I do not see any significant advantages in taking the patch, I simply provided it as an alternative implementation
> was requested in the ticket, but bearing in mind the -l 3600 option to fix the open issue with lockfile,
> and the fact procmail already exists on the foopies, there is a strong argument not to change the locking implementation.

Based directly on these reasons, lets not bother with this part. We can revisit doing the -l option later.
Attachment #748914 - Flags: review?(bugspam.Callek) → review-

Updated

5 years ago
Attachment #748919 - Flags: review?(bugspam.Callek) → review+

Updated

5 years ago
Attachment #748920 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 748921 [details] [diff] [review]
1.3) puppet repo patch

Review of attachment 748921 [details] [diff] [review]:
-----------------------------------------------------------------

::: modules/foopy/templates/foopy.logrotate.erb
@@ +3,5 @@
> +daily
> +rotate 14
> +missingok
> +dateext
> +dateformat -%Y%m%d

we should use |copytruncate| otherwise we'll move away the /builds/watcher.log and then the cltbld user will be unable to recreate it.
Attachment #748921 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 748922 [details] [diff] [review]
1.4) tools repo patch

Review of attachment 748922 [details] [diff] [review]:
-----------------------------------------------------------------

My only outstand Q not listed below (mostly because I'm curious) how did you pick these exit codes, and is it for any reason other than your opinion of "exit with a different code for each logical reason we would exit" ?

r- due to the qty of fixups I request, but this looks pretty good overall.

::: buildfarm/mobile/watch_devices.sh
@@ +23,5 @@
>    fi
>    local expected_pid=`cat /builds/$device/twistd.pid`
> +  log "buildbot pid is $expected_pid"
> +  kill -0 $expected_pid >/dev/null 2>&1
> +  return $?

Soooo, we shouldn't do this part.

set -e
kill -0 <non-existant-pid>
--->aborts

the || allows us to not need to unset -e and still properly use this.

@@ +33,1 @@
>  }

ahhh I just realized (during review) why I did $1 vs ${device}... primarily because below I set $device as local, effectively I'm treating $device as *always* a local var elsewhere here

@@ +37,4 @@
>    export PYTHONPATH=/builds/sut_tools
>    deviceIP=`python -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> +  log ""
> +  log ""

not a fan of blank log lines.

@@ +38,5 @@
>    deviceIP=`python -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> +  log ""
> +  log ""
> +  log "Starting cycle for our device ($device = $deviceIP) now"
> +  if ! check_buildbot_running "${device}"; then

I remember why I did this, but said why isn't really relevant to block this (cleaner) change. Why is because I had anticipated wanting to use the var elsewhere here. But the fact is we don't and I can't think of when we would want to given current knowledge.

@@ +64,5 @@
> +       else
> +           log "Verify problem discovered:"
> +           while read LINE; do
> +               log "${LINE}"
> +           done < "/builds/${device}/error.flg"

please remove this else block, verify.py itself prints out the error flag reasons. Its only when there is no error flag that we need extra logging here.

@@ +69,4 @@
>         fi
> +       death "Exiting due to verify failure" 66
> +    fi
> +    log "starting buildbot slave for device $device (IP=$deviceIP)"

nit we already say device and ip in the log above, so we can remove "for device..." here, since each device splits between different logs

@@ +84,2 @@
>          # Stop.py should really do foopy cleanups and not touch device
> +        log "Cleaning up device $device..."

nit: attempting cleanup of device $device...

since if we failed out due to being unable to connect to the device, cleanup will fail :-)

@@ +126,5 @@
>    fi
>    # we've definitely acquired the lock at this point, so create the trap as soon as possible...
>    trap "device_check_exit ${device}" EXIT
> +  device_check "${device}" >>"/builds/${device}/watcher.log" 2>&1
> +fi

this hunk will need fixing when you rip out the 1.1 stuff
Attachment #748922 - Flags: review?(bugspam.Callek) → review+
Landing/Deploy plan:

* 1.1 --> do not deploy

* 1.2 tools -- land
* 1.2 tools -- deploy to all foopies

* 1.2 puppet \__do together
* 1.3 puppet /
  ** land to puppet repo
  ** communicate with :dustin and request he merge it to puppet320
  ** watch the releng-shared mailbox [1] to be sure there are no problems related to this deploy

* [2]

* manually go through all foopies and kill any watch_devices sessions running in a screen session

* Monitor with buildduty and sheriff help for any issues.

--
[1] https://intranet.mozilla.org/RelEngWiki/index.php/How_To/Read_Releng-Shared_Emails
[2] This is ok because we have lockfile checking on the bulk of the work, the watch_devices that doesn't auto-exit will still be running, and the watch_devices that exits itself will be running from cron, and we'll never dive into the device-specific checks until the lock is released.
(Assignee)

Comment 16

5 years ago
(In reply to Justin Wood (:Callek) from comment #11)
> Status: NEW → RESOLVED
> Resolution: --- → FIXED
> ???

Hey Callek,
I assumed (whoops) that I should change the status and resolution after uploading the patch - but if the workflow is different, sorry about that. Is there a link that explains the correct workflow to use?

Thanks,
Pete
(Assignee)

Comment 17

5 years ago
(In reply to Justin Wood (:Callek) from comment #14)
> Comment on attachment 748922 [details] [diff] [review]
> 1.4) tools repo patch
> 
> Review of attachment 748922 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> My only outstand Q not listed below (mostly because I'm curious) how did you
> pick these exit codes, and is it for any reason other than your opinion of
> "exit with a different code for each logical reason we would exit" ?

Good question! :)
1) yes, I wanted unique codes for all reasons, so it can be distinguished from exit code what the reason for exit was
2) choosing codes starting from 64 was based on this: http://tldp.org/LDP/abs/html/exitcodes.html
(Assignee)

Comment 18

5 years ago
(In reply to Justin Wood (:Callek) from comment #13)
> Comment on attachment 748921 [details] [diff] [review]
> 1.3) puppet repo patch
> 
> Review of attachment 748921 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: modules/foopy/templates/foopy.logrotate.erb
> @@ +3,5 @@
> > +daily
> > +rotate 14
> > +missingok
> > +dateext
> > +dateformat -%Y%m%d
> 
> we should use |copytruncate| otherwise we'll move away the
> /builds/watcher.log and then the cltbld user will be unable to recreate it.

From IRC:
<Callek> pmoore: sooo in a brief look at the rotated logs, two things (a) I was wrong about *requiring* copytruncate ;-) since it either does that already or the move/recreate works fine

No worries, then I'll leave it like it is, if you are happy with it.
(Assignee)

Comment 19

5 years ago
Comment on attachment 748922 [details] [diff] [review]
1.4) tools repo patch

Review of attachment 748922 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/mobile/watch_devices.sh
@@ +23,5 @@
>    fi
>    local expected_pid=`cat /builds/$device/twistd.pid`
> +  log "buildbot pid is $expected_pid"
> +  kill -0 $expected_pid >/dev/null 2>&1
> +  return $?

I disagree! :)

pmoore@hazel:~ $ cat test_kill.sh 
#!/bin/bash -eu

function test_bad_kill {
    kill -0 123412341234 >/dev/null 2>&1
    return $?
}

if ! test_bad_kill; then
    echo "Kill command was unsuccessful"
fi
pmoore@hazel:~ $ ./test_kill.sh 
Kill command was unsuccessful
pmoore@hazel:~ $

@@ +33,1 @@
>  }

Sure, quite right! :)

@@ +37,4 @@
>    export PYTHONPATH=/builds/sut_tools
>    deviceIP=`python -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> +  log ""
> +  log ""

I love them. :)

I can remove them if necessary, but personally I find it makes the log file more humanly readable, which should ultimately aid troubleshooting. The reason for the break here is that there are typically around 60 lines of output for each iteration of the device check when it fails, and putting just a little bit of space between makes it very easy to visually separate each iteration, and see which log messages relate to a single iteration, so that they do not jumble together. I can't see any major disadvantages of separating the log iterations, only the advantage of better readability. But if you really don't want it, I can (begrudgingly) get rid of it. ;)

@@ +38,5 @@
>    deviceIP=`python -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> +  log ""
> +  log ""
> +  log "Starting cycle for our device ($device = $deviceIP) now"
> +  if ! check_buildbot_running "${device}"; then

cool, I'll leave it as it is then. :)

@@ +64,5 @@
> +       else
> +           log "Verify problem discovered:"
> +           while read LINE; do
> +               log "${LINE}"
> +           done < "/builds/${device}/error.flg"

Quite right! Whoops. :)

@@ +69,4 @@
>         fi
> +       death "Exiting due to verify failure" 66
> +    fi
> +    log "starting buildbot slave for device $device (IP=$deviceIP)"

ok

@@ +84,2 @@
>          # Stop.py should really do foopy cleanups and not touch device
> +        log "Cleaning up device $device..."

totally agreed :)

@@ +126,5 @@
>    fi
>    # we've definitely acquired the lock at this point, so create the trap as soon as possible...
>    trap "device_check_exit ${device}" EXIT
> +  device_check "${device}" >>"/builds/${device}/watcher.log" 2>&1
> +fi

sure, no worries
(Assignee)

Comment 20

5 years ago
From IRC:

<Callek> pmoore: (b) it looks like the daily rotation actually happens on some skewed time, since it seems to happen at like 3am PT, so if there is a way to specify the tz or specific time to do it, great. -- http://linux.die.net/man/8/logrotate (is my ref that doesn't seem to have a way)
<Callek> pmoore: thats more of a minor nit though
<Callek> anyway, thanks just wanted to say those while it was in my mind
<pmoore> Callek: i'll investigate the tz issue
<Callek> pmoore: its a VERY low prior issue, so if you can't solve it in 5-10 minutes do not worry about it
<pmoore> Callek: and provide new patches/responses to your comments in the bug
<pmoore> sure

Hi Callek,

The jobs in /etc/cron.daily are handled by anacron, and all run between 03:00 and 03:45...

[cltbld@foopy57 builds]$ more /etc/anacrontab 
# /etc/anacrontab: configuration file for anacron

# See anacron(8) and anacrontab(5) for details.

SHELL=/bin/sh
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
# the maximal random delay added to the base delay of the jobs
RANDOM_DELAY=45
# the jobs will be started during the following hours only
START_HOURS_RANGE=3-22

#period in days   delay in minutes   job-identifier   command
1	5	cron.daily		nice run-parts /etc/cron.daily
7	25	cron.weekly		nice run-parts /etc/cron.weekly
@monthly 45	cron.monthly		nice run-parts /etc/cron.monthly

The first hourly job of the day kick starts the daily job, and this is defined to be within 45 mins after 03:00AM.

Please note that all logrotate jobs on the foopies run at the same time (puppet, psacct, syslog, yum, dracut, watch_devices) since they are all triggered the same way. And not only logrotate daily jobs, also the other /etc/cron.daily jobs (makewhatis, mlocate, prelink, readahead, tmpwatch). So this should be standard at least across the system.
(Assignee)

Comment 21

5 years ago
Created attachment 750649 [details] [diff] [review]
1.4) tools repo patch - manage_buildslave.sh

Hey Callek,

I just went to create the new patch, and realised I had forgotten to attach my changes to manage_buildslave.sh - so adding this too.

Nothing major, just distinct exit codes, a bit neater handling of the input options (rather than changing value of input parameter opt from "restart" to "start" during processing I suggest using a case statement). Just cosmetic really. :)
Attachment #750649 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 22

5 years ago
Created attachment 750723 [details] [diff] [review]
1.4) tools repo patch - watch_devices.sh

Hey Callek,

I had to rebase this patch since we decided to drop patch 1.1, so I have attached the new version. Please note there are still two changes in here that you did not approve in the last review, but I responded to your points inline in the previous review (which you should be able to see if you view the obselete attachments of this bug) - maybe you can take a look and give me your feedback based to my comments. In short, the kill should not cause the script to prematurely exit (I provided a test case to demonstrate), and I gave a justification for the blank logging lines, which I still believe offer an improvement.

Of course you can reject my counter pleas, and I won't put up a fight, as these are both extremely small changes. :)

Please note change 1.2 does not need to be rebased as there were no merge conflicts when pulling out change 1.1.

Thanks!
Pete
Attachment #748922 - Attachment is obsolete: true
Attachment #750723 - Flags: review?(bugspam.Callek)
(In reply to Pete Moore [:pete][:pmoore] from comment #16)
> (In reply to Justin Wood (:Callek) from comment #11)
> > Status: NEW → RESOLVED
> > Resolution: --- → FIXED
> > ???
> 
> Hey Callek,
> I assumed (whoops) that I should change the status and resolution after
> uploading the patch - but if the workflow is different, sorry about that. Is
> there a link that explains the correct workflow to use?
> 

I'm not sure of a good link, but we typically resolve the bug when the primary work items in the bug are complete (and live). We'll chat more next week about this if it helps.

(In reply to Pete Moore [:pete][:pmoore] from comment #18)
> (In reply to Justin Wood (:Callek) from comment #13)
> No worries, then I'll leave it like it is, if you are happy with it.

Since it works as is, I have no objection.

(In reply to Pete Moore [:pete][:pmoore] from comment #19)
> Comment on attachment 748922 [details] [diff] [review]
> 1.4) tools repo patch
> 
> Review of attachment 748922 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: buildfarm/mobile/watch_devices.sh
> @@ +23,5 @@
> >    fi
> >    local expected_pid=`cat /builds/$device/twistd.pid`
> > +  log "buildbot pid is $expected_pid"
> > +  kill -0 $expected_pid >/dev/null 2>&1
> > +  return $?
> 
> I disagree! :)
> 
> pmoore@hazel:~ $ cat test_kill.sh 
...

Fwiw, While I get that an error return code is ok within a test, I don't like having a function in a set -e script that can give an error return code from a command it runs as utility. Which will exit the script if it is executed.

the fact that the function itself returns unfavorably in same cases is fine for the test below, but not for commands in the function itself (imo).


> @@ +37,4 @@
> >    export PYTHONPATH=/builds/sut_tools
> >    deviceIP=`python -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> > +  log ""
> > +  log ""
> 
> I love them. :)
> 

compromise? 

+ log "============================="
+ log "=  Beginning new iteration  ="
+ log "============================="

- I don't mind visual seperation I more minded blank lines which are hard to comprehend when staring at code as a human. and not always clear when looking at log files.

> 
> @@ +38,5 @@
> >    deviceIP=`python -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> > +  log ""
> > +  log ""
> > +  log "Starting cycle for our device ($device = $deviceIP) now"
> > +  if ! check_buildbot_running "${device}"; then
> 
> cool, I'll leave it as it is then. :)

Heh as said the "why" I chose doesn't directly apply anymore, and imo your change is better at the moment since I can't think of a case in the short term which would make my use-case relevant :-)

(In reply to Pete Moore [:pete][:pmoore] from comment #20)
> The jobs in /etc/cron.daily are handled by anacron, and all run between
> 03:00 and 03:45...

Ahh-ha! that explains it. ok, it's fine to leave as is, not worth trying to mess with anacron settings here.
Comment on attachment 750649 [details] [diff] [review]
1.4) tools repo patch - manage_buildslave.sh

Review of attachment 750649 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits

::: buildfarm/mobile/manage_buildslave.sh
@@ +17,5 @@
>  
> +if [ -z "$opt" ]; then
> +    echo "You can call this script with the following options:" >&2
> +    echo "$OPTIONS" >&2
> +    exit 64

nit: if you're changing exit code here, exit 2 makes more sense, since its a failure in args passed to script

@@ +22,4 @@
>  else
>      if [ -z $2 ]; then
> +        echo "You have to specify which device to manage." >&2
> +        exit 65

same here re: exit code ;-) (use exit 2)

@@ +29,3 @@
>      if [ ! -d "$DEVICE_PATH" ]; then
> +        echo "Directory '$DEVICE_PATH' does not exist. Try again with the correct device name." >&2
> +        exit 66

I'd be inclined to think exit 2 *also* makes sense here, but this one is more grey, so whatever is fine

@@ +42,5 @@
> +        ;;
> +    restart)
> +        "${BB_PATH}" stop "${DEVICE_PATH}"
> +        ;&
> +    start)

I've never seen the ;& syntax before, can you at least do a # Fall through comment after it? To make it clear to future readers of the code.

(Found my sanity re: it in http://stackoverflow.com/questions/12010686/case-statement-fallthrough fwiw)

@@ +45,5 @@
> +        ;&
> +    start)
> +        echo "We want to always start buildbot through twistd"
> +        echo "We will run with the twistd command instead of calling buildslave"
> +        cd $DEVICE_PATH

please don't do cd, pushd and popd are what we want (as in orig code).

Otherwise if a script calls this we risk changing its cwd depending on *how* we're called.
Attachment #750649 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 750723 [details] [diff] [review]
1.4) tools repo patch - watch_devices.sh

Review of attachment 750723 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits

::: buildfarm/mobile/watch_devices.sh
@@ +23,5 @@
>    fi
>    local expected_pid=`cat /builds/$device/twistd.pid`
> +  log "buildbot pid is $expected_pid"
> +  kill -0 $expected_pid >/dev/null 2>&1
> +  return $?

please see my other comment, I still prefer the || retcode=.. here

@@ +38,3 @@
>    export PYTHONPATH=/builds/sut_tools
> +  deviceIP=`"${PYTHON}" -c "import sut_lib;print sut_lib.getIPAddress('$device')" 2> /dev/null`
> +  if ! lockfile -l 3600 -r0 "/builds/${device}/watcher.lock" >/dev/null 2>&1; then

can we (for now) not do the -l 3600 I'd rather approach that in a followup at a later time. since if we get a lockfile stuck around for a long time, its almost always a reason for a human to peek anyway.

Very few lockfile-is-stuck is cause for no-human, at least for now.

@@ +42,3 @@
>    fi
> +  log ""
> +  log ""

see my other comment, I prefer the compromise vs blank lines ;-)
Attachment #750723 - Flags: review?(bugspam.Callek) → review+
(Assignee)

Comment 26

5 years ago
> (In reply to Pete Moore [:pete][:pmoore] from comment #19)
> > Comment on attachment 748922 [details] [diff] [review]
> > 1.4) tools repo patch
> > 
> > Review of attachment 748922 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: buildfarm/mobile/watch_devices.sh
> > @@ +23,5 @@
> > >    fi
> > >    local expected_pid=`cat /builds/$device/twistd.pid`
> > > +  log "buildbot pid is $expected_pid"
> > > +  kill -0 $expected_pid >/dev/null 2>&1
> > > +  return $?
> > 
> > I disagree! :)
> > 
> > pmoore@hazel:~ $ cat test_kill.sh 
> ...
> 
> Fwiw, While I get that an error return code is ok within a test, I don't
> like having a function in a set -e script that can give an error return code
> from a command it runs as utility. Which will exit the script if it is
> executed.
> 
> the fact that the function itself returns unfavorably in same cases is fine
> for the test below, but not for commands in the function itself (imo).

Hi Callek,

My change has no bearing on this. You can see this with the following example. This is equivalent to the original version before my change:

pmoore@hazel:~ $ cat test_retcode.sh 
#!/bin/bash -eu

function bad_return_code {
   echo "this statement exits normally"
   return 1
}

bad_return_code
echo "do not reach this line"
pmoore@hazel:~ $ ./test_retcode.sh 
this statement exits normally
pmoore@hazel:~ $ 

As you see, if you call a function and get a non-zero return statement, and you have the bash -e option set at that point, the script will exit. So my change has no bearing on this.
Comment on attachment 750723 [details] [diff] [review]
1.4) tools repo patch - watch_devices.sh

Review of attachment 750723 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/mobile/watch_devices.sh
@@ +23,5 @@
>    fi
>    local expected_pid=`cat /builds/$device/twistd.pid`
> +  log "buildbot pid is $expected_pid"
> +  kill -0 $expected_pid >/dev/null 2>&1
> +  return $?

sooo.. me and pete chatted in IRC on this, and he convinced me his change is actually better than my change. since any call of check_buildbot_running will work fine in any case where we are able to properly check its exitcode (e.g. check..running || retcode=$? or test ... )
(Assignee)

Comment 28

5 years ago
Created attachment 751223 [details] [diff] [review]
1.4) tools repo patch - manage_buildslave.sh
Attachment #750649 - Attachment is obsolete: true
Attachment #751223 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 29

5 years ago
Created attachment 751224 [details] [diff] [review]
1.4) tools repo patch - watch_devices.sh
Attachment #750723 - Attachment is obsolete: true
Attachment #751224 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 30

5 years ago
Created attachment 751227 [details] [diff] [review]
1.4) tools repo patch - manage_buildslave.sh
Attachment #751223 - Attachment is obsolete: true
Attachment #751223 - Flags: review?(bugspam.Callek)
Attachment #751227 - Flags: review?(bugspam.Callek)
(Assignee)

Comment 31

5 years ago
Hey Callek,

Everything should be as desired now. It has been quite a delicate operation adapting the set of patches which are intended to be serially applied, but I'm confident I have done it without mistakes. Of course I'm happy for you to skim over them to make sure you agree.

Let's apply them together when we are together next week. Have a good weekend!

Thanks,
Pete

Updated

5 years ago
Attachment #751224 - Flags: review?(bugspam.Callek) → review+

Updated

5 years ago
Attachment #751227 - Flags: review?(bugspam.Callek) → review+
(Assignee)

Updated

5 years ago
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.