Closed Bug 830796 Opened 13 years ago Closed 13 years ago

Run tegras without cp

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86
macOS

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: Callek)

References

Details

Attachments

(4 files, 2 obsolete files)

I will take few staging tegras and test them without clientproxy.
Priority: -- → P2
I'm working with tegra-010.
Assignee: armenzg → bugspam.Callek
Summary: test tegras without cp → Run tegras without cp
Attached patch [tools] v0.5 (obsolete) — Splinter Review
So guys, sorry for the later than expected review. I was looking today intending to ping you for feedback since I didn't see review granted/denied mail on this. Only to realize I never posted it. I do almost *expect* stuff to change to come out of what you guys see. My rough plan, as written up at our team week is in https://releng.etherpad.mozilla.org/kill-clientproxy But that is not intended to be needed for understanding this bash script. Ok, r? from armen for the logic and what I'm doing (and not doing) in this round of the work. r? from hal as the best bash-syntax guru in the team (afaik) and f? from coop for the overall approach and if any of the deficiencies are a blocker to the work. (Please let me know if any of the below are blockers for initial landing/deployment) Notes: * Does not do anything about logrotate yet. --- just writes to a single .log file per device --- a global .log file in /builds to identify if/when the watcher is working * Does not modify tegra [soon mobile] dashboard to cope with loss of CP -- e.g. check.sh and friends --- Still preserves error.flg mapping --- Still will check for buildbot --- Until then most data/stats from the dashboard will be bad * Does not yet support a good Graceful model --- CP does not support that either --- Current (and with this incarnation) model would still be whole buildbot graceful --- Making a single-device/foopy graceful will be MUCH easier once this lands/deploys * any "max tries" logic on the error.flg purging --- Right now, it will remove error.flg after 1 hour until we mark it disabled manually --- After an hour it will then try again to run verify.py, and if successful will launch buildbot * Does not yet use slavealloc --- Should be easy to add gettac in manually, or make the manage_buildslave use it directly ---* does not use slavealloc notion of disabled * Sub Process logging happens, but is not prefixed with a timestamp by us --- Any subprocess launches (verify.py, stop.py, buildslave startup, etc) needs its own date logging, or we get no timestamp for those lines * No staggering of launches for devices --- I opted for all watch_device launches for a specific device to happen every 5 minutes, rather than wait a second or two between each launch and then wait 5 minutes --- I feel this is WONTFIX since clientproxy did a LOT of launching at once anyway, and used up 2 procs per device as overhead, this is 1 proc per system + 1 proc periodically per device. * There are lots of cleanups that can be made in sut_tools once this lands/deploys --- Not the least of which is stop.py/cleanup.py --- I felt that changing fewer pieces at once is better, and this works, but if any of the stuff you guys feel needs changing now, please feel free to tell me and I can do so.
Attachment #707887 - Flags: review?(hwine)
Attachment #707887 - Flags: review?(armenzg)
Attachment #707887 - Flags: feedback?(coop)
Comment on attachment 707887 [details] [diff] [review] [tools] v0.5 r- due to the error/confusion in handling of $device cited below. Rest are style/nits >@@ -1,7 +1,13 @@ > #!/bin/bash >-BB_PATH=/tools/buildbot/bin/buildslave >-BB_PYTHON=/tools/buildbot/bin/python2.7 >-BB_TWISTD=/tools/buildbot/bin/twistd >+if [ -d "/tools/buildbot/bin" ]; then >+ BB_PATH=/tools/buildbot/bin/buildslave >+ BB_PYTHON=/tools/buildbot/bin/python2.7 >+ BB_TWISTD=/tools/buildbot/bin/twistd >+else >+ BB_PATH=`which buildslave` >+ BB_PYTHON=`which python` >+ BB_TWISTD=`which twistd` >+fi There are a lot of assumptions above - it might be worthwhile double checking everything is okay: hash $BB_PATH $BB_PYTHON $BB_TWISTD || die "Can't find all executables" That's also why I'm a fan of "fail early, fail often", and use: set -eu at the top of scripts > opt="$1" One side effect of the 'set -u' is that the above will error if no option was given to the script. >+#!/bin/bash >+PYTHON=`which python` >+ >+# boilerplate >+warn() { echo "$@" 1>&2; } >+die() { warn "$@" >2 ; exit 1 ; } >+usage() { warn "$@" "${USAGE:-}" ; test $# -eq 0; exit $?; } So, these functions operate slightly differently than the ones I use (mine consume 'per line strings', yours 'words'). Just a style difference, but having different behavior under the same name could be confusing as folks maintain multiple scripts. Maybe borrow from perl 5 & use 'carp' and 'croak'? >+function log_output_to() { >+ # $1 is "where to >+ # $2 (if present) is "hide output to invoker" >+ if [ ! -z "$2" ]; then If you do choose to use 'set -u', then the above needs to be '-z "${2:-}"' <= the one hassle of 'set -u' >+function check_buildbot_running() { >+ # returns success if running >+ # !0 if not running >+ device=$1 I'm a fan of using 'local' for function arguments - as written, it's a global and used in several functions. If one ever calls the other - nasty debugging >+ expected_pid=`cat /builds/$device/twistd.pid` >+ log pid is $expected_pid >+ ps -ax | grep -v grep | grep $expected_pid 2>&1 > /dev/null Since you're not checking if the pid is for the same named process, this can be simplified to 'kill -0 $expected_pid' >+function kill_lockfile() { >+ rm -f /builds/$device/watcher.lock >+} >+ >+function device_check_exit() { >+ kill_lockfile $1 Error: you pass $1, but use the global '$device' in the called function. Are they the same? Are you feeling lucky? Especially since device_check_exit is called asynchronously. A singleton context object (aka global) is useful in passing state to cleanup routines, but then it should only be defined & set once. >+ # Trap here, not earlier so that if lockfile fails, we don't clear the lock >+ # From another process >+ trap "device_check_exit $device" EXIT I think, but am not positive, that you're evaluating the value of '$device' (as the arg to device_check_exit) at the time this statement is executed. Any future change to '$device' would not change the arg. But using such fancy binding techniques (I forget the technical name) is a subtle language quirk that may make the code harder to maintain. >+ log "Sleeping for 200 sec after startup, to prevent premature flag killing" >+ sleep 200 # wait a bit before checking for an error flag or otherwise nit: '200' is a magic number, and may change as we tune things. Consider a local defining it, so the log message and the actual call stay in sync. >+ >+function watch_launcher(){ >+ while [ 1 ]; do nit: I prefer 'while true; do', as the above implies '[ 0 ]' would evaluate to false. It does not. >+ for d in `ls -d tegra-* panda-* 2> /dev/null`; do >+ log ..checking $d >+ $0 $d & I haven't analyzed the script to ensure it terminates when called re-entrantly. If it doesn't, this could be a slow fork bomb. I do note that, as written, the wait time of the child script could exceed the cycle time of the parent. While the lockfile should handle that, I'd like to see some comments about that, so a maintainer knows to be very careful when messing with the lockfile, etc. nit: '$0' is brittle, as if it's a relative path and anyone adds a 'cd' call, it'll go bad. There are two ways I use. If I'm only going to call this script, I use: # get full abs path to our script for nested call script_abs_path=$(cd $(dirname $0);/bin/pwd)/${0##*/} If I'm going to call other scripts that I expect in the same directory as this script I use progDir=$(cd $(dirname $0);/bin/pwd) export PATH=$progDir:$PATH Which allows for easy testing of a new version of the suite, as all related utilities will come from the same directory. And, yes, it needs to be that complicated :)
Attachment #707887 - Flags: review?(hwine) → review-
(In reply to Justin Wood (:Callek) from comment #2) > > (Please let me know if any of the below are blockers for initial > landing/deployment) > > Notes: > * Does not do anything about logrotate yet. > --- just writes to a single .log file per device > --- a global .log file in /builds to identify if/when the watcher is working A simple monthly cron job may be enough here, but not a blocker > * Does not modify tegra [soon mobile] dashboard to cope with loss of CP > -- e.g. check.sh and friends > --- Still preserves error.flg mapping > --- Still will check for buildbot > --- Until then most data/stats from the dashboard will be bad Your call, imo, as you have to live with it. However, I'd make this first enhancement. > * Sub Process logging happens, but is not prefixed with a timestamp by us > --- Any subprocess launches (verify.py, stop.py, buildslave startup, etc) > needs its own date logging, or we get no timestamp for those lines I had the same issue with vcs-sync - I ended up wrapping such calls from my script with time stamped "starting foo" and "ending foo". Seems to work well enough. Again, not a blocker, but maybe an easy win.
Comment on attachment 707887 [details] [diff] [review] [tools] v0.5 Review of attachment 707887 [details] [diff] [review]: ----------------------------------------------------------------- I like the suggestions by hwine. The patches and the logic looks simple but effective. Callek, this looks great!
Attachment #707887 - Flags: review?(armenzg)
Please feel free to ask for another review once you address the comments by hwine.
Comment on attachment 707887 [details] [diff] [review] [tools] v0.5 Review of attachment 707887 [details] [diff] [review]: ----------------------------------------------------------------- Direction is good, provided Hal's comments are addressed. WRT the dashboard, so long as the CP status still defaults to 'available' so that the active bar calculation can just (essentially) ignore that data, I'm fine with it.
Attachment #707887 - Flags: feedback?(coop) → feedback+
Attached patch [tools] v1Splinter Review
I think this address [most] comments, and cleans up a bit from my own findings/readability/English. I'd want armen and (hal || coop) to r+ this before I think of deploying, also I plan to run this on staging for most of the week before I deploy, so earliest Production deploy is likely Mon-Feb-11 if no more code modifications here, now. (In reply to Hal Wine [:hwine] from comment #3) > >@@ -1,7 +1,13 @@ > >+ BB_TWISTD=`which twistd` > >+fi > > There are a lot of assumptions above - it might be worthwhile double > checking everything is okay: > hash $BB_PATH $BB_PYTHON $BB_TWISTD || die "Can't find all executables" Good idea, had to quote vars or this passes if all 3 are not found though and this script doesn't have die. I'm explicitly not doing much in this pre-existing bash script though right now, so won't be adding set -eu to it > >+# boilerplate > >+warn() { echo "$@" 1>&2; } > >+die() { warn "$@" >2 ; exit 1 ; } > >+usage() { warn "$@" "${USAGE:-}" ; test $# -eq 0; exit $?; } > > So, these functions operate slightly differently than the ones I use (mine > consume 'per line strings', yours 'words'). Just a style difference, but > having different behavior under the same name could be confusing as folks > maintain multiple scripts. Maybe borrow from perl 5 & use 'carp' and 'croak'? I always forget which is which with perl carp/croak and don't like the naming. The only reason I did different with regard to warn vs yours is, afaict yours has many times gotten confused with things that are not quoted and output one word per line, I personally like my way better and if its not a sticking point I'm ok with the difference. (yours as it stands, for readers benefit is) warn() { for m; do echo "$m"; done 1>&2; } > >+function log_output_to() { > >+ # $1 is "where to > >+ # $2 (if present) is "hide output to invoker" > >+ if [ ! -z "$2" ]; then > > If you do choose to use 'set -u', then the above needs to be '-z "${2:-}"' > <= the one hassle of 'set -u' That is fine by me fwiw ;-) > >+function check_buildbot_running() { > >+ # returns success if running > >+ # !0 if not running > >+ device=$1 > > I'm a fan of using 'local' for function arguments - as written, it's a > global and used in several functions. If one ever calls the other - nasty > debugging I'm sold, local all the things > > >+ expected_pid=`cat /builds/$device/twistd.pid` > >+ log pid is $expected_pid > >+ ps -ax | grep -v grep | grep $expected_pid 2>&1 > /dev/null > > Since you're not checking if the pid is for the same named process, this can > be simplified to 'kill -0 $expected_pid' Maybe I'm not oldschool enough :-) but I never quite trusted kill with the null signal, but changed anyway. > >+function kill_lockfile() { > >+ rm -f /builds/$device/watcher.lock > >+} > >+ > >+function device_check_exit() { > >+ kill_lockfile $1 > > Error: you pass $1, but use the global '$device' in the called function. good catch, thanks. > >+ # Trap here, not earlier so that if lockfile fails, we don't clear the lock > >+ # From another process > >+ trap "device_check_exit $device" EXIT > > I think, but am not positive, that you're evaluating the value of '$device' > (as the arg to device_check_exit) at the time this statement is executed. If its evaluated at trap setup or exit time should have no bearing on the logic puzzle here, its one-device-per-invocation of script. But thanks for the warning. > >+ log "Sleeping for 200 sec after startup, to prevent premature flag killing" > >+ sleep 200 # wait a bit before checking for an error flag or otherwise > > nit: '200' is a magic number, and may change as we tune things. Consider a > local defining it, so the log message and the actual call stay in sync. Fixed > >+function watch_launcher(){ > >+ while [ 1 ]; do > > nit: I prefer 'while true; do', as the above implies '[ 0 ]' would evaluate > to false. It does not. done. > >+ for d in `ls -d tegra-* panda-* 2> /dev/null`; do > >+ log ..checking $d > >+ $0 $d & > > I haven't analyzed the script to ensure it terminates when called > re-entrantly. It does > I do note that, as written, the wait time of the child script could exceed the cycle time of > the parent. While the lockfile should handle that, I'd like to see some > comments about that, so a maintainer knows to be very careful when messing > with the lockfile, etc. I didn't add comments yet, but I'm open to suggestions on where/what specifically to comment. I'm conscious of messy code being harder to maintain. The overlap is intentional at the moment. > nit: '$0' is brittle, as if it's a relative path and anyone adds a 'cd' > call, it'll go bad. There are two ways I use. If I'm only going to call this > script, I use: > # get full abs path to our script for nested call > script_abs_path=$(cd $(dirname $0);/bin/pwd)/${0##*/} went with this one (In reply to Chris Cooper [:coop] from comment #7) > WRT the dashboard, so long as the CP status still defaults to 'available' so > that the active bar calculation can just (essentially) ignore that data, I'm > fine with it. I went ahead and hardcoded CP to default to 'yes I'm active' feel free to give this patch another look
Attachment #707887 - Attachment is obsolete: true
Attachment #708986 - Flags: review?(hwine)
Attachment #708986 - Flags: review?(armenzg)
Attachment #708986 - Flags: feedback?(coop)
Attached patch [interdiff] tools-v1 interdiff (obsolete) — Splinter Review
interdiff between v0.5 and v1 incase it helps with reviewing.
Comment on attachment 708986 [details] [diff] [review] [tools] v1 Review of attachment 708986 [details] [diff] [review]: ----------------------------------------------------------------- I like this a lot. Are growing logs going to be a concern? Who sets disabled.flg? Should adding disabled.flg the standard procedure to disable a tegra? I have not been doing that in the past. Perfect would be to check slavealloc's state but I don't want it to become the enemy of this. ::: buildfarm/mobile/watch_devices.sh @@ +111,5 @@ > + if [ -f /builds/$device/error.flg -o -f /builds/$device/disabled.flg ]; then > + log "Something wants us to kill buildbot..." > + set +e # These steps are ok to fail, not a good thing but not critical > + cp error.flg error.flg.bak 2>/dev/null # stop.py will remove error flag o_O > + python /builds/sut_tools/stop.py --device $device Would we prefer to stop buildbot with this instead? /builds/tools/buildfarm/mobile/manage_buildslave.sh stop $device @@ +131,5 @@ > + # get full abs path to our script for nested call > + local script_abs_path=$(cd $(dirname $0);/bin/pwd)/${0##*/} > + while true; do > + log "Watcher is still alive" > + for d in `ls -d tegra-* panda-* 2> /dev/null`; do This makes me a little uncomfortable for directories like this "tegra-123_disabled" (when we have renamed a slave to disable it). Do you think it will be a problem?
Attachment #708986 - Flags: review?(armenzg) → review+
Comment on attachment 708986 [details] [diff] [review] [tools] v1 r+ assuming you'll fix the one error noted below. Rest of the comments are just informational notes. One general note - I'm still uncomfortable with a different definition of 'warn' - it looks like you've quoted all your words into lines, so there would now be no difference. But I won't push that. [I will note this is a lesson to me to repair the "broken windows" - I left off one set of quotes on one call in an email you got a lot of, and you ended up with a bad impression of the approach. Believe it or not, that was the first time it happened to me. :/ ] >diff --git a/buildfarm/mobile/manage_buildslave.sh b/buildfarm/mobile/manage_buildslave.sh >--- a/buildfarm/mobile/manage_buildslave.sh >+++ b/buildfarm/mobile/manage_buildslave.sh >@@ -1,7 +1,16 @@ ... >+hash "$BB_PATH" "$BB_PYTHON" "$BB_TWISTD" 2>/dev/null || \ >+ (echo "Can't find all needed executables to manage buildslave"; exit 1) nit: error output should go to stderr: 1>&2 is your friend >diff --git a/buildfarm/mobile/watch_devices.sh b/buildfarm/mobile/watch_devices.sh >new file mode 100644 >--- /dev/null >+++ b/buildfarm/mobile/watch_devices.sh >@@ -0,0 +1,148 @@ >+#!/bin/bash -eu ... >+die() { warn "$@" >2 ; exit 1 ; } error: stdout is going to file '2', not stderr >+function check_buildbot_running() { ... >+ set +e # return code is legit able to fail >+ kill -0 $expected_pid 2>&1 > /dev/null >+ local retcode=$? >+ set -e # re-enable stop on failure fwiw - the above is correct. Another way to code this, which avoids setting/resetting the -e flag (and thus keeps the code neater and less brittle) is: ec=0 command_that_might_fail || ec=$? # do whatever you'd like with $ec fwiw, I have been burned so many times by $? changing (add a pipe step, echo statement, log call, etc) that I never use it except as above. Similarly, if you do not care about the exit code at all: command_that_might_fail || : yes, that is an idiom >+function device_check() { ... >+ lockfile -r0 /builds/$device/watcher.lock 2>&1 > /dev/null fyi: that can be shortened to '&>/dev/null' to redirect both stdout & stderr >+function watch_launcher(){ ... >+ for d in `ls -d tegra-* panda-* 2> /dev/null`; do to address Armen's comment, you could just append '[0-9]' to each glob, to ensure it ends in a digit. (if someone renames a directory 'tegra-057.disabled_3' they get what they deserve.)
Attachment #708986 - Flags: review?(hwine) → review+
(In reply to Armen Zambrano G. [:armenzg] from comment #10) > Are growing logs going to be a concern? Yes, though the (linux at least) foopies have tons of disk space and are less of a concern than anything else here, so I opted for followup, as described above. If you disagree let me know. > Who sets disabled.flg? We do. > Should adding disabled.flg the standard procedure to disable a tegra? I have > not been doing that in the past. It will be the standard "shut down" mechanic, its not been used in the past since we haven't been able to shutdown clientproxy without the kill signals it was sending (cp would hang due to the multiproc stuff and and error or two lurking in there) > Perfect would be to check slavealloc's state but I don't want it to become > the enemy of this. Agreed, but followup. (We also don't want slavealloc to set disabled.flg [which does exactly what stop_cp does now], per chat with coop at workweek. We would love it to graceful the device though.) > ::: buildfarm/mobile/watch_devices.sh > @@ +111,5 @@ > > + if [ -f /builds/$device/error.flg -o -f /builds/$device/disabled.flg ]; then > > + log "Something wants us to kill buildbot..." > > + set +e # These steps are ok to fail, not a good thing but not critical > > + cp error.flg error.flg.bak 2>/dev/null # stop.py will remove error flag o_O > > + python /builds/sut_tools/stop.py --device $device > > Would we prefer to stop buildbot with this instead? > /builds/tools/buildfarm/mobile/manage_buildslave.sh stop $device Possibly, but I do want to be able to kill all the foopy stray procs as well, so I think stop.py is better going forward, but have it call manage_buildslave for the buildslave directly, and then also call out to cleanup, etc. > @@ +131,5 @@ > > + # get full abs path to our script for nested call > > + local script_abs_path=$(cd $(dirname $0);/bin/pwd)/${0##*/} > > + while true; do > > + log "Watcher is still alive" > > + for d in `ls -d tegra-* panda-* 2> /dev/null`; do > > This makes me a little uncomfortable for directories like this > "tegra-123_disabled" (when we have renamed a slave to disable it). > Do you think it will be a problem? In theory, now we will only be setting a disabled.flg not moving directories around, the only reason we were moving directories is that a start_cp was actually bringing up dead (yet to be reimaged) devices. (In reply to Hal Wine [:hwine] from comment #11) > One general note - I'm still uncomfortable with a different definition of > 'warn' - it looks like you've quoted all your words into lines, so there > would now be no difference. But I won't push that. Since I quote my lines now, I decided to go with your def of warn, I still feel its much too easy to accidentally have 50 lines when you have 50 short words for 1 line, but I decided to go with it. (note my log() also quotes the entire arg list anyway :-) ) > >diff --git a/buildfarm/mobile/manage_buildslave.sh b/buildfarm/mobile/manage_buildslave.sh > nit: error output should go to stderr: 1>&2 is your friend Done > >diff --git a/buildfarm/mobile/watch_devices.sh b/buildfarm/mobile/watch_devices.sh > >+die() { warn "$@" >2 ; exit 1 ; } > > error: stdout is going to file '2', not stderr fixed > >+function check_buildbot_running() { > ... > >+ set +e # return code is legit able to fail > >+ kill -0 $expected_pid 2>&1 > /dev/null > >+ local retcode=$? > >+ set -e # re-enable stop on failure > > fwiw - the above is correct. Another way to code this, which avoids > setting/resetting the -e flag (and thus keeps the code neater and less > brittle) is: > ec=0 > command_that_might_fail || ec=$? > # do whatever you'd like with $ec > fwiw, I have been burned so many times by $? changing (add a pipe step, echo > statement, log call, etc) that I never use it except as above. I like your style! :-) done (and similar) > >+ lockfile -r0 /builds/$device/watcher.lock 2>&1 > /dev/null > > fyi: that can be shortened to '&>/dev/null' to redirect both stdout & stderr For this I've always been a fan of listing explicitly, &>/path/2/file has always seemed too magic and easy to confuse at a glance. > >+ for d in `ls -d tegra-* panda-* 2> /dev/null`; do > > to address Armen's comment, you could just append '[0-9]' to each glob, to Given the change in process, I'm not sold on needing this, but given its simple-to-do and makes sense, done.
Comment on attachment 708986 [details] [diff] [review] [tools] v1 Review of attachment 708986 [details] [diff] [review]: ----------------------------------------------------------------- Thought I'd already rubberstamped this after Hal's more thorough review. Sorry for the delay.
Attachment #708986 - Flags: feedback?(coop) → feedback+
This is the final patch, as landed http://hg.mozilla.org/build/tools/rev/f6912b16c5b6 Then a chmod fixup, (stupid windows) [as checked in from a foopy directly] http://hg.mozilla.org/build/tools/rev/a35b3aa87519
Attachment #712726 - Flags: review+
So this patch does two minor followups: The mac foopy I went to deploy this on complained about syntax of `exec &>>$1` Changing to `exec >>$1 2>&1` is equivalent and works Also I noticed that restarting watch_devices while something is still writing to the device log can cause the lockfile error to appear interlaced... So I am opting for lockfile errors/warnings to appear in the main screen/script session, and not in the log itself... we can always redirect to /dev/null instead if you like. If you consider that second part a bad idea, I'd like r+ on at least first hunk and land that, and we can approach second part in another bug.
Attachment #708987 - Attachment is obsolete: true
Attachment #712793 - Flags: review?(hwine)
Attachment #712793 - Flags: review?(hwine) → review+
Comment on attachment 712793 [details] [diff] [review] [tools] minor followup 1 forgot to add, and redirecting to terminal is fine.
this is to automate the needed bits on Linux (I'll still need to go in and run watch_devices.sh manually)
Attachment #713647 - Flags: review?(hwine)
Comment on attachment 713647 [details] [diff] [review] [puppet] linux foopy needs Review of attachment 713647 [details] [diff] [review]: ----------------------------------------------------------------- lgtm
Attachment #713647 - Flags: review?(hwine) → review+
This evening I deployed this to all android foopies, which means we're now running without clientproxy everywhere! I'll leave this bug open for a few days to verify sanity (incase we need to rollback) but until then I declare victory.
Blocks: 840343
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: