Closed Bug 888835 Opened 11 years ago Closed 11 years ago

Manage android with slavealloc

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

x86_64
All

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

References

Details

Attachments

(4 files, 2 obsolete files)

We should start to manage android jobs with slavealloc.

This requires us to somehow get slavealloc giving out buildbot.tacs that know to watch for the shutdown.stamp.

And also to audit slavealloc to make sure enabled/disabled state is accurate.

We have two options here to either lock all devices to specific masters like currently done by hand, or allow them to dynamically pick a master and use "pools" like we do for other slave types, I'm not particular on this point but I'll likely go with pool's.
This adjusts slavealloc slightly in the way we'll need for pandas/tegras to use slavealloc!

The caveat to call out is this adds |allow_shutdown="file"| to *all* slaves, which shouldn't have any affect, since we're not setting the shutdown file on any other slaves, (though we could start once this deploys, if we wanted)
Attachment #780823 - Flags: review?(coop)
Attachment #780823 - Flags: review?(coop) → review+
This bug affects more than one host. I'm moving it to Platform Support. Please let me know if you believe that this should live in Machine Management.
Component: Release Engineering: Machine Management → Release Engineering: Platform Support
QA Contact: armenzg → coop
Attached patch [tools] actually use slavealloc (obsolete) — Splinter Review
So, this code is *untested* so far, but I think this is all we need.

What it changes:
* disabled.flg is no longer checked for when to start buildbot or not, we merely used enabled/disabled in slavealloc
* disabled.flg is STILL available as a "I need to force this slave to shut down"
* adds retry to gettac wget in currently-unused method of manage_buildslave
* uses a tight retry loop before falling back on the existing buildbot.tac (changed to never fully remove the .tac)
* Uses new slavealloc url (from Bug 863266 so we don't hurt ourselves by forgetting)

What I will [need to] do before deploy:
* TEST
* Audit all devices enabled/disabled state and correct in slavealloc
* Test!!!!!
* Lock devices to their current master [Not necessary for future, but just for fewer-moving-parts]
* Did I say test?
* Update Slavealloc notes with any notes currently in disabled.flg
* BE SURE YOU TEST CALLEK!
* Inform sheriffs/buildduty-folk of coming change.
* One final test!

What I will do after I deploy:
* Update docs
* Remind people of the change
* Have someone else look over the docs to be sure they are understandable.
Attachment #783571 - Flags: review?(pmoore)
Attachment #783571 - Flags: feedback?(coop)
Depends on: 899899
Comment on attachment 780823 [details] [diff] [review]
[tools] part 1 - slavealloc

https://hg.mozilla.org/build/tools/rev/0efee119d4f6 and asked for deploy to new cluster in Bug 899899
Attachment #780823 - Flags: checked-in+
Comment on attachment 783571 [details] [diff] [review]
[tools] actually use slavealloc

Review of attachment 783571 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/mobile/manage_buildslave.sh
@@ +41,5 @@
>  
>  case "${opt}" in
>      gettac)
> +        ${retry} --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \
> +                wget -q "-O${DEVICE_PATH}/buildbot.tac.new" "http://slavealloc.pvt.build.mozilla.org/gettac/${DEVICE}"

Thank you for using the new URL without making me ask. ;)
Attachment #783571 - Flags: feedback?(coop) → feedback+
Comment on attachment 783571 [details] [diff] [review]
[tools] actually use slavealloc

Review of attachment 783571 [details] [diff] [review]:
-----------------------------------------------------------------

Looks almost perfect - just a couple of lines that I think might not be 100% quite right.

::: buildfarm/mobile/manage_buildslave.sh
@@ +2,5 @@
> +pushd `dirname $0` &> /dev/null
> +MY_DIR=$(pwd)
> +popd &> /dev/null
> +retry="$MY_DIR/../utils/retry.py -s 1 -r 2 -t 30 -m 20"
> +

Just a small point - you could simplify these four lines to one:

retry="$(dirname "${0}")/../utils/retry.py -s 1 -r 2 -t 30 -m 20"

Please note, this mechanism of using a "retry" fails if there is a space in the full path of the script's parent directory. On the foopies, this is not the case, so it is ok - but not ideal, in case directory names are ever changed.

Instead I would prefer to inline the retry command in line 44, with something like:
"$(dirname "${0}")/../utils/retry.py" -s 1 -r 2 -t 30 -m 20 --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \

This also works, if there is a space in the directory path.

@@ +16,5 @@
>  hash "$BB_PATH" "$BB_PYTHON" "$BB_TWISTD" 2>/dev/null || \
>      (echo "Can't find all needed executables to manage buildslave" >&2; exit 1)
>  
> +retry=
> +

Was this left in intentionally, or was this something from testing? Unsetting will make line 44 break, won't it?

@@ +41,5 @@
>  
>  case "${opt}" in
>      gettac)
> +        ${retry} --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \
> +                wget -q "-O${DEVICE_PATH}/buildbot.tac.new" "http://slavealloc.pvt.build.mozilla.org/gettac/${DEVICE}"

wget is fine to use, but just a note that it is not installed by default on mac - so if you wanted a mac/linux compatible option you could use curl. Since all the foopies are now linux, this shouldn't be a problem.

See above - I believe "retry" is an empty string at this point, so I believe this won't work...

@@ +47,5 @@
> +            rm -f ${DEVICE_PATH}/buildbot.tac
> +            mv ${DEVICE_PATH}/buildbot.tac.new ${DEVICE_PATH}/buildbot.tac
> +        else
> +            echo "Failed to get buildbot.tac" >&2
> +        fi

For "cleanness" it might be nice to quote the strings, i.e. "${DEVICE_PATH}/buildbot.tac.new" since if the DEVICE_PATH were to contain spaces, the current version would not work. Since DEVICE_PATH doesn't actually contain spaces ("/builds/${DEVICE}") it should work fine - but for readability or to make it future-proof, might be nice to get those quotes in, so you can be sure the shell will interpret the path as a single word/token.

::: buildfarm/mobile/watch_devices.sh
@@ +97,5 @@
>          sleep ${FAIL_WAIT} # Wait a while before allowing us to turn buildbot back on
>      fi
>    fi
> +  rm -f disabled.flg # Force disable only once. If we got this far, then we
> +                     # did all we must. Don't act on it next cycle.

shouldn't this be the following?

rm -f "/builds/${device}/disabled.flg"
Attachment #783571 - Flags: review?(pmoore) → review-
Just a note about the above review - you can choose to ignore any of my "preferences" - I don't want to block on any of these. The only things I wanted you to really double-check are the last line:

rm disabled.flg => rm -f "/builds/${device}/disabled.flg"

and also removing the line:
retry=

You can read my suggestions for the other lines, and take them or leave them. Only those two above are the reasons for holding off landing the patch until they've been clarified.

Thanks Callek!
Product: mozilla.org → Release Engineering
(In reply to Pete Moore [:pete][:pmoore] from comment #6)
> Comment on attachment 783571 [details] [diff] [review]
> [tools] actually use slavealloc
> 
> Review of attachment 783571 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks almost perfect - just a couple of lines that I think might not be 100%
> quite right.
> 
> ::: buildfarm/mobile/manage_buildslave.sh
> @@ +2,5 @@
> > +pushd `dirname $0` &> /dev/null
> > +MY_DIR=$(pwd)
> > +popd &> /dev/null
> > +retry="$MY_DIR/../utils/retry.py -s 1 -r 2 -t 30 -m 20"
> > +
>
> Just a small point - you could simplify these four lines to one:

I have a love for this current syntax (we have used it elsewhere), that said I won't argue if you really want me to change it.

> Please note, this mechanism of using a "retry" fails if there is a space in
> the full path of the script's parent directory. On the foopies, this is not
> the case, so it is ok - but not ideal, in case directory names are ever
> changed.

I added some quotes in the retry setting that I believe will fix this concern, again I haven't yet tested this so will need verification (that it at least works for current)

> @@ +16,5 @@
> >  hash "$BB_PATH" "$BB_PYTHON" "$BB_TWISTD" 2>/dev/null || \
> >      (echo "Can't find all needed executables to manage buildslave" >&2; exit 1)
> >  
> > +retry=
> > +
> 
> Was this left in intentionally, or was this something from testing?
> Unsetting will make line 44 break, won't it?

Nope ripped out.

> @@ +41,5 @@
> >  
> >  case "${opt}" in
> >      gettac)
> > +        ${retry} --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \
> > +                wget -q "-O${DEVICE_PATH}/buildbot.tac.new" "http://slavealloc.pvt.build.mozilla.org/gettac/${DEVICE}"
> 
> wget is fine to use, but just a note that it is not installed by default on
> mac - so if you wanted a mac/linux compatible option you could use curl.
> Since all the foopies are now linux, this shouldn't be a problem.
> 
> See above - I believe "retry" is an empty string at this point, so I believe
> this won't work...

Yea I think we can use wget here safely, we don't expect to run on mac for this, and even on mac on our infra we make sure wget is installed.

> @@ +47,5 @@
> > +            rm -f ${DEVICE_PATH}/buildbot.tac
> > +            mv ${DEVICE_PATH}/buildbot.tac.new ${DEVICE_PATH}/buildbot.tac
> > +        else
> > +            echo "Failed to get buildbot.tac" >&2
> > +        fi
> 
> For "cleanness" it might be nice to quote the strings, i.e.
> "${DEVICE_PATH}/buildbot.tac.new" since if the DEVICE_PATH were to contain
> spaces, the current version would not work. Since DEVICE_PATH doesn't
> actually contain spaces ("/builds/${DEVICE}") it should work fine - but for
> readability or to make it future-proof, might be nice to get those quotes
> in, so you can be sure the shell will interpret the path as a single
> word/token.

With the way we treat "DEVICE" at present I don't ever expect this to have a spaces-problem, that said quotes won't hurt at all, added.
 
> ::: buildfarm/mobile/watch_devices.sh
> @@ +97,5 @@
> >          sleep ${FAIL_WAIT} # Wait a while before allowing us to turn buildbot back on
> >      fi
> >    fi
> > +  rm -f disabled.flg # Force disable only once. If we got this far, then we
> > +                     # did all we must. Don't act on it next cycle.
> 
> shouldn't this be the following?
> 
> rm -f "/builds/${device}/disabled.flg"

Great catch!
(attaching new patch next)
Attachment #783571 - Attachment is obsolete: true
Attachment #790012 - Flags: review?(pmoore)
Comment on attachment 790012 [details] [diff] [review]
[tools] actually use slavealloc - v2

Review of attachment 790012 [details] [diff] [review]:
-----------------------------------------------------------------

Hey Callek,

Thanks for the changes - looking really good. I'm really sorry, I have to give an r- because I believe your update to retry has now broken the code, also if no spaces are in file name. However, I've proposed a fix, and then I think it is good to go. That is the only thing blocking (since I believe the script won't work at all in its current state) - the rest is optional.

Cheers!
Pete

::: buildfarm/mobile/manage_buildslave.sh
@@ +1,5 @@
>  #!/bin/bash
> +pushd `dirname $0` &> /dev/null
> +MY_DIR="$(pwd)"
> +popd &> /dev/null
> +retry="\"$MY_DIR/../utils/retry.py\" -s 1 -r 2 -t 30 -m 20"

Unfortunately, with the \" above, I think this is now broken.

These are the three options I see available. Number 1 is my favourite, number 2 next, number 3 last.


1) Use an array:

line 5:
retry=("$MY_DIR/../utils/retry.py" -s 1 -r 2 -t 30 -m 20) # safe when $MY_DIR contains spaces

line 42:
"${retry[@]}" --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \


2) Restore your old version, but provide a comment that it only works if there are no spaces in the directory path

line 5:
retry="$MY_DIR/../utils/retry.py -s 1 -r 2 -t 30 -m 20" # only works when $MY_DIR contains no spaces!

line 42:
${retry} --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \


3) Keep line 5 as you have it now, but use eval in lines 42/43 to fix it:

line 5:
retry="\"$MY_DIR/../utils/retry.py\" -s 1 -r 2 -t 30 -m 20"

line 42/43:
eval "${retry} --stderr-regexp 'ERROR 404: Not Found' --fail-if-match wget -q '-O${DEVICE_PATH}/buildbot.tac.new' 'http://slavealloc.pvt.build.mozilla.org/gettac/${DEVICE}'"

@@ +40,5 @@
>  case "${opt}" in
>      gettac)
> +        ${retry} --stderr-regexp 'ERROR 404: Not Found' --fail-if-match \
> +                wget -q "-O${DEVICE_PATH}/buildbot.tac.new" "http://slavealloc.pvt.build.mozilla.org/gettac/${DEVICE}"
> +        if [ -s ${DEVICE_PATH}/buildbot.tac.new ]; then

Suggestion: you may as well use

if [ -s "${DEVICE_PATH}/buildbot.tac.new" ]; then

since you have now quoted file paths everywhere else. Optional (since $DEVICE_PATH does not contain spaces) - in my opinion safer to always quote file paths, so that spaces in file paths are always supported.
Attachment #790012 - Flags: review?(pmoore) → review-
Comment on attachment 794444 [details]
raw SQL to bring slavealloc to current sanity

I ran my eye over this and will put a f+ on it. Slaves like tegra-015 they'll end up moving from staging to prod, but with a staging comment still, so that suggests there are some edge cases to still sort out. Suggest you get someone like kmoir to do the review.
Attachment #794444 - Flags: feedback+
Comment on attachment 794444 [details]
raw SQL to bring slavealloc to current sanity

Hey Coop,

I was hoping to coerce nick into reviewing this, he decided to pass though.

One thing he points out is I am marking some devices (e.g. tegra-015) as production, even though the note still says staging.

The device is technically attached to a loaned foopy atm, which is why my update's based on current state didn't grab it, so when we enable it after this we'd want to double check.

And alternative choice would be to mark envid where we lock masters, and set to prod for NOT IN (3,5) (staging, decomm)

I'm ok with either.
Attachment #794444 - Flags: review? → review?(coop)
I'm hoping to deploy "today" (friday) so whoever gets to this first.

caveat emptor: my removal of "completed..." log line is because the trap also has that line, and currently it is output twice for every job run. I considered it a suitable mini cleanup while I was here.
Attachment #790012 - Attachment is obsolete: true
Attachment #794517 - Flags: review?(pmoore)
Attachment #794517 - Flags: review?(coop)
Comment on attachment 794517 [details] [diff] [review]
[tools] actually use slavealloc - v3

Review of attachment 794517 [details] [diff] [review]:
-----------------------------------------------------------------

::: buildfarm/mobile/watch_devices.sh
@@ +52,4 @@
>    if ! check_buildbot_running "${device}"; then
>      log "Buildbot is not running"
> +    /builds/tools/buildfarm/mobile/manage_buildslave.sh gettac $device
> +    if grep -q "SLAVE DISABLED" /builds/$device/buildbot.tac; then

Should this create a disabled.flg file in any cases? Just wondering.
Attachment #794517 - Flags: review?(coop) → review+
Comment on attachment 794444 [details]
raw SQL to bring slavealloc to current sanity

(In reply to Nick Thomas [:nthomas] from comment #12) 
> I ran my eye over this and will put a f+ on it. Slaves like tegra-015
> they'll end up moving from staging to prod, but with a staging comment
> still, so that suggests there are some edge cases to still sort out. Suggest
> you get someone like kmoir to do the review.

How accurate are the comments in slavealloc? Has anyone made any effort to keep them accurate since I slammed the then-current foopy info into the notes field months ago? 

I guess I'm saying that I don't particularly trust the notes field for tegras. Someone should take an action item to audit those Notes at some ponit.

Kim is out until next week, so I'll take a stab at the review here.

> # envid=3 for dev/pp
> # Set environment to production but do not change any decomm
> UPDATE slaves SET envid=2 WHERE name (name LIKE 'tegra-%' OR name LIKE 'panda-%' ) AND NOT envid=5;

Confused about which env you're putting these into. Comment says 3, but then you update the slaves to envid=2. Also, not sure about your WHERE syntax. Should probably be |WHERE (name LIKE 'tegra-%' OR name LIKE 'panda-%' ) AND envid!=5;|

Re: basedir changes - I'm not a huge fan of using IGNORE in queries *unless* we capture the output and act on it somehow, i.e. determine why those entries are missing and whether we care.

Re: Dev Master Allocations - consider using the WHERE IN syntax instead of chained ORs, e.g. |WHERE name in ('panda-0068','panda-0071',...)| <- easier to read and also makes it easier to add other conditionals.
Attachment #794444 - Flags: review?(coop) → review+
(In reply to Chris Cooper [:coop] from comment #15)
> Comment on attachment 794517 [details] [diff] [review]
> [tools] actually use slavealloc - v3
> 
> Review of attachment 794517 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: buildfarm/mobile/watch_devices.sh
> @@ +52,4 @@
> >    if ! check_buildbot_running "${device}"; then
> >      log "Buildbot is not running"
> > +    /builds/tools/buildfarm/mobile/manage_buildslave.sh gettac $device
> > +    if grep -q "SLAVE DISABLED" /builds/$device/buildbot.tac; then
> 
> Should this create a disabled.flg file in any cases? Just wondering.

No, we preserve disabled.flg for forced job killing, but don't want buildbot.tac to do that for us.

https://hg.mozilla.org/build/tools/rev/b1482a207041
This is changes against the attached SQL as I ran it on friday, this is just syntax mistakes I did.
This deployed on friday.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment on attachment 794517 [details] [diff] [review]
[tools] actually use slavealloc - v3

Sorry I only just spotted you asked me to review this. Since it has already landed, I'll give it the r+ !! :D
Attachment #794517 - Flags: review?(pmoore) → review+
Moved here from bug 929823:

(In reply to Ed Morley [:edmorley UTC+1] from bug 929823 comment #4)
> Callek, this panda is still taking jobs even though it is disabled in
> slavealloc:
> https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.
> html?name=panda-0843
> 
> I thought slavealloc was supposed to work for pandas now?

(In reply to Justin Wood (:Callek) from comment #5)
> Ugh apparantly buildbot has been running for this panda since July first!
> Which means it didn't get the change that allowed it to automatically shut
> down after every job, so it wasn't using any new code for slavealloc.
> 
> I killed all buildbot jobs on this foopy, since they were all running for
> that long. (And could explain some of the retry amounts for these pandas)

(In reply to Ed Morley [:edmorley UTC+1] from comment #6)
> Great - thank you :-)
> Could we check that the same hasn't occurred on any of the other foopies?

(In reply to Ed Morley [:edmorley UTC+1] from comment #7)
> At first glance, at least the following seem to be exhibiting the same
> problem (found by selecting a handful of disabled tegras at random and
> seeing if they are still taking jobs + what master they are on):
> foopy89
> foopy91
> foopy92
> foopy94
> foopy96
> foopy97
> foopy98
> 
> Think we may need to check them all sadly.
Status: RESOLVED → REOPENED
Flags: needinfo?(bugspam.Callek)
Resolution: FIXED → ---
Indeed there was more foopies than I expected in this state... I suspect what happened was we added foopies to production _after_ I updated them all, but since devices.json didn't point to them the tools repos on these foopies were still (initially) older than this code...

But either way it has been too long for me to know WHY I missed these in the first pass, but with sheriff's approval I did a kill on all the buildbot procs that have been running for > 1 week to get this working again. all in all about 10-15 foopies worth, with off-the-cuff average of ~7 devices in said state per foopy
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Flags: needinfo?(bugspam.Callek)
Resolution: --- → FIXED
Thank you Callek! :-)
Depends on: 1091309
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: