Closed Bug 914764 Opened 6 years ago Closed 6 years ago

replace kittenherder's rebooting with slaveapi

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(6 files, 3 obsolete files)

The quick and dirty way to do this is to feed slaveapi the list of slaves from http://builddata.pub.build.mozilla.org/reports/slaves_needing_reboot.txt.

The better way is to have a script that analyzes the slaves itself (through nagios and other things) and then asks slaveapi to reboot them.
Depends on: 922858
Based on my reading of https://hg.mozilla.org/users/coop_mozilla.com/last-job-per-slave/file/db156044ac28/gen_last_build_html.py it appears that we currently reboot slaves that meeting all of the following conditions:
* Slave is not considered "staging", "missing", or "decomm" (I think this is sourced from slavealloc...)
* Not disabled in slavealloc
* "notes" field is empty in slavealloc
* 5 hours have elapsed since the last job
* We can find at least one recent job for it
* Is not a tegra

Most of the information above can be gleaned from slavealloc, and the remainder can be found in buildapi. We should also ignore pandas until bug 921067 is fixed.
(In reply to Ben Hearsum [:bhearsum] from comment #1)
> Based on my reading of
> https://hg.mozilla.org/users/coop_mozilla.com/last-job-per-slave/file/
> db156044ac28/gen_last_build_html.py it appears that we currently reboot
> slaves that meeting all of the following conditions:
> * Slave is not considered "staging", "missing", or "decomm" (I think this is
> sourced from slavealloc...)
> * Not disabled in slavealloc

I believe that this request gets the slaves that meet the above conditions:
/slaves?environment=prod&enabled=1

That patch that adds this is headed to bug 922858 soon...

> * "notes" field is empty in slavealloc

This one might need to be parsed by the client...

> * 5 hours have elapsed since the last job
> * We can find at least one recent job for it

These come from buildapi

> * Is not a tegra

Hardcode this in the client script? Not sure yet...
(In reply to Ben Hearsum [:bhearsum] from comment #2)
> (In reply to Ben Hearsum [:bhearsum] from comment #1)
> > * "notes" field is empty in slavealloc
> 
> This one might need to be parsed by the client...

16:18 < bhearsum> why do we have slaves enabled in slavealloc that say "not enough load (ignore)" in the notes?
16:22 < Callek|Buildduty> bhearsum: because we were going through last-job-per-slave as our primary info at one point and those notes were helping to prioritize before we had a better list
16:22 < Callek|Buildduty> e.g. old cent5 builders
16:23 < bhearsum> ok
16:23 < bhearsum> i think i'm going to stop ignoring slaves with notes in the kittenherder replacement
16:23 < bhearsum> i don't see any reason why an automated system should ignore hosts like that...
Depends on: 932392
Going to need to fix bug 943508 before we can finish this...graceful shutdowns aren't working on Windows slaves right now because of it...
Depends on: 943508
I've been running this with -x tegra -x panda -x ec2 -x spot to exclude slave classes we can't/don't want to deal with. If you're going to run this yourself I advise against setting -w higher than 4, I managed to kill buildapi by setting it too high earlier this week.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #8339503 - Flags: feedback?(jhopkins)
Attachment #8339503 - Flags: feedback?(coop)
Comment on attachment 8339503 [details] [diff] [review]
first pass on slave rebooting script

+def get_slave(slaveapi, slave):
+    url = furl(slaveapi)
+    url.path.add("slaves").add(slave)
+    return requests.get(str(url)).json()

Should attempt to retry on failure?

+def process_slave(slaveapi, slave):
+    info = get_slave(slaveapi, name)
+    # Ignore slaves without recent job information
+    if not info["recent_jobs"]:
+        log.info("%s - Skipping reboot because no recent jobs found" % name)
+        return

Could still check twistd.log on the slave for freshness.  In what situations could a slave get into this state and would it be "falling through the cracks" of our automated bug filing if we don't address it here?

+    last_job_time = datetime.fromtimestamp(info["recent_jobs"][0]["endtime"])
+    # And also slaves that haven't been idle for at least 5 hours
+    if not (now - last_job_time).total_seconds() > FIVE_HOURS:
+        log.info("%s - Skipping reboot because last job ended at %s" % (name, get_formatted_time(last_job_time)))
+        return

Clarify intention - s/ended at/ended recently at/

+    log.info("Graceful shutdown finished, rebooting slave.")
+    url = furl(slaveapi)
+    url.path.add("slaves").add(slave).add("actions").add("reboot")
+    requestid = requests.post(str(url))
+    # Because SlaveAPI fully escalates reboots (all the way to IT bug filing),
+    # there's no reason for us to watch for it to complete.
+    log.info("Reboot queued.")

Do we need to make note of the reboot so this script doesn't attempt to reboot it again too soon? (eg. while it's busy building after the reboot)
If not, where should that happen?

+    workers = {}
+
+    try:
+        for slave in get_production_slaves(slaveapi):
+            name = slave["name"]
+            if is_excluded(name):
+                log.debug("Excluding '%s' because it matches an excluded pattern.", name)
+                continue
+            while len(workers) >= max_workers:
+                for wname, w in workers.items():
+                    if not w.is_alive():
+                        del workers[wname]

Should have a 1 or 2 millisecond sleep in this loop to avoid consuming too much CPU.

+            t = Thread(target=process_slave, args=(slaveapi, name))
+            t.start()
+            workers[name] = t
+
+        # Wait for all of the workers to finish before exiting.
+        for w in workers.values():
+            w.join()
+    except KeyboardInterrupt:
+        raise

What happens if an exception is raised while operating on a slave - do the rest of the slaves get processed?
Attachment #8339503 - Flags: feedback?(jhopkins) → feedback-
Comment on attachment 8339503 [details] [diff] [review]
first pass on slave rebooting script

Review of attachment 8339503 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Ben Hearsum [:bhearsum] from comment #5)
> If you're going to run this
> yourself I advise against setting -w higher than 4, I managed to kill
> buildapi by setting it too high earlier this week.

Since we know that, can we set a configurable upper threshold for that in-code so we don't shoot ourselves in the feet?

(In reply to John Hopkins (:jhopkins) from comment #6) 
> Should attempt to retry on failure?

A good question. How do we want to bulletproof this tool against network issues, etc? FWIW, I'm happy to land a first pass at the script that isn't bulletproof, provided we have stubs or comments indicating where we want to add such things in the future.
 
> Could still check twistd.log on the slave for freshness.  In what situations
> could a slave get into this state and would it be "falling through the
> cracks" of our automated bug filing if we don't address it here?

I think we might actually want a separate script that we could run daily to find lost slaves of this type. I'd prefer smaller parts as opposed to one monolithic script that tries to address all problems.
 
> +    last_job_time =
> datetime.fromtimestamp(info["recent_jobs"][0]["endtime"])
> +    # And also slaves that haven't been idle for at least 5 hours
> +    if not (now - last_job_time).total_seconds() > FIVE_HOURS:
> +        log.info("%s - Skipping reboot because last job ended at %s" %
> (name, get_formatted_time(last_job_time)))
> +        return
> 
> Clarify intention - s/ended at/ended recently at/

Meh, I think it's clear enough as is, although it might be worth reporting the threshold we're checking against. 

Related: can we change "FIVE_HOURS" to be something like "THRESHOLD_IN_SECS" so we can adjust it as necessary and not have it be blatantly wrong?

Pretty happy with this overall if jhopkins' nits are addressed.
Attachment #8339503 - Flags: feedback?(coop) → feedback+
Thanks for the good feedback both of you!

(In reply to Chris Cooper [:coop] from comment #7)
> Comment on attachment 8339503 [details] [diff] [review]
> first pass on slave rebooting script
> 
> Review of attachment 8339503 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> (In reply to Ben Hearsum [:bhearsum] from comment #5)
> > If you're going to run this
> > yourself I advise against setting -w higher than 4, I managed to kill
> > buildapi by setting it too high earlier this week.
> 
> Since we know that, can we set a configurable upper threshold for that
> in-code so we don't shoot ourselves in the feet?

Will do.

> (In reply to John Hopkins (:jhopkins) from comment #6) 
> > Should attempt to retry on failure?
> 
> A good question. How do we want to bulletproof this tool against network
> issues, etc? FWIW, I'm happy to land a first pass at the script that isn't
> bulletproof, provided we have stubs or comments indicating where we want to
> add such things in the future.

It's pretty easy to add a simple retry in, so I'll do that.

> > Could still check twistd.log on the slave for freshness.  In what situations
> > could a slave get into this state and would it be "falling through the
> > cracks" of our automated bug filing if we don't address it here?
> 
> I think we might actually want a separate script that we could run daily to
> find lost slaves of this type. I'd prefer smaller parts as opposed to one
> monolithic script that tries to address all problems.

+1. One of the places that kittenherder fell down was being so monolithic.

I'd want to investigate whether or not that makes sense to do, too. I can't think of a case where twistd.log would be still and this script wouldn't find it through checking for idleness.

> > +    last_job_time =
> > datetime.fromtimestamp(info["recent_jobs"][0]["endtime"])
> > +    # And also slaves that haven't been idle for at least 5 hours
> > +    if not (now - last_job_time).total_seconds() > FIVE_HOURS:
> > +        log.info("%s - Skipping reboot because last job ended at %s" %
> > (name, get_formatted_time(last_job_time)))
> > +        return
> > 
> > Clarify intention - s/ended at/ended recently at/
> 
> Meh, I think it's clear enough as is, although it might be worth reporting
> the threshold we're checking against. 

Reporting the threshold time seems to address both nits, so I'll do that.

> Related: can we change "FIVE_HOURS" to be something like "THRESHOLD_IN_SECS"
> so we can adjust it as necessary and not have it be blatantly wrong?

Will do.

(In reply to John Hopkins (:jhopkins) from comment #6)
> +    log.info("Graceful shutdown finished, rebooting slave.")
> +    url = furl(slaveapi)
> +    url.path.add("slaves").add(slave).add("actions").add("reboot")
> +    requestid = requests.post(str(url))
> +    # Because SlaveAPI fully escalates reboots (all the way to IT bug
> filing),
> +    # there's no reason for us to watch for it to complete.
> +    log.info("Reboot queued.")
> 
> Do we need to make note of the reboot so this script doesn't attempt to
> reboot it again too soon? (eg. while it's busy building after the reboot)
> If not, where should that happen?

This is something that we should eventually handle on the SlaveAPI side through some sort of locking or limits on how many simultaneous actions of the same type can be run on a slave (for reboots, that would be "1"). (bug 939886 tracks this.) When we have something like that the server could even return the existing requestid rather than a new one. I don't think this is a condition we'll hit often, but if you really think it's important to handle I might be able to do something on the client side in the meantime.

> +    workers = {}
> +
> +    try:
> +        for slave in get_production_slaves(slaveapi):
> +            name = slave["name"]
> +            if is_excluded(name):
> +                log.debug("Excluding '%s' because it matches an excluded
> pattern.", name)
> +                continue
> +            while len(workers) >= max_workers:
> +                for wname, w in workers.items():
> +                    if not w.is_alive():
> +                        del workers[wname]
> 
> Should have a 1 or 2 millisecond sleep in this loop to avoid consuming too
> much CPU.

Good point, will do.

> +            t = Thread(target=process_slave, args=(slaveapi, name))
> +            t.start()
> +            workers[name] = t
> +
> +        # Wait for all of the workers to finish before exiting.
> +        for w in workers.values():
> +            w.join()
> +    except KeyboardInterrupt:
> +        raise
> 
> What happens if an exception is raised while operating on a slave - do the
> rest of the slaves get processed?

Exceptions that happen in threads get eaten IIRC, but I'll test this more thoroughly.
(In reply to Ben Hearsum [:bhearsum] from comment #8)
>> Do we need to make note of the reboot so this script doesn't attempt to
>> reboot it again too soon? (eg. while it's busy building after the reboot)
>> If not, where should that happen?
>
>This is something that we should eventually handle on the SlaveAPI side through
>some sort of locking or limits on how many simultaneous actions of the same
>type can be run on a slave (for reboots, that would be "1"). (bug 939886 tracks
>this.) When we have something like that the server could even return the
>existing requestid rather than a new one. I don't think this is a condition
>we'll hit often, but if you really think it's important to handle I might be
>able to do something on the client side in the meantime.

I do think it's an important safeguard because of the potential for rebooting machines while they are running jobs.  To phrase the request another way, the script should be idempotent: we should be able to run it once, then re-run it immediately and not reboot machines that were just rebooted.  If it's as simple as keeping a local timestamp and bailing out of the script with "hey, we've just ran; wait another 5 hours before running again to avoid repeat reboots or specify --force to override" I'd be ok with that as a start.
(In reply to John Hopkins (:jhopkins) from comment #9)
> (In reply to Ben Hearsum [:bhearsum] from comment #8)
> >> Do we need to make note of the reboot so this script doesn't attempt to
> >> reboot it again too soon? (eg. while it's busy building after the reboot)
> >> If not, where should that happen?
> >
> >This is something that we should eventually handle on the SlaveAPI side through
> >some sort of locking or limits on how many simultaneous actions of the same
> >type can be run on a slave (for reboots, that would be "1"). (bug 939886 tracks
> >this.) When we have something like that the server could even return the
> >existing requestid rather than a new one. I don't think this is a condition
> >we'll hit often, but if you really think it's important to handle I might be
> >able to do something on the client side in the meantime.
> 
> I do think it's an important safeguard because of the potential for
> rebooting machines while they are running jobs.

Unless I'm misunderstanding, this is already guaranteed not to happen because of the graceful shutdown (+ verification of it) we do before asking for the reboot.

> To phrase the request
> another way, the script should be idempotent: we should be able to run it
> once, then re-run it immediately and not reboot machines that were just
> rebooted.

I can see why it's inconvenient to reboot machines repeatedly like this, but I don't think there's any real risk to it.

> If it's as simple as keeping a local timestamp and bailing out of
> the script with "hey, we've just ran; wait another 5 hours before running
> again to avoid repeat reboots or specify --force to override" I'd be ok with
> that as a start.

I really, really don't want to introduce artificial delays like this. A slave can legitimately get stuck multiple times in a 5 hour window, and adding something like this means we don't fix it as quickly as we could.

I think we can do better, though I'm not sure how much I want to lump into this first pass. For example, maybe we shouldn't reboot a slave at all if it's connected to a master (which we might be able to find out from looking at its process list/twistd.log, but it would probably be much easier to do this on a modern buildbot, where there's a JSON API).
John and I talked about this a bit more today. He pointed out an edge case where we wouldn't reboot slaves. Specifically, if a slave hasn't run a job recently and we're not able to ssh to it to finish a graceful shutdown, we'll never request a reboot. This is bad because it means that any slaves that we regularly have to IPMI or PDU reboot won't be automatically fixed.

In order to fix this safely we need to be able to check if a slave is currently running a job without ssh. I'm going to look at adding this to buildapi, though I'm not sure if it'll be possible.
(In reply to Ben Hearsum [:bhearsum] from comment #11)
> John and I talked about this a bit more today. He pointed out an edge case
> where we wouldn't reboot slaves. Specifically, if a slave hasn't run a job
> recently and we're not able to ssh to it to finish a graceful shutdown,
> we'll never request a reboot. This is bad because it means that any slaves
> that we regularly have to IPMI or PDU reboot won't be automatically fixed.
> 
> In order to fix this safely we need to be able to check if a slave is
> currently running a job without ssh. I'm going to look at adding this to
> buildapi, though I'm not sure if it'll be possible.

Started looking into this, quick brain dump after short irc conversation with nick and john:
- Slave name is not directly accessible in schedulerdb
- Statusdb is useless for this because builds don't show up until they finish
- Slave name _is_ available in buildset_properties in schedulerdb:
mysql> select * from buildset_properties where property_name='slavename' limit 5;
+------------+---------------+---------------------------------------+
| buildsetid | property_name | property_value                        |
+------------+---------------+---------------------------------------+
|    7687585 | slavename     | ["w64-ix-slave43", "BuildSlave"]      |
|    7687574 | slavename     | ["w64-ix-slave132", "BuildSlave"]     |
|    7687573 | slavename     | ["w64-ix-slave132", "BuildSlave"]     |
|    7929082 | slavename     | ["bld-linux64-ec2-457", "BuildSlave"] |
|    7928927 | slavename     | ["bld-linux64-ec2-468", "BuildSlave"] |
+------------+---------------+---------------------------------------+

If it's not too expensive, we can select that for every running build (and even add it to https://secure.pub.build.mozilla.org/buildapi/running!). If it's expensive, we can add a new endpoint to make it retrieveable, or make it optionally retrievable on an existing endpoint.
OTOH:
16:48 <@catlee-mtg> bhearsum: looks at schedulerdb_requests I think?
16:49 < bhearsum> that's a table or a db?
16:49 <@catlee-mtg> table
16:49 <@catlee-mtg> in ...statusdb?
16:49 < bhearsum> oh
16:49 < bhearsum> statusdb is useless to me
16:49 < bhearsum> i need to find the slave of a currently running build
16:49 < bhearsum> which i think i can get from buildset_properties in schedulerdb
16:49 -!- sfink [chatzilla@moz-F811F28B.dsl.pltn13.sbcglobal.net] has joined #releng
16:50 <@catlee-mtg> nope
16:50 <@catlee-mtg> no way to get it from teh db
16:50 < bhearsum> really?
16:50 <@catlee-mtg> while it's running
16:50 -!- Irssi: Pasting 5 lines to #releng. Press Ctrl-K if you wish to do this or Ctrl-C to cancel.
16:50 < bhearsum> mysql> select * from buildset_properties where property_name='slavename' limit 5;
16:50 < bhearsum> +------------+---------------+---------------------------------------+
16:50 < bhearsum> | buildsetid | property_name | property_value                        |
16:50 < bhearsum> +------------+---------------+---------------------------------------+
16:50 < bhearsum> |    7687585 | slavename     | ["w64-ix-slave43", "BuildSlave"]      |
16:50 < bhearsum> |    7687574 | slavename     | ["w64-ix-slave132", "BuildSlave"]     |
16:50 < bhearsum> i guess i didn't check to see if those are complete....
16:50 <@catlee-mtg> I bet those are from retries...
16:51 <@catlee-mtg> if those are for all builds, I would be happily surprised!
16:51 < bhearsum> i'll dig into that tomorrow
16:52 < bhearsum> https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/db/connector.py#706 is encouraging, i think
16:52 < bhearsum> oh
16:52 < bhearsum> i see what you're saying
16:52 < bhearsum> that normally, "slavename" isn't a thing in buildset properties
16:52 < bhearsum> because you don't find it out until the build starts
16:54 <@catlee-mtg> yeah
16:54 <@nthomas> why do you need the slave name for a running job ?
16:54 < bhearsum> nthomas: was hoping to address the edge case for the kittenherder replacement: https://bugzilla.mozilla.org/show_bug.cgi?id=914764#c11
16:55 <@nthomas> I'm not sure I'd trust the result of a job if I can't ssh to the slave
16:56 -!- sfink [chatzilla@moz-F811F28B.dsl.pltn13.sbcglobal.net] has quit [Ping timeout]
16:57 -!- armenzg_buildduty is now known as armenzg_brb
16:57 < bhearsum> so it might be fine/better just to reboot the slave without checking?
16:58 <@nthomas> maaaaybe, I don't have all the context on this
16:58 < bhearsum> heheh
16:58 -!- ehsan is now known as ehsan|mtg
16:58 < bhearsum> i appreciate your input nonetheless
16:58 -!- jrmuizel [jrmuizel@13F2CEC5.7672369.D8E68FF6.IP] has joined #releng
16:59 < bhearsum> i also assume i can put your name on the blame for this particular part now too...
16:59 <@nthomas> that's how we roll
16:59 <@nthomas> dammit
17:00 < bhearsum> hehe
17:00  * bhearsum pastes this conversation, will dive into it more tomorrow
Indeed, those are for rebuilds only:
|    7928927 | slavename     | ["bld-linux64-ec2-468", "BuildSlave"] |
mysql> select id, reason, from_unixtime(submitted_at) from buildsets where id=7928927;
+---------+-------------------------------------------------------------+-----------------------------+
| id      | reason                                                      | from_unixtime(submitted_at) |
+---------+-------------------------------------------------------------+-----------------------------+
| 7928927 | The web-page 'rebuild' button was pressed by '<unknown>': 
 | 2013-11-20 06:04:17         |
+---------+-------------------------------------------------------------+-----------------------------+

So, as catlee said, there's no way to get the slave for a running job from the db.
John, given that we can't find out if a slave is running a job without ssh, how do you feel about assuming it's not, and just rebooting in a case like that? It seems pretty unlikely to me that it would be running a job. Nick made a related point yesterday, too:
16:55 <@nthomas> I'm not sure I'd trust the result of a job if I can't ssh to the slave
Flags: needinfo?(jhopkins)
If there are no recent jobs and we can't SSH and we know it's been > 5h since the reboot script was last run then I'm ok with just rebooting.

If there are no recent jobs and we can't SSH but we have no idea when the reboot script was last run then I'd rather just file a bug in bugzilla to look into it and not reboot the machine.

I'll leave the final call to you, but those are my thoughts.

Re: not trusting a slave's result that we can't SSH into, depends what the failure modes are of the SSH daemon.  In the past, we've had the sshd banning IPs which prevented connections.  We've also had out of date passwords on some hosts which prevented logins.  Neither of those impact the build result.
Flags: needinfo?(jhopkins)
(In reply to John Hopkins (:jhopkins) from comment #16)
> If there are no recent jobs and we can't SSH and we know it's been > 5h
> since the reboot script was last run then I'm ok with just rebooting.
> 
> If there are no recent jobs and we can't SSH but we have no idea when the
> reboot script was last run then I'd rather just file a bug in bugzilla to
> look into it and not reboot the machine.

Perhaps the reboot action should distinguish between "connection timed out" and other connection failures? How would you feel if we only did IPMI/PDU reboots when the ssh connection times out (instead of now, where we do them if the reboot via ssh doesn't work in any way).
Flags: needinfo?(jhopkins)
As discussed in IRC, let's start simple + conservative and file bugs for non-standard cases.
Also, we should ping first and not attempt to SSH if the ping is unsuccessful.
Flags: needinfo?(jhopkins)
Depends on: 950774
(In reply to John Hopkins (:jhopkins) from comment #18)
> As discussed in IRC, let's start simple + conservative and file bugs for
> non-standard cases.

I thought we agreed on IRC that we weren't making any changes to existing logic...which non-standard cases are you talking about here?

> Also, we should ping first and not attempt to SSH if the ping is
> unsuccessful.

This is a change to the reboot action, so I'm separating it out into bug 950774.
In IRC you said:

> i'm on board with keeping things simple for now and improving later

I think we're saying the same thing.  Simple logic isn't going to catch all the edge cases; we can file bugs for those edge cases.
(In reply to John Hopkins (:jhopkins) from comment #20)
> In IRC you said:
> 
> > i'm on board with keeping things simple for now and improving later
> 
> I think we're saying the same thing.  Simple logic isn't going to catch all
> the edge cases; we can file bugs for those edge cases.

Ah, I see! I thought you were saying slaveapi / this new script would file bugs for all of these edge case. I see now that you're saying we'll file follow-up bugs as we find them.
Attached patch v2Splinter Review
I think I've addressed all the review comments, here's a summary of the changes:
* Add verbose and dry run options
* Retry HTTP requests
* Catch exceptions in threads so we can print out the slave name associated with them
* Make log messages more consistent
* Don't attempt to reboot if the graceful shutdown failed
* Make ^C work by using .join(1) (join() ignores KeyboardInterrupt, so we need to call it repeatedly to give the script a chance to catch it)
Attachment #8339503 - Attachment is obsolete: true
Attachment #8349010 - Flags: review?(jhopkins)
Comment on attachment 8349010 [details] [diff] [review]
v2

Looks good!
Attachment #8349010 - Flags: review?(jhopkins) → review+
Comment on attachment 8349010 [details] [diff] [review]
v2

I pushed this. It's not running anywhere yet. Need to write puppet manifests to deploy it somewhere (location TBD). We're still talking about whether or not reboot history on slave health is a blocker. If it is, bug 913601 needs to be fixed before this is deployed and kittenherder is shut off.
Attachment #8349010 - Flags: checked-in+
Attached patch slaverebooter puppet manifests (obsolete) — Splinter Review
I used releaserunner as a rough model for this. The main difference is that this isn't a long running process, so I'm setting up a cron entry instead of supervisord.

I haven't figured out where it makes sense to run this yet...maybe on one of the slaveapi servers?
Attachment #8349467 - Flags: feedback?(dustin)
Isn't this part of slaveapi? Or is it a new, distinct tool?  It seems like this would just be a crontask within slaveapi.

This deployment violates the best practices
  "Install Python apps as pip-installable tarballs, with semantic versions, rather than from hg repositories."
and
  "Minimize the complexity of the deployment by making simple interfaces. For example, if the app needs a crontask installed, create a simple script with setuptools' entry-points that takes no or very few arguments, and gets its config from a config file."
> I haven't figured out where it makes sense to run this yet...maybe on one of the slaveapi servers?

My impression is slaveapi nodes eventually should be treated like a cluster: if one node goes down, not a huge deal.  If we install a singleton service on one of the cluster nodes then we have to treat that node as special from the other nodes and take care that it doesn't go down.  For that reason, it might be better to run the reboot script on a server that is understood to be a singleton (and could have a standby server waiting to be activated in case the first server goes offline).  In other words, a different availability model.
Run-once tasks like this are difficult to cluster, and releng has lots of them.  It'd be great to have a common, resilient way to handle this.  We talked a little bit about such a thing during the workweek - using carrot or some similar job-execution system (SQS is an option, too) to execute the jobs, and some kind of distributed scheduler to insert the jobs.
We've been using Buildbot Master machines for things like this (releaserunner, self serve agent, etc.). I can't think of anywhere else to put this....it's certainly not going on cruncher.
Dustin and I talked about his feedback on the Puppet patch. Summary:
- I'll make the script use a config file to avoid the lengthy command line arguments
- I'll make the script get packaged as part of build/tools' setup.py -- then we can pip install it rather than clone.
I left dryrun as an argument because it's an option that's just there for testing purposes.
Attachment #8349542 - Flags: review?(jhopkins)
Attachment #8349542 - Flags: review?(jhopkins) → review+
Attachment #8349542 - Flags: checked-in+
As to where to deploy, I think that slaveapi1 makes more sense than a buildmaster.
Attached patch v2 of puppet manifests (obsolete) — Splinter Review
Here's new Puppet manifests that use the buildtools package (which I have yet to upload) and a config instead of command line arguments.
Attachment #8349467 - Attachment is obsolete: true
Attachment #8349467 - Flags: feedback?(dustin)
Attachment #8349655 - Flags: feedback?(dustin)
Comment on attachment 8349655 [details] [diff] [review]
v2 of puppet manifests

Review of attachment 8349655 [details] [diff] [review]:
-----------------------------------------------------------------

Don't forget to document this module.  And, how will you include it in the node definition?  Including the class directly?
Attachment #8349655 - Flags: feedback?(dustin) → feedback+
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #36)
> And, how will you include it in the
> node definition?  Including the class directly?

That was my thinking. I don't know of any other way to include it only on a single host, but I'm open to ideas.
Comment on attachment 8349646 [details] [diff] [review]
include reboot script + deps in buildtools package

per discussion in irc, the two are roughly equivalent, so ..
Attachment #8349646 - Flags: review- → review+
I think that including the class in the node def makes sense for now, but it exposes the same code-smell as we were discussing above, where this *particular* host is responsible for something, rather than all hosts in a particular class being responsible for it.
Attachment #8349646 - Flags: checked-in+
Attachment #8349655 - Attachment is obsolete: true
Attachment #8350072 - Flags: review?(dustin)
Attachment #8350072 - Flags: review?(dustin) → review+
Comment on attachment 8350072 [details] [diff] [review]
install slaverebooter on bm65

I disabled the kittenherder cronjobs for everything except tegras (pandas were already disabled - don't know why). Once this syncs out with Puppet do some verification before the cronjob runs for the first time.
Attachment #8350072 - Flags: checked-in+
silly mistakes
Attachment #8356178 - Flags: review?(dustin)
Attachment #8356178 - Flags: review?(dustin) → review+
Attachment #8356178 - Flags: checked-in+
Silly mistake!
Attachment #8356185 - Flags: review?(jhopkins)
Attachment #8356185 - Flags: review?(jhopkins) → review+
Comment on attachment 8356185 [details] [diff] [review]
fix incomplete log message

Landed w/ a version bump in setup.py.
Attachment #8356185 - Flags: checked-in+
This seems to be working OK. I'm calling this fixed, and we can file follow-up bugs if issues arise.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Duplicate of this bug: 897783
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.