Closed Bug 914764 Opened 12 years ago Closed 11 years ago

replace kittenherder's rebooting with slaveapi

Categories

(Release Engineering :: General, defect)

Product:

Component:

Platform:

x86_64

Linux

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(6 files, 3 obsolete files)

first pass on slave rebooting script 11 years ago bhearsum@mozilla.com (:bhearsum) 4.22 KB, patch	jhopkins : feedback- coop : feedback+	Details \| Diff \| Splinter Review
v2 11 years ago bhearsum@mozilla.com (:bhearsum) 5.59 KB, patch	jhopkins : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
slaverebooter puppet manifests 11 years ago bhearsum@mozilla.com (:bhearsum) 5.08 KB, patch		Details \| Diff \| Splinter Review
use config file for most args 11 years ago bhearsum@mozilla.com (:bhearsum) 3.11 KB, patch	jhopkins : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
include reboot script + deps in buildtools package 11 years ago bhearsum@mozilla.com (:bhearsum) 428 bytes, patch	dustin : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
v2 of puppet manifests 11 years ago bhearsum@mozilla.com (:bhearsum) 5.64 KB, patch	dustin : feedback+	Details \| Diff \| Splinter Review
install slaverebooter on bm65 11 years ago bhearsum@mozilla.com (:bhearsum) 6.52 KB, patch	dustin : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
fix errors in cron 11 years ago bhearsum@mozilla.com (:bhearsum) 1.29 KB, patch	dustin : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review
fix incomplete log message 11 years ago bhearsum@mozilla.com (:bhearsum) 633 bytes, patch	jhopkins : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Assignee

Description

•

12 years ago

The quick and dirty way to do this is to feed slaveapi the list of slaves from http://builddata.pub.build.mozilla.org/reports/slaves_needing_reboot.txt. The better way is to have a script that analyzes the slaves itself (through nagios and other things) and then asks slaveapi to reboot them.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

12 years ago

Depends on: 922858

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 1

•

12 years ago

Based on my reading of https://hg.mozilla.org/users/coop_mozilla.com/last-job-per-slave/file/db156044ac28/gen_last_build_html.py it appears that we currently reboot slaves that meeting all of the following conditions: * Slave is not considered "staging", "missing", or "decomm" (I think this is sourced from slavealloc...) * Not disabled in slavealloc * "notes" field is empty in slavealloc * 5 hours have elapsed since the last job * We can find at least one recent job for it * Is not a tegra Most of the information above can be gleaned from slavealloc, and the remainder can be found in buildapi. We should also ignore pandas until bug 921067 is fixed.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 2

•

12 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #1) > Based on my reading of > https://hg.mozilla.org/users/coop_mozilla.com/last-job-per-slave/file/ > db156044ac28/gen_last_build_html.py it appears that we currently reboot > slaves that meeting all of the following conditions: > * Slave is not considered "staging", "missing", or "decomm" (I think this is > sourced from slavealloc...) > * Not disabled in slavealloc I believe that this request gets the slaves that meet the above conditions: /slaves?environment=prod&enabled=1 That patch that adds this is headed to bug 922858 soon... > * "notes" field is empty in slavealloc This one might need to be parsed by the client... > * 5 hours have elapsed since the last job > * We can find at least one recent job for it These come from buildapi > * Is not a tegra Hardcode this in the client script? Not sure yet...

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 3

•

12 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #2) > (In reply to Ben Hearsum [:bhearsum] from comment #1) > > * "notes" field is empty in slavealloc > > This one might need to be parsed by the client... 16:18 < bhearsum> why do we have slaves enabled in slavealloc that say "not enough load (ignore)" in the notes? 16:22 < Callek|Buildduty> bhearsum: because we were going through last-job-per-slave as our primary info at one point and those notes were helping to prioritize before we had a better list 16:22 < Callek|Buildduty> e.g. old cent5 builders 16:23 < bhearsum> ok 16:23 < bhearsum> i think i'm going to stop ignoring slaves with notes in the kittenherder replacement 16:23 < bhearsum> i don't see any reason why an automated system should ignore hosts like that...

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Depends on: 932392

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 4

•

11 years ago

Going to need to fix bug 943508 before we can finish this...graceful shutdowns aren't working on Windows slaves right now because of it...

Depends on: 943508

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 5

•

11 years ago

Attached patch first pass on slave rebooting script (obsolete) — Details — Splinter Review

I've been running this with -x tegra -x panda -x ec2 -x spot to exclude slave classes we can't/don't want to deal with. If you're going to run this yourself I advise against setting -w higher than 4, I managed to kill buildapi by setting it too high earlier this week.

Assignee: nobody → bhearsum

Status: NEW → ASSIGNED

Attachment #8339503 - Flags: feedback?(jhopkins)

Attachment #8339503 - Flags: feedback?(coop)

John Hopkins (:jhopkins)

Comment 6

•

11 years ago

Comment on attachment 8339503 [details] [diff] [review] first pass on slave rebooting script +def get_slave(slaveapi, slave): + url = furl(slaveapi) + url.path.add("slaves").add(slave) + return requests.get(str(url)).json() Should attempt to retry on failure? +def process_slave(slaveapi, slave): + info = get_slave(slaveapi, name) + # Ignore slaves without recent job information + if not info["recent_jobs"]: + log.info("%s - Skipping reboot because no recent jobs found" % name) + return Could still check twistd.log on the slave for freshness. In what situations could a slave get into this state and would it be "falling through the cracks" of our automated bug filing if we don't address it here? + last_job_time = datetime.fromtimestamp(info["recent_jobs"][0]["endtime"]) + # And also slaves that haven't been idle for at least 5 hours + if not (now - last_job_time).total_seconds() > FIVE_HOURS: + log.info("%s - Skipping reboot because last job ended at %s" % (name, get_formatted_time(last_job_time))) + return Clarify intention - s/ended at/ended recently at/ + log.info("Graceful shutdown finished, rebooting slave.") + url = furl(slaveapi) + url.path.add("slaves").add(slave).add("actions").add("reboot") + requestid = requests.post(str(url)) + # Because SlaveAPI fully escalates reboots (all the way to IT bug filing), + # there's no reason for us to watch for it to complete. + log.info("Reboot queued.") Do we need to make note of the reboot so this script doesn't attempt to reboot it again too soon? (eg. while it's busy building after the reboot) If not, where should that happen? + workers = {} + + try: + for slave in get_production_slaves(slaveapi): + name = slave["name"] + if is_excluded(name): + log.debug("Excluding '%s' because it matches an excluded pattern.", name) + continue + while len(workers) >= max_workers: + for wname, w in workers.items(): + if not w.is_alive(): + del workers[wname] Should have a 1 or 2 millisecond sleep in this loop to avoid consuming too much CPU. + t = Thread(target=process_slave, args=(slaveapi, name)) + t.start() + workers[name] = t + + # Wait for all of the workers to finish before exiting. + for w in workers.values(): + w.join() + except KeyboardInterrupt: + raise What happens if an exception is raised while operating on a slave - do the rest of the slaves get processed?

Attachment #8339503 - Flags: feedback?(jhopkins) → feedback-

Chris Cooper [:coop] (he/him)

Comment 7

•

11 years ago

Comment on attachment 8339503 [details] [diff] [review] first pass on slave rebooting script Review of attachment 8339503 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Ben Hearsum [:bhearsum] from comment #5) > If you're going to run this > yourself I advise against setting -w higher than 4, I managed to kill > buildapi by setting it too high earlier this week. Since we know that, can we set a configurable upper threshold for that in-code so we don't shoot ourselves in the feet? (In reply to John Hopkins (:jhopkins) from comment #6) > Should attempt to retry on failure? A good question. How do we want to bulletproof this tool against network issues, etc? FWIW, I'm happy to land a first pass at the script that isn't bulletproof, provided we have stubs or comments indicating where we want to add such things in the future. > Could still check twistd.log on the slave for freshness. In what situations > could a slave get into this state and would it be "falling through the > cracks" of our automated bug filing if we don't address it here? I think we might actually want a separate script that we could run daily to find lost slaves of this type. I'd prefer smaller parts as opposed to one monolithic script that tries to address all problems. > + last_job_time = > datetime.fromtimestamp(info["recent_jobs"][0]["endtime"]) > + # And also slaves that haven't been idle for at least 5 hours > + if not (now - last_job_time).total_seconds() > FIVE_HOURS: > + log.info("%s - Skipping reboot because last job ended at %s" % > (name, get_formatted_time(last_job_time))) > + return > > Clarify intention - s/ended at/ended recently at/ Meh, I think it's clear enough as is, although it might be worth reporting the threshold we're checking against. Related: can we change "FIVE_HOURS" to be something like "THRESHOLD_IN_SECS" so we can adjust it as necessary and not have it be blatantly wrong? Pretty happy with this overall if jhopkins' nits are addressed.

Attachment #8339503 - Flags: feedback?(coop) → feedback+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 8

•

11 years ago

Thanks for the good feedback both of you! (In reply to Chris Cooper [:coop] from comment #7) > Comment on attachment 8339503 [details] [diff] [review] > first pass on slave rebooting script > > Review of attachment 8339503 [details] [diff] [review]: > ----------------------------------------------------------------- > > (In reply to Ben Hearsum [:bhearsum] from comment #5) > > If you're going to run this > > yourself I advise against setting -w higher than 4, I managed to kill > > buildapi by setting it too high earlier this week. > > Since we know that, can we set a configurable upper threshold for that > in-code so we don't shoot ourselves in the feet? Will do. > (In reply to John Hopkins (:jhopkins) from comment #6) > > Should attempt to retry on failure? > > A good question. How do we want to bulletproof this tool against network > issues, etc? FWIW, I'm happy to land a first pass at the script that isn't > bulletproof, provided we have stubs or comments indicating where we want to > add such things in the future. It's pretty easy to add a simple retry in, so I'll do that. > > Could still check twistd.log on the slave for freshness. In what situations > > could a slave get into this state and would it be "falling through the > > cracks" of our automated bug filing if we don't address it here? > > I think we might actually want a separate script that we could run daily to > find lost slaves of this type. I'd prefer smaller parts as opposed to one > monolithic script that tries to address all problems. +1. One of the places that kittenherder fell down was being so monolithic. I'd want to investigate whether or not that makes sense to do, too. I can't think of a case where twistd.log would be still and this script wouldn't find it through checking for idleness. > > + last_job_time = > > datetime.fromtimestamp(info["recent_jobs"][0]["endtime"]) > > + # And also slaves that haven't been idle for at least 5 hours > > + if not (now - last_job_time).total_seconds() > FIVE_HOURS: > > + log.info("%s - Skipping reboot because last job ended at %s" % > > (name, get_formatted_time(last_job_time))) > > + return > > > > Clarify intention - s/ended at/ended recently at/ > > Meh, I think it's clear enough as is, although it might be worth reporting > the threshold we're checking against. Reporting the threshold time seems to address both nits, so I'll do that. > Related: can we change "FIVE_HOURS" to be something like "THRESHOLD_IN_SECS" > so we can adjust it as necessary and not have it be blatantly wrong? Will do. (In reply to John Hopkins (:jhopkins) from comment #6) > + log.info("Graceful shutdown finished, rebooting slave.") > + url = furl(slaveapi) > + url.path.add("slaves").add(slave).add("actions").add("reboot") > + requestid = requests.post(str(url)) > + # Because SlaveAPI fully escalates reboots (all the way to IT bug > filing), > + # there's no reason for us to watch for it to complete. > + log.info("Reboot queued.") > > Do we need to make note of the reboot so this script doesn't attempt to > reboot it again too soon? (eg. while it's busy building after the reboot) > If not, where should that happen? This is something that we should eventually handle on the SlaveAPI side through some sort of locking or limits on how many simultaneous actions of the same type can be run on a slave (for reboots, that would be "1"). (bug 939886 tracks this.) When we have something like that the server could even return the existing requestid rather than a new one. I don't think this is a condition we'll hit often, but if you really think it's important to handle I might be able to do something on the client side in the meantime. > + workers = {} > + > + try: > + for slave in get_production_slaves(slaveapi): > + name = slave["name"] > + if is_excluded(name): > + log.debug("Excluding '%s' because it matches an excluded > pattern.", name) > + continue > + while len(workers) >= max_workers: > + for wname, w in workers.items(): > + if not w.is_alive(): > + del workers[wname] > > Should have a 1 or 2 millisecond sleep in this loop to avoid consuming too > much CPU. Good point, will do. > + t = Thread(target=process_slave, args=(slaveapi, name)) > + t.start() > + workers[name] = t > + > + # Wait for all of the workers to finish before exiting. > + for w in workers.values(): > + w.join() > + except KeyboardInterrupt: > + raise > > What happens if an exception is raised while operating on a slave - do the > rest of the slaves get processed? Exceptions that happen in threads get eaten IIRC, but I'll test this more thoroughly.

John Hopkins (:jhopkins)

Comment 9

•

11 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #8) >> Do we need to make note of the reboot so this script doesn't attempt to >> reboot it again too soon? (eg. while it's busy building after the reboot) >> If not, where should that happen? > >This is something that we should eventually handle on the SlaveAPI side through >some sort of locking or limits on how many simultaneous actions of the same >type can be run on a slave (for reboots, that would be "1"). (bug 939886 tracks >this.) When we have something like that the server could even return the >existing requestid rather than a new one. I don't think this is a condition >we'll hit often, but if you really think it's important to handle I might be >able to do something on the client side in the meantime. I do think it's an important safeguard because of the potential for rebooting machines while they are running jobs. To phrase the request another way, the script should be idempotent: we should be able to run it once, then re-run it immediately and not reboot machines that were just rebooted. If it's as simple as keeping a local timestamp and bailing out of the script with "hey, we've just ran; wait another 5 hours before running again to avoid repeat reboots or specify --force to override" I'd be ok with that as a start.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 10

•

11 years ago

(In reply to John Hopkins (:jhopkins) from comment #9) > (In reply to Ben Hearsum [:bhearsum] from comment #8) > >> Do we need to make note of the reboot so this script doesn't attempt to > >> reboot it again too soon? (eg. while it's busy building after the reboot) > >> If not, where should that happen? > > > >This is something that we should eventually handle on the SlaveAPI side through > >some sort of locking or limits on how many simultaneous actions of the same > >type can be run on a slave (for reboots, that would be "1"). (bug 939886 tracks > >this.) When we have something like that the server could even return the > >existing requestid rather than a new one. I don't think this is a condition > >we'll hit often, but if you really think it's important to handle I might be > >able to do something on the client side in the meantime. > > I do think it's an important safeguard because of the potential for > rebooting machines while they are running jobs. Unless I'm misunderstanding, this is already guaranteed not to happen because of the graceful shutdown (+ verification of it) we do before asking for the reboot. > To phrase the request > another way, the script should be idempotent: we should be able to run it > once, then re-run it immediately and not reboot machines that were just > rebooted. I can see why it's inconvenient to reboot machines repeatedly like this, but I don't think there's any real risk to it. > If it's as simple as keeping a local timestamp and bailing out of > the script with "hey, we've just ran; wait another 5 hours before running > again to avoid repeat reboots or specify --force to override" I'd be ok with > that as a start. I really, really don't want to introduce artificial delays like this. A slave can legitimately get stuck multiple times in a 5 hour window, and adding something like this means we don't fix it as quickly as we could. I think we can do better, though I'm not sure how much I want to lump into this first pass. For example, maybe we shouldn't reboot a slave at all if it's connected to a master (which we might be able to find out from looking at its process list/twistd.log, but it would probably be much easier to do this on a modern buildbot, where there's a JSON API).

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 11

•

11 years ago

John and I talked about this a bit more today. He pointed out an edge case where we wouldn't reboot slaves. Specifically, if a slave hasn't run a job recently and we're not able to ssh to it to finish a graceful shutdown, we'll never request a reboot. This is bad because it means that any slaves that we regularly have to IPMI or PDU reboot won't be automatically fixed. In order to fix this safely we need to be able to check if a slave is currently running a job without ssh. I'm going to look at adding this to buildapi, though I'm not sure if it'll be possible.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 12

•

11 years ago

(In reply to Ben Hearsum [:bhearsum] from comment #11) > John and I talked about this a bit more today. He pointed out an edge case > where we wouldn't reboot slaves. Specifically, if a slave hasn't run a job > recently and we're not able to ssh to it to finish a graceful shutdown, > we'll never request a reboot. This is bad because it means that any slaves > that we regularly have to IPMI or PDU reboot won't be automatically fixed. > > In order to fix this safely we need to be able to check if a slave is > currently running a job without ssh. I'm going to look at adding this to > buildapi, though I'm not sure if it'll be possible. Started looking into this, quick brain dump after short irc conversation with nick and john: - Slave name is not directly accessible in schedulerdb - Statusdb is useless for this because builds don't show up until they finish - Slave name _is_ available in buildset_properties in schedulerdb: mysql> select * from buildset_properties where property_name='slavename' limit 5; +------------+---------------+---------------------------------------+ | buildsetid | property_name | property_value | +------------+---------------+---------------------------------------+ | 7687585 | slavename | ["w64-ix-slave43", "BuildSlave"] | | 7687574 | slavename | ["w64-ix-slave132", "BuildSlave"] | | 7687573 | slavename | ["w64-ix-slave132", "BuildSlave"] | | 7929082 | slavename | ["bld-linux64-ec2-457", "BuildSlave"] | | 7928927 | slavename | ["bld-linux64-ec2-468", "BuildSlave"] | +------------+---------------+---------------------------------------+ If it's not too expensive, we can select that for every running build (and even add it to https://secure.pub.build.mozilla.org/buildapi/running!). If it's expensive, we can add a new endpoint to make it retrieveable, or make it optionally retrievable on an existing endpoint.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 13

•

11 years ago

OTOH: 16:48 <@catlee-mtg> bhearsum: looks at schedulerdb_requests I think? 16:49 < bhearsum> that's a table or a db? 16:49 <@catlee-mtg> table 16:49 <@catlee-mtg> in ...statusdb? 16:49 < bhearsum> oh 16:49 < bhearsum> statusdb is useless to me 16:49 < bhearsum> i need to find the slave of a currently running build 16:49 < bhearsum> which i think i can get from buildset_properties in schedulerdb 16:49 -!- sfink [chatzilla@moz-F811F28B.dsl.pltn13.sbcglobal.net] has joined #releng 16:50 <@catlee-mtg> nope 16:50 <@catlee-mtg> no way to get it from teh db 16:50 < bhearsum> really? 16:50 <@catlee-mtg> while it's running 16:50 -!- Irssi: Pasting 5 lines to #releng. Press Ctrl-K if you wish to do this or Ctrl-C to cancel. 16:50 < bhearsum> mysql> select * from buildset_properties where property_name='slavename' limit 5; 16:50 < bhearsum> +------------+---------------+---------------------------------------+ 16:50 < bhearsum> | buildsetid | property_name | property_value | 16:50 < bhearsum> +------------+---------------+---------------------------------------+ 16:50 < bhearsum> | 7687585 | slavename | ["w64-ix-slave43", "BuildSlave"] | 16:50 < bhearsum> | 7687574 | slavename | ["w64-ix-slave132", "BuildSlave"] | 16:50 < bhearsum> i guess i didn't check to see if those are complete.... 16:50 <@catlee-mtg> I bet those are from retries... 16:51 <@catlee-mtg> if those are for all builds, I would be happily surprised! 16:51 < bhearsum> i'll dig into that tomorrow 16:52 < bhearsum> https://mxr.mozilla.org/build-central/source/buildbot/master/buildbot/db/connector.py#706 is encouraging, i think 16:52 < bhearsum> oh 16:52 < bhearsum> i see what you're saying 16:52 < bhearsum> that normally, "slavename" isn't a thing in buildset properties 16:52 < bhearsum> because you don't find it out until the build starts 16:54 <@catlee-mtg> yeah 16:54 <@nthomas> why do you need the slave name for a running job ? 16:54 < bhearsum> nthomas: was hoping to address the edge case for the kittenherder replacement: https://bugzilla.mozilla.org/show_bug.cgi?id=914764#c11 16:55 <@nthomas> I'm not sure I'd trust the result of a job if I can't ssh to the slave 16:56 -!- sfink [chatzilla@moz-F811F28B.dsl.pltn13.sbcglobal.net] has quit [Ping timeout] 16:57 -!- armenzg_buildduty is now known as armenzg_brb 16:57 < bhearsum> so it might be fine/better just to reboot the slave without checking? 16:58 <@nthomas> maaaaybe, I don't have all the context on this 16:58 < bhearsum> heheh 16:58 -!- ehsan is now known as ehsan|mtg 16:58 < bhearsum> i appreciate your input nonetheless 16:58 -!- jrmuizel [jrmuizel@13F2CEC5.7672369.D8E68FF6.IP] has joined #releng 16:59 < bhearsum> i also assume i can put your name on the blame for this particular part now too... 16:59 <@nthomas> that's how we roll 16:59 <@nthomas> dammit 17:00 < bhearsum> hehe 17:00 * bhearsum pastes this conversation, will dive into it more tomorrow

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 14

•

11 years ago

Indeed, those are for rebuilds only: | 7928927 | slavename | ["bld-linux64-ec2-468", "BuildSlave"] | mysql> select id, reason, from_unixtime(submitted_at) from buildsets where id=7928927; +---------+-------------------------------------------------------------+-----------------------------+ | id | reason | from_unixtime(submitted_at) | +---------+-------------------------------------------------------------+-----------------------------+ | 7928927 | The web-page 'rebuild' button was pressed by '<unknown>': | 2013-11-20 06:04:17 | +---------+-------------------------------------------------------------+-----------------------------+ So, as catlee said, there's no way to get the slave for a running job from the db.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 15

•

11 years ago

John, given that we can't find out if a slave is running a job without ssh, how do you feel about assuming it's not, and just rebooting in a case like that? It seems pretty unlikely to me that it would be running a job. Nick made a related point yesterday, too: 16:55 <@nthomas> I'm not sure I'd trust the result of a job if I can't ssh to the slave

Flags: needinfo?(jhopkins)

John Hopkins (:jhopkins)

Comment 16

•

11 years ago

If there are no recent jobs and we can't SSH and we know it's been > 5h since the reboot script was last run then I'm ok with just rebooting. If there are no recent jobs and we can't SSH but we have no idea when the reboot script was last run then I'd rather just file a bug in bugzilla to look into it and not reboot the machine. I'll leave the final call to you, but those are my thoughts. Re: not trusting a slave's result that we can't SSH into, depends what the failure modes are of the SSH daemon. In the past, we've had the sshd banning IPs which prevented connections. We've also had out of date passwords on some hosts which prevented logins. Neither of those impact the build result.

Flags: needinfo?(jhopkins)

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 17

•

11 years ago

(In reply to John Hopkins (:jhopkins) from comment #16) > If there are no recent jobs and we can't SSH and we know it's been > 5h > since the reboot script was last run then I'm ok with just rebooting. > > If there are no recent jobs and we can't SSH but we have no idea when the > reboot script was last run then I'd rather just file a bug in bugzilla to > look into it and not reboot the machine. Perhaps the reboot action should distinguish between "connection timed out" and other connection failures? How would you feel if we only did IPMI/PDU reboots when the ssh connection times out (instead of now, where we do them if the reboot via ssh doesn't work in any way).

Flags: needinfo?(jhopkins)

John Hopkins (:jhopkins)

Comment 18

•

11 years ago

As discussed in IRC, let's start simple + conservative and file bugs for non-standard cases. Also, we should ping first and not attempt to SSH if the ping is unsuccessful.

Flags: needinfo?(jhopkins)

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Depends on: 950774

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 19

•

11 years ago

(In reply to John Hopkins (:jhopkins) from comment #18) > As discussed in IRC, let's start simple + conservative and file bugs for > non-standard cases. I thought we agreed on IRC that we weren't making any changes to existing logic...which non-standard cases are you talking about here? > Also, we should ping first and not attempt to SSH if the ping is > unsuccessful. This is a change to the reboot action, so I'm separating it out into bug 950774.

John Hopkins (:jhopkins)

Comment 20

•

11 years ago

In IRC you said: > i'm on board with keeping things simple for now and improving later I think we're saying the same thing. Simple logic isn't going to catch all the edge cases; we can file bugs for those edge cases.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 21

•

11 years ago

(In reply to John Hopkins (:jhopkins) from comment #20) > In IRC you said: > > > i'm on board with keeping things simple for now and improving later > > I think we're saying the same thing. Simple logic isn't going to catch all > the edge cases; we can file bugs for those edge cases. Ah, I see! I thought you were saying slaveapi / this new script would file bugs for all of these edge case. I see now that you're saying we'll file follow-up bugs as we find them.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 22

•

11 years ago

Attached patch v2 — Details — Splinter Review

I think I've addressed all the review comments, here's a summary of the changes: * Add verbose and dry run options * Retry HTTP requests * Catch exceptions in threads so we can print out the slave name associated with them * Make log messages more consistent * Don't attempt to reboot if the graceful shutdown failed * Make ^C work by using .join(1) (join() ignores KeyboardInterrupt, so we need to call it repeatedly to give the script a chance to catch it)

Attachment #8339503 - Attachment is obsolete: true

Attachment #8349010 - Flags: review?(jhopkins)

John Hopkins (:jhopkins)

Comment 23

•

11 years ago

Comment on attachment 8349010 [details] [diff] [review] v2 Looks good!

Attachment #8349010 - Flags: review?(jhopkins) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 24

•

11 years ago

Comment on attachment 8349010 [details] [diff] [review] v2 I pushed this. It's not running anywhere yet. Need to write puppet manifests to deploy it somewhere (location TBD). We're still talking about whether or not reboot history on slave health is a blocker. If it is, bug 913601 needs to be fixed before this is deployed and kittenherder is shut off.

Attachment #8349010 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 25

•

11 years ago

Attached patch slaverebooter puppet manifests (obsolete) — Details — Splinter Review

I used releaserunner as a rough model for this. The main difference is that this isn't a long running process, so I'm setting up a cron entry instead of supervisord. I haven't figured out where it makes sense to run this yet...maybe on one of the slaveapi servers?

Attachment #8349467 - Flags: feedback?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 26

•

11 years ago

Isn't this part of slaveapi? Or is it a new, distinct tool? It seems like this would just be a crontask within slaveapi. This deployment violates the best practices "Install Python apps as pip-installable tarballs, with semantic versions, rather than from hg repositories." and "Minimize the complexity of the deployment by making simple interfaces. For example, if the app needs a crontask installed, create a simple script with setuptools' entry-points that takes no or very few arguments, and gets its config from a config file."

John Hopkins (:jhopkins)

Comment 27

•

11 years ago

> I haven't figured out where it makes sense to run this yet...maybe on one of the slaveapi servers? My impression is slaveapi nodes eventually should be treated like a cluster: if one node goes down, not a huge deal. If we install a singleton service on one of the cluster nodes then we have to treat that node as special from the other nodes and take care that it doesn't go down. For that reason, it might be better to run the reboot script on a server that is understood to be a singleton (and could have a standby server waiting to be activated in case the first server goes offline). In other words, a different availability model.

Dustin J. Mitchell [:dustin] (he/him)

Comment 28

•

11 years ago

Run-once tasks like this are difficult to cluster, and releng has lots of them. It'd be great to have a common, resilient way to handle this. We talked a little bit about such a thing during the workweek - using carrot or some similar job-execution system (SQS is an option, too) to execute the jobs, and some kind of distributed scheduler to insert the jobs.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 29

•

11 years ago

We've been using Buildbot Master machines for things like this (releaserunner, self serve agent, etc.). I can't think of anywhere else to put this....it's certainly not going on cruncher.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 30

•

11 years ago

Dustin and I talked about his feedback on the Puppet patch. Summary: - I'll make the script use a config file to avoid the lengthy command line arguments - I'll make the script get packaged as part of build/tools' setup.py -- then we can pip install it rather than clone.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 31

•

11 years ago

Attached patch use config file for most args — Details — Splinter Review

I left dryrun as an argument because it's an option that's just there for testing purposes.

Attachment #8349542 - Flags: review?(jhopkins)

John Hopkins (:jhopkins)

Updated

•

11 years ago

Attachment #8349542 - Flags: review?(jhopkins) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Attachment #8349542 - Flags: checked-in+

Dustin J. Mitchell [:dustin] (he/him)

Comment 32

•

11 years ago

As to where to deploy, I think that slaveapi1 makes more sense than a buildmaster.

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 33

•

11 years ago

Attached patch include reboot script + deps in buildtools package — Details — Splinter Review

Attachment #8349646 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 34

•

11 years ago

Comment on attachment 8349646 [details] [diff] [review] include reboot script + deps in buildtools package I think you want to use entry_points, not scripts: http://stackoverflow.com/questions/4359553/how-to-install-python-cli-scripts-with-setuptools-without-duplicates http://pythonhosted.org/setuptools/setuptools.html#automatic-script-creation

Attachment #8349646 - Flags: review?(dustin) → review-

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 35

•

11 years ago

Attached patch v2 of puppet manifests (obsolete) — Details — Splinter Review

Here's new Puppet manifests that use the buildtools package (which I have yet to upload) and a config instead of command line arguments.

Attachment #8349467 - Attachment is obsolete: true

Attachment #8349467 - Flags: feedback?(dustin)

Attachment #8349655 - Flags: feedback?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 36

•

11 years ago

Comment on attachment 8349655 [details] [diff] [review] v2 of puppet manifests Review of attachment 8349655 [details] [diff] [review]: ----------------------------------------------------------------- Don't forget to document this module. And, how will you include it in the node definition? Including the class directly?

Attachment #8349655 - Flags: feedback?(dustin) → feedback+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 37

•

11 years ago

(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #36) > And, how will you include it in the > node definition? Including the class directly? That was my thinking. I don't know of any other way to include it only on a single host, but I'm open to ideas.

Dustin J. Mitchell [:dustin] (he/him)

Comment 38

•

11 years ago

Comment on attachment 8349646 [details] [diff] [review] include reboot script + deps in buildtools package per discussion in irc, the two are roughly equivalent, so ..

Attachment #8349646 - Flags: review- → review+

Dustin J. Mitchell [:dustin] (he/him)

Comment 39

•

11 years ago

I think that including the class in the node def makes sense for now, but it exposes the same code-smell as we were discussing above, where this *particular* host is responsible for something, rather than all hosts in a particular class being responsible for it.

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Attachment #8349646 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 40

•

11 years ago

Attached patch install slaverebooter on bm65 — Details — Splinter Review

Attachment #8349655 - Attachment is obsolete: true

Attachment #8350072 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

11 years ago

Attachment #8350072 - Flags: review?(dustin) → review+

Chris Cooper [:coop] (he/him)

Updated

•

11 years ago

Depends on: 952264

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 41

•

11 years ago

Comment on attachment 8350072 [details] [diff] [review] install slaverebooter on bm65 I disabled the kittenherder cronjobs for everything except tegras (pandas were already disabled - don't know why). Once this syncs out with Puppet do some verification before the cronjob runs for the first time.

Attachment #8350072 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 42

•

11 years ago

Attached patch fix errors in cron — Details — Splinter Review

silly mistakes

Attachment #8356178 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

11 years ago

Attachment #8356178 - Flags: review?(dustin) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Updated

•

11 years ago

Attachment #8356178 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 43

•

11 years ago

Attached patch fix incomplete log message — Details — Splinter Review

Silly mistake!

Attachment #8356185 - Flags: review?(jhopkins)

John Hopkins (:jhopkins)

Updated

•

11 years ago

Attachment #8356185 - Flags: review?(jhopkins) → review+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 44

•

11 years ago

Comment on attachment 8356185 [details] [diff] [review] fix incomplete log message Landed w/ a version bump in setup.py.

Attachment #8356185 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Assignee

Comment 45

•

11 years ago

This seems to be working OK. I'm calling this fixed, and we can file follow-up bugs if issues arise.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

8 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.