Closed Bug 1173942 Opened 10 years ago Closed 9 years ago

Stop rebooting machines from idleizer (allow halting instead)

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mrrrgn, Assigned: mrrrgn)

References

Details

Attachments

(9 files, 3 obsolete files)

Instead it should exit buildbot, and write to a log file
Assignee: nobody → winter2718
Depends on: 1173940
Blocks: 1173945
No longer depends on: 1173940
Attached patch idelizer-reboots.diff (obsolete) — Splinter Review
Attached patch idelizer-reboots.diff (obsolete) — Splinter Review
Attachment #8638771 - Flags: review?(dustin)
Attachment #8624227 - Attachment is obsolete: true
Background: We would like finer grained control over stopping machines that are idling often. To pass this information to other apps, I'm adding a message into the log when an idle out happens so that we can count them up. I also added the ability to shut down idleizer machine reboots via an environment variable. I chose to make this configurable, because we'll still have some non-runner test machines in the wild when this lands.
Comment on attachment 8638771 [details] [diff] [review] idelizer-reboots.diff Review of attachment 8638771 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, to the extent that I remember this code :)
Attachment #8638771 - Flags: review?(dustin) → review+
Attached patch idleizer-reboots.diff (obsolete) — Splinter Review
Here we use environment variables to control how the machine behaves when the idle timer runs out. By default we will reboot; but by setting IDLEIZER_HALT_ON_IDLE=true we will get a full shutdown. Shutdowns will terminate spot instances in aws, so this serves as a handy mechanism for dealing with the problem of wasting billing hours on idle time.
Attachment #8638771 - Attachment is obsolete: true
Attachment #8640663 - Flags: review?(rthijssen)
Shown to be working. Puppet patch coming soon. https://pastebin.mozilla.org/8841014
Attached patch idleizer7.diffSplinter Review
Show to work on windows. Bumps idleizer to the new version that supports terminating instances.
Attachment #8640787 - Flags: review?(rthijssen)
Attachment #8640663 - Attachment is obsolete: true
Attachment #8640663 - Flags: review?(rthijssen)
Attachment #8640794 - Flags: review?(rthijssen)
Comment on attachment 8640787 [details] [diff] [review] idleizer7.diff Review of attachment 8640787 [details] [diff] [review]: ----------------------------------------------------------------- Spreading out the reviews
Attachment #8640787 - Flags: review?(rthijssen) → review?(bugspam.Callek)
Attached patch haltidle.diffSplinter Review
So, this will terminate the instances; but I'm not sure we want it for all the windows instances. If not, we'll need to use puppet to distinguish between the instance types and only set this in the ones we want terminated.
Attachment #8640797 - Flags: review?(mcornmesser)
Comment on attachment 8640794 [details] [diff] [review] idleizer-reboots.diff Review of attachment 8640794 [details] [diff] [review]: ----------------------------------------------------------------- On lines 183 & 185, it would be nice to get a shutdown/restart reason/comment in event log/papertrail. (reason codes (p:4:1): http://ss64.com/nt/shutdown.html) eg: 182 if not halt: 183 os.system("shutdown -f -r -t 0 -c 'idleizer restart' -d 'p:4:1'") 184 else: 185 os.system("shutdown -f -s -t 0 -c 'idleizer shutdown' -d 'p:4:1'")
Attachment #8640794 - Flags: review?(rthijssen) → review+
Comment on attachment 8640797 [details] [diff] [review] haltidle.diff Review of attachment 8640797 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. This should also be added to EC2_start-buildbot.bat.erb.
Attachment #8640797 - Flags: review?(mcornmesser) → review+
Comment on attachment 8640787 [details] [diff] [review] idleizer7.diff Review of attachment 8640787 [details] [diff] [review]: ----------------------------------------------------------------- thought I marked this *days* ago, sorry for the delay.
Attachment #8640787 - Flags: review?(bugspam.Callek) → review+
Summary: Stop rebooting machines from idleizer → Stop rebooting machines from idleizer (allow halting instead)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
I'd really like to get a papertrail message indicating the reason for the restart/shutdown. It would allow us to properly analyse our spot instance spawning efficiency. This patch introduces the arguments required to provide this level of verbosity. The syntax is documented at: http://ss64.com/nt/shutdown.html
Attachment #8661128 - Flags: review?(dustin)
Attachment #8661128 - Flags: review?(dustin) → review+
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This is a rework of the patch that attempted to configure reboot/halt behavior via environment variables. Here, the option is passed into the class at instantiation. The idea is that this can be configured by modifying the buildbot.tac template in slavealoc.
Attachment #8682172 - Flags: review?(dustin)
Comment on attachment 8682172 [details] [diff] [review] idleizer-halt-argument.diff Review of attachment 8682172 [details] [diff] [review]: ----------------------------------------------------------------- Looks good - minor notes. ::: slave/buildslave/idleizer.py @@ +27,5 @@ > L{buildslave.bot.BuildSlave} instance created in C{buildbot.tac}. > C{max_idle_time} is the maximum time, in seconds, that the slave will > be allowed to idle before it is restarted. Similarly, > C{max_disconnected_time} is the maximum time it will be allowed to go > + without a connection to a master. If C{halt} is true a shutdown will nit: wrong parameter name @@ +113,5 @@ > > def maybeReboot(self): > + if self._reboot_or_halt and self._halt_on_shutdown: > + self.halt() > + elif self.reboot_or_halt: This would read better with two levels of conditional: if reboot_or_halt: if halt_on_shutdown: halt else: reboot @@ +199,5 @@ > + def halt(self): > + log.msg("Invoking platform-specific shutdown command") > + > + if sys.platform in ('darwin', 'linux2'): > + os.system("sudo -S shutdown -h now < /dev/null") Is this already in sudoers?
Attachment #8682172 - Flags: review?(dustin) → review+
(In reply to Dustin J. Mitchell [:dustin] from comment #17) > Comment on attachment 8682172 [details] [diff] [review] > idleizer-halt-argument.diff > > Review of attachment 8682172 [details] [diff] [review]: > ----------------------------------------------------------------- > > Looks good - minor notes. > > ::: slave/buildslave/idleizer.py > @@ +27,5 @@ > > L{buildslave.bot.BuildSlave} instance created in C{buildbot.tac}. > > C{max_idle_time} is the maximum time, in seconds, that the slave will > > be allowed to idle before it is restarted. Similarly, > > C{max_disconnected_time} is the maximum time it will be allowed to go > > + without a connection to a master. If C{halt} is true a shutdown will > > nit: wrong parameter name > > @@ +113,5 @@ > > > > def maybeReboot(self): > > + if self._reboot_or_halt and self._halt_on_shutdown: > > + self.halt() > > + elif self.reboot_or_halt: > > This would read better with two levels of conditional: > > if reboot_or_halt: > if halt_on_shutdown: > halt > else: > reboot > > @@ +199,5 @@ > > + def halt(self): > > + log.msg("Invoking platform-specific shutdown command") > > + > > + if sys.platform in ('darwin', 'linux2'): > > + os.system("sudo -S shutdown -h now < /dev/null") > > Is this already in sudoers? Thanks ! Sending another patch to fix this up.
s/this/the sudoers up/
Attached file buildslave-bump.diff
Just bumps the buildslave version.
Attachment #8682708 - Flags: review?(dustin)
This will make it so that all non-scl3 machines halt when they idle for 90 minutes, which seems reasonable in a cloud environment. I understand if you object to this patch because of the hard coding. I suppose the only other option would be to modify the db schema and then update everything by hand?
Attachment #8682962 - Flags: review?(dustin)
(In reply to Morgan Phillips [:mrrrgn] from comment #21) > Created attachment 8682962 [details] [diff] [review] > slavealloc-tac.diff > > This will make it so that all non-scl3 machines halt when they idle for 90 > minutes, which seems reasonable in a cloud environment. I understand if you > object to this patch because of the hard coding. I suppose the only other > option would be to modify the db schema and then update everything by hand? Okay, I'm just going to add a custom template on second though.
Attachment #8682962 - Flags: review?(dustin)
Attached image tac-template.png
I've added a new template that will turn on halt on reboot, and decrease the idle timeout from 7 hours down to 90 minutes. You can select it by hand for now to text, then we can apply it to all of the windows hosts whenever you're ready. The template is called: "halt-on-shutdown." Please check in with me on irc before testing, since I still need to bump the buildslave version in puppet (waiting on review).
Attachment #8682975 - Flags: feedback?(rthijssen)
Attachment #8682975 - Flags: feedback?(mcornmesser)
Attachment #8682708 - Flags: review?(dustin) → review?(mcornmesser)
(In reply to Morgan Phillips [:mrrrgn] from comment #23) > Created attachment 8682975 [details] > tac-template.png > > I've added a new template that will turn on halt on reboot, and decrease the > idle timeout from 7 hours down to 90 minutes. You can select it by hand for > now to text, then we can apply it to all of the windows hosts whenever > you're ready. The template is called: "halt-on-shutdown." > > Please check in with me on irc before testing, since I still need to bump > the buildslave version in puppet (waiting on review). s/to text//g <- every other post I write is a facepalm :)
Attachment #8682708 - Flags: review?(mcornmesser) → review+
Attachment #8682975 - Flags: feedback?(rthijssen) → feedback+
Attachment #8682975 - Flags: feedback?(mcornmesser) → feedback+
I'd like to understand what I should look for on the Windows build slaves to validate that idleizer is installed/active. - Today at 13:00 I disabled all y-2008-spot-* instances in slavealloc. There were 93 running instances of this type in EC2. - At 18:00 I could see that these slaves were no longer associated with bb masters and none had running jobs. All were still up and running. I don't know whether we haven't got idleizer installed correctly or what to look for to establish why the instances are still running. Is there a log file I can look at on the instances that might contain clues? There was nothing pertaining to idleizer in c:\opt\runner\runner.log but I don't know if that's the right place to look. I'm going to go ahead and manually terminate all but one instance. I'll leave y-2008-spot-126 (10.132.64.101) up and running so that if somebody knows what to look for, we can have a look at that instances logs to see why it hasn't idleizer terminated.
Actually the last lines from runner.log might be relevant: Error sending notice to nagios (ignored) Traceback (most recent call last): File "C:\opt\runner\Scripts\runner-script.py", line 9, in <module> load_entry_point('runner==2.0', 'console_scripts', 'runner')() File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 256, in main runner(config, args.taskdir, args.times) File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 224, in runner if not process_taskdir(config, taskdir): File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 138, in process_taskdir r = run_task(task_cmd, env, max_time=task_config['max_time']) File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 23, in run_task proc = subprocess.Popen(t, stdin=open(os.devnull, 'r'), env=env) File "C:\mozilla-build\python27\Lib\subprocess.py", line 711, in __init__ errread, errwrite) File "C:\mozilla-build\python27\Lib\subprocess.py", line 948, in _execute_child startupinfo) WindowsError: [Error 193] %1 is not a valid Win32 application
Blocks: 1222915
there's a small typo in the code: http://hg.mozilla.org/build/buildbot/file/5fa6d3543fc9/slave/buildslave/idleizer.py#l118 def maybeReboot(self): if self._reboot_or_halt: if self._halt_on_shutdown: self.halt() elif self.reboot_or_halt: self.reboot() That last 'elif self.reboot_or_halt' should be changed to just an 'else:' I think. 'self.reboot_or_halt' doesn't exist, and is causing bug 1222915
Agreed. I think my review comment caused that particular error :(
Pushing a fix this. I'm not touching idleizer anymore.
Attached file durr.diff
Nothing makes me want to jump off a cliff quite like the number of bugs I've created touching idleizer. 0/10, hope to not do again.
Attachment #8684994 - Flags: review?(dustin)
Attachment #8684994 - Flags: review?(dustin) → review+
Reported working in production.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: