Closed
Bug 1173942
Opened 10 years ago
Closed 9 years ago
Stop rebooting machines from idleizer (allow halting instead)
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mrrrgn, Assigned: mrrrgn)
References
Details
Attachments
(9 files, 3 obsolete files)
2.39 KB,
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
13.15 KB,
patch
|
grenade
:
review+
|
Details | Diff | Splinter Review |
1.85 KB,
patch
|
markco
:
review+
|
Details | Diff | Splinter Review |
1.07 KB,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
8.55 KB,
patch
|
dustin
:
review+
|
Details | Diff | Splinter Review |
989 bytes,
text/plain
|
dustin
:
review+
|
Details |
4.61 KB,
patch
|
Details | Diff | Splinter Review | |
118.67 KB,
image/png
|
markco
:
feedback+
grenade
:
feedback+
|
Details |
1.12 KB,
text/plain
|
dustin
:
review+
|
Details |
Instead it should exit buildbot, and write to a log file
Assignee | ||
Comment 1•10 years ago
|
||
Assignee | ||
Comment 2•10 years ago
|
||
Attachment #8638771 -
Flags: review?(dustin)
Assignee | ||
Updated•10 years ago
|
Attachment #8624227 -
Attachment is obsolete: true
Assignee | ||
Comment 3•10 years ago
|
||
Background: We would like finer grained control over stopping machines that are idling often. To pass this information to other apps, I'm adding a message into the log when an idle out happens so that we can count them up. I also added the ability to shut down idleizer machine reboots via an environment variable. I chose to make this configurable, because we'll still have some non-runner test machines in the wild when this lands.
Comment 4•10 years ago
|
||
Comment on attachment 8638771 [details] [diff] [review]
idelizer-reboots.diff
Review of attachment 8638771 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good, to the extent that I remember this code :)
Attachment #8638771 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 5•10 years ago
|
||
Here we use environment variables to control how the machine behaves when the idle timer runs out. By default we will reboot; but by setting IDLEIZER_HALT_ON_IDLE=true we will get a full shutdown. Shutdowns will terminate spot instances in aws, so this serves as a handy mechanism for dealing with the problem of wasting billing hours on idle time.
Attachment #8638771 -
Attachment is obsolete: true
Attachment #8640663 -
Flags: review?(rthijssen)
Assignee | ||
Comment 6•10 years ago
|
||
Shown to be working. Puppet patch coming soon. https://pastebin.mozilla.org/8841014
Assignee | ||
Comment 7•10 years ago
|
||
Show to work on windows. Bumps idleizer to the new version that supports terminating instances.
Attachment #8640787 -
Flags: review?(rthijssen)
Assignee | ||
Comment 8•10 years ago
|
||
Attachment #8640663 -
Attachment is obsolete: true
Attachment #8640663 -
Flags: review?(rthijssen)
Attachment #8640794 -
Flags: review?(rthijssen)
Assignee | ||
Comment 9•10 years ago
|
||
Comment on attachment 8640787 [details] [diff] [review]
idleizer7.diff
Review of attachment 8640787 [details] [diff] [review]:
-----------------------------------------------------------------
Spreading out the reviews
Attachment #8640787 -
Flags: review?(rthijssen) → review?(bugspam.Callek)
Assignee | ||
Comment 10•10 years ago
|
||
So, this will terminate the instances; but I'm not sure we want it for all the windows instances. If not, we'll need to use puppet to distinguish between the instance types and only set this in the ones we want terminated.
Attachment #8640797 -
Flags: review?(mcornmesser)
Comment 11•10 years ago
|
||
Comment on attachment 8640794 [details] [diff] [review]
idleizer-reboots.diff
Review of attachment 8640794 [details] [diff] [review]:
-----------------------------------------------------------------
On lines 183 & 185, it would be nice to get a shutdown/restart reason/comment in event log/papertrail. (reason codes (p:4:1): http://ss64.com/nt/shutdown.html) eg:
182 if not halt:
183 os.system("shutdown -f -r -t 0 -c 'idleizer restart' -d 'p:4:1'")
184 else:
185 os.system("shutdown -f -s -t 0 -c 'idleizer shutdown' -d 'p:4:1'")
Attachment #8640794 -
Flags: review?(rthijssen) → review+
Comment 12•10 years ago
|
||
Comment on attachment 8640797 [details] [diff] [review]
haltidle.diff
Review of attachment 8640797 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good. This should also be added to EC2_start-buildbot.bat.erb.
Attachment #8640797 -
Flags: review?(mcornmesser) → review+
Comment 13•10 years ago
|
||
Comment on attachment 8640787 [details] [diff] [review]
idleizer7.diff
Review of attachment 8640787 [details] [diff] [review]:
-----------------------------------------------------------------
thought I marked this *days* ago, sorry for the delay.
Attachment #8640787 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Updated•10 years ago
|
Summary: Stop rebooting machines from idleizer → Stop rebooting machines from idleizer (allow halting instead)
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 14•9 years ago
|
||
I'd really like to get a papertrail message indicating the reason for the restart/shutdown. It would allow us to properly analyse our spot instance spawning efficiency. This patch introduces the arguments required to provide this level of verbosity. The syntax is documented at: http://ss64.com/nt/shutdown.html
Attachment #8661128 -
Flags: review?(dustin)
Updated•9 years ago
|
Attachment #8661128 -
Flags: review?(dustin) → review+
Comment 15•9 years ago
|
||
Assignee | ||
Updated•9 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 16•9 years ago
|
||
This is a rework of the patch that attempted to configure reboot/halt behavior via environment variables. Here, the option is passed into the class at instantiation. The idea is that this can be configured by modifying the buildbot.tac template in slavealoc.
Attachment #8682172 -
Flags: review?(dustin)
Comment 17•9 years ago
|
||
Comment on attachment 8682172 [details] [diff] [review]
idleizer-halt-argument.diff
Review of attachment 8682172 [details] [diff] [review]:
-----------------------------------------------------------------
Looks good - minor notes.
::: slave/buildslave/idleizer.py
@@ +27,5 @@
> L{buildslave.bot.BuildSlave} instance created in C{buildbot.tac}.
> C{max_idle_time} is the maximum time, in seconds, that the slave will
> be allowed to idle before it is restarted. Similarly,
> C{max_disconnected_time} is the maximum time it will be allowed to go
> + without a connection to a master. If C{halt} is true a shutdown will
nit: wrong parameter name
@@ +113,5 @@
>
> def maybeReboot(self):
> + if self._reboot_or_halt and self._halt_on_shutdown:
> + self.halt()
> + elif self.reboot_or_halt:
This would read better with two levels of conditional:
if reboot_or_halt:
if halt_on_shutdown:
halt
else:
reboot
@@ +199,5 @@
> + def halt(self):
> + log.msg("Invoking platform-specific shutdown command")
> +
> + if sys.platform in ('darwin', 'linux2'):
> + os.system("sudo -S shutdown -h now < /dev/null")
Is this already in sudoers?
Attachment #8682172 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 18•9 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> Comment on attachment 8682172 [details] [diff] [review]
> idleizer-halt-argument.diff
>
> Review of attachment 8682172 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Looks good - minor notes.
>
> ::: slave/buildslave/idleizer.py
> @@ +27,5 @@
> > L{buildslave.bot.BuildSlave} instance created in C{buildbot.tac}.
> > C{max_idle_time} is the maximum time, in seconds, that the slave will
> > be allowed to idle before it is restarted. Similarly,
> > C{max_disconnected_time} is the maximum time it will be allowed to go
> > + without a connection to a master. If C{halt} is true a shutdown will
>
> nit: wrong parameter name
>
> @@ +113,5 @@
> >
> > def maybeReboot(self):
> > + if self._reboot_or_halt and self._halt_on_shutdown:
> > + self.halt()
> > + elif self.reboot_or_halt:
>
> This would read better with two levels of conditional:
>
> if reboot_or_halt:
> if halt_on_shutdown:
> halt
> else:
> reboot
>
> @@ +199,5 @@
> > + def halt(self):
> > + log.msg("Invoking platform-specific shutdown command")
> > +
> > + if sys.platform in ('darwin', 'linux2'):
> > + os.system("sudo -S shutdown -h now < /dev/null")
>
> Is this already in sudoers?
Thanks ! Sending another patch to fix this up.
Assignee | ||
Comment 19•9 years ago
|
||
s/this/the sudoers up/
Assignee | ||
Comment 20•9 years ago
|
||
Just bumps the buildslave version.
Attachment #8682708 -
Flags: review?(dustin)
Assignee | ||
Comment 21•9 years ago
|
||
This will make it so that all non-scl3 machines halt when they idle for 90 minutes, which seems reasonable in a cloud environment. I understand if you object to this patch because of the hard coding. I suppose the only other option would be to modify the db schema and then update everything by hand?
Attachment #8682962 -
Flags: review?(dustin)
Assignee | ||
Comment 22•9 years ago
|
||
(In reply to Morgan Phillips [:mrrrgn] from comment #21)
> Created attachment 8682962 [details] [diff] [review]
> slavealloc-tac.diff
>
> This will make it so that all non-scl3 machines halt when they idle for 90
> minutes, which seems reasonable in a cloud environment. I understand if you
> object to this patch because of the hard coding. I suppose the only other
> option would be to modify the db schema and then update everything by hand?
Okay, I'm just going to add a custom template on second though.
Assignee | ||
Updated•9 years ago
|
Attachment #8682962 -
Flags: review?(dustin)
Assignee | ||
Comment 23•9 years ago
|
||
I've added a new template that will turn on halt on reboot, and decrease the idle timeout from 7 hours down to 90 minutes. You can select it by hand for now to text, then we can apply it to all of the windows hosts whenever you're ready. The template is called: "halt-on-shutdown."
Please check in with me on irc before testing, since I still need to bump the buildslave version in puppet (waiting on review).
Attachment #8682975 -
Flags: feedback?(rthijssen)
Attachment #8682975 -
Flags: feedback?(mcornmesser)
Assignee | ||
Updated•9 years ago
|
Attachment #8682708 -
Flags: review?(dustin) → review?(mcornmesser)
Assignee | ||
Comment 24•9 years ago
|
||
(In reply to Morgan Phillips [:mrrrgn] from comment #23)
> Created attachment 8682975 [details]
> tac-template.png
>
> I've added a new template that will turn on halt on reboot, and decrease the
> idle timeout from 7 hours down to 90 minutes. You can select it by hand for
> now to text, then we can apply it to all of the windows hosts whenever
> you're ready. The template is called: "halt-on-shutdown."
>
> Please check in with me on irc before testing, since I still need to bump
> the buildslave version in puppet (waiting on review).
s/to text//g <- every other post I write is a facepalm :)
Updated•9 years ago
|
Attachment #8682708 -
Flags: review?(mcornmesser) → review+
Updated•9 years ago
|
Attachment #8682975 -
Flags: feedback?(rthijssen) → feedback+
Updated•9 years ago
|
Attachment #8682975 -
Flags: feedback?(mcornmesser) → feedback+
Comment 25•9 years ago
|
||
I'd like to understand what I should look for on the Windows build slaves to validate that idleizer is installed/active.
- Today at 13:00 I disabled all y-2008-spot-* instances in slavealloc. There were 93 running instances of this type in EC2.
- At 18:00 I could see that these slaves were no longer associated with bb masters and none had running jobs. All were still up and running.
I don't know whether we haven't got idleizer installed correctly or what to look for to establish why the instances are still running. Is there a log file I can look at on the instances that might contain clues? There was nothing pertaining to idleizer in c:\opt\runner\runner.log but I don't know if that's the right place to look.
I'm going to go ahead and manually terminate all but one instance. I'll leave y-2008-spot-126 (10.132.64.101) up and running so that if somebody knows what to look for, we can have a look at that instances logs to see why it hasn't idleizer terminated.
Comment 26•9 years ago
|
||
Actually the last lines from runner.log might be relevant:
Error sending notice to nagios (ignored)
Traceback (most recent call last):
File "C:\opt\runner\Scripts\runner-script.py", line 9, in <module>
load_entry_point('runner==2.0', 'console_scripts', 'runner')()
File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 256, in main
runner(config, args.taskdir, args.times)
File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 224, in runner
if not process_taskdir(config, taskdir):
File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 138, in process_taskdir
r = run_task(task_cmd, env, max_time=task_config['max_time'])
File "C:\opt\runner\lib\site-packages\runner\__init__.py", line 23, in run_task
proc = subprocess.Popen(t, stdin=open(os.devnull, 'r'), env=env)
File "C:\mozilla-build\python27\Lib\subprocess.py", line 711, in __init__
errread, errwrite)
File "C:\mozilla-build\python27\Lib\subprocess.py", line 948, in _execute_child
startupinfo)
WindowsError: [Error 193] %1 is not a valid Win32 application
Comment 27•9 years ago
|
||
there's a small typo in the code:
http://hg.mozilla.org/build/buildbot/file/5fa6d3543fc9/slave/buildslave/idleizer.py#l118
def maybeReboot(self):
if self._reboot_or_halt:
if self._halt_on_shutdown:
self.halt()
elif self.reboot_or_halt:
self.reboot()
That last 'elif self.reboot_or_halt' should be changed to just an 'else:' I think. 'self.reboot_or_halt' doesn't exist, and is causing bug 1222915
Comment 28•9 years ago
|
||
Agreed. I think my review comment caused that particular error :(
Assignee | ||
Comment 29•9 years ago
|
||
Pushing a fix this. I'm not touching idleizer anymore.
Assignee | ||
Comment 30•9 years ago
|
||
Nothing makes me want to jump off a cliff quite like the number of bugs I've created touching idleizer. 0/10, hope to not do again.
Attachment #8684994 -
Flags: review?(dustin)
Updated•9 years ago
|
Attachment #8684994 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 31•9 years ago
|
||
Reported working in production.
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•