Closed
Bug 987158
Opened 11 years ago
Closed 9 years ago
Add Windows support to aws_stop_idle.py
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: jhopkins, Unassigned)
Details
Attachments
(2 files, 3 obsolete files)
|
6.49 KB,
patch
|
bhearsum
:
review+
jhopkins
:
checked-in+
|
Details | Diff | Splinter Review |
|
7.15 KB,
patch
|
jhopkins
:
review+
jhopkins
:
checked-in+
|
Details | Diff | Splinter Review |
Add Windows support to aws_stop_idle.py. May be implemented independently in slaveapi and then called from aws_stop_idle.py.
| Reporter | ||
Comment 1•11 years ago
|
||
Attachment #8397521 -
Flags: review?(bhearsum)
Comment 2•11 years ago
|
||
Comment on attachment 8397521 [details] [diff] [review]
[slaveapi] add 'uptime' action
Review of attachment 8397521 [details] [diff] [review]:
-----------------------------------------------------------------
Seems fine overall, just a few minor points:
::: setup.py
@@ +28,5 @@
> "gevent_subprocess==0.1.1",
> "furl>=0.3.5",
> "orderedmultidict>=0.7.1",
> "pytz>=2013.7",
> + "python-dateutil>=1.5",
Have you checked to see if pytz supports the things you need? If it does, I'd rather not add this as a dependency.
::: slaveapi/actions/buildslave_uptime.py
@@ +26,5 @@
> + match = re.match('Statistics since (.+)', line)
> + if match:
> + _timedelta = datetime.today() - dateutil.parser.parse(match.group(1))
> + return int(_timedelta.total_seconds())
> + return None
Is None a valid return value for this function? I see that you're running "return SUCCESS, get_windows_uptime(...)" later, which means you could be setting success without returning an uptime. Whether it's doing something different here or down there to prevent it, that doesn't seem like a case that you'd want to it.
@@ +77,5 @@
> +
> + if not ping(slave.fqdn):
> + return FAILURE, "%s - Slave is offline, cannot get uptime!" % name
> +
> + is_unix = True
Is something that should go in Slave or another class? I'm fine to keep it here for now, but it's likely to be duplicated later.
::: slaveapi/web/__init__.py
@@ +9,5 @@
> app.add_url_rule("/results", view_func=Results.as_view("results"), methods=["GET"])
> app.add_url_rule("/slaves", view_func=Slaves.as_view("slaves"), methods=["GET"])
> app.add_url_rule("/slaves/<slave>", view_func=Slave.as_view("slave"), methods=["GET"])
> app.add_url_rule("/slaves/<slave>/actions/reboot", view_func=Reboot.as_view("reboot"), methods=["GET", "POST"])
> +app.add_url_rule("/slaves/<slave>/actions/uptime", view_func=Uptime.as_view("uptime"), methods=["GET", "POST"])
This is a bit of a nit, but I think it would be good to call this "get_uptime" or something else with a verb in it. A action whose name is just a noun is a bit weird.
Attachment #8397521 -
Flags: review?(bhearsum) → feedback+
| Reporter | ||
Comment 3•11 years ago
|
||
Addressed review feedback. Note:
- pytz does not provide a date parsing function so I am continuing to use python-dateutil
- i've chosen not to wrap the check_ping call in a separate function
Attachment #8397521 -
Attachment is obsolete: true
Attachment #8399542 -
Flags: review?(bhearsum)
Comment 4•11 years ago
|
||
Comment on attachment 8399542 [details] [diff] [review]
[slaveapi] add get_uptime action
Review of attachment 8399542 [details] [diff] [review]:
-----------------------------------------------------------------
::: slaveapi/web/slave.py
@@ +38,5 @@
> def __init__(self, *args, **kwargs):
> self.action = shutdown_buildslave
> ActionView.__init__(self, *args, **kwargs)
> +
> +class Uptime(ActionView):
Since the endpoint has been renamed, should this be renamed, too? Feel free to change that on landing.
Attachment #8399542 -
Flags: review?(bhearsum) → review+
| Reporter | ||
Comment 5•11 years ago
|
||
Comment on attachment 8399542 [details] [diff] [review]
[slaveapi] add get_uptime action
http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=5c8499468a25aa523513ea5b064a01bedad89c99
Attachment #8399542 -
Flags: checked-in+
Comment 6•11 years ago
|
||
(In reply to John Hopkins (:jhopkins) from comment #5)
> Comment on attachment 8399542 [details] [diff] [review]
> [slaveapi] add get_uptime action
>
> http://git.mozilla.org/?p=build/slaveapi.git;a=commit;
> h=5c8499468a25aa523513ea5b064a01bedad89c99
didn't land right, note you're missing the new file
| Reporter | ||
Comment 7•11 years ago
|
||
Thanks, Callek - fixed.
| Reporter | ||
Comment 8•11 years ago
|
||
Example usage:
$ curl -dwaittime=30 http://cruncher.srv.releng.scl3.mozilla.com:9876/slaves/bld-centos6-hp-003/actions/get_last_activity
Attachment #8403997 -
Flags: review?(bugspam.Callek)
Comment 9•11 years ago
|
||
Comment on attachment 8403997 [details] [diff] [review]
[slaveapi] add get_last_activity
Review of attachment 8403997 [details] [diff] [review]:
-----------------------------------------------------------------
I'm skeptical on the need/format for the many debug()'s sprinkled here, fwiw. That said I won't block on them.
::: slaveapi/actions/buildslave_last_activity.py
@@ +13,5 @@
> +
> +
> +import re
> +import dateutil.parser
> +from datetime import datetime
nit: I like the py-wide includes above the slaveapi ones.
@@ +19,5 @@
> +
> +def buildslave_last_activity(name):
> + """Get the build slave last activity time (in seconds).
> + Other values are "booting" and "stopped".
> + """
There is no talk about the return values (e.g. uptime) how to use/retrieve them, etc.
@@ +30,5 @@
> + return rc, uptime
> + if uptime < 3 * 60:
> + # Assume we're still booting
> + log.debug("%s - uptime is %.2f; assuming we're still booting up", name,
> + uptime)
NIT: you have longer lines than this, so just make uptime) on the same line
@@ +33,5 @@
> + log.debug("%s - uptime is %.2f; assuming we're still booting up", name,
> + uptime)
> + return SUCCESS, { "state": "booting", "last_activity": 0 }
> +
> + console = get_console(slave, usebuildbotslave=False)
you *certainly* want usebuildbotslave=True here. Since you're using the console to check twistd.log
@@ +40,5 @@
> + log.debug("slave.basedir='%s'" % slave.basedir)
> + # we'll disregard the return code b/c it will be non-zero if twistd.log.1 is not found
> + rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir })
> + except RemoteCommandError:
> + return FAILURE, "%s - failed to tail twistd.log:" % name
Is this worth it if we get a successful uptime above? -- especially because the reference implementation doesn't fail-early in this case? http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88
(e.g. I'd argue "last_state" == booting vs FAIL )
@@ +44,5 @@
> + return FAILURE, "%s - failed to tail twistd.log:" % name
> +
> + last_activity = None
> + running_command = False
> + t = time.time()
t is unused
@@ +57,5 @@
> + # Not sure what to do with this line...
> + continue
> +
> + # uncomment to dump out ALL the lines
> + #log.debug("%s - %s", name, line.strip())
I don't feel this commented-out code is worthy here at this stage.
@@ +69,5 @@
> +
> + if "Shut Down" in line:
> + # Check if this happened before we booted, i.e. we're still booting
> + # up
> + if last_activity_seconds > uptime:
Note that uptime is cached as of when we called it, so we need to base our last_activity_seconds delta *here* based on that moment otherwise this fails to be an accurate measurement. (specifically it would drift last_activity_time more and more as this runs.)
We need to cache time.time() before THIS method calls get_console (because getting a console can be a relatively lengthy wait, e.g. multiple passwords/attempts)
@@ +72,5 @@
> + # up
> + if last_activity_seconds > uptime:
> + log.debug(
> + "%s - shutdown line (%s) is older than uptime (%s); assuming we're still booting %s",
> + name, last_activity_seconds, uptime, line.strip())
nit: typo; ("shutdown line")
@@ +79,5 @@
> + last_state = "stopped"
> + elif "I have a leftover directory" in line:
> + # Ignore this, it doesn't indicate anything
> + continue
> + elif "slave is ready" in line:
I don't see this in http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 --- is this a better thing than the previous "else:" clause?
@@ +84,5 @@
> + if last_activity_seconds < uptime:
> + last_state = "ready"
> + elif running_command:
> + # We're in the middle of running something, so say that our last
> + # activity is now (0 seconds ago)
we're not actually setting the last_activity to 0 here. (maybe intentional, but not sure if thats a good thing. either way the comment is wrong)
::: slaveapi/web/slave.py
@@ +52,5 @@
> +class GetLastActivity(ActionView):
> + """Request the last activity age (in seconds). See
> + :py:class:`slaveapi.web.action_base.ActionView` for details on GET and POST
> + methods. See :py:func:`slaveapi.actions.buildslave_last_activity.buildslave_last_activity`
> + for details on how LastActivity is retrieved."""
This sure does have a __doc__ string, but no edits to docs/ in this patch.
Attachment #8403997 -
Flags: review?(bugspam.Callek) → review-
Comment 10•11 years ago
|
||
Comment on attachment 8403997 [details] [diff] [review]
[slaveapi] add get_last_activity
Review of attachment 8403997 [details] [diff] [review]:
-----------------------------------------------------------------
::: slaveapi/actions/buildslave_last_activity.py
@@ +38,5 @@
> + try:
> + log.debug("%s - tailing twistd.log" % name)
> + log.debug("slave.basedir='%s'" % slave.basedir)
> + # we'll disregard the return code b/c it will be non-zero if twistd.log.1 is not found
> + rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir })
ooo additionally, after we have the console's output here, its worth doing a console.disconnect() so we don't hold the socket open during this whole functions run.
@@ +48,5 @@
> + t = time.time()
> + line = ""
> + last_activity_seconds = None
> + last_state = "unknown"
> + for line in output.splitlines():
for niceness of gevent I'd also love a time.sleep(0) here, so that we have the chance to yield to other "threads"
| Reporter | ||
Comment 11•11 years ago
|
||
> Is this worth it if we get a successful uptime above? -- especially because the reference implementation doesn't fail-> early in this case? http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88
>
> (e.g. I'd argue "last_state" == booting vs FAIL )
RemoteCommandError should only fire on unexpected problems, like a command timeout or an exception while running the command. It's not based on the exit code of the command run. Does that answer your question?
> I don't see this in http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 --- is this a better thing than the previous "else:" clause?
Yes - I found that a machine with "slave is ready" as the last line could be erroneously considered "running_command" unless I added that clause.
Attachment #8403997 -
Attachment is obsolete: true
Attachment #8404853 -
Flags: review?(bugspam.Callek)
Comment 12•11 years ago
|
||
Comment on attachment 8404853 [details] [diff] [review]
[slaveapi] add get_last_activity
Review of attachment 8404853 [details] [diff] [review]:
-----------------------------------------------------------------
r- ONLY for the unaddressed issue (the last_activity_time drifting as we go along the loop, but the uptime staying static and confusing the checks)
::: slaveapi/actions/buildslave_last_activity.py
@@ +31,5 @@
> + rc, uptime = buildslave_uptime(name)
> + if rc != SUCCESS:
> + return rc, uptime
> + t = time.time()
> + time.sleep(0)
not really worried about time.sleep() on this line
@@ +37,5 @@
> + # Assume we're still booting
> + log.debug("%s - uptime is %.2f; assuming we're still booting up", name, uptime)
> + return SUCCESS, { "state": "booting", "last_activity": 0 }
> +
> + console = get_console(slave, usebuildbotslave=False)
usebuildbotslave=True (as per our IRC convo on it)
@@ +46,5 @@
> + rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir })
> + except RemoteCommandError:
> + return FAILURE, "%s - failed to tail twistd.log:" % name
> + console.disconnect()
> + time.sleep(0)
not really worried about time.sleep() here either
@@ +55,5 @@
> + running_command = False
> + line = ""
> + last_activity_seconds = None
> + last_state = "unknown"
> + for line in output.splitlines():
time.sleep was more for within this for loop, so the parsing of the 100 twistd.log lines can yield.
@@ +72,5 @@
> + running_command = False
> +
> + if "Shut Down" in line:
> + # Check if this happened before we booted, i.e. we're still booting up
> + if last_activity_seconds > uptime:
Unaddressed:
(In reply to Justin Wood (:Callek) from comment #9)
> Note that uptime is cached as of when we called it, so we need to base our
> last_activity_seconds delta *here* based on that moment otherwise this fails
> to be an accurate measurement. (specifically it would drift
> last_activity_time more and more as this runs.)
>
> We need to cache time.time() before THIS method calls get_console (because
> getting a console can be a relatively lengthy wait, e.g. multiple
> passwords/attempts)
Specifically see http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#132
Attachment #8404853 -
Flags: review?(bugspam.Callek) → review-
Comment 13•11 years ago
|
||
Exact same patch as attachment 8404853 [details] [diff] [review] -- except adjusted to account for my concerns
(done by me as per IRC convo)
Attachment #8404853 -
Attachment is obsolete: true
Attachment #8405456 -
Flags: review?(jhopkins)
| Reporter | ||
Updated•11 years ago
|
Attachment #8405456 -
Flags: review?(jhopkins) → review+
| Reporter | ||
Updated•11 years ago
|
Attachment #8405456 -
Flags: checked-in+
| Reporter | ||
Updated•11 years ago
|
Assignee: jhopkins → george.miroshnykov
Comment 14•11 years ago
|
||
Does this bug block win64-on-hardware, or only win64-on-VMs?
Flags: needinfo?(hwine)
Updated•11 years ago
|
Assignee: gmiroshnykov → nobody
Comment 16•9 years ago
|
||
catlee: is this still needed? Are we handling this some other way in AWS?
Flags: needinfo?(catlee)
Comment 17•9 years ago
|
||
I think the machines shut themselves down; this isn't needed on Windows.
Flags: needinfo?(catlee)
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Comment 18•8 years ago
|
||
There's some error case which leaves windows spot instances that are just sitting around:
use1 - b-2008 - 25
use1 - y-2008 - 42
uws2 - b-2008 - 81
usw2 - y-2008 - 84
Plus similar numbers of t-w732/g-w732. The few build workers I looked at didn't have a terminal open; they were mostly launched January 4th or February 11th (last TCW).
It would be good to make sure it's on buildduty's list of jobs to do once or twice a week, and/or modify our plans TCW to account for these zombies.
Comment 19•8 years ago
|
||
Checked this week for these spot instances that are siting around on both use1 and usw2 but only found 1 instance from 14 January that needed to be terminated, the rest have the "Launch time" no older than 20 February,maybe someone already made the cleaning.Anyway we will keep a close watch on them.
| Assignee | ||
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•