Closed Bug 987158 Opened 11 years ago Closed 9 years ago

Add Windows support to aws_stop_idle.py

Categories

(Release Engineering :: General, defect)

x86
Windows Server 2008
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jhopkins, Unassigned)

Details

Attachments

(2 files, 3 obsolete files)

Add Windows support to aws_stop_idle.py. May be implemented independently in slaveapi and then called from aws_stop_idle.py.
Attached patch [slaveapi] add 'uptime' action (obsolete) — Splinter Review
Attachment #8397521 - Flags: review?(bhearsum)
Comment on attachment 8397521 [details] [diff] [review] [slaveapi] add 'uptime' action Review of attachment 8397521 [details] [diff] [review]: ----------------------------------------------------------------- Seems fine overall, just a few minor points: ::: setup.py @@ +28,5 @@ > "gevent_subprocess==0.1.1", > "furl>=0.3.5", > "orderedmultidict>=0.7.1", > "pytz>=2013.7", > + "python-dateutil>=1.5", Have you checked to see if pytz supports the things you need? If it does, I'd rather not add this as a dependency. ::: slaveapi/actions/buildslave_uptime.py @@ +26,5 @@ > + match = re.match('Statistics since (.+)', line) > + if match: > + _timedelta = datetime.today() - dateutil.parser.parse(match.group(1)) > + return int(_timedelta.total_seconds()) > + return None Is None a valid return value for this function? I see that you're running "return SUCCESS, get_windows_uptime(...)" later, which means you could be setting success without returning an uptime. Whether it's doing something different here or down there to prevent it, that doesn't seem like a case that you'd want to it. @@ +77,5 @@ > + > + if not ping(slave.fqdn): > + return FAILURE, "%s - Slave is offline, cannot get uptime!" % name > + > + is_unix = True Is something that should go in Slave or another class? I'm fine to keep it here for now, but it's likely to be duplicated later. ::: slaveapi/web/__init__.py @@ +9,5 @@ > app.add_url_rule("/results", view_func=Results.as_view("results"), methods=["GET"]) > app.add_url_rule("/slaves", view_func=Slaves.as_view("slaves"), methods=["GET"]) > app.add_url_rule("/slaves/<slave>", view_func=Slave.as_view("slave"), methods=["GET"]) > app.add_url_rule("/slaves/<slave>/actions/reboot", view_func=Reboot.as_view("reboot"), methods=["GET", "POST"]) > +app.add_url_rule("/slaves/<slave>/actions/uptime", view_func=Uptime.as_view("uptime"), methods=["GET", "POST"]) This is a bit of a nit, but I think it would be good to call this "get_uptime" or something else with a verb in it. A action whose name is just a noun is a bit weird.
Attachment #8397521 - Flags: review?(bhearsum) → feedback+
Addressed review feedback. Note: - pytz does not provide a date parsing function so I am continuing to use python-dateutil - i've chosen not to wrap the check_ping call in a separate function
Attachment #8397521 - Attachment is obsolete: true
Attachment #8399542 - Flags: review?(bhearsum)
Comment on attachment 8399542 [details] [diff] [review] [slaveapi] add get_uptime action Review of attachment 8399542 [details] [diff] [review]: ----------------------------------------------------------------- ::: slaveapi/web/slave.py @@ +38,5 @@ > def __init__(self, *args, **kwargs): > self.action = shutdown_buildslave > ActionView.__init__(self, *args, **kwargs) > + > +class Uptime(ActionView): Since the endpoint has been renamed, should this be renamed, too? Feel free to change that on landing.
Attachment #8399542 - Flags: review?(bhearsum) → review+
(In reply to John Hopkins (:jhopkins) from comment #5) > Comment on attachment 8399542 [details] [diff] [review] > [slaveapi] add get_uptime action > > http://git.mozilla.org/?p=build/slaveapi.git;a=commit; > h=5c8499468a25aa523513ea5b064a01bedad89c99 didn't land right, note you're missing the new file
Thanks, Callek - fixed.
Comment on attachment 8403997 [details] [diff] [review] [slaveapi] add get_last_activity Review of attachment 8403997 [details] [diff] [review]: ----------------------------------------------------------------- I'm skeptical on the need/format for the many debug()'s sprinkled here, fwiw. That said I won't block on them. ::: slaveapi/actions/buildslave_last_activity.py @@ +13,5 @@ > + > + > +import re > +import dateutil.parser > +from datetime import datetime nit: I like the py-wide includes above the slaveapi ones. @@ +19,5 @@ > + > +def buildslave_last_activity(name): > + """Get the build slave last activity time (in seconds). > + Other values are "booting" and "stopped". > + """ There is no talk about the return values (e.g. uptime) how to use/retrieve them, etc. @@ +30,5 @@ > + return rc, uptime > + if uptime < 3 * 60: > + # Assume we're still booting > + log.debug("%s - uptime is %.2f; assuming we're still booting up", name, > + uptime) NIT: you have longer lines than this, so just make uptime) on the same line @@ +33,5 @@ > + log.debug("%s - uptime is %.2f; assuming we're still booting up", name, > + uptime) > + return SUCCESS, { "state": "booting", "last_activity": 0 } > + > + console = get_console(slave, usebuildbotslave=False) you *certainly* want usebuildbotslave=True here. Since you're using the console to check twistd.log @@ +40,5 @@ > + log.debug("slave.basedir='%s'" % slave.basedir) > + # we'll disregard the return code b/c it will be non-zero if twistd.log.1 is not found > + rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir }) > + except RemoteCommandError: > + return FAILURE, "%s - failed to tail twistd.log:" % name Is this worth it if we get a successful uptime above? -- especially because the reference implementation doesn't fail-early in this case? http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 (e.g. I'd argue "last_state" == booting vs FAIL ) @@ +44,5 @@ > + return FAILURE, "%s - failed to tail twistd.log:" % name > + > + last_activity = None > + running_command = False > + t = time.time() t is unused @@ +57,5 @@ > + # Not sure what to do with this line... > + continue > + > + # uncomment to dump out ALL the lines > + #log.debug("%s - %s", name, line.strip()) I don't feel this commented-out code is worthy here at this stage. @@ +69,5 @@ > + > + if "Shut Down" in line: > + # Check if this happened before we booted, i.e. we're still booting > + # up > + if last_activity_seconds > uptime: Note that uptime is cached as of when we called it, so we need to base our last_activity_seconds delta *here* based on that moment otherwise this fails to be an accurate measurement. (specifically it would drift last_activity_time more and more as this runs.) We need to cache time.time() before THIS method calls get_console (because getting a console can be a relatively lengthy wait, e.g. multiple passwords/attempts) @@ +72,5 @@ > + # up > + if last_activity_seconds > uptime: > + log.debug( > + "%s - shutdown line (%s) is older than uptime (%s); assuming we're still booting %s", > + name, last_activity_seconds, uptime, line.strip()) nit: typo; ("shutdown line") @@ +79,5 @@ > + last_state = "stopped" > + elif "I have a leftover directory" in line: > + # Ignore this, it doesn't indicate anything > + continue > + elif "slave is ready" in line: I don't see this in http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 --- is this a better thing than the previous "else:" clause? @@ +84,5 @@ > + if last_activity_seconds < uptime: > + last_state = "ready" > + elif running_command: > + # We're in the middle of running something, so say that our last > + # activity is now (0 seconds ago) we're not actually setting the last_activity to 0 here. (maybe intentional, but not sure if thats a good thing. either way the comment is wrong) ::: slaveapi/web/slave.py @@ +52,5 @@ > +class GetLastActivity(ActionView): > + """Request the last activity age (in seconds). See > + :py:class:`slaveapi.web.action_base.ActionView` for details on GET and POST > + methods. See :py:func:`slaveapi.actions.buildslave_last_activity.buildslave_last_activity` > + for details on how LastActivity is retrieved.""" This sure does have a __doc__ string, but no edits to docs/ in this patch.
Attachment #8403997 - Flags: review?(bugspam.Callek) → review-
Comment on attachment 8403997 [details] [diff] [review] [slaveapi] add get_last_activity Review of attachment 8403997 [details] [diff] [review]: ----------------------------------------------------------------- ::: slaveapi/actions/buildslave_last_activity.py @@ +38,5 @@ > + try: > + log.debug("%s - tailing twistd.log" % name) > + log.debug("slave.basedir='%s'" % slave.basedir) > + # we'll disregard the return code b/c it will be non-zero if twistd.log.1 is not found > + rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir }) ooo additionally, after we have the console's output here, its worth doing a console.disconnect() so we don't hold the socket open during this whole functions run. @@ +48,5 @@ > + t = time.time() > + line = "" > + last_activity_seconds = None > + last_state = "unknown" > + for line in output.splitlines(): for niceness of gevent I'd also love a time.sleep(0) here, so that we have the chance to yield to other "threads"
Attached patch [slaveapi] add get_last_activity (obsolete) — Splinter Review
> Is this worth it if we get a successful uptime above? -- especially because the reference implementation doesn't fail-> early in this case? http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 > > (e.g. I'd argue "last_state" == booting vs FAIL ) RemoteCommandError should only fire on unexpected problems, like a command timeout or an exception while running the command. It's not based on the exit code of the command run. Does that answer your question? > I don't see this in http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 --- is this a better thing than the previous "else:" clause? Yes - I found that a machine with "slave is ready" as the last line could be erroneously considered "running_command" unless I added that clause.
Attachment #8403997 - Attachment is obsolete: true
Attachment #8404853 - Flags: review?(bugspam.Callek)
Comment on attachment 8404853 [details] [diff] [review] [slaveapi] add get_last_activity Review of attachment 8404853 [details] [diff] [review]: ----------------------------------------------------------------- r- ONLY for the unaddressed issue (the last_activity_time drifting as we go along the loop, but the uptime staying static and confusing the checks) ::: slaveapi/actions/buildslave_last_activity.py @@ +31,5 @@ > + rc, uptime = buildslave_uptime(name) > + if rc != SUCCESS: > + return rc, uptime > + t = time.time() > + time.sleep(0) not really worried about time.sleep() on this line @@ +37,5 @@ > + # Assume we're still booting > + log.debug("%s - uptime is %.2f; assuming we're still booting up", name, uptime) > + return SUCCESS, { "state": "booting", "last_activity": 0 } > + > + console = get_console(slave, usebuildbotslave=False) usebuildbotslave=True (as per our IRC convo on it) @@ +46,5 @@ > + rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir }) > + except RemoteCommandError: > + return FAILURE, "%s - failed to tail twistd.log:" % name > + console.disconnect() > + time.sleep(0) not really worried about time.sleep() here either @@ +55,5 @@ > + running_command = False > + line = "" > + last_activity_seconds = None > + last_state = "unknown" > + for line in output.splitlines(): time.sleep was more for within this for loop, so the parsing of the 100 twistd.log lines can yield. @@ +72,5 @@ > + running_command = False > + > + if "Shut Down" in line: > + # Check if this happened before we booted, i.e. we're still booting up > + if last_activity_seconds > uptime: Unaddressed: (In reply to Justin Wood (:Callek) from comment #9) > Note that uptime is cached as of when we called it, so we need to base our > last_activity_seconds delta *here* based on that moment otherwise this fails > to be an accurate measurement. (specifically it would drift > last_activity_time more and more as this runs.) > > We need to cache time.time() before THIS method calls get_console (because > getting a console can be a relatively lengthy wait, e.g. multiple > passwords/attempts) Specifically see http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#132
Attachment #8404853 - Flags: review?(bugspam.Callek) → review-
Exact same patch as attachment 8404853 [details] [diff] [review] -- except adjusted to account for my concerns (done by me as per IRC convo)
Attachment #8404853 - Attachment is obsolete: true
Attachment #8405456 - Flags: review?(jhopkins)
Attachment #8405456 - Flags: review?(jhopkins) → review+
Attachment #8405456 - Flags: checked-in+
Assignee: jhopkins → george.miroshnykov
Does this bug block win64-on-hardware, or only win64-on-VMs?
Flags: needinfo?(hwine)
Flags: needinfo?(hwine) → needinfo?(catlee)
QA Contact: hwine → catlee
Only blocks win64-on-VMs.
Flags: needinfo?(catlee)
No longer blocks: 880004
Assignee: gmiroshnykov → nobody
catlee: is this still needed? Are we handling this some other way in AWS?
Flags: needinfo?(catlee)
I think the machines shut themselves down; this isn't needed on Windows.
Flags: needinfo?(catlee)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
There's some error case which leaves windows spot instances that are just sitting around: use1 - b-2008 - 25 use1 - y-2008 - 42 uws2 - b-2008 - 81 usw2 - y-2008 - 84 Plus similar numbers of t-w732/g-w732. The few build workers I looked at didn't have a terminal open; they were mostly launched January 4th or February 11th (last TCW). It would be good to make sure it's on buildduty's list of jobs to do once or twice a week, and/or modify our plans TCW to account for these zombies.
Checked this week for these spot instances that are siting around on both use1 and usw2 but only found 1 instance from 14 January that needed to be terminated, the rest have the "Launch time" no older than 20 February,maybe someone already made the cleaning.Anyway we will keep a close watch on them.
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: