Add Windows support to aws_stop_idle.py

RESOLVED WONTFIX

Status

Release Engineering
General
RESOLVED WONTFIX
3 years ago
3 months ago

People

(Reporter: jhopkins, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 3 obsolete attachments)

(Reporter)

Description

3 years ago
Add Windows support to aws_stop_idle.py.  May be implemented independently in slaveapi and then called from aws_stop_idle.py.
(Reporter)

Comment 1

3 years ago
Created attachment 8397521 [details] [diff] [review]
[slaveapi] add 'uptime' action
Attachment #8397521 - Flags: review?(bhearsum)
Comment on attachment 8397521 [details] [diff] [review]
[slaveapi] add 'uptime' action

Review of attachment 8397521 [details] [diff] [review]:
-----------------------------------------------------------------

Seems fine overall, just a few minor points:

::: setup.py
@@ +28,5 @@
>          "gevent_subprocess==0.1.1",
>          "furl>=0.3.5",
>          "orderedmultidict>=0.7.1",
>          "pytz>=2013.7",
> +        "python-dateutil>=1.5",

Have you checked to see if pytz supports the things you need? If it does, I'd rather not add this as a dependency.

::: slaveapi/actions/buildslave_uptime.py
@@ +26,5 @@
> +        match = re.match('Statistics since (.+)', line)
> +        if match:
> +            _timedelta = datetime.today() - dateutil.parser.parse(match.group(1))
> +            return int(_timedelta.total_seconds())
> +    return None

Is None a valid return value for this function? I see that you're running "return SUCCESS, get_windows_uptime(...)" later, which means you could be setting success without returning an uptime. Whether it's doing something different here or down there to prevent it, that doesn't seem like a case that you'd want to it.

@@ +77,5 @@
> +
> +    if not ping(slave.fqdn):
> +        return FAILURE, "%s - Slave is offline, cannot get uptime!" % name
> +
> +    is_unix = True

Is something that should go in Slave or another class? I'm fine to keep it here for now, but it's likely to be duplicated later.

::: slaveapi/web/__init__.py
@@ +9,5 @@
>  app.add_url_rule("/results", view_func=Results.as_view("results"), methods=["GET"])
>  app.add_url_rule("/slaves", view_func=Slaves.as_view("slaves"), methods=["GET"])
>  app.add_url_rule("/slaves/<slave>", view_func=Slave.as_view("slave"), methods=["GET"])
>  app.add_url_rule("/slaves/<slave>/actions/reboot", view_func=Reboot.as_view("reboot"), methods=["GET", "POST"])
> +app.add_url_rule("/slaves/<slave>/actions/uptime", view_func=Uptime.as_view("uptime"), methods=["GET", "POST"])

This is a bit of a nit, but I think it would be good to call this "get_uptime" or something else with a verb in it. A action whose name is just a noun is a bit weird.
Attachment #8397521 - Flags: review?(bhearsum) → feedback+
(Reporter)

Comment 3

3 years ago
Created attachment 8399542 [details] [diff] [review]
[slaveapi] add get_uptime action

Addressed review feedback.  Note:
- pytz does not provide a date parsing function so I am continuing to use python-dateutil
- i've chosen not to wrap the check_ping call in a separate function
Attachment #8397521 - Attachment is obsolete: true
Attachment #8399542 - Flags: review?(bhearsum)
Comment on attachment 8399542 [details] [diff] [review]
[slaveapi] add get_uptime action

Review of attachment 8399542 [details] [diff] [review]:
-----------------------------------------------------------------

::: slaveapi/web/slave.py
@@ +38,5 @@
>      def __init__(self, *args, **kwargs):
>          self.action = shutdown_buildslave
>          ActionView.__init__(self, *args, **kwargs)
> +
> +class Uptime(ActionView):

Since the endpoint has been renamed, should this be renamed, too? Feel free to change that on landing.
Attachment #8399542 - Flags: review?(bhearsum) → review+
(Reporter)

Comment 5

3 years ago
Comment on attachment 8399542 [details] [diff] [review]
[slaveapi] add get_uptime action

http://git.mozilla.org/?p=build/slaveapi.git;a=commit;h=5c8499468a25aa523513ea5b064a01bedad89c99
Attachment #8399542 - Flags: checked-in+
(In reply to John Hopkins (:jhopkins) from comment #5)
> Comment on attachment 8399542 [details] [diff] [review]
> [slaveapi] add get_uptime action
> 
> http://git.mozilla.org/?p=build/slaveapi.git;a=commit;
> h=5c8499468a25aa523513ea5b064a01bedad89c99

didn't land right, note you're missing the new file
(Reporter)

Comment 7

3 years ago
Thanks, Callek - fixed.
(Reporter)

Comment 8

3 years ago
Created attachment 8403997 [details] [diff] [review]
[slaveapi] add get_last_activity

Example usage:

$ curl -dwaittime=30 http://cruncher.srv.releng.scl3.mozilla.com:9876/slaves/bld-centos6-hp-003/actions/get_last_activity
Attachment #8403997 - Flags: review?(bugspam.Callek)
Comment on attachment 8403997 [details] [diff] [review]
[slaveapi] add get_last_activity

Review of attachment 8403997 [details] [diff] [review]:
-----------------------------------------------------------------

I'm skeptical on the need/format for the many debug()'s sprinkled here, fwiw. That said I won't block on them.

::: slaveapi/actions/buildslave_last_activity.py
@@ +13,5 @@
> +
> +
> +import re
> +import dateutil.parser
> +from datetime import datetime

nit: I like the py-wide includes above the slaveapi ones.

@@ +19,5 @@
> +
> +def buildslave_last_activity(name):
> +    """Get the build slave last activity time (in seconds).
> +    Other values are "booting" and "stopped".
> +    """

There is no talk about the return values (e.g. uptime) how to use/retrieve them, etc.

@@ +30,5 @@
> +        return rc, uptime
> +    if uptime < 3 * 60:
> +        # Assume we're still booting
> +        log.debug("%s - uptime is %.2f; assuming we're still booting up", name,
> +                  uptime)

NIT: you have longer lines than this, so just make uptime) on the same line

@@ +33,5 @@
> +        log.debug("%s - uptime is %.2f; assuming we're still booting up", name,
> +                  uptime)
> +        return SUCCESS, { "state": "booting", "last_activity": 0 }
> +
> +    console = get_console(slave, usebuildbotslave=False)

you *certainly* want usebuildbotslave=True here. Since you're using the console to check twistd.log

@@ +40,5 @@
> +        log.debug("slave.basedir='%s'" % slave.basedir)
> +        # we'll disregard the return code b/c it will be non-zero if twistd.log.1 is not found
> +        rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir })
> +    except RemoteCommandError:
> +        return FAILURE, "%s - failed to tail twistd.log:" % name

Is this worth it if we get a successful uptime above? -- especially because the reference implementation doesn't fail-early in this case? http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88

(e.g. I'd argue "last_state" == booting vs FAIL )

@@ +44,5 @@
> +        return FAILURE, "%s - failed to tail twistd.log:" % name
> +
> +    last_activity = None
> +    running_command = False
> +    t = time.time()

t is unused

@@ +57,5 @@
> +            # Not sure what to do with this line...
> +            continue
> +
> +        # uncomment to dump out ALL the lines
> +        #log.debug("%s - %s", name, line.strip())

I don't feel this commented-out code is worthy here at this stage.

@@ +69,5 @@
> +
> +        if "Shut Down" in line:
> +            # Check if this happened before we booted, i.e. we're still booting
> +            # up
> +            if last_activity_seconds > uptime:

Note that uptime is cached as of when we called it, so we need to base our last_activity_seconds delta *here* based on that moment otherwise this fails to be an accurate measurement. (specifically it would drift last_activity_time more and more as this runs.)

We need to cache time.time() before THIS method calls get_console (because getting a console can be a relatively lengthy wait, e.g. multiple passwords/attempts)

@@ +72,5 @@
> +            # up
> +            if last_activity_seconds > uptime:
> +                log.debug(
> +                    "%s - shutdown line (%s) is older than uptime (%s); assuming we're still booting %s",
> +                    name, last_activity_seconds, uptime, line.strip())

nit: typo; ("shutdown line")

@@ +79,5 @@
> +                last_state = "stopped"
> +        elif "I have a leftover directory" in line:
> +            # Ignore this, it doesn't indicate anything
> +            continue
> +        elif "slave is ready" in line:

I don't see this in http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 --- is this a better thing than the previous "else:" clause?

@@ +84,5 @@
> +            if last_activity_seconds < uptime:
> +                last_state = "ready"
> +        elif running_command:
> +            # We're in the middle of running something, so say that our last
> +            # activity is now (0 seconds ago)

we're not actually setting the last_activity to 0 here. (maybe intentional, but not sure if thats a good thing. either way the comment is wrong)

::: slaveapi/web/slave.py
@@ +52,5 @@
> +class GetLastActivity(ActionView):
> +    """Request the last activity age (in seconds).  See
> +    :py:class:`slaveapi.web.action_base.ActionView` for details on GET and POST
> +    methods. See :py:func:`slaveapi.actions.buildslave_last_activity.buildslave_last_activity`
> +    for details on how LastActivity is retrieved."""

This sure does have a __doc__ string, but no edits to docs/ in this patch.
Attachment #8403997 - Flags: review?(bugspam.Callek) → review-
Comment on attachment 8403997 [details] [diff] [review]
[slaveapi] add get_last_activity

Review of attachment 8403997 [details] [diff] [review]:
-----------------------------------------------------------------

::: slaveapi/actions/buildslave_last_activity.py
@@ +38,5 @@
> +    try:
> +        log.debug("%s - tailing twistd.log" % name)
> +        log.debug("slave.basedir='%s'" % slave.basedir)
> +        # we'll disregard the return code b/c it will be non-zero if twistd.log.1 is not found
> +        rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir })

ooo additionally, after we have the console's output here, its worth doing a console.disconnect() so we don't hold the socket open during this whole functions run.

@@ +48,5 @@
> +    t = time.time()
> +    line = ""
> +    last_activity_seconds = None
> +    last_state = "unknown"
> +    for line in output.splitlines():

for niceness of gevent I'd also love a time.sleep(0) here, so that we have the chance to yield to other "threads"
(Reporter)

Comment 11

3 years ago
Created attachment 8404853 [details] [diff] [review]
[slaveapi] add get_last_activity

> Is this worth it if we get a successful uptime above? -- especially because the reference implementation doesn't fail-> early in this case? http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88
>
> (e.g. I'd argue "last_state" == booting vs FAIL )

RemoteCommandError should only fire on unexpected problems, like a command timeout or an exception while running the command.  It's not based on the exit code of the command run.  Does that answer your question?

> I don't see this in http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#88 --- is this a better thing than the previous "else:" clause?

Yes - I found that a machine with "slave is ready" as the last line could be erroneously considered "running_command" unless I added that clause.
Attachment #8403997 - Attachment is obsolete: true
Attachment #8404853 - Flags: review?(bugspam.Callek)
Comment on attachment 8404853 [details] [diff] [review]
[slaveapi] add get_last_activity

Review of attachment 8404853 [details] [diff] [review]:
-----------------------------------------------------------------

r- ONLY for the unaddressed issue (the last_activity_time drifting as we go along the loop, but the uptime staying static and confusing the checks)

::: slaveapi/actions/buildslave_last_activity.py
@@ +31,5 @@
> +    rc, uptime = buildslave_uptime(name)
> +    if rc != SUCCESS:
> +        return rc, uptime
> +    t = time.time()
> +    time.sleep(0)

not really worried about time.sleep() on this line

@@ +37,5 @@
> +        # Assume we're still booting
> +        log.debug("%s - uptime is %.2f; assuming we're still booting up", name, uptime)
> +        return SUCCESS, { "state": "booting", "last_activity": 0 }
> +
> +    console = get_console(slave, usebuildbotslave=False)

usebuildbotslave=True (as per our IRC convo on it)

@@ +46,5 @@
> +        rc, output = console.run_cmd("tail -n 100 %(basedir)s/twistd.log.1 %(basedir)s/twistd.log" % { 'basedir': slave.basedir })
> +    except RemoteCommandError:
> +        return FAILURE, "%s - failed to tail twistd.log:" % name
> +    console.disconnect()
> +    time.sleep(0)

not really worried about time.sleep() here either

@@ +55,5 @@
> +    running_command = False
> +    line = ""
> +    last_activity_seconds = None
> +    last_state = "unknown"
> +    for line in output.splitlines():

time.sleep was more for within this for loop, so the parsing of the 100 twistd.log lines can yield.

@@ +72,5 @@
> +            running_command = False
> +
> +        if "Shut Down" in line:
> +            # Check if this happened before we booted, i.e. we're still booting up
> +            if last_activity_seconds > uptime:

Unaddressed:
(In reply to Justin Wood (:Callek) from comment #9)
> Note that uptime is cached as of when we called it, so we need to base our
> last_activity_seconds delta *here* based on that moment otherwise this fails
> to be an accurate measurement. (specifically it would drift
> last_activity_time more and more as this runs.)
> 
> We need to cache time.time() before THIS method calls get_console (because
> getting a console can be a relatively lengthy wait, e.g. multiple
> passwords/attempts)

Specifically see http://mxr.mozilla.org/build/source/cloud-tools/scripts/aws_stop_idle.py#132
Attachment #8404853 - Flags: review?(bugspam.Callek) → review-
Created attachment 8405456 [details] [diff] [review]
[slaveapi] v3 - adjusted with my concerns

Exact same patch as attachment 8404853 [details] [diff] [review] -- except adjusted to account for my concerns

(done by me as per IRC convo)
Attachment #8404853 - Attachment is obsolete: true
Attachment #8405456 - Flags: review?(jhopkins)
(Reporter)

Updated

3 years ago
Attachment #8405456 - Flags: review?(jhopkins) → review+
(Reporter)

Updated

3 years ago
Attachment #8405456 - Flags: checked-in+
(Reporter)

Updated

3 years ago
Assignee: jhopkins → george.miroshnykov
Does this bug block win64-on-hardware, or only win64-on-VMs?
Flags: needinfo?(hwine)
Flags: needinfo?(hwine) → needinfo?(catlee)
QA Contact: hwine → catlee
Only blocks win64-on-VMs.
Flags: needinfo?(catlee)
No longer blocks: 880004
Assignee: gmiroshnykov → nobody

Comment 16

9 months ago
catlee: is this still needed? Are we handling this some other way in AWS?
Flags: needinfo?(catlee)
I think the machines shut themselves down; this isn't needed on Windows.
Flags: needinfo?(catlee)

Updated

9 months ago
Status: NEW → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → WONTFIX
There's some error case which leaves windows spot instances that are just sitting around:

use1 - b-2008 - 25
use1 - y-2008 - 42
uws2 - b-2008 - 81
usw2 - y-2008 - 84

Plus similar numbers of t-w732/g-w732. The few build workers I looked at didn't have a terminal open; they were mostly launched January 4th or February 11th (last TCW).

It would be good to make sure it's on buildduty's list of jobs to do once or twice a week, and/or modify our plans TCW to account for these zombies.
Checked this week for these spot instances that are siting around on both use1 and usw2 but only found 1 instance from 14 January that needed to be terminated, the rest have the "Launch time" no older than 20 February,maybe someone already made the cleaning.Anyway we will keep a close watch on them.
(Assignee)

Updated

3 months ago
Component: Tools → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.