Closed Bug 962190 Opened 8 years ago Closed 8 years ago

aws sanity check shouldn't report instances that are actively doing work as long running

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: massimo)

Details

Attachments

(2 files, 1 obsolete file)

It seems to get confused about spot instances that are deleted and then a new one created with the same name. eg, the sanity check e-mail will report it as up for 20 hours, but when I log in I see less than an hour of uptime, and a job running.

This is dangerous, because if someone shuts it down without thinking, it will interrupt a job.
I think the report is still correct, but the message is confusing. The script doesn't use uptime(1)'s output, instead it uses instances' launch time, see http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/aws_sanity_checker.py#l100
(In reply to Rail Aliiev [:rail] from comment #1)
> I think the report is still correct, but the message is confusing. The
> script doesn't use uptime(1)'s output, instead it uses instances' launch
> time, see
> http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/
> aws_sanity_checker.py#l100

OK, that's good to know. Rewording the summary.

Maybe the report generator can look to buildapi to see if a long running instance has recently taken a job? Or maybe looking at its uptime would be better? Not sure about the latter because I think slave's will reboot on their own if they haven't taken a job in awhile...
Summary: aws sanity check tells lies about spot instances → aws sanity check shouldn't report instances that are actively doing work as long running
Assignee: nobody → mgervasini
Attached patch 962190.patch (obsolete) — Splinter Review
Hi,

This patch updates the aws_sanity_checker.py output. For long running slaves it reports the time of the last completed job (from buildapi).

Instances that need special care are at the very top of the report.

I have created a support module aws_slaves.py so the report can be extended in the future to include data from other source as slavehealt/slaveapi/... 
Other changes are in the aws_sanity_checker.py script, everything is managed by the generate_report method and the output can be updated and extended easily (we could generate an html report for example)
Attachment #8374142 - Flags: review?(rail)
Comment on attachment 8374142 [details] [diff] [review]
962190.patch

Review of attachment 8374142 [details] [diff] [review]:
-----------------------------------------------------------------

Almost there! I like the patch, thanks for organizing this braindump. :)

In overall the patch looks good. The only blocker for me is the fact that it doesn't time out HTTP requests and may sit forever blocked. And some nits.

::: aws/aws_slaves.py
@@ +9,5 @@
> +LOG = logging.getLogger(__name__)
> +LOG.setLevel(logging.INFO)
> +BUILDAPI_URL_JSON = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}?format=json"
> +BUILDAPI_URL = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}"
> +SLAVEHEALTH_URL = "https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class={slave_class}&type={slave_type}&name={slave_name}"

This is unused, can be deleted.

@@ +95,5 @@
> +            return self.last_job_endtime
> +        url = self.get_buildapi_json_url()
> +        endtime = self.now
> +        try:
> +            json_data = urllib2.urlopen(url)

Adding timeout here is a good idea. Feel free to use requests module instead.

@@ +138,5 @@
> +
> +    def is_loaned(self):
> +        """returns True if the slave is loaned"""
> +        # not implemented yet
> +        return self.get_loaned_string()

Not implemented? :) The following should behave exactly how the docstring says.

return self.get_loaned_string() is not None
Attachment #8374142 - Flags: review?(rail) → review-
Attached patch 962190.patchSplinter Review
Hi Rail,

sorry for another big patch.

This patch fixes the problems with the old patch (new timeout for connections defaults to 60s) and creates a new generic class, Instance,  which is the base class for Slave and Master. I have added this extra abstraction because it might me useful for bug 897002 and any follow ups.
Attachment #8374142 - Attachment is obsolete: true
Attachment #8374978 - Flags: review?(rail)
Comment on attachment 8374978 [details] [diff] [review]
962190.patch

Review of attachment 8374978 [details] [diff] [review]:
-----------------------------------------------------------------

In overall it looks good. Good progress!

However, there are one thing regarding the "impaired" status. See my comments below.

::: aws/aws_instances.py
@@ +96,5 @@
> +        return self._get_state() == 'stopped'
> +
> +    def is_impaired(self):
> +        """returns True if instance is impaired"""
> +        return self._get_state() == 'impaired'

This won't work because "impaired" is status, not state. You have to query that explicitly, see http://hg.mozilla.org/build/cloud-tools/file/b08c1cdf88a8/aws/aws_stop_idle.py#l280 as an example. You probably shouldn't call this more than once per region to be faster.

This status is not something that makes you 100% sure that the instance is down. In aws_stop_idle.py we use it only when ssh doesn't work.

@@ +194,5 @@
> +
> +
> +
> +class Master(Instance):
> +    """AWS master"""

Hmm, this class does noting... Do you really need it?
Attachment #8374978 - Flags: review?(rail) → review-
Attached patch 962190.patchSplinter Review
Hi rail, thanks for the review.

This patch updates the is_imparied() method.

A note about this new patch; there are few methods that are not currently not used (e.g. *_message) and some constant (e.g. EXPECTED_MAX_UPTIME) are both in aws_instance.py and aws_sanity_checker.py 
I plan to remove the duplicate code and make full use of the methods in a following patch, this patch is an intermediate step to make aws_sanity_checker more flexible.
Attachment #8377598 - Flags: review?(rail)
Comment on attachment 8377598 [details] [diff] [review]
962190.patch

Review of attachment 8377598 [details] [diff] [review]:
-----------------------------------------------------------------

Great! Just a couple of nits before you push this:

::: aws/aws_instances.py
@@ +165,5 @@
> +    def is_stale(self):
> +        """returns True if a running instance is running for a longer time than
> +           expected (EXPECTED_MAX_UPTIME); for stopped instances, returns True
> +           if the instance is stopped for more than EXPECTED_MAX_DOWNTIME
> +           Loaned instances and moy-type with timeout == meh cannot be stale"""

s/moy-type/moz-type/

@@ +316,5 @@
> +           up for 147 hours BUILDAPI_INFO"""
> +        message = self.running_message()
> +        if message:
> +            message = "{0} ({1} since last build, source: {2} )".format(
> +                message, self.when_last_job_ended(), self.get_buildapi_url())

I think you can remove the ", source: http://..." part here to make the message less spammy. We know how to get there! :)

::: aws/aws_sanity_checker.py
@@ +167,5 @@
> +        if slave.is_long_running():
> +            last_job_ended = slave.when_last_job_ended()
> +            buildapi_url = slave.get_buildapi_url()
> +            message = "{0} ({1} since last build, source: {2} )".format(message,
> +                      last_job_ended, buildapi_url)

The same here.

@@ +189,5 @@
> +            last_job_ended = slave.when_last_job_ended()
> +            if last_job_ended != '0h:0m':
> +                buildapi_url = slave.get_buildapi_url()
> +                message = "{0} ({1} since last build, source: {2} )".format(message,
> +                          last_job_ended, buildapi_url)

And here
Attachment #8377598 - Flags: review?(rail) → review+
Comment on attachment 8377598 [details] [diff] [review]
962190.patch

I have removed the buildapi url from the output messages and fixed the typo in comments.
Attachment #8377598 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Thanks for cleaning this report up.
I have some more improvements to mention; would you like us to use this bug? or another one?

Here:
* We have dozens of "None" lines [1] (282 out of 376 lines)
* We have 90+ tst-linux64-spot-XXX instances that have been up for more than 12 hours (up to 356 hours)
** However, they have been doing jobs all this time
** only some of them had not done jobs recently (and they appeared on the "lazy instances" section - bug 984893)

[1] 196 None (i-19533b11, us-west-2): up for 7 hours (no info from buildapi)
Can we send the following sections to a separate audience?
* ==== Instances with unknown type ====
* ==== Instances with unknown state ====
* ==== Instances stopped for a while ====
* ==== Not attached volumes ====

They're not actionable for buildduty.
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #11)
> Can we send the following sections to a separate audience?
> * ==== Instances with unknown type ====
> * ==== Instances with unknown state ====
> * ==== Instances stopped for a while ====
> * ==== Not attached volumes ====
> 
> They're not actionable for buildduty.

It'd probably be better to take these to a new bug.
Filed bug 986059.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.