Closed
Bug 962190
Opened 12 years ago
Closed 12 years ago
aws sanity check shouldn't report instances that are actively doing work as long running
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: massimo)
Details
Attachments
(2 files, 1 obsolete file)
14.36 KB,
patch
|
rail
:
review-
|
Details | Diff | Splinter Review |
18.32 KB,
patch
|
rail
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
It seems to get confused about spot instances that are deleted and then a new one created with the same name. eg, the sanity check e-mail will report it as up for 20 hours, but when I log in I see less than an hour of uptime, and a job running.
This is dangerous, because if someone shuts it down without thinking, it will interrupt a job.
![]() |
||
Comment 1•12 years ago
|
||
I think the report is still correct, but the message is confusing. The script doesn't use uptime(1)'s output, instead it uses instances' launch time, see http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/aws_sanity_checker.py#l100
Reporter | ||
Comment 2•12 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #1)
> I think the report is still correct, but the message is confusing. The
> script doesn't use uptime(1)'s output, instead it uses instances' launch
> time, see
> http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/
> aws_sanity_checker.py#l100
OK, that's good to know. Rewording the summary.
Maybe the report generator can look to buildapi to see if a long running instance has recently taken a job? Or maybe looking at its uptime would be better? Not sure about the latter because I think slave's will reboot on their own if they haven't taken a job in awhile...
Summary: aws sanity check tells lies about spot instances → aws sanity check shouldn't report instances that are actively doing work as long running
![]() |
||
Updated•12 years ago
|
Assignee: nobody → mgervasini
![]() |
Assignee | |
Comment 3•12 years ago
|
||
Hi,
This patch updates the aws_sanity_checker.py output. For long running slaves it reports the time of the last completed job (from buildapi).
Instances that need special care are at the very top of the report.
I have created a support module aws_slaves.py so the report can be extended in the future to include data from other source as slavehealt/slaveapi/...
Other changes are in the aws_sanity_checker.py script, everything is managed by the generate_report method and the output can be updated and extended easily (we could generate an html report for example)
Attachment #8374142 -
Flags: review?(rail)
![]() |
||
Comment 4•12 years ago
|
||
Comment on attachment 8374142 [details] [diff] [review]
962190.patch
Review of attachment 8374142 [details] [diff] [review]:
-----------------------------------------------------------------
Almost there! I like the patch, thanks for organizing this braindump. :)
In overall the patch looks good. The only blocker for me is the fact that it doesn't time out HTTP requests and may sit forever blocked. And some nits.
::: aws/aws_slaves.py
@@ +9,5 @@
> +LOG = logging.getLogger(__name__)
> +LOG.setLevel(logging.INFO)
> +BUILDAPI_URL_JSON = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}?format=json"
> +BUILDAPI_URL = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}"
> +SLAVEHEALTH_URL = "https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class={slave_class}&type={slave_type}&name={slave_name}"
This is unused, can be deleted.
@@ +95,5 @@
> + return self.last_job_endtime
> + url = self.get_buildapi_json_url()
> + endtime = self.now
> + try:
> + json_data = urllib2.urlopen(url)
Adding timeout here is a good idea. Feel free to use requests module instead.
@@ +138,5 @@
> +
> + def is_loaned(self):
> + """returns True if the slave is loaned"""
> + # not implemented yet
> + return self.get_loaned_string()
Not implemented? :) The following should behave exactly how the docstring says.
return self.get_loaned_string() is not None
Attachment #8374142 -
Flags: review?(rail) → review-
![]() |
Assignee | |
Comment 5•12 years ago
|
||
Hi Rail,
sorry for another big patch.
This patch fixes the problems with the old patch (new timeout for connections defaults to 60s) and creates a new generic class, Instance, which is the base class for Slave and Master. I have added this extra abstraction because it might me useful for bug 897002 and any follow ups.
Attachment #8374142 -
Attachment is obsolete: true
Attachment #8374978 -
Flags: review?(rail)
![]() |
||
Comment 6•12 years ago
|
||
Comment on attachment 8374978 [details] [diff] [review]
962190.patch
Review of attachment 8374978 [details] [diff] [review]:
-----------------------------------------------------------------
In overall it looks good. Good progress!
However, there are one thing regarding the "impaired" status. See my comments below.
::: aws/aws_instances.py
@@ +96,5 @@
> + return self._get_state() == 'stopped'
> +
> + def is_impaired(self):
> + """returns True if instance is impaired"""
> + return self._get_state() == 'impaired'
This won't work because "impaired" is status, not state. You have to query that explicitly, see http://hg.mozilla.org/build/cloud-tools/file/b08c1cdf88a8/aws/aws_stop_idle.py#l280 as an example. You probably shouldn't call this more than once per region to be faster.
This status is not something that makes you 100% sure that the instance is down. In aws_stop_idle.py we use it only when ssh doesn't work.
@@ +194,5 @@
> +
> +
> +
> +class Master(Instance):
> + """AWS master"""
Hmm, this class does noting... Do you really need it?
Attachment #8374978 -
Flags: review?(rail) → review-
![]() |
Assignee | |
Comment 7•12 years ago
|
||
Hi rail, thanks for the review.
This patch updates the is_imparied() method.
A note about this new patch; there are few methods that are not currently not used (e.g. *_message) and some constant (e.g. EXPECTED_MAX_UPTIME) are both in aws_instance.py and aws_sanity_checker.py
I plan to remove the duplicate code and make full use of the methods in a following patch, this patch is an intermediate step to make aws_sanity_checker more flexible.
Attachment #8377598 -
Flags: review?(rail)
![]() |
||
Comment 8•12 years ago
|
||
Comment on attachment 8377598 [details] [diff] [review]
962190.patch
Review of attachment 8377598 [details] [diff] [review]:
-----------------------------------------------------------------
Great! Just a couple of nits before you push this:
::: aws/aws_instances.py
@@ +165,5 @@
> + def is_stale(self):
> + """returns True if a running instance is running for a longer time than
> + expected (EXPECTED_MAX_UPTIME); for stopped instances, returns True
> + if the instance is stopped for more than EXPECTED_MAX_DOWNTIME
> + Loaned instances and moy-type with timeout == meh cannot be stale"""
s/moy-type/moz-type/
@@ +316,5 @@
> + up for 147 hours BUILDAPI_INFO"""
> + message = self.running_message()
> + if message:
> + message = "{0} ({1} since last build, source: {2} )".format(
> + message, self.when_last_job_ended(), self.get_buildapi_url())
I think you can remove the ", source: http://..." part here to make the message less spammy. We know how to get there! :)
::: aws/aws_sanity_checker.py
@@ +167,5 @@
> + if slave.is_long_running():
> + last_job_ended = slave.when_last_job_ended()
> + buildapi_url = slave.get_buildapi_url()
> + message = "{0} ({1} since last build, source: {2} )".format(message,
> + last_job_ended, buildapi_url)
The same here.
@@ +189,5 @@
> + last_job_ended = slave.when_last_job_ended()
> + if last_job_ended != '0h:0m':
> + buildapi_url = slave.get_buildapi_url()
> + message = "{0} ({1} since last build, source: {2} )".format(message,
> + last_job_ended, buildapi_url)
And here
Attachment #8377598 -
Flags: review?(rail) → review+
![]() |
Assignee | |
Comment 9•12 years ago
|
||
Comment on attachment 8377598 [details] [diff] [review]
962190.patch
I have removed the buildapi url from the output messages and fixed the typo in comments.
Attachment #8377598 -
Flags: checked-in+
![]() |
Assignee | |
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 10•12 years ago
|
||
Thanks for cleaning this report up.
I have some more improvements to mention; would you like us to use this bug? or another one?
Here:
* We have dozens of "None" lines [1] (282 out of 376 lines)
* We have 90+ tst-linux64-spot-XXX instances that have been up for more than 12 hours (up to 356 hours)
** However, they have been doing jobs all this time
** only some of them had not done jobs recently (and they appeared on the "lazy instances" section - bug 984893)
[1] 196 None (i-19533b11, us-west-2): up for 7 hours (no info from buildapi)
Comment 11•12 years ago
|
||
Can we send the following sections to a separate audience?
* ==== Instances with unknown type ====
* ==== Instances with unknown state ====
* ==== Instances stopped for a while ====
* ==== Not attached volumes ====
They're not actionable for buildduty.
Reporter | ||
Comment 12•12 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #11)
> Can we send the following sections to a separate audience?
> * ==== Instances with unknown type ====
> * ==== Instances with unknown state ====
> * ==== Instances stopped for a while ====
> * ==== Not attached volumes ====
>
> They're not actionable for buildduty.
It'd probably be better to take these to a new bug.
Comment 13•12 years ago
|
||
Filed bug 986059.
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•