Closed
Bug 962190
Opened 10 years ago
Closed 10 years ago
aws sanity check shouldn't report instances that are actively doing work as long running
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: massimo)
Details
Attachments
(2 files, 1 obsolete file)
14.36 KB,
patch
|
rail
:
review-
|
Details | Diff | Splinter Review |
18.32 KB,
patch
|
rail
:
review+
massimo
:
checked-in+
|
Details | Diff | Splinter Review |
It seems to get confused about spot instances that are deleted and then a new one created with the same name. eg, the sanity check e-mail will report it as up for 20 hours, but when I log in I see less than an hour of uptime, and a job running. This is dangerous, because if someone shuts it down without thinking, it will interrupt a job.
Comment 1•10 years ago
|
||
I think the report is still correct, but the message is confusing. The script doesn't use uptime(1)'s output, instead it uses instances' launch time, see http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/aws_sanity_checker.py#l100
Reporter | ||
Comment 2•10 years ago
|
||
(In reply to Rail Aliiev [:rail] from comment #1) > I think the report is still correct, but the message is confusing. The > script doesn't use uptime(1)'s output, instead it uses instances' launch > time, see > http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/ > aws_sanity_checker.py#l100 OK, that's good to know. Rewording the summary. Maybe the report generator can look to buildapi to see if a long running instance has recently taken a job? Or maybe looking at its uptime would be better? Not sure about the latter because I think slave's will reboot on their own if they haven't taken a job in awhile...
Summary: aws sanity check tells lies about spot instances → aws sanity check shouldn't report instances that are actively doing work as long running
Updated•10 years ago
|
Assignee: nobody → mgervasini
Assignee | ||
Comment 3•10 years ago
|
||
Hi, This patch updates the aws_sanity_checker.py output. For long running slaves it reports the time of the last completed job (from buildapi). Instances that need special care are at the very top of the report. I have created a support module aws_slaves.py so the report can be extended in the future to include data from other source as slavehealt/slaveapi/... Other changes are in the aws_sanity_checker.py script, everything is managed by the generate_report method and the output can be updated and extended easily (we could generate an html report for example)
Attachment #8374142 -
Flags: review?(rail)
Comment 4•10 years ago
|
||
Comment on attachment 8374142 [details] [diff] [review] 962190.patch Review of attachment 8374142 [details] [diff] [review]: ----------------------------------------------------------------- Almost there! I like the patch, thanks for organizing this braindump. :) In overall the patch looks good. The only blocker for me is the fact that it doesn't time out HTTP requests and may sit forever blocked. And some nits. ::: aws/aws_slaves.py @@ +9,5 @@ > +LOG = logging.getLogger(__name__) > +LOG.setLevel(logging.INFO) > +BUILDAPI_URL_JSON = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}?format=json" > +BUILDAPI_URL = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}" > +SLAVEHEALTH_URL = "https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class={slave_class}&type={slave_type}&name={slave_name}" This is unused, can be deleted. @@ +95,5 @@ > + return self.last_job_endtime > + url = self.get_buildapi_json_url() > + endtime = self.now > + try: > + json_data = urllib2.urlopen(url) Adding timeout here is a good idea. Feel free to use requests module instead. @@ +138,5 @@ > + > + def is_loaned(self): > + """returns True if the slave is loaned""" > + # not implemented yet > + return self.get_loaned_string() Not implemented? :) The following should behave exactly how the docstring says. return self.get_loaned_string() is not None
Attachment #8374142 -
Flags: review?(rail) → review-
Assignee | ||
Comment 5•10 years ago
|
||
Hi Rail, sorry for another big patch. This patch fixes the problems with the old patch (new timeout for connections defaults to 60s) and creates a new generic class, Instance, which is the base class for Slave and Master. I have added this extra abstraction because it might me useful for bug 897002 and any follow ups.
Attachment #8374142 -
Attachment is obsolete: true
Attachment #8374978 -
Flags: review?(rail)
Comment 6•10 years ago
|
||
Comment on attachment 8374978 [details] [diff] [review] 962190.patch Review of attachment 8374978 [details] [diff] [review]: ----------------------------------------------------------------- In overall it looks good. Good progress! However, there are one thing regarding the "impaired" status. See my comments below. ::: aws/aws_instances.py @@ +96,5 @@ > + return self._get_state() == 'stopped' > + > + def is_impaired(self): > + """returns True if instance is impaired""" > + return self._get_state() == 'impaired' This won't work because "impaired" is status, not state. You have to query that explicitly, see http://hg.mozilla.org/build/cloud-tools/file/b08c1cdf88a8/aws/aws_stop_idle.py#l280 as an example. You probably shouldn't call this more than once per region to be faster. This status is not something that makes you 100% sure that the instance is down. In aws_stop_idle.py we use it only when ssh doesn't work. @@ +194,5 @@ > + > + > + > +class Master(Instance): > + """AWS master""" Hmm, this class does noting... Do you really need it?
Attachment #8374978 -
Flags: review?(rail) → review-
Assignee | ||
Comment 7•10 years ago
|
||
Hi rail, thanks for the review. This patch updates the is_imparied() method. A note about this new patch; there are few methods that are not currently not used (e.g. *_message) and some constant (e.g. EXPECTED_MAX_UPTIME) are both in aws_instance.py and aws_sanity_checker.py I plan to remove the duplicate code and make full use of the methods in a following patch, this patch is an intermediate step to make aws_sanity_checker more flexible.
Attachment #8377598 -
Flags: review?(rail)
Comment 8•10 years ago
|
||
Comment on attachment 8377598 [details] [diff] [review] 962190.patch Review of attachment 8377598 [details] [diff] [review]: ----------------------------------------------------------------- Great! Just a couple of nits before you push this: ::: aws/aws_instances.py @@ +165,5 @@ > + def is_stale(self): > + """returns True if a running instance is running for a longer time than > + expected (EXPECTED_MAX_UPTIME); for stopped instances, returns True > + if the instance is stopped for more than EXPECTED_MAX_DOWNTIME > + Loaned instances and moy-type with timeout == meh cannot be stale""" s/moy-type/moz-type/ @@ +316,5 @@ > + up for 147 hours BUILDAPI_INFO""" > + message = self.running_message() > + if message: > + message = "{0} ({1} since last build, source: {2} )".format( > + message, self.when_last_job_ended(), self.get_buildapi_url()) I think you can remove the ", source: http://..." part here to make the message less spammy. We know how to get there! :) ::: aws/aws_sanity_checker.py @@ +167,5 @@ > + if slave.is_long_running(): > + last_job_ended = slave.when_last_job_ended() > + buildapi_url = slave.get_buildapi_url() > + message = "{0} ({1} since last build, source: {2} )".format(message, > + last_job_ended, buildapi_url) The same here. @@ +189,5 @@ > + last_job_ended = slave.when_last_job_ended() > + if last_job_ended != '0h:0m': > + buildapi_url = slave.get_buildapi_url() > + message = "{0} ({1} since last build, source: {2} )".format(message, > + last_job_ended, buildapi_url) And here
Attachment #8377598 -
Flags: review?(rail) → review+
Assignee | ||
Comment 9•10 years ago
|
||
Comment on attachment 8377598 [details] [diff] [review] 962190.patch I have removed the buildapi url from the output messages and fixed the typo in comments.
Attachment #8377598 -
Flags: checked-in+
Assignee | ||
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 10•10 years ago
|
||
Thanks for cleaning this report up. I have some more improvements to mention; would you like us to use this bug? or another one? Here: * We have dozens of "None" lines [1] (282 out of 376 lines) * We have 90+ tst-linux64-spot-XXX instances that have been up for more than 12 hours (up to 356 hours) ** However, they have been doing jobs all this time ** only some of them had not done jobs recently (and they appeared on the "lazy instances" section - bug 984893) [1] 196 None (i-19533b11, us-west-2): up for 7 hours (no info from buildapi)
Comment 11•10 years ago
|
||
Can we send the following sections to a separate audience? * ==== Instances with unknown type ==== * ==== Instances with unknown state ==== * ==== Instances stopped for a while ==== * ==== Not attached volumes ==== They're not actionable for buildduty.
Reporter | ||
Comment 12•10 years ago
|
||
(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #11) > Can we send the following sections to a separate audience? > * ==== Instances with unknown type ==== > * ==== Instances with unknown state ==== > * ==== Instances stopped for a while ==== > * ==== Not attached volumes ==== > > They're not actionable for buildduty. It'd probably be better to take these to a new bug.
Comment 13•10 years ago
|
||
Filed bug 986059.
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•