Closed Bug 962190 Opened 12 years ago Closed 12 years ago

aws sanity check shouldn't report instances that are actively doing work as long running

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: massimo)

Details

Attachments

(2 files, 1 obsolete file)

962190.patch 12 years ago Massimo Gervasini [:massimo] 13.19 KB, patch	rail : review-	Details \| Diff \| Splinter Review
962190.patch 12 years ago Massimo Gervasini [:massimo] 14.36 KB, patch	rail : review-	Details \| Diff \| Splinter Review
962190.patch 12 years ago Massimo Gervasini [:massimo] 18.32 KB, patch	rail : review+ massimo : checked-in+	Details \| Diff \| Splinter Review

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

12 years ago

It seems to get confused about spot instances that are deleted and then a new one created with the same name. eg, the sanity check e-mail will report it as up for 20 hours, but when I log in I see less than an hour of uptime, and a job running. This is dangerous, because if someone shuts it down without thinking, it will interrupt a job.

Rail Aliiev [:rail]

Comment 1

•

12 years ago

I think the report is still correct, but the message is confusing. The script doesn't use uptime(1)'s output, instead it uses instances' launch time, see http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/aws_sanity_checker.py#l100

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

12 years ago

(In reply to Rail Aliiev [:rail] from comment #1) > I think the report is still correct, but the message is confusing. The > script doesn't use uptime(1)'s output, instead it uses instances' launch > time, see > http://hg.mozilla.org/build/cloud-tools/file/9ce063fb875d/aws/ > aws_sanity_checker.py#l100 OK, that's good to know. Rewording the summary. Maybe the report generator can look to buildapi to see if a long running instance has recently taken a job? Or maybe looking at its uptime would be better? Not sure about the latter because I think slave's will reboot on their own if they haven't taken a job in awhile...

Summary: aws sanity check tells lies about spot instances → aws sanity check shouldn't report instances that are actively doing work as long running

Chris AtLee [:catlee]

Updated

•

12 years ago

Assignee: nobody → mgervasini

Massimo Gervasini [:massimo]

Assignee

Comment 3

•

12 years ago

Attached patch 962190.patch (obsolete) — Details — Splinter Review

Hi, This patch updates the aws_sanity_checker.py output. For long running slaves it reports the time of the last completed job (from buildapi). Instances that need special care are at the very top of the report. I have created a support module aws_slaves.py so the report can be extended in the future to include data from other source as slavehealt/slaveapi/... Other changes are in the aws_sanity_checker.py script, everything is managed by the generate_report method and the output can be updated and extended easily (we could generate an html report for example)

Attachment #8374142 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 4

•

12 years ago

Comment on attachment 8374142 [details] [diff] [review] 962190.patch Review of attachment 8374142 [details] [diff] [review]: ----------------------------------------------------------------- Almost there! I like the patch, thanks for organizing this braindump. :) In overall the patch looks good. The only blocker for me is the fact that it doesn't time out HTTP requests and may sit forever blocked. And some nits. ::: aws/aws_slaves.py @@ +9,5 @@ > +LOG = logging.getLogger(__name__) > +LOG.setLevel(logging.INFO) > +BUILDAPI_URL_JSON = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}?format=json" > +BUILDAPI_URL = "http://buildapi.pvt.build.mozilla.org/buildapi/recent/{slave_name}" > +SLAVEHEALTH_URL = "https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?class={slave_class}&type={slave_type}&name={slave_name}" This is unused, can be deleted. @@ +95,5 @@ > + return self.last_job_endtime > + url = self.get_buildapi_json_url() > + endtime = self.now > + try: > + json_data = urllib2.urlopen(url) Adding timeout here is a good idea. Feel free to use requests module instead. @@ +138,5 @@ > + > + def is_loaned(self): > + """returns True if the slave is loaned""" > + # not implemented yet > + return self.get_loaned_string() Not implemented? :) The following should behave exactly how the docstring says. return self.get_loaned_string() is not None

Attachment #8374142 - Flags: review?(rail) → review-

Massimo Gervasini [:massimo]

Assignee

Comment 5

•

12 years ago

Attached patch 962190.patch — Details — Splinter Review

Hi Rail, sorry for another big patch. This patch fixes the problems with the old patch (new timeout for connections defaults to 60s) and creates a new generic class, Instance, which is the base class for Slave and Master. I have added this extra abstraction because it might me useful for bug 897002 and any follow ups.

Attachment #8374142 - Attachment is obsolete: true

Attachment #8374978 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 6

•

12 years ago

Comment on attachment 8374978 [details] [diff] [review] 962190.patch Review of attachment 8374978 [details] [diff] [review]: ----------------------------------------------------------------- In overall it looks good. Good progress! However, there are one thing regarding the "impaired" status. See my comments below. ::: aws/aws_instances.py @@ +96,5 @@ > + return self._get_state() == 'stopped' > + > + def is_impaired(self): > + """returns True if instance is impaired""" > + return self._get_state() == 'impaired' This won't work because "impaired" is status, not state. You have to query that explicitly, see http://hg.mozilla.org/build/cloud-tools/file/b08c1cdf88a8/aws/aws_stop_idle.py#l280 as an example. You probably shouldn't call this more than once per region to be faster. This status is not something that makes you 100% sure that the instance is down. In aws_stop_idle.py we use it only when ssh doesn't work. @@ +194,5 @@ > + > + > + > +class Master(Instance): > + """AWS master""" Hmm, this class does noting... Do you really need it?

Attachment #8374978 - Flags: review?(rail) → review-

Massimo Gervasini [:massimo]

Assignee

Comment 7

•

12 years ago

Attached patch 962190.patch — Details — Splinter Review

Hi rail, thanks for the review. This patch updates the is_imparied() method. A note about this new patch; there are few methods that are not currently not used (e.g. *_message) and some constant (e.g. EXPECTED_MAX_UPTIME) are both in aws_instance.py and aws_sanity_checker.py I plan to remove the duplicate code and make full use of the methods in a following patch, this patch is an intermediate step to make aws_sanity_checker more flexible.

Attachment #8377598 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 8

•

12 years ago

Comment on attachment 8377598 [details] [diff] [review] 962190.patch Review of attachment 8377598 [details] [diff] [review]: ----------------------------------------------------------------- Great! Just a couple of nits before you push this: ::: aws/aws_instances.py @@ +165,5 @@ > + def is_stale(self): > + """returns True if a running instance is running for a longer time than > + expected (EXPECTED_MAX_UPTIME); for stopped instances, returns True > + if the instance is stopped for more than EXPECTED_MAX_DOWNTIME > + Loaned instances and moy-type with timeout == meh cannot be stale""" s/moy-type/moz-type/ @@ +316,5 @@ > + up for 147 hours BUILDAPI_INFO""" > + message = self.running_message() > + if message: > + message = "{0} ({1} since last build, source: {2} )".format( > + message, self.when_last_job_ended(), self.get_buildapi_url()) I think you can remove the ", source: http://..." part here to make the message less spammy. We know how to get there! :) ::: aws/aws_sanity_checker.py @@ +167,5 @@ > + if slave.is_long_running(): > + last_job_ended = slave.when_last_job_ended() > + buildapi_url = slave.get_buildapi_url() > + message = "{0} ({1} since last build, source: {2} )".format(message, > + last_job_ended, buildapi_url) The same here. @@ +189,5 @@ > + last_job_ended = slave.when_last_job_ended() > + if last_job_ended != '0h:0m': > + buildapi_url = slave.get_buildapi_url() > + message = "{0} ({1} since last build, source: {2} )".format(message, > + last_job_ended, buildapi_url) And here

Attachment #8377598 - Flags: review?(rail) → review+

Massimo Gervasini [:massimo]

Assignee

Comment 9

•

12 years ago

Comment on attachment 8377598 [details] [diff] [review] 962190.patch I have removed the buildapi url from the output messages and fixed the typo in comments.

Attachment #8377598 - Flags: checked-in+

Massimo Gervasini [:massimo]

Assignee

Updated

•

12 years ago

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 10

•

12 years ago

Thanks for cleaning this report up. I have some more improvements to mention; would you like us to use this bug? or another one? Here: * We have dozens of "None" lines [1] (282 out of 376 lines) * We have 90+ tst-linux64-spot-XXX instances that have been up for more than 12 hours (up to 356 hours) ** However, they have been doing jobs all this time ** only some of them had not done jobs recently (and they appeared on the "lazy instances" section - bug 984893) [1] 196 None (i-19533b11, us-west-2): up for 7 hours (no info from buildapi)

Armen [:armenzg]

Comment 11

•

12 years ago

Can we send the following sections to a separate audience? * ==== Instances with unknown type ==== * ==== Instances with unknown state ==== * ==== Instances stopped for a while ==== * ==== Not attached volumes ==== They're not actionable for buildduty.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 12

•

12 years ago

(In reply to Armen Zambrano [:armenzg] (Release Engineering) (EDT/UTC-4) from comment #11) > Can we send the following sections to a separate audience? > * ==== Instances with unknown type ==== > * ==== Instances with unknown state ==== > * ==== Instances stopped for a while ==== > * ==== Not attached volumes ==== > > They're not actionable for buildduty. It'd probably be better to take these to a new bug.

Armen [:armenzg]

Comment 13

•

12 years ago

Filed bug 986059.

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

aws sanity check shouldn't report instances that are actively doing work as long running

Categories

(Release Engineering :: General, defect)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Assigned: massimo)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files, 1 obsolete file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Attachment

General

Description

File Name

Content Type