Closed Bug 936222 Opened 12 years ago Closed 11 years ago

nextSlave should take into account retries and spot instances

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(4 files)

remove fastRegexes from buildbot-configs 12 years ago Chris AtLee [:catlee] 4.36 KB, patch	rail : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
remove old nextSlave functions 12 years ago Chris AtLee [:catlee] 19.28 KB, patch	rail : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
spot-buildbotcustom.diff 12 years ago Chris AtLee [:catlee] 27.41 KB, patch	rail : review+ nthomas : review+ catlee : checked-in+	Details \| Diff \| Splinter Review
Handle case where there are no pending requests 12 years ago Chris AtLee [:catlee] 5.88 KB, patch	bhearsum : review+	Details \| Diff \| Splinter Review

Chris AtLee [:catlee]

Assignee

Description

•

12 years ago

Jobs running on spot instances can be killed at any time. Thanks to Ben's recent buildbot changes, all of these should be automatically retried. Ideally we want any retried jobs to end up on ondemand instances where they're more likely to finish. Probably the best place for this is nextSlave functions for buildbot.

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Assignee: nobody → catlee

Priority: -- → P2

Chris AtLee [:catlee]

Assignee

Comment 1

•

12 years ago

Attached patch remove fastRegexes from buildbot-configs — Details — Splinter Review

I'm going to be cleaning up the nextSlave functions in buildbotcustom. I don't think we need the fast/slow distinction anymore, so I'm removing the regexes here.

Attachment #829354 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

12 years ago

Attachment #829354 - Flags: review?(rail) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Attachment #829354 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Comment 2

•

12 years ago

in production

Chris AtLee [:catlee]

Assignee

Comment 3

•

12 years ago

Attached patch remove old nextSlave functions — Details — Splinter Review

Attachment #831038 - Flags: review?(rail)

Rail Aliiev [:rail]

Comment 4

•

12 years ago

Comment on attachment 831038 [details] [diff] [review] remove old nextSlave functions lgtm and passes test-masters

Attachment #831038 - Flags: review?(rail) → review+

Chris AtLee [:catlee]

Assignee

Updated

•

12 years ago

Attachment #831038 - Flags: checked-in+

John Hopkins (:jhopkins)

Comment 5

•

12 years ago

in production

Chris AtLee [:catlee]

Assignee

Comment 6

•

12 years ago

Attached patch spot-buildbotcustom.diff — Details — Splinter Review

Don't be afraid! There are a few major bits to this patch: - safeNextSlave is a decorator that factors out some of the boiler plate for catching exceptions and returning random slaves in case something goes wrong in the real nextSlave function - _getRetries is the scary piece. Its job is to find out if any of the pending requests for the given builder have been retried. It has to do some mildly scary stack walking to do the DB call, but other than that is pretty straightforward. - _classifyAWSSlaves splits a list of slaves into three buckets: in-house, ondemand (AWS), and spot (AWS) - _nextAWSSlave is what pulls these together. We always want to run on in-house slaves if available. After that, our behaviour depends on if there are previous retries for this builder, and if we have a aws_wait timeout configured. If we have retries, then we won't run on spot instances. If aws_wait is set, then we'll wait up to that many seconds for an in-house slave to pick up the job. Most of the build builders can use _nextAWSSlave that waits for in-house machines and sorts by which slaves have most recently done the job. Test builders can use _nextAWSSlave that doesn't wait for in-house machines and doesn't do slave sorting.

Attachment #8341230 - Flags: review?(rail)

Attachment #8341230 - Flags: review?(nthomas)

Rail Aliiev [:rail]

Comment 7

•

12 years ago

Comment on attachment 8341230 [details] [diff] [review] spot-buildbotcustom.diff ship it!

Attachment #8341230 - Flags: review?(rail) → review+

Nick Thomas [:nthomas] (UTC+12)

Comment 8

•

12 years ago

Sorry for the delay in reviewing, I'm aiming to get it done in the next day.

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

12 years ago

Comment on attachment 8341230 [details] [diff] [review] spot-buildbotcustom.diff Review of attachment 8341230 [details] [diff] [review]: ----------------------------------------------------------------- ::: misc.py @@ +278,5 @@ > +def _getRetries(builder): > + """Returns the pending build requests for this builder and the number of > + previous builds for these build requests.""" > + frame = inspect.currentframe() > + # Walk up the stack until we find 't' Which is a db transaction object ? A little more commentary would be great. @@ +296,5 @@ > + # Figure out if any of these requests have been retried > + # Do this by looking for existing builds corresponding to these > + # requests > + q = "SELECT count(*) FROM builds WHERE brid IN " + \ > + builder.db.parmlist(len(request_ids)) Looks like 'retried' will be non-zero if any build row is found, rather than the RETRY result. Was that the intention ? Perhaps I've forgotten what happens in the buildrequests table when we retry-on-disconnect or a rebuild is requested via self-serve (which might make the query OK). r+ because the worst that can happen is that we don't use ondemand instead of spot instances, but still curious. @@ +417,5 @@ > + > + > +@safeNextSlave > +def _nextSlave(builder, available_slaves): > + # Choose the slow slave that was most recently on this builder Nit, remove 'slow'. @@ +443,5 @@ > + valid = [] > + for s in available_slaves: > + if 'panda-' in s.slave.slavename: > + # excempt Panda's from this foolishness > + valid.append(s) Nit, lets remove the c in excempt, while you're here.

Attachment #8341230 - Flags: review?(nthomas) → review+

Chris AtLee [:catlee]

Assignee

Comment 10

•

12 years ago

Comment on attachment 8341230 [details] [diff] [review] spot-buildbotcustom.diff Nick, I think I addressed all your concerns here: https://hg.mozilla.org/build/buildbotcustom/rev/178badbaba06

Attachment #8341230 - Flags: checked-in+

Chris AtLee [:catlee]

Assignee

Comment 11

•

12 years ago

Some small fallout from this: 2013-12-13 07:02:31-0800 [-] Error choosing next slave for builder 'Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-3', choosing randomly instead 2013-12-13 07:02:31-0800 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext return func(*args,**kw) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction result = interaction(trans, *args, **kw) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs sb = self._choose_slave(available_slaves) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 547, in _choose_slave return self.nextSlave(self, available_slaves) --- <exception caught here> --- File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave return func(builder, available_slaves) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 388, in _nextSlave requests, retried = _getRetries(builder) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 306, in _getRetries t.execute(q, request_ids) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/db/connector.py", line 63, in execute return self._cursor.execute(*args, **kwargs) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute self.errorhandler(self, exc, value) File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler raise errorclass, errorvalue _mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1") This isn't fatal, the first line indicates that it chooses a random slave instead. I believe this is caused when there are no requests, and so the constructed query is "SELECT count(*) FROM builds WHERE brid IN ()". If I run this manually I get a similar error: mysql> SELECT count(*) from builds where brid IN (); ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1

Chris AtLee [:catlee]

Assignee

Comment 12

•

12 years ago

Attached patch Handle case where there are no pending requests — Details — Splinter Review

I'm not 100% sure how we end up calling getRetries if there are no pending requests. It's possible that the request has been picked up by another master by the time we call this code. In any case, this patch removes the assumption that we have pending requests for a builder we're trying to find a slave for.

Attachment #8347266 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Attachment #8347266 - Flags: review?(bhearsum) → review+

Chris AtLee [:catlee]

Assignee

Comment 13

•

11 years ago

should have closed this a while ago.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.