Closed
Bug 936222
Opened 11 years ago
Closed 11 years ago
nextSlave should take into account retries and spot instances
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: catlee)
References
Details
Attachments
(4 files)
4.36 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
19.28 KB,
patch
|
rail
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
27.41 KB,
patch
|
rail
:
review+
nthomas
:
review+
catlee
:
checked-in+
|
Details | Diff | Splinter Review |
5.88 KB,
patch
|
bhearsum
:
review+
|
Details | Diff | Splinter Review |
Jobs running on spot instances can be killed at any time. Thanks to Ben's recent buildbot changes, all of these should be automatically retried. Ideally we want any retried jobs to end up on ondemand instances where they're more likely to finish.
Probably the best place for this is nextSlave functions for buildbot.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → catlee
Priority: -- → P2
Assignee | ||
Comment 1•11 years ago
|
||
I'm going to be cleaning up the nextSlave functions in buildbotcustom. I don't think we need the fast/slow distinction anymore, so I'm removing the regexes here.
Attachment #829354 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #829354 -
Flags: review?(rail) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #829354 -
Flags: checked-in+
Comment 2•11 years ago
|
||
in production
Assignee | ||
Comment 3•11 years ago
|
||
Attachment #831038 -
Flags: review?(rail)
Comment 4•11 years ago
|
||
Comment on attachment 831038 [details] [diff] [review]
remove old nextSlave functions
lgtm and passes test-masters
Attachment #831038 -
Flags: review?(rail) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #831038 -
Flags: checked-in+
Comment 5•11 years ago
|
||
in production
Assignee | ||
Comment 6•11 years ago
|
||
Don't be afraid!
There are a few major bits to this patch:
- safeNextSlave is a decorator that factors out some of the boiler plate for catching exceptions and returning random slaves in case something goes wrong in the real nextSlave function
- _getRetries is the scary piece. Its job is to find out if any of the pending requests for the given builder have been retried. It has to do some mildly scary stack walking to do the DB call, but other than that is pretty straightforward.
- _classifyAWSSlaves splits a list of slaves into three buckets: in-house, ondemand (AWS), and spot (AWS)
- _nextAWSSlave is what pulls these together. We always want to run on in-house slaves if available. After that, our behaviour depends on if there are previous retries for this builder, and if we have a aws_wait timeout configured. If we have retries, then we won't run on spot instances. If aws_wait is set, then we'll wait up to that many seconds for an in-house slave to pick up the job.
Most of the build builders can use _nextAWSSlave that waits for in-house machines and sorts by which slaves have most recently done the job.
Test builders can use _nextAWSSlave that doesn't wait for in-house machines and doesn't do slave sorting.
Attachment #8341230 -
Flags: review?(rail)
Attachment #8341230 -
Flags: review?(nthomas)
Comment 7•11 years ago
|
||
Comment on attachment 8341230 [details] [diff] [review]
spot-buildbotcustom.diff
ship it!
Attachment #8341230 -
Flags: review?(rail) → review+
Comment 8•11 years ago
|
||
Sorry for the delay in reviewing, I'm aiming to get it done in the next day.
Comment 9•11 years ago
|
||
Comment on attachment 8341230 [details] [diff] [review]
spot-buildbotcustom.diff
Review of attachment 8341230 [details] [diff] [review]:
-----------------------------------------------------------------
::: misc.py
@@ +278,5 @@
> +def _getRetries(builder):
> + """Returns the pending build requests for this builder and the number of
> + previous builds for these build requests."""
> + frame = inspect.currentframe()
> + # Walk up the stack until we find 't'
Which is a db transaction object ? A little more commentary would be great.
@@ +296,5 @@
> + # Figure out if any of these requests have been retried
> + # Do this by looking for existing builds corresponding to these
> + # requests
> + q = "SELECT count(*) FROM builds WHERE brid IN " + \
> + builder.db.parmlist(len(request_ids))
Looks like 'retried' will be non-zero if any build row is found, rather than the RETRY result. Was that the intention ? Perhaps I've forgotten what happens in the buildrequests table when we retry-on-disconnect or a rebuild is requested via self-serve (which might make the query OK).
r+ because the worst that can happen is that we don't use ondemand instead of spot instances, but still curious.
@@ +417,5 @@
> +
> +
> +@safeNextSlave
> +def _nextSlave(builder, available_slaves):
> + # Choose the slow slave that was most recently on this builder
Nit, remove 'slow'.
@@ +443,5 @@
> + valid = []
> + for s in available_slaves:
> + if 'panda-' in s.slave.slavename:
> + # excempt Panda's from this foolishness
> + valid.append(s)
Nit, lets remove the c in excempt, while you're here.
Attachment #8341230 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 10•11 years ago
|
||
Comment on attachment 8341230 [details] [diff] [review]
spot-buildbotcustom.diff
Nick, I think I addressed all your concerns here:
https://hg.mozilla.org/build/buildbotcustom/rev/178badbaba06
Attachment #8341230 -
Flags: checked-in+
Assignee | ||
Comment 11•11 years ago
|
||
Some small fallout from this:
2013-12-13 07:02:31-0800 [-] Error choosing next slave for builder 'Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-3', choosing randomly instead
2013-12-13 07:02:31-0800 [-] Unhandled Error
Traceback (most recent call last):
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext
return func(*args,**kw)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction
result = interaction(trans, *args, **kw)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs
sb = self._choose_slave(available_slaves)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 547, in _choose_slave
return self.nextSlave(self, available_slaves)
--- <exception caught here> ---
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave
return func(builder, available_slaves)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 388, in _nextSlave
requests, retried = _getRetries(builder)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 306, in _getRetries
t.execute(q, request_ids)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/db/connector.py", line 63, in execute
return self._cursor.execute(*args, **kwargs)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute
self.errorhandler(self, exc, value)
File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1")
This isn't fatal, the first line indicates that it chooses a random slave instead.
I believe this is caused when there are no requests, and so the constructed query is "SELECT count(*) FROM builds WHERE brid IN ()". If I run this manually I get a similar error:
mysql> SELECT count(*) from builds where brid IN ();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1
Assignee | ||
Comment 12•11 years ago
|
||
I'm not 100% sure how we end up calling getRetries if there are no pending requests. It's possible that the request has been picked up by another master by the time we call this code.
In any case, this patch removes the assumption that we have pending requests for a builder we're trying to find a slave for.
Attachment #8347266 -
Flags: review?(bhearsum)
Updated•11 years ago
|
Attachment #8347266 -
Flags: review?(bhearsum) → review+
Assignee | ||
Comment 13•11 years ago
|
||
should have closed this a while ago.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•