Closed Bug 936222 Opened 11 years ago Closed 10 years ago

nextSlave should take into account retries and spot instances

Categories

(Release Engineering :: General, defect, P2)

x86_64
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

Attachments

(4 files)

Jobs running on spot instances can be killed at any time. Thanks to Ben's recent buildbot changes, all of these should be automatically retried. Ideally we want any retried jobs to end up on ondemand instances where they're more likely to finish.

Probably the best place for this is nextSlave functions for buildbot.
Assignee: nobody → catlee
Priority: -- → P2
I'm going to be cleaning up the nextSlave functions in buildbotcustom. I don't think we need the fast/slow distinction anymore, so I'm removing the regexes here.
Attachment #829354 - Flags: review?(rail)
Attachment #829354 - Flags: review?(rail) → review+
Attachment #829354 - Flags: checked-in+
in production
Attachment #831038 - Flags: review?(rail)
Comment on attachment 831038 [details] [diff] [review]
remove old nextSlave functions

lgtm and passes test-masters
Attachment #831038 - Flags: review?(rail) → review+
Attachment #831038 - Flags: checked-in+
in production
Don't be afraid!

There are a few major bits to this patch:

- safeNextSlave is a decorator that factors out some of the boiler plate for catching exceptions and returning random slaves in case something goes wrong in the real nextSlave function

- _getRetries is the scary piece. Its job is to find out if any of the pending requests for the given builder have been retried. It has to do some mildly scary stack walking to do the DB call, but other than that is pretty straightforward.

- _classifyAWSSlaves splits a list of slaves into three buckets: in-house, ondemand (AWS), and spot (AWS)

- _nextAWSSlave is what pulls these together. We always want to run on in-house slaves if available. After that, our behaviour depends on if there are previous retries for this builder, and if we have a aws_wait timeout configured. If we have retries, then we won't run on spot instances. If aws_wait is set, then we'll wait up to that many seconds for an in-house slave to pick up the job.

Most of the build builders can use _nextAWSSlave that waits for in-house machines and sorts by which slaves have most recently done the job.

Test builders can use _nextAWSSlave that doesn't wait for in-house machines and doesn't do slave sorting.
Attachment #8341230 - Flags: review?(rail)
Attachment #8341230 - Flags: review?(nthomas)
Comment on attachment 8341230 [details] [diff] [review]
spot-buildbotcustom.diff

ship it!
Attachment #8341230 - Flags: review?(rail) → review+
Sorry for the delay in reviewing, I'm aiming to get it done in the next day.
Comment on attachment 8341230 [details] [diff] [review]
spot-buildbotcustom.diff

Review of attachment 8341230 [details] [diff] [review]:
-----------------------------------------------------------------

::: misc.py
@@ +278,5 @@
> +def _getRetries(builder):
> +    """Returns the pending build requests for this builder and the number of
> +    previous builds for these build requests."""
> +    frame = inspect.currentframe()
> +    # Walk up the stack until we find 't'

Which is a db transaction object ? A little more commentary would be great.

@@ +296,5 @@
> +    # Figure out if any of these requests have been retried
> +    # Do this by looking for existing builds corresponding to these
> +    # requests
> +    q = "SELECT count(*) FROM builds WHERE brid IN " + \
> +        builder.db.parmlist(len(request_ids))

Looks like 'retried' will be non-zero if any build row is found, rather than the RETRY result. Was that the intention ? Perhaps I've forgotten what happens in the buildrequests table when we retry-on-disconnect or a rebuild is requested via self-serve (which might make the query OK). 

r+ because the worst that can happen is that we don't use ondemand instead of spot instances, but still curious.

@@ +417,5 @@
> +
> +
> +@safeNextSlave
> +def _nextSlave(builder, available_slaves):
> +    # Choose the slow slave that was most recently on this builder

Nit, remove 'slow'.

@@ +443,5 @@
> +    valid = []
> +    for s in available_slaves:
> +        if 'panda-' in s.slave.slavename:
> +            # excempt Panda's from this foolishness
> +            valid.append(s)

Nit, lets remove the c in excempt, while you're here.
Attachment #8341230 - Flags: review?(nthomas) → review+
Comment on attachment 8341230 [details] [diff] [review]
spot-buildbotcustom.diff

Nick, I think I addressed all your concerns here:
https://hg.mozilla.org/build/buildbotcustom/rev/178badbaba06
Attachment #8341230 - Flags: checked-in+
Some small fallout from this:
2013-12-13 07:02:31-0800 [-] Error choosing next slave for builder 'Ubuntu VM 12.04 x64 mozilla-inbound debug test mochitest-3', choosing randomly instead
2013-12-13 07:02:31-0800 [-] Unhandled Error 
    Traceback (most recent call last):
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext
        return func(*args,**kw)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction
        result = interaction(trans, *args, **kw) 
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs
        sb = self._choose_slave(available_slaves)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/process/builder.py", line 547, in _choose_slave
        return self.nextSlave(self, available_slaves)
    --- <exception caught here> ---
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave
        return func(builder, available_slaves)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 388, in _nextSlave
        requests, retried = _getRetries(builder)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbotcustom/misc.py", line 306, in _getRetries
        t.execute(q, request_ids)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/buildbot-0.8.2_hg_12d5a63423f9_production_0.8-py2.7.egg/buildbot/db/connector.py", line 63, in execute
        return self._cursor.execute(*args, **kwargs)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/MySQLdb/cursors.py", line 174, in execute
        self.errorhandler(self, exc, value)
      File "/builds/buildbot/tests1-linux/lib/python2.7/site-packages/MySQLdb/connections.py", line 36, in defaulterrorhandler
        raise errorclass, errorvalue
    _mysql_exceptions.ProgrammingError: (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1")

This isn't fatal, the first line indicates that it chooses a random slave instead.

I believe this is caused when there are no requests, and so the constructed query is "SELECT count(*) FROM builds WHERE brid IN ()". If I run this manually I get a similar error: 

mysql> SELECT count(*) from builds where brid IN ();
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ')' at line 1
I'm not 100% sure how we end up calling getRetries if there are no pending requests. It's possible that the request has been picked up by another master by the time we call this code.

In any case, this patch removes the assumption that we have pending requests for a builder we're trying to find a slave for.
Attachment #8347266 - Flags: review?(bhearsum)
Attachment #8347266 - Flags: review?(bhearsum) → review+
should have closed this a while ago.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: