Closed Bug 636101 Opened 9 years ago Closed 9 years ago

Slow slaves get used for builders that ask for fast ones

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: catlee)

References

Details

(Whiteboard: [release][automation])

Attachments

(1 file)

I hit this for both Firefox and XULRunner source builders:
 ...
 62712 changesets found
 command timed out: 1200 seconds without output, killing pid 1535

The default timeout of 20 minutes may be too short as m-c continues to grow.

I hadn't set any reserved slaves on pm01, so feel free if that's the problem here. However, I think the xulrunner source is not set to use a fast slave so may always hit it there.
Morphing this to the root cause, which is that we're setting
  'nextSlave': _nextFastReservedSlave
on the builder but get allocated a VM. Saw other problems with 4.0b12 with getting a VM for l10n repack and Armen got a VM for final verify for 3.5.17.

catlee points about that 
 http://hg.mozilla.org/build/buildbotcustom/file/tip/misc.py#l311
has a bug at line 324 (elif not onlyFast and slow). There are several functions that would need a check too.
Assignee: nobody → catlee
Summary: Release source times out in create_bundle when run on VMs → Slow slaves get used for builders that ask for fast ones
Note that that priority only applies on a particular master, too - if a builder on a master with only slow slaves happens to claim the job, then it will run on the slow slave..
Comment on attachment 516702 [details] [diff] [review]
Fixes for _nextFastReservedSlave, and checking the reserved file. Now with tests!

This looks like a refactor of _nextFastSlave and _nextFastReservedSlave, being more careful to return a list from _recentSort and a slave or None from _nextFastSlave.  I'm not sure how the addition of 'not fast and only fast' helps: that would have gotten to the return [] anyway, right?

         if fast: # not taken
             return sorted(fast, _recentSort(builder))[-1]
         elif slow and not only_fast: # not taken
             return sorted(slow, _recentSort(builder))[-1]
         else:
             return []

so just changing the return [] to return None would be equivalent?

I don't see how this fixes the original issue, but the refactor looks fine.



# Choose the fast slave that was most recently on this builder

If this comment is true, then idle slaves shouldn't be that surprising - masters get an affinity for slaves they've used recently, so unused slaves never get contacted.  Right?
Attachment #516702 - Flags: review?(dustin) → review+
(In reply to comment #4)
> Comment on attachment 516702 [details] [diff] [review]
> Fixes for _nextFastReservedSlave, and checking the reserved file. Now with
> tests!
> 
> This looks like a refactor of _nextFastSlave and _nextFastReservedSlave, being
> more careful to return a list from _recentSort and a slave or None from
> _nextFastSlave.  I'm not sure how the addition of 'not fast and only fast'
> helps: that would have gotten to the return [] anyway, right?
> 
>          if fast: # not taken
>              return sorted(fast, _recentSort(builder))[-1]
>          elif slow and not only_fast: # not taken
>              return sorted(slow, _recentSort(builder))[-1]
>          else:
>              return []
> 
> so just changing the return [] to return None would be equivalent?

The old _nextFastReservedSlave looked like this:

        if fast: # not taken
            return sorted(fast, _recentSort(builder))[-1]
        elif not onlyFast and slow: # not taken
            return sorted(slow, _recentSort(builder))[-1]
        else: # oops!
            log.msg("No fast or slow slaves found for builder '%s', choosing randomly instead" % builder.name)
            return random.choice(available_slaves)

This the motivation for the refactor - it was too easy for the logic in the two
very similar functions to get out of sync.  The original version in _nextFast
should be equivalent to what the patch is doing now, a little less verbosely
though.

Returning None or [] should be equivalent, we use None elsewhere, and buildbot
expects the function to return a slave object so None makes more sense than []
for "no slave".

> I don't see how this fixes the original issue, but the refactor looks fine.
> 
> # Choose the fast slave that was most recently on this builder
> 
> If this comment is true, then idle slaves shouldn't be that surprising -
> masters get an affinity for slaves they've used recently, so unused slaves
> never get contacted.  Right?

Correct.  This is done by looking through the builds in the builder's build
cache (so we don't load pickles from disk), and ordering slaves by
time-since-last-build on that builder.
Cool - my r+ stands then, and thanks for the explanation.  My mental diff with _nextFastReservedSlave didn't catch that!
Attachment #516702 - Flags: review?(nrthomas) → review+
Comment on attachment 516702 [details] [diff] [review]
Fixes for _nextFastReservedSlave, and checking the reserved file. Now with tests!

http://hg.mozilla.org/build/buildbotcustom/rev/204086db1d2c
Attachment #516702 - Flags: checked-in+
Tests are failing:

Traceback (most recent call last):
  File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/loader.py", line 382, in loadTestsFromName
    addr.filename, addr.module)
  File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", line 39, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py", line 86, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "/builds/buildbot/preproduction/slave/test-masters/buildbotcustom/test/test_misc_nextslaves.py", line 132
    with mock.patch.object(time, 'time') as time_method:
            ^
SyntaxError: invalid syntax
(In reply to comment #8)
> Tests are failing:
> 
> Traceback (most recent call last):
>   File
> "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/loader.py",
> line 382, in loadTestsFromName
>     addr.filename, addr.module)
>   File
> "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py",
> line 39, in importFromPath
>     return self.importFromDir(dir_path, fqname)
>   File
> "/builds/buildbot/preproduction/sandbox/lib/python2.5/site-packages/nose/importer.py",
> line 86, in importFromDir
>     mod = load_module(part_fqname, fh, filename, desc)
>   File
> "/builds/buildbot/preproduction/slave/test-masters/buildbotcustom/test/test_misc_nextslaves.py",
> line 132
>     with mock.patch.object(time, 'time') as time_method:
>             ^
> SyntaxError: invalid syntax

Will fix in bug 643408
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.