Do not use spot instances for some builders

RESOLVED FIXED

Status

RESOLVED FIXED
5 years ago
6 months ago

People

(Reporter: rail, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 1 obsolete attachment)

We shouldn't use spot instances for some builders (PGO/release?) or/and branches (beta/release?).
Assignee: nobody → rail
Created attachment 8374131 [details] [diff] [review]
no_pgo_on_spots-buildbotcustom-2.diff

WCPGW?
Attachment #8374131 - Flags: review?(catlee)

Updated

5 years ago
Attachment #8374131 - Flags: review?(catlee) → review+
in production
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Created attachment 8374363 [details] [diff] [review]
tests.diff
Attachment #8374363 - Flags: review?(catlee)

Updated

5 years ago
Attachment #8374363 - Flags: review?(catlee) → review+
2014-02-12 16:05:26-0800 [-] Error choosing next slave for builder 'release-mozilla-release-linux_repack_9/10', choosing randomly instead
2014-02-12 16:05:26-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction
            result = interaction(trans, *args, **kw)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs
            sb = self._choose_slave(available_slaves)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/process/builder.py", line 548, in _choose_slave
            return self.nextSlave(self, available_slaves)
        --- <exception caught here> ---
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave
            return func(builder, available_slaves)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 463, in _nextSlave_skip_spot
            valid.append(s)
        exceptions.IndexError: list index out of range

Additionally it would be great to avoid running any of release builds on spot instances because there may be no chance to get to the slave to debug some failure.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Created attachment 8375265 [details] [diff] [review]
nextSlave.diff

I think I found the issue. 

sorted(no_spot_slaves, _recentSort(builder))[-1] doesn't work for [], better to return None earlier.
Attachment #8375262 - Attachment is obsolete: true
Attachment #8375265 - Flags: review?(catlee)

Updated

5 years ago
Attachment #8375265 - Flags: review?(catlee) → review+

Comment 9

5 years ago
Live in production.
Created attachment 8375658 [details] [diff] [review]
non-unified.diff

+ non-unified
Attachment #8375658 - Flags: review?(catlee)

Updated

5 years ago
Attachment #8375658 - Flags: review?(catlee) → review+
In production
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Had to back this out. See https://bugzilla.mozilla.org/show_bug.cgi?id=980890#c11
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Created attachment 8391006 [details] [diff] [review]
kill-skip-spot.diff

Since the kill ratio for spot instances has been almost 0% since we landed the bidding improvements (see below), let's use spot instances everywhere except releases.

^bld-linux64
date, total jobs, jobs on spots, spot retries, o-d retries
2014-03-01, 1725, 1356 (78%), 2 (0%), 2 (0%)
2014-03-02, 1036, 762 (73%), 1 (0%), 0 (0%)
2014-03-03, 2564, 2046 (79%), 68 (3%), 0 (0%)
2014-03-04, 3263, 2636 (80%), 27 (1%), 1 (0%)
2014-03-05, 2987, 2306 (77%), 38 (1%), 2 (0%)
2014-03-06, 3456, 2688 (77%), 29 (1%), 1 (0%)
2014-03-07, 3003, 2425 (80%), 10 (0%), 1 (0%)
2014-03-08, 1303, 951 (72%), 0 (0%), 0 (0%)
2014-03-09, 998, 685 (68%), 0 (0%), 0 (0%)
2014-03-10, 2282, 1966 (86%), 15 (0%), 0 (0%)
2014-03-11, 2730, 2385 (87%), 2 (0%), 0 (0%)
2014-03-12, 2883, 2616 (90%), 9 (0%), 0 (0%)
2014-03-13, 3109, 2728 (87%), 3 (0%), 0 (0%)


It may sound blasphemous, but we can even reconsider our logic to avoid running retried jobs on spot instances! :)
Attachment #8391006 - Flags: review?(catlee)
Comment on attachment 8391006 [details] [diff] [review]
kill-skip-spot.diff

Review of attachment 8391006 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, we could perhaps change it to run on spot if num_retries <= 1 instead of num_retries == 0
Attachment #8391006 - Flags: review?(catlee) → review+
In production
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Component: General Automation → General
Product: Release Engineering → Release Engineering
You need to log in before you can comment on or make changes to this bug.