Closed Bug 970552 Opened 8 years ago Closed 8 years ago

Do not use spot instances for some builders

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: rail)

References

Details

Attachments

(5 files, 1 obsolete file)

We shouldn't use spot instances for some builders (PGO/release?) or/and branches (beta/release?).
Assignee: nobody → rail
WCPGW?
Attachment #8374131 - Flags: review?(catlee)
Attachment #8374131 - Flags: review?(catlee) → review+
Comment on attachment 8374131 [details] [diff] [review]
no_pgo_on_spots-buildbotcustom-2.diff

https://hg.mozilla.org/build/buildbotcustom/rev/530a492013c9
Attachment #8374131 - Flags: checked-in+
in production
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Attached patch tests.diffSplinter Review
Attachment #8374363 - Flags: review?(catlee)
Attachment #8374363 - Flags: review?(catlee) → review+
2014-02-12 16:05:26-0800 [-] Error choosing next slave for builder 'release-mozilla-release-linux_repack_9/10', choosing randomly instead
2014-02-12 16:05:26-0800 [-] Unhandled Error
        Traceback (most recent call last):
          File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/python/context.py", line 37, in callWithContext
            return func(*args,**kw)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/twisted/enterprise/adbapi.py", line 429, in _runInteraction
            result = interaction(trans, *args, **kw)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/process/builder.py", line 517, in _claim_buildreqs
            sb = self._choose_slave(available_slaves)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbot-0.8.2_hg_f23f5672becd_production_0.8-py2.7.egg/buildbot/process/builder.py", line 548, in _choose_slave
            return self.nextSlave(self, available_slaves)
        --- <exception caught here> ---
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 267, in _nextSlave
            return func(builder, available_slaves)
          File "/builds/buildbot/build1/lib/python2.7/site-packages/buildbotcustom/misc.py", line 463, in _nextSlave_skip_spot
            valid.append(s)
        exceptions.IndexError: list index out of range

Additionally it would be great to avoid running any of release builds on spot instances because there may be no chance to get to the slave to debug some failure.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch [wip] nextSlave.diff (obsolete) — Splinter Review
Attached patch nextSlave.diffSplinter Review
I think I found the issue. 

sorted(no_spot_slaves, _recentSort(builder))[-1] doesn't work for [], better to return None earlier.
Attachment #8375262 - Attachment is obsolete: true
Attachment #8375265 - Flags: review?(catlee)
Attachment #8375265 - Flags: review?(catlee) → review+
Live in production.
Attached patch non-unified.diffSplinter Review
+ non-unified
Attachment #8375658 - Flags: review?(catlee)
Attachment #8375658 - Flags: review?(catlee) → review+
In production
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Had to back this out. See https://bugzilla.mozilla.org/show_bug.cgi?id=980890#c11
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Since the kill ratio for spot instances has been almost 0% since we landed the bidding improvements (see below), let's use spot instances everywhere except releases.

^bld-linux64
date, total jobs, jobs on spots, spot retries, o-d retries
2014-03-01, 1725, 1356 (78%), 2 (0%), 2 (0%)
2014-03-02, 1036, 762 (73%), 1 (0%), 0 (0%)
2014-03-03, 2564, 2046 (79%), 68 (3%), 0 (0%)
2014-03-04, 3263, 2636 (80%), 27 (1%), 1 (0%)
2014-03-05, 2987, 2306 (77%), 38 (1%), 2 (0%)
2014-03-06, 3456, 2688 (77%), 29 (1%), 1 (0%)
2014-03-07, 3003, 2425 (80%), 10 (0%), 1 (0%)
2014-03-08, 1303, 951 (72%), 0 (0%), 0 (0%)
2014-03-09, 998, 685 (68%), 0 (0%), 0 (0%)
2014-03-10, 2282, 1966 (86%), 15 (0%), 0 (0%)
2014-03-11, 2730, 2385 (87%), 2 (0%), 0 (0%)
2014-03-12, 2883, 2616 (90%), 9 (0%), 0 (0%)
2014-03-13, 3109, 2728 (87%), 3 (0%), 0 (0%)


It may sound blasphemous, but we can even reconsider our logic to avoid running retried jobs on spot instances! :)
Attachment #8391006 - Flags: review?(catlee)
Comment on attachment 8391006 [details] [diff] [review]
kill-skip-spot.diff

Review of attachment 8391006 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, we could perhaps change it to run on spot if num_retries <= 1 instead of num_retries == 0
Attachment #8391006 - Flags: review?(catlee) → review+
In production
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.