Status

Release Engineering
Other
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: (dormant account), Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

3 years ago
Amazon spot prices work best if we are flexible about where we run our nodes and what types of nodes we spin up.

We need a spot bidding lib to help us figure out what to bid on. I think we should specify acceptable configurations as a list:
whatShouldIBidFor([[c3.xlarge, 0.250, 1], [m3.large, 0.150, 0.6], [c3.2xlarge, 0.250, 1.2], [c3.xlarge, 0.300, ondemand]], ['us-east-1', 'us-west-2'])
where the format is [nodetype, max_price, performance_constant, spot|ondemand]]

The library would then look at the current spot pricesand return [region, node_type, max_price] which offers the best value where value = price/performance_constant.

performance_constant helps us decide when to bid on more powerful nodes..eg if c3.xlarge costs the same as c3.2xlarge(or even 20% more) we should bid on c3.2xlarge.

This way we can encode preferences instances up to c3.4xlarge, etc and even choose to bid on crappier instances in extreme situations.



As a followup we should do a similar api call for deciding when to shutdown overpriced instances and spin them up in a cheaper region.
I've been thinking about this too recently. Some ideas below.

* The closer the market price to the max price, the higher risk of being killed by price. We may need to introduce some risk ratio into the equation. Something like:

market price  safe price  max price    risk
      10           15         20          0%
      15           15         20          0%
      17           15         20         40%
      19           15         20         80%
      25           15         20        100%

Where "safe price" is some limit determined by practice (we'll need to analyze when the sport world starts collapsing). Using this ratio we can correct the instance type distribution.

* Not related to bidding, but instead starting 100% of capacity we needed, we can look at the capacity we already have and assume that some of them will be available soon to cover some part of the request. We can use pulse events to have better idea how many of running instances will be available in next N minutes.
(Reporter)

Comment 2

3 years ago
(In reply to Rail Aliiev [:rail] from comment #1)
> I've been thinking about this too recently. Some ideas below.
> 
> * The closer the market price to the max price, the higher risk of being
> killed by price. We may need to introduce some risk ratio into the equation.
> Something like:
> 
> market price  safe price  max price    risk
>       10           15         20          0%
>       15           15         20          0%
>       17           15         20         40%
>       19           15         20         80%
>       25           15         20        100%
> 
> Where "safe price" is some limit determined by practice (we'll need to
> analyze when the sport world starts collapsing). Using this ratio we can
> correct the instance type distribution.

I think risk is a good thing, cos it gives us free execution. We should just make sure that we can start running jobs ASAP and that they run quickly. sccache ensures that we don't repeat too much of the build process.
I would prefer to deal with high-risk jobs by starting them on 2 parallel instances in different regions.

Though until we have that, we should probably add a "low-risk" flag(low risk would be spot prices that are at their long-term average price...and we can prototype it by using a price buffer as you suggest)

> 
> * Not related to bidding, but instead starting 100% of capacity we needed,
> we can look at the capacity we already have and assume that some of them
> will be available soon to cover some part of the request. We can use pulse
> events to have better idea how many of running instances will be available
> in next N minutes.

Good idea. Can add a build step that says "almost done" or something. We can then count those as 'available' instances.
(Reporter)

Comment 3

3 years ago
So a couple things after playing with this
a) mixing ondemand logic into this doesn't make sense. This should be part of the 'nothing good to bid on' fallback case
b) risk assessment should also be a separate module that looks at answers and drops ones it considers unacceptable for any specific risk criteria
c) we should specify the AZ when bidding(once we have this stuff). Otherwise amazon can put us into AZs with higher prices that are still under our max

Here is a prototype that makes an ordered list of what to bid on:
https://github.com/tarasglek/spotbidagent/blob/master/sample_out.txt

Looks like switching to this model will drop our spot prices(2-4x) significantly by maximizing chances of hitting a cheap AZ.
(Reporter)

Comment 4

3 years ago
Looks like we can squeeze a lot more capacity out of us-east-1 by allowing more node types. Seems like older nodes are the cheapest.

https://github.com/tarasglek/spotbidagent/blob/master/sample_out.txt
(Reporter)

Updated

3 years ago
Blocks: 974727
(Reporter)

Comment 5

3 years ago
Pretty happy with how https://github.com/tarasglek/spotbidagent/ worked out
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.