Closed Bug 978971 Opened 10 years ago Closed 10 years ago

Spot bidding should be able to filter out AZs with bad conditions

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: rail)

Details

This could be due to:
   - problems getting spot nodes
   - nodes not being able to connect to buildbot
   - jacuzzi logic going crazy on the masters

This caused tree closures today while sheriffs are reporting this to be a regular occurence.

Today's was solved by itself. Route initial cause/issue is unknown.

This bug serves to find that cause.

See Bug 978956 for solving how to catch it when it happens -- much of this is going unnoticed as we are not seeing a high pending count, but a few machines pending for too long (over 1hour).
We hit the following scenario here:

- the bidding library gives us a list of choices and we use the cheapest one
- the availability zone doesn't not have enough capacity to serve the spot requests
- the spot sanity scripts sees the "capacity-oversubscribed" results and cancels them

We should enhance the bidding library so we can filter out the choices depending on some condition, like too many failures in last N minutes.
Assignee: nobody → rail
Summary: we are hitting long pending wait times before machines pick up jobs for aws machines, regularly → Spot bidding should be able to filter out AZs with bad conditions
I landed the following to see how it behaves on the weekend:
http://hg.mozilla.org/build/cloud-tools/rev/3429306e6bd2

What it does:

* checks if the market price is higher than 80% of our bid price. If it's higher we don't try to request spot instances

* It checks recent (last 15 minutes) spot request for the requested instances type in the requested AZ. If we see more than 10% of instances being killed, spot requests not fulfilled due to low price or oversubscribed capacity, then we skip this spot choice.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.