Closed Bug 1309901 Opened 8 years ago Closed 8 years ago

Use spot instances for ad-hoc clusters

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

We should use spot instances for Spark worker nodes of ad-hoc clusters. I can confirm that a running Spark job handles the death of a worker node without any particular user-visible effects besides a small lag.

Looking at the spot instance history for c3.4xlarge in our region it appears that only very rarely the spot price was higher than the on-demand one, so I would suggest we start with a higher bid price and go from there. I think the cost reduction outweighs the harm caused by a small chance of loosing all worker nodes.

Attachments

(2 files)

      No description provided.
User Story: (updated)
Assignee: nobody → fbertsch
Initially I'm going to just have all spots as TASK instances (they do not run the DHFS datanode daemon, which we don't use anyways - right?).

We could also consider having 1 Master, X number of next nodes be on-demand as well, and the final nodes (num_instances - X - 1) be spot instances. 

I'm going to set the initial bid be $.84, but this is configurable.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
How much longer does it take on average for a cluster to bootstrap now? Note that cluster bootstrap times can afaik be highly volatile with spot instances. How do you plan to deal with that if users complain that clusters don't start up?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Flags: needinfo?(fbertsch)
Unfortunately this is hard to test, because it is so variable. I've launched a few clusters and they all bootstrapped in about the same amount of time as I've seen before, e.g. "Waiting" status within 15 minutes. 

In general, I don't think we're going to run into an issue of getting kicked off machines or unable to start spot machines, since our bid is so high right now. However, we may run into periods where it happens. What if we have a config that we can use to switch off using spot clusters?
Flags: needinfo?(fbertsch)
(In reply to Frank Bertsch [:frank] from comment #4)
> Unfortunately this is hard to test, because it is so variable. I've launched
> a few clusters and they all bootstrapped in about the same amount of time as
> I've seen before, e.g. "Waiting" status within 15 minutes. 
> 
> In general, I don't think we're going to run into an issue of getting kicked
> off machines or unable to start spot machines, since our bid is so high
> right now. However, we may run into periods where it happens. What if we
> have a config that we can use to switch off using spot clusters?

Yeah, that makes sense. It would be nice to have a checkbox to disable spot instances when launching a new cluster.
I was hesitant to let users decide not to use spot instances, since so many people consider their workloads the most important and may not understand that losing a spot instance does not mean failures for their jobs. I was thinking this would be more an admin thing, though I'm open to considering otherwise.
(In reply to Frank Bertsch [:frank] from comment #6)
> I was hesitant to let users decide not to use spot instances, since so many
> people consider their workloads the most important and may not understand
> that losing a spot instance does not mean failures for their jobs. I was
> thinking this would be more an admin thing, though I'm open to considering
> otherwise.

Agreed.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7)
> (In reply to Frank Bertsch [:frank] from comment #6)
> > I was hesitant to let users decide not to use spot instances, since so many
> > people consider their workloads the most important and may not understand
> > that losing a spot instance does not mean failures for their jobs. I was
> > thinking this would be more an admin thing, though I'm open to considering
> > otherwise.
> 
> Agreed.

Okay I'm going to implement it this way, with a switch just for admins.
Points: --- → 3
Priority: -- → P1
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: