Closed Bug 1309901 Opened 8 years ago Closed 8 years ago

Use spot instances for ad-hoc clusters

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: rvitillo, Assigned: frank)

References

Details

User Story

We should use spot instances for Spark worker nodes of ad-hoc clusters. I can confirm that a running Spark job handles the death of a worker node without any particular user-visible effects besides a small lag.

Looking at the spot instance history for c3.4xlarge in our region it appears that only very rarely the spot price was higher than the on-demand one, so I would suggest we start with a higher bid price and go from there. I think the cost reduction outweighs the harm caused by a small chance of loosing all worker nodes.

Attachments

(2 files)

[telemetry-analysis-service] fbertsch:spots > mozilla:master 8 years ago GitHub Autolander Bot 61 bytes, text/x-github-pull-request		Details \| Review
[telemetry-analysis-service] fbertsch:spots > mozilla:master 8 years ago GitHub Autolander Bot 61 bytes, text/x-github-pull-request		Details \| Review

Roberto Agostino Vitillo (:rvitillo)

Reporter

Description

•

8 years ago

      No description provided.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

User Story: (updated)

Frank Bertsch [:frank]

Assignee

Updated

•

8 years ago

Assignee: nobody → fbertsch

Frank Bertsch [:frank]

Assignee

Comment 1

•

8 years ago

Initially I'm going to just have all spots as TASK instances (they do not run the DHFS datanode daemon, which we don't use anyways - right?).

We could also consider having 1 Master, X number of next nodes be on-demand as well, and the final nodes (num_instances - X - 1) be spot instances. 

I'm going to set the initial bid be $.84, but this is configurable.

GitHub Autolander Bot

Comment 2

•

8 years ago

Attached file [telemetry-analysis-service] fbertsch:spots > mozilla:master — Details

Frank Bertsch [:frank]

Assignee

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 3

•

8 years ago

How much longer does it take on average for a cluster to bootstrap now? Note that cluster bootstrap times can afaik be highly volatile with spot instances. How do you plan to deal with that if users complain that clusters don't start up?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Roberto Agostino Vitillo (:rvitillo)

Reporter

Updated

•

8 years ago

Flags: needinfo?(fbertsch)

Frank Bertsch [:frank]

Assignee

Comment 4

•

8 years ago

Unfortunately this is hard to test, because it is so variable. I've launched a few clusters and they all bootstrapped in about the same amount of time as I've seen before, e.g. "Waiting" status within 15 minutes. 

In general, I don't think we're going to run into an issue of getting kicked off machines or unable to start spot machines, since our bid is so high right now. However, we may run into periods where it happens. What if we have a config that we can use to switch off using spot clusters?

Flags: needinfo?(fbertsch)

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 5

•

8 years ago

(In reply to Frank Bertsch [:frank] from comment #4)
> Unfortunately this is hard to test, because it is so variable. I've launched
> a few clusters and they all bootstrapped in about the same amount of time as
> I've seen before, e.g. "Waiting" status within 15 minutes. 
> 
> In general, I don't think we're going to run into an issue of getting kicked
> off machines or unable to start spot machines, since our bid is so high
> right now. However, we may run into periods where it happens. What if we
> have a config that we can use to switch off using spot clusters?

Yeah, that makes sense. It would be nice to have a checkbox to disable spot instances when launching a new cluster.

Frank Bertsch [:frank]

Assignee

Comment 6

•

8 years ago

I was hesitant to let users decide not to use spot instances, since so many people consider their workloads the most important and may not understand that losing a spot instance does not mean failures for their jobs. I was thinking this would be more an admin thing, though I'm open to considering otherwise.

Roberto Agostino Vitillo (:rvitillo)

Reporter

Comment 7

•

8 years ago

(In reply to Frank Bertsch [:frank] from comment #6)
> I was hesitant to let users decide not to use spot instances, since so many
> people consider their workloads the most important and may not understand
> that losing a spot instance does not mean failures for their jobs. I was
> thinking this would be more an admin thing, though I'm open to considering
> otherwise.

Agreed.

Frank Bertsch [:frank]

Assignee

Comment 8

•

8 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #7)
> (In reply to Frank Bertsch [:frank] from comment #6)
> > I was hesitant to let users decide not to use spot instances, since so many
> > people consider their workloads the most important and may not understand
> > that losing a spot instance does not mean failures for their jobs. I was
> > thinking this would be more an admin thing, though I'm open to considering
> > otherwise.
> 
> Agreed.

Okay I'm going to implement it this way, with a switch just for admins.

GitHub Autolander Bot

Comment 9

•

8 years ago

Attached file [telemetry-analysis-service] fbertsch:spots > mozilla:master — Details

Thomas Huelbert

Updated

•

8 years ago

Points: --- → 3

Priority: -- → P1

Jannis Leidel [:jezdez]

Updated

•

8 years ago

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

6 years ago

Product: Cloud Services → Cloud Services Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Use spot instances for ad-hoc clusters

Categories

(Cloud Services Graveyard :: Metrics: Pipeline, defect, P1)

Tracking

(Not tracked)

People

(Reporter: rvitillo, Assigned: frank)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files)

Description

Updated

Updated

Comment 1

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type