Closed Bug 1306167 Opened 8 years ago Closed 8 years ago

provisioning failing because Client.VolumeLimitExceeded: Volume limit exceeded

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

      No description provided.
This has to do with EBS volume usage

usage:

windows builders: 60gb gp2 root volume
linux builders: 8gb gp2 root volume, 120gb standard additional volume

limits:

gp2: 200TB / region
standard: 20TB / region

So it's a pretty easy guess that the standard volumes are hitting the limit -- that's just 170 instances.  I count 150 running in us-east-1, but it's possible there are other things using standard storage?

Remedies:
 - ask for a limit increase
 - reduce disk size in the workerType def
   - and possibly kill running instances
I think we *really* don't want to be using standard.

  https://aws.amazon.com/ebs/previous-generation/
  http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

"Use Cases	Workloads where data is infrequently accessed"
"Magnetic volumes are backed by magnetic drives and are suited for workloads where data is accessed infrequently, and scenarios where low-cost storage for small volume sizes is important. These volumes deliver approximately 100 IOPS on average, with burst capability of up to hundreds of IOPS, and they can range in size from 1 GiB to 1 TiB."
$0.05/GB-month
$0.05/million I/O

so if we're using 20TB/mo, that's $1000, but the IO costs could get pretty high (I don't have an IO estimate)

gp2 is $0.10/GB-month, with no IO costs
st1 is $0.45/GB-month, with no IO costs

So gp2 is very likely to be cheaper as I'm sure we do a lot more IO than that.  st1 would definitely be cheaper, but is on HDDs and hte AWS pages seem to suggest it's good for streaming, which I take to mean sequential access.

Jonas is changing all of the existing instances to use gp2 volumes right now.
We were using magnetic storage, the old stuff because VolumeType wasn't specified.

I've changed gecko-b-*-linux to use:
  "VolumeType": "gp2"

@gps, I think you configured these workerTypes with EBS, I assume we wanted to use gp2, right?
I think this happened because we've locked API version: 2014-01-01
https://github.com/taskcluster/aws-provisioner/blob/f20b4fbdb136caed17151b3cacca166b727a0a31/config.yml#L60


Otherwise, we would probably haven gotten gp2 as docs says it's the default type.
Flags: needinfo?(gps)
I'm, uh, surprised we weren't using gp2. We should definitely use gp2 over st1.

I don't remember making a gp2 versus st1 decision when configuring these worker types. I think everyone assumed that the default EBS volumes were just fine. We did see a speedup from moving to c3.2xlarge to c4.4xlarge. We thought the I/O slowness was mostly due to bug 1291940. I certainly have enough data from my own instances (outside of TC which were almost certainly using gp2) to confirm that. I guess st1 contributed as well :/
Flags: needinfo?(gps)
..and the pending is now zero!

Items for followup:
 - look for other instance types using 'standard' volumes
 - (new bug) try using a newer API version in the provisioner.  the docs list "gp2" as the default volume type, but perhaps
   at 2014-01-01 what has become "standard" was the default; we may be missing other advantages of a newer API version, too!
   https://github.com/taskcluster/aws-provisioner/blob/master/config.yml#L60

I closed the rate-limit increase request before AWS replied, so the rate limits have not changed.
NOTE: st1 is no the same as standard!!!
jmaher: this may result in some Perfherder alerts. Since this change is being made out of band from source control, the Perfherder alerts will get assigned to a "random" commit.
This appeared to make Linux PGO builds 2x faster: https://treeherder.mozilla.org/perf.html#/alerts?id=3452.

Slow I/O kills.
           This is your brain.

T h i s  i s  y o u r  b r a i n  o n  E B S.

                               Any questions?
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: AWS-Provisioner → Services
You need to log in before you can comment on or make changes to this bug.