Closed Bug 1415725 Opened 7 years ago Closed 7 years ago

Switch to C5 AWS instances

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

(Blocks 2 open bugs)

Details

Amazon announced C5 AWS instance types this week:

https://aws.amazon.com/about-aws/whats-new/2017/11/introducing-amazon-ec2-c5-instances-the-next-generation-of-compute-optimized-instances/
https://aws.amazon.com/blogs/aws/now-available-compute-intensive-c5-instances-for-amazon-ec2/

These are Skylake based Xeons. Amazon claims 25% price/performance benefit over C4's. If we replace all our AWS workers with C5's, everything should be faster and cheaper. Firefox build tasks should hopefully speed up by a few minutes across the board. PGO builds should speed up significantly due to higher MHz of these Skylake Xeons. I wouldn't be surprised to see a 10+ minute win there.

Per the blog posts, it looks like we'll need to roll new AMIs to support C5's. That will require some TC platform support.

Needinfo on garndt so it shows up on his radar.
Flags: needinfo?(garndt)
I created some instances tonight and did some crude comparisons. Sample size of 1. But better nothing no measurements. Operations performed with a 200 GB SSD EBS gp1 volume formatted with ext4 using an Ubuntu 17.10 AMI. No Docker. Used `mach bootstrap` and system packages (e.g. gcc 7.2, llvm/clang 5.0 for bindgen.) All times are wall unless reported otherwise.

$ hg clone -U https://hg.mozilla.org/mozilla-unified firefox
c4.4xlarge:  64.275s
c5.4xlarge:  50.232s
c5.18xlarge: 48.828s

$ hg update central # revision 933f9cd9b3b9
c4.4xlarge:  19.980s
c5.4xlarge:  16.061s
c5.18xlarge: 12.794s

$ ./mach configure
c4.4xlarge:  22.428s
c5.4xlarge:  19.881s
c5.18xlarge: 20.962s

$ ./mach build # after configure
c4.4xlarge
real    20m1.067s
user    246m1.867s
sys     10m15.577s
Overall system resources - Wall time: 1199s; CPU: 80%; Read bytes: 880640; Write bytes: 8272633856; Read time: 0; Write time: 202152

c5.4xlarge
real    16m27.772s
user    202m28.660s
sys     8m37.696s
Overall system resources - Wall time: 987s; CPU: 80%; Read bytes: 2277376; Write bytes: 7847968768; Read time: 44; Write time: 147136

c5.18large
real    8m18.963s
user    213m39.447s
sys     12m25.144s
Overall system resources - Wall time: 497s; CPU: 38%; Read bytes: 1761280; Write bytes: 9741787136; Read time: 24; Write time: 239804

$ ./mach build # --disable-stylo, after configure
c4.4xlarge
real    16m13.741s
user    234m1.152s
sys     10m5.434s
Overall system resources - Wall time: 972s; CPU: 94%; Read bytes: 0; Write bytes: 7148986368; Read time: 0; Write time: 158932

c5.4xlarge
real    13m33.330s
user    192m48.991s
sys     8m28.318s
Overall system resources - Wall time: 812s; CPU: 93%; Read bytes: 8192; Write bytes: 7172489216; Read time: 0; Write time: 126424

c5.18xlarge
real    5m15.341s
user    206m4.466s
sys     12m16.506s
Overall system resources - Wall time: 314s; CPU: 58%; Read bytes: 8192; Write bytes: 7415091200; Read time: 0; Write time: 158116

$ ./mach build # --disable-stylo --disable-webrender, after configure
c4.4xlarge
real    15m50.554s
user    231m21.840s
sys     10m4.136s
Overall system resources - Wall time: 949s; CPU: 96%; Read bytes: 323584; Write bytes: 6778892288; Read time: 4; Write time: 151200

c5.4xlarge
real    13m18.626s
user    190m51.830s
sys     8m26.482s
Overall system resources - Wall time: 798s; CPU: 94%; Read bytes: 12288; Write bytes: 6755299328; Read time: 0; Write time: 121776

c5.18xlarge
real    4m21.206s
user    203m15.109s
sys     12m19.041s
Overall system resources - Wall time: 260s; CPU: 69%; Read bytes: 0; Write bytes: 7246147584; Read time: 0; Write time: 155848

$ ./mach build # sccache enabled, fresh cache, after configure
c5.18xlarge
real    8m54.079s
user    5m0.639s
sys     0m44.794s
Overall system resources - Wall time: 532s; CPU: 38%; Read bytes: 16384; Write bytes: 12460142592; Read time: 0; Write time: 332032

$ ./mach build # sccache enabled, populated cache, after configure
c5.18xlarge
real    2m14.839s
user    3m56.964s
sys     0m37.009s
Overall system resources - Wall time: 134s; CPU: 11%; Read bytes: 8192; Write bytes: 7984001024; Read time: 4; Write time: 247308

As we can see, the equivalent c5 is a bit faster: ~3.5 minutes faster for a regular `mach build`. And, the c5's are cheaper. On demand pricing in usw2 is currently $0.796/hr versus $0.680/hr for the c4.4xlarge and c5.4xlarge, respectively. The c5's are a win-win.

The c5.18xlarge is just insane. It took ~18s to get through export. It finished C++ compiling around the 3:30 mark AFAICT. The rest of a regular build (~half the wall time) was Rust. That's why --disable-stylo/--disable-webrender builds were significantly faster. At only 39% CPU utilization for the default build configuration, it is probably a waste for us to use this instance type for Firefox builds. But once Rust isn't the long pole and we can get closer to 100% CPU utilization...

Also, apparently you may not need special AMIs for the c5's after all. According to the "Which operating systems/AMIs are supported on C5 Instances" question at https://aws.amazon.com/ec2/faqs/, pretty much all modern AMIs support the c5's. I was able to use a standard Ubuntu AMI from https://cloud-images.ubuntu.com/locator/ec2/. So I would think us switching to the c5's should be pretty turnkey. It may not even need that much cooperation from TC people - just someone with permissions to modify workers in the AWS provisioner...
As you've mentioned, if this is working with the newer ubuntu AMIs, then swapping out our base ami should hopefully be rather painless.

While the ondemand prices might show a price decrease for the c5.4x over c4.4x, the spot market does not reflect this when I look at the spot pricing history and we currently only utilize the spot market (likely to change in 2018).  The c5's are roughly $.03/hr more in us-east-1 and looking at our usage from last month, would not be too significant of a change month over month.  

Right now the market seems a little volatile for c5's, but this is most likely because it's a new platform so people are testing the waters, which will stabilize over time.  Right now we will be limiting ourselves to provisioning only in us-east-1.  Overall it has the least volatile AZs over us-west-2 and c5's are not offered in us-west-1 and eu-central-1.
Flags: needinfo?(garndt)
Buried in comment #1 is the fact that c5's should be supported on newer AMIs. So this might all "just work."

Regarding the spot pricing, yeah, that's a bit unfortunate. Hopefully it calms down soon.

Even if the c5's are slightly more expensive, they should be a net win because instances are faster.

The AWS provisioner does support multiple EC2 instance types per worker. Today, the gecko-<N>-b-linux workers are both c4.4xlarge and m4.4xlarge, for example. Once we know the c5's work, we should be able to throw those instance types into the work definition. We establish a max bid price in the worker definition. So if c5's get too costly, presumably we'll stop provisioning them and go with the c4's and m4's. What I'm not sure about is the algorithm the provisioner uses to pick an instance type. It might require tweaking the worker definition so the decision to use a c5 takes into account that it is a faster instance and thus worth paying a few pennies more for.
John, bringing this to your attention since you could comment directly about the provisioning algorithm and how this will play out.
Flags: needinfo?(jhford)
We can use what's called the utility factor to skew towards C5s.  The bid we make is roughly $ec2-hourly-rate * capacity / utilityFactor.  This can be set per instance type.  If we set the C5 utility factor to 1.2 instead of 1, we will pay up to 20% more for a C5.  This functionality has been around for quite some time, so it should work well for us.

Otherwise, there's nothing from the provisioner which would be impacted by a new instance type being available.
Flags: needinfo?(jhford)
I created a one-off worker type (https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux-gps/) configured to use a c5.4xlarge. Aside from the worker name and spot bid price max, its worker definition is copied from gecko-1-b-linux. I then hacked up taskgraph to use this worker (https://hg.mozilla.org/try/rev/769625b1b7c5314d7a28af9a1df5af6ffba99b9d) and pushed that to Try.

According to dustin, the provisioner choked:

Nov 17 18:48:44 ec2-manager app/web.1:     InvalidParameterCombination: Enhanced networking with the Elastic Network
                Adapter (ENA) is required for the 'c5.4xlarge' instance type. Ensure that you are using an AMI that is enabled for ENA.

Looks like we'll need a new AMI for the c5's after all. I couldn't find a record of the AMI being used in the public domain. My guess is it is produced by the TaskCluster team.

jonas: are ther any newer AMIs we could use for docker-worker? https://aws.amazon.com/ec2/faqs/ has more info about OS compatibility.
Flags: needinfo?(jopsen)
Thanks for teaching me to fish!

https://github.com/taskcluster/docker-worker/pull/337

Someone with privileges to build new AMIs will need to complete the process though. I reckon we can discuss details in the PR.
Flags: needinfo?(jopsen)
The docker-worker changes to support c5 instances (and to move from AUFS to overlayfs) have landed!

https://tools.taskcluster.net/task-inspector/#HOHw-WaHQMe3LXHvR0RBdQ is a Firefox build task running on a c5.4xlarge. That's triggered from a Try push using a customer worker type configured to use c5s.

So, it appears we "just" need to update existing worker definitions to add c5 instances to the mix. I'll try to get that ball rolling later today...
Spot pricing reveals that c5.4xlarge has stabilized since ~November 15. It still periodically spikes. But for the majority of time it seems similar or cheaper than c4.4xlarge prices.

I performed a Try push using a custom worker type using c5.4xlarge. Everything seems to have "just worked." So I'm going forward with updating worker configs to start utilizing c5s.

gecko-1-b-linux, gecko-1-b-macosx64, gecko-2-b-linux, and gecko-3-b-linux are now configured to use c5.4xlarge instances with utility 1.2. c4.4xlarge and m4.4xlarge are still configured with utility 1.0. It will likely take some time before we start seeing c5 instances provisioned in the wild.
Assignee: nobody → gps
Status: NEW → ASSIGNED
After automation was pretty happy for a few hours, I went through all gecko-N workers running c4's and enabled c5's across the board.

Utility factor of 1.2 to tell the scheduler we're willing to pay a 20% premium for these instances. This /may/ be too high given our workload. But let's wait for some more Perfherder data to come in to drive any further tweaking.

While I was updating worker definitions, I noticed that the gecko-*-android workers were using 2xlarge instead of 4xlarge. Those tasks are pretty heavily CPU bound. So I changed them to 4xlarge. They are now consistent with other workers.

I've also been looking at Perfherder data. The m4's are consistently a bit slower than the c4's and cost about the same. I changed the utility of the m4's to 0.9 so we prefer them less than the c4 (at utility 1.0). This should result in fewer m4's being provisioned and tasks executing faster overall.
We had to back out the c5 changes due to an issue with older instance types with the AMI changes we made to move away from AUFS. We'll be moving back to c5 in the next ~24 hours.
We're still on the c4 and m4 instances because of the Greg Arndt Memory Outage last week. We didn't want to pile on any other changes to CI while we were dealing with instance provisioning issues. And wcosta is still dealing with a potential race condition in instance startup. We appear to be on track for using the c5's in the next day or two.
I just mass updated worker definitions to add c5 and m5 instances again. Utility of 1.1 on both for now. We'll likely want to tweak the utility a bit. But let's get data first and see what that tells us we should do.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: TaskCluster → Firefox Build System
Blocks: 1455706
You need to log in before you can comment on or make changes to this bug.