Closed
Bug 1415725
Opened 7 years ago
Closed 7 years ago
Switch to C5 AWS instances
Categories
(Firefox Build System :: Task Configuration, task)
Firefox Build System
Task Configuration
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Assigned: gps)
References
(Blocks 2 open bugs)
Details
Amazon announced C5 AWS instance types this week: https://aws.amazon.com/about-aws/whats-new/2017/11/introducing-amazon-ec2-c5-instances-the-next-generation-of-compute-optimized-instances/ https://aws.amazon.com/blogs/aws/now-available-compute-intensive-c5-instances-for-amazon-ec2/ These are Skylake based Xeons. Amazon claims 25% price/performance benefit over C4's. If we replace all our AWS workers with C5's, everything should be faster and cheaper. Firefox build tasks should hopefully speed up by a few minutes across the board. PGO builds should speed up significantly due to higher MHz of these Skylake Xeons. I wouldn't be surprised to see a 10+ minute win there. Per the blog posts, it looks like we'll need to roll new AMIs to support C5's. That will require some TC platform support. Needinfo on garndt so it shows up on his radar.
Flags: needinfo?(garndt)
Assignee | ||
Comment 1•7 years ago
|
||
I created some instances tonight and did some crude comparisons. Sample size of 1. But better nothing no measurements. Operations performed with a 200 GB SSD EBS gp1 volume formatted with ext4 using an Ubuntu 17.10 AMI. No Docker. Used `mach bootstrap` and system packages (e.g. gcc 7.2, llvm/clang 5.0 for bindgen.) All times are wall unless reported otherwise. $ hg clone -U https://hg.mozilla.org/mozilla-unified firefox c4.4xlarge: 64.275s c5.4xlarge: 50.232s c5.18xlarge: 48.828s $ hg update central # revision 933f9cd9b3b9 c4.4xlarge: 19.980s c5.4xlarge: 16.061s c5.18xlarge: 12.794s $ ./mach configure c4.4xlarge: 22.428s c5.4xlarge: 19.881s c5.18xlarge: 20.962s $ ./mach build # after configure c4.4xlarge real 20m1.067s user 246m1.867s sys 10m15.577s Overall system resources - Wall time: 1199s; CPU: 80%; Read bytes: 880640; Write bytes: 8272633856; Read time: 0; Write time: 202152 c5.4xlarge real 16m27.772s user 202m28.660s sys 8m37.696s Overall system resources - Wall time: 987s; CPU: 80%; Read bytes: 2277376; Write bytes: 7847968768; Read time: 44; Write time: 147136 c5.18large real 8m18.963s user 213m39.447s sys 12m25.144s Overall system resources - Wall time: 497s; CPU: 38%; Read bytes: 1761280; Write bytes: 9741787136; Read time: 24; Write time: 239804 $ ./mach build # --disable-stylo, after configure c4.4xlarge real 16m13.741s user 234m1.152s sys 10m5.434s Overall system resources - Wall time: 972s; CPU: 94%; Read bytes: 0; Write bytes: 7148986368; Read time: 0; Write time: 158932 c5.4xlarge real 13m33.330s user 192m48.991s sys 8m28.318s Overall system resources - Wall time: 812s; CPU: 93%; Read bytes: 8192; Write bytes: 7172489216; Read time: 0; Write time: 126424 c5.18xlarge real 5m15.341s user 206m4.466s sys 12m16.506s Overall system resources - Wall time: 314s; CPU: 58%; Read bytes: 8192; Write bytes: 7415091200; Read time: 0; Write time: 158116 $ ./mach build # --disable-stylo --disable-webrender, after configure c4.4xlarge real 15m50.554s user 231m21.840s sys 10m4.136s Overall system resources - Wall time: 949s; CPU: 96%; Read bytes: 323584; Write bytes: 6778892288; Read time: 4; Write time: 151200 c5.4xlarge real 13m18.626s user 190m51.830s sys 8m26.482s Overall system resources - Wall time: 798s; CPU: 94%; Read bytes: 12288; Write bytes: 6755299328; Read time: 0; Write time: 121776 c5.18xlarge real 4m21.206s user 203m15.109s sys 12m19.041s Overall system resources - Wall time: 260s; CPU: 69%; Read bytes: 0; Write bytes: 7246147584; Read time: 0; Write time: 155848 $ ./mach build # sccache enabled, fresh cache, after configure c5.18xlarge real 8m54.079s user 5m0.639s sys 0m44.794s Overall system resources - Wall time: 532s; CPU: 38%; Read bytes: 16384; Write bytes: 12460142592; Read time: 0; Write time: 332032 $ ./mach build # sccache enabled, populated cache, after configure c5.18xlarge real 2m14.839s user 3m56.964s sys 0m37.009s Overall system resources - Wall time: 134s; CPU: 11%; Read bytes: 8192; Write bytes: 7984001024; Read time: 4; Write time: 247308 As we can see, the equivalent c5 is a bit faster: ~3.5 minutes faster for a regular `mach build`. And, the c5's are cheaper. On demand pricing in usw2 is currently $0.796/hr versus $0.680/hr for the c4.4xlarge and c5.4xlarge, respectively. The c5's are a win-win. The c5.18xlarge is just insane. It took ~18s to get through export. It finished C++ compiling around the 3:30 mark AFAICT. The rest of a regular build (~half the wall time) was Rust. That's why --disable-stylo/--disable-webrender builds were significantly faster. At only 39% CPU utilization for the default build configuration, it is probably a waste for us to use this instance type for Firefox builds. But once Rust isn't the long pole and we can get closer to 100% CPU utilization... Also, apparently you may not need special AMIs for the c5's after all. According to the "Which operating systems/AMIs are supported on C5 Instances" question at https://aws.amazon.com/ec2/faqs/, pretty much all modern AMIs support the c5's. I was able to use a standard Ubuntu AMI from https://cloud-images.ubuntu.com/locator/ec2/. So I would think us switching to the c5's should be pretty turnkey. It may not even need that much cooperation from TC people - just someone with permissions to modify workers in the AWS provisioner...
Comment 2•7 years ago
|
||
As you've mentioned, if this is working with the newer ubuntu AMIs, then swapping out our base ami should hopefully be rather painless. While the ondemand prices might show a price decrease for the c5.4x over c4.4x, the spot market does not reflect this when I look at the spot pricing history and we currently only utilize the spot market (likely to change in 2018). The c5's are roughly $.03/hr more in us-east-1 and looking at our usage from last month, would not be too significant of a change month over month. Right now the market seems a little volatile for c5's, but this is most likely because it's a new platform so people are testing the waters, which will stabilize over time. Right now we will be limiting ourselves to provisioning only in us-east-1. Overall it has the least volatile AZs over us-west-2 and c5's are not offered in us-west-1 and eu-central-1.
Flags: needinfo?(garndt)
Assignee | ||
Comment 3•7 years ago
|
||
Buried in comment #1 is the fact that c5's should be supported on newer AMIs. So this might all "just work." Regarding the spot pricing, yeah, that's a bit unfortunate. Hopefully it calms down soon. Even if the c5's are slightly more expensive, they should be a net win because instances are faster. The AWS provisioner does support multiple EC2 instance types per worker. Today, the gecko-<N>-b-linux workers are both c4.4xlarge and m4.4xlarge, for example. Once we know the c5's work, we should be able to throw those instance types into the work definition. We establish a max bid price in the worker definition. So if c5's get too costly, presumably we'll stop provisioning them and go with the c4's and m4's. What I'm not sure about is the algorithm the provisioner uses to pick an instance type. It might require tweaking the worker definition so the decision to use a c5 takes into account that it is a faster instance and thus worth paying a few pennies more for.
Comment 4•7 years ago
|
||
John, bringing this to your attention since you could comment directly about the provisioning algorithm and how this will play out.
Flags: needinfo?(jhford)
Comment 5•7 years ago
|
||
We can use what's called the utility factor to skew towards C5s. The bid we make is roughly $ec2-hourly-rate * capacity / utilityFactor. This can be set per instance type. If we set the C5 utility factor to 1.2 instead of 1, we will pay up to 20% more for a C5. This functionality has been around for quite some time, so it should work well for us. Otherwise, there's nothing from the provisioner which would be impacted by a new instance type being available.
Flags: needinfo?(jhford)
Assignee | ||
Comment 6•7 years ago
|
||
I created a one-off worker type (https://tools.taskcluster.net/aws-provisioner/gecko-1-b-linux-gps/) configured to use a c5.4xlarge. Aside from the worker name and spot bid price max, its worker definition is copied from gecko-1-b-linux. I then hacked up taskgraph to use this worker (https://hg.mozilla.org/try/rev/769625b1b7c5314d7a28af9a1df5af6ffba99b9d) and pushed that to Try. According to dustin, the provisioner choked: Nov 17 18:48:44 ec2-manager app/web.1: InvalidParameterCombination: Enhanced networking with the Elastic Network Adapter (ENA) is required for the 'c5.4xlarge' instance type. Ensure that you are using an AMI that is enabled for ENA. Looks like we'll need a new AMI for the c5's after all. I couldn't find a record of the AMI being used in the public domain. My guess is it is produced by the TaskCluster team. jonas: are ther any newer AMIs we could use for docker-worker? https://aws.amazon.com/ec2/faqs/ has more info about OS compatibility.
Flags: needinfo?(jopsen)
Comment 7•7 years ago
|
||
API generation for docker-worker is here -- https://github.com/taskcluster/docker-worker/tree/master/deploy
Assignee | ||
Comment 8•7 years ago
|
||
Thanks for teaching me to fish! https://github.com/taskcluster/docker-worker/pull/337 Someone with privileges to build new AMIs will need to complete the process though. I reckon we can discuss details in the PR.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 9•7 years ago
|
||
The docker-worker changes to support c5 instances (and to move from AUFS to overlayfs) have landed! https://tools.taskcluster.net/task-inspector/#HOHw-WaHQMe3LXHvR0RBdQ is a Firefox build task running on a c5.4xlarge. That's triggered from a Try push using a customer worker type configured to use c5s. So, it appears we "just" need to update existing worker definitions to add c5 instances to the mix. I'll try to get that ball rolling later today...
Assignee | ||
Comment 10•7 years ago
|
||
Spot pricing reveals that c5.4xlarge has stabilized since ~November 15. It still periodically spikes. But for the majority of time it seems similar or cheaper than c4.4xlarge prices. I performed a Try push using a custom worker type using c5.4xlarge. Everything seems to have "just worked." So I'm going forward with updating worker configs to start utilizing c5s. gecko-1-b-linux, gecko-1-b-macosx64, gecko-2-b-linux, and gecko-3-b-linux are now configured to use c5.4xlarge instances with utility 1.2. c4.4xlarge and m4.4xlarge are still configured with utility 1.0. It will likely take some time before we start seeing c5 instances provisioned in the wild.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Assignee | ||
Comment 11•7 years ago
|
||
After automation was pretty happy for a few hours, I went through all gecko-N workers running c4's and enabled c5's across the board. Utility factor of 1.2 to tell the scheduler we're willing to pay a 20% premium for these instances. This /may/ be too high given our workload. But let's wait for some more Perfherder data to come in to drive any further tweaking. While I was updating worker definitions, I noticed that the gecko-*-android workers were using 2xlarge instead of 4xlarge. Those tasks are pretty heavily CPU bound. So I changed them to 4xlarge. They are now consistent with other workers. I've also been looking at Perfherder data. The m4's are consistently a bit slower than the c4's and cost about the same. I changed the utility of the m4's to 0.9 so we prefer them less than the c4 (at utility 1.0). This should result in fewer m4's being provisioned and tasks executing faster overall.
Assignee | ||
Comment 12•7 years ago
|
||
We had to back out the c5 changes due to an issue with older instance types with the AMI changes we made to move away from AUFS. We'll be moving back to c5 in the next ~24 hours.
Assignee | ||
Comment 13•7 years ago
|
||
We're still on the c4 and m4 instances because of the Greg Arndt Memory Outage last week. We didn't want to pile on any other changes to CI while we were dealing with instance provisioning issues. And wcosta is still dealing with a potential race condition in instance startup. We appear to be on track for using the c5's in the next day or two.
Assignee | ||
Comment 14•7 years ago
|
||
I just mass updated worker definitions to add c5 and m5 instances again. Utility of 1.1 on both for now. We'll likely want to tweak the utility a bit. But let's get data first and see what that tells us we should do.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Product: TaskCluster → Firefox Build System
You need to log in
before you can comment on or make changes to this bug.
Description
•