Closed Bug 1469398 Opened 6 years ago Closed 6 years ago

Windows AMI support for c5 instances

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: grenade)

References

(Blocks 1 open bug)

Details

Amazon c5 and m5 instances have faster CPU and are cheaper. The new c5d and m5d variants have local NVMe storage and I/O is substantially faster than what you can get via EBS. We want to use these 5th generation instances throughout CI. And on Linux we are.

However, if you attempt to run a c5 or m5 instance with the latest Windows AMIs we use for build tasks (I believe these are based on Windows Server), you receive the following error message:

  InvalidParameterCombination: Enhanced networking with the Elastic Network Adapter (ENA) is required for the 'c5.4xlarge' instance type. Ensure that you are using an AMI that is enabled for ENA.

This error is likely due to the AMI not having the ENA driver installed. Enabling the ENA driver should hopefully be as simple as installing the driver from the zip file linked from https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/enhanced-networking-ena.html.

Once instances have ENA support, it is quite possible that they'll fail to mount EBS volumes. This is because c5 instances treat EBS volumes as NVMe devices. At least on Linux. Not sure about Windows.

Anyway, I'm filing this bug against Taskcluster Operations so we have a dedicated bug on file to support the 5th generation instance types with our Windows AMIs.

needinfo to fubar to triage the request.
Flags: needinfo?(klibby)
Adding needinfo on coop in case triage falls on his team. (Ownership of OpenCloudConfig is a bit nebulous.)

Not running c5 and c5d instances on Windows is likely costing us *minutes* per build in Firefox CI due to I/O. That itself likely translates to a few thousand dollars in EC2 costs per year due to extra machine time. When you factor in that c5 instances are cheaper than c4 instances *and* have faster CPU, we're likely talking several to $10,000+ annual savings from this switch. And there's a good change this bug is the lone blocker.
Flags: needinfo?(coop)
It's not coop, it's on us, sorry. We're working on a new process to repeatably build AMIs, but have been stalled because of issues in MDC2.
Flags: needinfo?(coop)
Thanks for the update!

Correct me if I'm wrong, but aren't there 2 phases to AMI generation (base and top). And if OCC runs as part of the top-most AMI layer, presumably this could install the ENA driver. As long as AMI generation isn't running on a c5 instance (it shouldn't need to), I would think we could install the ENA driver in the top-most AMI and this may "just work."
Rob, what do you think? Can you take a look and see if you can make this go in our current state?
Flags: needinfo?(klibby) → needinfo?(rthijssen)
yep. i'll have a go.
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Component: Operations → Relops: OpenCloudConfig
Flags: needinfo?(rthijssen)
Product: Taskcluster → Infrastructure & Operations
QA Contact: rthijssen
if we've no issues over the next few hours, i'll deploy to l3 builders tomorrow morning (utc).
https://github.com/mozilla-releng/OpenCloudConfig/pull/159
(In reply to Rob Thijssen (:grenade UTC+2) from comment #9)
> if we've no issues over the next few hours, i'll deploy to l3 builders
> tomorrow morning (utc).
> https://github.com/mozilla-releng/OpenCloudConfig/pull/159

Reminder that the US is in holiday tomorrow; I don't think that should necessarily change your plan, but it won't be a normal weekday in many regards.
we had problems with provisioning. it seems we can't get enough c5.4xlarge spot instances (this instance type is also used by linux builders). i have reverted the aws provisioner config for gecko-1-b-win2012 to use the new amis (with ena drivers installed) but to instead use c4.4xlarge instance types until the provisioning issue is resolved.
On the Linux builders, we define a mix of instance types with different utility factor:

c5d.4xlarge: 1.3
m5d.4xlarge: 1.2
c5.4xlarge: 1.1
m5.4xlarge: 1.1
c4.4xlarge: 1.0
m4.4xlarge: 0.9

From https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=autoland,1444817,1,2&series=autoland,1582195,1,2&series=autoland,1691781,1,2&series=autoland,1444828,1,2&series=autoland,1618745,1,2&series=autoland,1697674,1,2, you can see that a healthy mix of these instances is provisioned in the wild.

Using a mix of instance types gives the provisioner access to additional instance types in case one instance type is not available or is prohibitively expensive. May I suggest applying this approach to the Windows workers as well?

Note that the c5d and m5d instance types have local/ephemeral NVMe storage available as a single drive. That replaces the use of an EBS volume. However, the Windows builders are using 2 EBS volumes and tasks actively use both volumes. To fully realize the benefits of the c5d and m5d instance types, we'll want tasks to use a single volume for task data and caches. I believe this will require a bit of work. So it's probably best to ignore the c5d and m5d instance types for Windows builders at this time. We can sort that out in bug 1462528 or in a derivative of it. i.e. we'll want the Windows builders to be a mix of c5.4xlarge, m5.4xlarge, c4.4xlarge, and m4.4xlarge.
Regarding the performance of the c5.4xlarge instance type... it is only 1 sample point, but it looks real.

https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=try,1713271,1,2&series=try,1460207,1,2

Previously, the best time for a Windows 64 opt build was ~2000s. This build clocked in at ~1875s. So roughly a 2 minute speedup.

Honestly, this isn't as significant as I was hoping for. And there appeared to be no improvement with Mercurial operation times either. This is... disappointing. Especially the lack of improvement with I/O as measured by Mercurial. I thought for sure we'd see a significant win there, as the c5 instances are supposed to have better EBS performance.

Despite the apparent lackluster performance gains, the 5th generation instances are faster and cheaper, so it is still a good idea to press on. And having the compatible AMI will enable us to experiment with c5d and m5d instances, which will almost certainly demonstrate significant I/O wins.
(In reply to Gregory Szorc [:gps] from comment #12)
> Using a mix of instance types gives the provisioner access to additional
> instance types in case one instance type is not available or is
> prohibitively expensive. May I suggest applying this approach to the Windows
> workers as well?

yes, sounds very reasonable. i hope i've understood the configuration properly. does this look sane:

  "instanceTypes": [
    {
      "instanceType": "c4.4xlarge",
      "utility": 0.9,
      ...
    },
    {
      "instanceType": "c5.4xlarge",
      "utility": 1,
      ...
    }
  ],

i'm hoping that the provisioner will see `utility: 1` and spot request c5.4xlarge, falling back to c4.4xlarge (`utility: 0.9`) if it can't get enough of the ena instances. or have i misinterpreted how that works?

i'm in the process of rewriting https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-workertype.sh (bug 1441402, bug 1460535) as a python module. this script is what currently manages the windows worker type configurations. as a bash script, it's difficult to support multiple instance types with readable code, so hopefully in python, it will be easier to support all of the available instance types.
(In reply to Rob Thijssen (:grenade UTC+2) from comment #14)
> i'm hoping that the provisioner will see `utility: 1` and spot request
> c5.4xlarge, falling back to c4.4xlarge (`utility: 0.9`) if it can't get
> enough of the ena instances. or have i misinterpreted how that works?

seems to work. the jobs on gecko-1-b-win2012-beta triggered instantiation of both c4.4xlarge & c5.4xlarge instances.
Are there plans to finish this rollout?

FWIW Perfherder build metrics show a clear performance win with c5 instances:

https://treeherder.mozilla.org/perf.html#/graphs?series=try,1460207,1,2&series=try,1713271,1,2 (~300s faster on average)

I'm keen to realize these speedups outside of Try!
Flags: needinfo?(rthijssen)
thanks for the follow up.
the gecko-3-b-win2012 rollout now complete.
https://github.com/mozilla-releng/OpenCloudConfig/commit/d1969a3a
https://tools.taskcluster.net/groups/dKeYmTbpScev4M034CDh9A/tasks/BuE3yhxbRdq9nPRtDkKCDQ/runs/0/logs/public%2Flogs%2Flive.log

i'll leave the bug open until i've updated occ to also update the aws provisioner config (this is being handled manually now) with the extra instance types.
Flags: needinfo?(rthijssen)
Thank you, Rob!

We have limited data, but the build time improvements are already pretty clear.

A ~300s speedup in opt builds:

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1460301,1,2&series=autoland,1724153,1,2

An ~800s speedup in pgo builds:

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1460459,1,2&series=autoland,1724384,1,2

(This graph is wonky because we rolled out c5's around the same time thinLTO landed and thinLTO made builds much slower. But it looks like the c5's effectively offset the build time loss from thinLTO!)

Developers and sheriffs should notice these speedups. Especially people who have tight iteration loops.
the occ ci updates were dealt with in bug 1480108 and occ commit https://github.com/mozilla-releng/OpenCloudConfig/commit/331e1ff7bbfd2f309d7520be7833792af7db2eee
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.