1469398 - Windows AMI support for c5 instances

Reporter

Description

•

6 years ago

Amazon c5 and m5 instances have faster CPU and are cheaper. The new c5d and m5d variants have local NVMe storage and I/O is substantially faster than what you can get via EBS. We want to use these 5th generation instances throughout CI. And on Linux we are.

However, if you attempt to run a c5 or m5 instance with the latest Windows AMIs we use for build tasks (I believe these are based on Windows Server), you receive the following error message:

  InvalidParameterCombination: Enhanced networking with the Elastic Network Adapter (ENA) is required for the 'c5.4xlarge' instance type. Ensure that you are using an AMI that is enabled for ENA.

This error is likely due to the AMI not having the ENA driver installed. Enabling the ENA driver should hopefully be as simple as installing the driver from the zip file linked from https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/enhanced-networking-ena.html.

Once instances have ENA support, it is quite possible that they'll fail to mount EBS volumes. This is because c5 instances treat EBS volumes as NVMe devices. At least on Linux. Not sure about Windows.

Anyway, I'm filing this bug against Taskcluster Operations so we have a dedicated bug on file to support the 5th generation instance types with our Windows AMIs.

needinfo to fubar to triage the request.

Flags: needinfo?(klibby)

Gregory Szorc [:gps]

Reporter

Comment 1

•

6 years ago

Adding needinfo on coop in case triage falls on his team. (Ownership of OpenCloudConfig is a bit nebulous.)

Not running c5 and c5d instances on Windows is likely costing us *minutes* per build in Firefox CI due to I/O. That itself likely translates to a few thousand dollars in EC2 costs per year due to extra machine time. When you factor in that c5 instances are cheaper than c4 instances *and* have faster CPU, we're likely talking several to $10,000+ annual savings from this switch. And there's a good change this bug is the lone blocker.

Flags: needinfo?(coop)

Kendall Libby [:fubar] (he/him)

Comment 2

•

6 years ago

It's not coop, it's on us, sorry. We're working on a new process to repeatably build AMIs, but have been stalled because of issues in MDC2.

Flags: needinfo?(coop)

Gregory Szorc [:gps]

Reporter

Comment 3

•

6 years ago

Thanks for the update!

Correct me if I'm wrong, but aren't there 2 phases to AMI generation (base and top). And if OCC runs as part of the top-most AMI layer, presumably this could install the ENA driver. As long as AMI generation isn't running on a c5 instance (it shouldn't need to), I would think we could install the ENA driver in the top-most AMI and this may "just work."

Kendall Libby [:fubar] (he/him)

Comment 4

•

6 years ago

Rob, what do you think? Can you take a look and see if you can make this go in our current state?

Flags: needinfo?(klibby) → needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 5

•

6 years ago

yep. i'll have a go.

Assignee: nobody → rthijssen

Status: NEW → ASSIGNED

Component: Operations → Relops: OpenCloudConfig

Flags: needinfo?(rthijssen)

Product: Taskcluster → Infrastructure & Operations

QA Contact: rthijssen

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 6

•

6 years ago

testing on gecko-1-b-win2012-beta:
https://github.com/mozilla-releng/OpenCloudConfig/commit/0f2a565

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 7

•

6 years ago

testing firefox build on c5.4xlarge workers:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce8c079

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 8

•

6 years ago

beta builds succeeded.
promoting to gecko-1-b-win2012 & gecko-2-b-win2012:
https://github.com/mozilla-releng/OpenCloudConfig/commit/562946e
https://tools.taskcluster.net/groups/DlYOexCVRFCgkztF9et0xA

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 9

•

6 years ago

if we've no issues over the next few hours, i'll deploy to l3 builders tomorrow morning (utc).
https://github.com/mozilla-releng/OpenCloudConfig/pull/159

Kendall Libby [:fubar] (he/him)

Comment 10

•

6 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #9)
> if we've no issues over the next few hours, i'll deploy to l3 builders
> tomorrow morning (utc).
> https://github.com/mozilla-releng/OpenCloudConfig/pull/159

Reminder that the US is in holiday tomorrow; I don't think that should necessarily change your plan, but it won't be a normal weekday in many regards.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 11

•

6 years ago

we had problems with provisioning. it seems we can't get enough c5.4xlarge spot instances (this instance type is also used by linux builders). i have reverted the aws provisioner config for gecko-1-b-win2012 to use the new amis (with ena drivers installed) but to instead use c4.4xlarge instance types until the provisioning issue is resolved.

Gregory Szorc [:gps]

Reporter

Comment 12

•

6 years ago

On the Linux builders, we define a mix of instance types with different utility factor:

c5d.4xlarge: 1.3
m5d.4xlarge: 1.2
c5.4xlarge: 1.1
m5.4xlarge: 1.1
c4.4xlarge: 1.0
m4.4xlarge: 0.9

From https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=autoland,1444817,1,2&series=autoland,1582195,1,2&series=autoland,1691781,1,2&series=autoland,1444828,1,2&series=autoland,1618745,1,2&series=autoland,1697674,1,2, you can see that a healthy mix of these instances is provisioned in the wild.

Using a mix of instance types gives the provisioner access to additional instance types in case one instance type is not available or is prohibitively expensive. May I suggest applying this approach to the Windows workers as well?

Note that the c5d and m5d instance types have local/ephemeral NVMe storage available as a single drive. That replaces the use of an EBS volume. However, the Windows builders are using 2 EBS volumes and tasks actively use both volumes. To fully realize the benefits of the c5d and m5d instance types, we'll want tasks to use a single volume for task data and caches. I believe this will require a bit of work. So it's probably best to ignore the c5d and m5d instance types for Windows builders at this time. We can sort that out in bug 1462528 or in a derivative of it. i.e. we'll want the Windows builders to be a mix of c5.4xlarge, m5.4xlarge, c4.4xlarge, and m4.4xlarge.

Gregory Szorc [:gps]

Reporter

Comment 13

•

6 years ago

Regarding the performance of the c5.4xlarge instance type... it is only 1 sample point, but it looks real.

https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=try,1713271,1,2&series=try,1460207,1,2

Previously, the best time for a Windows 64 opt build was ~2000s. This build clocked in at ~1875s. So roughly a 2 minute speedup.

Honestly, this isn't as significant as I was hoping for. And there appeared to be no improvement with Mercurial operation times either. This is... disappointing. Especially the lack of improvement with I/O as measured by Mercurial. I thought for sure we'd see a significant win there, as the c5 instances are supposed to have better EBS performance.

Despite the apparent lackluster performance gains, the 5th generation instances are faster and cheaper, so it is still a good idea to press on. And having the compatible AMI will enable us to experiment with c5d and m5d instances, which will almost certainly demonstrate significant I/O wins.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 14

•

6 years ago

(In reply to Gregory Szorc [:gps] from comment #12)
> Using a mix of instance types gives the provisioner access to additional
> instance types in case one instance type is not available or is
> prohibitively expensive. May I suggest applying this approach to the Windows
> workers as well?

yes, sounds very reasonable. i hope i've understood the configuration properly. does this look sane:

  "instanceTypes": [
    {
      "instanceType": "c4.4xlarge",
      "utility": 0.9,
      ...
    },
    {
      "instanceType": "c5.4xlarge",
      "utility": 1,
      ...
    }
  ],

i'm hoping that the provisioner will see `utility: 1` and spot request c5.4xlarge, falling back to c4.4xlarge (`utility: 0.9`) if it can't get enough of the ena instances. or have i misinterpreted how that works?

i'm in the process of rewriting https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-workertype.sh (bug 1441402, bug 1460535) as a python module. this script is what currently manages the windows worker type configurations. as a bash script, it's difficult to support multiple instance types with readable code, so hopefully in python, it will be easier to support all of the available instance types.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 15

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=19cabd6e0be0373538392fd93f52b0fd51025645

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 16

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2145cdfb4160ec9e97e09ceeff96b24fe959898d

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 17

•

6 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #14)
> i'm hoping that the provisioner will see `utility: 1` and spot request
> c5.4xlarge, falling back to c4.4xlarge (`utility: 0.9`) if it can't get
> enough of the ena instances. or have i misinterpreted how that works?

seems to work. the jobs on gecko-1-b-win2012-beta triggered instantiation of both c4.4xlarge & c5.4xlarge instances.

Gregory Szorc [:gps]

Reporter

Comment 18

•

6 years ago

Are there plans to finish this rollout?

FWIW Perfherder build metrics show a clear performance win with c5 instances:

https://treeherder.mozilla.org/perf.html#/graphs?series=try,1460207,1,2&series=try,1713271,1,2 (~300s faster on average)

I'm keen to realize these speedups outside of Try!

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 19

•

6 years ago

thanks for the follow up.
the gecko-3-b-win2012 rollout now complete.
https://github.com/mozilla-releng/OpenCloudConfig/commit/d1969a3a
https://tools.taskcluster.net/groups/dKeYmTbpScev4M034CDh9A/tasks/BuE3yhxbRdq9nPRtDkKCDQ/runs/0/logs/public%2Flogs%2Flive.log

i'll leave the bug open until i've updated occ to also update the aws provisioner config (this is being handled manually now) with the extra instance types.

Flags: needinfo?(rthijssen)

Gregory Szorc [:gps]

Reporter

Comment 20

•

6 years ago

Thank you, Rob!

We have limited data, but the build time improvements are already pretty clear.

A ~300s speedup in opt builds:

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1460301,1,2&series=autoland,1724153,1,2

An ~800s speedup in pgo builds:

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1460459,1,2&series=autoland,1724384,1,2

(This graph is wonky because we rolled out c5's around the same time thinLTO landed and thinLTO made builds much slower. But it looks like the c5's effectively offset the build time loss from thinLTO!)

Developers and sheriffs should notice these speedups. Especially people who have tight iteration loops.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 21

•

6 years ago

the occ ci updates were dealt with in bug 1480108 and occ commit https://github.com/mozilla-releng/OpenCloudConfig/commit/331e1ff7bbfd2f309d7520be7833792af7db2eee

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Windows AMI support for c5 instances

Categories

(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)

Tracking

(Not tracked)

People

(Reporter: gps, Assigned: grenade)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21