Open Bug 1455706 Opened 4 years ago Updated 3 years ago

Switch to C5/M5 AWS instances on Windows builders

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

+++ This bug was initially created as a clone of Bug #1415725 +++

gecko-<L>-b-win2012 are only running c4.4xlarge instance types. Linux is running c4.4xlarge, m4.4xlarge, c5.4xlarge, and m5.4xlarge.

I don't think there is a good reason to limit the Windows builders to c4.4xlarge. The c5 and m5 instances should hopefully "just work." And offering multiple instance types will increase our ability to bid for the cheapest spot instance at a given point in time.

I'm capable of updating the worker configuration. But a) this shouldn't go out on a Friday b) the change should be made under the supervision of the Taskcluster team.

To whoever implements this, we've adjusted the "utility" factor on the Linux worker types to prioritize the c5 and m5 instance types. The same should be done for Windows.

coop: could you please find an owner for this?
Flags: needinfo?(coop)
FWIW, I've been hacking on "no cache" builds that are a build configuration that doesn't use any caching.

https://treeherder.mozilla.org/perf.html#/graphs?series=try,1678626,1,2&series=try,1678618,1,2&series=try,1678976,1,2&series=try,1678650,1,2&series=try,1678602,1,2&zoom=1524245633312.1296,1524245697000,1000,2039.0480999873755 contains a number of triggers for such a build configuration.

The data clearly shows that c5 and m5 4xlarge instances are faster than c4 and m4 4xlarge instances.

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1618745,1,2&series=autoland,1444817,1,2&series=autoland,1582195,1,2&series=autoland,1444828,1,2 is a similar view for builds on Autoland that use sccache. And https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1444874,1,2&series=autoland,1582199,1,2&series=autoland,1444873,1,2&series=autoland,1618749,1,2 isolated just the compilation phase of the build. These don't show as much separation between the [cm]4 and [cm]5 instances. That's because the sccache hit rate is so high that raw CPU speed doesn't come into play as much.

But that doesn't mean we won't see a perf win from the [cm]5 instances: linking and PGO are very CPU heavy and will benefit from the faster CPU cores in the 5th generation instances. Plus, having spot instance bidding flexibility is worth it by itself IMO.
Looping in both :grenade and :jhford to provide insight.

John: is there anything preventing us from adding multiple instanceTypes to https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/view like we do for linux? Sounds like we could give the c5s a slightly better utility score to favor them slightly.

Checking the OCC repo, I see the following:

https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-workertype.sh#L128

Rob: each tc_worker_type in that file only accepts a single aws_instance_type. Could they accept more, i.e. will OCC deal with a variable instance type right now?
Flags: needinfo?(rthijssen)
Flags: needinfo?(jhford)
Flags: needinfo?(coop)
from what i can see, the linux implementation just uses multiple taskcluster worker types (eg: gecko-1-b-linux/c5.4xlarge, gecko-1-b-linux-large/m4.10xlarge, gecko-1-b-linux-xlarge/m5.24xlarge) to achieve multiple ec2 instance types.
if the windows implementation is to mimic linux, then yes, occ can support this easily. we only need to add the worker types to the provisioner and update the occ ci to start building the extra worker types.

i imagine there's a fair bit of in-tree code that would need to be updated to make use of the new worker types though when generating task graphs and i know very little about how all that works but i'm happy to add the worker types to occ if it's wanted.
Flags: needinfo?(rthijssen)
Linux doesn't do that for builds -- the rationale for doing that for tests is to only pay for the instance size we need, since we do not want to run multiple tests in parallel on the same host.

For builds, we have multiple instance types for the same workerType, and aws-provisioner picks between them using the best price/performance ratio.  That's what's being suggested here for Windows.
We have -large and -xlarge variants for Windows for "special" tasks (that can scale well up to 36+ vCPUs). I don't think we're there yet on Windows, so we don't have those worker types. But we may get there some day.

What I'd like to see in this bug is for the -b-win2012 workers to match their -b-linux equivalents: specify m4.4xlarge, c4.4xlarge, m5.4xlarge, and c5.4xlarge as candidate instance types with appropriate utility factors so we favor the 5th generation instance types. The AWS provisioner will then pick an appropriate instance type depending on price, availability, etc.
(In reply to Gregory Szorc [:gps] from comment #5)
> What I'd like to see in this bug is for the -b-win2012 workers to match
> their -b-linux equivalents: specify m4.4xlarge, c4.4xlarge, m5.4xlarge, and
> c5.4xlarge as candidate instance types with appropriate utility factors so
> we favor the 5th generation instance types. The AWS provisioner will then
> pick an appropriate instance type depending on price, availability, etc.

understood. the occ changes required are somewhat trivial. we just need to change the way we generate the aws provisioner config to include the extra instance types and associated block device mappings.

having said that, it's currently done in a bash script that has grown convoluted over time. a better approach might be to refactor the bash script into some python which would help us solve more than one problem.

bug 1441402 wants us to store default/base provisioner config in occ rather than mutate existing provisioner config. implementing that first gives us a cleaner platform to add the extra instance types to.
Depends on: 1441402
(In reply to Chris Cooper [:coop] from comment #2)
> John: is there anything preventing us from adding multiple instanceTypes to
> https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/view like we
> do for linux? Sounds like we could give the c5s a slightly better utility
> score to favor them slightly.

No reason we cannot.  The caveat is that the provisioner will run any configured instance type for that worker type.  This means that the pool of jobs configured for c5.xlarge and m3.xlarge must all work on both of those instance types.
Flags: needinfo?(jhford)
We should probably not mix generations before 4 and after 4 in the same worker type because 4 requires the use of EBS volumes and prior generations used instance volumes. So it is much easier if you limit the mix to e.g. c3+m3+etc or c4+m4+c5+m5+etc.

Mixing c4/m4/c5/m5 in the same worker type should be fine. We do this for the Linux builders, decision workers. As long as the AMI supports the 5th generation instance types, it should "just work."

Please note that I'm only requesting the build workers be changed: changing the instance type of test workers is much more dangerous because tests are very sensitive to their instance type (ask jmaher).
Is there an ETA on this?
I hacked around with a worker type config today and AWS Provisioner choked trying to provision a c5.4xlarge with the current AMIs:

InvalidParameterCombination: Enhanced networking with the Elastic Network Adapter (ENA) is required for the 'c5.4xlarge' instance type. Ensure that you are using an AMI that is enabled for ENA.

So it looks like this needs more work beyond updating the worker configs :/
Depends on: 1469398
You need to log in before you can comment on or make changes to this bug.