Closed Bug 1321168 Opened 8 years ago Closed 7 years ago

Release builds could be dozens of minutes faster if EC2 instance type is changed

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P5)

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

We're currently building Nightly and other release builds on [cmr]3.xlarge instances. In TaskCluster, we build on [cm]4.4xlarge instances. Not only are the c4's and m4's a more modern and more efficient (read: faster) CPU architecture, the 4xlarge's have 16 vCPUs as opposed to the xlarge's 4. That 4x processing power matters.

According to https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-central,2734d6cd30666b1026c68e6de1d6dd3189caaee9,1,2%5D&series=%5Bmozilla-central,0e64c3fa34b8d9a4f3c692a1d9c1fe2592db0882,1,2%5D&series=%5Bmozilla-central,ad6f63d6218ea702d11ccc29c5571bd099019be1,1,2%5D&series=%5Bmozilla-central,95db184c04b0d2d1597df8c7b2fd42dbadc57f62,1,2%5D&series=%5Bmozilla-central,fcd6fc7acbc1283393a276626d07cd6356be4ecc,1,2%5D this difference translates to ~45 minutes of wall time when compiling Firefox for PGO builds on Linux.

For the entirety of the build, https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bmozilla-central,001a79f961b8ceee66f307da0c4f577c7a3afc9f,1,2%5D&series=%5Bmozilla-central,03a7eb986f6e7b8b9cf876d55af9fb8a4af6b646,1,2%5D&series=%5Bmozilla-central,67170a2fcaa34388c058ef57e744687a517009d6,1,2%5D&series=%5Bmozilla-central,cad007c7fb7ba97dfc929873b7e10267a6b0ddb8,1,2%5D&series=%5Bmozilla-central,22bae2c8cab25fd67543be49a731115f3e15c3b2,1,2%5D says the difference is over 100 minutes.

The choice of xlarge instances is artificially limiting our release turnaround time. I think we should consider bumping the EC2 instance type so we can turn around release builds faster.
Looking at the RelEng AWS bill seems to indicate we are running a bunch of c3.2xlarge instances: we ran ~313k instance-hours in October. Compare with 71.6k for m3.xlarge, 34.0k r3.xlarge and 90.6k c3.xlarge. Not sure why these release jobs all seem to be running on the slower instances.
Moving to buildduty to investigate. It would be good to look at this more recent data in treeherder and see if this is still the case since we recently moved our release and nightly linux builds to tc.
Component: General Automation → Buildduty
QA Contact: catlee → bugspam.Callek
For linux all builds are running on Taskcluster under m4.4xlarge and we have for Windows and OS X some builds which run under 
b-2008-spot instances which are c3.2xlarge(see more details in the links bellow)

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=build&selectedJob=86623903

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=build&filter-build_system_type=Buildbot&fromchange=517c553ad64746c479456653ce11b04ab8e4977f

Gregory should we change also the machine type for b-2008-spot or it's fine with this configuration?
Flags: needinfo?(gps)
To be conservative, I'd have release builds match what non-release builds are using. I suspect there are reasons that the b-2008-spot instances are still using c3.2xlarge. We can worry about getting those switched in another bug.

Also, TC uses a combo of m4 and c4 instances. It just so happens that spot bidding often delivers m4's more often. The c4's are a few minutes faster than m4's. So if you want to prioritize wall time of release builds over a marginal cost premium (which I think you do since release builds are important), you should go with the c4's. https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-inbound,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bmozilla-inbound,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D
Flags: needinfo?(gps)
Did some investigation here but I couldn't find the place where the configuration for c4.4xlarge/m4.4xlarge AWS instances is made.Found some old tests and configuration in Bug 1287604 and Bug 1290282 (https://bug1290282.bmoattachments.org/attachment.cgi?id=8778997#)but things look different now.

Also based on the discutions on Bug 1287604,are we sure that we want to change the configuration from combo of m4 and c4 instances to c4 instances?
Flags: needinfo?(gps)
I cannot recall where the buildbot EC2 instance types are defined. catlee?
Flags: needinfo?(gps) → needinfo?(catlee)
Andrei

further to our discussion regarding the bug this morning, this is a request to change the instance type on release branches + m-c for nightly only
So from what I understand, for now we want to change the configuration on taskcluster for the builds to not use anymore a combo of m4.4xlarge and c4.4xlarge instances,but to use only c4.4xlarge based on  https://bugzilla.mozilla.org/show_bug.cgi?id=1321168#c4id=1321168#c4. 
 For that I will need to know where we define the characteristics of our TC worker types,particularly the instance type we are going to use,for example gecko-1-b-linux uses both c4.4xlarge and m4.4xlarge (https://tools.taskcluster.net/aws-provisioner/#gecko-1-b-linux/view);this is not found in watch_pending.cfg,there we only set the instance type for spot instances, c4.4xlarge and m4.4xlarge are not even in the list.
Dustin,do you know where we define the characteristics of our TC worker?

Gregory do we want to also change b-2008-spot instance type to c4.4xlarge?so that we will have for all builds in release branches + m-c, c4.4xlarge?We currently use c3.4xlarge there(https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg#L69).
Flags: needinfo?(dustin)
It doesn't seem this bug is about Taskcluster.

The instance types are controlled by the AWS provisioner.  But the workerTypes used for releases are the same as those used for CI builds, so this particular change is impossible.  I would want :gardnt's feedback before we change all workers' workerType -- there's a cost at which "dozens of minutes" might not be worthwhile.
Flags: needinfo?(dustin)
> The instance types are controlled by the AWS provisioner.  But the
> workerTypes used for releases are the same as those used for CI builds, so
> this particular change is impossible.  I would want :gardnt's feedback
> before we change all workers' workerType -- there's a cost at which "dozens
> of minutes" might not be worthwhile.
Flags: needinfo?(garndt)
How did we start talking about Taskcluster? I think Taskcluster's instance selection is fine. It is buildbot that is severely lagging behind the times.

Yes, we would ideally be building only on c4's instead of m4's in Taskcluster. But it isn't worth the complexity and cost at this time. Let's focus on getting buildbot release builds to something modern and we can worry about micro-optimizing release builds in TC to use c4 another time.
Based on Greg's comment above, I'm going to remove ni?
Flags: needinfo?(garndt)
Not sure if there is something else that need to be change.
Attachment #8857968 - Flags: feedback?(catlee)
Comment on attachment 8857968 [details]
changing instace type to c4.4xlarge for b-2008 and y-2008

Looks fine, but I'm not sure if we require local instance storage or not for the Windows builds. c4 instances are EBS-only whereas c3 have local SSD storage.

Rail, grenade, do you know if we require on local instance storage for windows instances?
Attachment #8857968 - Flags: feedback?(rthijssen)
Attachment #8857968 - Flags: feedback?(rail)
Attachment #8857968 - Flags: feedback?(catlee)
Attachment #8857968 - Flags: feedback+
Comment on attachment 8857968 [details]
changing instace type to c4.4xlarge for b-2008 and y-2008

TBH, I'm not sure how the lack of instance store may affect Windows builds. As a possible test-in-production we can land the patch and watch the builds.
Attachment #8857968 - Flags: feedback?(rail) → feedback+
Comment on attachment 8857968 [details]
changing instace type to c4.4xlarge for b-2008 and y-2008

should be fine. on tc we use c4s with instance storage on extra ebs drives configured into the ami. but pretty sure bb just builds on the c: drive so should be no problems.
Attachment #8857968 - Flags: feedback?(rthijssen) → feedback+
(In reply to Rob Thijssen (:grenade - CEST) from comment #17)
> Comment on attachment 8857968 [details]
> changing instace type to c4.4xlarge for b-2008 and y-2008
> 
> should be fine. on tc we use c4s with instance storage on extra ebs drives
> configured into the ami. but pretty sure bb just builds on the c: drive so
> should be no problems.

If builds are on c: then bug 1305174 may come into play and this may make builds significantly slower. Builds should be performed on EBS volumes that aren't initialized from an AMI to avoid this problem.
This was merged today.
Things look good,b-2008-spot and y-2008-spot instances are now c4.4xlarge.
Since only these builds are executed by Buildbot,and we don't want to do yet changes in Taskcluster I think we can change the status to fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Before we close this out, we should compare before/after times to make sure that we didn't make things worse instead of better (per comment 19).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
As I suspected, since these only have the single disk, many times are increasing, not decreasing. I doubt that the aggregate of those jobs that are marginally faster makes up for the drastic increases on freshly booted instances.
This has caused a significant increase in build times as per the graph. We should back this out.
Flags: needinfo?(aobreja)
Is it too much work to add EBS volumes and perform work on them? We could still use existing paths that reference C:\ if an NTFS junction point is used to map an EBS-backed drive to a directory on C:\.
The reason I'd like to use the C4's is because the C5's should be out any month now and they should offer a substantial speedup over C4's. So I see the choices as "stick with C3's until this automation is retired [and replaced by TaskCluster]" or move forward to C4's so we can swap in C5's whenever they are ready.
Between datacenter work, security work, and trying to get things working for taskcluster and windows 10, we don't have time to prioritize optimization at the moment. If someone outside of relops wants to make modifications, that'd be great.
Not having consistency between TC and BB annoys me from the perspective of someone who optimizes the build system and version control. I am currently working on some patches to build-cloud-tools to enable EBS volumes on these instances and to install junctions in the appropriate locations.
Assignee: nobody → gps
Status: REOPENED → ASSIGNED
My reading of the existing code in Ec2UserdataUtils.psm1 for provisioning a new spot instance indicates that everything here should have "just worked." However, the problem appears to be that the b-2008 and y-2008 configs still thought they had ephemeral drives:

        "device_map": {
            "/dev/sda1": {
                "delete_on_termination": true,
                "skip_resize": true,
                "volume_type": "gp2",
                "size": 120,
                "instance_dev": "C:"
            },
            "/dev/sdb": {
                "ephemeral_name": "ephemeral0",
                "instance_dev": "/dev/xvdb",
                "skip_resize": true,
                "delete_on_termination": false
            },
            "/dev/sdc": {
                "ephemeral_name": "ephemeral1",
                "instance_dev": "/dev/xvdc",
                "skip_resize": true,
                "delete_on_termination": false
            }
        },

I /think/ the creation of those volumes is silently failing. Ec2UserdataUtils.psm1 treats that as "no extra volumes, no work to do" and c:\builds is run from the root EBS volume and bad performance ensues. I'm optimistic reworking the device_map to properly list a gp2 volume will "just work." However, the code in Ec2UserdataUtils.psm1 is a bit fragile. So I'm going to refactor that as well. PR coming shortly.
Flags: needinfo?(aobreja)
I backed this out since it seems to line up with a spike in failures of bug 1147271.
Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the buildduty quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.
Priority: -- → P5
We're not going to make any more optimization changes to buildbot since we're well into the migration to taskcluster.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → WONTFIX
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: