Experiment with different AWS instance types for TC linux64 builds

RESOLVED FIXED

Status

()

Firefox
Build Config
P2
normal
RESOLVED FIXED
10 months ago
9 months ago

People

(Reporter: jgriffin, Assigned: jgriffin)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

10 months ago
Linux64 builds in TaskCluster are currently built using a blend of m3/c3/r3.2xlarge instances, depending on pricing and availability.

We'd like to experiment with using AWS instance types with more RAM and/or cores, in order to be able to evaluate the cost/benefit ratio of faster E2E build times in automation vs cost.

Comment 1

10 months ago
It is more important to scale cores than RAM. As long as we have 1+ GB/core, we should be fine. A little less is probably OK. Depends on platform though.

I've built in 4-5 minutes on a C4.8xlarge. That was only the build - symbol generating, packaging, etc take several minutes longer. But the C++ in the build system does scale out to dozens of cores pretty well.
P2 as we would get to it as time permits.
Priority: -- → P2
(Assignee)

Comment 3

9 months ago
Some numbers:

type       compile     build    price/cents per hr
m3.2xlarge      32        40        13.5
m4.4xlarge      14        22        20.6
c4.2xlarge      24        33        15.3
c4.4xlarge      13        20        22.9
r3.4xlarge      17        25        25.3

This suggests we should consider switching to m4/c4.4xlarge for linux builds; this would have 20 minutes off the build time at a cost delta of around 7 to 9 cents an hour. Since we'd be using the instance for about 20 minutes less per build, the real delta per build is only about 5 or 6 cents. This is a tiny cost compared the development velocity improvements we could achieve by reducing build times, especially on Try.

I haven't run experiments on 8xlarge instances yet; comment # 1 suggests this would result in additional speed increases, but they would come at greater cost. Currently a c4.8xlarge spot instance costs 45.7 cents/hr.

Comment 4

9 months ago
https://tools.taskcluster.net/aws-provisioner/ says we have ~100 instances of {dbg,opt}-linux-{32,64}. Assuming we run 100 instances 24/7, multiply the cost per hour by 74,400 to get our monthly cost. e.g.

100x m3.2xlarge @ $0.135: $10,044
100x c4.2xlarge @ $0.153: $11,383
100x c4.4xlarge @ $0.229: $17,037

The jump from m3.2xl to c4.2xl for little over $1,000/mo is a no brainer IMO.

Considering build jobs are the long pole in automation, I think throwing thousands of dollars at the problem per month is warranted.

jgriffin: did you test a full build job (symbol generation and all)? Or is this just the `mach build` piece?
Flags: needinfo?(jgriffin)
(Assignee)

Comment 5

9 months ago
I ran the entire build, using TaskCluster's build.sh script. So, the raw data I was looking at was something like this:

PERFHERDER_DATA: {"framework": {"name": "build_metrics"}, "suites": [{"subtests": [{"name": "libxul.so", "value": 119959088}], "name": "installer size", "value": 66849994, "alertThreshold": 0.25}, {"subtests": [{"name": "configure", "value": 25.377415895462036}, {"name": "pre-export", "value": 0.4210519790649414}, {"name": "export", "value": 26.041933059692383}, {"name": "compile", "value": 773.1008520126343}, {"name": "misc", "value": 2.1479151248931885}, {"name": "libs", "value": 9.281205177307129}, {"name": "tools", "value": 0.4877140522003174}, {"name": "package-tests", "value": 111.22820997238159}, {"name": "buildsymbols", "value": 204.79079699516296}, {"name": "package", "value": 44.21001100540161}, {"name":
"upload", "value": 8.736287832260132}], "name": "build times", "value": 1207.6935601234436}]}

(this for a c4.4xlarge instance)
Flags: needinfo?(jgriffin)

Comment 6

9 months ago
773s for a compile on a c4.4xlarge seems a bit long since the c4.4xlarge has 16 VCPUs. I would expect the compile tier to take 300-450s on that instance type.

I wonder if ccache or sccache could be interfering here. Also, in the c4 series, everything except the c4.8xlarge is shared hardware. So if there are other instances on the same physical machine, you'll be competing for system resources.

Also, slow I/O due to e.g. EBS could be slowing things down as well.
(Assignee)

Updated

9 months ago
Status: NEW → ASSIGNED
(Assignee)

Updated

9 months ago
Blocks: 1290282
(Assignee)

Updated

9 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 9 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.