Last Comment Bug 1287604 - Experiment with different AWS instance types for TC linux64 builds
: Experiment with different AWS instance types for TC linux64 builds
Status: RESOLVED FIXED
:
Product: Firefox
Classification: Client Software
Component: Build Config (show other bugs)
: unspecified
: Unspecified Unspecified
P2 normal (vote)
: ---
Assigned To: Jonathan Griffin (:jgriffin)
:
: Gregory Szorc [:gps] (away until 2017-03-20)
Mentors:
Depends on:
Blocks: thunder-try 1290282
  Show dependency treegraph
 
Reported: 2016-07-18 14:11 PDT by Jonathan Griffin (:jgriffin)
Modified: 2016-08-09 10:56 PDT (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments

Description User image Jonathan Griffin (:jgriffin) 2016-07-18 14:11:51 PDT
Linux64 builds in TaskCluster are currently built using a blend of m3/c3/r3.2xlarge instances, depending on pricing and availability.

We'd like to experiment with using AWS instance types with more RAM and/or cores, in order to be able to evaluate the cost/benefit ratio of faster E2E build times in automation vs cost.
Comment 1 User image Gregory Szorc [:gps] (away until 2017-03-20) 2016-07-18 14:21:59 PDT
It is more important to scale cores than RAM. As long as we have 1+ GB/core, we should be fine. A little less is probably OK. Depends on platform though.

I've built in 4-5 minutes on a C4.8xlarge. That was only the build - symbol generating, packaging, etc take several minutes longer. But the C++ in the build system does scale out to dozens of cores pretty well.
Comment 2 User image Armen Zambrano - Back on March 27th [:armenzg] (EDT/UTC-4) 2016-07-19 05:19:21 PDT
P2 as we would get to it as time permits.
Comment 3 User image Jonathan Griffin (:jgriffin) 2016-07-25 16:09:49 PDT
Some numbers:

type       compile     build    price/cents per hr
m3.2xlarge      32        40        13.5
m4.4xlarge      14        22        20.6
c4.2xlarge      24        33        15.3
c4.4xlarge      13        20        22.9
r3.4xlarge      17        25        25.3

This suggests we should consider switching to m4/c4.4xlarge for linux builds; this would have 20 minutes off the build time at a cost delta of around 7 to 9 cents an hour. Since we'd be using the instance for about 20 minutes less per build, the real delta per build is only about 5 or 6 cents. This is a tiny cost compared the development velocity improvements we could achieve by reducing build times, especially on Try.

I haven't run experiments on 8xlarge instances yet; comment # 1 suggests this would result in additional speed increases, but they would come at greater cost. Currently a c4.8xlarge spot instance costs 45.7 cents/hr.
Comment 4 User image Gregory Szorc [:gps] (away until 2017-03-20) 2016-07-26 09:33:12 PDT
https://tools.taskcluster.net/aws-provisioner/ says we have ~100 instances of {dbg,opt}-linux-{32,64}. Assuming we run 100 instances 24/7, multiply the cost per hour by 74,400 to get our monthly cost. e.g.

100x m3.2xlarge @ $0.135: $10,044
100x c4.2xlarge @ $0.153: $11,383
100x c4.4xlarge @ $0.229: $17,037

The jump from m3.2xl to c4.2xl for little over $1,000/mo is a no brainer IMO.

Considering build jobs are the long pole in automation, I think throwing thousands of dollars at the problem per month is warranted.

jgriffin: did you test a full build job (symbol generation and all)? Or is this just the `mach build` piece?
Comment 5 User image Jonathan Griffin (:jgriffin) 2016-07-26 10:50:11 PDT
I ran the entire build, using TaskCluster's build.sh script. So, the raw data I was looking at was something like this:

PERFHERDER_DATA: {"framework": {"name": "build_metrics"}, "suites": [{"subtests": [{"name": "libxul.so", "value": 119959088}], "name": "installer size", "value": 66849994, "alertThreshold": 0.25}, {"subtests": [{"name": "configure", "value": 25.377415895462036}, {"name": "pre-export", "value": 0.4210519790649414}, {"name": "export", "value": 26.041933059692383}, {"name": "compile", "value": 773.1008520126343}, {"name": "misc", "value": 2.1479151248931885}, {"name": "libs", "value": 9.281205177307129}, {"name": "tools", "value": 0.4877140522003174}, {"name": "package-tests", "value": 111.22820997238159}, {"name": "buildsymbols", "value": 204.79079699516296}, {"name": "package", "value": 44.21001100540161}, {"name":
"upload", "value": 8.736287832260132}], "name": "build times", "value": 1207.6935601234436}]}

(this for a c4.4xlarge instance)
Comment 6 User image Gregory Szorc [:gps] (away until 2017-03-20) 2016-07-26 11:21:56 PDT
773s for a compile on a c4.4xlarge seems a bit long since the c4.4xlarge has 16 VCPUs. I would expect the compile tier to take 300-450s on that instance type.

I wonder if ccache or sccache could be interfering here. Also, in the c4 series, everything except the c4.8xlarge is shared hardware. So if there are other instances on the same physical machine, you'll be competing for system resources.

Also, slow I/O due to e.g. EBS could be slowing things down as well.

Note You need to log in before you can comment on or make changes to this bug.