Closed
Bug 1290282
Opened 8 years ago
Closed 7 years ago
Move linux64 builds to c4.4xlarge/m4.4xlarge AWS instances
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jgriffin, Assigned: gps)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
We currently run linux64 builds on a mix of 2xlarge instances, according to garndt. I've done some experiments in bug 1287604 with more powerful instance types, and found we can shave about 20 minutes per build by moving to c4.4xlarge or m4.4xlarge instances. The cost differential is not large (~$1k/month, according to gps' estimate), so I think we should go ahead and make this change.
Reporter | ||
Comment 1•8 years ago
|
||
To ensure the new instance types don't result in different intermittent rates, we should do some testing on try first. Greg, can you help set this up?
Flags: needinfo?(garndt)
Reporter | ||
Updated•8 years ago
|
Blocks: thunder-try
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #0) > > The cost differential is not large (~$1k/month, according to gps' estimate), > so I think we should go ahead and make this change. Sorry, that should be $7k/month to switch to c4.4xlarge. Still probably a reasonable cost for the productivity gain.
Assignee | ||
Comment 3•8 years ago
|
||
My cost estimates were based on N instances running 24/7. I'm fairly certain that's not an accurate way to measure TC. If we don't want to commit to this change immediately, we could spin up a new worker type, copy task definitions, and have them run side-by-side to flush out any weirdness in the switch to c4. Will cost a bit more to run double. But there won't be any risks to existing automation since the existing tasks will still run on the existing worker types.
Assignee | ||
Comment 4•8 years ago
|
||
This bug should excite some build peeps :)
Comment 5•8 years ago
|
||
> Will cost a bit more to run double.
Considering all the tasks we've been running double or triple for months... it would be a drop in the bucket.
Comment 6•8 years ago
|
||
I don't see a reason not to try it out. Ideally we could increase the task concurrency of these instances too, so that they can run more than one task at a time.
Comment 7•8 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #6) > I don't see a reason not to try it out. Ideally we could increase the task > concurrency of these instances too, so that they can run more than one task > at a time. You mean run more than one build on an instance concurrently? That doesn't seem like a good idea.
Comment 8•8 years ago
|
||
In different docker containers, yes.
Comment 9•8 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #1) > To ensure the new instance types don't result in different intermittent > rates, we should do some testing on try first. Greg, can you help set this > up? Typically what we've done in the past is create a test worker type to use that's identical to the existing ones with exception of different ec2 instance types. Then it's just a matter of updating the in-tree tasks to use that new worker type for the builds and retriggering as necessary on try until you have a level of confidence you're seeking. let me know if you would like me to create a worker type, and if so, could we use the same worker type for opt/debug 32/64 builds at least for testing purposes so I don't have to create a bunch of new test worker types?
Flags: needinfo?(garndt)
Reporter | ||
Comment 10•8 years ago
|
||
That sounds perfect, except these are builds so we'd only need a new build worker type (I think), that would kick off tests afterwards using the existing test workers. Using the same worker type for linux64 opt/debug is fine.
Assignee | ||
Comment 11•8 years ago
|
||
Apparently we can't use c4 instance types yet because TC provisioning can't handle EBS storage yet (or something like that). In related news, Jonas gave me access to create worker types named gps-*. I've created a gps-c3-4xl and gps-c3-8xl worker type that are exactly what you think they are. You can configure Try tasks to use the worker type: c3.2xlarge: https://treeherder.mozilla.org/#/jobs?repo=try&revision=72beda80fb4b1ca917506ed38f57bfe92260e011 c3.4xlarge: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0be7b0cc76657d60d4b4271257d3d38c7222c067 c3.8xlarge: https://treeherder.mozilla.org/#/jobs?repo=try&revision=74fb3ac57c6e0179800bdb032d0f6594d5683869 As you can see, build tasks get a nice speed-up from the 4xlarge. However, the jump from 4xl to 8xl isn't that significant. And, uh, I'm kinda puzzled as to why. Jonas pulled up the AWS console metrics and it shows CPU usage didn't hit 100% on the c3.8xlarge instance when it did for c3.4xlarge. It's possible I/O is limiting us somehow. I've definitely seen a c4.8xlarge max out all CPU cores during a build. So something weird is going on. Maybe I got a bad worker or something. I just triggered a number of retries using up to 3 instances of each worker, so we should have some hopefully more promising data in the next 30 minutes or so. I may also schedule a few Try runs for all jobs on the new worker types to verify nothing breaks from using c3.4xlarge.
Comment 12•8 years ago
|
||
A c4 can be used but a launch spec needs to be defined within the worker type definition to setup the EBS volume. Here is an example pull from the gecko-talos-c4large worker type: "launchSpec": { "BlockDeviceMappings": [ { "DeviceName": "/dev/xvdb", "Ebs": { "DeleteOnTermination": true, "VolumeSize": 60 } } ] } }
Assignee | ||
Comment 13•8 years ago
|
||
Thanks for that info, Greg! It looks like I was able to spawn a c4.8xl and have a build job on Try now \o/ https://treeherder.mozilla.org/#/jobs?repo=try&revision=2d88bfe8e54714211f6d6873dae844d8589b3c37 I also did a much fuller try push with the c3.4xl instances: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a924703c319813f4113997a2f6f9a1c58e665195
Assignee | ||
Comment 14•8 years ago
|
||
Something is really wonky with these c3.8xl and c4.8xl instances. On the c4.8xl, configure took ages to run and was running at like 1 line a second. It was horrible. Things sped up during the build. But greg pulled dstat output during the compile tier and it's not good: ubuntu@ip-172-31-0-146:~$ dstat You did not select any stats, using -cdngy by default. ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 10 5 83 3 0 0| 705k 16M| 0 0 | 0 0 | 18k 43k 39 15 42 4 0 0| 0 25M| 42k 44k| 0 0 | 52k 106k 36 8 29 28 0 0|4096B 21M| 35k 34k| 0 0 | 38k 68k 47 9 43 1 0 0| 0 4728k| 54k 56k| 0 0 | 48k 96k 45 14 41 0 0 0| 0 28k| 25k 25k| 0 0 | 46k 93k 42 14 44 0 0 0| 0 20k| 74k 74k| 0 0 | 53k 104k 34 20 46 0 0 0| 0 36k| 77k 76k| 0 0 | 57k 111k 35 19 45 0 0 0| 0 1376k| 53k 52k| 0 0 | 53k 104k 37 20 43 0 0 0| 0 9456k| 79k 82k| 0 0 | 54k 104k 28 27 46 0 0 0| 0 36k| 82k 81k| 0 0 | 54k 105k 31 24 45 0 0 0| 0 8192B| 41k 41k| 0 0 | 51k 99k 40 18 42 0 0 0| 0 28k| 44k 43k| 0 0 | 50k 97k 36 21 42 0 0 0| 0 1444k| 135k 139k| 0 0 | 45k 88k 33 21 46 1 0 0| 0 21M| 105k 100k| 0 0 | 47k 92k 27 26 47 0 0 0| 0 2020k| 129k 132k| 0 0 | 52k 103k 38 19 43 0 0 0| 0 32k| 90k 92k| 0 0 | 50k 96k Only ~55% CPU utilization. And very high sys time and context switches. This didn't happen when I tried building inside an on-demand c4.8xl a few months ago. Something really wonky is going on.
Assignee | ||
Comment 15•7 years ago
|
||
Here is dstat output for a c4.8xlarge on-demand instance running Ubuntu 16.04: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 93 6 1 0 0 0| 0 41M| 660B 832B| 0 0 |9801 4182 92 7 1 0 0 0| 0 7320k|2016B 13k| 0 0 |9596 7047 93 7 1 0 0 0| 0 8888k|2310B 10k| 0 0 |9567 4825 91 8 1 0 0 0| 0 9248k|2574B 11k| 0 0 |9798 6004 93 7 1 0 0 0| 0 7464k| 924B 3044B| 0 0 |9530 4706 94 6 0 0 0 0| 0 102M| 198B 730B| 0 0 | 10k 4105 94 5 1 0 0 0| 0 105M|1386B 8132B| 0 0 | 12k 7619 92 7 1 0 0 0| 0 16M|1716B 6180B| 0 0 |9710 8397 93 6 0 0 0 0| 0 4096k| 396B 1220B| 0 0 |9570 5332 93 7 1 0 0 0| 0 0 | 594B 1718B| 0 0 |9620 7620 93 6 1 0 0 0| 0 11M| 792B 2352B| 0 0 |9681 6475 93 7 0 0 0 0| 0 47M|1980B 8540B| 0 0 | 10k 6181 92 8 0 0 0 0| 0 4096k| 330B 1110B| 0 0 |9511 6513 92 7 1 0 0 0| 0 4096k| 528B 1624B| 0 0 |9565 5617 93 6 1 0 0 0| 0 0 | 762B 2038B| 0 0 |9502 5408 93 7 1 0 0 0| 0 12M| 330B 1616B| 0 0 |9610 5814 92 7 1 0 0 0| 0 55M|1320B 4682B| 0 0 | 10k 7978 91 8 1 0 0 0| 0 13M|1122B 3206B| 0 0 |9763 6553 93 6 1 0 0 0| 0 0 | 264B 832B| 0 0 |9558 5728 92 7 1 0 0 0| 0 4096k| 726B 1882B| 0 0 |9574 6183 92 7 1 0 0 0| 0 9552k|1650B 5352B| 0 0 |9686 6569 93 7 1 0 0 0| 0 4096k|4290B 20k| 0 0 |9774 6831 92 7 1 0 0 0| 0 42M|2244B 8866B| 0 0 | 10k 7844 92 7 1 0 0 0| 0 0 |1026B 5634B| 0 0 |9606 6329 92 7 1 0 0 0| 0 0 |3432B 16k| 0 0 |9883 9022 92 7 1 0 0 0| 0 6616k|1914B 10k| 0 0 |9665 6986 92 7 1 0 0 0| 0 0 | 33k 262k| 0 0 | 15k 13k 93 6 1 0 0 0| 0 15M|7524B 56k| 0 0 | 11k 6725 93 6 1 0 0 0| 0 0 | 198B 754B| 0 0 |9555 6455 93 6 1 0 0 0| 0 0 |1950B 17k| 0 0 |9756 7345 93 6 1 0 0 0| 0 4368k|1188B 9116B| 0 0 |9714 7858 93 7 1 0 0 0| 0 0 | 594B 1804B| 0 0 |9586 7807 92 7 1 0 0 0| 0 63M| 660B 1996B| 0 0 |9888 7645 93 6 1 0 0 0| 0 0 | 264B 848B| 0 0 |9504 5586 This is what things should look like. 1% CPU idle. <10k context switches per second. No I/O wait. Aside from I/O write bursting every several seconds (likely the page cache flushing), it looks like this for the duration of the "compile" tier, which is the build system phase/tier where it can consume pretty much any numbers of CPUs you throw at it. FWIW, `mach build` completes in 351s. The compile tier took 215s. I'm going to try to build insider a Docker container and see if anything changes.
Assignee | ||
Comment 16•7 years ago
|
||
Building inside a Docker container using a container-local aufs-based filesystem yielded similar behavior as the non-Docker build. Whatever funkiness we're seeing in TaskCluster appears related to TaskCluster. I'd look into the slowdowns in TaskCluster, but I really need root on the host instance (not a container) to do serious forensic work. I don't think this blocks changing workers to beefier instances because we do see some modest time wins. But something is definitely funky in TaskCluster on 8xlarge instances *and* I wouldn't be surprised if 4xlarge instances were under-performing for similar reasons. I think we should move ahead with c[34].4xlarge changes and investigate TaskCluster inefficiencies in parallel.
Comment 17•7 years ago
|
||
Also, whatever inefficiency there is for those beefy instance might also apply to the currently used smaller ones.
Assignee | ||
Comment 18•7 years ago
|
||
Here is my proposed worker type. I basically copied the existing opt-linux-* worker type, dropped the r3 instance type, changed m3/c3.2xlarge to m4/c4.4xlarge, and added 60 GB EBS volumes. Not sure if that 60 GB is enough. I'm also not sure if the 0.6 spot price limit is proper. https://ec2price.com/?product=Linux/UNIX&type=c4.4xlarge®ion=us-west-2&window=60 looks like it should be, however. I'm also not sure how we'll define the worker types. I'm tempted to create a giant pool. Then again, copying the scheme of the existing worker types ({dbg,opt}-linux-{32,64}) seems the least invasive. There's also the question of whether we update the existing worker types or create new ones. Benefit of new is this rides the trains. Downside of new is we have to maintain old worker types forever in order to run old revisions in automation.
Comment 19•7 years ago
|
||
Bug 1220686 has some guidance on naming workerTypes. I'd argue for creating new, using those names. In practice, "forever" is about 30 days..
Comment 20•7 years ago
|
||
Comment on attachment 8778997 [details] Proposed worker type Looks good to me... > I'm tempted to create a giant pool Me too - Besides we're supposed to not fear breaking stuff :) And it forces us to ride the trains (or backporting automation changes). If we do create a giant pool, will 60GB be sufficient. I suspect we still want separate caches to avoid poisoning. Note: (60 GB EBS by hour) (0.1GB/mo / (31day/mo * 24hour/day)) * 60 GB = 0.0080645161 USD/hour With a spot price for c4.4xlarge around: 0.191 USD/hour (us-west-1) I think we can easily afford to add more EBS. > Downside of new is we have to maintain old worker types forever I think this largely amount to not accidentally deleting the AMIs :) Assuming the other branches are low-traffic, probably don't care too much if provisioning on older branches takes a long time.
Attachment #8778997 -
Flags: review?(jopsen) → review+
Comment hidden (mozreview-request) |
Comment 22•7 years ago
|
||
mozreview-review |
Comment on attachment 8779031 [details] Bug 1290282 - Switch to 16 vCPU instances for build tasks; https://reviewboard.mozilla.org/r/70096/#review67318
Attachment #8779031 -
Flags: review?(dustin) → review+
Comment 23•7 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/fe048eedd71a Switch to 16 vCPU instances for build tasks; r=dustin
Comment hidden (mozreview-request) |
Comment 25•7 years ago
|
||
mozreview-review |
Comment on attachment 8779101 [details] Bug 1290282 - Add build type to cache name; https://reviewboard.mozilla.org/r/70132/#review67430 ::: taskcluster/ci/legacy/tasks/builds/base_linux32.yml:15 (Diff revision 1) > - 'index.buildbot.revisions.{{head_rev}}.{{project}}.linux' > > scopes: > - 'docker-worker:cache:tooltool-cache' > - 'docker-worker:relengapi-proxy:tooltool.download.public' > - 'docker-worker:cache:level-{{level}}-{{project}}-build-linux32-workspace' you'll need to update the scope to match the cache folder
Attachment #8779101 -
Flags: review?(jopsen) → review-
Comment 26•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8779101 [details] Bug 1290282 - Add build type to cache name; https://reviewboard.mozilla.org/r/70132/#review67430 > you'll need to update the scope to match the cache folder the tree will have: `assume:moz-tree:level:X` which gives it scope: `docker-worker:cache:level-X-*` But when we pass the scope: `docker-worker:cache:level-{{level}}-{{project}}-build-linux32-workspace` That is a subscope of: `docker-worker:cache:level-X-*` Which is fine... It's just that when we change the cache name, we have to also change the scope in task.scopes. No permission bits need to be changed as the decision tasks has: `docker-worker:cache:level-X-*`
Comment hidden (mozreview-request) |
Comment 28•7 years ago
|
||
mozreview-review |
Comment on attachment 8779101 [details] Bug 1290282 - Add build type to cache name; https://reviewboard.mozilla.org/r/70132/#review67438
Attachment #8779101 -
Flags: review?(jopsen) → review+
Comment 29•7 years ago
|
||
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/147b245de122 Add build type to cache name; r=jonasfj
Assignee | ||
Comment 30•7 years ago
|
||
Perfherder has already issued alerts for this change: https://treeherder.mozilla.org/perf.html#/alerts?id=2331 Linux 64 debug 2650.79 > 2046.36 Linux 64 ASAN 2225.96 > 1320.64 Linux 64 opt 1706.22 > 972.77 The subtests are interesting. https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=9daeb2123b995f4543df67c705ac629d9e76e4fa&newProject=autoland&newRevision=fc6ed18f76e16d3e2a392c93fd77b867e12955e4&originalSignature=1d8205fa40e1c5a68ccbf35ddd6310b15b6d4945&newSignature=1d8205fa40e1c5a68ccbf35ddd6310b15b6d4945&framework=2 shows a clear win in the compile tier. 1770.11 -> 709.49. However, there are some significant regressions in other tiers like export and misc (which do a lot of I/O). This shouldn't be aufs (bug 1291940) since the workspace is a host ext4 volume mount. Perhaps this boils down to EBS being a bit slower than instance storage :/ Still, we're faster overall. But is is unfortunate to have any regressions.
Comment 31•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/fe048eedd71a https://hg.mozilla.org/mozilla-central/rev/147b245de122
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•