Closed Bug 1290282 Opened 4 years ago Closed 4 years ago

Move linux64 builds to c4.4xlarge/m4.4xlarge AWS instances

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jgriffin, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

We currently run linux64 builds on a mix of 2xlarge instances, according to garndt. I've done some experiments in bug 1287604 with more powerful instance types, and found we can shave about 20 minutes per build by moving to c4.4xlarge or m4.4xlarge instances.

The cost differential is not large (~$1k/month, according to gps' estimate), so I think we should go ahead and make this change.
To ensure the new instance types don't result in different intermittent rates, we should do some testing on try first. Greg, can you help set this up?
Flags: needinfo?(garndt)
Blocks: thunder-try
(In reply to Jonathan Griffin (:jgriffin) from comment #0)
> 
> The cost differential is not large (~$1k/month, according to gps' estimate),
> so I think we should go ahead and make this change.

Sorry, that should be $7k/month to switch to c4.4xlarge. Still probably a reasonable cost for the productivity gain.
My cost estimates were based on N instances running 24/7. I'm fairly certain that's not an accurate way to measure TC.

If we don't want to commit to this change immediately, we could spin up a new worker type, copy task definitions, and have them run side-by-side to flush out any weirdness in the switch to c4. Will cost a bit more to run double. But there won't be any risks to existing automation since the existing tasks will still run on the existing worker types.
This bug should excite some build peeps :)
> Will cost a bit more to run double.

Considering all the tasks we've been running double or triple for months... it would be a drop in the bucket.
I don't see a reason not to try it out.  Ideally we could increase the task concurrency of these instances too, so that they can run more than one task at a time.
(In reply to Dustin J. Mitchell [:dustin] from comment #6)
> I don't see a reason not to try it out.  Ideally we could increase the task
> concurrency of these instances too, so that they can run more than one task
> at a time.

You mean run more than one build on an instance concurrently? That doesn't seem like a good idea.
In different docker containers, yes.
(In reply to Jonathan Griffin (:jgriffin) from comment #1)
> To ensure the new instance types don't result in different intermittent
> rates, we should do some testing on try first. Greg, can you help set this
> up?

Typically what we've done in the past is create a test worker type to use that's identical to the existing ones with exception of different ec2 instance types.  Then it's just a matter of updating the in-tree tasks to use that new worker type for the builds and retriggering as necessary on try until you have a level of confidence you're seeking.

let me know if you would like me to create a worker type, and if so, could we use the same worker type for opt/debug 32/64 builds at least for testing purposes so I don't have to create a bunch of new test worker types?
Flags: needinfo?(garndt)
That sounds perfect, except these are builds so we'd only need a new build worker type (I think), that would kick off tests afterwards using the existing test workers. Using the same worker type for linux64 opt/debug is fine.
Apparently we can't use c4 instance types yet because TC provisioning can't handle EBS storage yet (or something like that).

In related news, Jonas gave me access to create worker types named gps-*. I've created a gps-c3-4xl and gps-c3-8xl worker type that are exactly what you think they are. You can configure Try tasks to use the worker type:

c3.2xlarge: https://treeherder.mozilla.org/#/jobs?repo=try&revision=72beda80fb4b1ca917506ed38f57bfe92260e011
c3.4xlarge: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0be7b0cc76657d60d4b4271257d3d38c7222c067
c3.8xlarge: https://treeherder.mozilla.org/#/jobs?repo=try&revision=74fb3ac57c6e0179800bdb032d0f6594d5683869

As you can see, build tasks get a nice speed-up from the 4xlarge. However, the jump from 4xl to 8xl isn't that significant. And, uh, I'm kinda puzzled as to why. Jonas pulled up the AWS console metrics and it shows CPU usage didn't hit 100% on the c3.8xlarge instance when it did for c3.4xlarge. It's possible I/O is limiting us somehow. I've definitely seen a c4.8xlarge max out all CPU cores during a build. So something weird is going on. Maybe I got a bad worker or something.

I just triggered a number of retries using up to 3 instances of each worker, so we should have some hopefully more promising data in the next 30 minutes or so. I may also schedule a few Try runs for all jobs on the new worker types to verify nothing breaks from using c3.4xlarge.
A c4 can be used but a launch spec needs to be defined within the worker type definition to setup the EBS volume.  Here is an example pull from the gecko-talos-c4large worker type:

      "launchSpec": {
        "BlockDeviceMappings": [
          {
            "DeviceName": "/dev/xvdb",
            "Ebs": {
              "DeleteOnTermination": true,
              "VolumeSize": 60
            }
          }
        ]
      }
    }
Thanks for that info, Greg! It looks like I was able to spawn a c4.8xl and have a build job on Try now \o/

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2d88bfe8e54714211f6d6873dae844d8589b3c37

I also did a much fuller try push with the c3.4xl instances:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a924703c319813f4113997a2f6f9a1c58e665195
Something is really wonky with these c3.8xl and c4.8xl instances.

On the c4.8xl, configure took ages to run and was running at like 1 line a second. It was horrible.

Things sped up during the build. But greg pulled dstat output during the compile tier and it's not good:

ubuntu@ip-172-31-0-146:~$ dstat
You did not select any stats, using -cdngy by default.
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 10   5  83   3   0   0| 705k   16M|   0     0 |   0     0 |  18k   43k
 39  15  42   4   0   0|   0    25M|  42k   44k|   0     0 |  52k  106k
 36   8  29  28   0   0|4096B   21M|  35k   34k|   0     0 |  38k   68k
 47   9  43   1   0   0|   0  4728k|  54k   56k|   0     0 |  48k   96k
 45  14  41   0   0   0|   0    28k|  25k   25k|   0     0 |  46k   93k
 42  14  44   0   0   0|   0    20k|  74k   74k|   0     0 |  53k  104k
 34  20  46   0   0   0|   0    36k|  77k   76k|   0     0 |  57k  111k
 35  19  45   0   0   0|   0  1376k|  53k   52k|   0     0 |  53k  104k
 37  20  43   0   0   0|   0  9456k|  79k   82k|   0     0 |  54k  104k
 28  27  46   0   0   0|   0    36k|  82k   81k|   0     0 |  54k  105k
 31  24  45   0   0   0|   0  8192B|  41k   41k|   0     0 |  51k   99k
 40  18  42   0   0   0|   0    28k|  44k   43k|   0     0 |  50k   97k
 36  21  42   0   0   0|   0  1444k| 135k  139k|   0     0 |  45k   88k
 33  21  46   1   0   0|   0    21M| 105k  100k|   0     0 |  47k   92k
 27  26  47   0   0   0|   0  2020k| 129k  132k|   0     0 |  52k  103k
 38  19  43   0   0   0|   0    32k|  90k   92k|   0     0 |  50k   96k

Only ~55% CPU utilization. And very high sys time and context switches. This didn't happen when I tried building inside an on-demand c4.8xl a few months ago. Something really wonky is going on.
Here is dstat output for a c4.8xlarge on-demand instance running Ubuntu 16.04:

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 93   6   1   0   0   0|   0    41M| 660B  832B|   0     0 |9801  4182
 92   7   1   0   0   0|   0  7320k|2016B   13k|   0     0 |9596  7047
 93   7   1   0   0   0|   0  8888k|2310B   10k|   0     0 |9567  4825
 91   8   1   0   0   0|   0  9248k|2574B   11k|   0     0 |9798  6004
 93   7   1   0   0   0|   0  7464k| 924B 3044B|   0     0 |9530  4706
 94   6   0   0   0   0|   0   102M| 198B  730B|   0     0 |  10k 4105
 94   5   1   0   0   0|   0   105M|1386B 8132B|   0     0 |  12k 7619
 92   7   1   0   0   0|   0    16M|1716B 6180B|   0     0 |9710  8397
 93   6   0   0   0   0|   0  4096k| 396B 1220B|   0     0 |9570  5332
 93   7   1   0   0   0|   0     0 | 594B 1718B|   0     0 |9620  7620
 93   6   1   0   0   0|   0    11M| 792B 2352B|   0     0 |9681  6475
 93   7   0   0   0   0|   0    47M|1980B 8540B|   0     0 |  10k 6181
 92   8   0   0   0   0|   0  4096k| 330B 1110B|   0     0 |9511  6513
 92   7   1   0   0   0|   0  4096k| 528B 1624B|   0     0 |9565  5617
 93   6   1   0   0   0|   0     0 | 762B 2038B|   0     0 |9502  5408
 93   7   1   0   0   0|   0    12M| 330B 1616B|   0     0 |9610  5814
 92   7   1   0   0   0|   0    55M|1320B 4682B|   0     0 |  10k 7978
 91   8   1   0   0   0|   0    13M|1122B 3206B|   0     0 |9763  6553
 93   6   1   0   0   0|   0     0 | 264B  832B|   0     0 |9558  5728
 92   7   1   0   0   0|   0  4096k| 726B 1882B|   0     0 |9574  6183
 92   7   1   0   0   0|   0  9552k|1650B 5352B|   0     0 |9686  6569
 93   7   1   0   0   0|   0  4096k|4290B   20k|   0     0 |9774  6831
 92   7   1   0   0   0|   0    42M|2244B 8866B|   0     0 |  10k 7844
 92   7   1   0   0   0|   0     0 |1026B 5634B|   0     0 |9606  6329
 92   7   1   0   0   0|   0     0 |3432B   16k|   0     0 |9883  9022
 92   7   1   0   0   0|   0  6616k|1914B   10k|   0     0 |9665  6986
 92   7   1   0   0   0|   0     0 |  33k  262k|   0     0 |  15k   13k
 93   6   1   0   0   0|   0    15M|7524B   56k|   0     0 |  11k 6725
 93   6   1   0   0   0|   0     0 | 198B  754B|   0     0 |9555  6455
 93   6   1   0   0   0|   0     0 |1950B   17k|   0     0 |9756  7345
 93   6   1   0   0   0|   0  4368k|1188B 9116B|   0     0 |9714  7858
 93   7   1   0   0   0|   0     0 | 594B 1804B|   0     0 |9586  7807
 92   7   1   0   0   0|   0    63M| 660B 1996B|   0     0 |9888  7645
 93   6   1   0   0   0|   0     0 | 264B  848B|   0     0 |9504  5586

This is what things should look like. 1% CPU idle. <10k context switches per second. No I/O wait. Aside from I/O write bursting every several seconds (likely the page cache flushing), it looks like this for the duration of the "compile" tier, which is the build system phase/tier where it can consume pretty much any numbers of CPUs you throw at it.

FWIW, `mach build` completes in 351s. The compile tier took 215s.

I'm going to try to build insider a Docker container and see if anything changes.
Building inside a Docker container using a container-local aufs-based filesystem yielded similar behavior as the non-Docker build.

Whatever funkiness we're seeing in TaskCluster appears related to TaskCluster. I'd look into the slowdowns in TaskCluster, but I really need root on the host instance (not a container) to do serious forensic work.

I don't think this blocks changing workers to beefier instances because we do see some modest time wins. But something is definitely funky in TaskCluster on 8xlarge instances *and* I wouldn't be surprised if 4xlarge instances were under-performing for similar reasons. I think we should move ahead with c[34].4xlarge changes and investigate TaskCluster inefficiencies in parallel.
Also, whatever inefficiency there is for those beefy instance might also apply to the currently used smaller ones.
Depends on: 1291940
Attached file Proposed worker type
Here is my proposed worker type. I basically copied the existing opt-linux-* worker type, dropped the r3 instance type, changed m3/c3.2xlarge to m4/c4.4xlarge, and added 60 GB EBS volumes.

Not sure if that 60 GB is enough. I'm also not sure if the 0.6 spot price limit is proper. https://ec2price.com/?product=Linux/UNIX&type=c4.4xlarge&region=us-west-2&window=60 looks like it should be, however.

I'm also not sure how we'll define the worker types. I'm tempted to create a giant pool. Then again, copying the scheme of the existing worker types ({dbg,opt}-linux-{32,64}) seems the least invasive.

There's also the question of whether we update the existing worker types or create new ones. Benefit of new is this rides the trains. Downside of new is we have to maintain old worker types forever in order to run old revisions in automation.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Attachment #8778997 - Flags: review?(jopsen)
Bug 1220686 has some guidance on naming workerTypes.  I'd argue for creating new, using those names.

In practice, "forever" is about 30 days..
Comment on attachment 8778997 [details]
Proposed worker type

Looks good to me...

> I'm tempted to create a giant pool
Me too - Besides we're supposed to not fear breaking stuff :)
And it forces us to ride the trains (or backporting automation changes).

If we do create a giant pool, will 60GB be sufficient.
I suspect we still want separate caches to avoid poisoning.

Note: (60 GB EBS by hour)
  (0.1GB/mo / (31day/mo * 24hour/day)) * 60 GB = 0.0080645161 USD/hour
  With a spot price for c4.4xlarge around: 0.191 USD/hour (us-west-1)
  I think we can easily afford to add more EBS.

> Downside of new is we have to maintain old worker types forever
I think this largely amount to not accidentally deleting the AMIs :)
Assuming the other branches are low-traffic, probably don't care too much
if provisioning on older branches takes a long time.
Attachment #8778997 - Flags: review?(jopsen) → review+
Comment on attachment 8779031 [details]
Bug 1290282 - Switch to 16 vCPU instances for build tasks;

https://reviewboard.mozilla.org/r/70096/#review67318
Attachment #8779031 - Flags: review?(dustin) → review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fe048eedd71a
Switch to 16 vCPU instances for build tasks; r=dustin
Comment on attachment 8779101 [details]
Bug 1290282 - Add build type to cache name;

https://reviewboard.mozilla.org/r/70132/#review67430

::: taskcluster/ci/legacy/tasks/builds/base_linux32.yml:15
(Diff revision 1)
>      - 'index.buildbot.revisions.{{head_rev}}.{{project}}.linux'
>  
>    scopes:
>      - 'docker-worker:cache:tooltool-cache'
>      - 'docker-worker:relengapi-proxy:tooltool.download.public'
>      - 'docker-worker:cache:level-{{level}}-{{project}}-build-linux32-workspace'

you'll need to update the scope to match the cache folder
Attachment #8779101 - Flags: review?(jopsen) → review-
Comment on attachment 8779101 [details]
Bug 1290282 - Add build type to cache name;

https://reviewboard.mozilla.org/r/70132/#review67430

> you'll need to update the scope to match the cache folder

the tree will have: `assume:moz-tree:level:X`
which gives it scope: `docker-worker:cache:level-X-*`

But when we pass the scope: `docker-worker:cache:level-{{level}}-{{project}}-build-linux32-workspace`
That is a subscope of: `docker-worker:cache:level-X-*`
Which is fine... 
It's just that when we change the cache name, we have to also change the scope in task.scopes.

No permission bits need to be changed as the decision tasks has: `docker-worker:cache:level-X-*`
Comment on attachment 8779101 [details]
Bug 1290282 - Add build type to cache name;

https://reviewboard.mozilla.org/r/70132/#review67438
Attachment #8779101 - Flags: review?(jopsen) → review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/147b245de122
Add build type to cache name; r=jonasfj
Perfherder has already issued alerts for this change:

https://treeherder.mozilla.org/perf.html#/alerts?id=2331

Linux 64 debug 2650.79 > 2046.36
Linux 64 ASAN  2225.96 > 1320.64
Linux 64 opt   1706.22 > 972.77

The subtests are interesting. https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=autoland&originalRevision=9daeb2123b995f4543df67c705ac629d9e76e4fa&newProject=autoland&newRevision=fc6ed18f76e16d3e2a392c93fd77b867e12955e4&originalSignature=1d8205fa40e1c5a68ccbf35ddd6310b15b6d4945&newSignature=1d8205fa40e1c5a68ccbf35ddd6310b15b6d4945&framework=2 shows a clear win in the compile tier. 1770.11 -> 709.49. However, there are some significant regressions in other tiers like export and misc (which do a lot of I/O). This shouldn't be aufs (bug 1291940) since the workspace is a host ext4 volume mount. Perhaps this boils down to EBS being a bit slower than instance storage :/

Still, we're faster overall. But is is unfortunate to have any regressions.
https://hg.mozilla.org/mozilla-central/rev/fe048eedd71a
https://hg.mozilla.org/mozilla-central/rev/147b245de122
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Blocks: 1293717
You need to log in before you can comment on or make changes to this bug.