Closed Bug 1430878 Opened 7 years ago Closed 7 years ago

Add worker type with >30 CPU cores to support some toolchain tasks

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla60

People

(Reporter: gps, Assigned: gps)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

We currently use the standard builder worker type for toolchain tasks. These are currently based on [cm][45].4xlarge instance types, which have 16 vCPUs.

Some of the toolchain tasks (notably Clang and GCC builds) are heavily CPU bound and should be able to scale up to take advantage of >>30 vCPUs.

I think we should add a new worker type backed by c5.18xlarge (72 vCPU), m5.24xlarge (96 vCPU), etc and move CPU bound toolchain tasks to it. This *should* shave minutes off individual toolchain tasks. And it could potentially shave dozens of minutes off the end-to-end times for rebuilding toolchains.

I /think/ this should be relatively low-effort:

1) Define the new worker type(s)
2) Perform Try pushes to see where the sweet spot is for the underlying EC2 instance type to make sure we're not throwing away too much money at cores we don't use
3) Switch tasks to the new worker type

Strictly speaking, we don't need to define the new worker type to test things: I can provision new AWS instances at will now and run things in Docker containers manually. But it's certainly easier to do it in the context of TC.

Also, we may want to wait for bug 1424376 so we don't waste money on idle instances because they still think AWS bills hourly.
From a c5.17xlarge worker building Clang 6:

[task 2018-01-16T21:35:04.398Z] -- Configuring done
[task 2018-01-16T21:36:25.043Z] -- Generating done
[task 2018-01-16T21:36:25.127Z] CMake Warning:
[task 2018-01-16T21:36:25.127Z]   Manually-specified variables were not used by the project:
[task 2018-01-16T21:36:25.127Z]
[task 2018-01-16T21:36:25.127Z]     LIBCXX_LIBCPPABI_VERSION
[task 2018-01-16T21:36:25.127Z]
[task 2018-01-16T21:36:25.127Z]
[task 2018-01-16T21:36:25.129Z] -- Build files have been written to: /builds/worker/workspace/moz-toolchain/build/stage1/build
[task 2018-01-16T21:36:25.759Z] cd "/builds/worker/workspace/build/src/build/build-clang"
[task 2018-01-16T21:36:25.759Z] cd "/builds/worker/workspace/moz-toolchain/build/stage1/build"
[task 2018-01-16T21:36:25.759Z] ninja install
[task 2018-01-16T21:36:26.954Z] [1/3524] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/APInt.cpp.o
[task 2018-01-16T21:38:14.234Z] [3000/3524] Building CXX object tools/clang/lib/Driver/CMakeFiles/clangDriver.dir/ToolChains/Haiku.cpp.o
[task 2018-01-16T21:39:35.046Z] [3524/3524] Install the project...
[task 2018-01-16T21:39:35.052Z] -- Install configuration: "Release"

So, stage1 took ~190s. It was ~100% CPU usage for most of that. Although it definitely trailed off towards the end (likely as it finished .o generation and moved on to linking).

Compare to https://public-artifacts.taskcluster.net/Jb_6wdWUTDqxnbaHvp1x0g/0/public/logs/live_backing.log:

[task 2018-01-16T19:39:24.099Z] -- Build files have been written to: /builds/worker/workspace/moz-toolchain/build/stage1/build
[task 2018-01-16T19:39:24.657Z] cd "/builds/worker/workspace/build/src/build/build-clang"
[task 2018-01-16T19:39:24.657Z] cd "/builds/worker/workspace/moz-toolchain/build/stage1/build"
[task 2018-01-16T19:39:24.657Z] ninja install
[task 2018-01-16T19:39:26.252Z] [1/3524] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/APInt.cpp.o
[task 2018-01-16T19:50:00.299Z] [3000/3524] Building CXX object tools/clang/lib/Frontend/CMakeFiles/clangFrontend.dir/CacheTokens.cpp.o
[task 2018-01-16T19:53:22.800Z] [3524/3524] Install the project...

That took ~840s. So only a ~11 minute speedup.

And that's only for stage 1.

Stage 2 appears to take a similar amount of wall time. It completes at:

[task 2018-01-16T21:44:03.156Z] [3083/3083] Install the project...

And stage 3 at:

[task 2018-01-16T21:48:23.858Z] [3083/3083] Install the project...

It's worth noting that there are some very slow single core bottlenecks in this task:

* Subversion operations against llvm.org (glandium wants to "cache" source archives in another task's artifact)
* cmake takes ~90s per invocation to generate backend files (after the "configure" like functionality). Why it is taking so long, I'm not sure. That's painful enough that it might be worth our time to profile it and report/fix the problem upstream. Maybe upgrading cmake will magically make it faster?
* fetching and xz decompression of other toolchains
* xz compression for ourselves

But even with those bottlenecks, wall time savings over the [cm][45].4xlarge is massive. On this x5.17xlarge, the task took ~25 minutes instead of ~60 minutes. We can probably shave another 10 minutes by improving the single core bottlenecks.
GCC 6's build system is quite obviously not as efficient as Clang's:

real    24m15.192s
user    149m1.928s
sys     5m15.660s

This is end-to-end time. Includes a Mercurial clone, toolchain download, xz compression, etc.

Chalk the CPU inefficiency up to a poor implementation of a GNU make backend versus cmake+ninja. It's quite obvious from looking at CPU load that GCC's build system does a lot of per-directory traversal. You see a bunch of processes get spawned and CPU jumps up. Then it slowly tails off to ~0%. Then a new batch comes in.

By contrast, an existing task (https://tools.taskcluster.net/task-inspector/#QRxcSFCTRzuVyE9ZcnO9Tg) took ~34 minutes.

While the c5.17xlarge is faster, it is a waste for GCC 6 tasks because too many cores remain idle for too much of the time. I rarely saw CPU usage approach 100%. I don't think it hit 90% once. Clang, by contrast, was at ~100% for dozens of seconds multiple times in its build. A c5.9xlarge would likely be the sweet spot for GCC tasks.
And with Clang 3.9:

real    19m1.485s
user    301m7.308s
sys     15m53.840s

~6 minutes of that was xz.

By comparison, https://tools.taskcluster.net/task-inspector/#SOKOIfXtS6urJ--jtWpOAw took 34 minutes. So not as big a win as Clang 6. But if you optimize our single threaded bottlenecks, it seems to be generally highly CPU efficient like Clang 6 and worth throwing cores at.
The docker-worker changes to move us off hourly billing have now been deployed. So idle instances will linger for 15 minutes before terminating. That should reduce our exposure to be being billed for expensive instances when idle.

dustin: would you be willing to create a new AWS worker type for us?

I'm not sure of the names, but I think we'll want two flavors: beefy and super beefy.

beefy: c4.8xlarge, c5.9xlarge. Maybe add m4.10xlarge and m5.12xlarge with a higher utility factor (more cores but more expensive)

super beefy: m4.16xlarge, c5.18xlarge. Maybe add m5.24xlarge with a higher utility factor.

We'll likely only use "beefy" for now. I'd like to have a worker type with 64 cores available to facilitate testing if nothing else. Although post bug 1392370, we could likely deploy Clang toolchain tasks to 64 core machines and not feel too guilty about wasting cores...
Depends on: 1392370
Flags: needinfo?(dustin)
I don't think I'm the best person for that -- I'm a little out of touch with how the deployments work.  Maybe Wander can help?
Flags: needinfo?(dustin)
gecko-L-toolchain and gecko-L-toolchain-huge were created.
(In reply to Gregory Szorc [:gps] from comment #1)
> * fetching and xz decompression of other toolchains

I don't know if toolchain tasks use `mach artifact toolchain`, but I filed bug 1421734 not long ago when looking at the setup overhead of some task. It didn't look like it'd be hard to make that code work in parallel, which should be a decent win when we have enough cores to do decompress all the packages in parallel. (I assume we can probably download as much as we want in parallel from S3 without maxing anything out.)
(In reply to Gregory Szorc [:gps] from comment #1)
> * Subversion operations against llvm.org (glandium wants to "cache" source
> archives in another task's artifact)

I think we've discussed this before, but in general removing dependencies on external resources would be great for reproducibility and reliability. I remember someone (grenade?) saying that our OpenCloudConfig repo had CI that would take URLs mentioned, upload them to a blob store, and then commit the resulting hash to the repo so the file could be fetched from there instead. Something like that preserves nice developer usability (you just put the upstream URL in the config) but also removes the external dependency.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #8)
> (In reply to Gregory Szorc [:gps] from comment #1)
> > * Subversion operations against llvm.org (glandium wants to "cache" source
> > archives in another task's artifact)
> 
> I think we've discussed this before, but in general removing dependencies on
> external resources would be great for reproducibility and reliability. I
> remember someone (grenade?) saying that our OpenCloudConfig repo had CI that
> would take URLs mentioned, upload them to a blob store, and then commit the
> resulting hash to the repo so the file could be fetched from there instead.
> Something like that preserves nice developer usability (you just put the
> upstream URL in the config) but also removes the external dependency.

That's basically tooltool :)

We can also create tasks that securely download these artifacts and make them TaskCluster artifacts. If we e.g. change the revision of Clang we build, we'll store clang.tar.xz (or whatever) as an artifact the first time we schedule things.
(In reply to Wander Lairson Costa [:wcosta] from comment #6)
> gecko-L-toolchain and gecko-L-toolchain-huge were created.

The "L" here is supposed to be a placeholder for {1, 2, 3}. So we actually need 3 variations of each worker type.

Also, we may not use these worker types for just toolchain tasks. We currently run these tasks on gecko-{{level}}-b-linux.

How about the following for the worker names:

  gecko-{{level}}-b-xlarge
  gecko-{{level}}-b-xxlarge

Also, I ran into scopes problems with a try task on these worker types. e.g. https://public-artifacts.taskcluster.net/D0o-KF3iSXacESfcO7i4Tg/0/public/logs/live_backing.log. If we name the workers gecko-{{level}}-*, I /think/ the scopes might magically sort themselves out?
Flags: needinfo?(wcosta)
I created gecko-1-b-xlarge and gecko-1-b-xxlarge. If they are ok, I will create levels 2 and 3 on Monday.
Flags: needinfo?(wcosta)
Thanks for the worker definitions!

I should have said that "linux" belongs in the worker name somewhere. e.g. gecko-{level}-b-xlarge. Sorry about that.

Anyway, tasks ran on these worker types properly! See try pushes at https://treeherder.mozilla.org/#/jobs?repo=try&revision=607cbc9486b19ae97f0264470c0c01a13cced215 and https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a06fbe140a11f722f95e6245dbed316b9601866. So, I think we can proceed with creating workers for other levels.

Our final names should be:

gecko-1-b-linux-xlarge
gecko-1-b-linux-xxlarge
gecko-2-b-linux-xlarge
gecko-2-b-linux-xxlarge
gecko-3-b-linux-xlarge
gecko-3-b-linux-xxlarge
Flags: needinfo?(wcosta)
I created them as gecko-L-b-linux-[x]large because using xx exceeds the maximum length for worker type names.

Also create a PR [1] to add these new worker types to docker-worker update list.

[1] https://github.com/taskcluster/docker-worker/pull/376
Flags: needinfo?(wcosta)
Comment on attachment 8952509 [details]
Bug 1430878 - Use larger EC2 instances for Clang toolchain tasks;

https://reviewboard.mozilla.org/r/221724/#review227734
Attachment #8952509 - Flags: review?(mh+mozilla) → review+
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f0e351b32622
Use larger EC2 instances for Clang toolchain tasks; r=glandium
wcosta: We had a worker misnamed in AWS Provisioner: gecko-3-linux-b-xlarge was created instead of gecko-3-b-linux-large. Jonas created a new worker as a copy last night. While he said he would clean up the old worker today, I figured I'd needinfo you in case you want to perform any additional auditing.
Flags: needinfo?(wcosta)
https://hg.mozilla.org/mozilla-central/rev/f0e351b32622
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
(In reply to Gregory Szorc [:gps] from comment #19)
> wcosta: We had a worker misnamed in AWS Provisioner: gecko-3-linux-b-xlarge
> was created instead of gecko-3-b-linux-large. Jonas created a new worker as
> a copy last night. While he said he would clean up the old worker today, I
> figured I'd needinfo you in case you want to perform any additional auditing.

It feels like everything is good, thanks for the heads up
Flags: needinfo?(wcosta)
Product: TaskCluster → Firefox Build System
Assignee: nobody → gps
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: