In bug 1290282, I was investigating some oddities where c4.8xlarge instances barely had any performance wins over c3.4xlarge instances on TC Firefox build tasks despite having 2x the number of CPU cores. I knew something was wrong after garndt posted dstat output while `mach build` was doing some heavy compilation: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 10 5 83 3 0 0| 705k 16M| 0 0 | 0 0 | 18k 43k 39 15 42 4 0 0| 0 25M| 42k 44k| 0 0 | 52k 106k 36 8 29 28 0 0|4096B 21M| 35k 34k| 0 0 | 38k 68k 47 9 43 1 0 0| 0 4728k| 54k 56k| 0 0 | 48k 96k 45 14 41 0 0 0| 0 28k| 25k 25k| 0 0 | 46k 93k 42 14 44 0 0 0| 0 20k| 74k 74k| 0 0 | 53k 104k 34 20 46 0 0 0| 0 36k| 77k 76k| 0 0 | 57k 111k 35 19 45 0 0 0| 0 1376k| 53k 52k| 0 0 | 53k 104k 37 20 43 0 0 0| 0 9456k| 79k 82k| 0 0 | 54k 104k 28 27 46 0 0 0| 0 36k| 82k 81k| 0 0 | 54k 105k 31 24 45 0 0 0| 0 8192B| 41k 41k| 0 0 | 51k 99k 40 18 42 0 0 0| 0 28k| 44k 43k| 0 0 | 50k 97k 36 21 42 0 0 0| 0 1444k| 135k 139k| 0 0 | 45k 88k 33 21 46 1 0 0| 0 21M| 105k 100k| 0 0 | 47k 92k 27 26 47 0 0 0| 0 2020k| 129k 132k| 0 0 | 52k 103k 38 19 43 0 0 0| 0 32k| 90k 92k| 0 0 | 50k 96k Note the very high sys (kernel) CPU usage and number of context switches. For comparison, here's what I got on a non-TC c4.8xlarge instance doing the same operation: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 93 6 1 0 0 0| 0 41M| 660B 832B| 0 0 |9801 4182 92 7 1 0 0 0| 0 7320k|2016B 13k| 0 0 |9596 7047 93 7 1 0 0 0| 0 8888k|2310B 10k| 0 0 |9567 4825 91 8 1 0 0 0| 0 9248k|2574B 11k| 0 0 |9798 6004 93 7 1 0 0 0| 0 7464k| 924B 3044B| 0 0 |9530 4706 94 6 0 0 0 0| 0 102M| 198B 730B| 0 0 | 10k 4105 94 5 1 0 0 0| 0 105M|1386B 8132B| 0 0 | 12k 7619 92 7 1 0 0 0| 0 16M|1716B 6180B| 0 0 |9710 8397 93 6 0 0 0 0| 0 4096k| 396B 1220B| 0 0 |9570 5332 The sys numbers are still a bit high for my liking. But context switches are *way* down, dare I say reasonable. The TC team stood up a docker worker and gave me SSH access to investigate. I initially thought OS and kernel differences might be to blame. My instance was running Ubuntu 16.04 (Linux 4.4). TC was running Ubuntu 14.04 (Linux 3.13) and had a number of outdated system packages. Upgrading system packages (including the kernel) and rebooting appeared to make things a bit faster. Although I would have to measure again to confirm. I built Firefox from source outside of Docker and both my instance and TC instance had remarkably similar times: 5:04 wall vs 5:03 wall 143:28 user vs 143:08 user 9:18 sys vs 9:29 sys Inside Docker was a different story. Building from a host-mounted ext4 volume was slightly slower: 5:42 wall 136:49 user 13:54 sys But building from a Docker aufs volume was much slower: 8:38 wall 132:19 user 75:03 sys dstat output during parts of the build was horrendous: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 9 82 9 0 0 0| 0 72k|3300B 15k| 0 0 | 20k 27k 7 83 11 0 0 0| 0 64k|3366B 12k| 0 0 | 19k 29k 3 88 9 0 0 0| 0 68k|1188B 5074B| 0 0 | 20k 29k 5 86 9 0 0 0| 0 64k| 66B 358B| 0 0 | 25k 25k 12 78 10 0 0 0| 0 576k|2574B 10k| 0 0 | 18k 27k 6 80 14 0 0 0| 0 30M| 990B 3684B| 0 0 | 19k 28k 4 91 4 0 0 0| 0 64k| 198B 722B| 0 0 | 21k 22k 5 93 1 0 0 0| 0 64k| 132B 874B| 0 0 | 22k 21k 9 85 6 0 0 0| 0 92k|2310B 8290B| 0 0 | 19k 26k 4 87 9 0 0 0| 0 1316k| 198B 866B| 0 0 | 18k 25k 3 95 3 0 0 0| 0 18M| 198B 714B| 0 0 | 21k 21k 7 90 3 0 0 0| 0 64k| 990B 4310B| 0 0 | 17k 22k 95% CPU time in the kernel!!! So it appears aufs under high load is a steaming pile. This shouldn't come as a surprise - aufs is an abandoned project and overlayfs is the equivalent technology that's part of the Linux kernel. So where does this leave us? Well, TaskCluster is using aufs heavily in production. My understanding is that unless a cache is declared (caches use volumes mounted from the host and are backed by ext4), your I/O inside a TC task will be using aufs. And if aufs has performance problems, well, TC tasks will be slow. The good news is a *lot* of Firefox automation uses caches (read: no aufs), so the negative impact of aufs is mitigated a bit. However, caches are stripped on most Try tasks because of concerns over cache poisoning. So e.g. a TC Firefox build will use ext4 on mozilla-central but aufs on Try. That means Try tasks will be slower than non-Try tasks. In theory, this applies to non-build tasks as well. A potential silver lining is that the aufs performance degradation might only be for large CPU counts driving high concurrent I/O loads. In bug 1290282, c3.4xlarge instances were faster than c3.2xlarge instances (not linear as expected but still pretty good) whereas c3.8xlarge hardly had any advantage over c3.4xlarge despite double the cores. I'll have to run some tests on less beefy instances to see if aufs overhead is measurable on e.g. 1, 2, and 4 core instances, which make up most of the instances in TC. If it is, then this performance problem could result in thousands of hours of wasted machine time every day. Per discussion in #taskcluster, jonasfj and garndt have ideas for addressing this at the worker level. I'll let them chime in.
We currently build on c3.2xlarge instances on an Ubuntu 14.04 host. I decided to compare the performance of a build in a c4.2xlarge (8 vCPUs) instance from at the host level and inside a Docker aufs volume to quantify the impact aufs has on TC tasks today. Host build (ext3): 19:30 wall 138:54 user 7:23 sys Docker build (aufs): 20:15 wall 138:50 user 9:33 sys There appears to be slight overhead to aufs on a c4.2xlarge (8 vCPUs). Although this was a sampling size of 1, so results could be within variance. FWIW, I did witness rather high sys CPU time during the "export" tier and during parts of the "compile" tier. But it wasn't as high as it is in the c4.8xlarge. (I suspect "export" tier is bad because that tier creates thousands of files and symlinks and is effectively a stress test for I/O writing.) On a c4.4xlarge (16 vCPUs): Host build (ext3): 10:59 wall 141:54 user 7:28 sys Docker build (aufs): 11:47 wall 140:32 user 15:00 sys sys time is higher with aufs. dstat output also shows sys CPU time was higher throughout this build compared to the c4.2xlarge. Still not as bad as c4.8xlarge. But noticeably worse than c4.2xlarge. The good news is aufs impact on c4.2xlarge instances appears minimal (the 45s gap between native and aufs remained intact - although not sure how that happened with ~5 minutes more kernel time). But the impact isn't horrendous like with c4.8xlarge instances. Since we don't run many instances with more than 8 cores in TC (do we run any at all?), this hopefully means we don't have major, widespread performance issues on TC due to aufs. The bad news is that until we address this underlying problem with aufs, running high vCPU instances in TC will have diminishing returns for many workloads. We'll get *some* wins. But they won't be as large as expected because aufs will eat the extra CPU. Put another way, running high vCPU instances is, to some extent, throwing money away. c4.4xlarge has an arguably tolerable amount of aufs overhead. But we should definitely avoid c4.8xlarge instances for the time being, as aufs could eat a lot of their extra CPU.
In #taskcluster yesterday, I stated that the excessive kernel time appears to have gone away in Ubuntu 16.04. I was wrong. Kernel overhead on a c4.8xlarge on Ubuntu 16.04 is *worse* than 14.04! Ubuntu 14.04 aufs (from comment #0): 8:38 wall 132:19 user 75:03 sys Ubuntu 16.04 aufs: 20:53 wall 103:03 user 461:28 sys !!! Ubuntu 16.04 overlayfs: 5:49 wall 135:32 user 9:39 sys Building outside of Docker yielded the same time as within Docker on overlayfs. That's an encouraging sign for overlayfs.
Did you use the overlay or overlay2 docker storage driver?
overlay with docker 1.10 (the Docker that ships with Ubuntu 16.04).
There has been some talk of mitigating the impact of aufs by using aggressively using Docker volumes (which are essentially bind mounted from the host filesystem). This is a good idea in theory. So let's put that to the test. On the same c4.8xlarge Ubuntu 16.04 instance used in comment #2, building Firefox with a source directory and object directory volume bind mounted from the host but still using aufs for the Docker container: Ubuntu 16.04 host volume + aufs container: 10:09 wall 111:27 user 117:41 sys As you can see, we're somewhere between native/overlayfs and full aufs despite avoiding aufs for the Firefox source and output. What gives? My theory is that the *reads* to the aufs container filesystem by processes like python and gcc loading standard library modules and headers is triggering aufs overhead. So even though we do hardly any write I/O to aufs, aufs is still slowing us down by interfering with the read path. FWIW, dstat reported little to no disk reads during the build. This is presumably because everything was in the page cache. But apparently the page cache isn't good enough for aufs: it still interferes with reading. It's looking more and more like avoiding aufs entirely is the only path forward.
I built with overlay2 and it has nearly identical performance as overlay.
We have what looks like to be an intermittent failure related to aufs in bug 1295474. garndt: given the performance and stability implications of running aufs, what is the timeline for not using aufs?
Also note that bug 1256004 was also likely due to aufs, although we were able to work around it.
I don't have a timeline really, but i can say I can start looking into this soon with others. I think making the changes is easy enough, it's then validating that everything continues running as expected that might take a little time.
FWIW, I tried switching my local dev environment to overlay2 and I'm running into what looks like bugs due to the filesystem not behaving like a filesystem should. So it looks like that storage driver is still buggy :/ I've used devicemapper with some success. However, it is *really* slow at creating new image layers compared to aufs and overlayfs, which is a big deal for my local development environment which reminds tons of Docker images frequently. However, I don't think this is a major issue for TC because it treats Docker images as tarballs and doesn't do any of the fancy layering foo. The matrix at https://docs.docker.com/engine/userguide/storagedriver/selectadriver/ also seems to indicate devicemapper might be the least crappy storage driver.
that's really good to know. Is it worth it to try overlay2 on the TC instances or is it just way too far gone to be of use?
Everything coming out of Docker seems to indicate overlayfs/overlay2 is the future. But if there are bugs in the filesystem, I'm inclined to write it off. I should clarify my comment about layer creation being slow. It's like 1-3s instead of say 100ms. This likely won't matter unless you are building or pulling images with many layers.
Another option is btrfs.
I upgraded to Docker 1.12 and haven't yet reproduced the overlayfs bug I was seeing before. It's possible I wasn't actually running overlay2 on the machine encountering the issues referenced in comment #10. Regarding btrfs, I'll have to give it another try. I think its layer generation is also a bit slow. But, again, we shouldn't care about this so much.
Ahh - the issue I was having with overlay was related to hard links (some processes were confused by the presence of hard links where hard links should not have been present). The docs at https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/ indicate that "overlay" relied heavily on hard links and "overlay2" does not. FWIW, the bottom of that page contains some nuggets about overlayfs, notably that open(2) doesn't conform to POSIX standards and rename(2) isn't supported either. Good times. A "real" filesystem like btrfs keeps sounding better and better.
UNIX domain sockets don't work with the overlay filesystem: https://github.com/docker/docker/issues/12080
garndt: I keep seeing people bit by AUFS bugs or performance issues. The problems aren't obvious because AUFS "just works" most of the time. I suspect the use of AUFS is a "silent killer" of overall performance on TaskCluster, especially on large-core instances (read: we're wasting money due to sub-optimal performance). What's the story on transitioning off AUFS? Depending on the timeline, it may justify someone doing work to have tasks alert when they are doing obviously sub-optimal things on AUFS like not having the workspace be a cache. But that won't save us: frequently accessed files baked into the Docker image will still kill perf because they are hosted on AUFS. And Try tasks can't use caches easily, meaning they always get AUFS and are therefore slower.
Sorry for the delayed response, I just returned from PTO and will be looking at this once I get caught up. I will have to see where we are with other priority projects on our plate. Last time I looked at this, I think I hit a rabbit hole of issues when trying to switch the fs drivers, but we can investigate with a new set of eyes. I'm going to leave the ni? at least for a couple of days to keep it on my radar.
Switching ni? to wander, he will start looking into this.
Flags: needinfo?(garndt) → needinfo?(wcosta)
Assignee: garndt → wcosta
Status: NEW → ASSIGNED
While I was hacking on bug 1415725, I cranked out a patch to switch docker-worker to overlay2. PR at https://github.com/taskcluster/docker-worker/pull/337.
Commit pushed to master at https://github.com/taskcluster/docker-worker https://github.com/taskcluster/docker-worker/commit/feb14875f6dbee6cdbca619a280f4b56deeb398b Upgrade to kernel 4.4 and use overlay2 storage driver AUFS is the source of numerous performance problems and filesystem consistency issues. Bug 1291940 has plenty of details. The tl;dr is AUFS takes out a global kernel lock for pretty much all I/O operations and concurrent I/O grinds to a halt as you add more concurrency. Firefox builds actually get slower as you use more cores. In addition, there are consistency issues with AUFS that result in files randomly not being evailable after write. In this commit, I bump the kernel version from 3.13 to 4.4 (so overlayfs is usable), add support for the overlay2 storage driver (added in Docker 1.12 - which we conveniently use), and switch the default storage driver to overlay2. The version of v4l2loopback was also bumped to the latest in order to pull in fixes so it compiles with Linux 4.4.
wcosta just pushed this out ~2 hours ago. So all new worker instances should be using the new AMIs with Linux 4.4 and overlay2 storage driver. I'm optimistic we'll see perfherder alerts as a result of this change. Could take a while for existing workers running AUFS to self-terminate, however.
Status: ASSIGNED → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → FIXED
https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-inbound,1445527,1,2&series=mozilla-inbound,1582370,1,2&series=mozilla-inbound,1446792,1,2 is showing a ~4 minute Linux64 opt build time drop for c4.4xlarge today. This will almost certainly generate a Perfherder alert once enough data comes in.
I rolled back AMIs today after a report of docker daemon failing to start in some instances. Connecting to a machine docker.log showed a failure to create logical volumes: ls: cannot access /dev/nvme*n*: No such file or directory umount: /dev/xvdb: not mounted No physical volume label read from /dev/xvdb Can't open /dev/xvdb exclusively. Mounted filesystem? Unable to add physical volume '/dev/xvdb' to volume group 'instance_storage'. No physical volume label read from /dev/xvdb Can't open /dev/xvdb exclusively. Mounted filesystem? Unable to add physical volume '/dev/xvdb' to volume group 'instance_storage'. No volume groups found Creating logical volume 'instance_storage' ls: cannot access /dev/nvme*n*: No such file or directory umount: /dev/xvdb: not mounted Tracking down the issue, the root cause was vgcreate trying to open /dev/xvdb exclusively and failing: open /dev/xvdb exclusively open("/dev/xvdb", O_RDWR|O_EXCL|O_DIRECT|O_NOATIME) = -1 EBUSY (Device or resource busy)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Today a made a new rolled out the images after some tests yesterday. The gecko-t-linux-xlarge it is only work type that showed the problem so far.
* Today I rolled out
Per IRC discussion, it looks like failures in gecko-t-linux-xlarge were due to using an older, buggy AMI. The new AMIs appear to "just work." wcosta: can we resolve this bug now?
(In reply to Gregory Szorc [:gps] from comment #27) > Per IRC discussion, it looks like failures in gecko-t-linux-xlarge were due > to using an older, buggy AMI. The new AMIs appear to "just work." > > wcosta: can we resolve this bug now? I found one case with newest AMI where the error happened. I connected to the machine, and after I rebooted it, the error was gone. It feels like there is a very subtle race condition on docker daemon initialization that produces the wrong behavior once in a while. Not sure if we should close this bug and investigate in a follow-up one, as it seems not to affect production. The only downside is Papertrail backlog.
If the incidence rate is low, let's get a new bug on file. FWIW, I think the way we create the volume (as part of docker's initialization routine) is kinda wonky. IMO this should be a one-time operation performed as part of launching the EC2 instance. But we can discuss this in another bug.
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago → 2 years ago
Resolution: --- → FIXED
The problem seems to get a lot more frequent once I added c5 instances.
Rolled back AMIs, as we won't take advantage of c5 instances and there still occasional bustage with other instances. Didn't reverted the patches, however.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
New AMIs got rolled out today. Workers are still recycling. But it looks like things will stick this time.
Status: REOPENED → RESOLVED
Last Resolved: 2 years ago → 2 years ago
Resolution: --- → FIXED
Component: Docker-Worker → Workers
Product: Taskcluster → Taskcluster
You need to log in before you can comment on or make changes to this bug.