Open Bug 1384029 Opened 8 years ago Updated 2 years ago

Something is wrong with automated build times

Categories

(Firefox Build System :: General, defect)

defect

Tracking

(firefox57 wontfix)

Tracking Status
firefox57 --- wontfix

People

(Reporter: glandium, Unassigned)

References

(Blocks 1 open bug)

Details

I'm filing this under core build config because this is likely to have the better audience for this bug, although this might end up in ops or something. Anyways, this all started when I was looking at build times for rust code, and to avoid c++ compilation having an influence, as well as other rust compilations, I went on making everything -j1 (in make *and* cargo), and disabling sccache. The surprising result was that, while I was shooting for a build time somewhere between 4 and 5 hours, considering the build times without sccache and the number of concurrent processes (16 on the c4.4xlarge instances we use), the actual build time was around... 2 hours and a half. Yeah, that's almost twice as fast as I was expecting. So I did a bunch of push tries of the same changeset with different values of -j, also using rust 1.20, which has support for the make jobserver. I then looked at the time between the first non-conftest.c, not-related-to-nsinstall compilation, and the last invocation of the compiler, which is an approximation of the time spent in the compile tier. I also looked at the cumulative time spent running compilers, and specifically the rust compiler. The raw results look like this (except for -j1, I did 5 builds for all of them): (columns are in seconds ; order is compile tier, cumulative rust + C++, cumulative rust) -j1 8227.03 8052.4 1199.12 -j2 4795.49 8213.71 1206.12 4746.43 8119.95 1203.9 4719.81 8085.48 1196.87 4743.63 8126.83 1192.4 4748.13 8134.85 1203.6 -j4 2364.73 8344.84 1239.2 2354.11 8304.01 1223.67 2349.38 8297.78 1225.87 2351.07 8278.86 1221.82 2362.16 8288.92 1228.59 -j6 1630.49 8438.88 1246.57 1590.31 8376.64 1246.09 1592.43 8377.73 1238.07 1609.46 8373.17 1242.59 1595.93 8427.91 1244.53 -j8 1245.43 8718.88 1290.91 1239.96 8624.64 1264.66 1251.5 8738.63 1282.37 1236.5 8599.81 1269.5 1236.9 8647.37 1272.79 -j10 1197.35 9772.46 1451.06 1240.19 9660.43 1458.26 1204.86 9676.35 1408.74 1225.08 9685.52 1441.98 1201.43 9738.65 1419.01 -j12 1301.48 11057.9 1529.92 1313.56 11241.1 1555.68 1263.15 11068 1525.22 1265.28 11027.4 1538.98 1273.23 11108.6 1533.25 -j14 1352.7 12265.4 1629.85 1384.47 12305.1 1603.2 1429.01 12410.1 1622.43 1369.63 12466 1656.87 1317.07 12392.8 1668.83 -j16 1417.22 13508.1 1713.36 1445.59 13515.7 1690.81 1460.44 13502 1677.32 1477.34 13487.4 1694.41 1435.85 13515.9 1690.55 The interesting result here is that -j10 are the fastest builds. Now, an interesting observation is that when looking at the moment where the build starts to be waiting for rust to finish (I only did that manually, I don't have numbers for all builds), the -j16 builds *do* start waiting on rust *before* the -j10 builds. Which means the time spent waiting on rust alone is longer on the -j16 builds than it is on the -j10 builds. Now, looking at compile tier times on 100% cache hit sccache builds: -j1 849.865 -j2 597.698 569.392 565.827 568.894 570.469 -j4 313.802 273.125 291.405 294.441 292.237 -j6 205.728 195.197 196.91 191.198 193.427 -j8 165.014 161.326 161.011 159.317 159.558 -j10 142.863 140.097 142.925 142.204 151.374 -j12 138.346 134.539 143.829 135.732 136.463 -j14 134.316 131.747 132.072 131.801 129.853 -j16 135.905 136.372 139.103 134.587 133.024 At the moment, the above tells that -j10 is the sweet spot and would make builds faster overall. Note this is all on try, where the I/O sucks, and this /could/ all be related to that.
(BTW, those are all cross-osx builds)
I suspect bug 1291940 is playing a part. AUFS has a global lock on *reads*. So any open on a file baked into the Docker image obtains a lock. Files like C/C++ headers are accessed a *ton*. Past some point of concurrency, lock contention starts adding non-trivial overhead and makes builds slower despite having more physical cores and CPU capacity present! I'd encourage you to obtain your own EC2 instance w/o TC and reproduce results. If you can blame this on AUFS, it is more ammo for fixing bug 1291940. For reference, our current Firefox builders use a 120gb gp2 EBS volume. If you provision one of that size, it should have the same IOPS as a TC worker.
Blocks: fastci
To give some perspective on how many I/O system calls are needed to build Firefox, I performed an `strace -f -e trace=file -c ./mach build` against a fully populated sccache. Essentially, I was asking for a count of all file-related system calls for both the `mach` process tree. Here is the result: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 48.02 0.171470 0 4417368 3797168 open 34.69 0.123857 0 4799144 1201886 stat 13.44 0.048000 134 359 rename 1.49 0.005317 0 12882 9478 mkdir 1.21 0.004326 0 55616 8328 lstat 0.70 0.002501 0 122182 112351 access 0.34 0.001200 0 10439 symlink 0.05 0.000169 0 730 utimes 0.03 0.000118 0 18654 6110 readlink 0.02 0.000075 0 26422 getcwd 0.02 0.000054 0 3998 2546 unlink 0.00 0.000000 0 62824 44658 execve 0.00 0.000000 0 3241 chdir 0.00 0.000000 0 70 rmdir 0.00 0.000000 0 14 link 0.00 0.000000 0 337 chmod 0.00 0.000000 0 12 chown 0.00 0.000000 0 79 utime 0.00 0.000000 0 3118 3113 statfs 0.00 0.000000 0 235 openat 0.00 0.000000 0 24 mkdirat 0.00 0.000000 0 1265 814 newfstatat 0.00 0.000000 0 1087 814 unlinkat 0.00 0.000000 0 37 fchmodat 0.00 0.000000 0 26 5 faccessat 0.00 0.000000 0 743 utimensat ------ ----------- ----------- --------- --------- ---------------- 100.00 0.357087 9540906 5187271 total ~9.5M system calls is arguably significant. But that's just for the `mach` process tree. Here's what the sccache daemon process tree contributes (again - a fully populated local disk cache): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 59.96 0.476053 0 23465064 823728 stat 31.69 0.251570 0 6911030 5599710 open 4.44 0.035251 0 1159010 280 readlink 1.45 0.011537 3 3465 rename 1.38 0.010970 3 3292 utimes 0.56 0.004427 0 130767 110043 access 0.26 0.002076 1 3596 chdir 0.12 0.000922 0 3363 chmod 0.11 0.000875 5 188 unlink 0.03 0.000236 1 282 102 mkdir 0.00 0.000029 0 18678 lstat 0.00 0.000014 0 4300 getcwd 0.00 0.000000 0 7192 execve 0.00 0.000000 0 180 rmdir ------ ----------- ----------- --------- --------- ---------------- 100.00 0.793960 31710407 6533863 total ~32M system calls from sccache! While there is likely a minefield of clownshoes resulting in excessive I/O system calls, it normally isn't a major problem. Even the strace timings show the ~23M stat calls for sccache taking only ~0.5s! But if AUFS is introducing locks around these operations and critical I/O paths slow down by just 100ns, at the volume of tens of millions of I/O system calls, we're talking seconds added to build times. I'm not saying these numbers are indicative of reality - I just wanted to put the scale of the issue into perspective. FWIW, I remember my results collaborating https://sthbrx.github.io/blog/2015/10/30/docker-just-stop-using-aufs/, where builds on AUFS spend a lot of time with kernel/aufs lock-related frames on the stack. And it gets worse as you increase concurrency.
I just did a build with -j16 in a docker with overlay on a c4.4xlarge instance with 120G of gp2, and got timings consistent with what I got on try: 1462.18 13531.1 1749.23
(also, those were totally skipping sccache, even the ones on try)
Product: Core → Firefox Build System
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.