Open
Bug 1384029
Opened 8 years ago
Updated 2 years ago
Something is wrong with automated build times
Categories
(Firefox Build System :: General, defect)
Firefox Build System
General
Tracking
(firefox57 wontfix)
NEW
Tracking | Status | |
---|---|---|
firefox57 | --- | wontfix |
People
(Reporter: glandium, Unassigned)
References
(Blocks 1 open bug)
Details
I'm filing this under core build config because this is likely to have the better audience for this bug, although this might end up in ops or something.
Anyways, this all started when I was looking at build times for rust code, and to avoid c++ compilation having an influence, as well as other rust compilations, I went on making everything -j1 (in make *and* cargo), and disabling sccache.
The surprising result was that, while I was shooting for a build time somewhere between 4 and 5 hours, considering the build times without sccache and the number of concurrent processes (16 on the c4.4xlarge instances we use), the actual build time was around... 2 hours and a half. Yeah, that's almost twice as fast as I was expecting.
So I did a bunch of push tries of the same changeset with different values of -j, also using rust 1.20, which has support for the make jobserver. I then looked at the time between the first non-conftest.c, not-related-to-nsinstall compilation, and the last invocation of the compiler, which is an approximation of the time spent in the compile tier. I also looked at the cumulative time spent running compilers, and specifically the rust compiler.
The raw results look like this (except for -j1, I did 5 builds for all of them):
(columns are in seconds ; order is compile tier, cumulative rust + C++, cumulative rust)
-j1
8227.03 8052.4 1199.12
-j2
4795.49 8213.71 1206.12
4746.43 8119.95 1203.9
4719.81 8085.48 1196.87
4743.63 8126.83 1192.4
4748.13 8134.85 1203.6
-j4
2364.73 8344.84 1239.2
2354.11 8304.01 1223.67
2349.38 8297.78 1225.87
2351.07 8278.86 1221.82
2362.16 8288.92 1228.59
-j6
1630.49 8438.88 1246.57
1590.31 8376.64 1246.09
1592.43 8377.73 1238.07
1609.46 8373.17 1242.59
1595.93 8427.91 1244.53
-j8
1245.43 8718.88 1290.91
1239.96 8624.64 1264.66
1251.5 8738.63 1282.37
1236.5 8599.81 1269.5
1236.9 8647.37 1272.79
-j10
1197.35 9772.46 1451.06
1240.19 9660.43 1458.26
1204.86 9676.35 1408.74
1225.08 9685.52 1441.98
1201.43 9738.65 1419.01
-j12
1301.48 11057.9 1529.92
1313.56 11241.1 1555.68
1263.15 11068 1525.22
1265.28 11027.4 1538.98
1273.23 11108.6 1533.25
-j14
1352.7 12265.4 1629.85
1384.47 12305.1 1603.2
1429.01 12410.1 1622.43
1369.63 12466 1656.87
1317.07 12392.8 1668.83
-j16
1417.22 13508.1 1713.36
1445.59 13515.7 1690.81
1460.44 13502 1677.32
1477.34 13487.4 1694.41
1435.85 13515.9 1690.55
The interesting result here is that -j10 are the fastest builds.
Now, an interesting observation is that when looking at the moment where the build starts to be waiting for rust to finish (I only did that manually, I don't have numbers for all builds), the -j16 builds *do* start waiting on rust *before* the -j10 builds. Which means the time spent waiting on rust alone is longer on the -j16 builds than it is on the -j10 builds.
Now, looking at compile tier times on 100% cache hit sccache builds:
-j1
849.865
-j2
597.698
569.392
565.827
568.894
570.469
-j4
313.802
273.125
291.405
294.441
292.237
-j6
205.728
195.197
196.91
191.198
193.427
-j8
165.014
161.326
161.011
159.317
159.558
-j10
142.863
140.097
142.925
142.204
151.374
-j12
138.346
134.539
143.829
135.732
136.463
-j14
134.316
131.747
132.072
131.801
129.853
-j16
135.905
136.372
139.103
134.587
133.024
At the moment, the above tells that -j10 is the sweet spot and would make builds faster overall.
Note this is all on try, where the I/O sucks, and this /could/ all be related to that.
Reporter | ||
Comment 1•8 years ago
|
||
(BTW, those are all cross-osx builds)
Comment 2•8 years ago
|
||
I suspect bug 1291940 is playing a part.
AUFS has a global lock on *reads*. So any open on a file baked into the Docker image obtains a lock. Files like C/C++ headers are accessed a *ton*. Past some point of concurrency, lock contention starts adding non-trivial overhead and makes builds slower despite having more physical cores and CPU capacity present!
I'd encourage you to obtain your own EC2 instance w/o TC and reproduce results. If you can blame this on AUFS, it is more ammo for fixing bug 1291940. For reference, our current Firefox builders use a 120gb gp2 EBS volume. If you provision one of that size, it should have the same IOPS as a TC worker.
Comment 3•8 years ago
|
||
To give some perspective on how many I/O system calls are needed to build Firefox, I performed an `strace -f -e trace=file -c ./mach build` against a fully populated sccache. Essentially, I was asking for a count of all file-related system calls for both the `mach` process tree. Here is the result:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
48.02 0.171470 0 4417368 3797168 open
34.69 0.123857 0 4799144 1201886 stat
13.44 0.048000 134 359 rename
1.49 0.005317 0 12882 9478 mkdir
1.21 0.004326 0 55616 8328 lstat
0.70 0.002501 0 122182 112351 access
0.34 0.001200 0 10439 symlink
0.05 0.000169 0 730 utimes
0.03 0.000118 0 18654 6110 readlink
0.02 0.000075 0 26422 getcwd
0.02 0.000054 0 3998 2546 unlink
0.00 0.000000 0 62824 44658 execve
0.00 0.000000 0 3241 chdir
0.00 0.000000 0 70 rmdir
0.00 0.000000 0 14 link
0.00 0.000000 0 337 chmod
0.00 0.000000 0 12 chown
0.00 0.000000 0 79 utime
0.00 0.000000 0 3118 3113 statfs
0.00 0.000000 0 235 openat
0.00 0.000000 0 24 mkdirat
0.00 0.000000 0 1265 814 newfstatat
0.00 0.000000 0 1087 814 unlinkat
0.00 0.000000 0 37 fchmodat
0.00 0.000000 0 26 5 faccessat
0.00 0.000000 0 743 utimensat
------ ----------- ----------- --------- --------- ----------------
100.00 0.357087 9540906 5187271 total
~9.5M system calls is arguably significant. But that's just for the `mach` process tree.
Here's what the sccache daemon process tree contributes (again - a fully populated local disk cache):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
59.96 0.476053 0 23465064 823728 stat
31.69 0.251570 0 6911030 5599710 open
4.44 0.035251 0 1159010 280 readlink
1.45 0.011537 3 3465 rename
1.38 0.010970 3 3292 utimes
0.56 0.004427 0 130767 110043 access
0.26 0.002076 1 3596 chdir
0.12 0.000922 0 3363 chmod
0.11 0.000875 5 188 unlink
0.03 0.000236 1 282 102 mkdir
0.00 0.000029 0 18678 lstat
0.00 0.000014 0 4300 getcwd
0.00 0.000000 0 7192 execve
0.00 0.000000 0 180 rmdir
------ ----------- ----------- --------- --------- ----------------
100.00 0.793960 31710407 6533863 total
~32M system calls from sccache!
While there is likely a minefield of clownshoes resulting in excessive I/O system calls, it normally isn't a major problem. Even the strace timings show the ~23M stat calls for sccache taking only ~0.5s! But if AUFS is introducing locks around these operations and critical I/O paths slow down by just 100ns, at the volume of tens of millions of I/O system calls, we're talking seconds added to build times. I'm not saying these numbers are indicative of reality - I just wanted to put the scale of the issue into perspective.
FWIW, I remember my results collaborating https://sthbrx.github.io/blog/2015/10/30/docker-just-stop-using-aufs/, where builds on AUFS spend a lot of time with kernel/aufs lock-related frames on the stack. And it gets worse as you increase concurrency.
Reporter | ||
Comment 4•8 years ago
|
||
I just did a build with -j16 in a docker with overlay on a c4.4xlarge instance with 120G of gp2, and got timings consistent with what I got on try:
1462.18 13531.1 1749.23
Reporter | ||
Comment 5•8 years ago
|
||
(also, those were totally skipping sccache, even the ones on try)
Updated•7 years ago
|
status-firefox57:
--- → wontfix
Updated•7 years ago
|
Product: Core → Firefox Build System
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•