Open Bug 1384029 Opened 4 years ago Updated 4 years ago

Something is wrong with automated build times

Categories

(Firefox Build System :: General, defect)

defect
Not set
normal

Tracking

(firefox57 wontfix)

Tracking Status
firefox57 --- wontfix

People

(Reporter: glandium, Unassigned)

References

(Blocks 1 open bug)

Details

I'm filing this under core build config because this is likely to have the better audience for this bug, although this might end up in ops or something.

Anyways, this all started when I was looking at build times for rust code, and to avoid c++ compilation having an influence, as well as other rust compilations, I went on making everything -j1 (in make *and* cargo), and disabling sccache.

The surprising result was that, while I was shooting for a build time somewhere between 4 and 5 hours, considering the build times without sccache and the number of concurrent processes (16 on the c4.4xlarge instances we use), the actual build time was around... 2 hours and a half. Yeah, that's almost twice as fast as I was expecting.

So I did a bunch of push tries of the same changeset with different values of -j, also using rust 1.20, which has support for the make jobserver. I then looked at the time between the first non-conftest.c, not-related-to-nsinstall compilation, and the last invocation of the compiler, which is an approximation of the time spent in the compile tier. I also looked at the cumulative time spent running compilers, and specifically the rust compiler.

The raw results look like this (except for -j1, I did 5 builds for all of them):

(columns are in seconds ; order is compile tier, cumulative rust + C++, cumulative rust)

-j1
8227.03 8052.4 1199.12

-j2
4795.49 8213.71 1206.12
4746.43 8119.95 1203.9
4719.81 8085.48 1196.87
4743.63 8126.83 1192.4
4748.13 8134.85 1203.6

-j4
2364.73 8344.84 1239.2
2354.11 8304.01 1223.67
2349.38 8297.78 1225.87
2351.07 8278.86 1221.82
2362.16 8288.92 1228.59

-j6
1630.49 8438.88 1246.57
1590.31 8376.64 1246.09
1592.43 8377.73 1238.07
1609.46 8373.17 1242.59
1595.93 8427.91 1244.53

-j8
1245.43 8718.88 1290.91
1239.96 8624.64 1264.66
1251.5 8738.63 1282.37
1236.5 8599.81 1269.5
1236.9 8647.37 1272.79

-j10
1197.35 9772.46 1451.06
1240.19 9660.43 1458.26
1204.86 9676.35 1408.74
1225.08 9685.52 1441.98
1201.43 9738.65 1419.01

-j12
1301.48 11057.9 1529.92
1313.56 11241.1 1555.68
1263.15 11068 1525.22
1265.28 11027.4 1538.98
1273.23 11108.6 1533.25

-j14
1352.7 12265.4 1629.85
1384.47 12305.1 1603.2
1429.01 12410.1 1622.43
1369.63 12466 1656.87
1317.07 12392.8 1668.83

-j16
1417.22 13508.1 1713.36
1445.59 13515.7 1690.81
1460.44 13502 1677.32
1477.34 13487.4 1694.41
1435.85 13515.9 1690.55

The interesting result here is that -j10 are the fastest builds.
Now, an interesting observation is that when looking at the moment where the build starts to be waiting for rust to finish (I only did that manually, I don't have numbers for all builds), the -j16 builds *do* start waiting on rust *before* the -j10 builds. Which means the time spent waiting on rust alone is longer on the -j16 builds than it is on the -j10 builds.

Now, looking at compile tier times on 100% cache hit sccache builds:

-j1
849.865

-j2
597.698
569.392
565.827
568.894
570.469

-j4
313.802
273.125
291.405
294.441
292.237

-j6
205.728
195.197
196.91
191.198
193.427

-j8
165.014
161.326
161.011
159.317
159.558


-j10
142.863
140.097
142.925
142.204
151.374

-j12
138.346
134.539
143.829
135.732
136.463

-j14
134.316
131.747
132.072
131.801
129.853

-j16
135.905
136.372
139.103
134.587
133.024

At the moment, the above tells that -j10 is the sweet spot and would make builds faster overall.

Note this is all on try, where the I/O sucks, and this /could/ all be related to that.
(BTW, those are all cross-osx builds)
I suspect bug 1291940 is playing a part.

AUFS has a global lock on *reads*. So any open on a file baked into the Docker image obtains a lock. Files like C/C++ headers are accessed a *ton*. Past some point of concurrency, lock contention starts adding non-trivial overhead and makes builds slower despite having more physical cores and CPU capacity present!

I'd encourage you to obtain your own EC2 instance w/o TC and reproduce results. If you can blame this on AUFS, it is more ammo for fixing bug 1291940. For reference, our current Firefox builders use a 120gb gp2 EBS volume. If you provision one of that size, it should have the same IOPS as a TC worker.
Blocks: fastci
To give some perspective on how many I/O system calls are needed to build Firefox, I performed an `strace -f -e trace=file -c ./mach build` against a fully populated sccache. Essentially, I was asking for a count of all file-related system calls for both the `mach` process tree. Here is the result:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 48.02    0.171470           0   4417368   3797168 open
 34.69    0.123857           0   4799144   1201886 stat
 13.44    0.048000         134       359           rename
  1.49    0.005317           0     12882      9478 mkdir
  1.21    0.004326           0     55616      8328 lstat
  0.70    0.002501           0    122182    112351 access
  0.34    0.001200           0     10439           symlink
  0.05    0.000169           0       730           utimes
  0.03    0.000118           0     18654      6110 readlink
  0.02    0.000075           0     26422           getcwd
  0.02    0.000054           0      3998      2546 unlink
  0.00    0.000000           0     62824     44658 execve
  0.00    0.000000           0      3241           chdir
  0.00    0.000000           0        70           rmdir
  0.00    0.000000           0        14           link
  0.00    0.000000           0       337           chmod
  0.00    0.000000           0        12           chown
  0.00    0.000000           0        79           utime
  0.00    0.000000           0      3118      3113 statfs
  0.00    0.000000           0       235           openat
  0.00    0.000000           0        24           mkdirat
  0.00    0.000000           0      1265       814 newfstatat
  0.00    0.000000           0      1087       814 unlinkat
  0.00    0.000000           0        37           fchmodat
  0.00    0.000000           0        26         5 faccessat
  0.00    0.000000           0       743           utimensat
------ ----------- ----------- --------- --------- ----------------
100.00    0.357087               9540906   5187271 total

~9.5M system calls is arguably significant. But that's just for the `mach` process tree.

Here's what the sccache daemon process tree contributes (again - a fully populated local disk cache):

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 59.96    0.476053           0  23465064    823728 stat
 31.69    0.251570           0   6911030   5599710 open
  4.44    0.035251           0   1159010       280 readlink
  1.45    0.011537           3      3465           rename
  1.38    0.010970           3      3292           utimes
  0.56    0.004427           0    130767    110043 access
  0.26    0.002076           1      3596           chdir
  0.12    0.000922           0      3363           chmod
  0.11    0.000875           5       188           unlink
  0.03    0.000236           1       282       102 mkdir
  0.00    0.000029           0     18678           lstat
  0.00    0.000014           0      4300           getcwd
  0.00    0.000000           0      7192           execve
  0.00    0.000000           0       180           rmdir
------ ----------- ----------- --------- --------- ----------------
100.00    0.793960              31710407   6533863 total

~32M system calls from sccache!

While there is likely a minefield of clownshoes resulting in excessive I/O system calls, it normally isn't a major problem. Even the strace timings show the ~23M stat calls for sccache taking only ~0.5s! But if AUFS is introducing locks around these operations and critical I/O paths slow down by just 100ns, at the volume of tens of millions of I/O system calls, we're talking seconds added to build times. I'm not saying these numbers are indicative of reality - I just wanted to put the scale of the issue into perspective.

FWIW, I remember my results collaborating https://sthbrx.github.io/blog/2015/10/30/docker-just-stop-using-aufs/, where builds on AUFS spend a lot of time with kernel/aufs lock-related frames on the stack. And it gets worse as you increase concurrency.
I just did a build with -j16 in a docker with overlay on a c4.4xlarge instance with 120G of gp2, and got timings consistent with what I got on try:
1462.18 13531.1 1749.23
(also, those were totally skipping sccache, even the ones on try)
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.