Open Bug 901384 Opened 11 years ago Updated 2 years ago

Investigate memory mapped files with gold

Categories

(Firefox Build System :: General, defect)

x86_64
Linux
defect

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

Details

I was looking into the excessive linking times on my Linux machine (10GB of memory on a mechanical HD and a non debug build with Clang SVN tip and vanilla mozconfig) and noticed the seemingly absurd amount of memory ld.gold is using - nearing 6GB RSS. This is enough to cause page cache eviction and lots of I/O during libxul linking.

Passing --stats into ld.gold revealed ld.gold is mmap'ing nearly 5.7GB of input files! Looking into the docs, it appears ld.gold will mmap() the entire input file on 64-bit by default and will only map "relevant" parts on 32-bit. You can also achieve the latter by passing --no-map-whole-files. In addition, the output file appears to be mmap'd by default. This can be disabled with --no-mmap-output-file.

I invoked the linker manually with different options to see if there was any effect. I invoked the linker multiple times to ensure the page cache was fresh, etc and am reporting the final values from each attempt.

Default options (full input and output file mmap):

/usr/bin/ld.gold.real: initial tasks run time: (user: 1.680000 sys: 0.320000 wall: 4.540000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 12.660000 sys: 2.510000 wall: 32.610000)
/usr/bin/ld.gold.real: final tasks run time: (user: 12.240000 sys: 14.350000 wall: 103.650000)
/usr/bin/ld.gold.real: total run time: (user: 26.580000 sys: 17.180000 wall: 140.800000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1388871680 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 5695366687
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 5695366687
/usr/bin/ld.gold.real: output file size: 1389299848 bytes

--no-map-whole-files and --no-mmap-output-file:

/usr/bin/ld.gold.real: initial tasks run time: (user: 2.240000 sys: 0.970000 wall: 10.680000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 13.550000 sys: 2.700000 wall: 34.830000)
/usr/bin/ld.gold.real: final tasks run time: (user: 12.200000 sys: 10.060000 wall: 58.540000)
/usr/bin/ld.gold.real: total run time: (user: 27.990000 sys: 13.730000 wall: 104.050000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1391321088 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 5024207517
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 451734053
/usr/bin/ld.gold.real: output file size: 1389299848 bytes

That's a healthy 36s (25%) decrease in wall time execution! The difference appears to be in I/O wait. As I said, I'm on a mechanical HD. According to --stats, --no-map-whole-files is decreasing mapped input file bytes by ~700MB (5695366687 vs 5024207517). My HD is doing 25-40MB/s reads during link. So that comes out to 17-28s less read time! Even more impressive is the "maximum bytes mapped for read at one time" - 5.7GB vs 450MB!!! I suspect most of the reads with partial mmap were serviced by the page cache.

I haven't isolated mmap output file impact.

I'm not an expert on gold and/or mmap. But it certainly appears that gold's whole file memory mapping is causing more read I/O than "relevant" memory mapping and thus slowing down linking.

My first question: why do we need to mmap input/output files? Presumably the page cache will perform in-memory caching, yielding fast access. In the case where we don't have enough physical memory to hold the input file set in memory, either gold will swap (due to mmap requirements) or the page cache will be evicted. Where swap is backed by a disk, I think the end result is mostly the same as page cache eviction: a lot of I/O. Is full file mmap a reasonable default given our large input set size? Should we disable it on 64-bit builds? Should we dynamically change the value based on the presence of a certain amount of memory?

Next, is there a good way to decrease the sizes of the input files? 5-6GB seems excessive. That is data that will mostly need to be brought into memory for linking. Unless you have tons of memory (it appears 11-12 GB on my Linux install), there will be page cache eviction during builds and the linker will need to go to the HD. If we can decrease the input file size, this will decrease page cache eviction along with read I/O wait times. I'd go so far as to suggest that we should optimize the build defaults for faster build times/smaller input file sizes, even if it means giving up a common developer requirement (e.g. debug symbols). Gecko developers may not like this, but casual/non-C++ developers will.

I concede the issue of I/O wait isn't as bad on SSDs. But, a lot of people (notably community contributors and our automation build infrastructure) aren't on SSDs, lack gobs of memory, and would likely benefit from improvements to libxul link times via lower memory usage.
--disable-debug-symbols and --disable-debug along with --no-map-whole-files and --no-map-output-file yielded:

/usr/bin/ld.gold.real: initial tasks run time: (user: 1.560000 sys: 0.370000 wall: 1.940000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 1.220000 sys: 0.070000 wall: 1.290000)
/usr/bin/ld.gold.real: final tasks run time: (user: 0.890000 sys: 0.080000 wall: 0.970000)
/usr/bin/ld.gold.real: total run time: (user: 3.670000 sys: 0.520000 wall: 4.200000)
/usr/bin/ld.gold.real: total space allocated by malloc: 573362176 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 757900552
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 340709441
/usr/bin/ld.gold.real: output file size: 82426368 bytes

Net reduction of 100s and nearly 5GB of mapped input files by disabling debug info. Good grief. I know Gecko/C++ devs may not like it, but can we make debug info disabled by default for local builds?
(In reply to Gregory Szorc [:gps] from comment #1)
> Net reduction of 100s and nearly 5GB of mapped input files by disabling
> debug info. Good grief. I know Gecko/C++ devs may not like it, but can we
> make debug info disabled by default for local builds?

If --enable-debug-symbols is not implied by --enable-debug, a rabid pack of angry developers will hunt you down and eat your flesh.

I know also that there are proposals afoot to fission off the debug information: <http://gcc.gnu.org/wiki/DebugFission>. We can try using that where supported...
For a clobber build with populated ccache and warm page cache:

No debug: 3:47 1062/272 MB in objdir/.o files
Debug: 11:02 9024/5932 MB in objdir/.o files

Debug info bloats object files which increases I/O overhead. Furthermore, it increases page cache eviction since the RSS of the build output is larger, in turn causing more I/O pressure and slowing down the build even more.

This is all on an i7-2600K with 10GB of RAM and a decent spinning HD (Western Digital Black 7200RPM) - not a shabby development machine IMO.

Keep in mind this is with Clang, which I believe uses more data for debug info than GCC.
Builds are optimized and non-debug by default, so I'm not entirely certain what you're proposing to change.
(In reply to Joshua Cranmer [:jcranmer] from comment #2)
> (In reply to Gregory Szorc [:gps] from comment #1)
> > Net reduction of 100s and nearly 5GB of mapped input files by disabling
> > debug info. Good grief. I know Gecko/C++ devs may not like it, but can we
> > make debug info disabled by default for local builds?
> 
> If --enable-debug-symbols is not implied by --enable-debug, a rabid pack of
> angry developers will hunt you down and eat your flesh.
>
> I know also that there are proposals afoot to fission off the debug
> information: <http://gcc.gnu.org/wiki/DebugFission>. We can try using that
> where supported...

As I mentioned in the dev-platform thread, this doesn't work well with ccache. It also wouldn't work on the buildbot builds, and would probably not be very helpful for distributors, either.
This is the kind of thing that probably requires a divergence between (local) developer builds and release builds.

Also note that -gsplit-dwarf makes the entire objdir smaller, because there are less relocations in the externalized dwarf info than there would be if it were in the object. The downside is that once relinked together, these dwarf info take more space than libxul.so would without -gsplit-dwarf, but one doesn't need to relink the dwarf info to make it work in gdb. A recent gdb is needed, though.

I had data that i wanted to blog, but i can't find them anymore :(
(In reply to Gregory Szorc [:gps] from comment #0)
> Passing --stats into ld.gold revealed ld.gold is mmap'ing nearly 5.7GB of
> input files! Looking into the docs, it appears ld.gold will mmap() the
> entire input file on 64-bit by default and will only map "relevant" parts on
> 32-bit. You can also achieve the latter by passing --no-map-whole-files. In
> addition, the output file appears to be mmap'd by default. This can be
> disabled with --no-mmap-output-file.

The reason for the discrepancy between the 32-bits and 64-bits default is probably virtual address space. The 32-bits linker would undoubtedly hit the 2, 3 or 4GB wall (resp. 32-bits OS, PAE, 64-bits OS) when linking big projects such as Firefox, while 64-bits linker wouldn't have this problem. Note that disabling input mmap shouldn't change the amount of data read overall. So if it does make a difference, it means gold is doing things in a very inefficient way when it maps everything (like, reading from everywhere several times), which kind of sounds like a bad idea.
You should get this data to gold upstream and just make them drop the mmap all input files thing entirely. I doubt there's a noticeable difference in link time for small things.
Another path to explore is incremental linking.
GCC 4.7 results:

No debug info: 3:55 1069/277 MB in objdir/.o files
Debug: 12:02 7555/4129 MB in objdir/.o files

 Here are the results for partial input mmap and disabled output mmap (warm ccache, fastest run):

/usr/bin/ld.gold.real: initial tasks run time: (user: 2.970000 sys: 2.990000 wall: 31.740000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 10.740000 sys: 4.130000 wall: 47.060000)
/usr/bin/ld.gold.real: final tasks run time: (user: 12.020000 sys: 8.240000 wall: 63.000000)
/usr/bin/ld.gold.real: total run time: (user: 25.730000 sys: 15.360000 wall: 142.130000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1916047360 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 4202271220
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 462871108
/usr/bin/ld.gold.real: output file size: 1188512016 bytes

Interestingly, gold allocated more explicit memory with GCC than with Clang. Also, gold used significantly more user and sys CPU with GCC! I'm no expert on the differences between how debug info is stored between the two, but there clearly are differences.
Clang has less support for new DWARF instructions than GCC has.
(In reply to Mike Hommey [:glandium] from comment #9)
> Clang has less support for new DWARF instructions than GCC has.

That can go both ways, can't it?

1) Clang has less support for new DWARF instructions, so it includes less debug information that could be communicated by those new instructions;
2) Clang has less support for new DWARF instructions, so its debug information is larger (presumably because the new instructions can reduce the size of the debug information).

Which sense are you intending?
Also, binutils threads where relevant options were added:

http://sourceware.org/ml/binutils/2009-10/msg00426.html (mmap'ing whole files)
http://sourceware.org/ml/binutils/2012-06/msg00051.html (mmap'ing output file)
On my machine (8-core Xeon 5550, 12GB RAM, objdir on mechanical hard drive), linking with --no-map-whole-files (configured with --enable-optimize --disable-debug) is a good bit slower with warm cache:

default (--map-whole-files):

/usr/bin/ld: total run time: (user: 21.110000 sys: 2.270000 wall: 24.540000)
/usr/bin/ld: maximum bytes mapped for read at one time: 3798355323

--no-map-whole-files:

/usr/bin/ld: total run time: (user: 23.710000 sys: 16.460000 wall: 41.950000)
/usr/bin/ld: maximum bytes mapped for read at one time: 457300289

Looks like we're spending more time mapping and unmapping things.  I do see the total RAM requirements drop quite a bit when not mapping whole files.  Not sure why the discrepancy between what gps measured and what I see here.
That is interesting.

It's possible your extra 2GB of RAM is enough to prevent page cache eviction? My numbers came from removing libxul.so in the objdir and running |make -C library/toolkit libxul.so| and taking the fastest measurement. The numbers during an actual clobber build were much different (they were slower due to more read I/O).

If I had a machine with gobs of memory (16+ GB), I'd create Linux control groups (cgroups) with different memory limits and test at which point the performance goes to crap due to I/O overhead. This would also allow us to isolate the amount of I/O going through the I/O subsystem rather than what's occurring on disk and/or reported by gold.
(In reply to Gregory Szorc [:gps] from comment #13)
> That is interesting.
> 
> It's possible your extra 2GB of RAM is enough to prevent page cache
> eviction? My numbers came from removing libxul.so in the objdir and running
> |make -C library/toolkit libxul.so| and taking the fastest measurement. The
> numbers during an actual clobber build were much different (they were slower
> due to more read I/O).

I was not removing libxul.so, I'll try that tomorrow.

It's possible, though I'm wondering instead how much stuff you have running on the machine you're testing on.  The only significant thing I have running on this machine is my 540M resident emacs process (710M virtual); the next closest thing (the X server) has 11M resident.  So I have gobs of memory both for the actual link and for the kernel's page cache.
My system boots with about 550MB virtual in use. When I run |hg pull|, |hg up|, launch vim, and run configure it goes up to ~1GB. You figure 6 GB for object files + 1.4GB for gold explicit allocations + 1.4GB for the output file, and I'm already looking at 8.8GB right there. No clue if that is simultaneously reserved or what. But that number is high enough that it certainly seems plausible I'm flirting with a memory ceiling.

I'll put a few more GB in this machine and see what happens.
13 GB of memory makes a world of difference compared to 10 GB! On subsequent linker runs, there is no read I/O. There is a burst of 70+ MB/s of writing (presumably writing out libxul.so).

With debug symbols and mmap whole input and output:

/usr/bin/ld.gold.real: initial tasks run time: (user: 1.690000 sys: 0.150000 wall: 1.840000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 11.990000 sys: 0.050000 wall: 12.070000)
/usr/bin/ld.gold.real: final tasks run time: (user: 15.910000 sys: 2.150000 wall: 23.000000)
/usr/bin/ld.gold.real: total run time: (user: 29.590000 sys: 2.350000 wall: 36.910000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1359491072 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 5686750375
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 5686750375

With --no-map-whole-files:

/usr/bin/ld.gold.real: initial tasks run time: (user: 1.760000 sys: 0.350000 wall: 2.120000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 12.560000 sys: 0.200000 wall: 12.780000)
/usr/bin/ld.gold.real: final tasks run time: (user: 15.340000 sys: 2.190000 wall: 20.320000)
/usr/bin/ld.gold.real: total run time: (user: 29.660000 sys: 2.740000 wall: 35.220000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1361821696 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 5017981283
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 447062915

Times are very similar irregardless of whole input file mmap. Interesting. Perhaps this is because all reads are serviced by the page cache? I do note that my system is still showing a few hundred MB free during the thralls of linking.

Let's try killing the page cache.

echo 3 > /proc/sys/vm/drop_caches

whole file mmap:

/usr/bin/ld.gold.real: initial tasks run time: (user: 2.900000 sys: 3.650000 wall: 80.990000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 16.240000 sys: 4.280000 wall: 61.970000)
/usr/bin/ld.gold.real: final tasks run time: (user: 13.520000 sys: 13.240000 wall: 86.040000)
/usr/bin/ld.gold.real: total run time: (user: 32.660000 sys: 21.170000 wall: 229.000000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1359486976 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 5686750375
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 5686750375

--no-map-whole-files:

/usr/bin/ld.gold.real: initial tasks run time: (user: 3.020000 sys: 4.150000 wall: 79.050000)
/usr/bin/ld.gold.real: middle tasks run time: (user: 16.290000 sys: 4.190000 wall: 60.440000)
/usr/bin/ld.gold.real: final tasks run time: (user: 10.800000 sys: 10.840000 wall: 73.410000)
/usr/bin/ld.gold.real: total run time: (user: 30.110000 sys: 19.180000 wall: 212.900000)
/usr/bin/ld.gold.real: total space allocated by malloc: 1361817600 bytes
/usr/bin/ld.gold.real: total bytes mapped for read: 5017981283
/usr/bin/ld.gold.real: maximum bytes mapped for read at one time: 447062915

Again, --no-map-whole-files is slightly faster on my machine, even with 13GB RAM (and no memory pressure). This is seemingly in contrast to Nathan's observation.
I want this to be a serious question: do we need debug symbols for local builds? I know Gecko/C++ developers do. But there are a number of people who I don't believe would miss debug symbols. These include Firefox developers who pretty much only touch JS.

What happens when local builds don't have debug symbols? Is the run-time "profile" of the application changed significantly enough such that only a fool would want to develop without debug symbols? Or, is it mostly harmless? Do we risk pushing more bad patches to try/inbound because we don't have debug symbols locally?

What I'm getting at is that if debug symbols can result in such a horrible developer experience (which I believe the data clearly shows - we're just arguing over the mmap impact), I believe the build tools should either warn when such a bad experience can be avoided or should just default to avoiding them in the first place. I expect C++ developers to cringe at the possibility of having debug symbols disabled by default. But, if you go on straight numbers, I'm pretty sure we have more non-C++ developers committing to mozilla-central than C++ developers these days. Why does a large group have to suffer for the convenience of a smaller group? Why do first-time contributors have to suffer through an ever worse first-time build experience if it can be avoided?
Then, why do JS developers build C++ at all? If you want to disable something, disable that, not debug info. --disable-compile-environment is currently pretty useless, but it could be made useful.
Not only would that make build even faster for those developers because they wouldn't have to build C++ code *at all*, but that would also avoid them to have to install a compiler they don't even need.

IOW, we should add a flag that takes the path to a nightly, and the build system would unpack it for its libs and executables, and use the source tree content for the rest.
I filed bug 901840 to track a better development environment with no compilation. I don't expect to see movement on that soon - at least this quarter. Although, there is significant potential there, so who knows.

I asked the question mainly because I'm looking for cheap wins. Disabling debug symbols at the expense of annoying the C++ developers seems pretty cheap compared to support a somewhat radical build config and creating the tooling to support it.
Presumably, non C++ developers are not touching C++ code. So incremental builds for them shouldn't rebuild C++ code (or at least not libxul). If it does, there's a bug to file.
(In reply to Mike Hommey [:glandium] from comment #20)
> Presumably, non C++ developers are not touching C++ code. So incremental
> builds for them shouldn't rebuild C++ code (or at least not libxul). If it
> does, there's a bug to file.

All developers need to keep up with mainline; for me, that's a ~50 minute clobber build every morning.
We settled on enabling debug symbols by default because we had seen lots of new contributors doing a build and then wanting to debug, and realizing they didn't have debug symbols and would have to do an entirely new build. That seemed like a horrible failure mode.
(In reply to Nick Alexander :nalexander from comment #21)
> (In reply to Mike Hommey [:glandium] from comment #20)
> > Presumably, non C++ developers are not touching C++ code. So incremental
> > builds for them shouldn't rebuild C++ code (or at least not libxul). If it
> > does, there's a bug to file.
> 
> All developers need to keep up with mainline; for me, that's a ~50 minute
> clobber build every morning.

Unfortunately I think your best option may be to just not update that frequently, I was under the impression people tend to pull more like weekly.  Given the amount of stuff that lands in a day chances are that even with a far better #include story than we have today you'd still have to rebuild most everything when updating over a days worth of changes, and at that point debug symbols are probably fairly small potatos.
Hi, somewhere, someone (maybe Mike Hommey?) pointed out that the current version of ccache doesn't grok-gsplit-dwarf (and not handle the additional .dwo file) which GCC 4.8.x supports.

I can't find out exactly where that was mentioned by whom now. I tried to find the post, but could not find it after 10minutes search, and so I am posting what I have done here.

I looked at ccache source code, and decided to hack it
so that it supports -gsplit-dwarf properly since
the lure of short link time looks so
attractive if what people says about "-gsplit-dwarf" is at least half true :-)
 
I wonder if interested people can hammer out the version available at
https://bitbucket.org/zephyrus00jp/ccache-gsplit-dwarf-support

This ccache's failure to support -gsplit-dwarf was also mentioned in 

https://bugzilla.samba.org/show_bug.cgi?id=10005

by Mike, and that is whyI thought  ccache's failure to handle .dwo file correctly was mentioned in mozilla mailing list(somewhere?) by Mike.
https://bugzilla.samba.org/show_bug.cgi?id=10005

TIA
Additional comments for ccache that supports -gsplit-dwarf:
gcc needs to be 4.8.x version.
ld needs to be ld.gold, I think.
objecopy needs to be a version that supports -extract-dwo, etc.
The error message will guide you in upgrading the necessary packages.

TIA
Additional random observations, building desktop with -g -O,
resulting in an approximately 1.2-1.3GB libxul.so:

With 8GB in the machine, a mechanical disk, and using
GNU gold (version 2.22.52.0.1-10.fc17 20120131) 1.11, linking is
completely disk bound and takes many (10+) minutes.

With 12GB ram, deleting libxul.so and then rerunning
'make -f client.mk build -j4' (to get everything re-statted and then
re-link libxul) takes < 2 minutes in the best case, when the
caches are warm and nothing else is going on.  But if I use the
machine for anything else at all, performance falls badly.

With 16GB ram, I can consistently get re-stat re-link times of
around 1m30 (wallclock), and ld.gold runs at 100% cpu utilisation
for the whole 22 or so user seconds that it needs.  This is
true even when using the machine for realistic other stuff
at the same time (thunderbird, firefox, emacs, irc client).

So it seems like 16GB ram is a realistic minimum requirement for a
productively-responsive developer machine, in this case.
(In reply to Julian Seward from comment #26)

> With 8GB in the machine, a mechanical disk, and using
> GNU gold (version 2.22.52.0.1-10.fc17 20120131) 1.11, linking is
> completely disk bound and takes many (10+) minutes.
> 

I found that on a computer with i3-3240 (3.4GHz) can run
virtualbox, under win7 64 bits, in which Debian GNU/Linux is hosted,
and there with 8GB memory (only 6 GB or so is assigned to VBox image),
linking of thunderbird after one file is changed and compiled can be
done in 4min 30secs or so (usually faster depending on how the
disk cache is populated I think) with ld.gold (gcc's -gsplit-dwarf and 
-Wl,-gdb-index options).

On a much slower CPU, but 16GB memory, I can still compile and run
thunderbird in a similar time-range in Debian GNU/Linux inside VMplayer 
under win64.
(My slower PC has Pentium G640 2.8GHz (which has two cores, but no hyperthreading. This is based on the architecture of two generations ago, and it shows in the performance figure).

So it is a matter of various factors
- how fast one aims at what cost
- CPU speed (and cache memory size, andthe  number of cores and hyperthreading, this seems important
  when -jN options are used)
- memory amount and its bus speed.

As a matter of fact, the PC with i3-3240 runs "make mozmill" of full DEBUG BUILD of TB
while I record the session using |script| commands slightly over 20 minutes while
the slower PC takes 35 minutes or so.

To me at the moment, the total execution time of |make mozmill| counts and so
PC with i3-3240 is very handy (but it does not have ECC while the slower PC has ECC memory.)

Also the problem of using gdb against the binary that is linked using -gsplit-dwarf
has forced me to disable -gsplit-dwarf, which lengthened the link time close to 10 minutes :-(
Being able to use gdb to obtain detailed stack trace and source line printing is important
and so I would appreciate if someone can figure out the answer to problems raised in bug 905646.
(Correctness counts over speed, of course.)

I think it is important to describe the CPU make and its clock frequency and memory's speed, etc. for
making meaningful comparisons. As is often mentioned in the performance analysis
community that you can't compare oranges and apples.
Product: Core → Firefox Build System
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.