Linker section GC is throwing away lots of PGO data
Categories
(Firefox Build System :: Toolchains, defect)
Tracking
(firefox82 fixed)
Tracking | Status | |
---|---|---|
firefox82 | --- | fixed |
People
(Reporter: away, Assigned: away)
References
Details
Attachments
(2 files)
While investigating clang-10 performance regressions, I noticed that a large number of functions are missing from our merged.profdata
, even though those functions are clearly instrumented in disassembly, and we definitely execute them during training.
After stepping through llvm-profdata show
and deciphering the file format from InstrProfData.inc
, I found that our missing functions are dumping their counter arrays to the .profraw
files as expected. But the tooling does not iterate over those arrays directly. Instead, the master "table of contents" is the list of __llvm_profile_data
objects at the beginning of the file, which have pointers to the counter arrays and other per-function information.
Those data objects were missing because their corresponding __profd_
entries in the compiled object files were dropped by linker GC. So we ended up with a bunch of data in our profiles that was unreachable from the ToC.
Disabling --gc-sections
during MOZ_PROFILE_GENERATE
results in huge improvements. Using the webaudio benchmark, which has tight C++ loops and is therefore very sensitive to compiler optimization, the previous clang-10 regressions turn into solid improvements.
I tried to run this under lld because I already have a debug build of that, but it turns out that the problem doesn't reproduce there. The excessive GC seems to be something specific to ld, that was made worse by clang-10.
Here's the line count of -Wl,--print-gc-sections
output:
clang-9 + ld: 580200
clang-9 + gold: 5463
clang-9 + lld: 5467
clang-10 + ld: 857670
clang-10 + gold: 5460
clang-10 + lld: 5464
Given that this isn't a problem under lld, I'm much less inclined to debug it any further, and more inclined to just take the --no-gc-sections
workaround.
Comment 2•5 years ago
|
||
Have you tried to get a small reproduction? That would be useful in order to file an upstream bug (and also to check if it's still an issue in newer versions of binutils, I'm not sure we're on the last release).
I did, using a stripped-down version of the GainNodeEngine
code from webaudio that first caught our attention in perfherder. I tried to match the code structure as closely as possible in case linkage and visibility of the functions make a difference. No luck. At this point I can't justify spending more time on it.
Comment 4•5 years ago
|
||
Do you have a list of the GCed sections? (from -Wl,--print-gc-sections)
For purposes of this bug, the removals that really hurt are the profd
symbols.
Comment 6•5 years ago
|
||
So... I took a quick look and ... I can't even figure out why some of those profd
symbols were not GC'ed before or why lld wouldn't GC them.
, because it looks to me they should be. Like, if I look at __profd__ZNK7mozilla3dom14GainNodeEngine19SizeOfExcludingThisEPFmPKvE.12884901887
, I don't see it used anywhere that should prevent the linker from GC'ing it, and more importantly, I don't see what the difference is between the clang 9 and clang 10 output that makes ld GC it now but not previously.
For not-well-understood reasons, ld's --gc-sections
discards a large number of the PGO bookkeeping structures that enable us to keep track of function counters, and the effect gets worse in object files generated by clang-10.
As much as I'd like to understand this better, the investigations take way too much time. As a path of least resistance, we can disable --gc-sections
for the instrumentation phase of PGO builds. It won't harm anything since users never see those builds, and it will improve the performance of the optimized phase greatly.
Comment 8•4 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:dmajor, could you have a look please?
For more information, please visit auto_nag documentation.
Comment hidden (obsolete) |
Comment 11•4 years ago
|
||
Comment 12•4 years ago
|
||
bugherder |
Comment 13•4 years ago
|
||
Backed out for causing Btime failures on Android 7.0 (bug 1660340)
Backout link: https://hg.mozilla.org/integration/autoland/rev/592e556af3ecee3700c80dac778585e93cbe755f
Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=314317947&repo=autoland&lineNumber=1864
Comment 14•4 years ago
|
||
Backout merged: https://hg.mozilla.org/mozilla-central/rev/592e556af3ec
Comment 15•4 years ago
|
||
Comment 16•4 years ago
|
||
bugherder |
Description
•