Closed Bug 1592981 Opened 6 years ago Closed 6 years ago

Use -import-instr-limit to mitigate size growth from ThinLTO

Categories

(Firefox Build System :: Toolchains, enhancement)

enhancement
Not set
normal

Tracking

(firefox72 fixed)

RESOLVED FIXED
mozilla72
Tracking Status
firefox72 --- fixed

People

(Reporter: away, Assigned: away)

References

Details

Attachments

(1 file, 1 obsolete file)

When we first enabled ThinLTO on our builds, we got great performance gains, but also large size increases due to aggressive inlining. There is an LLVM option called -import-instr-limit that limits the size of functions that may be imported (the threshold is subject to modification by PGO). Chromium found a good balance between speed and performance by using a value of 10. In my initial testing, that value can save us many megabytes from libxul (9MB on Windows/Linux, 17MB on Mac!) without noticeable speed regressions.

It is taking me many attempts to get the spelling right across all our linkers. Currently testing https://hg.mozilla.org/try/rev/0626e2fbdd47cdacfd1735592c0636daf01e104e.

When we first enabled ThinLTO on our builds, we got great performance gains, but also large size increases due to aggressive inlining. There is an LLVM option called -import-instr-limit that limits the size of functions that may be imported (the threshold is subject to modification by PGO). Chromium found a good balance between speed and performance by using a value of 10. In initial testing, on Windows and Linux that value can save us many megabytes from libxul without noticeable speed regressions. For Mac, which doesn't yet have PGO, we have to use a higher limit to avoid over-restricting the optimizer which caused slowdowns on my try pushes.

In Bug 1591725 I've been testing optimization options on android builds. Setting -import-instr-limit leads to significant reductions in size:

Optimization      AArch  Size, MB (geckoview_example.apk)
-Oz                 32     44.0
-O2,instr-limit=0   32     49.2      
-O2,instr-limit=1   32     49.2      
-O2,instr-limit=3   32     49.5
-O2,instr-limit=5   32     50.2
-O2,instr-limit=10  32     50.9
-O2                 32     53.7
-O3                 32     54.9

-Oz                 64     50.2
-O2,instr-limit=0   64     55.5        
-O2,instr-limit=1   64     55.5                
-O2,instr-limit=3   64     55.7
-O2,instr-limit=5   64     56.2
-O2,instr-limit=10  64     57.0
-O2                 64     60.9
-O3                 64     62.6

So far I've only seen performance, relative to -O2, degrade (via speedometer), with import-instr-limit=1 and lower.

Pushed by dmajor@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/749fec0af516 Use -import-instr-limit to mitigate size growth from ThinLTO r=froydnj
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Assignee: nobody → dmajor

== Change summary for alert #23712 (as of Tue, 05 Nov 2019 09:25:17 GMT) ==

Improvements:

8% raptor-webaudio-firefox windows7-32-shippable opt 205.67 -> 189.92

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=23712

== Change summary for alert #23686 (as of Mon, 04 Nov 2019 22:48:29 GMT) ==

Improvements:

31% build times android-4-2-x86 opt taskcluster-m5.4xlarge 2,018.30 -> 1,399.07
29% build times android-4-0-armv7-api16 opt taskcluster-c5.4xlarge 1,806.51 -> 1,285.88
28% build times android-5-0-x86_64 opt taskcluster-c5.4xlarge 2,208.62 -> 1,591.75
28% build times linux64-shippable opt nightly taskcluster-c5d.4xlarge 3,827.68 -> 2,765.54
27% build times android-5-0-x86_64 opt taskcluster-m5.4xlarge 2,151.94 -> 1,566.25
26% build times android-5-0-aarch64 opt taskcluster-c5.4xlarge 1,662.64 -> 1,222.31
26% build times linux64-shippable opt nightly taskcluster-m5.4xlarge 4,068.36 -> 2,997.44
26% build times linux32-shippable opt nightly taskcluster-c5d.4xlarge 3,923.70 -> 2,897.06
26% build times linux32-shippable opt nightly taskcluster-m5.4xlarge 4,177.20 -> 3,089.26
25% build times windows2012-aarch64 opt aarch64-no-eme nightly taskcluster-c5.4xlarge 3,281.75 -> 2,454.48
25% build times linux32-shippable opt nightly taskcluster-c5.4xlarge 3,959.06 -> 2,977.81
24% build times linux64-shippable opt nightly taskcluster-c5.4xlarge 3,773.01 -> 2,866.45
23% build times windows2012-aarch64 opt aarch64-no-eme nightly taskcluster-c4.4xlarge 3,965.53 -> 3,036.61
23% build times windows2012-64-shippable opt nightly taskcluster-c5.4xlarge 4,001.90 -> 3,088.03
23% build times windows2012-32-shippable opt nightly taskcluster-c5.4xlarge 3,793.12 -> 2,935.06
22% build times windows2012-64-shippable opt nightly taskcluster-c4.4xlarge 4,783.91 -> 3,750.17
21% build times windows2012-32-shippable opt nightly taskcluster-c4.4xlarge 4,627.16 -> 3,634.70
18% build times windows2012-aarch64 opt aarch64 taskcluster-c4.4xlarge 3,965.57 -> 3,255.33
16% build times android-4-0-armv7-api16 pgo taskcluster-m5.4xlarge 2,624.75 -> 2,207.55
16% build times android-4-0-armv7-api16 pgo taskcluster-c5d.4xlarge 2,330.61 -> 1,967.63
14% build times android-5-0-aarch64 pgo taskcluster-c5d.4xlarge 2,282.39 -> 1,955.86
14% build times android-5-0-aarch64 pgo taskcluster-m5.4xlarge 2,506.19 -> 2,153.65
13% build times android-4-0-armv7-api16 pgo taskcluster-c5.4xlarge 2,403.27 -> 2,087.00
8% build times osx-shippable opt nightly taskcluster-c5d.4xlarge 3,601.65 -> 3,314.12

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=23686

== Change summary for alert #23710 (as of Tue, 05 Nov 2019 09:08:27 GMT) ==

Improvements:

3% Explicit Memory windows7-32 opt 295,940,910.84 -> 285,632,278.54
3% Explicit Memory windows7-32-shippable opt 295,216,826.77 -> 285,805,301.59
2% Explicit Memory windows7-32-shippable opt stylo tp6 391,692,290.18 -> 383,544,099.35

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=23710

(In reply to Alexandru Ionescu :alexandrui from comment #8)

== Change summary for alert #23710 (as of Tue, 05 Nov 2019 09:08:27 GMT) ==

Improvements:

3% Explicit Memory windows7-32 opt 295,940,910.84 -> 285,632,278.54
3% Explicit Memory windows7-32-shippable opt 295,216,826.77 -> 285,805,301.59
2% Explicit Memory windows7-32-shippable opt stylo tp6 391,692,290.18 -> 383,544,099.35

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=23710

Judging by https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=autoland,1959367,1,4&timerange=1209600&zoom=1572921491070,1572922800207,283626021.2749412,303780436.2993224 , I believe this more rightly belongs to bug 1584101.

this seems to have regressed using gcc for compiling firefox, with enable LTO.

It seems that the linker doesn't understand -import-instr-limit:

0:11.14 checking what kind of list files are supported by the linker... configure: error: Couldn't find one that works
0:11.15 DEBUG: <truncated - see config.log for full output>
0:11.15 DEBUG: configure:10778: /usr/bin/x86_64-pc-linux-gnu-g++ -o conftest -march=znver1 -pipe -flifetime-dse=1 -Wno-psabi -Wno-class-memaccess -Wno-int-in-bool-context -Wno-multistatement-macros -Wno-maybe-uninitialized -Wno-deprecated-declarations -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -fno-math-errno -pthread -lpthread -Wl,-O1 -Wl,--as-needed -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags -Wl,--compress-debug-sections=zlib -fuse-ld=gold -Wl,-z,noexecstack -Wl,-z,text -Wl,-z,relro -Wl,-z,nocopyreloc -Wl,-Bsymbolic-functions -Wl,--icf=safe conftest.C -ldl 1>&5
0:11.15 DEBUG: configure:10864: checking for -pipe support
0:11.15 DEBUG: configure:10891: checking what kind of list files are supported by the linker
0:11.15 DEBUG: configure:10896: /usr/bin/x86_64-pc-linux-gnu-gcc -std=gnu99 -o conftest.o -c -flto -flifetime-dse=1 -march=znver1 -pipe -fno-strict-aliasing -ffunction-sections -fdata-sections -fno-math-errno -pthread -fPIC -pipe conftest.c 1>&5
0:11.15 DEBUG: configure:10903: /usr/bin/x86_64-pc-linux-gnu-gcc -std=gnu99 -o conftest -flto=12 -flifetime-dse=1 -Wl,-plugin-opt=-import-instr-limit=10 -lpthread -Wl,-O1 -Wl,--as-needed -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags -Wl,--compress-debug-sections=zlib -fuse-ld=gold -Wl,-z,noexecstack -Wl,-z,text -Wl,-z,relro -Wl,-z,nocopyreloc -Wl,-Bsymbolic-functions -Wl,--icf=safe conftest.list -ldl 1>&5
0:11.15 DEBUG: x86_64-pc-linux-gnu-gcc: error: unrecognized command line option '-import-instr-limit=10'
0:11.15 DEBUG: lto-wrapper: fatal error: /usr/bin/x86_64-pc-linux-gnu-gcc returned 1 exit status
0:11.15 DEBUG: compilation terminated.
0:11.15 DEBUG: /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld.gold: fatal error: lto-wrapper failed
0:11.15 DEBUG: collect2: error: ld returned 1 exit status
0:11.15 DEBUG: configure:10907: /usr/bin/x86_64-pc-linux-gnu-gcc -std=gnu99 -o conftest -flto=12 -flifetime-dse=1 -Wl,-plugin-opt=-import-instr-limit=10 -lpthread -Wl,-O1 -Wl,--as-needed -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags -Wl,--compress-debug-sections=zlib -fuse-ld=gold -Wl,-z,noexecstack -Wl,-z,text -Wl,-z,relro -Wl,-z,nocopyreloc -Wl,-Bsymbolic-functions -Wl,--icf=safe -Wl,-filelist,conftest.list -ldl 1>&5
0:11.15 DEBUG: /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld.gold: fatal error: -f/--auxiliary may not be used without -shared
0:11.15 DEBUG: collect2: error: ld returned 1 exit status
0:11.15 DEBUG: configure:10909: /usr/bin/x86_64-pc-linux-gnu-gcc -std=gnu99 -o conftest -flto=12 -flifetime-dse=1 -Wl,-plugin-opt=-import-instr-limit=10 -lpthread -Wl,-O1 -Wl,--as-needed -Wl,-rpath=/usr/lib64/firefox,--enable-new-dtags -Wl,--compress-debug-sections=zlib -fuse-ld=gold -Wl,-z,noexecstack -Wl,-z,text -Wl,-z,relro -Wl,-z,nocopyreloc -Wl,-Bsymbolic-functions -Wl,--icf=safe @conftest.list -ldl 1>&5
0:11.15 DEBUG: x86_64-pc-linux-gnu-gcc: error: unrecognized command line option '-import-instr-limit=10'
0:11.15 DEBUG: lto-wrapper: fatal error: /usr/bin/x86_64-pc-linux-gnu-gcc returned 1 exit status
0:11.15 DEBUG: compilation terminated.
0:11.15 DEBUG: /usr/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../x86_64-pc-linux-gnu/bin/ld.gold: fatal error: lto-wrapper failed
0:11.15 DEBUG: collect2: error: ld returned 1 exit status
0:11.15 DEBUG: configure: error: Couldn't find one that works
0:11.15 ERROR: old-configure failed
0:11.19 *** Fix above errors and then restart with
0:11.19 "./mach build"
0:11.19 gmake: *** [client.mk:115: configure] Error 1

with clang/lld, there's no problem

0:11.15 DEBUG: x86_64-pc-linux-gnu-gcc: error: unrecognized command line option '-import-instr-limit=10'

Looks like it needs to be clang specific.
Could you please open a new bug?
Thanks

reverting the patch from this bug allows to use lto wrappers again, so yes this change should be done clang specific. Going to open a new bug and mark this one as the one causing the regression.

Attached file Fix GCC LTO build break (obsolete) —

GCC doesn't understand the import-instr-limit option.

I'm sorry, I was confused by the discussion being in this bug. tt_1 opened bug 1602355 for the regression. I'll move the patch over there.

Comment on attachment 9114553 [details]
Fix GCC LTO build break

Revision D56366 was moved to bug 1602355. Setting attachment 9114553 [details] to obsolete.

Attachment #9114553 - Attachment is obsolete: true
Regressions: 1602355
See Also: → 1832022
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: