1700520 - GCC takes a long time to compile gl.cc

Reporter

Description

•

5 years ago

-ftime-report output at -O0:

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)    1381 kB (  0%)
 phase parsing                      :   2.68 (  0%)   6.41 ( 16%)  31.30 (  3%)  507182 kB (  6%)
 phase lang. deferred               :   0.37 (  0%)   0.23 (  1%)   0.61 (  0%)   37757 kB (  0%)
 phase opt and generate             :1098.11 (100%)  32.65 ( 83%)1134.78 ( 97%) 8287368 kB ( 93%)
 phase last asm                     :   2.13 (  0%)   0.08 (  0%)   5.25 (  0%)   57039 kB (  1%)
 |name lookup                       :   0.70 (  0%)   0.95 (  2%)  23.91 (  2%)   11029 kB (  0%)
 |overload resolution               :   0.81 (  0%)   1.39 (  4%)   2.21 (  0%)  178591 kB (  2%)
 garbage collection                 :   4.02 (  0%)   0.06 (  0%)   4.09 (  0%)       0 kB (  0%)
 dump files                         :   0.59 (  0%)   0.65 (  2%)   1.09 (  0%)       0 kB (  0%)
 callgraph construction             :   0.60 (  0%)   0.23 (  1%)   0.90 (  0%)  194877 kB (  2%)
 callgraph optimization             :   0.10 (  0%)   0.25 (  1%)   0.45 (  0%)       0 kB (  0%)
 callgraph ipa passes               :  14.80 (  1%)  27.79 ( 71%)  42.59 (  4%) 3386963 kB ( 38%)
 ipa dead code removal              :   0.26 (  0%)   0.00 (  0%)   0.25 (  0%)       0 kB (  0%)
 ipa inheritance graph              :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      13 kB (  0%)
 ipa inlining heuristics            :   0.42 (  0%)   0.01 (  0%)   0.40 (  0%)       0 kB (  0%)
 ipa comdats                        :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa HSA                            :   0.05 (  0%)   0.00 (  0%)   0.05 (  0%)       0 kB (  0%)
 ipa free lang data                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 ipa free inline summary            :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 cfg construction                   :   0.11 (  0%)   0.04 (  0%)   0.11 (  0%)    9811 kB (  0%)
 cfg cleanup                        :   0.61 (  0%)   0.09 (  0%)   0.66 (  0%)   27409 kB (  0%)
 trivially dead code                :   0.99 (  0%)   0.04 (  0%)   0.95 (  0%)       0 kB (  0%)
 df scan insns                      :   4.52 (  0%)   0.09 (  0%)   4.58 (  0%)     355 kB (  0%)
 df live regs                       :   1.77 (  0%)   0.03 (  0%)   1.92 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   2.49 (  0%)   0.04 (  0%)   2.45 (  0%)  166541 kB (  2%)
 register information               :   0.84 (  0%)   0.01 (  0%)   0.87 (  0%)       0 kB (  0%)
 alias analysis                     :   0.63 (  0%)   0.00 (  0%)   0.76 (  0%)   40190 kB (  0%)
 alias stmt walking                 :   0.01 (  0%)   0.03 (  0%)   0.01 (  0%)    1315 kB (  0%)
 rebuild jump labels                :   0.53 (  0%)   0.04 (  0%)   0.60 (  0%)      29 kB (  0%)
 preprocessing                      :   0.05 (  0%)   1.18 (  3%)   1.32 (  0%)    6073 kB (  0%)
 parser (global)                    :   0.14 (  0%)   1.31 (  3%)   1.39 (  0%)   42373 kB (  0%)
 parser struct body                 :   0.38 (  0%)   0.59 (  1%)  23.16 (  2%)   65937 kB (  1%)
 parser function body               :   0.05 (  0%)   0.17 (  0%)   0.21 (  0%)    6895 kB (  0%)
 parser inl. func. body             :   0.05 (  0%)   0.11 (  0%)   0.15 (  0%)    8859 kB (  0%)
 parser inl. meth. body             :   1.66 (  0%)   2.57 (  7%)   4.46 (  0%)  323556 kB (  4%)
 template instantiation             :   0.26 (  0%)   0.30 (  1%)   0.48 (  0%)   37749 kB (  0%)
 constant expression evaluation     :   0.17 (  0%)   0.31 (  1%)   0.43 (  0%)   20153 kB (  0%)
 early inlining heuristics          :   0.14 (  0%)   0.02 (  0%)   0.17 (  0%)   92308 kB (  1%)
 inline parameters                  :   0.69 (  0%)   0.05 (  0%)   0.75 (  0%)   41948 kB (  0%)
 integration                        :   6.35 (  1%)  14.93 ( 38%)  21.21 (  2%) 2610803 kB ( 29%)
 tree gimplify                      :   0.07 (  0%)   0.15 (  0%)   0.20 (  0%)   61878 kB (  1%)
 tree eh                            :   0.03 (  0%)   0.01 (  0%)   0.04 (  0%)    1672 kB (  0%)
 tree CFG construction              :   0.01 (  0%)   0.03 (  0%)   0.02 (  0%)   15077 kB (  0%)
 tree CFG cleanup                   :   0.98 (  0%)   0.12 (  0%)   1.30 (  0%)      24 kB (  0%)
 tree PHI insertion                 :   0.01 (  0%)   0.03 (  0%)   0.01 (  0%)    1973 kB (  0%)
 tree SSA rewrite                   :   0.68 (  0%)   0.09 (  0%)   0.85 (  0%)  253633 kB (  3%)
 tree SSA other                     :   0.07 (  0%)   0.27 (  1%)   0.32 (  0%)    1217 kB (  0%)
 tree SSA incremental               :   0.62 (  0%)   0.02 (  0%)   0.64 (  0%)    8843 kB (  0%)
 tree operand scan                  :   3.00 (  0%)  11.59 ( 29%)  14.61 (  1%)  241325 kB (  3%)
 tree switch lowering               :   0.06 (  0%)   0.00 (  0%)   0.13 (  0%)    2901 kB (  0%)
 dominance frontiers                :   0.12 (  0%)   0.05 (  0%)   0.05 (  0%)       0 kB (  0%)
 dominance computation              :   0.50 (  0%)   0.13 (  0%)   0.68 (  0%)       0 kB (  0%)
 out of ssa                         :   1.27 (  0%)   0.03 (  0%)   1.21 (  0%)    1278 kB (  0%)
 expand vars                        :1006.80 ( 91%)   1.05 (  3%)1008.34 ( 86%)  243594 kB (  3%)
 expand                             :   5.83 (  1%)   0.29 (  1%)   6.25 (  1%) 1696029 kB ( 19%)
 post expand cleanups               :   0.59 (  0%)   0.05 (  0%)   0.67 (  0%)   11146 kB (  0%)
 varconst                           :   0.05 (  0%)   0.05 (  0%)   0.05 (  0%)       7 kB (  0%)
 jump                               :   0.04 (  0%)   0.01 (  0%)   0.04 (  0%)       0 kB (  0%)
 loop init                          :   0.55 (  0%)   0.04 (  0%)   0.65 (  0%)    6471 kB (  0%)
 loop fini                          :   0.00 (  0%)   0.01 (  0%)   0.03 (  0%)       0 kB (  0%)
 mode switching                     :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 integrated RA                      :  18.87 (  2%)   0.51 (  1%)  19.35 (  2%)  984328 kB ( 11%)
 LRA non-specific                   :   7.62 (  1%)   0.20 (  1%)   7.70 (  1%)    9466 kB (  0%)
 LRA virtuals elimination           :   2.94 (  0%)   0.03 (  0%)   2.98 (  0%)  262009 kB (  3%)
 LRA reload inheritance             :   0.60 (  0%)   0.01 (  0%)   0.59 (  0%)     438 kB (  0%)
 LRA create live ranges             :   2.30 (  0%)   0.06 (  0%)   2.30 (  0%)     633 kB (  0%)
 LRA hard reg assignment            :   0.53 (  0%)   0.02 (  0%)   0.58 (  0%)       0 kB (  0%)
 reload                             :   0.14 (  0%)   0.01 (  0%)   0.14 (  0%)       0 kB (  0%)
 thread pro- & epilogue             :   2.53 (  0%)   0.07 (  0%)   2.72 (  0%)   11866 kB (  0%)
 machine dep reorg                  :   0.03 (  0%)   0.01 (  0%)   0.06 (  0%)       0 kB (  0%)
 shorten branches                   :   1.82 (  0%)   0.04 (  0%)   1.87 (  0%)       6 kB (  0%)
 reg stack                          :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 final                              :   7.01 (  1%)   0.38 (  1%)  10.83 (  1%)  220941 kB (  2%)
 symout                             :   3.84 (  0%)   0.15 (  0%)   7.05 (  1%)  862027 kB ( 10%)
 uninit var analysis                :   0.00 (  0%)   0.02 (  0%)   0.09 (  0%)       0 kB (  0%)
 initialize rtl                     :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      12 kB (  0%)
 rest of compilation                :   5.17 (  0%)   0.65 (  2%)   5.53 (  0%)  293297 kB (  3%)
 unaccounted post reload            :   0.00 (  0%)   0.02 (  0%)   0.00 (  0%)       0 kB (  0%)
 unaccounted late compilation       :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 repair loop structures             :   0.04 (  0%)   0.02 (  0%)   0.08 (  0%)       0 kB (  0%)
 TOTAL                              :1103.29         39.37       1171.95        8890741 kB

Mike Hommey [:glandium]

Reporter

Comment 1

•

5 years ago

Actually, even with clang, it's taking a (relative) long time (74s). -O0 -ftime-report with clang 11:

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  33.9637 ( 97.3%)  37.7902 ( 99.9%)  71.7539 ( 98.7%)  71.7848 ( 98.7%)  Code Generation Time
   0.9260 (  2.7%)   0.0432 (  0.1%)   0.9692 (  1.3%)   0.9741 (  1.3%)  LLVM IR Generation Time
  34.8897 (100.0%)  37.8334 (100.0%)  72.7231 (100.0%)  72.7589 (100.0%)  Total

7 warnings generated.
===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 16.7170 seconds (16.6894 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   2.9731 ( 20.2%)   0.3632 ( 18.3%)   3.3362 ( 20.0%)   3.3313 ( 20.0%)  Instruction Selection
   1.9257 ( 13.1%)   0.2700 ( 13.6%)   2.1957 ( 13.1%)   2.1907 ( 13.1%)  DAG Combining 2
   1.8347 ( 12.5%)   0.2232 ( 11.2%)   2.0579 ( 12.3%)   2.0546 ( 12.3%)  DAG Combining after legalize types
   1.6657 ( 11.3%)   0.3310 ( 16.7%)   1.9966 ( 11.9%)   2.0016 ( 12.0%)  DAG Combining 1
   1.4948 ( 10.1%)   0.1718 (  8.7%)   1.6666 ( 10.0%)   1.6627 ( 10.0%)  Type Legalization
   1.1966 (  8.1%)   0.1791 (  9.0%)   1.3758 (  8.2%)   1.3728 (  8.2%)  Instruction Scheduling
   1.1459 (  7.8%)   0.0753 (  3.8%)   1.2212 (  7.3%)   1.2193 (  7.3%)  DAG Combining after legalize vectors
   0.8639 (  5.9%)   0.1249 (  6.3%)   0.9888 (  5.9%)   0.9852 (  5.9%)  Instruction Creation
   0.7207 (  4.9%)   0.1078 (  5.4%)   0.8285 (  5.0%)   0.8253 (  4.9%)  DAG Legalization
   0.6559 (  4.5%)   0.1080 (  5.4%)   0.7640 (  4.6%)   0.7620 (  4.6%)  Vector Legalization
   0.1862 (  1.3%)   0.0126 (  0.6%)   0.1988 (  1.2%)   0.1980 (  1.2%)  Type Legalization 2
   0.0680 (  0.5%)   0.0189 (  1.0%)   0.0869 (  0.5%)   0.0859 (  0.5%)  Instruction Scheduling Cleanup
  14.7312 (100.0%)   1.9858 (100.0%)  16.7170 (100.0%)  16.6894 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 19.5061 seconds (19.5838 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   2.4645 ( 65.5%)   8.1807 ( 52.0%)  10.6452 ( 54.6%)  10.7012 ( 54.6%)  Debug Info Emission
   1.2884 ( 34.3%)   7.5648 ( 48.0%)   8.8533 ( 45.4%)   8.8749 ( 45.3%)  DWARF Exception Writer
   0.0077 (  0.2%)   0.0000 (  0.0%)   0.0077 (  0.0%)   0.0077 (  0.0%)  DWARF Debug Writer
   3.7606 (100.0%)  15.7456 (100.0%)  19.5061 (100.0%)  19.5838 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 69.2811 seconds (69.3056 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   6.3674 ( 19.7%)  32.6874 ( 88.5%)  39.0547 ( 56.4%)  39.0674 ( 56.4%)  X86 Assembly Printer
  17.9268 ( 55.4%)   2.7817 (  7.5%)  20.7085 ( 29.9%)  20.7183 ( 29.9%)  X86 DAG->DAG Instruction Selection
   4.7404 ( 14.7%)   0.1805 (  0.5%)   4.9210 (  7.1%)   4.9229 (  7.1%)  Inliner for always_inline functions
   0.8175 (  2.5%)   0.1491 (  0.4%)   0.9666 (  1.4%)   0.9663 (  1.4%)  Live DEBUG_VALUE analysis
   0.7007 (  2.2%)   0.0794 (  0.2%)   0.7801 (  1.1%)   0.7801 (  1.1%)  Prologue/Epilogue Insertion & Frame Finalization
   0.4281 (  1.3%)   0.0872 (  0.2%)   0.5152 (  0.7%)   0.5148 (  0.7%)  Fast Register Allocator
   0.3210 (  1.0%)   0.1400 (  0.4%)   0.4610 (  0.7%)   0.4608 (  0.7%)  Insert stack protectors
   0.1848 (  0.6%)   0.0390 (  0.1%)   0.2238 (  0.3%)   0.2237 (  0.3%)  Two-Address instruction pass
   0.0775 (  0.2%)   0.0662 (  0.2%)   0.1436 (  0.2%)   0.1436 (  0.2%)  Expand Atomic instructions
   0.0965 (  0.3%)   0.0273 (  0.1%)   0.1239 (  0.2%)   0.1236 (  0.2%)  Check CFA info and insert CFI instructions if needed
   0.0746 (  0.2%)   0.0192 (  0.1%)   0.0938 (  0.1%)   0.0936 (  0.1%)  Finalize ISel and expand pseudo-instructions
   0.0400 (  0.1%)   0.0332 (  0.1%)   0.0732 (  0.1%)   0.0729 (  0.1%)  Free MachineFunction
   0.0487 (  0.2%)   0.0238 (  0.1%)   0.0726 (  0.1%)   0.0725 (  0.1%)  MachineDominator Tree Construction
   0.0525 (  0.2%)   0.0180 (  0.0%)   0.0705 (  0.1%)   0.0704 (  0.1%)  Post-RA pseudo instruction expansion pass
   0.0516 (  0.2%)   0.0170 (  0.0%)   0.0686 (  0.1%)   0.0685 (  0.1%)  X86 pseudo instruction expansion pass
   0.0408 (  0.1%)   0.0259 (  0.1%)   0.0666 (  0.1%)   0.0666 (  0.1%)  Lower constant intrinsics
   0.0458 (  0.1%)   0.0190 (  0.1%)   0.0648 (  0.1%)   0.0648 (  0.1%)  X86 EFLAGS copy lowering
   0.0351 (  0.1%)   0.0220 (  0.1%)   0.0571 (  0.1%)   0.0569 (  0.1%)  Expand reduction intrinsics
   0.0339 (  0.1%)   0.0221 (  0.1%)   0.0560 (  0.1%)   0.0560 (  0.1%)  Scalarize Masked Memory Intrinsics
   0.0386 (  0.1%)   0.0006 (  0.0%)   0.0392 (  0.1%)   0.0392 (  0.1%)  CallGraph Construction
   0.0180 (  0.1%)   0.0161 (  0.0%)   0.0341 (  0.0%)   0.0340 (  0.0%)  Eliminate PHI nodes for register allocation
   0.0081 (  0.0%)   0.0220 (  0.1%)   0.0301 (  0.0%)   0.0301 (  0.0%)  Exception handling preparation
   0.0127 (  0.0%)   0.0143 (  0.0%)   0.0269 (  0.0%)   0.0269 (  0.0%)  Bundle Machine CFG Edges
   0.0084 (  0.0%)   0.0170 (  0.0%)   0.0254 (  0.0%)   0.0253 (  0.0%)  Remove unreachable blocks from the CFG
   0.0015 (  0.0%)   0.0202 (  0.1%)   0.0217 (  0.0%)   0.0217 (  0.0%)  Instrument function entry/exit with calls to e.g. mcount() (pre inlining)
   0.0070 (  0.0%)   0.0138 (  0.0%)   0.0209 (  0.0%)   0.0209 (  0.0%)  X86 Indirect Branch Tracking
   0.0068 (  0.0%)   0.0133 (  0.0%)   0.0201 (  0.0%)   0.0203 (  0.0%)  Insert fentry calls
   0.0044 (  0.0%)   0.0159 (  0.0%)   0.0203 (  0.0%)   0.0203 (  0.0%)  Expand indirectbr instructions
   0.0067 (  0.0%)   0.0132 (  0.0%)   0.0199 (  0.0%)   0.0200 (  0.0%)  Machine Optimization Remark Emitter
   0.0067 (  0.0%)   0.0133 (  0.0%)   0.0200 (  0.0%)   0.0200 (  0.0%)  Insert XRay ops
   0.0066 (  0.0%)   0.0131 (  0.0%)   0.0197 (  0.0%)   0.0198 (  0.0%)  X86 PIC Global Base Reg Initialization
   0.0065 (  0.0%)   0.0131 (  0.0%)   0.0196 (  0.0%)   0.0196 (  0.0%)  Implement the 'patchable-function' attribute
   0.0059 (  0.0%)   0.0137 (  0.0%)   0.0195 (  0.0%)   0.0196 (  0.0%)  Machine Optimization Remark Emitter #2
   0.0065 (  0.0%)   0.0130 (  0.0%)   0.0194 (  0.0%)   0.0196 (  0.0%)  Local Stack Slot Allocation
   0.0064 (  0.0%)   0.0131 (  0.0%)   0.0195 (  0.0%)   0.0196 (  0.0%)  StackMap Liveness Analysis
   0.0064 (  0.0%)   0.0130 (  0.0%)   0.0194 (  0.0%)   0.0195 (  0.0%)  Contiguously Lay Out Funclets
   0.0064 (  0.0%)   0.0129 (  0.0%)   0.0193 (  0.0%)   0.0195 (  0.0%)  Lazy Machine Block Frequency Analysis
   0.0057 (  0.0%)   0.0135 (  0.0%)   0.0193 (  0.0%)   0.0194 (  0.0%)  Lazy Machine Block Frequency Analysis #2
   0.0065 (  0.0%)   0.0129 (  0.0%)   0.0194 (  0.0%)   0.0194 (  0.0%)  X86 Indirect Thunks
   0.0064 (  0.0%)   0.0129 (  0.0%)   0.0193 (  0.0%)   0.0193 (  0.0%)  Analyze Machine Code For Garbage Collection
   0.0064 (  0.0%)   0.0128 (  0.0%)   0.0192 (  0.0%)   0.0193 (  0.0%)  X86 WinAlloca Expander
   0.0063 (  0.0%)   0.0127 (  0.0%)   0.0190 (  0.0%)   0.0193 (  0.0%)  X86 Speculative Execution Side Effect Suppression
   0.0064 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0193 (  0.0%)  Fixup Statepoint Caller Saved
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0192 (  0.0%)   0.0193 (  0.0%)  X86 FP Stackifier
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0192 (  0.0%)  X86 insert wait instruction
   0.0050 (  0.0%)   0.0140 (  0.0%)   0.0191 (  0.0%)   0.0192 (  0.0%)  Safe Stack instrumentation pass
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0192 (  0.0%)  X86 speculative load hardening
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0191 (  0.0%)  Compressing EVEX instrs to VEX encoding when possible
   0.0064 (  0.0%)   0.0127 (  0.0%)   0.0191 (  0.0%)   0.0191 (  0.0%)  X86 vzeroupper inserter
   0.0064 (  0.0%)   0.0126 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  X86 Load Value Injection (LVI) Ret-Hardening
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  X86 Discriminate Memory Operands
   0.0062 (  0.0%)   0.0127 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  X86 Insert Cache Prefetches
   0.0041 (  0.0%)   0.0149 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  Instrument function entry/exit with calls to e.g. mcount() (post inlining)
   0.0041 (  0.0%)   0.0146 (  0.0%)   0.0187 (  0.0%)   0.0187 (  0.0%)  Lower Garbage Collection Instructions
   0.0041 (  0.0%)   0.0146 (  0.0%)   0.0186 (  0.0%)   0.0187 (  0.0%)  Shadow Stack GC Lowering
   0.0006 (  0.0%)   0.0000 (  0.0%)   0.0006 (  0.0%)   0.0006 (  0.0%)  Assumption Cache Tracker
   0.0005 (  0.0%)   0.0000 (  0.0%)   0.0005 (  0.0%)   0.0005 (  0.0%)  Pre-ISel Intrinsic Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Force set function attributes
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
  32.3536 (100.0%)  36.9274 (100.0%)  69.2811 (100.0%)  69.3056 (100.0%)  Total

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 74.2717 seconds (74.3039 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.3765 (100.0%)  37.8952 (100.0%)  74.2717 (100.0%)  74.3039 (100.0%)  Clang front-end timer
  36.3765 (100.0%)  37.8952 (100.0%)  74.2717 (100.0%)  74.3039 (100.0%)  Total

Notes: in both cases (comment 0 and this one), this is code from the release branch.

Mike Hommey [:glandium]

Reporter

Comment 2

•

5 years ago

•

Edited

It was taking 37 seconds with GCC (at -O2) in FF86. Edit: scrap that, I was inadvertently using clang in the FF86 build.

Mike Hommey [:glandium]

Reporter

Comment 3

•

5 years ago

It was actually 1:20 in FF86, so it's still a regression compared to that version.

Mike Hommey [:glandium]

Reporter

Comment 4

•

5 years ago

Regressions:
1mn20 -> 1mn50: bug 1688820
1mn20 -> 3mn40: bug 1689933
3mn40 -> 24mn: bug 1674524
Improvement:
24mn -> 15mn: bug 1690967
Re-regression:
15mn -> 19mn25: bug 1678119 (but with interestingly an intermediate regression to 40m with the last patch out of the picture)
19mn25 -> 24mn40: bug 1692731
I didn't go further to find where we got further up to 27mn30 on the current release branch.

gl.o as built by GCC 10 on the current release branch contains 9.7MB of code, vs. 2.7MB as built by clang, which sounds like an awful lot, even for clang.
About 60% of the CPU time is spent in the assembler. So not only does GCC take a lot of time (and memory, > 15GB!) generating a large amount of code, but gas also takes a large amount of time (and memory, > 6GB) processing that code into an object file.
g++ -S generates the assembly in 10 minutes, and the resulting file is 1.37GB!! Removing -g brings that down to 5mn44 and 42.6MB.

Also worth noting that GCC 11 (gcc-11 (Debian 11-20210319-1) 11.0.1 20210319 (experimental) [master revision bd9b262fa92:73ac0472bc3:287e3e8466f44f9d395a2e4dcfcda56cc34ceb1c]), is even slower (cc1plus currently has used 85 minutes of CPU time, and it's still running).

It might be worth filing bugs against both GCC and binutils.

Regressed by: sw-wr-perf-mix-blend, 1689933, 1688820, sw-wr-perf-alpha

BMO Automation

Updated

•

5 years ago

Has Regression Range: --- → yes

BugBot [:suhaib / :marco/ :calixte]

Updated

•

5 years ago

Keywords: regression

Martin Stránský [:stransky] (ni? me)

Comment 5

•

5 years ago

Guys, any idea here? I see that on Fedora too.

Flags: needinfo?(mliska)

Flags: needinfo?(jh)

Andrew Osmond [:aosmond] (he/him)

Updated

•

5 years ago

Blocks: sw-wr

Severity: -- → S3

Priority: -- → P3

Martin Liška

Comment 6

•

5 years ago

@glandium: Can you please attach a pre-processed source file (ideally made from GCC 10)?

Mike Hommey [:glandium]

Reporter

Comment 7

•

5 years ago

Attached file gl.ii.gz — Details

This is the preprocessed code that takes 27 minutes to build on my machine with GCC 10 and, wait for it, 4 hours(!) with GCC 11.

Build with g++ -o gl.o -c gl.ii -O2 -std=gnu++17. There are other flags on the actual command line building the file, but that's enough to take 5mn40. Add -g for > 10 minutes.

Note that because of the amount of memory this requires, this fails to build on Debian i386, which doesn't cross-compile.

Lee Salzman [:lsalzman]

Assignee

Comment 8

•

5 years ago

(In reply to Mike Hommey [:glandium] from comment #7)

Created attachment 9211361 [details]
gl.ii.gz

This is the preprocessed code that takes 27 minutes to build on my machine with GCC 10 and, wait for it, 4 hours(!) with GCC 11.

Build with g++ -o gl.o -c gl.ii -O2 -std=gnu++17. There are other flags on the actual command line building the file, but that's enough to take 5mn40. Add -g for > 10 minutes.

Note that because of the amount of memory this requires, this fails to build on Debian i386, which doesn't cross-compile.

Just preliminarily playing around with that file, worst-case scenario, just changing ALWAYS_INLINE to only inline for GCC would seemingly fix it... Not sure if there is anything more targeted we could do for it yet.

Lee Salzman [:lsalzman]

Assignee

Comment 9

•

5 years ago

It seems like the ALWAYS_INLINE on blend_pixels() is one of the biggest offenders here to compile times. The problem is that almost every pixel that gets blended to the screen goes through that path, and the call overhead of passing SIMD vectors around when that particular function is not inlined often shows up as a performance blocker in profiles in the wild. But if worst case scenario just going to normal inlining of that at least keeps GCC from freaking out, I guess it might be one solution. Perhaps there is some other function attribute we can use here to signal to GCC to inline it more often that not, though attribute((hot)) did not seem to really change much from the normal default inlining...

Martin Liška

Comment 10

•

5 years ago

Thanks for the test-case. Please report the issue to GCC upstream bugzilla.
Btw. what is the purpose of the attached source code and why is it so huge?

Flags: needinfo?(mliska)

Lee Salzman [:lsalzman]

Assignee

Comment 11

•

5 years ago

(In reply to Martin Liška from comment #10)

Thanks for the test-case. Please report the issue to GCC upstream bugzilla.
Btw. what is the purpose of the attached source code and why is it so huge?

It's generating precompiled shaders and has to do so for all usable permutations up front, and as such, the amount of code that needs to be compiled in principle is just always going to be huge because we have so many shader permutations. The size of the file is a red herring, because even if you broke it up into multiple files, the amount of code it would have to compile (because of duplicating a bunch of preludes) would just increase.

Martin Liška

Comment 12

•

5 years ago

I see. So my first recommendation would be the removal of mentioned always_inline and inline keywords.
Can you please prepare preprocessed source files for these?

Lee Salzman [:lsalzman]

Assignee

Comment 13

•

5 years ago

Attached file Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium — Details

It seems like blend_pixels is the major offender in terms of slowing down GCC
compile times, so lets annotate such functions with PREFER_INLINE. Then, we ensure
that PREFER_INLINE disables always_inline for either GCC or debug builds so that
in cases where we know the compiler is sensitive to binary bloat, we aren't forcing
inlining.

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → lsalzman

Status: NEW → ASSIGNED

Lee Salzman [:lsalzman]

Assignee

Comment 14

•

5 years ago

Mike, I put up a patch to set up a PREFER_INLINE alias to address this. Is annotating blend_pixels good enough to alleviate the issue here, or are there other functions we need to annotate that you're seeing?

Also, would attribute((hot)) be appropriate here to use for PREFER_INLINE to still signal to GCC that even though we aren't forcing inlining, we still would really like it to if it makes sense?

Flags: needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Reporter

Comment 15

•

5 years ago

There is some chatter in the upstream bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99785

Flags: needinfo?(mh+mozilla)

Lee Salzman [:lsalzman]

Assignee

Comment 16

•

5 years ago

•

Edited

Did some experiments removing always_inline from blend_pixels to see what performance difference was as of recently: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&newProject=try&newRevision=6a176b1e685fa622fb15be32afab1a6c008c211c&framework=1&selectedTimeRange=172800

Looks like there are performance regressions between 9% and 26% (tsvg, tscroll, etc), even on the shippable builds. So pretty much looks like what I remember. Not inlining this function really hurts us.

Pulsebot

Comment 17

•

5 years ago

Pushed by lsalzman@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7256e03e71df Use PREFER_INLINE on large SWGL functions. r=jrmuizel

Alexandru Michis [:malexandru]

Comment 18

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/7256e03e71df

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox89: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 89 Branch

Mike Hommey [:glandium]

Reporter

Comment 19

•

5 years ago

Can we get this on beta, and release, if there's going to be a dot-release?

Flags: needinfo?(lsalzman)

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox86: --- → unaffected

status-firefox87: --- → fix-optional

status-firefox88: --- → fix-optional

status-firefox-esr78: --- → unaffected

tt_1

Comment 20

•

5 years ago

without the patch:

firefox-87.0 with gcc-9.3.0 is 40 minutes
firefox-87.0 with gcc-10.2.0 is 50 minutes

with the patch:

firefox-87.0 with gcc-9.3.0 is 33 minutes
firefox-87.0 with gcc-10.2.0 is 34 minutes

also this fixes armv7 compile, where it seems as if gl.cc was unable to get enough ram on a cross compile host with 32gb of it.

Lee Salzman [:lsalzman]

Assignee

Comment 21

•

5 years ago

Comment on attachment 9211599 [details]
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium

Beta/Release Uplift Approval Request

User impact if declined: Slow GCC build times.
Is this code covered by automated tests?: Yes
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): This only affects GCC builds, and we mostly build with Clang.
String changes made/needed:

Flags: needinfo?(lsalzman)

Attachment #9211599 - Flags: approval-mozilla-release?

Attachment #9211599 - Flags: approval-mozilla-beta?

Ryan VanderMeulen [:RyanVM]

Comment 22

•

5 years ago

Comment on attachment 9211599 [details]
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium

Approved for 88.0b6.

Attachment #9211599 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ryan VanderMeulen [:RyanVM]

Comment 23

•

5 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/2520e467aa3d

status-firefox88: fix-optional → fixed

Ryan VanderMeulen [:RyanVM]

Comment 24

•

5 years ago

Comment on attachment 9211599 [details]
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium

88 is in RC now and there are no plans for an 87 dot release.

Attachment #9211599 - Flags: approval-mozilla-release? → approval-mozilla-release-

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox87: fix-optional → wontfix

Flags: needinfo?(jh)

gl.ii.gz 5 years ago Mike Hommey [:glandium] 707.37 KB, application/gzip		Details
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium 5 years ago Lee Salzman [:lsalzman] 48 bytes, text/x-phabricator-request	RyanVM : approval-mozilla-beta+ RyanVM : approval-mozilla-release-	Details \| Review