Closed Bug 1700520 Opened 7 months ago Closed 7 months ago

GCC takes a long time to compile gl.cc

Categories

(Core :: Graphics, defect, P3)

defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox86 --- unaffected
firefox87 --- wontfix
firefox88 --- fixed
firefox89 --- fixed

People

(Reporter: glandium, Assigned: lsalzman)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(2 files)

-ftime-report output at -O0:

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)    1381 kB (  0%)
 phase parsing                      :   2.68 (  0%)   6.41 ( 16%)  31.30 (  3%)  507182 kB (  6%)
 phase lang. deferred               :   0.37 (  0%)   0.23 (  1%)   0.61 (  0%)   37757 kB (  0%)
 phase opt and generate             :1098.11 (100%)  32.65 ( 83%)1134.78 ( 97%) 8287368 kB ( 93%)
 phase last asm                     :   2.13 (  0%)   0.08 (  0%)   5.25 (  0%)   57039 kB (  1%)
 |name lookup                       :   0.70 (  0%)   0.95 (  2%)  23.91 (  2%)   11029 kB (  0%)
 |overload resolution               :   0.81 (  0%)   1.39 (  4%)   2.21 (  0%)  178591 kB (  2%)
 garbage collection                 :   4.02 (  0%)   0.06 (  0%)   4.09 (  0%)       0 kB (  0%)
 dump files                         :   0.59 (  0%)   0.65 (  2%)   1.09 (  0%)       0 kB (  0%)
 callgraph construction             :   0.60 (  0%)   0.23 (  1%)   0.90 (  0%)  194877 kB (  2%)
 callgraph optimization             :   0.10 (  0%)   0.25 (  1%)   0.45 (  0%)       0 kB (  0%)
 callgraph ipa passes               :  14.80 (  1%)  27.79 ( 71%)  42.59 (  4%) 3386963 kB ( 38%)
 ipa dead code removal              :   0.26 (  0%)   0.00 (  0%)   0.25 (  0%)       0 kB (  0%)
 ipa inheritance graph              :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      13 kB (  0%)
 ipa inlining heuristics            :   0.42 (  0%)   0.01 (  0%)   0.40 (  0%)       0 kB (  0%)
 ipa comdats                        :   0.02 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 ipa HSA                            :   0.05 (  0%)   0.00 (  0%)   0.05 (  0%)       0 kB (  0%)
 ipa free lang data                 :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 ipa free inline summary            :   0.01 (  0%)   0.00 (  0%)   0.01 (  0%)       0 kB (  0%)
 cfg construction                   :   0.11 (  0%)   0.04 (  0%)   0.11 (  0%)    9811 kB (  0%)
 cfg cleanup                        :   0.61 (  0%)   0.09 (  0%)   0.66 (  0%)   27409 kB (  0%)
 trivially dead code                :   0.99 (  0%)   0.04 (  0%)   0.95 (  0%)       0 kB (  0%)
 df scan insns                      :   4.52 (  0%)   0.09 (  0%)   4.58 (  0%)     355 kB (  0%)
 df live regs                       :   1.77 (  0%)   0.03 (  0%)   1.92 (  0%)       0 kB (  0%)
 df reg dead/unused notes           :   2.49 (  0%)   0.04 (  0%)   2.45 (  0%)  166541 kB (  2%)
 register information               :   0.84 (  0%)   0.01 (  0%)   0.87 (  0%)       0 kB (  0%)
 alias analysis                     :   0.63 (  0%)   0.00 (  0%)   0.76 (  0%)   40190 kB (  0%)
 alias stmt walking                 :   0.01 (  0%)   0.03 (  0%)   0.01 (  0%)    1315 kB (  0%)
 rebuild jump labels                :   0.53 (  0%)   0.04 (  0%)   0.60 (  0%)      29 kB (  0%)
 preprocessing                      :   0.05 (  0%)   1.18 (  3%)   1.32 (  0%)    6073 kB (  0%)
 parser (global)                    :   0.14 (  0%)   1.31 (  3%)   1.39 (  0%)   42373 kB (  0%)
 parser struct body                 :   0.38 (  0%)   0.59 (  1%)  23.16 (  2%)   65937 kB (  1%)
 parser function body               :   0.05 (  0%)   0.17 (  0%)   0.21 (  0%)    6895 kB (  0%)
 parser inl. func. body             :   0.05 (  0%)   0.11 (  0%)   0.15 (  0%)    8859 kB (  0%)
 parser inl. meth. body             :   1.66 (  0%)   2.57 (  7%)   4.46 (  0%)  323556 kB (  4%)
 template instantiation             :   0.26 (  0%)   0.30 (  1%)   0.48 (  0%)   37749 kB (  0%)
 constant expression evaluation     :   0.17 (  0%)   0.31 (  1%)   0.43 (  0%)   20153 kB (  0%)
 early inlining heuristics          :   0.14 (  0%)   0.02 (  0%)   0.17 (  0%)   92308 kB (  1%)
 inline parameters                  :   0.69 (  0%)   0.05 (  0%)   0.75 (  0%)   41948 kB (  0%)
 integration                        :   6.35 (  1%)  14.93 ( 38%)  21.21 (  2%) 2610803 kB ( 29%)
 tree gimplify                      :   0.07 (  0%)   0.15 (  0%)   0.20 (  0%)   61878 kB (  1%)
 tree eh                            :   0.03 (  0%)   0.01 (  0%)   0.04 (  0%)    1672 kB (  0%)
 tree CFG construction              :   0.01 (  0%)   0.03 (  0%)   0.02 (  0%)   15077 kB (  0%)
 tree CFG cleanup                   :   0.98 (  0%)   0.12 (  0%)   1.30 (  0%)      24 kB (  0%)
 tree PHI insertion                 :   0.01 (  0%)   0.03 (  0%)   0.01 (  0%)    1973 kB (  0%)
 tree SSA rewrite                   :   0.68 (  0%)   0.09 (  0%)   0.85 (  0%)  253633 kB (  3%)
 tree SSA other                     :   0.07 (  0%)   0.27 (  1%)   0.32 (  0%)    1217 kB (  0%)
 tree SSA incremental               :   0.62 (  0%)   0.02 (  0%)   0.64 (  0%)    8843 kB (  0%)
 tree operand scan                  :   3.00 (  0%)  11.59 ( 29%)  14.61 (  1%)  241325 kB (  3%)
 tree switch lowering               :   0.06 (  0%)   0.00 (  0%)   0.13 (  0%)    2901 kB (  0%)
 dominance frontiers                :   0.12 (  0%)   0.05 (  0%)   0.05 (  0%)       0 kB (  0%)
 dominance computation              :   0.50 (  0%)   0.13 (  0%)   0.68 (  0%)       0 kB (  0%)
 out of ssa                         :   1.27 (  0%)   0.03 (  0%)   1.21 (  0%)    1278 kB (  0%)
 expand vars                        :1006.80 ( 91%)   1.05 (  3%)1008.34 ( 86%)  243594 kB (  3%)
 expand                             :   5.83 (  1%)   0.29 (  1%)   6.25 (  1%) 1696029 kB ( 19%)
 post expand cleanups               :   0.59 (  0%)   0.05 (  0%)   0.67 (  0%)   11146 kB (  0%)
 varconst                           :   0.05 (  0%)   0.05 (  0%)   0.05 (  0%)       7 kB (  0%)
 jump                               :   0.04 (  0%)   0.01 (  0%)   0.04 (  0%)       0 kB (  0%)
 loop init                          :   0.55 (  0%)   0.04 (  0%)   0.65 (  0%)    6471 kB (  0%)
 loop fini                          :   0.00 (  0%)   0.01 (  0%)   0.03 (  0%)       0 kB (  0%)
 mode switching                     :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 integrated RA                      :  18.87 (  2%)   0.51 (  1%)  19.35 (  2%)  984328 kB ( 11%)
 LRA non-specific                   :   7.62 (  1%)   0.20 (  1%)   7.70 (  1%)    9466 kB (  0%)
 LRA virtuals elimination           :   2.94 (  0%)   0.03 (  0%)   2.98 (  0%)  262009 kB (  3%)
 LRA reload inheritance             :   0.60 (  0%)   0.01 (  0%)   0.59 (  0%)     438 kB (  0%)
 LRA create live ranges             :   2.30 (  0%)   0.06 (  0%)   2.30 (  0%)     633 kB (  0%)
 LRA hard reg assignment            :   0.53 (  0%)   0.02 (  0%)   0.58 (  0%)       0 kB (  0%)
 reload                             :   0.14 (  0%)   0.01 (  0%)   0.14 (  0%)       0 kB (  0%)
 thread pro- & epilogue             :   2.53 (  0%)   0.07 (  0%)   2.72 (  0%)   11866 kB (  0%)
 machine dep reorg                  :   0.03 (  0%)   0.01 (  0%)   0.06 (  0%)       0 kB (  0%)
 shorten branches                   :   1.82 (  0%)   0.04 (  0%)   1.87 (  0%)       6 kB (  0%)
 reg stack                          :   0.04 (  0%)   0.00 (  0%)   0.04 (  0%)       0 kB (  0%)
 final                              :   7.01 (  1%)   0.38 (  1%)  10.83 (  1%)  220941 kB (  2%)
 symout                             :   3.84 (  0%)   0.15 (  0%)   7.05 (  1%)  862027 kB ( 10%)
 uninit var analysis                :   0.00 (  0%)   0.02 (  0%)   0.09 (  0%)       0 kB (  0%)
 initialize rtl                     :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      12 kB (  0%)
 rest of compilation                :   5.17 (  0%)   0.65 (  2%)   5.53 (  0%)  293297 kB (  3%)
 unaccounted post reload            :   0.00 (  0%)   0.02 (  0%)   0.00 (  0%)       0 kB (  0%)
 unaccounted late compilation       :   0.00 (  0%)   0.00 (  0%)   0.02 (  0%)       0 kB (  0%)
 repair loop structures             :   0.04 (  0%)   0.02 (  0%)   0.08 (  0%)       0 kB (  0%)
 TOTAL                              :1103.29         39.37       1171.95        8890741 kB

Actually, even with clang, it's taking a (relative) long time (74s). -O0 -ftime-report with clang 11:

===-------------------------------------------------------------------------===
                         Miscellaneous Ungrouped Timers
===-------------------------------------------------------------------------===

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  33.9637 ( 97.3%)  37.7902 ( 99.9%)  71.7539 ( 98.7%)  71.7848 ( 98.7%)  Code Generation Time
   0.9260 (  2.7%)   0.0432 (  0.1%)   0.9692 (  1.3%)   0.9741 (  1.3%)  LLVM IR Generation Time
  34.8897 (100.0%)  37.8334 (100.0%)  72.7231 (100.0%)  72.7589 (100.0%)  Total

7 warnings generated.
===-------------------------------------------------------------------------===
                      Instruction Selection and Scheduling
===-------------------------------------------------------------------------===
  Total Execution Time: 16.7170 seconds (16.6894 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   2.9731 ( 20.2%)   0.3632 ( 18.3%)   3.3362 ( 20.0%)   3.3313 ( 20.0%)  Instruction Selection
   1.9257 ( 13.1%)   0.2700 ( 13.6%)   2.1957 ( 13.1%)   2.1907 ( 13.1%)  DAG Combining 2
   1.8347 ( 12.5%)   0.2232 ( 11.2%)   2.0579 ( 12.3%)   2.0546 ( 12.3%)  DAG Combining after legalize types
   1.6657 ( 11.3%)   0.3310 ( 16.7%)   1.9966 ( 11.9%)   2.0016 ( 12.0%)  DAG Combining 1
   1.4948 ( 10.1%)   0.1718 (  8.7%)   1.6666 ( 10.0%)   1.6627 ( 10.0%)  Type Legalization
   1.1966 (  8.1%)   0.1791 (  9.0%)   1.3758 (  8.2%)   1.3728 (  8.2%)  Instruction Scheduling
   1.1459 (  7.8%)   0.0753 (  3.8%)   1.2212 (  7.3%)   1.2193 (  7.3%)  DAG Combining after legalize vectors
   0.8639 (  5.9%)   0.1249 (  6.3%)   0.9888 (  5.9%)   0.9852 (  5.9%)  Instruction Creation
   0.7207 (  4.9%)   0.1078 (  5.4%)   0.8285 (  5.0%)   0.8253 (  4.9%)  DAG Legalization
   0.6559 (  4.5%)   0.1080 (  5.4%)   0.7640 (  4.6%)   0.7620 (  4.6%)  Vector Legalization
   0.1862 (  1.3%)   0.0126 (  0.6%)   0.1988 (  1.2%)   0.1980 (  1.2%)  Type Legalization 2
   0.0680 (  0.5%)   0.0189 (  1.0%)   0.0869 (  0.5%)   0.0859 (  0.5%)  Instruction Scheduling Cleanup
  14.7312 (100.0%)   1.9858 (100.0%)  16.7170 (100.0%)  16.6894 (100.0%)  Total

===-------------------------------------------------------------------------===
                                 DWARF Emission
===-------------------------------------------------------------------------===
  Total Execution Time: 19.5061 seconds (19.5838 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   2.4645 ( 65.5%)   8.1807 ( 52.0%)  10.6452 ( 54.6%)  10.7012 ( 54.6%)  Debug Info Emission
   1.2884 ( 34.3%)   7.5648 ( 48.0%)   8.8533 ( 45.4%)   8.8749 ( 45.3%)  DWARF Exception Writer
   0.0077 (  0.2%)   0.0000 (  0.0%)   0.0077 (  0.0%)   0.0077 (  0.0%)  DWARF Debug Writer
   3.7606 (100.0%)  15.7456 (100.0%)  19.5061 (100.0%)  19.5838 (100.0%)  Total

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 69.2811 seconds (69.3056 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   6.3674 ( 19.7%)  32.6874 ( 88.5%)  39.0547 ( 56.4%)  39.0674 ( 56.4%)  X86 Assembly Printer
  17.9268 ( 55.4%)   2.7817 (  7.5%)  20.7085 ( 29.9%)  20.7183 ( 29.9%)  X86 DAG->DAG Instruction Selection
   4.7404 ( 14.7%)   0.1805 (  0.5%)   4.9210 (  7.1%)   4.9229 (  7.1%)  Inliner for always_inline functions
   0.8175 (  2.5%)   0.1491 (  0.4%)   0.9666 (  1.4%)   0.9663 (  1.4%)  Live DEBUG_VALUE analysis
   0.7007 (  2.2%)   0.0794 (  0.2%)   0.7801 (  1.1%)   0.7801 (  1.1%)  Prologue/Epilogue Insertion & Frame Finalization
   0.4281 (  1.3%)   0.0872 (  0.2%)   0.5152 (  0.7%)   0.5148 (  0.7%)  Fast Register Allocator
   0.3210 (  1.0%)   0.1400 (  0.4%)   0.4610 (  0.7%)   0.4608 (  0.7%)  Insert stack protectors
   0.1848 (  0.6%)   0.0390 (  0.1%)   0.2238 (  0.3%)   0.2237 (  0.3%)  Two-Address instruction pass
   0.0775 (  0.2%)   0.0662 (  0.2%)   0.1436 (  0.2%)   0.1436 (  0.2%)  Expand Atomic instructions
   0.0965 (  0.3%)   0.0273 (  0.1%)   0.1239 (  0.2%)   0.1236 (  0.2%)  Check CFA info and insert CFI instructions if needed
   0.0746 (  0.2%)   0.0192 (  0.1%)   0.0938 (  0.1%)   0.0936 (  0.1%)  Finalize ISel and expand pseudo-instructions
   0.0400 (  0.1%)   0.0332 (  0.1%)   0.0732 (  0.1%)   0.0729 (  0.1%)  Free MachineFunction
   0.0487 (  0.2%)   0.0238 (  0.1%)   0.0726 (  0.1%)   0.0725 (  0.1%)  MachineDominator Tree Construction
   0.0525 (  0.2%)   0.0180 (  0.0%)   0.0705 (  0.1%)   0.0704 (  0.1%)  Post-RA pseudo instruction expansion pass
   0.0516 (  0.2%)   0.0170 (  0.0%)   0.0686 (  0.1%)   0.0685 (  0.1%)  X86 pseudo instruction expansion pass
   0.0408 (  0.1%)   0.0259 (  0.1%)   0.0666 (  0.1%)   0.0666 (  0.1%)  Lower constant intrinsics
   0.0458 (  0.1%)   0.0190 (  0.1%)   0.0648 (  0.1%)   0.0648 (  0.1%)  X86 EFLAGS copy lowering
   0.0351 (  0.1%)   0.0220 (  0.1%)   0.0571 (  0.1%)   0.0569 (  0.1%)  Expand reduction intrinsics
   0.0339 (  0.1%)   0.0221 (  0.1%)   0.0560 (  0.1%)   0.0560 (  0.1%)  Scalarize Masked Memory Intrinsics
   0.0386 (  0.1%)   0.0006 (  0.0%)   0.0392 (  0.1%)   0.0392 (  0.1%)  CallGraph Construction
   0.0180 (  0.1%)   0.0161 (  0.0%)   0.0341 (  0.0%)   0.0340 (  0.0%)  Eliminate PHI nodes for register allocation
   0.0081 (  0.0%)   0.0220 (  0.1%)   0.0301 (  0.0%)   0.0301 (  0.0%)  Exception handling preparation
   0.0127 (  0.0%)   0.0143 (  0.0%)   0.0269 (  0.0%)   0.0269 (  0.0%)  Bundle Machine CFG Edges
   0.0084 (  0.0%)   0.0170 (  0.0%)   0.0254 (  0.0%)   0.0253 (  0.0%)  Remove unreachable blocks from the CFG
   0.0015 (  0.0%)   0.0202 (  0.1%)   0.0217 (  0.0%)   0.0217 (  0.0%)  Instrument function entry/exit with calls to e.g. mcount() (pre inlining)
   0.0070 (  0.0%)   0.0138 (  0.0%)   0.0209 (  0.0%)   0.0209 (  0.0%)  X86 Indirect Branch Tracking
   0.0068 (  0.0%)   0.0133 (  0.0%)   0.0201 (  0.0%)   0.0203 (  0.0%)  Insert fentry calls
   0.0044 (  0.0%)   0.0159 (  0.0%)   0.0203 (  0.0%)   0.0203 (  0.0%)  Expand indirectbr instructions
   0.0067 (  0.0%)   0.0132 (  0.0%)   0.0199 (  0.0%)   0.0200 (  0.0%)  Machine Optimization Remark Emitter
   0.0067 (  0.0%)   0.0133 (  0.0%)   0.0200 (  0.0%)   0.0200 (  0.0%)  Insert XRay ops
   0.0066 (  0.0%)   0.0131 (  0.0%)   0.0197 (  0.0%)   0.0198 (  0.0%)  X86 PIC Global Base Reg Initialization
   0.0065 (  0.0%)   0.0131 (  0.0%)   0.0196 (  0.0%)   0.0196 (  0.0%)  Implement the 'patchable-function' attribute
   0.0059 (  0.0%)   0.0137 (  0.0%)   0.0195 (  0.0%)   0.0196 (  0.0%)  Machine Optimization Remark Emitter #2
   0.0065 (  0.0%)   0.0130 (  0.0%)   0.0194 (  0.0%)   0.0196 (  0.0%)  Local Stack Slot Allocation
   0.0064 (  0.0%)   0.0131 (  0.0%)   0.0195 (  0.0%)   0.0196 (  0.0%)  StackMap Liveness Analysis
   0.0064 (  0.0%)   0.0130 (  0.0%)   0.0194 (  0.0%)   0.0195 (  0.0%)  Contiguously Lay Out Funclets
   0.0064 (  0.0%)   0.0129 (  0.0%)   0.0193 (  0.0%)   0.0195 (  0.0%)  Lazy Machine Block Frequency Analysis
   0.0057 (  0.0%)   0.0135 (  0.0%)   0.0193 (  0.0%)   0.0194 (  0.0%)  Lazy Machine Block Frequency Analysis #2
   0.0065 (  0.0%)   0.0129 (  0.0%)   0.0194 (  0.0%)   0.0194 (  0.0%)  X86 Indirect Thunks
   0.0064 (  0.0%)   0.0129 (  0.0%)   0.0193 (  0.0%)   0.0193 (  0.0%)  Analyze Machine Code For Garbage Collection
   0.0064 (  0.0%)   0.0128 (  0.0%)   0.0192 (  0.0%)   0.0193 (  0.0%)  X86 WinAlloca Expander
   0.0063 (  0.0%)   0.0127 (  0.0%)   0.0190 (  0.0%)   0.0193 (  0.0%)  X86 Speculative Execution Side Effect Suppression
   0.0064 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0193 (  0.0%)  Fixup Statepoint Caller Saved
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0192 (  0.0%)   0.0193 (  0.0%)  X86 FP Stackifier
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0192 (  0.0%)  X86 insert wait instruction
   0.0050 (  0.0%)   0.0140 (  0.0%)   0.0191 (  0.0%)   0.0192 (  0.0%)  Safe Stack instrumentation pass
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0192 (  0.0%)  X86 speculative load hardening
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0191 (  0.0%)   0.0191 (  0.0%)  Compressing EVEX instrs to VEX encoding when possible
   0.0064 (  0.0%)   0.0127 (  0.0%)   0.0191 (  0.0%)   0.0191 (  0.0%)  X86 vzeroupper inserter
   0.0064 (  0.0%)   0.0126 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  X86 Load Value Injection (LVI) Ret-Hardening
   0.0063 (  0.0%)   0.0128 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  X86 Discriminate Memory Operands
   0.0062 (  0.0%)   0.0127 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  X86 Insert Cache Prefetches
   0.0041 (  0.0%)   0.0149 (  0.0%)   0.0190 (  0.0%)   0.0191 (  0.0%)  Instrument function entry/exit with calls to e.g. mcount() (post inlining)
   0.0041 (  0.0%)   0.0146 (  0.0%)   0.0187 (  0.0%)   0.0187 (  0.0%)  Lower Garbage Collection Instructions
   0.0041 (  0.0%)   0.0146 (  0.0%)   0.0186 (  0.0%)   0.0187 (  0.0%)  Shadow Stack GC Lowering
   0.0006 (  0.0%)   0.0000 (  0.0%)   0.0006 (  0.0%)   0.0006 (  0.0%)  Assumption Cache Tracker
   0.0005 (  0.0%)   0.0000 (  0.0%)   0.0005 (  0.0%)   0.0005 (  0.0%)  Pre-ISel Intrinsic Lowering
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Rewrite Symbols
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Force set function attributes
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Assumption Cache Tracker #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Module Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Library Information #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Transform Information
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Profile summary info #2
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Machine Branch Probability Analysis
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Target Pass Configuration
   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)  Create Garbage Collector Module Metadata
  32.3536 (100.0%)  36.9274 (100.0%)  69.2811 (100.0%)  69.3056 (100.0%)  Total

===-------------------------------------------------------------------------===
                          Clang front-end time report
===-------------------------------------------------------------------------===
  Total Execution Time: 74.2717 seconds (74.3039 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.3765 (100.0%)  37.8952 (100.0%)  74.2717 (100.0%)  74.3039 (100.0%)  Clang front-end timer
  36.3765 (100.0%)  37.8952 (100.0%)  74.2717 (100.0%)  74.3039 (100.0%)  Total

Notes: in both cases (comment 0 and this one), this is code from the release branch.

It was taking 37 seconds with GCC (at -O2) in FF86. Edit: scrap that, I was inadvertently using clang in the FF86 build.

It was actually 1:20 in FF86, so it's still a regression compared to that version.

Regressions:
1mn20 -> 1mn50: bug 1688820
1mn20 -> 3mn40: bug 1689933
3mn40 -> 24mn: bug 1674524
Improvement:
24mn -> 15mn: bug 1690967
Re-regression:
15mn -> 19mn25: bug 1678119 (but with interestingly an intermediate regression to 40m with the last patch out of the picture)
19mn25 -> 24mn40: bug 1692731
I didn't go further to find where we got further up to 27mn30 on the current release branch.

gl.o as built by GCC 10 on the current release branch contains 9.7MB of code, vs. 2.7MB as built by clang, which sounds like an awful lot, even for clang.
About 60% of the CPU time is spent in the assembler. So not only does GCC take a lot of time (and memory, > 15GB!) generating a large amount of code, but gas also takes a large amount of time (and memory, > 6GB) processing that code into an object file.
g++ -S generates the assembly in 10 minutes, and the resulting file is 1.37GB!! Removing -g brings that down to 5mn44 and 42.6MB.

Also worth noting that GCC 11 (gcc-11 (Debian 11-20210319-1) 11.0.1 20210319 (experimental) [master revision bd9b262fa92:73ac0472bc3:287e3e8466f44f9d395a2e4dcfcda56cc34ceb1c]), is even slower (cc1plus currently has used 85 minutes of CPU time, and it's still running).

It might be worth filing bugs against both GCC and binutils.

Guys, any idea here? I see that on Fedora too.

Flags: needinfo?(mliska)
Flags: needinfo?(jh)
Blocks: sw-wr
Severity: -- → S3
Priority: -- → P3

@glandium: Can you please attach a pre-processed source file (ideally made from GCC 10)?

Attached file gl.ii.gz

This is the preprocessed code that takes 27 minutes to build on my machine with GCC 10 and, wait for it, 4 hours(!) with GCC 11.

Build with g++ -o gl.o -c gl.ii -O2 -std=gnu++17. There are other flags on the actual command line building the file, but that's enough to take 5mn40. Add -g for > 10 minutes.

Note that because of the amount of memory this requires, this fails to build on Debian i386, which doesn't cross-compile.

(In reply to Mike Hommey [:glandium] from comment #7)

Created attachment 9211361 [details]
gl.ii.gz

This is the preprocessed code that takes 27 minutes to build on my machine with GCC 10 and, wait for it, 4 hours(!) with GCC 11.

Build with g++ -o gl.o -c gl.ii -O2 -std=gnu++17. There are other flags on the actual command line building the file, but that's enough to take 5mn40. Add -g for > 10 minutes.

Note that because of the amount of memory this requires, this fails to build on Debian i386, which doesn't cross-compile.

Just preliminarily playing around with that file, worst-case scenario, just changing ALWAYS_INLINE to only inline for GCC would seemingly fix it... Not sure if there is anything more targeted we could do for it yet.

It seems like the ALWAYS_INLINE on blend_pixels() is one of the biggest offenders here to compile times. The problem is that almost every pixel that gets blended to the screen goes through that path, and the call overhead of passing SIMD vectors around when that particular function is not inlined often shows up as a performance blocker in profiles in the wild. But if worst case scenario just going to normal inlining of that at least keeps GCC from freaking out, I guess it might be one solution. Perhaps there is some other function attribute we can use here to signal to GCC to inline it more often that not, though attribute((hot)) did not seem to really change much from the normal default inlining...

Thanks for the test-case. Please report the issue to GCC upstream bugzilla.
Btw. what is the purpose of the attached source code and why is it so huge?

Flags: needinfo?(mliska)

(In reply to Martin Liška from comment #10)

Thanks for the test-case. Please report the issue to GCC upstream bugzilla.
Btw. what is the purpose of the attached source code and why is it so huge?

It's generating precompiled shaders and has to do so for all usable permutations up front, and as such, the amount of code that needs to be compiled in principle is just always going to be huge because we have so many shader permutations. The size of the file is a red herring, because even if you broke it up into multiple files, the amount of code it would have to compile (because of duplicating a bunch of preludes) would just increase.

I see. So my first recommendation would be the removal of mentioned always_inline and inline keywords.
Can you please prepare preprocessed source files for these?

It seems like blend_pixels is the major offender in terms of slowing down GCC
compile times, so lets annotate such functions with PREFER_INLINE. Then, we ensure
that PREFER_INLINE disables always_inline for either GCC or debug builds so that
in cases where we know the compiler is sensitive to binary bloat, we aren't forcing
inlining.

Assignee: nobody → lsalzman
Status: NEW → ASSIGNED

Mike, I put up a patch to set up a PREFER_INLINE alias to address this. Is annotating blend_pixels good enough to alleviate the issue here, or are there other functions we need to annotate that you're seeing?

Also, would attribute((hot)) be appropriate here to use for PREFER_INLINE to still signal to GCC that even though we aren't forcing inlining, we still would really like it to if it makes sense?

Flags: needinfo?(mh+mozilla)

There is some chatter in the upstream bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99785

Flags: needinfo?(mh+mozilla)

Did some experiments removing always_inline from blend_pixels to see what performance difference was as of recently: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&newProject=try&newRevision=6a176b1e685fa622fb15be32afab1a6c008c211c&framework=1&selectedTimeRange=172800

Looks like there are performance regressions between 9% and 26% (tsvg, tscroll, etc), even on the shippable builds. So pretty much looks like what I remember. Not inlining this function really hurts us.

Pushed by lsalzman@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7256e03e71df
Use PREFER_INLINE on large SWGL functions. r=jrmuizel
Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch

Can we get this on beta, and release, if there's going to be a dot-release?

Flags: needinfo?(lsalzman)

without the patch:

firefox-87.0 with gcc-9.3.0 is 40 minutes
firefox-87.0 with gcc-10.2.0 is 50 minutes

with the patch:

firefox-87.0 with gcc-9.3.0 is 33 minutes
firefox-87.0 with gcc-10.2.0 is 34 minutes

also this fixes armv7 compile, where it seems as if gl.cc was unable to get enough ram on a cross compile host with 32gb of it.

Comment on attachment 9211599 [details]
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium

Beta/Release Uplift Approval Request

  • User impact if declined: Slow GCC build times.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This only affects GCC builds, and we mostly build with Clang.
  • String changes made/needed:
Flags: needinfo?(lsalzman)
Attachment #9211599 - Flags: approval-mozilla-release?
Attachment #9211599 - Flags: approval-mozilla-beta?

Comment on attachment 9211599 [details]
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium

Approved for 88.0b6.

Attachment #9211599 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9211599 [details]
Bug 1700520 - Use PREFER_INLINE on large SWGL functions. r?jrmuizel,glandium

88 is in RC now and there are no plans for an 87 dot release.

Attachment #9211599 - Flags: approval-mozilla-release? → approval-mozilla-release-
You need to log in before you can comment on or make changes to this bug.