1543478 - The throbber animation heavily regresses performance on main-thread-heavy micro-benchmark

Emilio Cobos Álvarez (:emilio)

Reporter

Description

•

5 years ago

Attached file Testcase. — Details

STR:

Enable WebRender.
Open the test-case attached to this bug.
Open attachment 9057328 [details], which is attached to bug 1537903, which has the same test-case, but without spinning the throbber.

Expected Results:

Comparable performance.

Actual results:

The number I get on my GPU (Mesa DRI Intel(R) HD Graphics P530 (Skylake GT2)) is much higher if the throbber is spinning. Performance is comparable in both test-cases if WebRender is disabled.

Diff between the two test-cases is just:

-const DELAY = true;
+const DELAY = false;
 window.onload = function() {
   DELAY ? setTimeout(runTests, 0) : runTests();
 }

This is with integrated Graphics on Linux.

Jeff Muizelaar [:jrmuizel]

Comment 1

•

5 years ago

I get roughly the same number on both attachments with Mac. 4041 vs 3963. What numbers do you get?

Jeff Muizelaar [:jrmuizel]

Updated

•

5 years ago

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 2

•

5 years ago

5845 vs 10184

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 3

•

5 years ago

I'll get a pair of profiles.

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 4

•

5 years ago

WebRender disabled, throbber test-case: 6912, then 7275

https://perfht.ml/2Z6zb0G

WebRender disabled, no-throbber test-case: 7038, then 7055

https://perfht.ml/2ZceXTA

WebRender enabled, throbber test-case: 20037, then 19430

https://perfht.ml/2U6YTOZ (note that this only has the second run because it went over the 30s profiler time window)

WebRender enabled, no-throbber test-case: 7713, then 7850.

https://perfht.ml/2U8TXJx

Note that WebRender's profiles are profiling a bunch more threads, so maybe not
directly comparable with non-WR.

Also, this is all on a pristine profile, just created and installed the profiler add-on.

Jeff Muizelaar [:jrmuizel]

Comment 5

•

5 years ago

Can you reproduce the problem if you make the window as small as possible?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 6

•

5 years ago

Reducing the window size does definitely help a lot, yeah: 5522 vs. 5644.

I've verified that the throbber is still visible while at it (just ensuring it doesn't get culled away).

Flags: needinfo?(emilio)

Jeff Muizelaar [:jrmuizel]

Comment 7

•

5 years ago

I'm guessing this is memory bandwidth starvation. What resolution are you running at?

Blocks: wr-intel

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 8

•

5 years ago

3840x2160

Jeff Muizelaar [:jrmuizel]

Comment 9

•

5 years ago

Can you run the benchmark at https://github.com/jrmuizel/memset-bench and post the results?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 10

•

5 years ago

171239584
 253290568        1 33177600 4.996732
 292047110        2 16588800 4.333633
 213503808        4 8294400 5.927880
  76897571        8 4147200 16.458582
  50981800       16 2073600 24.825036
  46075951       32 1036800 27.468234
  50430943       64 518400 25.096199
  53592175      128 259200 23.615854
  49455212      256 129600 25.591337
  51532981      512 64800 24.559515
  35984658     1024 32400 35.171239
  31182065     2048 16200 40.588236
  38009982     4096 8100 33.297174
  32214595     8192 4050 39.287317
  31485070    16384 2025 40.197624
  40086547    32768 1012 31.572313
  51698017    65536 506 24.481113
  45728613   131072 253 27.676873

Flags: needinfo?(emilio)

Jeff Muizelaar [:jrmuizel]

Comment 11

•

5 years ago

Can you also try running the benchmark with opengl layers instead of basic?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 12

•

5 years ago

5067 vs. 7135 on a maximized window, 5031 vs. 5188 on a minimally-sized window.

Flags: needinfo?(emilio)

Jeff Muizelaar [:jrmuizel]

Comment 13

•

5 years ago

And can you try running the benchmark while running cargo run mem --release on https://github.com/jrmuizel/jrmuizel-membench

Jeff Muizelaar [:jrmuizel]

Updated

•

5 years ago

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 14

•

5 years ago

WR enabled: 7174 vs. 14101
OpenGL layers: 7619 vs. 12291
Basic layers: 7053 vs. 7981

this is on a maximized window, lmk if you also want me to report numbers with a minimized window.

Flags: needinfo?(emilio)

Jamie Nicol [:jnicol]

Comment 15

•

5 years ago

Setting P1 because I assume we want this fixed before shipping on Intel?

Blocks: wr-perf

Priority: -- → P1

Jeff Muizelaar [:jrmuizel]

Comment 16

•

5 years ago

So to confirm. With basic layers, it only gets a little bit worse (7053 vs 7981) when running jrmuizel-membench?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 17

•

5 years ago

Yes, that's right.

Flags: needinfo?(emilio)

Jeff Muizelaar [:jrmuizel]

Comment 18

•

5 years ago

How many cores do you have in this machine?

Flags: needinfo?(emilio)

Emilio Cobos Álvarez (:emilio)

Reporter

Comment 19

•

5 years ago

It's a Lenovo P50, with 8 logical, 4 physical:

$ lscpu                                                                                                                                                                                    
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Stepping:            3
CPU MHz:             800.032
CPU max MHz:         3700.0000
CPU min MHz:         800.0000
BogoMIPS:            5616.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d

Flags: needinfo?(emilio)

Jeff Muizelaar [:jrmuizel]

Updated

•

4 years ago

Depends on: 1620076

Jeff Muizelaar [:jrmuizel]

Updated

•

4 years ago

No longer blocks: wr-intel

Jeff Muizelaar [:jrmuizel]

Comment 20

•

4 years ago

Sotaro, can you try running this on Windows 10 on your P50 with gfx.webrender.compositor on and off?

Flags: needinfo?(sotaro.ikeda.g)

Sotaro Ikeda [:sotaro]

Comment 21

•

4 years ago

•

Edited

When I tested STR in comment 0 on P50 on Win10 with maximized window. I did not see such a difference. Though, the values were different each time.

With DC compositor: 4761 vs 4365
Witout DC compositor:4250 vs 4059

Flags: needinfo?(sotaro.ikeda.g)

Jessie [:jbonisteel] pls NI

Updated

•

4 years ago

Priority: P1 → P3

Nicolas Silva [:nical]

Comment 22

•

3 years ago

Wontifixing this one because we're having a hard time reproducing it, and there's so many actionable items in the queue that I'd like to get this one off the perf triage list.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → WONTFIX