Closed Bug 1543478 Opened 5 years ago Closed 3 years ago

The throbber animation heavily regresses performance on main-thread-heavy micro-benchmark

Categories

(Core :: Graphics: WebRender, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: emilio, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Attached file Testcase.

STR:

  • Enable WebRender.
  • Open the test-case attached to this bug.
  • Open attachment 9057328 [details], which is attached to bug 1537903, which has the same test-case, but without spinning the throbber.

Expected Results:

  • Comparable performance.

Actual results:

  • The number I get on my GPU (Mesa DRI Intel(R) HD Graphics P530 (Skylake GT2)) is much higher if the throbber is spinning. Performance is comparable in both test-cases if WebRender is disabled.

Diff between the two test-cases is just:

-const DELAY = true;
+const DELAY = false;
 window.onload = function() {
   DELAY ? setTimeout(runTests, 0) : runTests();
 }

This is with integrated Graphics on Linux.

I get roughly the same number on both attachments with Mac. 4041 vs 3963. What numbers do you get?

Flags: needinfo?(emilio)

5845 vs 10184

Flags: needinfo?(emilio)

I'll get a pair of profiles.

WebRender disabled, throbber test-case: 6912, then 7275

WebRender disabled, no-throbber test-case: 7038, then 7055

WebRender enabled, throbber test-case: 20037, then 19430

WebRender enabled, no-throbber test-case: 7713, then 7850.

Note that WebRender's profiles are profiling a bunch more threads, so maybe not
directly comparable with non-WR.

Also, this is all on a pristine profile, just created and installed the profiler add-on.

Can you reproduce the problem if you make the window as small as possible?

Flags: needinfo?(emilio)

Reducing the window size does definitely help a lot, yeah: 5522 vs. 5644.

I've verified that the throbber is still visible while at it (just ensuring it doesn't get culled away).

Flags: needinfo?(emilio)

I'm guessing this is memory bandwidth starvation. What resolution are you running at?

Blocks: wr-intel

3840x2160

Can you run the benchmark at https://github.com/jrmuizel/memset-bench and post the results?

Flags: needinfo?(emilio)
171239584
 253290568        1 33177600 4.996732
 292047110        2 16588800 4.333633
 213503808        4 8294400 5.927880
  76897571        8 4147200 16.458582
  50981800       16 2073600 24.825036
  46075951       32 1036800 27.468234
  50430943       64 518400 25.096199
  53592175      128 259200 23.615854
  49455212      256 129600 25.591337
  51532981      512 64800 24.559515
  35984658     1024 32400 35.171239
  31182065     2048 16200 40.588236
  38009982     4096 8100 33.297174
  32214595     8192 4050 39.287317
  31485070    16384 2025 40.197624
  40086547    32768 1012 31.572313
  51698017    65536 506 24.481113
  45728613   131072 253 27.676873
Flags: needinfo?(emilio)

Can you also try running the benchmark with opengl layers instead of basic?

Flags: needinfo?(emilio)

5067 vs. 7135 on a maximized window, 5031 vs. 5188 on a minimally-sized window.

Flags: needinfo?(emilio)

And can you try running the benchmark while running cargo run mem --release on https://github.com/jrmuizel/jrmuizel-membench

Flags: needinfo?(emilio)

WR enabled: 7174 vs. 14101
OpenGL layers: 7619 vs. 12291
Basic layers: 7053 vs. 7981

this is on a maximized window, lmk if you also want me to report numbers with a minimized window.

Flags: needinfo?(emilio)

Setting P1 because I assume we want this fixed before shipping on Intel?

Blocks: wr-perf
Priority: -- → P1

So to confirm. With basic layers, it only gets a little bit worse (7053 vs 7981) when running jrmuizel-membench?

Flags: needinfo?(emilio)

Yes, that's right.

Flags: needinfo?(emilio)

How many cores do you have in this machine?

Flags: needinfo?(emilio)

It's a Lenovo P50, with 8 logical, 4 physical:

$ lscpu                                                                                                                                                                                    
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
Address sizes:       39 bits physical, 48 bits virtual
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               94
Model name:          Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Stepping:            3
CPU MHz:             800.032
CPU max MHz:         3700.0000
CPU min MHz:         800.0000
BogoMIPS:            5616.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
Flags: needinfo?(emilio)
Depends on: 1620076
No longer blocks: wr-intel

Sotaro, can you try running this on Windows 10 on your P50 with gfx.webrender.compositor on and off?

Flags: needinfo?(sotaro.ikeda.g)

When I tested STR in comment 0 on P50 on Win10 with maximized window. I did not see such a difference. Though, the values were different each time.

  • With DC compositor: 4761 vs 4365
  • Witout DC compositor:4250 vs 4059
Flags: needinfo?(sotaro.ikeda.g)
Priority: P1 → P3

Wontifixing this one because we're having a hard time reproducing it, and there's so many actionable items in the queue that I'd like to get this one off the perf triage list.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: