Closed Bug 596544 Opened 10 years ago Closed 10 years ago

These WebGL samples are slow on Firefox, fast on Chrome

Categories

(Core :: Canvas: WebGL, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -

People

(Reporter: paul, Assigned: bjacob)

References

()

Details

Attachments

(1 file)

How slow are they for you?
Here Aquarium runs at:
 - 40 FPS without GL layers
 - 70 FPS with GL layers
On my Core i7 + NVIDIA Quadro FX 880M laptop.

What's your platform, do you have accelerated layers enabled, and can you run any slow demo in a profiler to see where it's spending time? Finally, are you using a very recent build with Jaegermonkey?
Linux, ATI, nightly (don't know about accelerated layers)
What framerate do you get in Aquarium with a 1000x1000 window roughly? What's your CPU and graphics card? For accelerated layers, check the preference layers.accelerate-all and -none
3 fps (with and without layers activated)
Ah, so we have a real problem here. So keep layers acceleration disabled, and if you have time, please profile it. Here's how you can do that if you have a recent kernel (linux 2.6.33 or newer). Install the 'perf' profiler package from your distro or get perf from
    https://perf.wiki.kernel.org/index.php/Main_Page

then follow these steps:

    1. launch firefox
    2. go to the demo page and wait until it's fully loaded
    3. Open a terminal
    4. Find the PID of your firefox. You can do:
           ps aux | grep firefox | grep -v grep
    5. Attach the perf profiler to the running firefox process:
           perf record -f -g -p THE_PID_OF_YOUR_FIREFOX
       replacing THE_PID_OF_YOUR_FIREFOX by the value you found in step 4.
    6. After like 10 seconds, interrupt profiling by hitting Ctrl+C in the terminal.
    7. To see profiler results you can then do:
           perf report
       and to get a quick summary:
           perf report --sort dso,symbol | head -20
       please paste that quick summary here. You can also compress the perf.data file and attach it here.
blocking2.0: --- → ?
I'm also seeing this on 2010-09-18 nightly and 4.0b6. 

Window size has no major effect on the speed so it doesn't seem to be compositing-bound.

On Chrome (CPU compositing) top shows 70% chrome, 20% GPU process, 12% Xorg.
On Firefox (ditto, accelerated comp doesn't work) top output is 40% firefox-bin, 80% Xorg. Even on a small window Xorg share is 70%.

Also, GL calls seem to take an avg 75 us in Firebug profile (Aquarium does ~2000 per frame = 150 ms).

The high Xorg CPU use is weird.
Ah, that. OK. That could be **** XRender support in your drivers. It would have to be very, very **** as you said it isn't fillrate-bound anyway, but it's stll worth considering. What driver are you using ? Need to know the driver, not just "ATI".

Could you disable XRender acceleration in your xorg.conf? Not sure what the best way to do that is but you could always use the "vesa" or "fbdev" driver.

Could you run this in a profiler ? See comment 5.
Sure, driver is Catalyst 10.9, x86 Linux, Radeon HD 4850, X.Org X Server 1.7.7 Release Date: 2010-05-04. I don't know off-hand how to disable XRender, but if I figure it out I'll post an update.


Perf results for running aquarium demo for ~10 seconds:

perf report --sort dso,symbol | head -20 


For Firefox 4.0b6:

# Samples: 36083
#
# Overhead                        Shared Object  Symbol
# ........  ...................................  ......
#
     9.01%  ./downloads/firefox/libxul.so        [.] 0x00000000f6177a
                |          
                |--1.60%-- 0xb72cc177
                |          0xb716e66a
                |          
                |--1.14%-- 0xb72c9b78
                |          0xb716e66a
                |          
                |--1.08%-- 0xb72c9bdc
                |          0xb716e66a
                |          
                |--1.01%-- 0xb72c76f7
                |          0xb716e66a
                |          
                |--0.98%-- 0xb72c9bc6



Xorg /w Firefox:

# Samples: 73552
#
# Overhead                                  Shared Object  Symbol
# ........  .............................................  ......
#
    21.18%  /usr/lib/libpixman-1.so.0.16.4                 [.] 0x0000000004cbe6
                |          
                |--3.27%-- 0xb778e23a
                |          0xb7761dd9
...
                |          pixman_image_composite
                |          fbComposite
                |          0xb088e817
                |          0xb088d7a9
                |          0x810654b
                |          CompositePicture
... followed by repeated entries for pixman_image_composite

     8.91%  /usr/lib/libpixman-1.so.0.16.4                 [.] 0x00000000005b16
...
                |          pixman_image_composite
...
...
...
     0.58%  /lib/i686/cmov/libc-2.11.2.so                  [.] memcpy
...
     0.43%  /usr/lib/xorg/modules/glesx.so                 [.] 0x000000000212a3




Xorg output on Chrome with CPU compositing:

# Samples: 12844
#
# Overhead                                  Shared Object  Symbol
# ........  .............................................  ......
#
    46.71%  /lib/i686/cmov/libc-2.11.2.so                  [.] memcpy
                |          
                |--99.77%-- 0xb088cfab
                |          0x810654b
                |          CompositePicture
                |          0x80ff94d
                |          0x80fc623
                |          0x8080067
                |          0x806692a
                |          __libc_start_main
                |          0x8066511
                 --0.23%-- [...]

     6.84%  /usr/lib/xorg/modules/glesx.so                 [.] 0x000000000ecb29
                |          
                |--1.48%-- 0xb089122f
                |          0xb0892e5d
                |          esutDeleteSurf
                |          glesxDeleteSharedAccelSurf
                |          atiddxPixmapFreeGARTCacheable
                |          destroyPixmap
                |          0x810912c


Xorg output on Chrome with accelerated compositing:

# Samples: 2619
#
# Overhead                                  Shared Object  Symbol
# ........  .............................................  ......
#
     5.77%  [kernel]                                       [k] _ZN4Asic16Is_WPTR

     5.42%  [kernel]                                       [k] unix_poll

     4.51%  [kernel]                                       [k] do_select
                |          
                |--98.31%-- 0x807fda0
                |          0x806692a
                |          __libc_start_main
                |          0x8066511
                |          
                 --1.69%-- 0xb7830400
                           0x807fda0
                           0x806692a
                           __libc_start_main
                           0x8066511

     4.47%  [kernel]                                       [k] _ZN4Asic9WaitUnti

     4.43%  [kernel]                                       [k] _spin_lock_irqsav
OS: Linux → Windows Server 2003
(In reply to comment #8)
> perf report --sort dso,symbol | head -20 

argh, sorry! This is printing a call graph, making 20 lines be not enough. Can you please use this command instead:

perf report -g flat --sort dso,symbol | head -20
Also, your executables are missing a lot of symbols. Just firefox would be enough, but i'm afraid than in order to get perf to pick them up, they need to be in the firefox executable. Which means that you'd have to build firefox yourself, unless there's something I'm not aware of.
Alright, I'll roll my own build tomorrow.

I tested with some other full-window scenes. There is a clear connection between GL calls per frame and framerate. A scene with only one object runs smoothly (20-30 fps), a scene with a couple dozen objects runs at 10 fps, a scene with a hundred objects runs at 5 fps. 

Here's the symbol-poor output of the flat perf for Aquarium:

firefox-bin 4.0b6

# Samples: 63577
#
# Overhead                        Shared Object  Symbol
# ........  ...................................  ......
#
     9.03%  ./downloads/firefox/libxul.so        [.] 0x00000000eca012
     7.68%  /usr/lib/dri/fglrx_dri.so            [.] 0x000000003f20d7
             0.93%
                0xa1da699a

             0.63%
                0xa1da69c1

     6.58%  ./downloads/firefox/libxul.so        [.] gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*)
             6.57%
                gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*)

     5.93%  ./downloads/firefox/libxul.so        [.] 0x0000000055184e
     2.75%  [kernel]                             [k] copy_from_user
             1.57%


For X.org

# Samples: 88543
#
# Overhead                                  Shared Object  Symbol
# ........  .............................................  ......
#
    18.79%  /usr/lib/libpixman-1.so.0.16.4                 [.] 0x0000000004023a
    11.84%  /usr/lib/libpixman-1.so.0.16.4                 [.] 0x00000000005b1a
             6.62%
                0xb7753b00
                0xb7761dd9
                0xb7787a6d
                0xb778f80a
                0xb7790a2a
                0xb7787763
                0xb7762953
                0xb778975e
                0xb7762953
                0xb7794b63
                0xb7762953
                0xb779b289
                0xb7762953
                pixman_image_composite
> Window size has no major effect on the speed so it doesn't seem to be
> compositing-bound.

The window size only matters when compositing to screen, since you can then optimize away everything outside the window.  When compositing to a canvas (as is the case here, right?), all that matters is the size of the canvas, not the size of the window.

So yeah, crappy Render seems like the most likely cause.  God, I hate XOrg.  :(
Sorry, I was being unclear. The canvas size changes with the window size in Aquarium. So window size = canvas size = compositing overhead.

If Nvidia drivers perform fine on Linux, that'd suggest that there's some slow path that is getting hammered on ATI. And because the amount of GL calls is correlated with the slowness, I'd guess that each GL call is triggering some extra bit of work that the Nvidia driver shrugs off but ATI relegates to a slow path.

My reasoning here is:
If Nvidia is slow as well, then it'd be something inherently slow in the JS->GL-path. But if not, then there must be something going on that ATI isn't handling well. And it's not unfixable because Chrome doesn't have the slowness. It also seems to be triggered on most GL calls. So it must be something fixable that's  affecting every GL call.

But yeah, I have no idea about the exact cause. Are GLXPixmaps accelerated?

Maybe crib this http://src.chromium.org/svn/trunk/src/app/gfx/gl/gl_context_linux.cc

They seem to be using a 1x1 pbuffer for context, fbo in the context for the render target, fallback to glxpixmaps where pbuffers are not supported.
*** SCROLL DOWN TO WHERE I MENTION MAKECURRENT FOR THE INTERESTING BIT ***


(In reply to comment #13)
> Sorry, I was being unclear. The canvas size changes with the window size in
> Aquarium. So window size = canvas size = compositing overhead.
> 
> If Nvidia drivers perform fine on Linux, that'd suggest that there's some slow
> path that is getting hammered on ATI.

Agree.

> And because the amount of GL calls is correlated with the slowness,

Thanks for finding out this new bit of information. It's very interesting because the amount of GL calls in a WebGL app is NOT correlated to XRender at all. So IF we are looking at ATI driver slowness here, that would be OpenGL implementation slowness, not XRender slowness.

This isn't per se going to be easy to measure because GL calls are asynchronous and return immediately. The GL debug mode I'm coding (argh, I have to finish this today) is calling glFinish() after every GL call, emulating synchronousness, but this is only for debug builds, which aren't very suitable for benchmarking. Still, the slowness that you're experiencing here is so extreme that I guess that profiling a debug build will still make the culprit show up. I'll let you know when it's available.

> I'd guess that each GL call is triggering some
> extra bit of work that the Nvidia driver shrugs off but ATI relegates to a slow path.

Either that, or some function is just hideously slow in the ATI implementation. Either way, we'll know with a profiling of firefox in GL debug mode.

> My reasoning here is:
> If Nvidia is slow as well, then it'd be something inherently slow in the
> JS->GL-path.

Again, NVIDIA is /not/ slow here. I get 70 FPS in the Aquarium demo full-screen with default settings.

> But if not, then there must be something going on that ATI isn't
> handling well. And it's not unfixable because Chrome doesn't have the slowness.

Indeed. When we were looking at XRender slowness, I was shrugging this off as we all know that Chrome does in plain optimized software rendering what we do with XRender. But now that we're looking at GL slowness, this is getting interesting.


> It also seems to be triggered on most GL calls. So it must be something fixable that's  affecting every GL call.

OOOOOOOhhhhhhhh I know. The only thing that's triggered by every GL call is GLX MakeCurrent(). Let me make you a patch !!

> But yeah, I have no idea about the exact cause. Are GLXPixmaps accelerated?

Yes, since GLXPixmaps live on the server.

> 
> Maybe crib this
> http://src.chromium.org/svn/trunk/src/app/gfx/gl/gl_context_linux.cc
> 
> They seem to be using a 1x1 pbuffer for context, fbo in the context for the
> render target, fallback to glxpixmaps where pbuffers are not supported.

Yeah, we just create a plain 1x1 window for the context, so we don't have to worry about whether pbuffers are supported. We use a FBO for the actual rendering.
Here's a patch making us try to avoid calling MakeCurrent.

Before someone says "**** linux graphics", keep in mind that we're already doing this on other platforms including on Windows (WGL) !
Assignee: nobody → bjacob
Status: NEW → ASSIGNED
Attachment #476659 - Flags: review?(vladimir)
Ilmari / Paul: could you please my patch? Would be good to know asap if it fixes it. Otherwise you'll have to wait until someone reviews this.
Yep, that did it. Aquarium with accelerated layers: 5 fps with HEAD, 60 fps with patch.
OS: Windows Server 2003 → All
Wow, good job guys :)

Ilmari, thanks a lot for the informations!
Attachment #476659 - Flags: approval2.0+
http://hg.mozilla.org/mozilla-central/rev/f9728160d6e4
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
I just tried on latest hourly. http://hg.mozilla.org/mozilla-central/rev/901fd772c4da

1000 Fish in Aquarium, D2D ON

Fx4 with layers: 25
Fx4 without layers: 20
Chrome 7 dev with all HW acc enabled: 45

Is this performance issue related to this bug, or this is the current Fx speed?
With 1000 fish, you are really measuring Javascript performance, not so much graphics performance. So 25:45 against Chrome 7 seems plausible, see http://www.arewefastyet.com/ , feel free to continue this discussion with javascript people;

but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with 50 fish, that's a graphics bug.
Ah, wait, something else --- we are currently using OpenGL for the WebGL rendering and D3D for layer acceleration. Once we switch to D3D for WebGL rendering via ANGLE, that will play much better with D3D layer acceleration. So expect a performance improvement in not too long. But again, 1000 fish is very JS intensive.
Fwiw, testing on more or less current trunk on Mac with GL layers, I get about 40fps on a sorta-recent laptop (though nothing actually paints).  About 32% of the time is spent in the JS engine with 1000 fish.  The rest is spent in gl code of various sorts (most prominently under DrawElements, Uniform3fv, Uniform1 (those three are 51% between them).

So the time it takes to run the JS for 1000 fish on this machine is about 8ms (one third of the 25ms a 40Hz framerate leaves per frame).  If we're ending up at 25fps, that means 40ms per frame.  So either Csaba's machine is a lot slower than mine, or there's significant overhead in the non-JS stuff here, or both.
Ah, OK. I think I talked a bit too fast. Here I get:
  70 FPS with 50 fish
  40 FPS with 1000 fish

As for profiling, I am starting to wonder if my profiling practices are really appropriate here. Indeed, in both cases, I get above 90% time spent in JITted code, and the rest in the kernel. Could it simply be that all the time spent waiting on GPU rendering is accounted on the JIT code ?! How then would I go about measuring this time spent waiting on the GPU?
Here's the perf output per-DSO:

[bjacob@cahouette 1000fish]$ perf report -g flat --sort dso | head -20
# Samples: 34786750860
#
# Overhead      Shared Object
# ........  .................
#
    92.77%       7f5ea521be7a
     7.21%  [kernel]         
     0.01%  [rfcomm]         
     0.01%  [nvidia]         
     0.01%  [cfg80211]

Here's the perf output per-symbol, showing the top 20 symbols:

[bjacob@cahouette 1000fish]$ perf report -g flat --sort dso,symbol | head -20
# Samples: 34786750860
#
# Overhead      Shared Object  Symbol
# ........  .................  ......
#
    92.77%       7f5ea521be7a  [.] 0x007f5ea521be7a
     0.55%  [kernel]           [k] hpet_next_event
     0.55%  [kernel]           [k] system_call
     0.53%  [kernel]           [k] audit_syscall_exit
     0.53%  [kernel]           [k] audit_syscall_entry
     0.47%  [kernel]           [k] system_call_after_swapgs
     0.39%  [kernel]           [k] pid_vnr
     0.37%  [kernel]           [k] unroll_tree_refs
     0.29%  [kernel]           [k] kfree
     0.28%  [kernel]           [k] audit_free_names
     0.25%  [kernel]           [k] sys_getpid
     0.25%  [kernel]           [k] sysret_check
     0.22%  [kernel]           [k] sysret_signal
     0.20%  [kernel]           [k] audit_get_context
     0.20%  [kernel]           [k] auditsys
(In reply to comment #21)> > but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with> 50 fish, that's a graphics bug.50 fish:Fx4 with layers: 38 FPSFx4 without layers: 29 FPSChrome 7 dev with all HW acc: 52 FPS.Maybe it's a graphics bug, too. And i have JM+TM enabled, i didnt mentioned that.(In reply to comment #23)> > So either Csaba's machine is a lot slower> than mine, or there's significant overhead in the non-JS stuff here, or both.I tested on my laptop: Core2 T6600 2,2Ghz, 4GB RAM, Mob HD4330 512MB DDR2.(In reply to comment #22)> Once we switch to D3D for WebGL> rendering via ANGLE, that will play much better with D3D layer acceleration. I didnt know that Fx4 will use ANGLE, that's good news. Is there tracking bug for that? Will it make into Fx4.0?
(In reply to comment #21)
>
> but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with
> 50 fish, that's a graphics bug.

50 fish:

Fx4 with layers: 38 FPS
Fx4 without layers: 29 FPS
Chrome 7 dev with all HW acc: 52 FPS.

Maybe it's a graphics bug, too. 

And i have JM+TM enabled, i didnt mentioned that.

(In reply to comment #23)
>
> So either Csaba's machine is a lot slower
>than mine, or there's significant overhead in the non-JS stuff here, or both.

I tested on my laptop: Core2 T6600 2,2Ghz, 4GB RAM, Mob HD4330 512MB DDR2.

(In reply to comment #22)
>
>Once we switch to D3D for WebGL rendering via ANGLE,that will play much better >with D3D layer acceleration.

I didnt know that Fx4 will use ANGLE, that's good news. Is there tracking bug for that? Will it make into Fx4.0?
(In reply to comment #27)
> (In reply to comment #21)
> >
> > but could you try with fewer, e.g. 50 fish ? If we're slower than Chrome with
> > 50 fish, that's a graphics bug.
> 
> 50 fish:
> 
> Fx4 with layers: 38 FPS
> Fx4 without layers: 29 FPS
> Chrome 7 dev with all HW acc: 52 FPS.
> 
> Maybe it's a graphics bug, too.

We'll need actually good profiler results to know... Boris just explained me on IRC why my above results look strange, I had a couple of things wrong.
Okay. If you need some testing, tell! (and tell how to do it :) )
(In reply to comment #27)
> I didnt know that Fx4 will use ANGLE, that's good news. Is there tracking bug
> for that? Will it make into Fx4.0?

There's no tracking bug yet, and it probably won't make it into 4.0.0 although it should be done shortly thereafter and I have no idea whether we'll be allowed to backport that (the trick is to claim that it's not a feature, it's just to fix support for windows machines with **** GL drivers).
Thanks for the info!
blocking2.0: ? → -
You need to log in before you can comment on or make changes to this bug.