Closed Bug 594920 Opened 9 years ago Closed 9 years ago

3D Pong spends 14.7% of its time in gfxUtils::PremultiplyImageSurface


(Core :: Canvas: WebGL, defect)

Not set





(Reporter: humph, Unassigned)




(Keywords: perf)

This WebGL/processing.js demo is spending most of its time (when not in CoreGraphics stuff) in gfxUtils::PremultiplyImageSurface

15.0%	15.0%	CoreGraphics	argb32_mark_pixelshape
14.7%	14.7%	XUL	gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*)
6.9%	6.9%	CoreGraphics	CGBlt_copyBytes
6.8%	6.8%	libSystem.B.dylib	__bzero
5.4%	5.4%	XUL	pixman_blt_sse2
Can you try with accelerated layers (preferences layers.accelerate_xxx) ?
I tried changing layers.accelerate_all to true, and it was the same.  I then thought I might need to restart, and my dialog box was a copy of the 3D canvas (i.e., no text or buttons, just shapes like in the canvas below).  Here's a screenshot, but it doesn't show it perfectly:

Maybe that's another bug.  I'll try restarting when I figure out how to close this browser.
On restart it's worse:

I either have this pref done incorrectly, or it's very, very broken.  Benoit, can I have my browser back please?
Can you try running in safe mode ( -safe-mode), go to about:config and restore this pref to default?

Sorry for making you do dangerous things. Yes, GL layers are... not ready yet :-)
I seem to recall that Jason's filed some bugs about speeding up the premultiplies here recently...
Indeed I have.  I'll take a look.
gfxUtils::(Un)PremultiplyImageSurface is copy-pasted from the premultiply code I wrote for canvas.

I don't immediately have any ideas about speeding this up further.  Maybe derf does?
A sketch in SSE2 (off the top of my head, untested):

pxor xmm7,xmm7
mov xmm6,0x80808080808080808080808080808080
mov xmm5,0x00FE00FE00FE00FE00FE00FE00FE00FE
mov xmm4,0xFF000000FF000000FF000000FF000000

movq xmm0,(src)
movq xmm1,(src+8)
punpcklbw xmm0,xmm7
punpcklbw xmm1,xmm7
pshuflw xmm2,xmm0,0xFF
pshufhw xmm2,xmm0,0xFF
pmullw xmm2,xmm0
pshuflw xmm3,xmm1,0xFF
pshufhw xmm3,xmm1,0xFF
pmullw xmm3,xmm1
packuswb xmm0,xmm1
pand xmm0,xmm4
paddw xmm2,xmm5
paddw xmm3,xmm5
pmulhuw xmm2,xmm6
pmulhuw xmm3,xmm6
movq xmm1,xmm4
psrlw xmm2,$7
psrlw xmm3,$7
packuswb xmm2,xmm3
pandn xmm1,xmm2
movdqa (dst),xmm1 (or movntdq)

That's 22 operations for 4 RGBA pixels, or 1.375/byte. The existing code requires at least 2.75 operations per byte (all of them load/stores), and kills L1 cache with the 64K lookup table, so this has a decent chance of being faster, even accounting for Core2's slow shuffle unit.
I was going to say the same thing as Tim, but he mid-aired me.
It's actually missing a por xmm1,xmm0 at the end before the store to actually put the original alpha values back. So 1.4375 ops/byte.
Just spent a few hours tonight getting CubicVR on trace, only to once again hit this bug.  Here's a totally different WebGL example which is gated on gfx::PremultiplyImageSurface (14.2% for me):
This shall all be fixed once we have properly working OpenGL accelerated layers. This is the real solution, more more than a SSE-optimized software path!
Even if SSE were 2x faster than what we have now, we'd still be spending way too long premultiplying.  So if you can do this off the CPU, that's awesome.
Confirming that the problem is fixed by OpenGL layers, here on linux x86-64. Thus, marking as RESOLVED DUPLICATE or the meta-bug for OpenGL layers on OSX: bug 580405.

Indeed, here are the results of a profiler runs on linux x86-64 with NVIDIA driver, using 'perf', with OpenGL layers enabled (preference layers.accelerate-all). The profiler has been attached to a firefox already running the scene, so it's not polluted by startup. Here 3000 is the PID.

    91.42%    :3000          386ece5e60  [.] 0x0000386ece5e60
     1.61%    :3000       [.] 0x007f1a94315096
     0.64%    :3000  [kernel]            [k] hpet_next_event
     0.35%    :3000  [kernel]            [k] audit_syscall_entry
     0.24%    :3000  [kernel]            [k] system_call
     0.23%    :3000  [kernel]            [k] audit_syscall_exit
     0.15%    :3000  [kernel]            [k] pid_vnr
     0.15%    :3000  [kernel]            [k] sysret_check
     0.14%    :3000  [kernel]            [k] avc_has_perm_noaudit
     0.14%    :3000  [kernel]            [k] sys_getpid
     0.14%    :3000  [kernel]            [k] get_page_from_freelist
     0.13%    :3000  [kernel]            [k] system_call_after_swapgs
     0.13%    :3000  [kernel]            [k] fget_light
     0.12%    :3000  [kernel]            [k] audit_get_context
     0.11%    :3000  [kernel]            [k] unroll_tree_refs
     0.10%    :3000  [kernel]            [k] handle_mm_fault

(Other symbols under 0.1%)

Interpretation: the top symbol at 91% is JITted javascript. The next symbol is profiler overhead. The subsequent symbols are the linux kernel, with hpet_next_event being a profiler-related function. Thus we are in the absolutely dream situation: all the time spent is spent in JITted code!!
Closed: 9 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: ogl-osx-beta
Sorry, this taking 1.61% is actually the Firefox code. Indeed I ran this against a nightly build that doesn't have symbols. Here's the table per DSO:

# Overhead       Shared Object
# ........  ..................
    91.42%          386ece5e60
     6.77%  [kernel]          
     0.11%  [nvidia]          
     0.09%  [sdhci]

So the big one is JITted JS, the second one is linux (I admit to be surprised to see 6.77% time spent in the kernel), the third one is Firefox itself. Anyhow, we see that the performance problem is fixed by GL layers.
> admit to be surprised to see 6.77% time spent in the kernel

Is there a lot of memory churn going on, perhaps?
(In reply to comment #16)
> > admit to be surprised to see 6.77% time spent in the kernel
> Is there a lot of memory churn going on, perhaps?

It's very possible. We know we have room for improvement in this area in WebGL (some WebGL funcs cause temporary buffers to be heap allocated and then deleted everytime), so I'll make a patch fixing the obvious issues and see if that makes a difference.
But wait... this is a nightly build and should thus be using JEMalloc, right? Isn't JEMalloc supposed to maintain its own memory pool so that most malloc/free calls, in case of heavy churn, should just never be seen at all by the kernel?
That would be true for any allocator, but depending on the size of the allocations, there could be mmap/munmap traffic and the like.  That's how you get into the kernel with allocators...
If you build yourself with --enable-profiling (bug 592923), you might be able to get more information out of perf.
You need to log in before you can comment on or make changes to this bug.