594920 - 3D Pong spends 14.7% of its time in gfxUtils::PremultiplyImageSurface

Reporter

Description

•

14 years ago

This WebGL/processing.js demo is spending most of its time (when not in CoreGraphics stuff) in gfxUtils::PremultiplyImageSurface

http://pointmarker.com/webgl/test2.html

15.0%	15.0%	CoreGraphics	argb32_mark_pixelshape
14.7%	14.7%	XUL	gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*)
6.9%	6.9%	CoreGraphics	CGBlt_copyBytes
6.8%	6.8%	libSystem.B.dylib	__bzero
5.4%	5.4%	XUL	pixman_blt_sse2

Benoit Jacob [:bjacob] (mostly away)

Comment 1

•

14 years ago

Can you try with accelerated layers (preferences layers.accelerate_xxx) ?

David Humphrey (:humph)

Reporter

Comment 2

•

14 years ago

I tried changing layers.accelerate_all to true, and it was the same.  I then thought I might need to restart, and my dialog box was a copy of the 3D canvas (i.e., no text or buttons, just shapes like in the canvas below).  Here's a screenshot, but it doesn't show it perfectly:

http://www.flickr.com/photos/11280278@N04/4975110336/

Maybe that's another bug.  I'll try restarting when I figure out how to close this browser.

David Humphrey (:humph)

Reporter

Comment 3

•

14 years ago

On restart it's worse:

http://www.flickr.com/photos/11280278@N04/4974511199/

I either have this pref done incorrectly, or it's very, very broken.  Benoit, can I have my browser back please?

Benoit Jacob [:bjacob] (mostly away)

Comment 4

•

14 years ago

Can you try running in safe mode ( -safe-mode), go to about:config and restore this pref to default?

Sorry for making you do dangerous things. Yes, GL layers are... not ready yet :-)

Boris Zbarsky [:bzbarsky]

Comment 5

•

14 years ago

I seem to recall that Jason's filed some bugs about speeding up the premultiplies here recently...

Justin Lebar (not reading bugmail)

Comment 6

•

14 years ago

Indeed I have.  I'll take a look.

Justin Lebar (not reading bugmail)

Comment 7

•

14 years ago

gfxUtils::(Un)PremultiplyImageSurface is copy-pasted from the premultiply code I wrote for canvas.

I don't immediately have any ideas about speeding this up further.  Maybe derf does?

Timothy B. Terriberry (:derf)

Comment 8

•

14 years ago

A sketch in SSE2 (off the top of my head, untested):

pxor xmm7,xmm7
mov xmm6,0x80808080808080808080808080808080
mov xmm5,0x00FE00FE00FE00FE00FE00FE00FE00FE
mov xmm4,0xFF000000FF000000FF000000FF000000

movq xmm0,(src)
movq xmm1,(src+8)
punpcklbw xmm0,xmm7
punpcklbw xmm1,xmm7
pshuflw xmm2,xmm0,0xFF
pshufhw xmm2,xmm0,0xFF
pmullw xmm2,xmm0
pshuflw xmm3,xmm1,0xFF
pshufhw xmm3,xmm1,0xFF
pmullw xmm3,xmm1
packuswb xmm0,xmm1
pand xmm0,xmm4
paddw xmm2,xmm5
paddw xmm3,xmm5
pmulhuw xmm2,xmm6
pmulhuw xmm3,xmm6
movq xmm1,xmm4
psrlw xmm2,$7
psrlw xmm3,$7
packuswb xmm2,xmm3
pandn xmm1,xmm2
movdqa (dst),xmm1 (or movntdq)

That's 22 operations for 4 RGBA pixels, or 1.375/byte. The existing code requires at least 2.75 operations per byte (all of them load/stores), and kills L1 cache with the 64K lookup table, so this has a decent chance of being faster, even accounting for Core2's slow shuffle unit.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 9

•

14 years ago

I was going to say the same thing as Tim, but he mid-aired me.

Timothy B. Terriberry (:derf)

Comment 10

•

14 years ago

It's actually missing a por xmm1,xmm0 at the end before the store to actually put the original alpha values back. So 1.4375 ops/byte.

David Humphrey (:humph)

Reporter

Comment 11

•

14 years ago

Just spent a few hours tonight getting CubicVR on trace, only to once again hit this bug.  Here's a totally different WebGL example which is gated on gfx::PremultiplyImageSurface (14.2% for me):

http://cubicvr.org/CubicVR.js/LightingTestMulti.html

Benoit Jacob [:bjacob] (mostly away)

Comment 12

•

14 years ago

This shall all be fixed once we have properly working OpenGL accelerated layers. This is the real solution, more more than a SSE-optimized software path!

Justin Lebar (not reading bugmail)

Comment 13

•

14 years ago

Even if SSE were 2x faster than what we have now, we'd still be spending way too long premultiplying.  So if you can do this off the CPU, that's awesome.

Benoit Jacob [:bjacob] (mostly away)

Comment 14

•

14 years ago

Confirming that the problem is fixed by OpenGL layers, here on linux x86-64. Thus, marking as RESOLVED DUPLICATE or the meta-bug for OpenGL layers on OSX: bug 580405.

Indeed, here are the results of a profiler runs on linux x86-64 with NVIDIA driver, using 'perf', with OpenGL layers enabled (preference layers.accelerate-all). The profiler has been attached to a firefox already running the scene, so it's not polluted by startup. Here 3000 is the PID.

    91.42%    :3000          386ece5e60  [.] 0x0000386ece5e60
     1.61%    :3000  perf-3000.map       [.] 0x007f1a94315096
     0.64%    :3000  [kernel]            [k] hpet_next_event
     0.35%    :3000  [kernel]            [k] audit_syscall_entry
     0.24%    :3000  [kernel]            [k] system_call
     0.23%    :3000  [kernel]            [k] audit_syscall_exit
     0.15%    :3000  [kernel]            [k] pid_vnr
     0.15%    :3000  [kernel]            [k] sysret_check
     0.14%    :3000  [kernel]            [k] avc_has_perm_noaudit
     0.14%    :3000  [kernel]            [k] sys_getpid
     0.14%    :3000  [kernel]            [k] get_page_from_freelist
     0.13%    :3000  [kernel]            [k] system_call_after_swapgs
     0.13%    :3000  [kernel]            [k] fget_light
     0.12%    :3000  [kernel]            [k] audit_get_context
     0.11%    :3000  [kernel]            [k] unroll_tree_refs
     0.10%    :3000  [kernel]            [k] handle_mm_fault

(Other symbols under 0.1%)

Interpretation: the top symbol at 91% is JITted javascript. The next symbol is profiler overhead. The subsequent symbols are the linux kernel, with hpet_next_event being a profiler-related function. Thus we are in the absolutely dream situation: all the time spent is spent in JITted code!!

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → DUPLICATE

Benoit Jacob [:bjacob] (mostly away)

Comment 15

•

14 years ago

Sorry, this perf-3000.map taking 1.61% is actually the Firefox code. Indeed I ran this against a nightly build that doesn't have symbols. Here's the table per DSO:

# Overhead       Shared Object
# ........  ..................
#
    91.42%          386ece5e60
     6.77%  [kernel]          
     1.61%  perf-3000.map     
     0.11%  [nvidia]          
     0.09%  [sdhci]

So the big one is JITted JS, the second one is linux (I admit to be surprised to see 6.77% time spent in the kernel), the third one is Firefox itself. Anyhow, we see that the performance problem is fixed by GL layers.

Boris Zbarsky [:bzbarsky]

Comment 16

•

14 years ago

> admit to be surprised to see 6.77% time spent in the kernel

Is there a lot of memory churn going on, perhaps?

Benoit Jacob [:bjacob] (mostly away)

Comment 17

•

14 years ago

(In reply to comment #16)
> > admit to be surprised to see 6.77% time spent in the kernel
> 
> Is there a lot of memory churn going on, perhaps?

It's very possible. We know we have room for improvement in this area in WebGL (some WebGL funcs cause temporary buffers to be heap allocated and then deleted everytime), so I'll make a patch fixing the obvious issues and see if that makes a difference.

Benoit Jacob [:bjacob] (mostly away)

Comment 18

•

14 years ago

But wait... this is a nightly build and should thus be using JEMalloc, right? Isn't JEMalloc supposed to maintain its own memory pool so that most malloc/free calls, in case of heavy churn, should just never be seen at all by the kernel?

Boris Zbarsky [:bzbarsky]

Comment 19

•

14 years ago

That would be true for any allocator, but depending on the size of the allocations, there could be mmap/munmap traffic and the like.  That's how you get into the kernel with allocators...

Justin Lebar (not reading bugmail)

Comment 20

•

14 years ago

If you build yourself with --enable-profiling (bug 592923), you might be able to get more information out of perf.

Bugzilla

Quick Search

3D Pong spends 14.7% of its time in gfxUtils::PremultiplyImageSurface

Categories

(Core :: Graphics: CanvasWebGL, defect)

Tracking

()

People

(Reporter: humph, Unassigned)

References

(
URL
)

Details

(Keywords: perf)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20