Closed
Bug 594920
Opened 14 years ago
Closed 14 years ago
3D Pong spends 14.7% of its time in gfxUtils::PremultiplyImageSurface
Categories
(Core :: Graphics: CanvasWebGL, defect)
Tracking
()
RESOLVED
DUPLICATE
of bug 580405
People
(Reporter: humph, Unassigned)
References
()
Details
(Keywords: perf)
This WebGL/processing.js demo is spending most of its time (when not in CoreGraphics stuff) in gfxUtils::PremultiplyImageSurface http://pointmarker.com/webgl/test2.html 15.0% 15.0% CoreGraphics argb32_mark_pixelshape 14.7% 14.7% XUL gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*) 6.9% 6.9% CoreGraphics CGBlt_copyBytes 6.8% 6.8% libSystem.B.dylib __bzero 5.4% 5.4% XUL pixman_blt_sse2
Comment 1•14 years ago
|
||
Can you try with accelerated layers (preferences layers.accelerate_xxx) ?
Reporter | ||
Comment 2•14 years ago
|
||
I tried changing layers.accelerate_all to true, and it was the same. I then thought I might need to restart, and my dialog box was a copy of the 3D canvas (i.e., no text or buttons, just shapes like in the canvas below). Here's a screenshot, but it doesn't show it perfectly: http://www.flickr.com/photos/11280278@N04/4975110336/ Maybe that's another bug. I'll try restarting when I figure out how to close this browser.
Reporter | ||
Comment 3•14 years ago
|
||
On restart it's worse: http://www.flickr.com/photos/11280278@N04/4974511199/ I either have this pref done incorrectly, or it's very, very broken. Benoit, can I have my browser back please?
Comment 4•14 years ago
|
||
Can you try running in safe mode ( -safe-mode), go to about:config and restore this pref to default? Sorry for making you do dangerous things. Yes, GL layers are... not ready yet :-)
Comment 5•14 years ago
|
||
I seem to recall that Jason's filed some bugs about speeding up the premultiplies here recently...
Comment 6•14 years ago
|
||
Indeed I have. I'll take a look.
Comment 7•14 years ago
|
||
gfxUtils::(Un)PremultiplyImageSurface is copy-pasted from the premultiply code I wrote for canvas. I don't immediately have any ideas about speeding this up further. Maybe derf does?
Comment 8•14 years ago
|
||
A sketch in SSE2 (off the top of my head, untested): pxor xmm7,xmm7 mov xmm6,0x80808080808080808080808080808080 mov xmm5,0x00FE00FE00FE00FE00FE00FE00FE00FE mov xmm4,0xFF000000FF000000FF000000FF000000 movq xmm0,(src) movq xmm1,(src+8) punpcklbw xmm0,xmm7 punpcklbw xmm1,xmm7 pshuflw xmm2,xmm0,0xFF pshufhw xmm2,xmm0,0xFF pmullw xmm2,xmm0 pshuflw xmm3,xmm1,0xFF pshufhw xmm3,xmm1,0xFF pmullw xmm3,xmm1 packuswb xmm0,xmm1 pand xmm0,xmm4 paddw xmm2,xmm5 paddw xmm3,xmm5 pmulhuw xmm2,xmm6 pmulhuw xmm3,xmm6 movq xmm1,xmm4 psrlw xmm2,$7 psrlw xmm3,$7 packuswb xmm2,xmm3 pandn xmm1,xmm2 movdqa (dst),xmm1 (or movntdq) That's 22 operations for 4 RGBA pixels, or 1.375/byte. The existing code requires at least 2.75 operations per byte (all of them load/stores), and kills L1 cache with the 64K lookup table, so this has a decent chance of being faster, even accounting for Core2's slow shuffle unit.
I was going to say the same thing as Tim, but he mid-aired me.
Comment 10•14 years ago
|
||
It's actually missing a por xmm1,xmm0 at the end before the store to actually put the original alpha values back. So 1.4375 ops/byte.
Reporter | ||
Comment 11•14 years ago
|
||
Just spent a few hours tonight getting CubicVR on trace, only to once again hit this bug. Here's a totally different WebGL example which is gated on gfx::PremultiplyImageSurface (14.2% for me): http://cubicvr.org/CubicVR.js/LightingTestMulti.html
Comment 12•14 years ago
|
||
This shall all be fixed once we have properly working OpenGL accelerated layers. This is the real solution, more more than a SSE-optimized software path!
Comment 13•14 years ago
|
||
Even if SSE were 2x faster than what we have now, we'd still be spending way too long premultiplying. So if you can do this off the CPU, that's awesome.
Comment 14•14 years ago
|
||
Confirming that the problem is fixed by OpenGL layers, here on linux x86-64. Thus, marking as RESOLVED DUPLICATE or the meta-bug for OpenGL layers on OSX: bug 580405. Indeed, here are the results of a profiler runs on linux x86-64 with NVIDIA driver, using 'perf', with OpenGL layers enabled (preference layers.accelerate-all). The profiler has been attached to a firefox already running the scene, so it's not polluted by startup. Here 3000 is the PID. 91.42% :3000 386ece5e60 [.] 0x0000386ece5e60 1.61% :3000 perf-3000.map [.] 0x007f1a94315096 0.64% :3000 [kernel] [k] hpet_next_event 0.35% :3000 [kernel] [k] audit_syscall_entry 0.24% :3000 [kernel] [k] system_call 0.23% :3000 [kernel] [k] audit_syscall_exit 0.15% :3000 [kernel] [k] pid_vnr 0.15% :3000 [kernel] [k] sysret_check 0.14% :3000 [kernel] [k] avc_has_perm_noaudit 0.14% :3000 [kernel] [k] sys_getpid 0.14% :3000 [kernel] [k] get_page_from_freelist 0.13% :3000 [kernel] [k] system_call_after_swapgs 0.13% :3000 [kernel] [k] fget_light 0.12% :3000 [kernel] [k] audit_get_context 0.11% :3000 [kernel] [k] unroll_tree_refs 0.10% :3000 [kernel] [k] handle_mm_fault (Other symbols under 0.1%) Interpretation: the top symbol at 91% is JITted javascript. The next symbol is profiler overhead. The subsequent symbols are the linux kernel, with hpet_next_event being a profiler-related function. Thus we are in the absolutely dream situation: all the time spent is spent in JITted code!!
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
Comment 15•14 years ago
|
||
Sorry, this perf-3000.map taking 1.61% is actually the Firefox code. Indeed I ran this against a nightly build that doesn't have symbols. Here's the table per DSO: # Overhead Shared Object # ........ .................. # 91.42% 386ece5e60 6.77% [kernel] 1.61% perf-3000.map 0.11% [nvidia] 0.09% [sdhci] So the big one is JITted JS, the second one is linux (I admit to be surprised to see 6.77% time spent in the kernel), the third one is Firefox itself. Anyhow, we see that the performance problem is fixed by GL layers.
Comment 16•14 years ago
|
||
> admit to be surprised to see 6.77% time spent in the kernel
Is there a lot of memory churn going on, perhaps?
Comment 17•14 years ago
|
||
(In reply to comment #16) > > admit to be surprised to see 6.77% time spent in the kernel > > Is there a lot of memory churn going on, perhaps? It's very possible. We know we have room for improvement in this area in WebGL (some WebGL funcs cause temporary buffers to be heap allocated and then deleted everytime), so I'll make a patch fixing the obvious issues and see if that makes a difference.
Comment 18•14 years ago
|
||
But wait... this is a nightly build and should thus be using JEMalloc, right? Isn't JEMalloc supposed to maintain its own memory pool so that most malloc/free calls, in case of heavy churn, should just never be seen at all by the kernel?
Comment 19•14 years ago
|
||
That would be true for any allocator, but depending on the size of the allocations, there could be mmap/munmap traffic and the like. That's how you get into the kernel with allocators...
Comment 20•14 years ago
|
||
If you build yourself with --enable-profiling (bug 592923), you might be able to get more information out of perf.
You need to log in
before you can comment on or make changes to this bug.
Description
•