This WebGL/processing.js demo is spending most of its time (when not in CoreGraphics stuff) in gfxUtils::PremultiplyImageSurface http://pointmarker.com/webgl/test2.html 15.0% 15.0% CoreGraphics argb32_mark_pixelshape 14.7% 14.7% XUL gfxUtils::PremultiplyImageSurface(gfxImageSurface*, gfxImageSurface*) 6.9% 6.9% CoreGraphics CGBlt_copyBytes 6.8% 6.8% libSystem.B.dylib __bzero 5.4% 5.4% XUL pixman_blt_sse2
Can you try with accelerated layers (preferences layers.accelerate_xxx) ?
I tried changing layers.accelerate_all to true, and it was the same. I then thought I might need to restart, and my dialog box was a copy of the 3D canvas (i.e., no text or buttons, just shapes like in the canvas below). Here's a screenshot, but it doesn't show it perfectly: http://www.flickr.com/photos/11280278@N04/4975110336/ Maybe that's another bug. I'll try restarting when I figure out how to close this browser.
On restart it's worse: http://www.flickr.com/photos/11280278@N04/4974511199/ I either have this pref done incorrectly, or it's very, very broken. Benoit, can I have my browser back please?
Can you try running in safe mode ( -safe-mode), go to about:config and restore this pref to default? Sorry for making you do dangerous things. Yes, GL layers are... not ready yet :-)
I seem to recall that Jason's filed some bugs about speeding up the premultiplies here recently...
Indeed I have. I'll take a look.
gfxUtils::(Un)PremultiplyImageSurface is copy-pasted from the premultiply code I wrote for canvas. I don't immediately have any ideas about speeding this up further. Maybe derf does?
A sketch in SSE2 (off the top of my head, untested): pxor xmm7,xmm7 mov xmm6,0x80808080808080808080808080808080 mov xmm5,0x00FE00FE00FE00FE00FE00FE00FE00FE mov xmm4,0xFF000000FF000000FF000000FF000000 movq xmm0,(src) movq xmm1,(src+8) punpcklbw xmm0,xmm7 punpcklbw xmm1,xmm7 pshuflw xmm2,xmm0,0xFF pshufhw xmm2,xmm0,0xFF pmullw xmm2,xmm0 pshuflw xmm3,xmm1,0xFF pshufhw xmm3,xmm1,0xFF pmullw xmm3,xmm1 packuswb xmm0,xmm1 pand xmm0,xmm4 paddw xmm2,xmm5 paddw xmm3,xmm5 pmulhuw xmm2,xmm6 pmulhuw xmm3,xmm6 movq xmm1,xmm4 psrlw xmm2,$7 psrlw xmm3,$7 packuswb xmm2,xmm3 pandn xmm1,xmm2 movdqa (dst),xmm1 (or movntdq) That's 22 operations for 4 RGBA pixels, or 1.375/byte. The existing code requires at least 2.75 operations per byte (all of them load/stores), and kills L1 cache with the 64K lookup table, so this has a decent chance of being faster, even accounting for Core2's slow shuffle unit.
I was going to say the same thing as Tim, but he mid-aired me.
It's actually missing a por xmm1,xmm0 at the end before the store to actually put the original alpha values back. So 1.4375 ops/byte.
Just spent a few hours tonight getting CubicVR on trace, only to once again hit this bug. Here's a totally different WebGL example which is gated on gfx::PremultiplyImageSurface (14.2% for me): http://cubicvr.org/CubicVR.js/LightingTestMulti.html
This shall all be fixed once we have properly working OpenGL accelerated layers. This is the real solution, more more than a SSE-optimized software path!
Even if SSE were 2x faster than what we have now, we'd still be spending way too long premultiplying. So if you can do this off the CPU, that's awesome.
Sorry, this perf-3000.map taking 1.61% is actually the Firefox code. Indeed I ran this against a nightly build that doesn't have symbols. Here's the table per DSO: # Overhead Shared Object # ........ .................. # 91.42% 386ece5e60 6.77% [kernel] 1.61% perf-3000.map 0.11% [nvidia] 0.09% [sdhci] So the big one is JITted JS, the second one is linux (I admit to be surprised to see 6.77% time spent in the kernel), the third one is Firefox itself. Anyhow, we see that the performance problem is fixed by GL layers.
> admit to be surprised to see 6.77% time spent in the kernel Is there a lot of memory churn going on, perhaps?
(In reply to comment #16) > > admit to be surprised to see 6.77% time spent in the kernel > > Is there a lot of memory churn going on, perhaps? It's very possible. We know we have room for improvement in this area in WebGL (some WebGL funcs cause temporary buffers to be heap allocated and then deleted everytime), so I'll make a patch fixing the obvious issues and see if that makes a difference.
But wait... this is a nightly build and should thus be using JEMalloc, right? Isn't JEMalloc supposed to maintain its own memory pool so that most malloc/free calls, in case of heavy churn, should just never be seen at all by the kernel?
That would be true for any allocator, but depending on the size of the allocations, there could be mmap/munmap traffic and the like. That's how you get into the kernel with allocators...
If you build yourself with --enable-profiling (bug 592923), you might be able to get more information out of perf.