Closed
Bug 341645
Opened 18 years ago
Closed 18 years ago
Special-case to avoid memcpy() calls
Categories
(Core Graveyard :: GFX, defect)
Core Graveyard
GFX
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: swsnyder, Unassigned)
Details
(Keywords: perf)
Attachments
(1 obsolete file)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.4) Gecko/20060527 SUSE/1.5.0.4-1.3 Firefox/1.5.0.4 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.4) Gecko/20060527 SUSE/1.5.0.4-1.3 Firefox/1.5.0.4 Functions SetImageData() and SetAlphaData() are often called with 4 bytes to copy. A performance gain can be gotten by doing these copies in a single instruction rather than calling memcpy(). Reproducible: Always Steps to Reproduce: 1.Load page containing graphics. 2. 3. Actual Results: Lots of code executed unnecessarily. Expected Results: Code should recognize common data sets. The 2 functions noted above call memcpy() for any non-zero quantity of graphics data to copy. When I load the pages of the 10 most popular websites (http://www.alexa.com/site/ds/top_sites?cc=US&ts_mode=country&lang=none) I see that 13% of the calls to memcpy() are to copy only 4 bytes of data. A 4-byte copy can be done in a single instruction. When building Mozilla with inline memcpy() (-O2 optimization with MSVC compilers) the inline code takes many more CPU cycles than the simple compare-and-move I am suggesting. The savings in execution time should be even more stark when building with -O1 (no inlining of standard functions) as the change avoids overhead of passing params on the stack.
Reporter | ||
Comment 1•18 years ago
|
||
Reporter | ||
Comment 2•18 years ago
|
||
Going to all of those top 10 most popular websites with Seamonkey v1.0.2 involves a combined 25344 calls to memcpy() from SetImageData() and SetAlphaData(). Those calls break down like this: 38407 mticks, sizeKB = 7510, 4byte = 3145, small = 5328, large = 16871 where number of bytes to copy is: 4byte = (data == 4 bytes) small = (4 < data < 64) large = (data >= 64) and sizeKB is to total amount of data (expressed in KBytes) run through memcpy() for these calls. The exact numbers vary slightly from run to run because the advertisements shown on the pages vary. Roughly 13% of the calls to memcpy() are for 4-byte blocks. FYI.
Comment 3•18 years ago
|
||
==> GFX
Assignee: general → nobody
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: perf
Product: Mozilla Application Suite → Core
QA Contact: general → general
Version: unspecified → Trunk
Updated•18 years ago
|
Component: General → GFX
Updated•18 years ago
|
Assignee: nobody → general
QA Contact: general → ian
Comment 4•18 years ago
|
||
Steve, do you realize that this would only work on properly align data ? This type of operations can be very slow on many processors, and are even forbidden on some. I've had some algorithms runs 10 times slower because of misalignment (on Alpha). That's the reason that the compiler didn't take the obvious optimizations. Did you test only on Intel or AMD ? While this would work on some processors (Intel mainly), it can't be used on all.
memcpy is usually quite fast; I'd be surprised if you can do much better. Did you actually measure the performance of these changes?
Reporter | ||
Comment 6•18 years ago
|
||
Jo, Hmmm. I hadn't considered non-x86 processors. Alignment (on x86) isn't an issue. The memcpy() also copies the data 4 bytes at a time, without testing alignment, but is much less efficient because it's generalized for any amount of data. Here is the code generated by VS2003/MSVC 7.1. It's compiled at -O2 get get the inlined memcpy. if (aLength == 4) 01C69E5B cmp esi,4 01C69E5E jne gfxImageFrame::SetAlphaData+0AAh (1C69E6Ah) *((PRUint32 *)(alphaData + offset)) = *((PRUint32 *)aData); 01C69E60 mov ecx,dword ptr [ecx] 01C69E62 mov edx,dword ptr [this] 01C69E65 mov dword ptr [edi+edx],ecx else 01C69E68 jmp gfxImageFrame::SetAlphaData+0C5h (1C69E85h) memcpy(alphaData + offset, aData, aLength); 01C69E6A mov eax,dword ptr [this] 01C69E6D mov ecx,esi 01C69E6F mov esi,dword ptr [aData] 01C69E72 mov edx,ecx 01C69E74 shr ecx,2 01C69E77 add edi,eax 01C69E79 rep movs dword ptr [edi],dword ptr [esi] 01C69E7B mov ecx,edx 01C69E7D and ecx,3 01C69E80 rep movs byte ptr [edi],byte ptr [esi]
Reporter | ||
Updated•18 years ago
|
Attachment #225725 -
Attachment is obsolete: true
Reporter | ||
Comment 7•18 years ago
|
||
I'm withdrawing this patch because it really doesn't provide the performance improvement I was hoping for. There is a gain by not calling memcpy() just for 4 bytes, but that gain seems to be negated by the new data size check needed to avoid the memcpy() call. Oh, well.
Status: NEW → RESOLVED
Closed: 18 years ago
Resolution: --- → WONTFIX
Updated•15 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•