Special-case to avoid memcpy() calls

RESOLVED WONTFIX

Status

Core Graveyard
GFX
RESOLVED WONTFIX
12 years ago
10 years ago

People

(Reporter: Steve Snyder, Unassigned)

Tracking

({perf})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 obsolete attachment)

(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.4) Gecko/20060527 SUSE/1.5.0.4-1.3 Firefox/1.5.0.4
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.4) Gecko/20060527 SUSE/1.5.0.4-1.3 Firefox/1.5.0.4

Functions SetImageData() and SetAlphaData() are often called with 4 bytes to copy.  A performance gain can be gotten by doing these copies in a single instruction rather than calling memcpy().


Reproducible: Always

Steps to Reproduce:
1.Load page containing graphics.
2.
3.

Actual Results:  
Lots of code executed unnecessarily.


Expected Results:  
Code should recognize common data sets.


The 2 functions noted above call memcpy() for any non-zero quantity of graphics data to copy.

When I load the pages of the 10 most popular websites (http://www.alexa.com/site/ds/top_sites?cc=US&ts_mode=country&lang=none) I see that 13% of the calls to memcpy() are to copy only 4 bytes of data.  A 4-byte copy can be done in a single instruction.  

When building Mozilla with inline memcpy() (-O2 optimization with MSVC compilers) the inline code takes many more CPU cycles than the simple compare-and-move I am suggesting.  The savings in execution time should be even more stark when building with -O1 (no inlining of standard functions) as the change avoids overhead of passing params on the stack.
(Reporter)

Comment 1

12 years ago
Created attachment 225725 [details] [diff] [review]
Avoid memcpy() for 4 bytes of data
(Reporter)

Comment 2

12 years ago
Going to all of those top 10 most popular websites with Seamonkey v1.0.2 involves a combined 25344 calls to memcpy() from SetImageData() and SetAlphaData().  Those calls break down like this:

    38407 mticks, sizeKB = 7510, 4byte = 3145, small = 5328, large = 16871

where number of bytes to copy is:

  4byte = (data == 4 bytes)
  small = (4 < data < 64)
  large = (data >= 64)

and sizeKB is to total amount of data (expressed in KBytes) run through memcpy() for these calls.

The exact numbers vary slightly from run to run because the advertisements shown on the pages vary.  Roughly 13% of the calls to memcpy() are for 4-byte blocks.

FYI.

Comment 3

12 years ago
==> GFX
Assignee: general → nobody
Status: UNCONFIRMED → NEW
Component: General → General
Ever confirmed: true
Keywords: perf
Product: Mozilla Application Suite → Core
QA Contact: general → general
Version: unspecified → Trunk

Updated

12 years ago
Component: General → GFX

Updated

12 years ago
Assignee: nobody → general
QA Contact: general → ian

Comment 4

12 years ago
Steve, do you realize that this would only work on properly align data ? This type of operations can be very slow on many processors, and are even forbidden on some. I've had some algorithms runs 10 times slower because of misalignment (on Alpha). That's the reason that the compiler didn't take the obvious optimizations. Did you test only on Intel or AMD ?

While this would work on some processors (Intel mainly), it can't be used on all.
memcpy is usually quite fast; I'd be surprised if you can do much better.  Did you actually measure the performance of these changes?
(Reporter)

Comment 6

12 years ago
Jo,

Hmmm.  I hadn't considered non-x86 processors.

Alignment (on x86) isn't an issue.  The memcpy() also copies the data 4 bytes at a time, without testing alignment, but is much less efficient because it's generalized for any amount of data.  Here is the code generated by VS2003/MSVC 7.1.  It's compiled at -O2 get get the inlined memcpy.

     if (aLength == 4)
01C69E5B  cmp         esi,4
01C69E5E  jne         gfxImageFrame::SetAlphaData+0AAh (1C69E6Ah)
        *((PRUint32 *)(alphaData + offset)) = *((PRUint32 *)aData);
01C69E60  mov         ecx,dword ptr [ecx]
01C69E62  mov         edx,dword ptr [this]
01C69E65  mov         dword ptr [edi+edx],ecx
     else
01C69E68  jmp         gfxImageFrame::SetAlphaData+0C5h (1C69E85h)
        memcpy(alphaData + offset, aData, aLength);
01C69E6A  mov         eax,dword ptr [this]
01C69E6D  mov         ecx,esi
01C69E6F  mov         esi,dword ptr [aData]
01C69E72  mov         edx,ecx
01C69E74  shr         ecx,2
01C69E77  add         edi,eax
01C69E79  rep movs    dword ptr [edi],dword ptr [esi]
01C69E7B  mov         ecx,edx
01C69E7D  and         ecx,3
01C69E80  rep movs    byte ptr [edi],byte ptr [esi]

(Reporter)

Updated

12 years ago
Attachment #225725 - Attachment is obsolete: true
(Reporter)

Comment 7

12 years ago
I'm withdrawing this patch because it really doesn't provide the performance improvement I was hoping for.  There is a gain by not calling memcpy() just for 4 bytes, but that gain seems to be negated by the new data size check needed to avoid the memcpy() call.

Oh, well.
Status: NEW → RESOLVED
Last Resolved: 12 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.