Closed Bug 406580 Opened 17 years ago Closed 17 years ago

Faster copying of RGB pixel data

Categories

(Core :: Graphics, defect, P1)

x86
Windows XP
defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta3

People

(Reporter: swsnyder, Assigned: swsnyder)

References

Details

(Keywords: perf)

Attachments

(3 files, 6 obsolete files)

A significant percentage of the time spent in graphics processing is copying the pixel data from the buffer where a given image is decoded to the display buffer for output.  Most of this slow processing is due to having to construct each pixel from 4 individual values (A,R,G,B).

In the case where the alpha value is 0xFF, no manipulation of the R,G.B values are needed, yet the pixel components are still handled as 4 values.  The R,G,B values are read from contiguous byte-sized memory locations then reassembled, with 0xFF, as a single 32-bit value.

Processing would be faster if those pixel components were kept contiguous, not read individually from the source buffer and re-assembled.
Reduces by 20% - 25% the times it takes to copy pixels on x86 platforms.

The performance improvement comes from reducing the number of memory access needed for each pixel.  The RGB bytes are gotten in a single 32-bit read from the source buffer, reordered within registers, and written to the destination buffer as a single 32-bit value.  On x86 platforms, hardware byte reordering is used, as provided by Microsoft's and GNU's compilers.
The short version:

Original code:
  4.00 reads from memory per pixel
  2.00 writes to memory per pixel

Modified, using generic C byte-reordering:
  2.25 reads from memory per pixel
  1.75 writes to memory per pixel

Modified, using x86 bswap byte-reordering:
  1.25 reads from memory per pixel
  1.50 writes to memory per pixel

Details:

// original code from trunk: form pixel from individual bytes
//   instruction bytes in loop:   53
//   reads from memory per loop:  3 (data) + 1 (var)
//-> reads from memory per pixel: 4.00 = (4 / 1)
//   writes to memory per loop:   1 (pixels) + 1 (vars)
//-> writes to memory per pixel:  2.00 = (2 / 1)
//   times to copy all scanlines, rdtsc():
//     cosmos.jpg: 367277, 376664, 373971; avg: 372637
//     sk7.jpg:    104953, 106780, 106857; avg: 106193
// -------------------------------------------------

      for (PRUint32 i=mInfo.output_width; i>0; --i) {
0050300B  mov         ecx,dword ptr [esi+74h]
0050300E  test        ecx,ecx
00503010  jbe         nsJPEGDecoder::OutputScanlines+152h (503042h)
        *imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]);
00503012  movzx       edx,byte ptr [eax]
00503015  movzx       ebp,byte ptr [eax+1]
00503019  movzx       eax,byte ptr [eax+2]
0050301D  or          edx,0FFFFFF00h
00503023  shl         edx,8
00503026  or          edx,ebp
00503028  shl         edx,8
0050302B  or          edx,eax
0050302D  mov         dword ptr [edi],edx
        sampleRow += 3;
0050302F  mov         eax,dword ptr [esp+0Ch]
00503033  add         eax,3
00503036  add         edi,4
00503039  sub         ecx,1
0050303C  mov         dword ptr [esp+0Ch],eax
00503040  jne         nsJPEGDecoder::OutputScanlines+122h (503012h)


// mod: read contiguous bytes, generic shift/or byte re-ordering
//   instruction bytes in loop:   172
//   reads from memory per loop:  8 (data) + 1 (var)
//-> reads from memory per pixel: 2.25 = (9 / 4)
//   writes to memory per loop:   4 (pixels) + 3 (vars)
//-> writes to memory per pixel:  1.75 = (7 / 4)
//   times to copy all scanlines, rdtsc():
//     cosmos.jpg: not tested
//     sk7.jpg:    not tested
// -------------------------------------------------

      PRUint32 idx = mInfo.output_width;
0050305B  mov         edx,dword ptr [esi+74h]

      // bulk copy of pixels
      while (idx >= 4) {
0050305E  cmp         edx,4
00503061  mov         dword ptr [esp+14h],edx
00503065  jb          nsJPEGDecoder::OutputScanlines+1D8h (503118h)
0050306B  shr         edx,2
0050306E  mov         dword ptr [esp+18h],edx
        PRUint32 p0, p1, p2, p3; // to avoid back-to-back stalls
        p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
        p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
00503072  mov         edx,dword ptr [eax+3]
00503075  movzx       esi,byte ptr [eax+5]
        p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6);
00503079  movzx       ebx,byte ptr [eax+8]
        p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9);
0050307D  movzx       ebp,byte ptr [eax+0Bh]
00503081  mov         ecx,edx
00503083  and         edx,0FF00h
00503089  or          ecx,0FFFFFF00h
0050308F  shl         ecx,10h
00503092  or          ecx,esi
00503094  mov         esi,dword ptr [eax+6]
00503097  or          ecx,edx
00503099  mov         edx,esi
0050309B  or          edx,0FFFFFF00h
005030A1  shl         edx,10h
005030A4  or          edx,ebx
005030A6  mov         ebx,dword ptr [eax+9]
005030A9  and         esi,0FF00h
005030AF  or          edx,esi
005030B1  mov         esi,ebx
005030B3  or          esi,0FFFFFF00h
005030B9  shl         esi,10h
005030BC  or          esi,ebp
005030BE  and         ebx,0FF00h
005030C4  or          esi,ebx
005030C6  mov         ebx,dword ptr [eax]
        imageRow[0] = p0; imageRow[1] = p1;
005030C8  movzx       eax,byte ptr [eax+2]
005030CC  mov         ebp,ebx
005030CE  or          ebp,0FFFFFF00h
005030D4  shl         ebp,10h
005030D7  or          ebp,eax
005030D9  and         ebx,0FF00h
        imageRow[2] = p2; imageRow[3] = p3;
005030DF  mov         dword ptr [edi+8],edx
        idx       -=  4;
005030E2  mov         edx,dword ptr [esp+14h]
005030E6  or          ebp,ebx
005030E8  mov         dword ptr [edi],ebp
005030EA  mov         dword ptr [edi+4],ecx
005030ED  mov         dword ptr [edi+0Ch],esi
        sampleRow += 12;
005030F0  mov         eax,dword ptr [esp+10h]
005030F4  sub         edx,4
005030F7  add         eax,0Ch
        imageRow  +=  4;
005030FA  add         edi,10h
005030FD  sub         dword ptr [esp+18h],1
00503102  mov         dword ptr [esp+14h],edx
00503106  mov         dword ptr [esp+10h],eax
0050310A  jne         nsJPEGDecoder::OutputScanlines+132h (503072h)


// mod: read contiguous bytes, use x86 bswap byte re-ordering
//   instruction bytes in loop:   105
//   reads from memory per loop:  4 (data) + 1 (var)
//-> reads from memory per pixel: 1.25 = (5 / 4)
//   writes to memory per loop:   4 (pixels) + 2 (vars)
//-> writes to memory per pixel:  1.50 = (6 / 4)
//   times to copy all scanlines, rdtsc():
//     cosmos.jpg: 294457, 299895, 301546; avg: 298632
//     sk7.jpg:     81303,  81471,  80976; avg:  81250
// -------------------------------------------------

      PRUint32 idx = mInfo.output_width;
0050300B  mov         ebp,dword ptr [ebx+74h]

      // bulk copy of pixels
      while (idx >= 4) {
0050300E  cmp         ebp,4
00503011  jb          nsJPEGDecoder::OutputScanlines+18Dh (50307Dh)
00503013  mov         ecx,ebp
00503015  shr         ecx,2
00503018  mov         dword ptr [esp+14h],ecx
0050301C  lea         esp,[esp]

        PRUint32 p0, p1, p2, p3; // to avoid back-to-back stalls
        p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
        p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
00503020  mov         ecx,dword ptr [eax+3]
        p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6);
00503023  mov         edx,dword ptr [eax+6]
        p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9);
00503026  mov         edi,dword ptr [eax+9]
00503029  mov         eax,dword ptr [eax]
0050302B  bswap       eax
0050302D  bswap       ecx
0050302F  bswap       edx
00503031  bswap       edi
00503033  shr         eax,8
00503036  shr         ecx,8
00503039  shr         edx,8
0050303C  shr         edi,8
0050303F  or          eax,0FF000000h
00503044  or          ecx,0FF000000h
0050304A  or          edx,0FF000000h
00503050  or          edi,0FF000000h
        imageRow[0] = p0; imageRow[1] = p1;
00503056  mov         dword ptr [esi],eax
00503058  mov         dword ptr [esi+4],ecx
        imageRow[2] = p2; imageRow[3] = p3;
0050305B  mov         dword ptr [esi+8],edx
0050305E  mov         dword ptr [esi+0Ch],edi
        idx       -=  4;
        sampleRow += 12;
00503061  mov         eax,dword ptr [esp+10h]
00503065  add         eax,0Ch
00503068  sub         ebp,4
        imageRow  +=  4;
0050306B  add         esi,10h
0050306E  sub         dword ptr [esp+14h],1
00503073  mov         dword ptr [esp+10h],eax
00503077  jne         nsJPEGDecoder::OutputScanlines+130h (503020h)
Assignee: swsnyder → nobody
Component: General → GFX: Thebes
Product: Mozilla Application Suite → Core
QA Contact: general → thebes
Attachment #291204 - Flags: review?(pavlov)
Attachment #291204 - Flags: review?(pavlov)
Attachment #291204 - Flags: review+
Attachment #291204 - Flags: approval1.9+
Keywords: checkin-needed
The patch needs to get an extra
    && !defined(XP_OS2)
at the end of the GCC version test line for the checkin. (For some reason the byte swapping stuff was never implemented in the OS/2 GCC...)
(In reply to comment #3)
> The patch needs to get an extra
>     && !defined(XP_OS2)
> at the end of the GCC version test line for the checkin. (For some reason the
> byte swapping stuff was never implemented in the OS/2 GCC...)

In attempting to confirm this I find that the link to the OS/2 build of GCC is dead.  At http://developer.mozilla.org/en/docs/OS/2_Build_Prerequisites#Compiler is a link to an FTP site (ftp.netlabs.org) that no longer has the GCC build.

FYI.
Assignee: nobody → swsnyder
I just updated that page a few days ago and thought that I had verified that link. Obviously not, sorry.

ftp://ftp.netlabs.org/pub/gcc/GCC-3.3.5-csd3.zip is the package that we use on OS/2. It does define some bswap stuff (in endian.h) but those don't seem to be backed up by symbols in the library...
I added the extra !defined(XP_OS2), as per Peter.

Checking in gfx/thebes/public/gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v  <--  gfxColor.h
new revision: 1.14; previous revision: 1.13
done
Checking in modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp,v  <--  nsGIFDecoder2.cpp
new revision: 1.90; previous revision: 1.89
done
Checking in modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp,v  <--  nsJPEGDecoder.cpp
new revision: 1.82; previous revision: 1.81
done
Checking in modules/libpr0n/decoders/png/nsPNGDecoder.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/png/nsPNGDecoder.cpp,v  <--  nsPNGDecoder.cpp
new revision: 1.77; previous revision: 1.76
done
Status: NEW → RESOLVED
Closed: 17 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M11
The GCC check also doesn't compile on Mac, so I had to add a !defined(XP_MACOSX) to that check to fix bustage.
Depends on: 409381
This fixes the build on FreeBSD...  It will also be broken on any other GCC platform, since byteswap.h is a Linux only header.

  -Jeremy
Attachment #294278 - Flags: review?(pavlov)
Attachment #294278 - Flags: review?(pavlov) → review+
Attachment #294278 - Flags: approval1.9?
Attachment #294278 - Flags: approval1.9? → approval1.9+
Keywords: checkin-needed
Checking in gfx/thebes/public/gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v  <--  gfxColor.h
new revision: 1.16; previous revision: 1.15
done
Keywords: checkin-needed
I'm fairly certain that this change has broken GIF rendering on PPC Macs.  See bug 409551.
Confirmed that reverting to nsGIFDecoder2.cpp, v.1.89 fixes the problem.  Suggest backing out patches related to this until successfully tested on both little and big endian machines.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
See Bug 409381 (patch 294371) for fix.

This fixes the GIF, JPG, and PNG color distortion seen on PowerPC platforms.

It's being dealt with in bug 409381. Re-resolving.
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
(In reply to comment #8)
> This fixes the build on FreeBSD...  It will also be broken on any other GCC
> platform, since byteswap.h is a Linux only header.

So, this would be why my Win XP build can't find byteswap.h? Or is that a cygwin/MinGW thing?
Comment on attachment 291204 [details] [diff] [review]
Where Alpha=0xFF, keep RGB values contiguous in pixel copying

>+#else
>+#  define GFX_BYTESWAP16(x) ( (((x) & 0xff) << 8) | (((x) >> 8) & 0xff) )
>+#  define GFX_BYTESWAP32(x) ( (GFX_BYTESWAP16((x) & 0xffff) << 16) | GFX_BYTESWAP16(x >> 16) )
>+#endif
Won't this actually be slower than GFX_PACKED_PIXEL on those platforms that don't give you access to the native byteswap opcodes?
(In reply to comment #15)
>(From update of attachment 291204 [details] [diff] [review])
>>+#else
>>+#  define GFX_BYTESWAP16(x) ( (((x) & 0xff) << 8) | (((x) >> 8) & 0xff) )
>>+#  define GFX_BYTESWAP32(x) ( (GFX_BYTESWAP16((x) & 0xffff) << 16) | GFX_BYTESWAP16(x >> 16) )
>>+#endif
>Won't this actually be slower than GFX_PACKED_PIXEL on those platforms that
>don't give you access to the native byteswap opcodes?
OK, so maybe it isn't, but how about
#define GFX_BYTESWAP24FF(x)
  ((0xff << 24) | ((x) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff))
#define GFX_0XFF_PPIXEL_FROM_BPTR \
  (GFX_BYTESWAP24FF(*((PRUint32 *)(pbptr))))
(In reply to comment #16)

> OK, so maybe it isn't, but how about
> #define GFX_BYTESWAP24FF(x)
>   ((0xff << 24) | ((x) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff))
> #define GFX_0XFF_PPIXEL_FROM_BPTR \
>   (GFX_BYTESWAP24FF(*((PRUint32 *)(pbptr))))

That does looks faster than the use of a generic byteswap as it does fewer operations (3 vs. 5 shifts) on the 32-bit value read from memory.

The important thing is that those contiguous RGB values be gotten in a single read.  Reduction of operations on the value once it's in the register is icing on the cake.
FWIW, on a Pentium3/850Mhz (Win2k/SP4) machine.

The GFX_BYTESWAP24FF(x) suggested above, and the generic byteswap (not bswap()):

// 24FF jenniferconnelly.jpg: 0 = initial load; 1,2 = reload
(0) Total time to copy all scanlines:    133126
(1) Total time to copy all scanlines:    131543
(2) Total time to copy all scanlines:    132112 <-- avg: 132,260

// 32FF jenniferconnelly.jpg: 0 = initial load; 1,2 = reload
(0) Total time to copy all scanlines:    136260
(1) Total time to copy all scanlines:    137576
(2) Total time to copy all scanlines:    133686 <-- avg: 135,840

So on on a P3 the GFX_BYTESWAP24FF(x) is 2.7% faster on an x86 platform where the compiler doesn't support the bswap instruction.
I count only two effective shifts. The first shift is on a constant and the compiler should be able to turn that into 0xff000000, leaving just the shifts of R (<< 16) and B (>> 16).

So you go from 5 shifts, 5 &s, and 2 |s to 2 shifts, 2 &s, and 3 |s. Not bad.

However, looking at the assembly for one XBGR -> FRGB operation of the generic double swapping code:

00503072  mov         edx,dword ptr [eax+3]
00503081  mov         ecx,edx
00503083  and         edx,0FF00h
00503089  or          ecx,0FFFFFF00h
0050308F  shl         ecx,10h
00503075  movzx       esi,byte ptr [eax+5]
00503092  or          ecx,esi
00503097  or          ecx,edx

which is effectively 2 shifts, 2 &s and 3 |s:

(((x) | 0xffffff00) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff)

(it looks like the compiler replaced the last >> plus & with a load-byte op)

What's interesting is that this looks a lot like Neil's suggestion, except that the first two terms (0xff << 24 | (x) << 16) are combined: (((x) | 0xffffff00) << 16). Which makes me wonder, what does Neil's suggestion compile to to make it 2.7% faster?
Something like this would avoid the extra memory access with two more instructions:

mov         edx,dword ptr [eax+3]
mov         ecx,edx    ; these 3 instructions replace movzx esi,byte ptr [eax+5]
and         ecx,0FF00h ;
mov         esi,ecx    ;
mov         ecx,edx
shr         edx,10h
and         edx,0FFh
or          ecx,0FFFFFF00h
shl         ecx,10h
or          ecx,esi
or          ecx,edx

Not nearly as nice as
mov         ecx,dword ptr [eax+3]
bswap       ecx
shr         ecx,8
or          ecx,0FF000000h
Hrm, and looking around on the net, it looks like for Mac OS X you can use:

#include <libkern/OSByteOrder.h>
#define GFX_BYTESWAP16(x) OSSwapInt16(x)
#define GFX_BYTESWAP32(x) OSSwapInt32(x)
With the above includes and defines I get this on Intel Mac (-O2, make nsPNGDecoder.s)

L299:
  movl  -112(%ebp), %eax
  movl  (%eax), %eax
  movl  %eax, -156(%ebp)
  bswap   %eax
  movl  %eax, -156(%ebp)
  movl  -112(%ebp), %ecx
  movl  3(%ecx), %edx
  bswap   %edx
  movl  %ecx, %esi
  movl  6(%ecx), %ecx
  bswap   %ecx
  movl  9(%esi), %esi
  bswap   %esi
  shrl  $8, -156(%ebp)
  orl $-16777216, -156(%ebp)
  movl  -156(%ebp), %eax
  movl  %eax, (%edi)
  shrl  $8, %edx
  orl $-16777216, %edx
  movl  %edx, 4(%edi)
  shrl  $8, %ecx
  orl $-16777216, %ecx
  movl  %ecx, 8(%edi)
  shrl  $8, %esi
  orl $-16777216, %esi
  movl  %esi, 12(%edi)
  subl  $4, -96(%ebp)
  addl  $12, -112(%ebp)
  addl  $16, %edi
  cmpl  $4, -96(%ebp)
  ja  L299
  jmp L300

which is pretty much the same as the bswap version in comment 2, except that GCC 4.01 isn't quite as good with registers.

The generic version boils down to the equivalent of:
((((x) & 0xFF) << 8) | (((x) >> 8) & 0xFF)) << 16) | (((x) >> 8) & 0xFF00) | 0xFF000000

From the .s file:
  movzbl  %dl,%eax   ; - (x) & 0xFF           ...R
  sall  $8, %eax                              ..R.
  movzbl  %dh, %ecx  ; - ((x) >> 8) & 0xFF         ...G
  orl %ecx, %eax                              ..RG
  sall  $16, %eax                             RG..
  shrl  $8, %edx                                        .ABG
  andl  $65280, %edx                                    ..B.
  orl %eax, %edx                              RGB.
  shrl  $8, %edx                              .RGB
  orl $-16777216, %edx                        FRGB

Repeat twice (once %esi, once %edi) with the first three lines replaced with:
  movl %esi,%eax     ; \
  andl $255,%eax     ; - (x) & 0xFF
  sall  $8, %eax
  movl  %esi, %edx   ; - ((x) >> 8) & 0xFF
  movzbl  %dh, %ecx  ; /

And repeat once more with:
  movzbl  -124(%ebp),%eax  ; - (x) & 0xFF
  sall  $8, %eax
  movl  -124(%ebp), %edx   ; - ((x) >> 8) & 0xFF
  movzbl  %dh, %ecx        ; /

While not being smart with registers, these variants avoid doing a second memory access.

Neil's version (slightly reordered):
((x) << 16) | ((x) & 0xFF00) | (((x) >> 16) & 0xFF) | 0xFF000000

From the .s file:
  movl  %edx, %eax
  sall  $16, %eax       ; (x) << 16      GR..
  movl  %edx, %ecx
  andl  $65280, %ecx    ; (x) & 0xFF00        ..G.
  orl %ecx, %eax                         GRG.
  shrl  $16, %edx       ; (x) >> 16                ..AB
  andl  $255, %edx      ;     & 0xFF               ...B
  orl %edx, %eax                         GRGB
  orl $-16777216, %eax                   FRGB

Repeat 3 more times with %edx in the first and third line replaced with %edi, %esi, and -124(%ebp) respectively.

So on Intel Mac Neil's version is slightly more efficient. It's pretty much what I had suggested in comment 20 (if I hadn't been a doofus and would just have done "mov esi,edx ; and esi,0FF00h"). Reorder things slightly et voila:

mov         ecx,edx
shl         ecx,10h
mov         esi,edx
and         esi,0FF00h
or          ecx,esi
shr         edx,10h
and         edx,0FFh
or          ecx,edx
or          ecx,0FF000000h

My guess is that not doing the second memory access gives Neil's version 2.7% over the generic version.
Hrm, I guess my As above should really have been Xs (well, B from the next pixel, but X for clarity)

Anyway, inspiration struck:
#  define GFX_BYTESWAP24FF(x) ( ((((x) << 16) | ((x) >> 16)) & 0xffff00ff) | ((x) & 0xff00) | (0xff << 24) )

becomes:
  movl  %ecx, %eax      ; XBGR
  rorl  $16, %eax       ; GRXB
  xorb  %ah, %ah        ; GR.B
  andl  $65280, %ecx    ;      ..G.
  orl %ecx, %eax        ; GRGB
  orl $-16777216, %eax  ; FRGB

And it's a damn shame the compiler didn't do:
  movl  %ecx, %eax      ; XBGR
  rorl  $16, %eax       ; GRXB
  movb  %ch, %ah        ; GRGB
  orl $-16777216, %eax  ; FRGB

(prepend a "mov %esi, %ecx" for %esi, same for %edi)

Ah, but how about:
#  define GFX_BYTESWAP24FF(x) ( ((((x) << 16) | ((x) >> 16)) | 0xff00ff00) & ((x) | 0xffff00ff) )

becomes:
  movl  %ecx, %eax      ; XBGR
  rorl  $16, %eax       ; GRXB
  orl $-16711936, %eax  ; FRFB
  orl $-65281, %ecx     ;      FFGF
  andl  %ecx, %eax      ; FRGB

with no messing around for %edi and %esi.
Wheeee! Not sure about the GCC range there, let me know if that needs to change.
Attachment #294884 - Flags: review?
Attachment #294884 - Flags: review? → review?(pavlov)
One thing I'm wondering though. Do we need to worry about the cost of non-aligned memory access? Since we're already taking 4 * 3 bytes in the main loop, could we process this in 3 chunks of 4, and make sure that the pre-loop aligns us?
Re comment #25:

For x86, memory write alignment is more important for performance than read alignment.  Of course, optimimal performmance would be if all memory accesses were aligned by data type.

In the case of the pixel copy, the destination is aligned 100% of the time and the source 25% so it's not too bad.  Optimal would be a scheme in which 12 RGB source bytes were gotten in a 32-bit-aligned read and 4 ARGB pixels were emitted.  Some sort of SIMD (SSE2, etc) sequence would do it, as would some careful ASM code, but either would be deemed too platform-specific for Mozilla.
Any way to do what you're suggesting with intrinsics?
(In reply to comment #27)
> Any way to do what you're suggesting with intrinsics?

You means SIMD intrinsics? Certainly.  In my limited experience SIMD intrinsics are *better* than ASM code in that they are common to the Micosoft/GNU/Intel compilers and avoid the differing Intel/AT&T ASM formats (don't get me started!).

My SSE2 experience is too weak to provide off-the-cuff code, but a suitable sequence would be roughly.

pre: copy pixel data as bytes until 128-bit alignment is reached for both source and destination pointers.

1. read 16 byte of RGB data from the 128-bit-aligned source into XMM register
2. shuffle the 4 3-byte sequences into 32-bit position, discarding the high 4 bytes read
3. ADD or OR 4 0xFF bytes into the appropriate positions
4. write XMM register value (4 ARGB pixels) to 128-bit-aligned destination

post: copy remaining (n mod 4) pixels as bytes

The downsides: 

1. requires SSE2+ instruction set
2. might actually be slower for short scanlines due to set-up costs
3. so different from the integer code that you'd have to actually write the final code to know the relative performance
Comment on attachment 294884 [details] [diff] [review]
Add Mac OS X bswap support, speed up generic GFX_0XFF_PPIXEL_FROM_BPTR

> #elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && !defined(XP_MACOSX) && !defined(XP_OS2)
> #  include <byteswap.h>
> #  define GFX_BYTESWAP16(x) bswap_16(x)
> #  define GFX_BYTESWAP32(x) bswap_32(x)
> #  define _GFX_USE_INTRIN_BYTESWAP_
>+#elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && defined(XP_MACOSX)
>+#  include <libkern/OSByteOrder.h>
>+#  define GFX_BYTESWAP16(x) OSSwapInt16(x)
>+#  define GFX_BYTESWAP32(x) OSSwapInt32(x)
>+#  define _GFX_USE_INTRIN_BYTESWAP_

Can't you move the XP_MACOSX one before the one above it so we can drop the !defined(XP_MACOSX) check in the elif before it? Is there an XP_* define for just normal Linux?
reed: if the compiler requirements are the same for these two blocks, then yes. I considered it, but since I'm not quite sure if I'm doing the right version check, or need it at all, I just have it as a place-holder for now.
Steve: btw, why are you storing the results in local variables and then writing them all to the destination? That seems to require more opcodes than writing it out as you go. Compare grouped write vs. write-as-you-go:

L299:                         |L299: 
  movl  -112(%ebp), %eax      |  movl  -108(%ebp), %ecx
  movl  (%eax), %eax          |  movl  (%ecx), %eax
* movl  %eax, -156(%ebp)      |  bswap   %eax
  bswap   %eax                |  shrl  $8, %eax
  movl  %eax, -156(%ebp)      |  orl $-16777216, %eax
  movl  -112(%ebp), %ecx      |  movl  %eax, (%edi)
  movl  3(%ecx), %edx         |  movl  3(%ecx), %eax
  bswap   %edx                |  bswap   %eax
  movl  %ecx, %esi            |  shrl  $8, %eax
  movl  6(%ecx), %ecx         |  orl $-16777216, %eax
  bswap   %ecx                |  movl  %eax, 4(%edi)
  movl  9(%esi), %esi         |  movl  6(%ecx), %eax
  bswap   %esi                |  bswap   %eax
  shrl  $8, -156(%ebp)        |  shrl  $8, %eax
  orl $-16777216, -156(%ebp)  |  orl $-16777216, %eax
  movl  -156(%ebp), %eax      |  movl  %eax, 8(%edi)
  movl  %eax, (%edi)          |  movl  9(%ecx), %eax
  shrl  $8, %edx              |  bswap   %eax
  orl $-16777216, %edx        |  shrl  $8, %eax
  movl  %edx, 4(%edi)         |  orl $-16777216, %eax
  shrl  $8, %ecx              |  movl  %eax, 12(%edi)
  orl $-16777216, %ecx        |  subl  $4, %edx
  movl  %ecx, 8(%edi)         |  addl  $12, %ecx
  shrl  $8, %esi              |  movl  %ecx, -108(%ebp)
  orl $-16777216, %esi        |  addl  $16, %edi
  movl  %esi, 12(%edi)        |  cmpl  $4, %edx
  subl  $4, -96(%ebp)         |  ja  L299
  addl  $12, -112(%ebp)       |  jmp L300
  addl  $16, %edi             |
  cmpl  $4, -96(%ebp)         |
  ja  L299                    |
  jmp L300                    |

(*) compiler bug?

Of course, seeing how the assembly in comment 2 looks like it's ordered rather nicely, I guess I should go find a more recent compiler to play with :-)
BTW, how come in the PNG decoder RGB and BGR get the same pixel-copying treatment?
(In reply to comment #31)
> Steve: btw, why are you storing the results in local variables and then writing
> them all to the destination? That seems to require more opcodes than writing it
> out as you go. Compare grouped write vs. write-as-you-go:
[snip]

The local variables give the compiler a little wiggle room, although the
declared variables are never actually generated by the compiler.  The purpose
is to avoid exactly the kind of back-to-back register stalls shown in your
snippets.

Example:

    movl  3(%ecx), %eax
    bswap   %eax
    shrl  $8, %eax
    orl $-16777216, %eax

Note how the source for 3 operations in a row is the destination of the prior
instruction.  There will be a stall on each of the last 3 instructions because
the prior instruction has not yet completed.  Remember that multiple
instructions are in various stages of execution.  (Sadly, my Pentium4 has a
24-stage pipeline.  The last P4 models released had 30 stages!)

Then look at the ASM code I posted above:

00503020  mov         ecx,dword ptr [eax+3]
00503023  mov         edx,dword ptr [eax+6]
00503026  mov         edi,dword ptr [eax+9]
00503029  mov         eax,dword ptr [eax]
0050302B  bswap       eax
0050302D  bswap       ecx
0050302F  bswap       edx
00503031  bswap       edi
00503033  shr         eax,8
00503036  shr         ecx,8
00503039  shr         edx,8
0050303C  shr         edi,8
0050303F  or          eax,0FF000000h
00503044  or          ecx,0FF000000h
0050304A  or          edx,0FF000000h
00503050  or          edi,0FF000000h
00503056  mov         dword ptr [esi],eax
00503058  mov         dword ptr [esi+4],ecx
0050305B  mov         dword ptr [esi+8],edx
0050305E  mov         dword ptr [esi+0Ch],edi

I have to say that this is damned good code generation on Microsoft's part. 
There is only a single back-to-back use of a register (mov/bswap eax) in that
entire loop.

Presumably those local variables would actually be allocated and used if
compiler optimization was completely disabled.  As it is, the compiler uses
registers to contain the intermediate values.

Re GCC: The major change in GCC v4.0 was a lot of back-end enhancements, which
at the time were said to enable future optimizations.  They seem to be actually
implementing those optimizations now as the v4.1.2 I'm using (in Fedora7)
generates pretty good x86 code.  FYI.
Ah, ok, that's what I guessed you were talking about. The code in the left column is slightly better in that regard, but still nowhere near as pretty as Microsoft's.

I'm curious what Microsoft does with Neil's generic macro (comment 16) and with either of mine (comment 23).

Also, why did you pick __GNUC__ >= 2?
Comment #32:

I confess that I punted on the version.  I knew that bswap_xx() predated GCC v3.0, but not exactly in which version it was introduced.  I guess I could have just used v3 as a minimum since 

    http://developer.mozilla.org/en/docs/Linux_Build_Prerequisites#Build_Tools

says that GCC v3.2+ is "recommended".
Peter:

Pulled the trunk and built in Linux today.  This is what was generated for the inner loop of nsJPEGDecoder::OutputScanlines:

.L50:
        subl    $4, 24(%esp)
        movl    (%edi), %eax
        movl    3(%edi), %edx
        movl    6(%edi), %ecx
        movl    9(%edi), %esi
        addl    $12, %edi
        bswap   %eax
        bswap   %edx
        bswap   %ecx
        bswap   %esi
        shrl    $8, %eax
        shrl    $8, %edx
        orl     $-16777216, %eax	; 0xFF000000
        shrl    $8, %ecx
        orl     $-16777216, %edx        ; 0xFF000000
        shrl    $8, %esi
        orl     $-16777216, %ecx        ; 0xFF000000
        orl     $-16777216, %esi        ; 0xFF000000
        movl    %eax, (%ebp)
        movl    %edx, 4(%ebp)
        movl    %ecx, 8(%ebp)
        movl    %esi, 12(%ebp)
        addl    $16, %ebp
        cmpl    $4, 24(%esp)
        ja      .L50

Again, this is with GCC v4.1.2, built with -Os optimization.  Not too shabby.

(The "-Os" was Red Hat's choice.  For some reason they always build Mozilla products at -Os instead of their usual -O2.)

FYI.
Attachment #294884 - Flags: review?(pavlov) → review+
Attachment #294884 - Attachment is obsolete: true
To sorta summarize this post-fixed flurry of comments and patches, this patch (like the previous patch) gets us to where we use |bswap| on Intel Mac with GCC, and adds a faster generic implementation of GFX_0XFF_PPIXEL_FROM_BPTR().
Attachment #295008 - Flags: review?(pavlov)
Attachment #295008 - Flags: approval1.9?
Blocks: 410439
No longer blocks: 410439
Depends on: 410439
Comment on attachment 295008 [details] [diff] [review]
Shuffle things around a bit, no functional changes.

>+#elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && !defined(XP_OS2)

In order to stop having to append !defined() things to the end of this line, could it be changed to this:

#elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && defined(XP_UNIX)

Would that work to deal with problems like what have been mentioned above, plus things like bug 410439?
Comment on attachment 295008 [details] [diff] [review]
Shuffle things around a bit, no functional changes.

Approvals after reviews, bitte. Also, I'd really prefer to have this be attached to a new bug, referencing this original, but whatever :)
Attachment #295008 - Flags: approval1.9?
Given the fact that these defines are conditioned on __i386__, and people have reported problems on PPC, is the use of bswapXX? right, or should this be using ntohl(), which will take care of machine endian issues?  These are also much more likely to be optimized on all architectures, since they're critical to network performance.  Plus they're already in <prinet.h>, so all of the platform junk could be skipped...
Attached patch Boring version (I like it!) (obsolete) — Splinter Review
Forest, trees, and way too much fun. Thanks, Jeremy.

This should do the right thing everywhere.
Attachment #295008 - Attachment is obsolete: true
Attachment #295486 - Flags: review?
Attachment #295008 - Flags: review?(pavlov)
Attachment #295486 - Flags: review? → review?(pavlov)
Depends on: 411055
Comment on attachment 295486 [details] [diff] [review]
Boring version (I like it!)

this is fine as long as ntohl is a macro everywhere and this doesn't turn in to a function call.  Have you verified that?
ntohl should exist on all platforms supported by NSPR. I would be very surprised if it reduces to a function call (instead of some inlined assembly) on any serious platform.

Is it good enough if we verify Windows, Mac, Linux and some BSD?

I have verified that on Intel Mac ntohl gets inlined to bswap, and on a pretty standard FC2 Linux box (note: _old_) the ntohl call gets inlined to bswap where available, and rorw 8 + rorl 16 + rorw 8 where not. I imagine more up-to-date Linux distributions will do just as well.

Steve, can you check on Windows?
Jeremy, can you take FreeBSD?
(In reply to comment #44)
> Jeremy, can you take FreeBSD?

On FreeBSD, yes.  Unless optimization is turned off.

The mingw problems are because prinet.h doesn't include any headers, because windows.h includes winsock2.h, which defines ntohl, but the PNG encoder doesn't include windows.h...  This is a bug in NSPR - it's breaking the contract that prinet.h makes - that ntohl will be defined if it is included.
Attachment #295486 - Flags: review?(pavlov) → review+
Attachment #295486 - Flags: approval1.9?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
/me wonders what FreeBSD generates with optimizations turned off.

Is there a bug filed against NSPR on prinet.h not providing ntohl in a ming environment?
Comment on attachment 295486 [details] [diff] [review]
Boring version (I like it!)

a+ - thanx this is awesome
Attachment #295486 - Flags: approval1.9? → approval1.9+
So before I can check this in we'll need some code for mingw in NSPR or update this patch to make an exception for that case for now, and I would still like some confirmation that we end up using bswap on Windows/VC++.
Ok, so since bug 411055 exists and this is already broken (bug 410439) I'm not going to wait for mingw (sorry) but I would still like confirmation that we generate bswap on Windows (very unlikely we don't, but ...)
With SSE2, a potential solution would be:

__declspec(align(16)) unsigned char ff[] = {255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

movd       xmm0, dword ptr source       RGB
pxor       xmm1, xmm1                   0 vector
movdqa     xmm2, xmmword ptr ff         FF, 0, 0, 0, 0, etc.
punpcklbw  xmm0, xmm1                   R0G0B0
pshuflw    xmm0, xmm0, encoding         (encoding is a byte which specifies how words are to
                                         be shuffled. In this case, we'd reorder RGB to BGR
                                         and shift it one byte)
                                        00B0G0R0
packuswb   xmm0, xmm0                   0BGR
por        xmm0, xmm2                   FBGR
movd       dword ptr target, xmm1

To do one pixel at a time in a macro.

Doing twelve bytes at a time (four pixels) is hard with SSE2 as SSE2 doesn't handle objects well that aren't sized at powers of two. And SSE2 doesn't have a packed byte shuffle which is why the above code converts the bytes to words, does a word shuffle and then converts the words back to bytes. To process multiple three-byte pixels, mask, shift and merge sequences would have to be done and it would be tight on registers to try to do four at a time. It would be an interesting exercise and it would have to be done in a piece of streaming code for efficiency. The above code does a vector read and a vector zero everytime which would normally be factored out of streaming code but there shouldn't be much of a cost as the instructions without dependencies can run in parallel given enough execution resources in the CPU.

I don't know if it is faster or not but you wouldn't put it in a macro as you'd have to test for SSE2 availability every time and then fork to the appropriate code. With streaming code, you'd only do the test once.

However, with SSSE3 (Core 2 Duo and on), there is a packed shuffle instruction which would allow you to move any byte to any other location in the vector in one instruction. Latency is 3 on Core 2 Duo and 1 on Penryn. So in one instruction, you can do:

RGBRGBRGBRGBRGBR ->  BGR BGR BGR BGR

Do a vector or with F   F   F   F    and you get

FBGRFBGRFBGRFBGR with two instructions and a latency of four or five.

In SSE4, there's an instruction called PBLENDVB (Variable Blend Packed Bytes) that will do the above in one instruction. Arguments are destination register, source register/memory and copy mask. The copy mask copies any byte from either register/memory to any other position (you can duplicate) in the destination register.

As you can imagine, a code loop to do 12 bytes at a time would be a very short piece of code.

The SSSE3 instruction PSHUFB has a huge amount of potential for improving byte-twiddling operations involving pixels, whether the size is three, four or more bytes. The downsides are:

- AMD doesn't support it in Barcelona. I expect them to support it in SSE5 but my guess is that they'll need two years for their next architecture.
- MSVC doesn't support SSSE3 and SSE4 instructions until VS2008. VS2008 is out and I have the free trial installed somewhere to play with SSSE3 but I haven't yet.
- I think that the current versions of GCC don't support SSSE3 and SSE4 yet for Mac OSX. I'm not sure where Linux is on these instruction sets.

On the other hand, ICC 10.1 supports SSSE3 and SSE4 as you'd expect. I have a free trial of ICC 10.1 and have built Firefox 2.0.0.11 but any compiler optimizations and the build is unusable (left click doesn't work and gmail runs like a dog). I was planning to work on getting optimized builds using ICC before the free trial expires but other stuff came up.

I'm not sure what the inputs to this macro are and where it is used but it might be possible to optimize the input phase with SSE2. I have SSE2 code in progress that outputs three-byte pixels and the output can be in RGB or BGR order. The output of the color code is actually in four-byte pixels with one byte as a zero so I have to mask and merge to get the output to three-byte pixels. This takes 30 instructions. Outputting four-byte instructions would take only six. Of course an SSSE3 implementation would chop that down to about six as well. The SSE2 is still 25% faster than the scalar code because it replaces scalar table lookups with parallel (8 operations per instruction) arithmetic operations.
> However, with SSSE3 (Core 2 Duo and on), there is a packed shuffle instruction
> which would allow you to move any byte to any other location in the vector in
> one instruction. Latency is 3 on Core 2 Duo and 1 on Penryn. So in one
> instruction, you can do:

http://en.wikipedia.org/wiki/SSE3

Says a whole lot more processors have SSE3 - looks like anything that's dual core (including Althon64X2).  So I'd gander a bet that a large portion of our user have SSE3 on their machines.
 
> I'm not sure what the inputs to this macro are and where it is used but it
> might be possible to optimize the input phase with SSE2. I have SSE2 code in
> progress that outputs three-byte pixels and the output can be in RGB or BGR
> order. The output of the color code is actually in four-byte pixels with one
> byte as a zero so I have to mask and merge to get the output to three-byte
> pixels. This takes 30 instructions. Outputting four-byte instructions would
> take only six. Of course an SSSE3 implementation would chop that down to about
> six as well. The SSE2 is still 25% faster than the scalar code because it
> replaces scalar table lookups with parallel (8 operations per instruction)
> arithmetic operations.
> 

That would be awesome (SSE2 version)
SSE3 is not the same as SSSE3. The closeness of the acronyms can be confusing.
Attached patch Boring version moved out (obsolete) — Splinter Review
Alright, so #include-ing winsock.h in gfxColor.h is painful since that gets included directly and indirectly in a lot of places in the tree, some of which seem rather unhappy to now have to deal with winsock.h. If I were a Windows hacker I'm sure I could figure out all the magic incantations to make it all work. Instead, I'm moving the macros out to a new header file which will only need to be included by the various decoders that need the macros. Shaves a bit off build time, and per TryServer this actually compiles.
Attachment #295486 - Attachment is obsolete: true
Attachment #296902 - Flags: review?(pavlov)
Comment on attachment 296902 [details] [diff] [review]
Boring version moved out

i'd prefer not to add new files.  especially for mingw
Attachment #296902 - Flags: review?(pavlov) → review-
Ah, but you see, I'm not adding new files for MinGW, I'm adding a new file for Windows so we won't end up #include-ing winsock.h in quite a few places.
Alternatively I could just stuff the #include in the decoders. A little uglier, but not too horrid.
(In reply to comment #50)
> With SSE2, a potential solution would be:
> 
> __declspec(align(16)) unsigned char ff[] = {255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0};
> 
> movd       xmm0, dword ptr source       RGB
> pxor       xmm1, xmm1                   0 vector
> movdqa     xmm2, xmmword ptr ff         FF, 0, 0, 0, 0, etc.
> punpcklbw  xmm0, xmm1                   R0G0B0
> pshuflw    xmm0, xmm0, encoding         (encoding is a byte which specifies how
> words are to
>                                          be shuffled. In this case, we'd
> reorder RGB to BGR
>                                          and shift it one byte)
>                                         00B0G0R0
> packuswb   xmm0, xmm0                   0BGR
> por        xmm0, xmm2                   FBGR
> movd       dword ptr target, xmm1
> 

Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona, as well as movd instruction. I.e. if the code needs to be the same for all platforms it's better to use bswap, which is known to work fast on all platforms and does not require SSE2.
> Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona,
> as well as movd instruction. I.e. if the code needs to be the same for 
> all platforms it's better to use bswap, which is known to work fast on 
> all platforms and does not require SSE2.

pshufd is expensive at four cycles. pshuflw is only two cycles. Throughput
of the instruction is one instruction per clock which is the same as BSWAP
on K8. The thing about using SIMD instructions is that almost all of them
don't affect the status flags so they can run in some degree, in parallel
if there are no instruction dependencies.

The above code wouldn't be competitive with an unrolled bswap solution. But
perhaps something doing 48 bytes at a time using SSE2 would be where you are
working on different parts of the string in parallel. I have some other ideas
on this but am pretty busy with other work right now.
Don't really need it in XBM and BMP. Also, move the winsock include to the top to make MinGW happy.
Attachment #296957 - Attachment is obsolete: true
(In reply to comment #59)
> > Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona,
> > as well as movd instruction. I.e. if the code needs to be the same for 
> > all platforms it's better to use bswap, which is known to work fast on 
> > all platforms and does not require SSE2.
> 
> pshufd is expensive at four cycles. pshuflw is only two cycles. Throughput
> of the instruction is one instruction per clock which is the same as BSWAP
> on K8.

The main problem is not clock cycles those insns take, nor their throughput. The
main problem is they are vector path, causing stalls in decoder and bubbles in
ROB. I.e. this in fact affects possibilities for out of order execution,
decreasing the parallelism possibilities.
> The main problem is not clock cycles those insns take, nor their 
> throughput. The main problem is they are vector path, causing stalls 
> in decoder and bubbles in ROB. I.e. this in fact affects possibilities 
> for out of order execution, decreasing the parallelism possibilities.

pshufd is VectorPath. pshuflw and pshufhw are DirectPath Double.

I have a link to a local copy of the K8 SOG on my homepage and it only takes me half a minute to look this stuff up if I don't already have it memorized.

A VectorPath instruction isn't the end of the world when considering the selection of instructions to use. It should be avoided when there are other
alternatives but not so if those alternatives are considerably more expensive.
After applying attachment 296978 [details] [diff] [review] I get subroutines, not inline instructions. Ouch!  With the value to be swapped passed on the stack.  Ouch again!

[ Below is a verbatim display from the debugger, but those calls actually refer to ntohl(). ]

From nsJPEGDecoder::OutputScanlines(), built with VS2005/SP1 on a Win2K machine: 

      // bulk copy of pixels.
      while (idx > 4) {          // >4 to avoid last 3 bytes in buffer
00542302  cmp         ecx,4 
00542305  mov         dword ptr [esp+10h],ecx 
00542309  jbe         nsJPEGDecoder::OutputScanlines+1DDh (5423ADh) 
0054230F  add         ecx,0FFFFFFFBh 
00542312  shr         ecx,2 
00542315  add         ecx,1 
00542318  mov         dword ptr [esp+14h],ecx 
0054231C  lea         esp,[esp] 
        PRUint32 p0, p1, p2, p3; // to avoid back-to-back register stalls
        p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
00542320  mov         eax,dword ptr [eax] 
00542322  push        eax  
00542323  call        imgRequest::NotifyProxyListener+254h (54D794h) 
        p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
00542328  mov         ecx,dword ptr [esp+0Ch] 
0054232C  mov         edx,dword ptr [ecx+3] 
0054232F  mov         edi,eax 
00542331  shr         edi,8 
00542334  push        edx  
00542335  or          edi,0FF000000h 
0054233B  call        imgRequest::NotifyProxyListener+23Ch (54D77Ch) 
00542340  mov         ebx,eax 
        p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6);
00542342  mov         eax,dword ptr [esp+0Ch] 
00542346  mov         ecx,dword ptr [eax+6] 
00542349  shr         ebx,8 
0054234C  push        ecx  
0054234D  or          ebx,0FF000000h 
00542353  call        imgRequest::NotifyProxyListener+224h (54D764h) 
        p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9);
00542358  mov         edx,dword ptr [esp+0Ch] 
0054235C  shr         eax,8 
0054235F  or          eax,0FF000000h 
00542364  mov         dword ptr [esp+20h],eax 
00542368  mov         eax,dword ptr [edx+9] 
0054236B  push        eax  
0054236C  call        imgRequest::NotifyProxyListener+20Bh (54D74Bh) 
        imageRow[0] = p0; imageRow[1] = p1;
        imageRow[2] = p2; imageRow[3] = p3;
00542371  mov         ecx,dword ptr [esp+20h] 
        idx       -=  4;
00542375  sub         dword ptr [esp+10h],4 
0054237A  shr         eax,8 
0054237D  or          eax,0FF000000h 
00542382  mov         dword ptr [esi+0Ch],eax 
00542385  mov         dword ptr [esi],edi 
00542387  mov         dword ptr [esi+4],ebx 
0054238A  mov         dword ptr [esi+8],ecx 
        sampleRow += 12;
0054238D  mov         eax,dword ptr [esp+0Ch] 
00542391  add         eax,0Ch 
        imageRow  +=  4;
00542394  add         esi,10h 
00542397  sub         dword ptr [esp+14h],1 
0054239C  mov         dword ptr [esp+0Ch],eax 
005423A0  jne         nsJPEGDecoder::OutputScanlines+150h (542320h) 
005423A6  mov         edi,dword ptr [this] 
005423A9  mov         ecx,dword ptr [esp+10h] 
      }
[Continuation from above.]

Those calls to ntohl() execute this code:

75033304  mov         ecx,dword ptr [esp+4] 
75033308  mov         eax,ecx 
7503330A  mov         edx,ecx 
7503330C  shl         eax,10h 
7503330F  and         edx,0FF00h 
75033315  or          eax,edx 
75033317  mov         edx,ecx 
75033319  shr         edx,10h 
7503331C  and         ecx,0FF0000h 
75033322  or          edx,ecx 
75033324  shl         eax,8 
75033327  shr         edx,8 
7503332A  or          eax,edx 
7503332C  ret         4    

So even after taking the pain of the stack handling we still don't get the bswap instruction, just generic C code.  The only way this could be worse would be to individually arrange each of the 32 bits in the pixel.
Is this an optimized build? Or does MS' ntohl() really suck that hard?
(In reply to comment #65)
> Is this an optimized build? Or does MS' ntohl() really suck that hard?

Yes, it is an optimized build.

As for ntohl() sucking...  I don't know.  It doesn't make sense to me that it would be this bad, but I can't dispute the code that I'm seeing.
(In reply to comment #57)
> Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona, as
> well as movd instruction.

I checked the movd instruction and only the register versions are VectorPath
and slow. The memory versions are DirectPath Double and latencies of 2 and 4 depending on the direction.

That said, I took another stab at an SSE2 solution. The clock starts are listed
on the right. This looks to be about 1.5 clocks per byte on the K8 architecture.
I think that doing 24 bytes at a time would get it down to 1 clock per byte. The
ideal number is 48 bytes as there is no read waste (48 bytes=3 vector reads and
16 pixels). The scalar instructions for updating source and target pointers, and  doing the compare for loop end can be intermingled with the vector instructions and run in parallel.

__declspec(align(16)) const unsigned char ff000000[16] =
  {255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0};
/*
movdqa     xmm1, xmmword ptr source          ! 1
pxor       xmm0, xmm0                        ! 1
movdqa     xmm2, xmmword ptr ff000000        ! 2
movdqa     xmm3, xmm1                        ! 3
pshuflw    xmm1, [0,1,1,2]                   ! 3
psrldq     xmm3, 6                           ! 5
punpcklbw  xmm1, xmm0                        ! 5
pshuflw    xmm3, [0,1,1,2]                   ! 7
pshuflw    xmm1, [0,2,1,0]                   ! 7
pshufhw    xmm1, [0,3,2,1]                   ! 8
punpcklbw  xmm3, xmm0                        ! 9
pshuflw    xmm3, [0,2,1,0]                   ! 11
pshufhw    xmm3, [0,3,2,1]                   ! 11
packuswb   xmm1, xmm3                        ! 13
por        xmm1, xmm2                        ! 15
movlps     qword ptr target, xmm1            ! 17
movhps     qword ptr target+8, xmm1          ! 17
*/
This is from jdcolor.c for the most common type of jpeg. You can change the
byte order output by swapping the RGB_RED and RGB_BLUE tags. The same change
would have to be made for ycck_cmyk_convert. A small but different change would
have to be made for null_convert. gray_rgb_convert doesn't need to be changed.
The only routine that I'm not sure of is grayscale_convert but nobody has ever complained to me about it in my builds in the three years since it has been in my builds. I don't know how developers feel about making this kind of change in JPEG as it changes the functionality.

Some color output code also exists in jdmerge.c but I don't think that Mozilla uses this code.

Even more efficient would be to create four-byte pixel buffers which jdcolor could write into with ffBGR pixels. This would be cheap to do on the JPEG side;
much more expensive in nsJPEGDecoder.cpp.

120 ycc_rgb_convert (j_decompress_ptr cinfo,
121                  JSAMPIMAGE input_buf, JDIMENSION input_row,
122                  JSAMPARRAY output_buf, int num_rows)
123 {
124   my_cconvert_ptr cconvert = (my_cconvert_ptr) cinfo->cconvert;
125   register int y, cb, cr;
126   register JSAMPROW outptr;
127   register JSAMPROW inptr0, inptr1, inptr2;
128   register JDIMENSION col;
129   JDIMENSION num_cols = cinfo->output_width;
130   /* copy these pointers into registers if possible */
131   register JSAMPLE * range_limit = cinfo->sample_range_limit;
132   register int * Crrtab = cconvert->Cr_r_tab;
133   register int * Cbbtab = cconvert->Cb_b_tab;
134   register INT32 * Crgtab = cconvert->Cr_g_tab;
135   register INT32 * Cbgtab = cconvert->Cb_g_tab;
136   SHIFT_TEMPS
137 
138   while (--num_rows >= 0) {
139     inptr0 = input_buf[0][input_row];
140     inptr1 = input_buf[1][input_row];
141     inptr2 = input_buf[2][input_row];
142     input_row++;
143     outptr = *output_buf++;
144     for (col = 0; col < num_cols; col++) {
145       y  = GETJSAMPLE(inptr0[col]);
146       cb = GETJSAMPLE(inptr1[col]);
147       cr = GETJSAMPLE(inptr2[col]);
148       /* Range-limiting is essential due to noise introduced by DCT losses. */
149       outptr[RGB_RED] =   range_limit[y + Crrtab[cr]];
150       outptr[RGB_GREEN] = range_limit[y +
151                               ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
152                                                  SCALEBITS))];
153       outptr[RGB_BLUE] =  range_limit[y + Cbbtab[cb]];
154       outptr += RGB_PIXELSIZE;
155     }
156   }
157 }
(In reply to comment #67)
> (In reply to comment #57)
> That said, I took another stab at an SSE2 solution.
> movlps     qword ptr target, xmm1            ! 17
> movhps     qword ptr target+8, xmm1          ! 17
> */

That's nice, but now movlps/movhps pair will suffer a lot on Core2 architecture because of partial xmm register write stall. At the same time use of movdqa (or even movdqu) is slower on K8. However, not as slow as loading pair on Intel.

(In reply to comment #69)
> That's nice, but now movlps/movhps pair will suffer a lot on Core2 architecture
> because of partial xmm register write stall. At the same time use of movdqa (or
> even movdqu) is slower on K8. However, not as slow as loading pair on Intel.

I've only spent some time in the Core 2 Duo SOG so I'm not familiar with many nuances in SIMD optimization but, in practice, I've found that optimizing for the K8 architecture usually provides for good or better results in Core 2 Duo processors. Or it's okay to give back a percent or two if you're improving performance by ten times that using SIMD.

I'm not familiar with the partial xmm register write stall. Could you point me to where it is in the Intel Core 2 Duo documentation?

Also, given your expertise in this area, do you have any interest in adding SSE2 optimization in the new PNG code? The old mmx code was removed due to maintenance headaches so the current code is now all scalar. The MMX code provided about a 25% performance improvement on my PNG benchmark suite. SSE2 should be able to do a little better.
(In reply to comment #70)
> (In reply to comment #69)
> > That's nice, but now movlps/movhps pair will suffer a lot on Core2 architecture
> > because of partial xmm register write stall. At the same time use of movdqa (or
> > even movdqu) is slower on K8. However, not as slow as loading pair on Intel.
> 
> I've only spent some time in the Core 2 Duo SOG so I'm not familiar with many
> nuances in SIMD optimization but, in practice, I've found that optimizing for
> the K8 architecture usually provides for good or better results in Core 2 Duo
> processors. Or it's okay to give back a percent or two if you're improving
> performance by ten times that using SIMD.
> 
> I'm not familiar with the partial xmm register write stall. Could you point me
> to where it is in the Intel Core 2 Duo documentation?

You can find it here:

http://www.intel.com/design/processor/manuals/248966.pdf

Chapter 3.5.2.4 Partial XMM Register Stalls

That's the first document Google gave me.

In practice I had many FP intensive applications suffering up to 50% because of movlp*/movhp*.

> Also, given your expertise in this area, do you have any interest in adding
> SSE2 optimization in the new PNG code? The old mmx code was removed due to
> maintenance headaches so the current code is now all scalar. The MMX code
> provided about a 25% performance improvement on my PNG benchmark suite. SSE2
> should be able to do a little better.

I would do that if only (or when) have enough time.

(In reply to comment #71)

> You can find it here:
> 
> http://www.intel.com/design/processor/manuals/248966.pdf
> 
> Chapter 3.5.2.4 Partial XMM Register Stalls
> 
> That's the first document Google gave me.
> 
> In practice I had many FP intensive applications suffering up to 50% because of
> movlp*/movhp*.

The documentation there only describes a problem with XMM WRITE stalls or SIMD loads. The code above does SIMD stores and I see no problems described in the Intel documentation with SIMD partial register stores.

Did you see application performance problems using movlps/hps xmm, qword ptr source or movlps/hps qword ptr target, xmm? If the latter, then perhaps Intel needs to update their SOG. If only the former, then my code should be fine, right?

Right ;) Sorry for confusion. That is the big issue for loads. Must be AT&T and Intel syntax mess in my head. I've just looked up our final decision made some time ago. The optimal way we've been finally found for a blended code are single loads and slitted stores. So yes, the code should be good for a env unknown before hand.
Attached patch Hybrid approachSplinter Review
Since nothl() on Windows seems to suck, I'm switching back to _byteswap_ulong() for Windows, with a fallback to the fast generic macro for older VC++ and MinGW. The rest still uses nothl(). Oh, and BIG_ENDIAN platforms just do a shift+or.

TryServer said this compiles on Linux, Mac, and Windows. Martijn let me know it compiles on MinGW (yay!), and per prinet.h ntohl() is available on OS/2 and Unix platforms.
Attachment #296902 - Attachment is obsolete: true
Attachment #296978 - Attachment is obsolete: true
Attachment #297255 - Flags: review?
Attachment #297255 - Flags: review? → review?(pavlov)
Attachment #297255 - Flags: review?(pavlov) → review+
Attachment #297255 - Flags: approval1.9?
Moving to blocking list so you are good to land.
Flags: blocking1.9+
Priority: -- → P1
Attachment #297255 - Flags: approval1.9?
Attachment #297255 - Flags: approval1.9?
Comment on attachment 297255 [details] [diff] [review]
Hybrid approach

I'll land this when I get home tonight.
Attachment #297255 - Flags: approval1.9?
Checking in gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v  <--  gfxColor.h
new revision: 1.18; previous revision: 1.17
Status: REOPENED → RESOLVED
Closed: 17 years ago17 years ago
Resolution: --- → FIXED
Depends on: 413143
Blocks: 414947
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: