Closed Bug 413143 Opened 12 years ago Closed 12 years ago

trunk is broken(sigbus) on SPARC since 20071221


(Core :: ImageLib, defect)

Not set





(Reporter: armin76, Assigned: jag+mozilla)




(3 files, 2 obsolete files)

[Thread debugging using libthread_db enabled]
[New Thread 0xf6ddbb70 (LWP 5697)]
[New Thread 0xf5967b90 (LWP 5700)]
[New Thread 0xf50d3b90 (LWP 5701)]
[New Thread 0xf4773b90 (LWP 5702)]
[Thread 0xf4773b90 (LWP 5702) exited]
[New Thread 0xf3f73b90 (LWP 5703)]
[Thread 0xf3f73b90 (LWP 5703) exited]
[New Thread 0xf3f73b90 (LWP 5704)]
[New Thread 0xf4773b90 (LWP 5708)]
[Thread 0xf4773b90 (LWP 5708) exited]
[New Thread 0xf3753b90 (LWP 5709)]
[New Thread 0xf4773b90 (LWP 5710)]
[New Thread 0xf2de7b90 (LWP 5711)]

Program received signal SIGBUS, Bus error.
[Switching to Thread 0xf6ddbb70 (LWP 5697)]
ConvertColormap (aColormap=0x7b94bc, aColors=256) at nsGIFDecoder2.cpp:701
701	nsGIFDecoder2.cpp: No such file or directory.
	in nsGIFDecoder2.cpp
#0  ConvertColormap (aColormap=0x7b94bc, aColors=256) at nsGIFDecoder2.cpp:701
#1  0xf736f658 in nsGIFDecoder2::GifWrite (this=0x7b7308, 
    buf=0x74a8cd "!ù\004\001", len=295) at nsGIFDecoder2.cpp:862
#2  0xf73700f4 in nsGIFDecoder2::ProcessData (this=0x7b7308, 
    data=0x74a5c0 "GIF89a\020", count=1076, _retval=0xfff7c9dc)
    at nsGIFDecoder2.cpp:243
#3  0xf7370178 in ReadDataOut (in=0x746570, closure=0x7b7308, 
    fromRawSegment=0x74a5c0 "GIF89a\020", toOffset=0, count=1076, 
    writeCount=0xfff7c9dc) at nsGIFDecoder2.cpp:190
#4  0xf7c6dbdc in nsPipeInputStream::ReadSegments (this=0x746570, 
    writer=0xf7370160 <ReadDataOut>, closure=0x7b7308, count=1076, 
    readCount=0xfff7cbdc) at nsPipe3.cpp:799
#5  0xf736ef7c in nsGIFDecoder2::WriteFrom (this=0x7b7308, inStr=0x0, 
    count=1076, _retval=0xfff7cbdc) at nsGIFDecoder2.cpp:262
#6  0xf7369cd0 in imgRequest::OnDataAvailable (this=0x7464d8, 
    aRequest=0x6c2914, ctxt=0x2800, inStr=0x746570, 
    sourceOffset=<value optimized out>, count=1076) at imgRequest.cpp:861
#7  0xf73662e0 in ProxyListener::OnDataAvailable (this=<value optimized out>, 
    aRequest=0x6c2914, ctxt=0x2800, inStr=0x746570, sourceOffset=1076, 
    count=4294429652) at imgLoader.cpp:877
#8  0xf72345b0 in nsBaseChannel::OnDataAvailable (this=0x6c28e8, 
    request=0x7467d8, ctxt=0x0, stream=0x746570, offset=0, count=1076)
    at nsBaseChannel.cpp:647
#9  0xf723d708 in nsInputStreamPump::OnStateTransfer (this=0x7467d8)
    at nsInputStreamPump.cpp:508
#10 0xf723d898 in nsInputStreamPump::OnInputStreamReady (this=0x7467d8, 
    stream=0x746570) at nsInputStreamPump.cpp:398
#11 0xf7c6fbdc in nsInputStreamReadyEvent::Run (this=<value optimized out>)
    at nsStreamUtils.cpp:111
#12 0xf7c8dac0 in nsThread::ProcessNextEvent (this=0x41f68, mayWait=1, 
    result=0xfff7ceec) at nsThread.cpp:510
#13 0xf7c482e8 in NS_ProcessNextEvent_P (thread=0x41f68, mayWait=1)
    at nsThreadUtils.cpp:227
#14 0xf7b98aa4 in nsBaseAppShell::Run (this=0xab258) at nsBaseAppShell.cpp:154
#15 0xf7a01b18 in nsAppStartup::Run (this=0xf6150) at nsAppStartup.cpp:181
#16 0xf71d85e0 in XRE_main (argc=<value optimized out>, 
    argv=<value optimized out>, aAppData=<value optimized out>)
    at nsAppRunner.cpp:3207
#17 0x00010cd0 in main (argc=1, argv=0xf7c7dda0) at nsBrowserApp.cpp:158

Seems like nsGIFDecoder2.cpp is affected as well
Component: GFX: Thebes → ImageLib
QA Contact: thebes → imagelib
That kinda sucks. I guess we could have the old code run when we're on a platform that doesn't support unaligned memory access, or maybe there's some way to force the compiler to generate code to support it on those platforms (bound to be slow though).

I actually was looking at how we could do 3 memory reads of 4 bytes each, and still efficiently do all the right byte swapping etc. Let me see if I still have that lying around somewhere and if it'd be worth using that.

CC'ing Steve and Pavlov for their input.
The destination of the ARGB writes are already 32-bit-aligned.  Therefore it must be the reads that the SPARC is unhappy about.  We make no attempt to align the R,G,B source buffer, so on average it is only aligned 1/4 of the time.  (Or 3/4 fatal to SPARC it seems.)

The most straightfoerward way to fix this is to read bytes until the source pointer is aligned then continue as we do now.  Alignment of source would improve performance on x86 as well.
Something like this TOTALLY UNTESTED code:

diff -U5 mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp.original mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp
--- mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp.original    2008-01-21 04:48:18.000000000 -0500
+++ mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp     2008-01-21 13:56:50.000000000 -0500
@@ -715,10 +715,17 @@

       // counter for while() loops below
       PRUint32 idx = mInfo.output_width;

+      // copy as bytes until source pointer is 32-bit-aligned
+      while ( idx && (sampleRow & 0x3) )
+      {
+        *imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]);
+        sampleRow += 3; idx--;
+      }
       // bulk copy of pixels.
       while (idx > 4) {          // >4 to avoid last 3 bytes in buffer
         PRUint32 p0, p1, p2, p3; // to avoid back-to-back register stalls
         p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
         p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
I'll test it, but the same code should be applied to nsGIFDecoder.cpp and nsPNGDecoder.cpp, no?
(In reply to comment #5)
> I'll test it, but the same code should be applied to nsGIFDecoder.cpp and
> nsPNGDecoder.cpp, no?

Yes, if we cannot reliably assume that the source pointer is already aligned.

We might need something like what's used in comment 4, but remember that we're currently doing unaligned memory reads when we do:


I now have some code in place that will avoid that (and as a bonus it does 3 memory reads instead of 4). Looking at the generated assembly (GCC, you suck!) when we have bswap we use one less opcode (though we might be hitting register stalls, I'd love to see what MSVC++ makes of my new code), when we don't have bswap we really prefer the unaligned access code but the aligned access code isn't half bad.

I'll attach a patch shortly. In the mean time, do we have anything to tell us which platforms don't support unaligned access?
Attached patch Something like this? (obsolete) — Splinter Review
Obviously this'll need another iteration, for one to add the code that properly defines UNALIGNED_MEMORY_ACCESS_OK (or its opposite, WANT_ALIGNED_MEMORY_ACCESS), and perhaps different different names for either of those defines ;-)

Steve, could you have a look at the assembly MSVC++ generates for this?
c-67-170-198-205:~/moz-src/mozilla jag$ gcc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5250)

unaligned bswap                |aligned bswap
L299:                          |L299:
  movl  -112(%ebp), %eax       |  movl  -112(%ebp), %eax
  movl  (%eax), %eax           |  movl  (%eax), %edx
  movl  %eax, -156(%ebp)       |  bswap   %edx
  bswap   %eax                 |  movl  4(%eax), %ecx
  movl  %eax, -156(%ebp)       |  bswap   %ecx
  movl  -112(%ebp), %ecx       |  movl  8(%eax), %esi
  movl  3(%ecx), %edx          |  bswap   %esi
  bswap   %edx                 |  movl  %edx, %eax
  movl  %ecx, %esi             |  shrl  $8, %eax
  movl  6(%ecx), %ecx          |  orl $-16777216, %eax
  bswap   %ecx                 |  movl  %eax, (%edi)
  movl  9(%esi), %esi          |  sall  $16, %edx
  bswap   %esi                 |  movl  %ecx, %eax
  shrl  $8, -156(%ebp)         |  shrl  $16, %eax
  orl $-16777216, -156(%ebp)   |  orl %eax, %edx
  movl  -156(%ebp), %eax       |  orl $-16777216, %edx
  movl  %eax, (%edi)           |  movl  %edx, 4(%edi)
  shrl  $8, %edx               |  sall  $8, %ecx
  orl $-16777216, %edx         |  movl  %esi, %eax
  movl  %edx, 4(%edi)          |  shrl  $24, %eax
  shrl  $8, %ecx               |  orl %eax, %ecx
  orl $-16777216, %ecx         |  orl $-16777216, %ecx
  movl  %ecx, 8(%edi)          |  movl  %ecx, 8(%edi)
  shrl  $8, %esi               |  orl $-16777216, %esi
  orl $-16777216, %esi         |  movl  %esi, 12(%edi)
  movl  %esi, 12(%edi)         |  subl  $4, -96(%ebp)
  subl  $4, -96(%ebp)          |  addl  $12, -112(%ebp)
  addl  $12, -112(%ebp)        |  addl  $16, %edi
  addl  $16, %edi              |  cmpl  $4, -96(%ebp)
  cmpl  $4, -96(%ebp)          |  ja  L299
  ja  L299                     |  jmp L300
  jmp L300                     |

BTW, GCC sometimes generated fewer instructions when I didn't use intermediary PRUint32s. Of course, it tended to reuse %eax quite a bit.
Patch applied, compiled with VS2005/SP1:

      // copy as bytes until source pointer is 32-bit-aligned
      while (idx-- && (NS_PTR_TO_UINT32(sampleRow) & 0x3)) {
        *imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]);
00549072  test        edi,edi 
00549074  je          nsJPEGDecoder::OutputScanlines+16Ch (5490ACh) 
00549076  sub         edi,1 
00549079  test        dl,3 
0054907C  je          nsJPEGDecoder::OutputScanlines+16Fh (5490AFh) 
0054907E  movzx       eax,byte ptr [edx] 
00549081  movzx       ecx,byte ptr [edx+1] 
00549085  movzx       edx,byte ptr [edx+2] 
00549089  or          eax,0FFFFFF00h 
0054908E  shl         eax,8 
00549091  or          eax,ecx 
00549093  shl         eax,8 
00549096  or          eax,edx 
00549098  mov         dword ptr [esi],eax 
        sampleRow += 3;
0054909A  mov         edx,dword ptr [esp+0Ch] 
0054909E  add         edx,3 
005490A1  add         esi,4 
005490A4  test        edi,edi 
005490A6  mov         dword ptr [esp+0Ch],edx 
005490AA  jne         nsJPEGDecoder::OutputScanlines+136h (549076h) 
005490AC  sub         edi,1 

      // bulk copy of pixels.
      while (idx > 4) {          // >4 to avoid last 3 bytes in buffer
005490AF  cmp         edi,4 
005490B2  jbe         nsJPEGDecoder::OutputScanlines+1E5h (549125h) 
005490B4  lea         eax,[edi-5] 
005490B7  shr         eax,2 
005490BA  add         eax,1 
005490BD  mov         dword ptr [esp+10h],eax 

        GFX_RGB_TO_FRGB(sampleRow, imageRow);
005490C1  mov         eax,dword ptr [edx] 
005490C3  mov         ecx,dword ptr [edx+4] 
005490C6  mov         edx,dword ptr [edx+8] 
005490C9  bswap       eax  
005490CB  mov         ebx,eax 
005490CD  shr         ebx,8 
005490D0  or          ebx,0FF000000h 
005490D6  bswap       ecx  
005490D8  mov         dword ptr [esi],ebx 
005490DA  or          eax,0FFFFFF00h 
005490DF  mov         ebx,ecx 
005490E1  shl         eax,10h 
005490E4  shr         ebx,10h 
005490E7  or          eax,ebx 
005490E9  bswap       edx  
005490EB  mov         dword ptr [esi+4],eax 
005490EE  mov         eax,edx 
005490F0  or          ecx,0FFFF0000h 
005490F6  shl         ecx,8 
005490F9  shr         eax,18h 
005490FC  or          edx,0FF000000h 
00549102  or          ecx,eax 
00549104  mov         dword ptr [esi+0Ch],edx 
00549107  mov         dword ptr [esi+8],ecx 
        idx       -=  4;
        sampleRow += 12;
0054910A  mov         edx,dword ptr [esp+0Ch] 
0054910E  add         edx,0Ch 
00549111  sub         edi,4 
        imageRow  +=  4;
00549114  add         esi,10h 
00549117  sub         dword ptr [esp+10h],1 
0054911C  mov         dword ptr [esp+0Ch],edx 
00549120  jne         nsJPEGDecoder::OutputScanlines+181h (5490C1h) 
Attached image ffsparc.png
With patch from comment 8, this is how it looks like.
Heh, I wouldn't be surprised if that's Sparc only. I forgot to mention "untested". I'll go test right now (since with that patch as-is everyone's going through the "must be aligned" path).

Steve, it looks like where we can we should prefer unaligned memory access (shorter path, less back-to-back register use), unless doing only three memory reads, instead of four, makes up for that.
(In reply to comment #13)
> Steve, it looks like where we can we should prefer unaligned memory access
> (shorter path, less back-to-back register use), unless doing only three memory
> reads, instead of four, makes up for that.

It's not just the reduction of a memory access, but also that the 3 remaining accesses are now %100 aligned.

That said, I can't confidently predict that one way is better than the other.  My unsupported guess is that the small relative performance difference varies by x86 platform.  (I.e. the relative benchmark results on Pentium3/SDRAM systems will differ from Pentium4/DDR, and both will differ from the latest Intel C2D/DDR3.  All due to differing instruction and CPU cache timing differences.)

Even if my benchmarking shows that one scheme is clearly better than the other, I'd be reluctant to assume that's the case for anybody else's Win32 machine.
Steve: alright, point taken. I'm thinking of reducing the number of different cases. If I can always go with aligned memory access and there's no loss in performance, I'd prefer that.

Raúl: the fix for the striping (and crashing!) is pretty simple:

-      while (idx-- && (NS_PTR_TO_UINT32(sampleRow) & 0x3)) {
+      while ((NS_PTR_TO_UINT32(sampleRow) & 0x3) && idx--) {

I'll attach a new patch for people to play with.
Still looking for a better name for GFX_RGB_TO_FRGB. Maybe GFX_BLOCK_RGB_TO_FRGB? Or GFX_4_PIXEL_RGB_TO_FRGB?

Also, is there a better test for alignment? Something that doesn't assume 4 byte alignment? And I still need a test to see if alignment is needed. Unless we just go ahead and always use aligned reads.

I don't have the time to do performance tests of aligned vs unaligned, I'd appreciate it if someone can help me out there.
Assignee: nobody → jag
Attachment #298405 - Attachment is obsolete: true
(In reply to comment #16)
> I don't have the time to do performance tests of aligned vs unaligned, I'd
> appreciate it if someone can help me out there.

Done on a dual-Pentium3/850MHz ("Coppermine") with 100MHz SDRAM:

jenniferconnelly.jpg (stored locally)
Trunk 21 Jan (SeaMonkey): load once, reload twice

before patch
(0) Total time to decode jpeg image:     740263
(0) Total time to copy all scanlines:    113643
(1) Total time to decode jpeg image:     725490
(1) Total time to copy all scanlines:    111146
(2) Total time to decode jpeg image:     726999 <-- avg: 730,917
(2) Total time to copy all scanlines:    112370 <-- avg: 112,386

after patch
(0) Total time to decode jpeg image:     698394
(0) Total time to copy all scanlines:     98211
(1) Total time to decode jpeg image:     689328
(1) Total time to copy all scanlines:    101617
(2) Total time to decode jpeg image:     706283 <-- avg: 698,001
(2) Total time to copy all scanlines:     98770 <-- avg:  99,532
Works fine on SPARC, thanks!
Attachment #298499 - Attachment is obsolete: true
Attachment #298731 - Flags: superreview?(pavlov)
Attachment #298731 - Flags: review?(swsnyder)
(In reply to comment #19)
> Created an attachment (id=298731) [details]
> Always use aligned reads, fix generic GFX_NTOHL impl. (oops!)

I'm not authorized to do formal reviews, so you should pick someone else as reviewer.  That said, I do have a small nit to pick.

   // copy remaining pixel(s)
+  // NB: can't use 32-bit reads, they might read off the end of the buffer
   while (c--) {
     from -= 3;
-    *--to = GFX_0XFF_PPIXEL_FROM_BPTR(from);
+    *--to = GFX_PACKED_PIXEL(0xFF, from[0], from[1], from[2]);

That 2nd occurrance of the comment "NB: can't use 32-bit reads, they might read off the end of the buffer" in nsGIFDecoder2.cpp is inaccurate.  Because the GIF copy is working *backwards* through the buffer, that comment is correct for the 1st pixel read (i.e. that final set of RGB bytes in the buffer).  The code above reads the remaining bytes from the lowest buffer addresses.  It can't overflow, nor (for single RGB sets read) can it underflow.  This is different from JPG and PNG which copy from low addresses to high.

What if aColors == 3 and the last pixel is aligned? That means we'll skip over the first loop, we'll skip over the second loop, and we'll end up at the third. We'll still need to make sure we don't read 4 bytes for the last pixel.
OK, you're right.  I hereby leave that nit unpicked.
Builds on TryServer.

Steve, last I heard anyone can r=, as long as a module owner or peer gives additional r= or sr=. Since you wrote the patch that started this all, I think you're qualified to give r= on my patch, as long as you feel comfortable doing so.
Attachment #298731 - Flags: review?(swsnyder) → review?(pavlov)
Attachment #298731 - Flags: superreview?(pavlov)
Attachment #298731 - Flags: superreview+
Attachment #298731 - Flags: review?(pavlov)
Attachment #298731 - Flags: review+
Flags: blocking1.9?
Attachment #298731 - Flags: approval1.9?
Comment on attachment 298731 [details] [diff] [review]
Always use aligned reads, fix generic GFX_NTOHL impl. (oops!)

Attachment #298731 - Flags: approval1.9? → approval1.9+
Checking in gfx/thebes/public/gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v  <--  gfxColor.h
new revision: 1.19; previous revision: 1.18
Checking in modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp,v  <--  nsGIFDecoder2.cpp
new revision: 1.97; previous revision: 1.96
Checking in modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp,v  <--  nsJPEGDecoder.cpp
new revision: 1.89; previous revision: 1.88
Checking in modules/libpr0n/decoders/png/nsPNGDecoder.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/png/nsPNGDecoder.cpp,v  <--  nsPNGDecoder.cpp
new revision: 1.78; previous revision: 1.77
Closed: 12 years ago
Resolution: --- → FIXED
This replaces GFX_ROL32 with PR_ROTATE_LEFT32 and PR_ROTATE_RIGHT32 (recently added), and |while (idx > 4)| can safely become |while (idx >= 4)| because we're no longer reading one byte past the 4th pixel when doing the block copy.
Depends on: 414854
Depends on: 414971
Comment on attachment 300284 [details] [diff] [review]
What I ended up checking in

>+  #else
>+    // A reasonably fast generic little-endian implementation.
>+    #define GFX_NTOHL(x) \
>+         ( (PR_ROTATE_RIGHT32((x),8) & 0xFF00FF00) | \
>+           (PR_ROTATE_LEFT32((x),8)  & 0x00FF00FF) )
Sorry to spoil your fun, but you need to #include "prbit.h" for these...
(In reply to comment #27)
> Sorry to spoil your fun, but you need to #include "prbit.h" for these...

That's bug 414971.

You need to log in before you can comment on or make changes to this bug.