Closed
Bug 413143
Opened 17 years ago
Closed 17 years ago
trunk is broken(sigbus) on SPARC since 20071221
Categories
(Core :: Graphics: ImageLib, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: armin76, Assigned: jag+mozilla)
References
Details
Attachments
(3 files, 2 obsolete files)
39.06 KB,
image/png
|
Details | |
9.72 KB,
patch
|
pavlov
:
review+
pavlov
:
superreview+
beltzner
:
approval1.9+
|
Details | Diff | Splinter Review |
9.82 KB,
patch
|
Details | Diff | Splinter Review |
Due to the check-in in bug 406580, exactly http://bonsai.mozilla.org/cvsview2.cgi?diff_mode=context&whitespace_mode=show&subdir=mozilla/modules/libpr0n/decoders/png&command=DIFF_FRAMESET&file=nsPNGDecoder.cpp&rev1=1.76&rev2=1.77&root=/cvsroot gives sigbus(segfault) on sparc due to unaligned access.
Reporter | ||
Comment 1•17 years ago
|
||
[Thread debugging using libthread_db enabled] [New Thread 0xf6ddbb70 (LWP 5697)] [New Thread 0xf5967b90 (LWP 5700)] [New Thread 0xf50d3b90 (LWP 5701)] [New Thread 0xf4773b90 (LWP 5702)] [Thread 0xf4773b90 (LWP 5702) exited] [New Thread 0xf3f73b90 (LWP 5703)] [Thread 0xf3f73b90 (LWP 5703) exited] [New Thread 0xf3f73b90 (LWP 5704)] [New Thread 0xf4773b90 (LWP 5708)] [Thread 0xf4773b90 (LWP 5708) exited] [New Thread 0xf3753b90 (LWP 5709)] [New Thread 0xf4773b90 (LWP 5710)] [New Thread 0xf2de7b90 (LWP 5711)] Program received signal SIGBUS, Bus error. [Switching to Thread 0xf6ddbb70 (LWP 5697)] ConvertColormap (aColormap=0x7b94bc, aColors=256) at nsGIFDecoder2.cpp:701 701 nsGIFDecoder2.cpp: No such file or directory. in nsGIFDecoder2.cpp #0 ConvertColormap (aColormap=0x7b94bc, aColors=256) at nsGIFDecoder2.cpp:701 #1 0xf736f658 in nsGIFDecoder2::GifWrite (this=0x7b7308, buf=0x74a8cd "!ù\004\001", len=295) at nsGIFDecoder2.cpp:862 #2 0xf73700f4 in nsGIFDecoder2::ProcessData (this=0x7b7308, data=0x74a5c0 "GIF89a\020", count=1076, _retval=0xfff7c9dc) at nsGIFDecoder2.cpp:243 #3 0xf7370178 in ReadDataOut (in=0x746570, closure=0x7b7308, fromRawSegment=0x74a5c0 "GIF89a\020", toOffset=0, count=1076, writeCount=0xfff7c9dc) at nsGIFDecoder2.cpp:190 #4 0xf7c6dbdc in nsPipeInputStream::ReadSegments (this=0x746570, writer=0xf7370160 <ReadDataOut>, closure=0x7b7308, count=1076, readCount=0xfff7cbdc) at nsPipe3.cpp:799 #5 0xf736ef7c in nsGIFDecoder2::WriteFrom (this=0x7b7308, inStr=0x0, count=1076, _retval=0xfff7cbdc) at nsGIFDecoder2.cpp:262 #6 0xf7369cd0 in imgRequest::OnDataAvailable (this=0x7464d8, aRequest=0x6c2914, ctxt=0x2800, inStr=0x746570, sourceOffset=<value optimized out>, count=1076) at imgRequest.cpp:861 #7 0xf73662e0 in ProxyListener::OnDataAvailable (this=<value optimized out>, aRequest=0x6c2914, ctxt=0x2800, inStr=0x746570, sourceOffset=1076, count=4294429652) at imgLoader.cpp:877 #8 0xf72345b0 in nsBaseChannel::OnDataAvailable (this=0x6c28e8, request=0x7467d8, ctxt=0x0, stream=0x746570, offset=0, count=1076) at nsBaseChannel.cpp:647 #9 0xf723d708 in nsInputStreamPump::OnStateTransfer (this=0x7467d8) at nsInputStreamPump.cpp:508 #10 0xf723d898 in nsInputStreamPump::OnInputStreamReady (this=0x7467d8, stream=0x746570) at nsInputStreamPump.cpp:398 #11 0xf7c6fbdc in nsInputStreamReadyEvent::Run (this=<value optimized out>) at nsStreamUtils.cpp:111 #12 0xf7c8dac0 in nsThread::ProcessNextEvent (this=0x41f68, mayWait=1, result=0xfff7ceec) at nsThread.cpp:510 #13 0xf7c482e8 in NS_ProcessNextEvent_P (thread=0x41f68, mayWait=1) at nsThreadUtils.cpp:227 #14 0xf7b98aa4 in nsBaseAppShell::Run (this=0xab258) at nsBaseAppShell.cpp:154 #15 0xf7a01b18 in nsAppStartup::Run (this=0xf6150) at nsAppStartup.cpp:181 #16 0xf71d85e0 in XRE_main (argc=<value optimized out>, argv=<value optimized out>, aAppData=<value optimized out>) at nsAppRunner.cpp:3207 #17 0x00010cd0 in main (argc=1, argv=0xf7c7dda0) at nsBrowserApp.cpp:158 Seems like nsGIFDecoder2.cpp is affected as well
Updated•17 years ago
|
Component: GFX: Thebes → ImageLib
QA Contact: thebes → imagelib
Assignee | ||
Comment 2•17 years ago
|
||
That kinda sucks. I guess we could have the old code run when we're on a platform that doesn't support unaligned memory access, or maybe there's some way to force the compiler to generate code to support it on those platforms (bound to be slow though). I actually was looking at how we could do 3 memory reads of 4 bytes each, and still efficiently do all the right byte swapping etc. Let me see if I still have that lying around somewhere and if it'd be worth using that. CC'ing Steve and Pavlov for their input.
Comment 3•17 years ago
|
||
The destination of the ARGB writes are already 32-bit-aligned. Therefore it must be the reads that the SPARC is unhappy about. We make no attempt to align the R,G,B source buffer, so on average it is only aligned 1/4 of the time. (Or 3/4 fatal to SPARC it seems.) The most straightfoerward way to fix this is to read bytes until the source pointer is aligned then continue as we do now. Alignment of source would improve performance on x86 as well.
Comment 4•17 years ago
|
||
Something like this TOTALLY UNTESTED code: diff -U5 mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp.original mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp --- mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp.original 2008-01-21 04:48:18.000000000 -0500 +++ mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp 2008-01-21 13:56:50.000000000 -0500 @@ -715,10 +715,17 @@ } // counter for while() loops below PRUint32 idx = mInfo.output_width; + // copy as bytes until source pointer is 32-bit-aligned + while ( idx && (sampleRow & 0x3) ) + { + *imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]); + sampleRow += 3; idx--; + } + // bulk copy of pixels. while (idx > 4) { // >4 to avoid last 3 bytes in buffer PRUint32 p0, p1, p2, p3; // to avoid back-to-back register stalls p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0); p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
Reporter | ||
Comment 5•17 years ago
|
||
I'll test it, but the same code should be applied to nsGIFDecoder.cpp and nsPNGDecoder.cpp, no?
Comment 6•17 years ago
|
||
(In reply to comment #5) > I'll test it, but the same code should be applied to nsGIFDecoder.cpp and > nsPNGDecoder.cpp, no? Yes, if we cannot reliably assume that the source pointer is already aligned.
Assignee | ||
Comment 7•17 years ago
|
||
We might need something like what's used in comment 4, but remember that we're currently doing unaligned memory reads when we do: *((PRUInt32*)(sampleRow+3)) I now have some code in place that will avoid that (and as a bonus it does 3 memory reads instead of 4). Looking at the generated assembly (GCC, you suck!) when we have bswap we use one less opcode (though we might be hitting register stalls, I'd love to see what MSVC++ makes of my new code), when we don't have bswap we really prefer the unaligned access code but the aligned access code isn't half bad. I'll attach a patch shortly. In the mean time, do we have anything to tell us which platforms don't support unaligned access?
Assignee | ||
Comment 8•17 years ago
|
||
Obviously this'll need another iteration, for one to add the code that properly defines UNALIGNED_MEMORY_ACCESS_OK (or its opposite, WANT_ALIGNED_MEMORY_ACCESS), and perhaps different different names for either of those defines ;-) Steve, could you have a look at the assembly MSVC++ generates for this?
Assignee | ||
Comment 9•17 years ago
|
||
c-67-170-198-205:~/moz-src/mozilla jag$ gcc --version i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5250) unaligned bswap |aligned bswap L299: |L299: movl -112(%ebp), %eax | movl -112(%ebp), %eax movl (%eax), %eax | movl (%eax), %edx movl %eax, -156(%ebp) | bswap %edx bswap %eax | movl 4(%eax), %ecx movl %eax, -156(%ebp) | bswap %ecx movl -112(%ebp), %ecx | movl 8(%eax), %esi movl 3(%ecx), %edx | bswap %esi bswap %edx | movl %edx, %eax movl %ecx, %esi | shrl $8, %eax movl 6(%ecx), %ecx | orl $-16777216, %eax bswap %ecx | movl %eax, (%edi) movl 9(%esi), %esi | sall $16, %edx bswap %esi | movl %ecx, %eax shrl $8, -156(%ebp) | shrl $16, %eax orl $-16777216, -156(%ebp) | orl %eax, %edx movl -156(%ebp), %eax | orl $-16777216, %edx movl %eax, (%edi) | movl %edx, 4(%edi) shrl $8, %edx | sall $8, %ecx orl $-16777216, %edx | movl %esi, %eax movl %edx, 4(%edi) | shrl $24, %eax shrl $8, %ecx | orl %eax, %ecx orl $-16777216, %ecx | orl $-16777216, %ecx movl %ecx, 8(%edi) | movl %ecx, 8(%edi) shrl $8, %esi | orl $-16777216, %esi orl $-16777216, %esi | movl %esi, 12(%edi) movl %esi, 12(%edi) | subl $4, -96(%ebp) subl $4, -96(%ebp) | addl $12, -112(%ebp) addl $12, -112(%ebp) | addl $16, %edi addl $16, %edi | cmpl $4, -96(%ebp) cmpl $4, -96(%ebp) | ja L299 ja L299 | jmp L300 jmp L300 |
Assignee | ||
Comment 10•17 years ago
|
||
BTW, GCC sometimes generated fewer instructions when I didn't use intermediary PRUint32s. Of course, it tended to reuse %eax quite a bit.
Comment 11•17 years ago
|
||
Patch applied, compiled with VS2005/SP1: #ifndef UNALIGNED_MEMORY_ACCESS_OK // copy as bytes until source pointer is 32-bit-aligned while (idx-- && (NS_PTR_TO_UINT32(sampleRow) & 0x3)) { *imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]); 00549072 test edi,edi 00549074 je nsJPEGDecoder::OutputScanlines+16Ch (5490ACh) 00549076 sub edi,1 00549079 test dl,3 0054907C je nsJPEGDecoder::OutputScanlines+16Fh (5490AFh) 0054907E movzx eax,byte ptr [edx] 00549081 movzx ecx,byte ptr [edx+1] 00549085 movzx edx,byte ptr [edx+2] 00549089 or eax,0FFFFFF00h 0054908E shl eax,8 00549091 or eax,ecx 00549093 shl eax,8 00549096 or eax,edx 00549098 mov dword ptr [esi],eax sampleRow += 3; 0054909A mov edx,dword ptr [esp+0Ch] 0054909E add edx,3 005490A1 add esi,4 005490A4 test edi,edi 005490A6 mov dword ptr [esp+0Ch],edx 005490AA jne nsJPEGDecoder::OutputScanlines+136h (549076h) 005490AC sub edi,1 #endif // bulk copy of pixels. while (idx > 4) { // >4 to avoid last 3 bytes in buffer 005490AF cmp edi,4 005490B2 jbe nsJPEGDecoder::OutputScanlines+1E5h (549125h) 005490B4 lea eax,[edi-5] 005490B7 shr eax,2 005490BA add eax,1 005490BD mov dword ptr [esp+10h],eax GFX_RGB_TO_FRGB(sampleRow, imageRow); 005490C1 mov eax,dword ptr [edx] 005490C3 mov ecx,dword ptr [edx+4] 005490C6 mov edx,dword ptr [edx+8] 005490C9 bswap eax 005490CB mov ebx,eax 005490CD shr ebx,8 005490D0 or ebx,0FF000000h 005490D6 bswap ecx 005490D8 mov dword ptr [esi],ebx 005490DA or eax,0FFFFFF00h 005490DF mov ebx,ecx 005490E1 shl eax,10h 005490E4 shr ebx,10h 005490E7 or eax,ebx 005490E9 bswap edx 005490EB mov dword ptr [esi+4],eax 005490EE mov eax,edx 005490F0 or ecx,0FFFF0000h 005490F6 shl ecx,8 005490F9 shr eax,18h 005490FC or edx,0FF000000h 00549102 or ecx,eax 00549104 mov dword ptr [esi+0Ch],edx 00549107 mov dword ptr [esi+8],ecx idx -= 4; sampleRow += 12; 0054910A mov edx,dword ptr [esp+0Ch] 0054910E add edx,0Ch 00549111 sub edi,4 imageRow += 4; 00549114 add esi,10h 00549117 sub dword ptr [esp+10h],1 0054911C mov dword ptr [esp+0Ch],edx 00549120 jne nsJPEGDecoder::OutputScanlines+181h (5490C1h) }
Reporter | ||
Comment 12•17 years ago
|
||
With patch from comment 8, this is how it looks like.
Assignee | ||
Comment 13•17 years ago
|
||
Heh, I wouldn't be surprised if that's Sparc only. I forgot to mention "untested". I'll go test right now (since with that patch as-is everyone's going through the "must be aligned" path). Steve, it looks like where we can we should prefer unaligned memory access (shorter path, less back-to-back register use), unless doing only three memory reads, instead of four, makes up for that.
Comment 14•17 years ago
|
||
(In reply to comment #13) > Steve, it looks like where we can we should prefer unaligned memory access > (shorter path, less back-to-back register use), unless doing only three memory > reads, instead of four, makes up for that. It's not just the reduction of a memory access, but also that the 3 remaining accesses are now %100 aligned. That said, I can't confidently predict that one way is better than the other. My unsupported guess is that the small relative performance difference varies by x86 platform. (I.e. the relative benchmark results on Pentium3/SDRAM systems will differ from Pentium4/DDR, and both will differ from the latest Intel C2D/DDR3. All due to differing instruction and CPU cache timing differences.) Even if my benchmarking shows that one scheme is clearly better than the other, I'd be reluctant to assume that's the case for anybody else's Win32 machine.
Assignee | ||
Comment 15•17 years ago
|
||
Steve: alright, point taken. I'm thinking of reducing the number of different cases. If I can always go with aligned memory access and there's no loss in performance, I'd prefer that. Raúl: the fix for the striping (and crashing!) is pretty simple: - while (idx-- && (NS_PTR_TO_UINT32(sampleRow) & 0x3)) { + while ((NS_PTR_TO_UINT32(sampleRow) & 0x3) && idx--) { I'll attach a new patch for people to play with.
Assignee | ||
Comment 16•17 years ago
|
||
Still looking for a better name for GFX_RGB_TO_FRGB. Maybe GFX_BLOCK_RGB_TO_FRGB? Or GFX_4_PIXEL_RGB_TO_FRGB? Also, is there a better test for alignment? Something that doesn't assume 4 byte alignment? And I still need a test to see if alignment is needed. Unless we just go ahead and always use aligned reads. I don't have the time to do performance tests of aligned vs unaligned, I'd appreciate it if someone can help me out there.
Comment 17•17 years ago
|
||
(In reply to comment #16) > I don't have the time to do performance tests of aligned vs unaligned, I'd > appreciate it if someone can help me out there. Done on a dual-Pentium3/850MHz ("Coppermine") with 100MHz SDRAM: jenniferconnelly.jpg (stored locally) ------------------------------------- Trunk 21 Jan (SeaMonkey): load once, reload twice before patch ------------ (0) Total time to decode jpeg image: 740263 (0) Total time to copy all scanlines: 113643 (1) Total time to decode jpeg image: 725490 (1) Total time to copy all scanlines: 111146 (2) Total time to decode jpeg image: 726999 <-- avg: 730,917 (2) Total time to copy all scanlines: 112370 <-- avg: 112,386 after patch ----------- (0) Total time to decode jpeg image: 698394 (0) Total time to copy all scanlines: 98211 (1) Total time to decode jpeg image: 689328 (1) Total time to copy all scanlines: 101617 (2) Total time to decode jpeg image: 706283 <-- avg: 698,001 (2) Total time to copy all scanlines: 98770 <-- avg: 99,532 http://img261.imageshack.us/my.php?image=jenniferconnellymm1.jpg
Reporter | ||
Comment 18•17 years ago
|
||
Works fine on SPARC, thanks!
Assignee | ||
Comment 19•17 years ago
|
||
Attachment #298499 -
Attachment is obsolete: true
Attachment #298731 -
Flags: superreview?(pavlov)
Attachment #298731 -
Flags: review?(swsnyder)
Comment 20•17 years ago
|
||
(In reply to comment #19) > Created an attachment (id=298731) [details] > Always use aligned reads, fix generic GFX_NTOHL impl. (oops!) I'm not authorized to do formal reviews, so you should pick someone else as reviewer. That said, I do have a small nit to pick. // copy remaining pixel(s) + // NB: can't use 32-bit reads, they might read off the end of the buffer while (c--) { from -= 3; - *--to = GFX_0XFF_PPIXEL_FROM_BPTR(from); + *--to = GFX_PACKED_PIXEL(0xFF, from[0], from[1], from[2]); } That 2nd occurrance of the comment "NB: can't use 32-bit reads, they might read off the end of the buffer" in nsGIFDecoder2.cpp is inaccurate. Because the GIF copy is working *backwards* through the buffer, that comment is correct for the 1st pixel read (i.e. that final set of RGB bytes in the buffer). The code above reads the remaining bytes from the lowest buffer addresses. It can't overflow, nor (for single RGB sets read) can it underflow. This is different from JPG and PNG which copy from low addresses to high.
Assignee | ||
Comment 21•17 years ago
|
||
What if aColors == 3 and the last pixel is aligned? That means we'll skip over the first loop, we'll skip over the second loop, and we'll end up at the third. We'll still need to make sure we don't read 4 bytes for the last pixel.
Comment 22•17 years ago
|
||
OK, you're right. I hereby leave that nit unpicked.
Assignee | ||
Comment 23•17 years ago
|
||
Builds on TryServer. Steve, last I heard anyone can r=, as long as a module owner or peer gives additional r= or sr=. Since you wrote the patch that started this all, I think you're qualified to give r= on my patch, as long as you feel comfortable doing so.
Assignee | ||
Updated•17 years ago
|
Attachment #298731 -
Flags: review?(swsnyder) → review?(pavlov)
Updated•17 years ago
|
Attachment #298731 -
Flags: superreview?(pavlov)
Attachment #298731 -
Flags: superreview+
Attachment #298731 -
Flags: review?(pavlov)
Attachment #298731 -
Flags: review+
Updated•17 years ago
|
Flags: blocking1.9?
Updated•17 years ago
|
Attachment #298731 -
Flags: approval1.9?
Comment 24•17 years ago
|
||
Comment on attachment 298731 [details] [diff] [review] Always use aligned reads, fix generic GFX_NTOHL impl. (oops!) a=beltzner
Attachment #298731 -
Flags: approval1.9? → approval1.9+
Assignee | ||
Comment 25•17 years ago
|
||
Checking in gfx/thebes/public/gfxColor.h; /cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v <-- gfxColor.h new revision: 1.19; previous revision: 1.18 done Checking in modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp; /cvsroot/mozilla/modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp,v <-- nsGIFDecoder2.cpp new revision: 1.97; previous revision: 1.96 done Checking in modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp; /cvsroot/mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp,v <-- nsJPEGDecoder.cpp new revision: 1.89; previous revision: 1.88 done Checking in modules/libpr0n/decoders/png/nsPNGDecoder.cpp; /cvsroot/mozilla/modules/libpr0n/decoders/png/nsPNGDecoder.cpp,v <-- nsPNGDecoder.cpp new revision: 1.78; previous revision: 1.77 done
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 26•17 years ago
|
||
This replaces GFX_ROL32 with PR_ROTATE_LEFT32 and PR_ROTATE_RIGHT32 (recently added), and |while (idx > 4)| can safely become |while (idx >= 4)| because we're no longer reading one byte past the 4th pixel when doing the block copy.
Comment 27•17 years ago
|
||
Comment on attachment 300284 [details] [diff] [review] What I ended up checking in >+ #else >+ // A reasonably fast generic little-endian implementation. >+ #define GFX_NTOHL(x) \ >+ ( (PR_ROTATE_RIGHT32((x),8) & 0xFF00FF00) | \ >+ (PR_ROTATE_LEFT32((x),8) & 0x00FF00FF) ) Sorry to spoil your fun, but you need to #include "prbit.h" for these...
Comment 28•17 years ago
|
||
(In reply to comment #27) > Sorry to spoil your fun, but you need to #include "prbit.h" for these... That's bug 414971.
You need to log in
before you can comment on or make changes to this bug.
Description
•