Closed Bug 242145 Opened 20 years ago Closed 14 years ago

Optimization of Inverse Discrete Cosine Transform for SSE2

Categories

(Core :: Graphics: ImageLib, defect)

x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 573948

People

(Reporter: mmoy, Assigned: mmoy)

References

Details

(Keywords: perf)

Attachments

(2 files, 5 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040427 Firefox/0.8.0+ (mmoy-O2-GL7-SSE2-crc32-quek01235)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040427 Firefox/0.8.0+ (mmoy-O2-GL7-SSE2-crc32-quek01235)

~50% code path reduction of the IDCT SSE2 code in jidctint.c

Reproducible: Always
Steps to Reproduce:
1.
2.
3.
jddctmgr.c patch
Attachment #147363 - Attachment is patch: true
Attached patch jidctint.c patch (obsolete) — Splinter Review
The patches should be reviewed by pavlov@pavlov.net but I can't figure out how
to set the review flag. I can see the Flags: and Requestee: tags but there's no
place to fill them in.

One other thing is that the jddctmgr.c patch generates a compiler warning on
msvc++ in the cast to an unsigned integer in the SSE2 varianted code. I tried a 
variety of things to get a clean compile but didn't come up with anything.
(In reply to comment #3)
> The patches should be reviewed by pavlov@pavlov.net but I can't figure out how
> to set the review flag. I can see the Flags: and Requestee: tags but there's no
> place to fill them in.
make sure to select "?" (review, superreview), then the field becomes editable.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Here is what I see in my screen:

http://mysite.verizon.net/vze6w3dz/bugzilla.JPG

The only place that I see a ? option is in the blocking 1.7 and 1.8a dropdown
boxes and that doesn't seem like the right place.

http://mysite.verizon.net/vze6w3dz/bugzilla.JPG(In reply to comment #4)
> (In reply to comment #3)
> > The patches should be reviewed by pavlov@pavlov.net but I can't figure out how
> > to set the review flag. I can see the Flags: and Requestee: tags but there's no
> > place to fill them in.
> make sure to select "?" (review, superreview), then the field becomes editable.

Isn't that uint cast going to break on a 64bit machine?

And, you need to edit the attachment to see the review boxes.
(In reply to comment #6)
> Isn't that uint cast going to break on a 64bit machine?

No. The SSE2 code is varianted on XP_WIN32 (see jmorecfg.h and jdapimin.c).

Even if it weren't varianted, I think the cast would be okay. The & would be a
problem over 4 GB of memory though. A (x / 16) * 16 would probably be better but
I was looking for something with a fast execution time.

> And, you need to edit the attachment to see the review boxes.

I'll go take a look.
Attachment #147363 - Flags: review?(pavlov)
Attachment #147364 - Flags: review?(pavlov)
Have you got some perf numbers ?
Is there a test suite for JPEG on the internet ?
what will happen when running with this patch on a CPU that doesn't support SSE2?
Keywords: perf
(In reply to comment #9)
> what will happen when running with this patch on a CPU that doesn't support SSE2?

The C code path will be used. The code is conditionally compiled for Windows.
If Windows is found, then a CPUID check is run on jpeg initialization which
determines the capabilities of the machine. Currently flags are set for MMX
and/or SSE2 if the CPU has this capability. The cpuid instruction is used to
determine this. The code is at the bottom of jdapimin.c:

#ifdef HAVE_SSE2_INTEL_MNEMONICS

static int sse2support()
{
	int sse2available = 0;
	int my_edx;
	_asm
	{
		mov eax, 01                       
		cpuid                                    
		mov my_edx, edx    
	}
	if (my_edx & (0x1 << 26)) 
		sse2available = 1; 
	else sse2available = 2;

	return sse2available;
}

#endif
(In reply to comment #8)
> Have you got some perf numbers ?

No. I do have some performance tests that I ran when I filed the bug to turn on
the SSE2 code that eliminated disk and network variances by using a RamDisk for
the software and the images. It loaded three 10 MB files and used a JavaScript
timer to try to remove the human aspect. The big problem is to get Windows to
quiet down enough to get a decent test.

I have received anecdotal comments from users that have downloaded my latest
and intermediate builds and I didn't see any negative comments on performance
or rendering.

If you would like to give it a try, the Pentium4 build is available at
http://www.pryan.org/firefox/mmoy/ The build there will only run on SSE2
enabled machines, though, as the optimization flags include arch:SSE2.

My intermediate builds have cosine, cosine-b and cosine-c in them. You can see
the download statistics on these builds at http://www.pryan.org/stats/

> Is there a test suite for JPEG on the internet ?

Not that I know of. If there were one, I'd expect that it would be something that you want to run locally.
(In reply to comment #6)
> Isn't that uint cast going to break on a 64bit machine?

I think that I'm going to explictly put the conditional compilation code in for
jddctmgr to only appear on Windows so that it won't even get compiled on 64-bit
and non-Windows platforms to eliminate potential compilation problems. I'll make
the changes and run a full build which will probably be done this afternoon.
Attached file Update jddctmgr.c patch (obsolete) —
Added conditional compilation for alignment code.
Attachment #147363 - Attachment is obsolete: true
Attachment #147497 - Flags: review?(pavlov)
Comment on attachment 147363 [details] [diff] [review]
jddctmgr.c patch (to align a structure)

clearing obsolete review request
Attachment #147363 - Flags: review?(pavlov)
I ran my JPEG performance test for my 5/1 build against the official nightly 5/2
build and found a 15% performance improvement.

The test uses a Javascript timer and loads three 10 MB images in one page. The
Program Directory is stored on a ramdisk as are the three 10 MB images along with
the html webpage. The profile used is stored on disk so differences in disk
latency are not minimized by using the ramdisk.

The IDCT is part of JPEG decompression and other parts of the JPEG code may be
worth a look too.

I have an Intel CD-DROM with VTUNE on it that came with an Intel Optimization
book but I haven't tried it out yet. It would probably provide more accurate
statistics in comparing the performance of the current IDCT SSE2 code with 
my version. The version of VTUNE that I have isn't a full version and I don't
know if I can do this analysis or not.
Another builder reported a crash on startup with this code on Athlon XP though it works fine on P4s. I requested the assembler code around the crash point to debug the platform difference. I've had several hundred downloads and this is the first
report of a crash problem. I'll report back when I get more information.
There's a one-line cut/paste bug in the jddctmgr.c patch. I'll upload a new one
Saturday night.
Attached patch Update jddctmgr patch (obsolete) — Splinter Review
Attachment #147497 - Attachment is obsolete: true
Attachment #147497 - Flags: review?(pavlov)
Attachment #148023 - Attachment is obsolete: true
Attachment #148024 - Flags: review?(pavlov)
Is there a possibility to port your optimizations to SSE? Users of slower
systems with SSE support could probably need better performance rather than
those with already fast SSE2 systems (of course better performance is always
appreciated, doesn't really matter how fast your system already is ;-))
(In reply to comment #20)
> Is there a possibility to port your optimizations to SSE? Users of slower
> systems with SSE support could probably need better performance rather than
> those with already fast SSE2 systems (of course better performance is always
> appreciated, doesn't really matter how fast your system already is ;-))

I have an SSE port that's about 80% done. When I'm done, I'll do some performance
comparisons to see if it's worthwhile trying to get it into the code base.

The SSE port was considerably more work than the SSE2 port as there isn't nearly
as much to work with in SSE compared to SSE2 when you're working with integers.
The SSE floating point support is very good but an FP approach that I tried
proved to only provide very small returns. I think that the high cost of
integer to FP and back to integer may have negated a little less work in scaling
and descaling.
15% perf increase looks very promising, thanks !
I'll consider changing my motherboard and CPU ;-)

Just one more question:
Are you sure that the new code is absolutely equivalent to the old code?

I'm thinking about using both old and new code to decompress some images and
compare the uncompressed bitmaps (possibly convert jpeg to uncompressed bmp
files) to prove that they are the same.
I can't test this because I don't have SSE2 :-(
(In reply to comment #22)
> Just one more question:
> Are you sure that the new code is absolutely equivalent to the old code?

Well, that's like asking if someone is sure that there are absolutely no bugs
in their code. I laid out what I did to go from the old code to the new code
and was fairly careful at each stage. Is it possible that I made an error in
my understanding of one or two of the SSE2 instructions? Of course.

In general, I have a bunch of ~1 mb images that I used to verify the code from
one stage to the next and you are somewhat dependent on my coding abilities and
my eyesight.

One thing that I'll mention though. When you make a mistake in coding with this
stuff, 19 times out of 20, you know it very quickly.

There have been about 600 downloads of the code at various levels of porting and
I haven't received one report of rendering problems though.

> I'm thinking about using both old and new code to decompress some images and
> compare the uncompressed bitmaps (possibly convert jpeg to uncompressed bmp
> files) to prove that they are the same.
> I can't test this because I don't have SSE2 :-(

I'd welcome such a test. With current review methods, I think reviewers are limited to checking the code which can be quite time-consuming if you're not conversant with these esoteric assembler instructions.
When you made your comment about comparing bitmaps, I thought to myself how
would you do that? You'd have to write code to intercept the output from the
JPEG code and write it out to a file.

Of course it struck me later that one could simply do a COPY IMAGE and then 
paste it into PAINT and then save it as a bitmap (that's probably the default 
anyways).

I may give this a try later this week.

I also finished the SSE port this evening and it's 19% faster than the official
nightly using my large JPEG rendering test.
You do realize that the jpeg directory is just a modified version of the IJG
codebase, which contains a command line decoder (djpeg.c) that can be used
for timing/testing, right?
(In reply to comment #25)
> You do realize that the jpeg directory is just a modified version of the IJG
> codebase, which contains a command line decoder (djpeg.c) that can be used
> for timing/testing, right?

Yes to the former, no to the latter. I'll have to take a look at that.
(In reply to comment #20)
> Is there a possibility to port your optimizations to SSE? Users of slower
> systems with SSE support could probably need better performance rather than
> those with already fast SSE2 systems (of course better performance is always
> appreciated, doesn't really matter how fast your system already is ;-))

Give this build a try:

http://www.pryan.org/mozilla/firefox/mmoy/mmoy-05-03-2004-Pentium3G.exe
Out of curiousity, how long does it take to get something reviewed? My other two changes got a lot of attention pretty quickly (turning on SSE2 and fixing the
SSE2 alignment problem) but this one seems to be taking a while.
(In reply to comment #28)
> Out of curiousity, how long does it take to get something reviewed?

You have to take the bug (assign to yourself), go on IRC (chatzilla), find a
suitable reviewer. Ultimately mail drivers@mozilla.org to find one.

Attached patch Updated jidctint.c (obsolete) — Splinter Review
Added some small improvements.
Attachment #147364 - Attachment is obsolete: true
Attachment #149128 - Flags: review?(pavlov)
can you post a diff of jidcint.c instead of the full file?
Attached patch jidctint.c patchSplinter Review
Prev file was the c file; not the patch
Attachment #149128 - Attachment is obsolete: true
Attachment #149128 - Flags: review?(pavlov)
Attachment #149139 - Flags: review?(pavlov)
Attachment #147364 - Flags: review?(pavlov)
This website consistently crashes FF on Linux. It's intermittant on Windows but
crashes just about everytime I go to that site on Linux. I don't have a debug
environment on my Linux box though nor do I have a build environment.
(In reply to comment #33)
> This website consistently crashes FF on Linux. It's intermittant on Windows but
> crashes just about everytime I go to that site on Linux. I don't have a debug
> environment on my Linux box though nor do I have a build environment.

Please ignore this post, it's meant for another bug.
Blocks: 203448
Assignee: jdunn → nobody
QA Contact: imagelib
stuart, any chance you could take a look at this for 1.9?
Flags: blocking1.9?
I don't know intel SSE2 so I'm not really qualified to review this.  Not sure who is.
Alfred, do you know SSE2?
You could ask Makoto Kato as he's generally good with assembler.
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]
Comment on attachment 149139 [details] [diff] [review]
jidctint.c patch

i just tried running with this patch and I crash loading images on many common websites.

in jpeg_idct_islow_sse2
doing:
pmullw      xmm0, QWORD PTR [edx+0]
Attachment #149139 - Flags: review?(pavlov) → review-
Assignee: nobody → mmoy
Did you also apply the jddctmgr.c patch? That patch should align the structure that the pmullw is crashing on. That was a long time ago when I wasn't all that
good at doing patches for multiple modules in one file.
i'm not sure that I applied all the correct patches.  Is there any way you can put them together in a single patch?  We're trying to find someone good to review the patches and would like to get them in for Firefox 3 if possible.
Flags: wanted1.9+
Whiteboard: [wanted-1.9]
Michael - once the smoke clears on the other JPEG patches we can try for this one...
Could it be that this bug is superceeded by bug 477728?
The patches look more or less the same.
Adding dependency link to signal relationship, but I think one is a DUP of another...
Depends on: 477728
This is subsumed by bug 573948: Switch to libjpeg-turbo.  I'm going to dupe; feel free to undupe if you think this is wrong.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
Attachment #148024 - Flags: review?(pavlov)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: