Optimization of Inverse Discrete Cosine Transform for SSE2

RESOLVED DUPLICATE of bug 573948

Status

()

Core
ImageLib
RESOLVED DUPLICATE of bug 573948
13 years ago
5 years ago

People

(Reporter: Michael Moy, Assigned: Michael Moy)

Tracking

({perf})

Trunk
x86
Windows XP
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9 -
wanted1.9 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 5 obsolete attachments)

(Assignee)

Description

13 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040427 Firefox/0.8.0+ (mmoy-O2-GL7-SSE2-crc32-quek01235)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7b) Gecko/20040427 Firefox/0.8.0+ (mmoy-O2-GL7-SSE2-crc32-quek01235)

~50% code path reduction of the IDCT SSE2 code in jidctint.c

Reproducible: Always
Steps to Reproduce:
1.
2.
3.
(Assignee)

Comment 1

13 years ago
Created attachment 147363 [details] [diff] [review]
jddctmgr.c patch (to align a structure)

jddctmgr.c patch
(Assignee)

Updated

13 years ago
Attachment #147363 - Attachment is patch: true
(Assignee)

Comment 2

13 years ago
Created attachment 147364 [details] [diff] [review]
jidctint.c patch
(Assignee)

Comment 3

13 years ago
The patches should be reviewed by pavlov@pavlov.net but I can't figure out how
to set the review flag. I can see the Flags: and Requestee: tags but there's no
place to fill them in.

One other thing is that the jddctmgr.c patch generates a compiler warning on
msvc++ in the cast to an unsigned integer in the SSE2 varianted code. I tried a 
variety of things to get a clean compile but didn't come up with anything.

Comment 4

13 years ago
(In reply to comment #3)
> The patches should be reviewed by pavlov@pavlov.net but I can't figure out how
> to set the review flag. I can see the Flags: and Requestee: tags but there's no
> place to fill them in.
make sure to select "?" (review, superreview), then the field becomes editable.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Assignee)

Comment 5

13 years ago
Here is what I see in my screen:

http://mysite.verizon.net/vze6w3dz/bugzilla.JPG

The only place that I see a ? option is in the blocking 1.7 and 1.8a dropdown
boxes and that doesn't seem like the right place.

http://mysite.verizon.net/vze6w3dz/bugzilla.JPG(In reply to comment #4)
> (In reply to comment #3)
> > The patches should be reviewed by pavlov@pavlov.net but I can't figure out how
> > to set the review flag. I can see the Flags: and Requestee: tags but there's no
> > place to fill them in.
> make sure to select "?" (review, superreview), then the field becomes editable.

Comment 6

13 years ago
Isn't that uint cast going to break on a 64bit machine?

And, you need to edit the attachment to see the review boxes.
(Assignee)

Comment 7

13 years ago
(In reply to comment #6)
> Isn't that uint cast going to break on a 64bit machine?

No. The SSE2 code is varianted on XP_WIN32 (see jmorecfg.h and jdapimin.c).

Even if it weren't varianted, I think the cast would be okay. The & would be a
problem over 4 GB of memory though. A (x / 16) * 16 would probably be better but
I was looking for something with a fast execution time.

> And, you need to edit the attachment to see the review boxes.

I'll go take a look.
(Assignee)

Updated

13 years ago
Attachment #147363 - Flags: review?(pavlov)
(Assignee)

Updated

13 years ago
Attachment #147364 - Flags: review?(pavlov)

Comment 8

13 years ago
Have you got some perf numbers ?
Is there a test suite for JPEG on the internet ?
what will happen when running with this patch on a CPU that doesn't support SSE2?
Keywords: perf
(Assignee)

Comment 10

13 years ago
(In reply to comment #9)
> what will happen when running with this patch on a CPU that doesn't support SSE2?

The C code path will be used. The code is conditionally compiled for Windows.
If Windows is found, then a CPUID check is run on jpeg initialization which
determines the capabilities of the machine. Currently flags are set for MMX
and/or SSE2 if the CPU has this capability. The cpuid instruction is used to
determine this. The code is at the bottom of jdapimin.c:

#ifdef HAVE_SSE2_INTEL_MNEMONICS

static int sse2support()
{
	int sse2available = 0;
	int my_edx;
	_asm
	{
		mov eax, 01                       
		cpuid                                    
		mov my_edx, edx    
	}
	if (my_edx & (0x1 << 26)) 
		sse2available = 1; 
	else sse2available = 2;

	return sse2available;
}

#endif
(Assignee)

Comment 11

13 years ago
(In reply to comment #8)
> Have you got some perf numbers ?

No. I do have some performance tests that I ran when I filed the bug to turn on
the SSE2 code that eliminated disk and network variances by using a RamDisk for
the software and the images. It loaded three 10 MB files and used a JavaScript
timer to try to remove the human aspect. The big problem is to get Windows to
quiet down enough to get a decent test.

I have received anecdotal comments from users that have downloaded my latest
and intermediate builds and I didn't see any negative comments on performance
or rendering.

If you would like to give it a try, the Pentium4 build is available at
http://www.pryan.org/firefox/mmoy/ The build there will only run on SSE2
enabled machines, though, as the optimization flags include arch:SSE2.

My intermediate builds have cosine, cosine-b and cosine-c in them. You can see
the download statistics on these builds at http://www.pryan.org/stats/

> Is there a test suite for JPEG on the internet ?

Not that I know of. If there were one, I'd expect that it would be something that you want to run locally.
(Assignee)

Comment 12

13 years ago
(In reply to comment #6)
> Isn't that uint cast going to break on a 64bit machine?

I think that I'm going to explictly put the conditional compilation code in for
jddctmgr to only appear on Windows so that it won't even get compiled on 64-bit
and non-Windows platforms to eliminate potential compilation problems. I'll make
the changes and run a full build which will probably be done this afternoon.
(Assignee)

Comment 13

13 years ago
Created attachment 147497 [details]
Update jddctmgr.c patch

Added conditional compilation for alignment code.
(Assignee)

Updated

13 years ago
Attachment #147363 - Attachment is obsolete: true
(Assignee)

Updated

13 years ago
Attachment #147497 - Flags: review?(pavlov)
Comment on attachment 147363 [details] [diff] [review]
jddctmgr.c patch (to align a structure)

clearing obsolete review request
Attachment #147363 - Flags: review?(pavlov)
(Assignee)

Comment 15

13 years ago
I ran my JPEG performance test for my 5/1 build against the official nightly 5/2
build and found a 15% performance improvement.

The test uses a Javascript timer and loads three 10 MB images in one page. The
Program Directory is stored on a ramdisk as are the three 10 MB images along with
the html webpage. The profile used is stored on disk so differences in disk
latency are not minimized by using the ramdisk.

The IDCT is part of JPEG decompression and other parts of the JPEG code may be
worth a look too.

I have an Intel CD-DROM with VTUNE on it that came with an Intel Optimization
book but I haven't tried it out yet. It would probably provide more accurate
statistics in comparing the performance of the current IDCT SSE2 code with 
my version. The version of VTUNE that I have isn't a full version and I don't
know if I can do this analysis or not.
(Assignee)

Comment 16

13 years ago
Another builder reported a crash on startup with this code on Athlon XP though it works fine on P4s. I requested the assembler code around the crash point to debug the platform difference. I've had several hundred downloads and this is the first
report of a crash problem. I'll report back when I get more information.
(Assignee)

Comment 17

13 years ago
There's a one-line cut/paste bug in the jddctmgr.c patch. I'll upload a new one
Saturday night.
(Assignee)

Comment 18

13 years ago
Created attachment 148023 [details] [diff] [review]
Update jddctmgr patch
Attachment #147497 - Attachment is obsolete: true
(Assignee)

Updated

13 years ago
Attachment #147497 - Flags: review?(pavlov)
(Assignee)

Comment 19

13 years ago
Created attachment 148024 [details] [diff] [review]
Corrected jddctmgr.c patch
Attachment #148023 - Attachment is obsolete: true
(Assignee)

Updated

13 years ago
Attachment #148024 - Flags: review?(pavlov)
Is there a possibility to port your optimizations to SSE? Users of slower
systems with SSE support could probably need better performance rather than
those with already fast SSE2 systems (of course better performance is always
appreciated, doesn't really matter how fast your system already is ;-))
(Assignee)

Comment 21

13 years ago
(In reply to comment #20)
> Is there a possibility to port your optimizations to SSE? Users of slower
> systems with SSE support could probably need better performance rather than
> those with already fast SSE2 systems (of course better performance is always
> appreciated, doesn't really matter how fast your system already is ;-))

I have an SSE port that's about 80% done. When I'm done, I'll do some performance
comparisons to see if it's worthwhile trying to get it into the code base.

The SSE port was considerably more work than the SSE2 port as there isn't nearly
as much to work with in SSE compared to SSE2 when you're working with integers.
The SSE floating point support is very good but an FP approach that I tried
proved to only provide very small returns. I think that the high cost of
integer to FP and back to integer may have negated a little less work in scaling
and descaling.

Comment 22

13 years ago
15% perf increase looks very promising, thanks !
I'll consider changing my motherboard and CPU ;-)

Just one more question:
Are you sure that the new code is absolutely equivalent to the old code?

I'm thinking about using both old and new code to decompress some images and
compare the uncompressed bitmaps (possibly convert jpeg to uncompressed bmp
files) to prove that they are the same.
I can't test this because I don't have SSE2 :-(
(Assignee)

Comment 23

13 years ago
(In reply to comment #22)
> Just one more question:
> Are you sure that the new code is absolutely equivalent to the old code?

Well, that's like asking if someone is sure that there are absolutely no bugs
in their code. I laid out what I did to go from the old code to the new code
and was fairly careful at each stage. Is it possible that I made an error in
my understanding of one or two of the SSE2 instructions? Of course.

In general, I have a bunch of ~1 mb images that I used to verify the code from
one stage to the next and you are somewhat dependent on my coding abilities and
my eyesight.

One thing that I'll mention though. When you make a mistake in coding with this
stuff, 19 times out of 20, you know it very quickly.

There have been about 600 downloads of the code at various levels of porting and
I haven't received one report of rendering problems though.

> I'm thinking about using both old and new code to decompress some images and
> compare the uncompressed bitmaps (possibly convert jpeg to uncompressed bmp
> files) to prove that they are the same.
> I can't test this because I don't have SSE2 :-(

I'd welcome such a test. With current review methods, I think reviewers are limited to checking the code which can be quite time-consuming if you're not conversant with these esoteric assembler instructions.
(Assignee)

Comment 24

13 years ago
When you made your comment about comparing bitmaps, I thought to myself how
would you do that? You'd have to write code to intercept the output from the
JPEG code and write it out to a file.

Of course it struck me later that one could simply do a COPY IMAGE and then 
paste it into PAINT and then save it as a bitmap (that's probably the default 
anyways).

I may give this a try later this week.

I also finished the SSE port this evening and it's 19% faster than the official
nightly using my large JPEG rendering test.

Comment 25

13 years ago
You do realize that the jpeg directory is just a modified version of the IJG
codebase, which contains a command line decoder (djpeg.c) that can be used
for timing/testing, right?
(Assignee)

Comment 26

13 years ago
(In reply to comment #25)
> You do realize that the jpeg directory is just a modified version of the IJG
> codebase, which contains a command line decoder (djpeg.c) that can be used
> for timing/testing, right?

Yes to the former, no to the latter. I'll have to take a look at that.
(Assignee)

Comment 27

13 years ago
(In reply to comment #20)
> Is there a possibility to port your optimizations to SSE? Users of slower
> systems with SSE support could probably need better performance rather than
> those with already fast SSE2 systems (of course better performance is always
> appreciated, doesn't really matter how fast your system already is ;-))

Give this build a try:

http://www.pryan.org/mozilla/firefox/mmoy/mmoy-05-03-2004-Pentium3G.exe
(Assignee)

Comment 28

13 years ago
Out of curiousity, how long does it take to get something reviewed? My other two changes got a lot of attention pretty quickly (turning on SSE2 and fixing the
SSE2 alignment problem) but this one seems to be taking a while.

Comment 29

13 years ago
(In reply to comment #28)
> Out of curiousity, how long does it take to get something reviewed?

You have to take the bug (assign to yourself), go on IRC (chatzilla), find a
suitable reviewer. Ultimately mail drivers@mozilla.org to find one.

(Assignee)

Comment 30

13 years ago
Created attachment 149128 [details] [diff] [review]
Updated jidctint.c

Added some small improvements.
Attachment #147364 - Attachment is obsolete: true
(Assignee)

Updated

13 years ago
Attachment #149128 - Flags: review?(pavlov)

Comment 31

13 years ago
can you post a diff of jidcint.c instead of the full file?
(Assignee)

Comment 32

13 years ago
Created attachment 149139 [details] [diff] [review]
jidctint.c patch

Prev file was the c file; not the patch
Attachment #149128 - Attachment is obsolete: true
(Assignee)

Updated

13 years ago
Attachment #149128 - Flags: review?(pavlov)
(Assignee)

Updated

13 years ago
Attachment #149139 - Flags: review?(pavlov)
(Assignee)

Updated

13 years ago
Attachment #147364 - Flags: review?(pavlov)
(Assignee)

Comment 33

13 years ago
This website consistently crashes FF on Linux. It's intermittant on Windows but
crashes just about everytime I go to that site on Linux. I don't have a debug
environment on my Linux box though nor do I have a build environment.
(Assignee)

Comment 34

13 years ago
(In reply to comment #33)
> This website consistently crashes FF on Linux. It's intermittant on Windows but
> crashes just about everytime I go to that site on Linux. I don't have a debug
> environment on my Linux box though nor do I have a build environment.

Please ignore this post, it's meant for another bug.

Updated

13 years ago
Blocks: 203448
Assignee: jdunn → nobody
QA Contact: imagelib
stuart, any chance you could take a look at this for 1.9?
Flags: blocking1.9?

Comment 36

10 years ago
I don't know intel SSE2 so I'm not really qualified to review this.  Not sure who is.
Alfred, do you know SSE2?
(Assignee)

Comment 38

10 years ago
You could ask Makoto Kato as he's generally good with assembler.
Blocks: 331298
Flags: blocking1.9? → blocking1.9-
Whiteboard: [wanted-1.9]

Comment 39

10 years ago
Comment on attachment 149139 [details] [diff] [review]
jidctint.c patch

i just tried running with this patch and I crash loading images on many common websites.

in jpeg_idct_islow_sse2
doing:
pmullw      xmm0, QWORD PTR [edx+0]
Attachment #149139 - Flags: review?(pavlov) → review-
Assignee: nobody → mmoy
(Assignee)

Comment 40

10 years ago
Did you also apply the jddctmgr.c patch? That patch should align the structure that the pmullw is crashing on. That was a long time ago when I wasn't all that
good at doing patches for multiple modules in one file.

Comment 41

10 years ago
i'm not sure that I applied all the correct patches.  Is there any way you can put them together in a single patch?  We're trying to find someone good to review the patches and would like to get them in for Firefox 3 if possible.
Flags: wanted1.9+
Whiteboard: [wanted-1.9]

Comment 42

10 years ago
Michael - once the smoke clears on the other JPEG patches we can try for this one...

Comment 43

8 years ago
Could it be that this bug is superceeded by bug 477728?
The patches look more or less the same.
Adding dependency link to signal relationship, but I think one is a DUP of another...
Depends on: 477728
This is subsumed by bug 573948: Switch to libjpeg-turbo.  I'm going to dupe; feel free to undupe if you think this is wrong.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 573948

Updated

5 years ago
Attachment #148024 - Flags: review?(pavlov)
You need to log in before you can comment on or make changes to this bug.