Closed
Bug 243642
Opened 20 years ago
Closed 11 months ago
Optimize abs,min,max by removing branches
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: caillon, Unassigned)
Details
(Keywords: perf)
Attachments
(3 files)
2.52 KB,
patch
|
Details | Diff | Splinter Review | |
213.95 KB,
patch
|
Details | Diff | Splinter Review | |
721 bytes,
patch
|
Details | Diff | Splinter Review |
Most CPUs use branch prediction, which works well with loops and branches where the probability of one branch being taken is quite high. For branches where there is about an equal chance of one branch being taken, there is some overhead. The Intel and AMD optimization guidelines recommend re-writing abs, min, and max to avoid branching at all. We can save a small amount of code size (only several k) but I would imagine we would get some performance improvement out of this. I don't have a good box to run performance tests on (my slowest box is a P4 3.0 GHz which is having issues starting X at the moment; and my other box is a dual Xeon 2.4 box) or I'd throw out some numbers. If someone wants to test, that would be awesome. Seeing the places that make extensive use of these functions (table layout and font metrics are the two heavy hitters here), this should be worth while for page loading. Note: I did not change the NSPR versions (PR_ABS,PR_MIN,PR_MAX) because as written, they could allow for objects to be passed in (as long as the objects passed in define the right operators) which my new version would break. I could not think of any other decent place to put these, so I stuck them in nscore.h -- if anyone can think of a better place for this stuff, I have no problem with moving it all.
Reporter | ||
Comment 1•20 years ago
|
||
Assignee: general → caillon
Status: NEW → ASSIGNED
Reporter | ||
Comment 2•20 years ago
|
||
Comment 3•20 years ago
|
||
NS_MIN and NS_MAX are already defined in xpcom/string/public/nsCharTraits.h, but maybe those can be renamed for use in the string library, or maybe they can leverage these routines.
Reporter | ||
Comment 4•20 years ago
|
||
Oh, hm. I didn't see that in my initial search for some reason. Yeah, we can rename that if we choose to use NS_MAX/NS_MIN for this. Would renaming the existant versions to NS_STRING_MIN/NS_STRING_MAX be acceptable?
Comment 5•20 years ago
|
||
We're talking about xpcom/string/public/nsAlgorithm.h, right?
Comment 6•20 years ago
|
||
Note that the only callers seem to be passing in ints, so in fact just eliminating those NS_MIN macros may be appropriate...
Comment 7•20 years ago
|
||
(In reply to comment #6) > Note that the only callers seem to be passing in ints, so in fact just > eliminating those NS_MIN macros may be appropriate... agreed.... and, yeah i mistyped.
Reporter | ||
Comment 8•20 years ago
|
||
bz, yeah that would even be more idealistic. I'll make sure that doesn't break anything.
Reporter | ||
Comment 9•20 years ago
|
||
(And I realized that my initial codesighs tests were using -Os. I'm rebuilding both before and after with -O1 now....)
Reporter | ||
Comment 10•20 years ago
|
||
Okay, the codesize win is still fairly negligent for me (4k), though I really suspect that the big win isn't going to be in codesize anyway, but in performance.
Reporter | ||
Comment 11•20 years ago
|
||
Comment 12•20 years ago
|
||
4k is not *negligible* -- we would be *negligent* to turn our noses up at it ;-). But I'm keener to see the cycle savings. /be
Comment 13•20 years ago
|
||
Did anyone do any timing tests? On my Pentium 4 using gcc 3.4.0 and tuning for the P4 I find absolutely no time difference. Using other optimizations I see at most a savings of 6 clock cycles. That's less than 2.5 ns. If I'm right then it'll be quite a while before the development time is recovered. :-)
Comment 14•20 years ago
|
||
I'm wondering if there are some CPU where abs, min, max is part of their instruction set ? (I couldn't find them for Intel P4)
Reporter | ||
Comment 15•20 years ago
|
||
(In reply to comment #13) > Did anyone do any timing tests? On my Pentium 4 using gcc 3.4.0 and tuning for > the P4 I find absolutely no time difference. Using other optimizations I see > at most a savings of 6 clock cycles. That's less than 2.5 ns. Luckily, your CPU is not the only machine I care about. And if that is indeed 6 cycles per call, then this seems surely worth it. It's not like these functions have been taking thousands of cycles to begin with. We do make fair usage of these functions during table layout and font rendering. I think the bottom line is "actual mileage may vary." I'm more keen to see the cumulative results on the page load test with real world implications on different platforms than to see the individual call results here, but I think it fair to say that if we can save cycles with an easy patch such as this, lets do it. (In reply to comment #14) > I'm wondering if there are some CPU where abs, min, max is part of their > instruction set ? (I couldn't find them for Intel P4) I don't believe there are any, but could be wrong.
Comment 16•20 years ago
|
||
> Luckily, your CPU is not the only machine I care about ...
Exactly what cpu are optimizing for? All Intel cpus from the P6 on have a
cmov instruction which is designed to handle precisely the case you describe
here. P6s are very old so is there a compiler in common use that doesn't
unserstand cmov? I believe Morotola chips have equivalent instructions.
I did some more testing. Building and tuning for a P6 or above finds zero
difference between the usual code and your code. That's because gcc uses
cmov in these cases. Building and tuning for a 586 makes gcc use branching
but I still find zero difference. I believe that's because the branches are
short so speculative execution eliminates any diferences. Only if I build
and tune for a 486 or below do I see any differences.
I agree that the only things that matter are test results but I don't think
that using the mozilla trunk as a test rig for something like this is
appropriate. I would want a proof of concept demo. And that's what I
constructed. I got unexpected (based on your statements) results which is
why I asked if anyone else had run tests. It appears I'm the only one who
did.
I suspect the only differences you might see would be due to different
compilers. You are implicitly assuming the compiler will optimize your code
reasonably. I'm not sure you should assume that. Also note that compilers
often have special optimization routines for the common programming idioms.
Doing something tricky may well cause you problems.
Reporter | ||
Comment 17•20 years ago
|
||
(In reply to comment #16) > > Luckily, your CPU is not the only machine I care about ... > > Exactly what cpu are optimizing for? All Intel cpus Don't forget that we also care about AMD, G4/5, PPC, etc. CPUs. > I did some more testing. Building and tuning for a P6 or above finds > zero difference between the usual code and your code. Interesting. On a P4 running Rawhide, I see the following differences using -march=pentium4 old_abs: pushl %ebp movl %esp, %ebp subl $4, %esp movl 8(%ebp), %eax movl %eax, -4(%ebp) cmpl $0, -4(%ebp) jns .L5 negl -4(%ebp) .L5: movl -4(%ebp), %eax leave ret .size old_abs, .-old_abs new_abs: pushl %ebp movl %esp, %ebp subl $4, %esp movl 8(%ebp), %eax sarl $31, %eax movl %eax, -4(%ebp) movl -4(%ebp), %eax xorl 8(%ebp), %eax subl -4(%ebp), %eax leave ret Are you using |abs| instead of explicitly using the old code? Even so, abs is defined to __builtin_abs() for pentium 4 on gcc and uses the exact same code with the jump AFAICT. The new code in a ~9% win for me on pentium 4, using rawhide without optimizations, and a ~6% win with optimizations. I also ran on a PPC64 with gcc3.3 box. There, the code for abs is the same, and executes with no noticable difference in time before or after. However, PPC64 still generates jumps for min and max with the old code, and I see a ~7% performance improvement using min and max with my new code and -O2. >I agree that the only things that matter are test results but I don't >think that using the mozilla trunk as a test rig for something like this >is appropriate. Test rig for what? The new code is known to be optimal. It is copied pretty much verbatim from the AMD and Intel optimization guideline manuals. I made relevant changes to use NSPR types and added comments. That's all. Some CPU and compiler configurations might already be doing just fine with the old code. Others may not. Please don't think that Pentium 4 using gcc 3.4 (does anyone ship gcc 3.4 as the default compiler?) using Pentium 4 tuning is how every Mozilla will be built. Some distributions still cater to older machines. Again, "actual miles may vary." If you are getting optimal code already, I'll drink a beer for you tonight. And then I'll drink a beer for those who aren't getting optimal code yet but would like to. (In reply to comment #13) >If I'm right then it'll be quite a while before the development time is >recovered. :-) I was reading optimization guides, and I found this tip which could apply to Mozilla and contained the code I am now using. I wrote a quick patch set to do so. I did not expect to cut down Mozilla's overall tinderbox test times by 10%. I expected to help out a little bit where every little bit helps. If we can force optimal code to those previously using suboptimal code, and let those already using optimal code to continue to do so, I think it is worth doing. I spent about a minute porting the code found in the Intel and AMD manuals to NSPR types, about 10 minutes writing the comments, and maybe 5 minutes doing a global search and replace, generating the second patch, and cutting out the directories which can't rely on XPCOM. I have spent close to an hour over the past few days replying to your comments to which you yourself have already shown could help depending on some variants (for you, it either had no effect or a positive effect). For me, on two different machines and architectures, it has positive effects. I appreciate the help in increasing the development time.
Comment 18•20 years ago
|
||
Comment on attachment 148531 [details] [diff] [review] Optimized branchless versions of abs,min,max Why don't we use the c library's abs?
Comment 19•20 years ago
|
||
glibc uses builtins for abs/min/max if on gcc >= 2.97, else it uses ASM. Plus CVS gcc (maybe earlier versions too) knows how to do range propegation based on teh result of abs being >= 0.
Reporter | ||
Comment 20•20 years ago
|
||
It's my understanding though that even the latest glibc uses the jump version, as does the builtin in gcc. That notwithstanding, it would be nice to offer this to people who aren't running the latest versions of glibc or gcc. And what's the story with say MSVC?
Comment 21•20 years ago
|
||
hey caillon, in Comment #17 you mention performance results, but i didn't see where you mentioned the test that you are using to gather those results. are you using the standard Tp test? or some custom test?
Comment 22•20 years ago
|
||
> Don't forget that we also care about AMD, G4/5, PPC, etc. CPUs. Yes, that's why I mentioned Motorola. > Interesting. On a P4 running Rawhide, I see the following differences > using -march=pentium4 So here's two versions of an abs function. static inline PRInt32 my_abs1 (PRInt32 x) { return (x < 0) ? -x : x; } static inline PRInt32 my_abs2 (PRInt32 x) { return abs(x); } Using my gcc-3.4.0 built from a pristine GNU source tarball and the command line "gcc -S -fverbose-asm -fmarch=pentium4 -O" my two versions and your code produce the _same_ result (modulo a choice of registers): movl %edi, %eax # x, tmp67 sarl $31, %eax #, tmp67 movl %eax, %ecx # tmp67, <anonymous> xorl %edi, %ecx # x, <anonymous> subl %eax, %ecx # tmp67, <anonymous> Using -O2 my_abs2 and your code still produce the same output but my_abs1 becomes: movl %edi, %ecx # x, tmp67 negl %ecx # tmp67 cmpl $-1, %edi #, x cmovg %edi, %ecx # tmp67,, x, <anonymous> Comparing this function static inline PRInt32 my_min (PRInt32 x, PRInt32 y) { return (x < y) ? x : y; } and yours, using -O my_min becomes cmpl 8(%ebp), %edi # x, y movl 8(%ebp), %ecx # x, <anonymous> cmovle %edi, %ecx # y,, <anonymous> and your function becomes movl 8(%ebp), %eax # x, x subl %edi, %eax # y, x movl %eax, %ecx # x, tmp67 sarl $31, %ecx #, tmp67 andl %eax, %ecx # x, tmp68 leal (%edi,%ecx), %ecx #, y Using -O2 produces the same results. No jumps anywhere. Obviously the gcc people read the same stuff you do. > Test rig for what? The new code is known to be optimal. Whatever Intel says doesn't necessarily apply to other cpus. Optimal doesn't count. The only thing that matters is actual timing tests. So far you see a difference on your machine and I don't on my machine. The two sample points differ. I would say that's a good reason for caution. It's very difficult to determine if a proposal really is better or if there's some other interaction with the particular compiler that makes it appear better. > Some CPU and compiler configurations might already be doing just fine with > the old code. Others may not. But this is no reason to change the code for everyone. If you know there are problems with a particular compiler you can always select different code when you build. Otherwise, using generic code is usually best. > Don't forget that we also care about AMD, G4/5, PPC, etc. CPUs. Yes, that's why I mentioned Motorola. > Interesting. On a P4 running Rawhide, I see the following differences > using -march=pentium4 So here's two versions of an abs function. static inline PRInt32 my_abs1 (PRInt32 x) { return (x < 0) ? -x : x; } static inline PRInt32 my_abs2 (PRInt32 x) { return abs(x); } Using my gcc-3.4.0 built from a pristine GNU source tarball and the command line "gcc -S -fverbose-asm -fmarch=pentium4 -O" my two versions and your code produce the _same_ result (modulo a choice of registers): movl %edi, %eax # x, tmp67 sarl $31, %eax #, tmp67 movl %eax, %ecx # tmp67, <anonymous> xorl %edi, %ecx # x, <anonymous> subl %eax, %ecx # tmp67, <anonymous> Using -O2 my_abs2 and your code still produce the same output but my_abs1 becomes: movl %edi, %ecx # x, tmp67 negl %ecx # tmp67 cmpl $-1, %edi #, x cmovg %edi, %ecx # tmp67,, x, <anonymous> Comparing this function static inline PRInt32 my_min (PRInt32 x, PRInt32 y) { return (x < y) ? x : y; } and yours, using -O my_min becomes cmpl 8(%ebp), %edi # x, y movl 8(%ebp), %ecx # x, <anonymous> cmovle %edi, %ecx # y,, <anonymous> and your function becomes movl 8(%ebp), %eax # x, x subl %edi, %eax # y, x movl %eax, %ecx # x, tmp67 sarl $31, %ecx #, tmp67 andl %eax, %ecx # x, tmp68 leal (%edi,%ecx), %ecx #, y Using -O2 produces the same results. No jumps anywhere. Obviously the gcc people read the same stuff you do. > Test rig for what? The new code is known to be optimal. Whatever Intel says doesn't necessarily apply to other cpus. Optimal doesn't count. The only thing that matters is actual timing tests. So far you see a difference on your machine and I don't on my machine. The two sample points differ. I would say that's a good reason for caution. It's very difficult to determine if a proposal really is better or if there's some other interaction with the particular compiler that makes it appear better. > Some CPU and compiler configurations might already be doing just fine with > the old code. Others may not. But this is no reason to change the code for everyone. If you know there are problems with a particular compiler you can always select different code when you build. Otherwise, using generic code is usually best.
Reporter | ||
Comment 23•20 years ago
|
||
Darin, it was just a custom test on the functions, iterating from int32_min to int32_max.
Reporter | ||
Comment 24•20 years ago
|
||
(In reply to comment #22) > Using my gcc-3.4.0 built from a pristine GNU source tarball and the command > line "gcc -S -fverbose-asm -fmarch=pentium4 -O" my two versions and your > code produce the _same_ result (modulo a choice of registers): That's good if you are building Mozilla for yourself on your own machine. Distributors (save for gentoo) do not have the luxury of building for the exact hardware everyone is running. They must be more generic. If you can get the same code by tuning for i686, I would be more interested in your argument. Also does gcc 3.3 produce the same code? That is the current default install for distributions AFAIK bceause of ABI issues in gcc 3.4. Fedora Core 2 which will be out shortly will have gcc3.4 available, but it will not be the default gcc.
Comment 25•20 years ago
|
||
tenthumbs makes the same excellent point proven repeatedly over the years on comp.arch, comp.compilers, etc. The compiler should select the best instructions; if we hand-select, we'll be doing wrong in the long run, and on other platforms than the one we test on. Is the gcc ABI changing *again*? I give up. /be
Reporter | ||
Comment 26•20 years ago
|
||
Well, let's get a decision made. Do we want this patchset or not? If so, who should do the reviews? I don't care to spend too much more time here. If nothing else, this was a fun academic exercise, but I have more pressing things to work on.
Comment 27•20 years ago
|
||
I think comment 0 is in part invalid. We need to solve specific problems on certain OS/compiler target platforms with a patch such as this bug's. A general claim that we should code in C to avoid branches in abs/min/max is not valid in the modern age, with optimizing back ends targeting super-scalar CPUs where the costs and instruction scheduling challenges are complex. In the '80s, with the primitive Portable C Compiler, maybe -- but branches weren't as costly then. The analogue from that era might be explicit register storage class usage, which we don't want or need today. So without specific numbers on specific platforms, I'd say we don't push this patch; leave it here in case it becomes important, at the most. If we *do* have a list of specific targets that benefit from this patch, then let's configure the hand-selected instructions only for those targets. /be
Comment 28•20 years ago
|
||
Brendan is far more eloquent than I am. I agree with him. Tuning the code for some processor or compiler seems liek a mistake. It's better to get a new compiler. Brendan asked: "Is the gcc ABI changing *again*?" Maybe. The name mangling is the same but they changed the soname on libstc++ so there will probably be problems eventually. Even gcc-2.95.3 can generate jump-less code if you allow it so gcc has known what to do for quite a while. Be that as it may, the significant speedup reported makes me wonder if we've found a different problem. What happens if you use this code in the patch? static inline PRInt32 ns_abs (PRInt32 x) { return (x < 0) ? -x : x; } static inline PRInt32 ns_min (PRInt32 x, PRInt32 y) { return (x < y) ? x : y; } static inline PRInt32 ns_max (PRInt32 x, PRInt32 y) { return (x > y) ? x : y; } If there's still a performance win then something else is happening. Perhaps templates are an issue.
Comment 29•20 years ago
|
||
I should also point out that the original attachment actually is NOT functionally equivalent to a normal min or max function. In particular, it will fail if the difference between the two values is larger than (1<<31). e.g., for ns_min: x=0x7FFFFFFF y=-1 yields 0x7FFFFFFF instead of the expected answer -1. The code (x<y?x:y), however, gets the answer right. If you haven't done the code audit to ensure this could never happen, I wouldn't apply the patch. Even if you do, I'd document the deficiency, lest unwary programmers assume that it really does work correctly in all cases.
Updated•20 years ago
|
Product: Browser → Seamonkey
Updated•16 years ago
|
Assignee: caillon → nobody
Status: ASSIGNED → NEW
OS: Linux → All
Product: Mozilla Application Suite → Core
QA Contact: general → general
Hardware: PC → All
Updated•2 years ago
|
Severity: normal → S3
Comment 30•11 months ago
|
||
Is this still an issue?
Severity: S3 → --
Component: General → JavaScript Engine
OS: All → Unspecified
Hardware: All → Unspecified
Version: Trunk → unspecified
Comment 31•11 months ago
|
||
Not sure why this ended up in the JS component, but since it's here: this is the sort of thing that's much better to leave up to the build compiler. The alternative is generally going to be that we tune it once for whatever hardware we have available and then forget about it for years.
Status: NEW → RESOLVED
Closed: 11 months ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•