Optimize abs,min,max by removing branches

NEW
Unassigned

Status

()

Core
General
14 years ago
10 years ago

People

(Reporter: Christopher Aillon (sabbatical, not receiving bugmail), Unassigned)

Tracking

({perf})

Trunk
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

Most CPUs use branch prediction, which works well with loops and branches where
the probability of one branch being taken is quite high.  For branches where
there is about an equal chance of one branch being taken, there is some overhead.

The Intel and AMD optimization guidelines recommend re-writing abs, min, and max
to avoid branching at all.

We can save a small amount of code size (only several k) but I would imagine we
would get some performance improvement out of this.  I don't have a good box to
run performance tests on (my slowest box is a P4 3.0 GHz which is having issues
starting X at the moment; and my other box is a dual Xeon 2.4 box) or I'd throw
out some numbers.  If someone wants to test, that would be awesome.

Seeing the places that make extensive use of these functions (table layout and
font metrics are the two heavy hitters here), this should be worth while for
page loading.

Note: I did not change the NSPR versions (PR_ABS,PR_MIN,PR_MAX) because as
written, they could allow for objects to be passed in (as long as the objects
passed in define the right operators) which my new version would break.  I could
not think of any other decent place to put these, so I stuck them in nscore.h --
if anyone can think of a better place for this stuff, I have no problem with
moving it all.
Created attachment 148531 [details] [diff] [review]
Optimized branchless versions of abs,min,max
Assignee: general → caillon
Status: NEW → ASSIGNED
Created attachment 148532 [details] [diff] [review]
Start using the new version

Comment 3

14 years ago
NS_MIN and NS_MAX are already defined in xpcom/string/public/nsCharTraits.h, but
maybe those can be renamed for use in the string library, or maybe they can
leverage these routines.
Oh, hm.  I didn't see that in my initial search for some reason.  Yeah, we can
rename that if we choose to use NS_MAX/NS_MIN for this.  Would renaming the
existant versions to NS_STRING_MIN/NS_STRING_MAX be acceptable?
We're talking about xpcom/string/public/nsAlgorithm.h, right?
Note that the only callers seem to be passing in ints, so in fact just
eliminating those NS_MIN macros may be appropriate...

Comment 7

14 years ago
(In reply to comment #6)
> Note that the only callers seem to be passing in ints, so in fact just
> eliminating those NS_MIN macros may be appropriate...

agreed.... and, yeah i mistyped.
bz, yeah that would even be more idealistic.  I'll make sure that doesn't break
anything.
(And I realized that my initial codesighs tests were using -Os.  I'm rebuilding
both before and after with -O1 now....)
Okay, the codesize win is still fairly negligent for me (4k), though I really
suspect that the big win isn't going to be in codesize anyway, but in performance.
Created attachment 148538 [details] [diff] [review]
Remove the existing NS_* macros
4k is not *negligible* -- we would be *negligent* to turn our noses up at it
;-).  But I'm keener to see the cycle savings.

/be

Comment 13

14 years ago
Did anyone do any timing tests? On my Pentium 4 using gcc 3.4.0 and tuning for
the P4 I find absolutely no time difference. Using other optimizations I see
at most a savings of 6 clock cycles. That's less than 2.5 ns. If I'm right
then it'll be quite a while before the development time is recovered. :-)

Comment 14

14 years ago
I'm wondering if there are some CPU where abs, min, max is part of their
instruction set ? (I couldn't find them for Intel P4)
(In reply to comment #13)
> Did anyone do any timing tests? On my Pentium 4 using gcc 3.4.0 and tuning for
> the P4 I find absolutely no time difference. Using other optimizations I see
> at most a savings of 6 clock cycles. That's less than 2.5 ns.

Luckily, your CPU is not the only machine I care about.  And if that is indeed 6
cycles per call, then this seems surely worth it.  It's not like these functions
have been taking thousands of cycles to begin with.  We do make fair usage of
these functions during table layout and font rendering.  I think the bottom line
is "actual mileage may vary."  I'm more keen to see the cumulative results on
the page load test with real world implications on different platforms than to
see the individual call results here, but I think it fair to say that if we can
save cycles with an easy patch such as this, lets do it.


(In reply to comment #14)
> I'm wondering if there are some CPU where abs, min, max is part of their
> instruction set ? (I couldn't find them for Intel P4)

I don't believe there are any, but could be wrong.

Comment 16

14 years ago
> Luckily, your CPU is not the only machine I care about ...

Exactly what cpu are optimizing for? All Intel cpus from the P6 on have a
cmov instruction which is designed to handle precisely the case you describe
here. P6s are very old so is there a compiler in common use that doesn't
unserstand cmov? I believe Morotola chips have equivalent instructions.

I did some more testing. Building and tuning for a P6 or above finds zero
difference between the usual code and your code. That's because gcc uses
cmov in these cases. Building and tuning for a 586 makes gcc use branching
but I still find zero difference. I believe that's because the branches are
short so speculative execution eliminates any diferences. Only if I build
and tune for a 486 or below do I see any differences.

I agree that the only things that matter are test results but I don't think
that using the mozilla trunk as a test rig for something like this is
appropriate. I would want a proof of concept demo. And that's what I
constructed. I got unexpected (based on your statements) results which is
why I asked if anyone else had run tests. It appears I'm the only one who
did.

I suspect the only differences you might see would be due to different
compilers. You are implicitly assuming the compiler will optimize your code
reasonably. I'm not sure you should assume that. Also note that compilers
often have special optimization routines for the common programming idioms.
Doing something tricky may well cause you problems.
(In reply to comment #16)
> > Luckily, your CPU is not the only machine I care about ...
> 
> Exactly what cpu are optimizing for?  All Intel cpus 

Don't forget that we also care about AMD, G4/5, PPC, etc. CPUs.

> I did some more testing. Building and tuning for a P6 or above finds
> zero difference between the usual code and your code.

Interesting.  On a P4 running Rawhide, I see the following differences using
-march=pentium4

old_abs:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $4, %esp
        movl    8(%ebp), %eax
        movl    %eax, -4(%ebp)
        cmpl    $0, -4(%ebp)
        jns     .L5
        negl    -4(%ebp)
.L5:
        movl    -4(%ebp), %eax
        leave
        ret
        .size   old_abs, .-old_abs

new_abs:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $4, %esp
        movl    8(%ebp), %eax
        sarl    $31, %eax
        movl    %eax, -4(%ebp)
        movl    -4(%ebp), %eax
        xorl    8(%ebp), %eax
        subl    -4(%ebp), %eax
        leave
        ret


Are you using |abs| instead of explicitly using the old code?  Even so, abs is
defined to __builtin_abs() for pentium 4 on gcc and uses the exact same code
with the jump AFAICT.

The new code in a ~9% win for me on pentium 4, using rawhide without
optimizations, and a ~6% win with optimizations.  I also ran on a PPC64 with
gcc3.3 box.  There, the code for abs is the same, and executes with no noticable
difference in time before or after.  However, PPC64 still generates jumps for
min and max with the old code, and I see a ~7% performance improvement using min
and max with my new code and -O2.

>I agree that the only things that matter are test results but I don't
>think that using the mozilla trunk as a test rig for something like this
>is appropriate.

Test rig for what?  The new code is known to be optimal.  It is copied pretty
much verbatim from the AMD and Intel optimization guideline manuals.  I made
relevant changes to use NSPR types and added comments.  That's all.

Some CPU and compiler configurations might already be doing just fine with the
old code.  Others may not.  Please don't think that Pentium 4 using gcc 3.4
(does anyone ship gcc 3.4 as the default compiler?) using Pentium 4 tuning is
how every Mozilla will be built.  Some distributions still cater to older
machines.  Again, "actual miles may vary."  If you are getting optimal code
already, I'll drink a beer for you tonight.  And then I'll drink a beer for
those who aren't getting optimal code yet but would like to.

(In reply to comment #13)
>If I'm right then it'll be quite a while before the development time is
>recovered. :-)

I was reading optimization guides, and I found this tip which could apply to
Mozilla and contained the code I am now using.  I wrote a quick patch set to do
so.  I did not expect to cut down Mozilla's overall tinderbox test times by 10%.
 I expected to help out a little bit where every little bit helps.   If we can
force optimal code to those previously using suboptimal code, and let those
already using optimal code to continue to do so, I think it is worth doing.

I spent about a minute porting the code found in the Intel and AMD manuals to
NSPR types, about 10 minutes writing the comments, and maybe 5 minutes doing a
global search and replace, generating the second patch, and cutting out the
directories which can't rely on XPCOM.  I have spent close to an hour over the
past few days replying to your comments to which you yourself have already shown
could help depending on some variants (for you, it either had no effect or a
positive effect).  For me, on two different machines and architectures, it has
positive effects.  I appreciate the help in increasing the development time.
Comment on attachment 148531 [details] [diff] [review]
Optimized branchless versions of abs,min,max

Why don't we use the c library's abs?
glibc uses builtins for abs/min/max if on gcc >= 2.97, else it uses ASM. Plus
CVS gcc (maybe earlier versions too) knows how to do range propegation based on
teh result of abs being >= 0.
It's my understanding though that even the latest glibc uses the jump version,
as does the builtin in gcc.  That notwithstanding, it would be nice to offer
this to people who aren't running the latest versions of glibc or gcc.  And
what's the story with say MSVC?

Comment 21

14 years ago
hey caillon, in Comment #17 you mention performance results, but i didn't see
where you mentioned the test that you are using to gather those results.  are
you using the standard Tp test?  or some custom test?

Comment 22

14 years ago
> Don't forget that we also care about AMD, G4/5, PPC, etc. CPUs.

Yes, that's why I mentioned Motorola. 

> Interesting.  On a P4 running Rawhide, I see the following differences 
> using -march=pentium4

So here's two versions of an abs function.

static inline PRInt32 my_abs1 (PRInt32 x)
{
    return (x < 0) ? -x : x;
}

static inline PRInt32 my_abs2 (PRInt32 x)
{
    return abs(x);
}

Using my gcc-3.4.0 built from a pristine GNU source tarball and the command 
line "gcc -S -fverbose-asm -fmarch=pentium4 -O" my two versions and your 
code produce the _same_ result (modulo a choice of registers):

        movl    %edi, %eax      # x, tmp67
        sarl    $31, %eax       #, tmp67
        movl    %eax, %ecx      # tmp67, <anonymous>
        xorl    %edi, %ecx      # x, <anonymous>
        subl    %eax, %ecx      # tmp67, <anonymous>

Using -O2 my_abs2 and your code still produce the same output but my_abs1
becomes:

        movl    %edi, %ecx      # x, tmp67
        negl    %ecx    # tmp67
        cmpl    $-1, %edi       #, x
        cmovg   %edi, %ecx      # tmp67,, x, <anonymous>

Comparing this function

static inline PRInt32 my_min (PRInt32 x, PRInt32 y)
{
    return (x < y) ? x : y;
}

and yours, using -O my_min becomes

        cmpl    8(%ebp), %edi   # x, y
        movl    8(%ebp), %ecx   # x, <anonymous>
        cmovle  %edi, %ecx      # y,, <anonymous>

and your function becomes

        movl    8(%ebp), %eax   # x, x
        subl    %edi, %eax      # y, x
        movl    %eax, %ecx      # x, tmp67
        sarl    $31, %ecx       #, tmp67
        andl    %eax, %ecx      # x, tmp68
        leal    (%edi,%ecx), %ecx       #, y

Using -O2 produces the same results. No jumps anywhere. Obviously the gcc 
people read the same stuff you do.

> Test rig for what?  The new code is known to be optimal.

Whatever Intel says doesn't necessarily apply to other cpus.

Optimal doesn't count. The only thing that matters is actual timing tests. 
So far you see a difference on your machine and I don't on my machine. The 
two sample points differ. I would say that's a good reason for caution. It's 
very difficult to determine if a proposal really is better or if there's 
some other interaction with the particular compiler that makes it appear 
better.

> Some CPU and compiler configurations might already be doing just fine with 
> the old code.  Others may not.

But this is no reason to change the code for everyone. If you know there are 
problems with a particular compiler you can always select different code 
when you build. Otherwise, using generic code is usually best.



> Don't forget that we also care about AMD, G4/5, PPC, etc. CPUs.

Yes, that's why I mentioned Motorola. 

> Interesting.  On a P4 running Rawhide, I see the following differences 
> using -march=pentium4

So here's two versions of an abs function.

static inline PRInt32 my_abs1 (PRInt32 x)
{
    return (x < 0) ? -x : x;
}

static inline PRInt32 my_abs2 (PRInt32 x)
{
    return abs(x);
}

Using my gcc-3.4.0 built from a pristine GNU source tarball and the command 
line "gcc -S -fverbose-asm -fmarch=pentium4 -O" my two versions and your 
code produce the _same_ result (modulo a choice of registers):

        movl    %edi, %eax      # x, tmp67
        sarl    $31, %eax       #, tmp67
        movl    %eax, %ecx      # tmp67, <anonymous>
        xorl    %edi, %ecx      # x, <anonymous>
        subl    %eax, %ecx      # tmp67, <anonymous>

Using -O2 my_abs2 and your code still produce the same output but my_abs1
becomes:

        movl    %edi, %ecx      # x, tmp67
        negl    %ecx    # tmp67
        cmpl    $-1, %edi       #, x
        cmovg   %edi, %ecx      # tmp67,, x, <anonymous>

Comparing this function

static inline PRInt32 my_min (PRInt32 x, PRInt32 y)
{
    return (x < y) ? x : y;
}

and yours, using -O my_min becomes

        cmpl    8(%ebp), %edi   # x, y
        movl    8(%ebp), %ecx   # x, <anonymous>
        cmovle  %edi, %ecx      # y,, <anonymous>

and your function becomes

        movl    8(%ebp), %eax   # x, x
        subl    %edi, %eax      # y, x
        movl    %eax, %ecx      # x, tmp67
        sarl    $31, %ecx       #, tmp67
        andl    %eax, %ecx      # x, tmp68
        leal    (%edi,%ecx), %ecx       #, y

Using -O2 produces the same results. No jumps anywhere. Obviously the gcc 
people read the same stuff you do.

> Test rig for what?  The new code is known to be optimal.

Whatever Intel says doesn't necessarily apply to other cpus.

Optimal doesn't count. The only thing that matters is actual timing tests. 
So far you see a difference on your machine and I don't on my machine. The 
two sample points differ. I would say that's a good reason for caution. It's 
very difficult to determine if a proposal really is better or if there's 
some other interaction with the particular compiler that makes it appear 
better.

> Some CPU and compiler configurations might already be doing just fine with 
> the old code.  Others may not.

But this is no reason to change the code for everyone. If you know there are 
problems with a particular compiler you can always select different code 
when you build. Otherwise, using generic code is usually best.
Darin, it was just a custom test on the functions, iterating from int32_min to
int32_max.
(In reply to comment #22)
> Using my gcc-3.4.0 built from a pristine GNU source tarball and the command 
> line "gcc -S -fverbose-asm -fmarch=pentium4 -O" my two versions and your 
> code produce the _same_ result (modulo a choice of registers):

That's good if you are building Mozilla for yourself on your own machine. 
Distributors (save for gentoo) do not have the luxury of building for the exact
hardware everyone is running.  They must be more generic.  If you can get the
same code by tuning for i686, I would be more interested in your argument.

Also does gcc 3.3 produce the same code?  That is the current default install
for distributions AFAIK bceause of ABI issues in gcc 3.4.  Fedora Core 2 which
will be out shortly will have gcc3.4 available, but it will not be the default gcc.
tenthumbs makes the same excellent point proven repeatedly over the years on
comp.arch, comp.compilers, etc.  The compiler should select the best
instructions; if we hand-select, we'll be doing wrong in the long run, and on
other platforms than the one we test on.

Is the gcc ABI changing *again*?  I give up.

/be
Well, let's get a decision made.  Do we want this patchset or not?  If so, who
should do the reviews?  I don't care to spend too much more time here.  If
nothing else, this was a fun academic exercise, but I have more pressing things
to work on.
I think comment 0 is in part invalid.  We need to solve specific problems on
certain OS/compiler target platforms with a patch such as this bug's.  A general
claim that we should code in C to avoid branches in abs/min/max is not valid in
the modern age, with optimizing back ends targeting super-scalar CPUs where the
costs and instruction scheduling challenges are complex.

In the '80s, with the primitive Portable C Compiler, maybe -- but branches
weren't as costly then.  The analogue from that era might be explicit register
storage class usage, which we don't want or need today.

So without specific numbers on specific platforms, I'd say we don't push this
patch; leave it here in case it becomes important, at the most.  If we *do* have
a list of specific targets that benefit from this patch, then let's configure
the hand-selected instructions only for those targets.

/be

Comment 28

14 years ago
Brendan is far more eloquent than I am. I agree with him. Tuning the code for
some processor or compiler seems liek a mistake. It's better to get a new compiler.

Brendan asked: "Is the gcc ABI changing *again*?"  Maybe. The name mangling 
is the same but they changed the soname on libstc++ so there will probably 
be problems eventually.

Even gcc-2.95.3 can generate jump-less code if you allow it so gcc has known 
what to do for quite a while.

Be that as it may, the significant speedup reported makes me wonder if we've 
found a different problem. What happens if you use this code in the patch? 

static inline PRInt32 ns_abs (PRInt32 x)
{
    return (x < 0) ? -x : x;
}

static inline PRInt32 ns_min (PRInt32 x, PRInt32 y)
{
    return (x < y) ? x : y;
}

static inline PRInt32 ns_max (PRInt32 x, PRInt32 y)
{
    return (x > y) ? x : y;
}

If there's still a performance win then something else is happening. Perhaps 
templates are an issue.
I should also point out that the original attachment actually is NOT
functionally equivalent to a normal min or max function.

In particular, it will fail if the difference between the two values is larger
than (1<<31). e.g., for ns_min:

x=0x7FFFFFFF
y=-1

yields 0x7FFFFFFF instead of the expected answer -1.

The code (x<y?x:y), however, gets the answer right. If you haven't done the code
audit to ensure this could never happen, I wouldn't apply the patch. Even if you
do, I'd document the deficiency, lest unwary programmers assume that it really
does work correctly in all cases.
Product: Browser → Seamonkey
Assignee: caillon → nobody
Status: ASSIGNED → NEW
Component: General → General
OS: Linux → All
Product: Mozilla Application Suite → Core
QA Contact: general → general
Hardware: PC → All
You need to log in before you can comment on or make changes to this bug.