412972 - Optimize int/uint comparisons

Reporter

Description

•

17 years ago

Currently, when doing an unsigned integer vs. integer comparison in the JITed code, Tamarin emits two coercions to the Number type and does the comparison using SSE registers.  This is the ugly MIR_u2d sequence:

MOV (disp+4, ESP, 0);			// high 32 bits = 0
MOV (disp, ESP, (Register)(vReg&0x7)); // low 32 bits = unsigned value FILDQ(disp, ESP); // load 64bit int (won't lose precision) FSTPQ(disp, ESP); // store + pop as 64bit double MOVSD(r, disp, ESP); // load double into XMM

Does anyone have a better/faster/cleverer idea to compare an unsigned integer vs. an integer in registers?  Perhaps checking the upper bit of the unsigned value first and then doing the standard signed integer compare?

Jason Orendorff [:jorendorff]

Comment 1

•

17 years ago

Suppose A is int32 and B is uint32.

Let UA = (uint32) A; that is, the same bit sequence, but treat it as a uint32.

Then A <=> B is equivalent to UA & ((UA >> 31) - 1) <=> B

I don't know if it'd be faster or not, but it's all 32-bit integer operations and avoids branches.

Jason Orendorff [:jorendorff]

Comment 2

•

17 years ago

Argh, that's incorrect in the case that B == 0 and A == 0x80000000.  :-P

David Mandelin [:dmandelin]

Comment 3

•

17 years ago

gcc compiles "return a >= 0 && a == b" into something like (assuming a in eax and b in ebx, result goes in eax):

movl %eax %edx
notl %edx
shr $31, %edx
cmpl %ebx, %eax
andl %ecx %eax

They're all integer ops and the critical path is 4 instructions long. There might be a shorter way, though.

David Mandelin [:dmandelin]

Comment 4

•

17 years ago

I forgot to mention, in my example I was assuming that a is 'int' and b is 'unsigned int', so 'a >= 0 && a == b' is then a semantic equality test.

Jason Orendorff [:jorendorff]

Comment 5

•

17 years ago

dmandelin's gcc is smarter than mine.

"return a > 0 && ((unsigned int) a) > b;" produces:

        movl    A, %eax
        testl   %eax, %eax
        setg    %dl
        cmpl    B, %eax
        seta    %al
        movzbl  %al, %eax
        andl    %edx, %eax

after which, if A > B, then EAX is 1 and the zero bit is clear; otherwise A <= B, EAX is 0, and the zero bit is set.

Didn't try it with "a >= 0 && ((unsigned int) a) >= b".

Werner Sharp

Comment 6

•

17 years ago

There are a lot of MIR instructions that need to use the result of the comparison:

MIR_eq/ne/lt/le are different than MIR_jeq/jne/jlt/jle/jnle/jnlt

Some notes from an email discussion:

"MIR-eq/ne/etc do a compare then SETcc, whereas MIR_jeq/jne/etc do compare followed by branch.  the compare's job is to set the condition codes directly"

Whatever ASM we generate, it has to work for all cases obviously.  My hand coded version for a loop was something this:

> var limit:uint = 10000000;
> var i:int = 0;
> for (i = 0; i < limit; i++)
> {
> }

> eax = limit
> ebx = i
> 
> loop2:
> cmp eax, 0x7fffffff
> jg  cmp
> cmp eax, ebx
> cmp:
> jle done
> add ebx, i
> jmp loop2:
> done:

Ed Smith said that the ideal location for this optimization would be in the cmpOptimization where we check types of operands to short circuit using floating point values.

David Mandelin [:dmandelin]

Comment 7

•

17 years ago

Your version seems to be about 60% faster than the version I got from gcc, for this loop at least. (I'm not sure how to tell gcc to fully optimize an empty loop body without just removing it, but if I could, it might be enlightening.) That surprised me, because the critical path length is the same.
 
After some digging, I found that Intel says: "Use the 
setcc and cmov instructions to eliminate unpredictable conditional branches 
where possible. Do not do this for predictable branches. ...converting 
conditional branches to cmovs or setcc trades off control flow dependence for 
data dependence and restricts the capability of the out of order engine." So at least Intel wouldn't be surprised that the branch code is superior for a loop. But it might be the other way around for unpredictable tests.

I notice that in your loop example, 'limit' is unsigned int but never written to, so it could be converted to signed int at the beginning as a no-op, allowing the sign test to be removed. I wonder if there are enough opportunities for once-at-the top conversions like that for it to be worth doing in the optimizer.

Werner Sharp

Comment 8

•

17 years ago

My loop is a bit of a bad example for fixing this bug in the MIR generation code.  When "cmpOptimization" is called, the state information we have is just the type of both operands.  We could emit optimized code here (like we do currently with MIR_ucmp instead of number coercion and fcmpIns) or we could try to fix this performance issue higher up in cmpLt, etc.  We don't know how the condition codes are going to be used (without looking ahead in our instruction stream) so we really need to focus on the right ASM code to to just set the condition codes.  

Optimizing the 'limit' check outside of the loop is a wonderful goal but much harder to accomplish in the current codebase.

Jason Orendorff [:jorendorff]

Comment 9

•

17 years ago

How about this:

  mov eax, A
  mov edx, B
  mov ecx, eax
  and ecx, 0x80000000
  por edx, ecx          ; edx = B | (A & signbit)
  and eax, 0x7fffffff   ; eax = A & ~signbit
  cmp eax, edx

after which, the OF and SF bits will be set up for a conditional jump just as after a normal compare of two *unsigned* operands.

What this does is force B to be greater than A if A's sign bit is set.  Then compare as normal.  (Or, that's what I intended it to do.)

Jason Orendorff [:jorendorff]

Comment 10

•

17 years ago

I think I should have said "the ZF and CF bits".

Jim Blandy :jimb

Comment 11

•

17 years ago

If A is signed and B unsigned, you can compute A <= B like this:

        tst A, A
        cmovs A, B
        cmp A, B

cmovs is a conditional move: if A is negative, we set A equal to B, so that the comparison reports them as equal.  Since the condition we actually want is <=, this is just as good.  With the right choice of branch instruction, this could handle other operators.

B could be a memory reference.

Edwin Smith

Reporter

Updated

•

16 years ago

Priority: -- → P2

Lars T Hansen

Updated

•

16 years ago

Summary: Optimize int/uint comparisons (TC and TT) → Optimize int/uint comparisons

Brent Baker

Comment 12

•

16 years ago

triaging old issues and wondering if this issue also applies to LIR?

Flags: flashplayer-triage+

Flags: flashplayer-qrb?

Brent Baker

Updated

•

16 years ago

Flags: flashplayer-triage+

Flags: flashplayer-qrb?

Edwin Smith

Reporter

Comment 13

•

16 years ago

yes, still applicable to LIR and open

Brent Baker

Updated

•

16 years ago

Flags: flashplayer-triage+

Flags: flashplayer-qrb?

Dan Smith

Updated

•

15 years ago

Flags: flashplayer-qrb? → flashplayer-qrb+

Target Milestone: --- → Future

Lars T Hansen

Updated

•

15 years ago

Priority: P2 → P5

Target Milestone: Future → flash10.1

Dan Smith

Updated

•

15 years ago

Target Milestone: flash10.1 → flash10.2

Lars T Hansen

Updated

•

14 years ago

Whiteboard: PACMAN

Edwin Smith

Reporter

Updated

•

14 years ago

Hardware: x86 → All

William Maddox

Updated

•

14 years ago

Assignee: nobody → wmaddox

William Maddox

Comment 14

•

14 years ago

Attached file Data on int/uinit comparisons in performance test suite — Details

We currently optimize on 64-bit platforms by converting 32-bit operands to 64-bit and doing a signed compare.  On 32-bit platforms, we optimize in the case that the signed operand is an immediate with non-negative value.  I ran an experiment to see how often an int/uint or uint/int comparison was falling back to a floating-point compare on i386.  The results look like this:

TEST sunspider/as3/date-format-tofte.abc
uint/int and int/uint compare [static]: total = 1, unoptimized = 1
jit_num_num_compare 1 [1 : 1] 242000 242000 
jit_int_uint_compare 1 [1 : 1] 227500 227500 
jit_int_uint_compare_unoptimized 1 [1 : 1] 227500 227500

The second line shows static counts computed by ad-hoc instrumentation, and the following lines show dynamic counts collected via vprof instrumentation.
"optimized" means that a non-FP comparision was used.

William Maddox

Updated

•

14 years ago

Assignee: wmaddox → nobody

Dan Smith

Updated

•

13 years ago

Flags: flashplayer-injection-

Flags: flashplayer-bug-

Priority: P5 → --

Target Milestone: Q3 11 - Serrano → Future

Dan Smith

Updated

•

13 years ago

Priority: -- → P3

Target Milestone: Future → Q2 12 - Cyril

Sylvestre Ledru [:Sylvestre]

Comment 15

•

6 years ago

Tamarin is a dead project now. Mass WONTFIX.

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → WONTFIX

Sylvestre Ledru [:Sylvestre]

Comment 16

•

6 years ago

Tamarin isn't maintained anymore. WONTFIX remaining bugs.