Optimize int/uint comparisons

NEW
Unassigned

Status

P3
normal
11 years ago
7 years ago

People

(Reporter: edwsmith, Unassigned)

Tracking

unspecified
Q2 12 - Cyril
All
Windows XP
Bug Flags:
flashplayer-injection -
flashplayer-qrb +
flashplayer-bug -
flashplayer-triage +

Details

(Whiteboard: PACMAN)

Attachments

(1 attachment)

(Reporter)

Description

11 years ago
Currently, when doing an unsigned integer vs. integer comparison in the JITed code, Tamarin emits two coercions to the Number type and does the comparison using SSE registers.  This is the ugly MIR_u2d sequence:

MOV (disp+4, ESP, 0);			// high 32 bits = 0
MOV (disp, ESP, (Register)(vReg&0x7)); // low 32 bits = unsigned value FILDQ(disp, ESP); // load 64bit int (won't lose precision) FSTPQ(disp, ESP); // store + pop as 64bit double MOVSD(r, disp, ESP); // load double into XMM

Does anyone have a better/faster/cleverer idea to compare an unsigned integer vs. an integer in registers?  Perhaps checking the upper bit of the unsigned value first and then doing the standard signed integer compare?
Suppose A is int32 and B is uint32.

Let UA = (uint32) A; that is, the same bit sequence, but treat it as a uint32.

Then A <=> B is equivalent to UA & ((UA >> 31) - 1) <=> B

I don't know if it'd be faster or not, but it's all 32-bit integer operations and avoids branches.
Argh, that's incorrect in the case that B == 0 and A == 0x80000000.  :-P
gcc compiles "return a >= 0 && a == b" into something like (assuming a in eax and b in ebx, result goes in eax):

movl %eax %edx
notl %edx
shr $31, %edx
cmpl %ebx, %eax
andl %ecx %eax

They're all integer ops and the critical path is 4 instructions long. There might be a shorter way, though.
I forgot to mention, in my example I was assuming that a is 'int' and b is 'unsigned int', so 'a >= 0 && a == b' is then a semantic equality test.
dmandelin's gcc is smarter than mine.

"return a > 0 && ((unsigned int) a) > b;" produces:

        movl    A, %eax
        testl   %eax, %eax
        setg    %dl
        cmpl    B, %eax
        seta    %al
        movzbl  %al, %eax
        andl    %edx, %eax

after which, if A > B, then EAX is 1 and the zero bit is clear; otherwise A <= B, EAX is 0, and the zero bit is set.

Didn't try it with "a >= 0 && ((unsigned int) a) >= b".

Comment 6

11 years ago
There are a lot of MIR instructions that need to use the result of the comparison:

MIR_eq/ne/lt/le are different than MIR_jeq/jne/jlt/jle/jnle/jnlt

Some notes from an email discussion:

"MIR-eq/ne/etc do a compare then SETcc, whereas MIR_jeq/jne/etc do compare followed by branch.  the compare's job is to set the condition codes directly"

Whatever ASM we generate, it has to work for all cases obviously.  My hand coded version for a loop was something this:

> var limit:uint = 10000000;
> var i:int = 0;
> for (i = 0; i < limit; i++)
> {
> }

> eax = limit
> ebx = i
> 
> loop2:
> cmp eax, 0x7fffffff
> jg  cmp
> cmp eax, ebx
> cmp:
> jle done
> add ebx, i
> jmp loop2:
> done:

Ed Smith said that the ideal location for this optimization would be in the cmpOptimization where we check types of operands to short circuit using floating point values.  
Your version seems to be about 60% faster than the version I got from gcc, for this loop at least. (I'm not sure how to tell gcc to fully optimize an empty loop body without just removing it, but if I could, it might be enlightening.) That surprised me, because the critical path length is the same.
 
After some digging, I found that Intel says: "Use the 
setcc and cmov instructions to eliminate unpredictable conditional branches 
where possible. Do not do this for predictable branches. ...converting 
conditional branches to cmovs or setcc trades off control flow dependence for 
data dependence and restricts the capability of the out of order engine." So at least Intel wouldn't be surprised that the branch code is superior for a loop. But it might be the other way around for unpredictable tests.

I notice that in your loop example, 'limit' is unsigned int but never written to, so it could be converted to signed int at the beginning as a no-op, allowing the sign test to be removed. I wonder if there are enough opportunities for once-at-the top conversions like that for it to be worth doing in the optimizer.

Comment 8

11 years ago
My loop is a bit of a bad example for fixing this bug in the MIR generation code.  When "cmpOptimization" is called, the state information we have is just the type of both operands.  We could emit optimized code here (like we do currently with MIR_ucmp instead of number coercion and fcmpIns) or we could try to fix this performance issue higher up in cmpLt, etc.  We don't know how the condition codes are going to be used (without looking ahead in our instruction stream) so we really need to focus on the right ASM code to to just set the condition codes.  

Optimizing the 'limit' check outside of the loop is a wonderful goal but much harder to accomplish in the current codebase.
How about this:

  mov eax, A
  mov edx, B
  mov ecx, eax
  and ecx, 0x80000000
  por edx, ecx          ; edx = B | (A & signbit)
  and eax, 0x7fffffff   ; eax = A & ~signbit
  cmp eax, edx

after which, the OF and SF bits will be set up for a conditional jump just as after a normal compare of two *unsigned* operands.

What this does is force B to be greater than A if A's sign bit is set.  Then compare as normal.  (Or, that's what I intended it to do.)
I think I should have said "the ZF and CF bits".

Comment 11

11 years ago
If A is signed and B unsigned, you can compute A <= B like this:

        tst A, A
        cmovs A, B
        cmp A, B

cmovs is a conditional move: if A is negative, we set A equal to B, so that the comparison reports them as equal.  Since the condition we actually want is <=, this is just as good.  With the right choice of branch instruction, this could handle other operators.

B could be a memory reference.
(Reporter)

Updated

11 years ago
Priority: -- → P2

Updated

10 years ago
Summary: Optimize int/uint comparisons (TC and TT) → Optimize int/uint comparisons

Comment 12

10 years ago
triaging old issues and wondering if this issue also applies to LIR?
Flags: flashplayer-triage+
Flags: flashplayer-qrb?

Updated

10 years ago
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
(Reporter)

Comment 13

10 years ago
yes, still applicable to LIR and open

Updated

10 years ago
Flags: flashplayer-triage+
Flags: flashplayer-qrb?

Updated

9 years ago
Flags: flashplayer-qrb? → flashplayer-qrb+
Target Milestone: --- → Future

Updated

9 years ago
Priority: P2 → P5
Target Milestone: Future → flash10.1

Updated

9 years ago
Target Milestone: flash10.1 → flash10.2

Updated

9 years ago
Whiteboard: PACMAN
(Reporter)

Updated

8 years ago
Hardware: x86 → All

Updated

8 years ago
Assignee: nobody → wmaddox

Comment 14

8 years ago
Created attachment 473238 [details]
Data on int/uinit comparisons in performance test suite

We currently optimize on 64-bit platforms by converting 32-bit operands to 64-bit and doing a signed compare.  On 32-bit platforms, we optimize in the case that the signed operand is an immediate with non-negative value.  I ran an experiment to see how often an int/uint or uint/int comparison was falling back to a floating-point compare on i386.  The results look like this:

TEST sunspider/as3/date-format-tofte.abc
uint/int and int/uint compare [static]: total = 1, unoptimized = 1
jit_num_num_compare 1 [1 : 1] 242000 242000 
jit_int_uint_compare 1 [1 : 1] 227500 227500 
jit_int_uint_compare_unoptimized 1 [1 : 1] 227500 227500

The second line shows static counts computed by ad-hoc instrumentation, and the following lines show dynamic counts collected via vprof instrumentation.
"optimized" means that a non-FP comparision was used.

Updated

8 years ago
Assignee: wmaddox → nobody

Updated

8 years ago
Flags: flashplayer-injection-
Flags: flashplayer-bug-
Priority: P5 → --
Target Milestone: Q3 11 - Serrano → Future

Updated

7 years ago
Priority: -- → P3
Target Milestone: Future → Q2 12 - Cyril
You need to log in before you can comment on or make changes to this bug.