Closed Bug 412972 Opened 18 years ago Closed 7 years ago

Optimize int/uint comparisons

Categories

(Tamarin Graveyard :: Virtual Machine, defect, P3)

All
Windows XP
defect

Tracking

(Not tracked)

RESOLVED WONTFIX
Q2 12 - Cyril

People

(Reporter: edwsmith, Unassigned)

Details

(Whiteboard: PACMAN)

Attachments

(1 file)

Currently, when doing an unsigned integer vs. integer comparison in the JITed code, Tamarin emits two coercions to the Number type and does the comparison using SSE registers. This is the ugly MIR_u2d sequence: MOV (disp+4, ESP, 0); // high 32 bits = 0 MOV (disp, ESP, (Register)(vReg&0x7)); // low 32 bits = unsigned value FILDQ(disp, ESP); // load 64bit int (won't lose precision) FSTPQ(disp, ESP); // store + pop as 64bit double MOVSD(r, disp, ESP); // load double into XMM Does anyone have a better/faster/cleverer idea to compare an unsigned integer vs. an integer in registers? Perhaps checking the upper bit of the unsigned value first and then doing the standard signed integer compare?
Suppose A is int32 and B is uint32. Let UA = (uint32) A; that is, the same bit sequence, but treat it as a uint32. Then A <=> B is equivalent to UA & ((UA >> 31) - 1) <=> B I don't know if it'd be faster or not, but it's all 32-bit integer operations and avoids branches.
Argh, that's incorrect in the case that B == 0 and A == 0x80000000. :-P
gcc compiles "return a >= 0 && a == b" into something like (assuming a in eax and b in ebx, result goes in eax): movl %eax %edx notl %edx shr $31, %edx cmpl %ebx, %eax andl %ecx %eax They're all integer ops and the critical path is 4 instructions long. There might be a shorter way, though.
I forgot to mention, in my example I was assuming that a is 'int' and b is 'unsigned int', so 'a >= 0 && a == b' is then a semantic equality test.
dmandelin's gcc is smarter than mine. "return a > 0 && ((unsigned int) a) > b;" produces: movl A, %eax testl %eax, %eax setg %dl cmpl B, %eax seta %al movzbl %al, %eax andl %edx, %eax after which, if A > B, then EAX is 1 and the zero bit is clear; otherwise A <= B, EAX is 0, and the zero bit is set. Didn't try it with "a >= 0 && ((unsigned int) a) >= b".
There are a lot of MIR instructions that need to use the result of the comparison: MIR_eq/ne/lt/le are different than MIR_jeq/jne/jlt/jle/jnle/jnlt Some notes from an email discussion: "MIR-eq/ne/etc do a compare then SETcc, whereas MIR_jeq/jne/etc do compare followed by branch. the compare's job is to set the condition codes directly" Whatever ASM we generate, it has to work for all cases obviously. My hand coded version for a loop was something this: > var limit:uint = 10000000; > var i:int = 0; > for (i = 0; i < limit; i++) > { > } > eax = limit > ebx = i > > loop2: > cmp eax, 0x7fffffff > jg cmp > cmp eax, ebx > cmp: > jle done > add ebx, i > jmp loop2: > done: Ed Smith said that the ideal location for this optimization would be in the cmpOptimization where we check types of operands to short circuit using floating point values.
Your version seems to be about 60% faster than the version I got from gcc, for this loop at least. (I'm not sure how to tell gcc to fully optimize an empty loop body without just removing it, but if I could, it might be enlightening.) That surprised me, because the critical path length is the same. After some digging, I found that Intel says: "Use the setcc and cmov instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. ...converting conditional branches to cmovs or setcc trades off control flow dependence for data dependence and restricts the capability of the out of order engine." So at least Intel wouldn't be surprised that the branch code is superior for a loop. But it might be the other way around for unpredictable tests. I notice that in your loop example, 'limit' is unsigned int but never written to, so it could be converted to signed int at the beginning as a no-op, allowing the sign test to be removed. I wonder if there are enough opportunities for once-at-the top conversions like that for it to be worth doing in the optimizer.
My loop is a bit of a bad example for fixing this bug in the MIR generation code. When "cmpOptimization" is called, the state information we have is just the type of both operands. We could emit optimized code here (like we do currently with MIR_ucmp instead of number coercion and fcmpIns) or we could try to fix this performance issue higher up in cmpLt, etc. We don't know how the condition codes are going to be used (without looking ahead in our instruction stream) so we really need to focus on the right ASM code to to just set the condition codes. Optimizing the 'limit' check outside of the loop is a wonderful goal but much harder to accomplish in the current codebase.
How about this: mov eax, A mov edx, B mov ecx, eax and ecx, 0x80000000 por edx, ecx ; edx = B | (A & signbit) and eax, 0x7fffffff ; eax = A & ~signbit cmp eax, edx after which, the OF and SF bits will be set up for a conditional jump just as after a normal compare of two *unsigned* operands. What this does is force B to be greater than A if A's sign bit is set. Then compare as normal. (Or, that's what I intended it to do.)
I think I should have said "the ZF and CF bits".
If A is signed and B unsigned, you can compute A <= B like this: tst A, A cmovs A, B cmp A, B cmovs is a conditional move: if A is negative, we set A equal to B, so that the comparison reports them as equal. Since the condition we actually want is <=, this is just as good. With the right choice of branch instruction, this could handle other operators. B could be a memory reference.
Priority: -- → P2
Summary: Optimize int/uint comparisons (TC and TT) → Optimize int/uint comparisons
triaging old issues and wondering if this issue also applies to LIR?
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
yes, still applicable to LIR and open
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Flags: flashplayer-qrb? → flashplayer-qrb+
Target Milestone: --- → Future
Priority: P2 → P5
Target Milestone: Future → flash10.1
Target Milestone: flash10.1 → flash10.2
Whiteboard: PACMAN
Hardware: x86 → All
Assignee: nobody → wmaddox
We currently optimize on 64-bit platforms by converting 32-bit operands to 64-bit and doing a signed compare. On 32-bit platforms, we optimize in the case that the signed operand is an immediate with non-negative value. I ran an experiment to see how often an int/uint or uint/int comparison was falling back to a floating-point compare on i386. The results look like this: TEST sunspider/as3/date-format-tofte.abc uint/int and int/uint compare [static]: total = 1, unoptimized = 1 jit_num_num_compare 1 [1 : 1] 242000 242000 jit_int_uint_compare 1 [1 : 1] 227500 227500 jit_int_uint_compare_unoptimized 1 [1 : 1] 227500 227500 The second line shows static counts computed by ad-hoc instrumentation, and the following lines show dynamic counts collected via vprof instrumentation. "optimized" means that a non-FP comparision was used.
Assignee: wmaddox → nobody
Flags: flashplayer-injection-
Flags: flashplayer-bug-
Priority: P5 → --
Target Milestone: Q3 11 - Serrano → Future
Priority: -- → P3
Target Milestone: Future → Q2 12 - Cyril
Tamarin is a dead project now. Mass WONTFIX.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: