Closed
Bug 412972
Opened 17 years ago
Closed 6 years ago
Optimize int/uint comparisons
Categories
(Tamarin Graveyard :: Virtual Machine, defect, P3)
Tracking
(Not tracked)
RESOLVED
WONTFIX
Q2 12 - Cyril
People
(Reporter: edwsmith, Unassigned)
Details
(Whiteboard: PACMAN)
Attachments
(1 file)
55.89 KB,
text/plain
|
Details |
Currently, when doing an unsigned integer vs. integer comparison in the JITed code, Tamarin emits two coercions to the Number type and does the comparison using SSE registers. This is the ugly MIR_u2d sequence: MOV (disp+4, ESP, 0); // high 32 bits = 0 MOV (disp, ESP, (Register)(vReg&0x7)); // low 32 bits = unsigned value FILDQ(disp, ESP); // load 64bit int (won't lose precision) FSTPQ(disp, ESP); // store + pop as 64bit double MOVSD(r, disp, ESP); // load double into XMM Does anyone have a better/faster/cleverer idea to compare an unsigned integer vs. an integer in registers? Perhaps checking the upper bit of the unsigned value first and then doing the standard signed integer compare?
Comment 1•17 years ago
|
||
Suppose A is int32 and B is uint32. Let UA = (uint32) A; that is, the same bit sequence, but treat it as a uint32. Then A <=> B is equivalent to UA & ((UA >> 31) - 1) <=> B I don't know if it'd be faster or not, but it's all 32-bit integer operations and avoids branches.
Comment 2•17 years ago
|
||
Argh, that's incorrect in the case that B == 0 and A == 0x80000000. :-P
Comment 3•17 years ago
|
||
gcc compiles "return a >= 0 && a == b" into something like (assuming a in eax and b in ebx, result goes in eax): movl %eax %edx notl %edx shr $31, %edx cmpl %ebx, %eax andl %ecx %eax They're all integer ops and the critical path is 4 instructions long. There might be a shorter way, though.
Comment 4•17 years ago
|
||
I forgot to mention, in my example I was assuming that a is 'int' and b is 'unsigned int', so 'a >= 0 && a == b' is then a semantic equality test.
Comment 5•17 years ago
|
||
dmandelin's gcc is smarter than mine. "return a > 0 && ((unsigned int) a) > b;" produces: movl A, %eax testl %eax, %eax setg %dl cmpl B, %eax seta %al movzbl %al, %eax andl %edx, %eax after which, if A > B, then EAX is 1 and the zero bit is clear; otherwise A <= B, EAX is 0, and the zero bit is set. Didn't try it with "a >= 0 && ((unsigned int) a) >= b".
Comment 6•17 years ago
|
||
There are a lot of MIR instructions that need to use the result of the comparison: MIR_eq/ne/lt/le are different than MIR_jeq/jne/jlt/jle/jnle/jnlt Some notes from an email discussion: "MIR-eq/ne/etc do a compare then SETcc, whereas MIR_jeq/jne/etc do compare followed by branch. the compare's job is to set the condition codes directly" Whatever ASM we generate, it has to work for all cases obviously. My hand coded version for a loop was something this: > var limit:uint = 10000000; > var i:int = 0; > for (i = 0; i < limit; i++) > { > } > eax = limit > ebx = i > > loop2: > cmp eax, 0x7fffffff > jg cmp > cmp eax, ebx > cmp: > jle done > add ebx, i > jmp loop2: > done: Ed Smith said that the ideal location for this optimization would be in the cmpOptimization where we check types of operands to short circuit using floating point values.
Comment 7•17 years ago
|
||
Your version seems to be about 60% faster than the version I got from gcc, for this loop at least. (I'm not sure how to tell gcc to fully optimize an empty loop body without just removing it, but if I could, it might be enlightening.) That surprised me, because the critical path length is the same. After some digging, I found that Intel says: "Use the setcc and cmov instructions to eliminate unpredictable conditional branches where possible. Do not do this for predictable branches. ...converting conditional branches to cmovs or setcc trades off control flow dependence for data dependence and restricts the capability of the out of order engine." So at least Intel wouldn't be surprised that the branch code is superior for a loop. But it might be the other way around for unpredictable tests. I notice that in your loop example, 'limit' is unsigned int but never written to, so it could be converted to signed int at the beginning as a no-op, allowing the sign test to be removed. I wonder if there are enough opportunities for once-at-the top conversions like that for it to be worth doing in the optimizer.
Comment 8•17 years ago
|
||
My loop is a bit of a bad example for fixing this bug in the MIR generation code. When "cmpOptimization" is called, the state information we have is just the type of both operands. We could emit optimized code here (like we do currently with MIR_ucmp instead of number coercion and fcmpIns) or we could try to fix this performance issue higher up in cmpLt, etc. We don't know how the condition codes are going to be used (without looking ahead in our instruction stream) so we really need to focus on the right ASM code to to just set the condition codes. Optimizing the 'limit' check outside of the loop is a wonderful goal but much harder to accomplish in the current codebase.
Comment 9•17 years ago
|
||
How about this: mov eax, A mov edx, B mov ecx, eax and ecx, 0x80000000 por edx, ecx ; edx = B | (A & signbit) and eax, 0x7fffffff ; eax = A & ~signbit cmp eax, edx after which, the OF and SF bits will be set up for a conditional jump just as after a normal compare of two *unsigned* operands. What this does is force B to be greater than A if A's sign bit is set. Then compare as normal. (Or, that's what I intended it to do.)
Comment 10•17 years ago
|
||
I think I should have said "the ZF and CF bits".
Comment 11•17 years ago
|
||
If A is signed and B unsigned, you can compute A <= B like this: tst A, A cmovs A, B cmp A, B cmovs is a conditional move: if A is negative, we set A equal to B, so that the comparison reports them as equal. Since the condition we actually want is <=, this is just as good. With the right choice of branch instruction, this could handle other operators. B could be a memory reference.
Reporter | ||
Updated•16 years ago
|
Priority: -- → P2
Updated•16 years ago
|
Summary: Optimize int/uint comparisons (TC and TT) → Optimize int/uint comparisons
Comment 12•16 years ago
|
||
triaging old issues and wondering if this issue also applies to LIR?
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Updated•16 years ago
|
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Reporter | ||
Comment 13•16 years ago
|
||
yes, still applicable to LIR and open
Updated•16 years ago
|
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Flags: flashplayer-qrb? → flashplayer-qrb+
Target Milestone: --- → Future
Updated•15 years ago
|
Priority: P2 → P5
Target Milestone: Future → flash10.1
Updated•14 years ago
|
Whiteboard: PACMAN
Reporter | ||
Updated•14 years ago
|
Hardware: x86 → All
Updated•14 years ago
|
Assignee: nobody → wmaddox
Comment 14•14 years ago
|
||
We currently optimize on 64-bit platforms by converting 32-bit operands to 64-bit and doing a signed compare. On 32-bit platforms, we optimize in the case that the signed operand is an immediate with non-negative value. I ran an experiment to see how often an int/uint or uint/int comparison was falling back to a floating-point compare on i386. The results look like this: TEST sunspider/as3/date-format-tofte.abc uint/int and int/uint compare [static]: total = 1, unoptimized = 1 jit_num_num_compare 1 [1 : 1] 242000 242000 jit_int_uint_compare 1 [1 : 1] 227500 227500 jit_int_uint_compare_unoptimized 1 [1 : 1] 227500 227500 The second line shows static counts computed by ad-hoc instrumentation, and the following lines show dynamic counts collected via vprof instrumentation. "optimized" means that a non-FP comparision was used.
Updated•14 years ago
|
Assignee: wmaddox → nobody
Flags: flashplayer-injection-
Flags: flashplayer-bug-
Priority: P5 → --
Target Milestone: Q3 11 - Serrano → Future
Comment 15•6 years ago
|
||
Tamarin is a dead project now. Mass WONTFIX.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Comment 16•6 years ago
|
||
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
You need to log in
before you can comment on or make changes to this bug.
Description
•