Closed Bug 885186 Opened 11 years ago Closed 11 years ago

Optimize MoveEmitter for x86/x64

Categories

(Core :: JavaScript Engine, enhancement)

x86_64
All
enhancement
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla25

People

(Reporter: sunfish, Unassigned)

References

Details

Attachments

(3 files, 1 obsolete file)

x86/x64 offer several fun ways to implement MoveGroups: xchg can swap two GPRs directly, push+pop can implement moves without requiring a temporary register, and xorpd can be used to perform an xor-swap on xmm registers without requiring stack storage.
Depends on: 885183
Attached patch a proposed fix (obsolete) — Splinter Review
This patch contains some changes which are not yet tested on x86 (32-bit), though I will do that soon.
Attachment #765184 - Flags: review?(jdemooij)
Attached patch an updated patchSplinter Review
And it turns out that there was a bug in the untested code. The pop instruction computes the effective address for its output *after* decrementing the stack pointer, so the code in toOperand needs to add an offset if the result is to be used by a pop.

The attached patch adds code to account for this.
Attachment #765184 - Attachment is obsolete: true
Attachment #765184 - Flags: review?(jdemooij)
Attachment #765514 - Flags: review?(jdemooij)
Comment on attachment 765514 [details] [diff] [review]
an updated patch

Review of attachment 765514 [details] [diff] [review]:
-----------------------------------------------------------------

Nice. Good to see spilledReg_ go, it could be pretty confusing.

::: js/src/ion/shared/MoveEmitter-x86-shared.cpp
@@ +127,5 @@
> +            // Otherwise use the stack.
> +            breakCycle(to, move.kind());
> +            inCycle_ = true;
> +        }
> +    

Nit: some trailing whitespace

::: js/src/ion/shared/MoveEmitter-x86-shared.h
@@ +37,3 @@
>  
> +    size_t characterizeCycle(const MoveResolver &moves, size_t i,
> +                             bool &allGeneralRegs, bool &allFloatRegs);

Use pointers for the booleans here, to make it clear they are outparams.
Attachment #765514 - Flags: review?(jdemooij) → review+
https://hg.mozilla.org/mozilla-central/rev/eb8d520f9954
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
This appears to have slowed down the asm primes benchmark on awfy by 10%. The range is invalid on awfy, I bisected manually on my machine.
-[Codegen] push       %rdx
The two codegen changes from this patch on the asm primes benchmark are this:

-[Codegen] push       %rdx
-[Codegen] leaq       0x28(%r10), %rdx
-[Codegen] movq       %rdx, 0x8(%rsp)
+[Codegen] leaq       0x28(%r10), %r11
+[Codegen] movq       %r11, 0x0(%rsp)
 [Codegen] leaq       0x20(%r10), %r9
 [Codegen] leaq       0x18(%r10), %r8
 [Codegen] movq       0x10(%r10), %rcx
 [Codegen] movq       0x8(%r10), %rdx
 [Codegen] movq       0x0(%r10), %rsi
-[Codegen] addq       $0x8, %rsp

which looks good, and several instances of diffs like this:

-[Codegen] subq       $0x8, %rsp
-[Codegen] movq       %rax, 0x0(%rsp)
-[Codegen] movq       %rcx, %rax
-[Codegen] movq       0x0(%rsp), %rcx
-[Codegen] addq       $0x8, %rsp
+[Codegen] xchgq      %rax, %rcx

which is good as long as reg-to-reg reg xchgq isn't slow. According to Agner's Instruction tables, it's at most 3 cycles across all the CPUs.

On my Sandy Bridge, microbenchmarks of the swap using xchgq are twice as fast as the swap using sub, store, mov, load, add.

What CPU are you seeing the slowdowns on? I'll continue to look into this.
Attached file regressing benchmarks
My local machine runs 64-bit linux and reports |Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz|. I also see that zlib slows down by 1.5% with this commit (less than on awfy, that's the second of two regressions that happened at the same time).

I did build the benchmarks locally, I doubt the difference matters, but attached them if you want to run on the exact same ones as me.

Your diffs there do look good, so perhaps the benefits of this patch outweigh the downsides?
Attached patch nops for SPEEDSplinter Review
I can reproduce it, but only with the attached testcases. With the precompiled versions in the AWFY repository, I see the opposite results - the MoveEmitter patch speeds up primes by 10%, and zlib is sped up slightly.

The primes regression appears entirely due to code alignment or some other microarchitectural quirk. The attached patch simply inserts 16 nops after every xchg, making it take the same amount of code size as the sequence it replaces. With this patch, primes speeds back up by 10%.
Interesting. I guess the regression is not significant then.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: