Closed
Bug 885186
Opened 11 years ago
Closed 11 years ago
Optimize MoveEmitter for x86/x64
Categories
(Core :: JavaScript Engine, enhancement)
Tracking
()
RESOLVED
FIXED
mozilla25
People
(Reporter: sunfish, Unassigned)
References
Details
Attachments
(3 files, 1 obsolete file)
21.48 KB,
patch
|
jandem
:
review+
|
Details | Diff | Splinter Review |
82.43 KB,
application/x-bzip
|
Details | |
740 bytes,
patch
|
Details | Diff | Splinter Review |
x86/x64 offer several fun ways to implement MoveGroups: xchg can swap two GPRs directly, push+pop can implement moves without requiring a temporary register, and xorpd can be used to perform an xor-swap on xmm registers without requiring stack storage.
Reporter | ||
Comment 1•11 years ago
|
||
This patch contains some changes which are not yet tested on x86 (32-bit), though I will do that soon.
Attachment #765184 -
Flags: review?(jdemooij)
Reporter | ||
Comment 2•11 years ago
|
||
And it turns out that there was a bug in the untested code. The pop instruction computes the effective address for its output *after* decrementing the stack pointer, so the code in toOperand needs to add an offset if the result is to be used by a pop. The attached patch adds code to account for this.
Attachment #765184 -
Attachment is obsolete: true
Attachment #765184 -
Flags: review?(jdemooij)
Attachment #765514 -
Flags: review?(jdemooij)
Comment 3•11 years ago
|
||
Comment on attachment 765514 [details] [diff] [review] an updated patch Review of attachment 765514 [details] [diff] [review]: ----------------------------------------------------------------- Nice. Good to see spilledReg_ go, it could be pretty confusing. ::: js/src/ion/shared/MoveEmitter-x86-shared.cpp @@ +127,5 @@ > + // Otherwise use the stack. > + breakCycle(to, move.kind()); > + inCycle_ = true; > + } > + Nit: some trailing whitespace ::: js/src/ion/shared/MoveEmitter-x86-shared.h @@ +37,3 @@ > > + size_t characterizeCycle(const MoveResolver &moves, size_t i, > + bool &allGeneralRegs, bool &allFloatRegs); Use pointers for the booleans here, to make it clear they are outparams.
Attachment #765514 -
Flags: review?(jdemooij) → review+
Comment 4•11 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/eb8d520f9954
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
Comment 5•11 years ago
|
||
This appears to have slowed down the asm primes benchmark on awfy by 10%. The range is invalid on awfy, I bisected manually on my machine.
Reporter | ||
Comment 6•11 years ago
|
||
-[Codegen] push %rdx The two codegen changes from this patch on the asm primes benchmark are this: -[Codegen] push %rdx -[Codegen] leaq 0x28(%r10), %rdx -[Codegen] movq %rdx, 0x8(%rsp) +[Codegen] leaq 0x28(%r10), %r11 +[Codegen] movq %r11, 0x0(%rsp) [Codegen] leaq 0x20(%r10), %r9 [Codegen] leaq 0x18(%r10), %r8 [Codegen] movq 0x10(%r10), %rcx [Codegen] movq 0x8(%r10), %rdx [Codegen] movq 0x0(%r10), %rsi -[Codegen] addq $0x8, %rsp which looks good, and several instances of diffs like this: -[Codegen] subq $0x8, %rsp -[Codegen] movq %rax, 0x0(%rsp) -[Codegen] movq %rcx, %rax -[Codegen] movq 0x0(%rsp), %rcx -[Codegen] addq $0x8, %rsp +[Codegen] xchgq %rax, %rcx which is good as long as reg-to-reg reg xchgq isn't slow. According to Agner's Instruction tables, it's at most 3 cycles across all the CPUs. On my Sandy Bridge, microbenchmarks of the swap using xchgq are twice as fast as the swap using sub, store, mov, load, add. What CPU are you seeing the slowdowns on? I'll continue to look into this.
Comment 7•11 years ago
|
||
My local machine runs 64-bit linux and reports |Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz|. I also see that zlib slows down by 1.5% with this commit (less than on awfy, that's the second of two regressions that happened at the same time). I did build the benchmarks locally, I doubt the difference matters, but attached them if you want to run on the exact same ones as me. Your diffs there do look good, so perhaps the benefits of this patch outweigh the downsides?
Reporter | ||
Comment 8•11 years ago
|
||
I can reproduce it, but only with the attached testcases. With the precompiled versions in the AWFY repository, I see the opposite results - the MoveEmitter patch speeds up primes by 10%, and zlib is sped up slightly. The primes regression appears entirely due to code alignment or some other microarchitectural quirk. The attached patch simply inserts 16 nops after every xchg, making it take the same amount of code size as the sequence it replaces. With this patch, primes speeds back up by 10%.
Comment 9•11 years ago
|
||
Interesting. I guess the regression is not significant then.
You need to log in
before you can comment on or make changes to this bug.
Description
•