Closed
Bug 586240
Opened 14 years ago
Closed 14 years ago
JM: Investigate Value loading/saving on x64
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: sstangl, Unassigned)
References
Details
Attachments
(3 files)
13.51 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
1.19 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
1.44 KB,
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
The performance penalties on the x64 port most likely derive from additional work performed relative to the x86 version to satisfy the boxing format. There are potentially some tricks we can do to make the penalty less severe. Consider the case of loading both a type register as well as a data register. Currently, the pseudocode for this is: > move(valueOf(address), typeReg) > move(typeReg, payloadReg) > move(JSVAL_TYPE_MASK, scratchReg) > and(scratchReg, typeReg) > move(JSVAL_PAYLOAD_MASK, scratchReg) > and(scratchReg, payloadReg) Surprisingly, this code is pretty bad. In fact, it turns out that the brain-dead version that loads from memory twice is faster: > move(valueOf(address), typeReg) > move(JSVAL_TYPE_MASK, scratchReg) > and(scratchReg, typeReg) > move(valueOf(address), payloadReg) > move(JSVAL_PAYLOAD_MASK, scratchReg) > and(scratchReg, typeReg) The best explanation I have is that the second version pipelines better. Indeed, a third version that uses a base register gets the best performance of all: > move(valueOf(address), valueReg) > move(valueReg, typeReg) > move(valueReg, dataReg) > move(JSVAL_TYPE_MASK, scratchReg) > and(scratchReg, typeReg) > move(JSVAL_PAYLOAD_MASK, scratchReg) > and(scratchReg, payloadReg) The attached patch makes a loadValueAsComponents() function, reducing duplication of logic. This function then implements the third piece of code described above. We also get rid of loadValueThenPayload() and loadValueThenType() in the process, which is nice. Performance is not much affected; the noise on SS is too much. However, access-nbody notably improves: the minimum execution time goes from 20ms to 17ms. But getting all the logic in one place is a good thing.
Attachment #464778 -
Flags: review?(dvander)
Comment on attachment 464778 [details] [diff] [review] Implement loadValueAsComponents(). >+ * Loading into the ValueRegister is better than loading directly into type: >+ * with this code, there are no interdependencies between type and payload, >+ * allowing the instructions to be better pipelined. This is X64-specific. >+ */ >+ loadValue(address, Registers::ValueReg); >+ Label l = label(); >+ >+ move(Registers::ValueReg, type); >+ move(Registers::ValueReg, payload); >+ move(Imm64(JSVAL_PAYLOAD_MASK), Registers::ValueReg); Is it possible to load directly into type or payload rather than the to-be-clobbered ValueReg?
Attachment #464778 -
Flags: review?(dvander) → review+
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to comment #1) > Is it possible to load directly into type or payload rather than the > to-be-clobbered ValueReg? Yes, that was what I was doing originally, since it's 'obviously better'. Testing showed that doing so actually gives significantly worse performance, to which I wave my hands and say 'pipelining'. The comment above the block tries to solidify the experience.
Comment 3•14 years ago
|
||
Cc'ing Moh, but dvander or sstangl should lay out example fast and slow machine code sequences. /be
Reporter | ||
Comment 4•14 years ago
|
||
http://hg.mozilla.org/projects/jaegermonkey/rev/7dd58614272f
Reporter | ||
Comment 5•14 years ago
|
||
(In reply to comment #3) > Cc'ing Moh, but dvander or sstangl should lay out example fast and slow machine > code sequences. We have a 64-bit fatval, and want to split it into a 17-bit type tag and 47-bit payload, with the type tag occupying the upper bits. Both of these are then loaded into separate registers. We have three different implementations of this process, presented below, in order of measured slowest to measured fastest. Note that when these code examples occur they are often appended to each other, bearing variations in addresses and type and payload registers. %r15 is the ValueReg, a scratch register. %r11 is the JSC::scratchRegister, a scratch register used to load constants. %rbx is the JSFrameReg, from which argument slots are indexed. $0x7fffffffffff is the JSVAL_PAYLOAD_MASK. $0xffff800000000000 is the JSVAL_TYPE_MASK. Version 1 (slowest): Load address into type, move into payload, mask both: > mov 0x98(%rbx),%r14 > mov %r14,%r13 > movabs $0xffff800000000000,%r11 > and %r11,%r14 > movabs $0x7fffffffffff,%r11 > and %r11,%r13 Version 2 (faster): Load address into type, load address into payload, mask: > mov 0x98(%rbx),%r14 > movabs $0xffff800000000000,%r11 > and %r11,%r14 > mov 0x98(%rbx),%r13 > movabs $0x7fffffffffff,%r11 > and %r11,%r13 Version 3 (fastest): Load address into ValueReg, disperse, mask: > mov 0x98(%rbx),%r15 > mov %r15,%r14 > mov %r15,%r13 > movabs $0x7fffffffffff,%r15 > movabs $0xffff800000000000,%r11 > and %r11,%r14 > and %r15,%r13 Any advice on how to improve this would be greatly appreciated.
Reporter | ||
Comment 6•14 years ago
|
||
Jan de Mooij compiled the above logic using GCC, and I adapted loadValueAsComponents() to do the same thing. It looks good:
> mov 0xa0(%rbx),%r15
> movabs $0x7fffffffffff,%r13
> movabs $0xffff800000000000,%r14
> and %r15,%r13
> and %r15,%r14
Attachment #465007 -
Flags: review?(dvander)
Updated•14 years ago
|
Attachment #465007 -
Flags: review?(dvander) → review+
Comment 7•14 years ago
|
||
test -- did my last suggestion about xor make it through? (Trying to send from blackberry...)
Comment 8•14 years ago
|
||
Sorry for the spam. Suggestion was to use the fact the value and payload masks are compliments to save a register and an instruction. Like Mov value, r14 Movabs MASK, r13 And r14,r13 Xor r13,r14
Reporter | ||
Comment 9•14 years ago
|
||
We could make use of the masks being compliments and eliminate the second 'movabs' by xor'ing %r13 with a sign-extended (-1). Unfortunately, it turns out that this is slower than just doing the move. It's a clever idea, though.
Comment 10•14 years ago
|
||
Just to verify -- it sounds like you may have misread my idea. Did you try it as:
> mov 0xa0(%rbx), %r14
> movabs $0x7fffffffffff, %r13
> and %r14, %r13
> xor %r13, %r14
I guess this doesn't pipeline as well as the gcc version, but I would have thought the reduced instruction count would matter more.
Reporter | ||
Comment 11•14 years ago
|
||
(In reply to comment #10) > Just to verify -- it sounds like you may have misread my idea. Did you try it Ah, you're right! My apologies. The version you suggested looks very good. I tried it out, and it looked like it was actually slower (at least, SunSpider's bits-in-byte gets 10-15% slower consistently), but is certainly much more elegant. And it gets us one step closer to eliminating the ValueReg, which is a great thing. I'm attaching this patch so that dvander can test for speed on the mac mini that runs arewefastyet.com. Thanks, Travis.
Looks like maybe 1% on SS (6ms) and <1% on v8 (~30ms),
Comment 13•14 years ago
|
||
Travis's solution looks good. I tried all four versions in a micro test on a Core i7 system. The performance of all versions seems to be very close, with Travis's version perhaps slightly faster. If I could come up with something better, I'll post.
Reporter | ||
Updated•14 years ago
|
Attachment #465888 -
Flags: review?(dvander)
Updated•14 years ago
|
Attachment #465888 -
Flags: review?(dvander) → review+
Reporter | ||
Comment 14•14 years ago
|
||
http://hg.mozilla.org/projects/jaegermonkey/rev/89b775191b9d
Reporter | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•