Closed Bug 586240 Opened 14 years ago Closed 14 years ago

JM: Investigate Value loading/saving on x64

Categories

(Core :: JavaScript Engine, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: sstangl, Unassigned)

References

Details

Attachments

(3 files)

The performance penalties on the x64 port most likely derive from additional work performed relative to the x86 version to satisfy the boxing format. There are potentially some tricks we can do to make the penalty less severe.

Consider the case of loading both a type register as well as a data register. Currently, the pseudocode for this is:

> move(valueOf(address), typeReg)
> move(typeReg, payloadReg)
> move(JSVAL_TYPE_MASK, scratchReg)
> and(scratchReg, typeReg)
> move(JSVAL_PAYLOAD_MASK, scratchReg)
> and(scratchReg, payloadReg)

Surprisingly, this code is pretty bad. In fact, it turns out that the brain-dead version that loads from memory twice is faster:

> move(valueOf(address), typeReg)
> move(JSVAL_TYPE_MASK, scratchReg)
> and(scratchReg, typeReg)
> move(valueOf(address), payloadReg)
> move(JSVAL_PAYLOAD_MASK, scratchReg)
> and(scratchReg, typeReg)

The best explanation I have is that the second version pipelines better. Indeed, a third version that uses a base register gets the best performance of all:

> move(valueOf(address), valueReg)
> move(valueReg, typeReg)
> move(valueReg, dataReg)
> move(JSVAL_TYPE_MASK, scratchReg)
> and(scratchReg, typeReg)
> move(JSVAL_PAYLOAD_MASK, scratchReg)
> and(scratchReg, payloadReg)

The attached patch makes a loadValueAsComponents() function, reducing duplication of logic. This function then implements the third piece of code described above. We also get rid of loadValueThenPayload() and loadValueThenType() in the process, which is nice.

Performance is not much affected; the noise on SS is too much. However, access-nbody notably improves: the minimum execution time goes from 20ms to 17ms. But getting all the logic in one place is a good thing.
Attachment #464778 - Flags: review?(dvander)
Comment on attachment 464778 [details] [diff] [review]
Implement loadValueAsComponents().


>+         * Loading into the ValueRegister is better than loading directly into type:
>+         * with this code, there are no interdependencies between type and payload,
>+         * allowing the instructions to be better pipelined. This is X64-specific.
>+         */
>+        loadValue(address, Registers::ValueReg);
>+        Label l = label();
>+        
>+        move(Registers::ValueReg, type);
>+        move(Registers::ValueReg, payload);
>+        move(Imm64(JSVAL_PAYLOAD_MASK), Registers::ValueReg);

Is it possible to load directly into type or payload rather than the to-be-clobbered ValueReg?
Attachment #464778 - Flags: review?(dvander) → review+
(In reply to comment #1)
> Is it possible to load directly into type or payload rather than the
> to-be-clobbered ValueReg?

Yes, that was what I was doing originally, since it's 'obviously better'. Testing showed that doing so actually gives significantly worse performance, to which I wave my hands and say 'pipelining'. The comment above the block tries to solidify the experience.
Cc'ing Moh, but dvander or sstangl should lay out example fast and slow machine code sequences.

/be
(In reply to comment #3)
> Cc'ing Moh, but dvander or sstangl should lay out example fast and slow machine
> code sequences.

We have a 64-bit fatval, and want to split it into a 17-bit type tag and 47-bit payload, with the type tag occupying the upper bits. Both of these are then loaded into separate registers.

We have three different implementations of this process, presented below, in order of measured slowest to measured fastest. Note that when these code examples occur they are often appended to each other, bearing variations in addresses and type and payload registers.

%r15 is the ValueReg, a scratch register.
%r11 is the JSC::scratchRegister, a scratch register used to load constants.
%rbx is the JSFrameReg, from which argument slots are indexed.
$0x7fffffffffff is the JSVAL_PAYLOAD_MASK.
$0xffff800000000000 is the JSVAL_TYPE_MASK.

Version 1 (slowest): Load address into type, move into payload, mask both:

> mov    0x98(%rbx),%r14
> mov    %r14,%r13
> movabs $0xffff800000000000,%r11
> and    %r11,%r14
> movabs $0x7fffffffffff,%r11
> and    %r11,%r13

Version 2 (faster): Load address into type, load address into payload, mask:

> mov    0x98(%rbx),%r14
> movabs $0xffff800000000000,%r11
> and    %r11,%r14
> mov    0x98(%rbx),%r13
> movabs $0x7fffffffffff,%r11
> and    %r11,%r13

Version 3 (fastest): Load address into ValueReg, disperse, mask:

> mov    0x98(%rbx),%r15
> mov    %r15,%r14
> mov    %r15,%r13
> movabs $0x7fffffffffff,%r15
> movabs $0xffff800000000000,%r11
> and    %r11,%r14
> and    %r15,%r13

Any advice on how to improve this would be greatly appreciated.
Jan de Mooij compiled the above logic using GCC, and I adapted loadValueAsComponents() to do the same thing. It looks good:

> mov    0xa0(%rbx),%r15
> movabs $0x7fffffffffff,%r13
> movabs $0xffff800000000000,%r14
> and    %r15,%r13
> and    %r15,%r14
Attachment #465007 - Flags: review?(dvander)
Attachment #465007 - Flags: review?(dvander) → review+
test -- did my last suggestion about xor make it through?  (Trying to send from blackberry...)
Sorry for the spam.  Suggestion was to use the fact the value and payload masks are compliments to save a register and an instruction.  Like

Mov value, r14
Movabs MASK, r13
And r14,r13
Xor r13,r14
We could make use of the masks being compliments and eliminate the second 'movabs' by xor'ing %r13 with a sign-extended (-1). Unfortunately, it turns out that this is slower than just doing the move. It's a clever idea, though.
Just to verify -- it sounds like you may have misread my idea.  Did you try it as:

> mov 0xa0(%rbx), %r14
> movabs $0x7fffffffffff, %r13
> and %r14, %r13
> xor %r13, %r14

I guess this doesn't pipeline as well as the gcc version, but I would have thought the reduced instruction count would matter more.
Attached patch Use xor method.Splinter Review
(In reply to comment #10)
> Just to verify -- it sounds like you may have misread my idea.  Did you try it

Ah, you're right! My apologies. The version you suggested looks very good. I tried it out, and it looked like it was actually slower (at least, SunSpider's bits-in-byte gets 10-15% slower consistently), but is certainly much more elegant. And it gets us one step closer to eliminating the ValueReg, which is a great thing.

I'm attaching this patch so that dvander can test for speed on the mac mini that runs arewefastyet.com.

Thanks, Travis.
Looks like maybe 1% on SS (6ms) and <1% on v8 (~30ms),
Travis's solution looks good. I tried all four versions in a micro test on a Core i7 system. The performance of all versions seems to be very close, with Travis's version perhaps slightly faster. If I could come up with something better, I'll post.
Attachment #465888 - Flags: review?(dvander)
Attachment #465888 - Flags: review?(dvander) → review+
Depends on: 587444
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: