Closed Bug 586240 Opened 14 years ago Closed 14 years ago

JM: Investigate Value loading/saving on x64

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: sstangl, Unassigned)

References

Details

Attachments

(3 files)

Implement loadValueAsComponents(). 14 years ago Sean Stangl [:sstangl] 13.51 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Mimic GCC output. 14 years ago Sean Stangl [:sstangl] 1.19 KB, patch	dvander : review+	Details \| Diff \| Splinter Review
Use xor method. 14 years ago Sean Stangl [:sstangl] 1.44 KB, patch	dvander : review+	Details \| Diff \| Splinter Review

Sean Stangl [:sstangl]

Reporter

Description

•

14 years ago

Attached patch Implement loadValueAsComponents(). — Details — Splinter Review

The performance penalties on the x64 port most likely derive from additional work performed relative to the x86 version to satisfy the boxing format. There are potentially some tricks we can do to make the penalty less severe.

Consider the case of loading both a type register as well as a data register. Currently, the pseudocode for this is:

> move(valueOf(address), typeReg)
> move(typeReg, payloadReg)
> move(JSVAL_TYPE_MASK, scratchReg)
> and(scratchReg, typeReg)
> move(JSVAL_PAYLOAD_MASK, scratchReg)
> and(scratchReg, payloadReg)

Surprisingly, this code is pretty bad. In fact, it turns out that the brain-dead version that loads from memory twice is faster:

> move(valueOf(address), typeReg)
> move(JSVAL_TYPE_MASK, scratchReg)
> and(scratchReg, typeReg)
> move(valueOf(address), payloadReg)
> move(JSVAL_PAYLOAD_MASK, scratchReg)
> and(scratchReg, typeReg)

The best explanation I have is that the second version pipelines better. Indeed, a third version that uses a base register gets the best performance of all:

> move(valueOf(address), valueReg)
> move(valueReg, typeReg)
> move(valueReg, dataReg)
> move(JSVAL_TYPE_MASK, scratchReg)
> and(scratchReg, typeReg)
> move(JSVAL_PAYLOAD_MASK, scratchReg)
> and(scratchReg, payloadReg)

The attached patch makes a loadValueAsComponents() function, reducing duplication of logic. This function then implements the third piece of code described above. We also get rid of loadValueThenPayload() and loadValueThenType() in the process, which is nice.

Performance is not much affected; the noise on SS is too much. However, access-nbody notably improves: the minimum execution time goes from 20ms to 17ms. But getting all the logic in one place is a good thing.

Attachment #464778 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 1

•

14 years ago

Comment on attachment 464778 [details] [diff] [review]
Implement loadValueAsComponents().


>+         * Loading into the ValueRegister is better than loading directly into type:
>+         * with this code, there are no interdependencies between type and payload,
>+         * allowing the instructions to be better pipelined. This is X64-specific.
>+         */
>+        loadValue(address, Registers::ValueReg);
>+        Label l = label();
>+        
>+        move(Registers::ValueReg, type);
>+        move(Registers::ValueReg, payload);
>+        move(Imm64(JSVAL_PAYLOAD_MASK), Registers::ValueReg);

Is it possible to load directly into type or payload rather than the to-be-clobbered ValueReg?

Attachment #464778 - Flags: review?(dvander) → review+

Sean Stangl [:sstangl]

Reporter

Comment 2

•

14 years ago

(In reply to comment #1)
> Is it possible to load directly into type or payload rather than the
> to-be-clobbered ValueReg?

Yes, that was what I was doing originally, since it's 'obviously better'. Testing showed that doing so actually gives significantly worse performance, to which I wave my hands and say 'pipelining'. The comment above the block tries to solidify the experience.

Brendan Eich [:brendan]

Comment 3

•

14 years ago

Cc'ing Moh, but dvander or sstangl should lay out example fast and slow machine code sequences.

/be

Sean Stangl [:sstangl]

Reporter

Comment 4

•

14 years ago

http://hg.mozilla.org/projects/jaegermonkey/rev/7dd58614272f

Sean Stangl [:sstangl]

Reporter

Comment 5

•

14 years ago

(In reply to comment #3)
> Cc'ing Moh, but dvander or sstangl should lay out example fast and slow machine
> code sequences.

We have a 64-bit fatval, and want to split it into a 17-bit type tag and 47-bit payload, with the type tag occupying the upper bits. Both of these are then loaded into separate registers.

We have three different implementations of this process, presented below, in order of measured slowest to measured fastest. Note that when these code examples occur they are often appended to each other, bearing variations in addresses and type and payload registers.

%r15 is the ValueReg, a scratch register.
%r11 is the JSC::scratchRegister, a scratch register used to load constants.
%rbx is the JSFrameReg, from which argument slots are indexed.
$0x7fffffffffff is the JSVAL_PAYLOAD_MASK.
$0xffff800000000000 is the JSVAL_TYPE_MASK.

Version 1 (slowest): Load address into type, move into payload, mask both:

> mov    0x98(%rbx),%r14
> mov    %r14,%r13
> movabs $0xffff800000000000,%r11
> and    %r11,%r14
> movabs $0x7fffffffffff,%r11
> and    %r11,%r13

Version 2 (faster): Load address into type, load address into payload, mask:

> mov    0x98(%rbx),%r14
> movabs $0xffff800000000000,%r11
> and    %r11,%r14
> mov    0x98(%rbx),%r13
> movabs $0x7fffffffffff,%r11
> and    %r11,%r13

Version 3 (fastest): Load address into ValueReg, disperse, mask:

> mov    0x98(%rbx),%r15
> mov    %r15,%r14
> mov    %r15,%r13
> movabs $0x7fffffffffff,%r15
> movabs $0xffff800000000000,%r11
> and    %r11,%r14
> and    %r15,%r13

Any advice on how to improve this would be greatly appreciated.

Sean Stangl [:sstangl]

Reporter

Comment 6

•

14 years ago

Attached patch Mimic GCC output. — Details — Splinter Review

Jan de Mooij compiled the above logic using GCC, and I adapted loadValueAsComponents() to do the same thing. It looks good:

> mov    0xa0(%rbx),%r15
> movabs $0x7fffffffffff,%r13
> movabs $0xffff800000000000,%r14
> and    %r15,%r13
> and    %r15,%r14

Attachment #465007 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

14 years ago

Attachment #465007 - Flags: review?(dvander) → review+

travis fisher

Comment 7

•

14 years ago

test -- did my last suggestion about xor make it through?  (Trying to send from blackberry...)

travis fisher

Comment 8

•

14 years ago

Sorry for the spam.  Suggestion was to use the fact the value and payload masks are compliments to save a register and an instruction.  Like

Mov value, r14
Movabs MASK, r13
And r14,r13
Xor r13,r14

Sean Stangl [:sstangl]

Reporter

Comment 9

•

14 years ago

We could make use of the masks being compliments and eliminate the second 'movabs' by xor'ing %r13 with a sign-extended (-1). Unfortunately, it turns out that this is slower than just doing the move. It's a clever idea, though.

travis fisher

Comment 10

•

14 years ago

Just to verify -- it sounds like you may have misread my idea.  Did you try it as:

> mov 0xa0(%rbx), %r14
> movabs $0x7fffffffffff, %r13
> and %r14, %r13
> xor %r13, %r14

I guess this doesn't pipeline as well as the gcc version, but I would have thought the reduced instruction count would matter more.

Sean Stangl [:sstangl]

Reporter

Comment 11

•

14 years ago

Attached patch Use xor method. — Details — Splinter Review

(In reply to comment #10)
> Just to verify -- it sounds like you may have misread my idea.  Did you try it

Ah, you're right! My apologies. The version you suggested looks very good. I tried it out, and it looked like it was actually slower (at least, SunSpider's bits-in-byte gets 10-15% slower consistently), but is certainly much more elegant. And it gets us one step closer to eliminating the ValueReg, which is a great thing.

I'm attaching this patch so that dvander can test for speed on the mac mini that runs arewefastyet.com.

Thanks, Travis.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 12

•

14 years ago

Looks like maybe 1% on SS (6ms) and <1% on v8 (~30ms),

Moh Haghighat

Comment 13

•

14 years ago

Travis's solution looks good. I tried all four versions in a micro test on a Core i7 system. The performance of all versions seems to be very close, with Travis's version perhaps slightly faster. If I could come up with something better, I'll post.

Sean Stangl [:sstangl]

Reporter

Updated

•

14 years ago

Attachment #465888 - Flags: review?(dvander)

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

14 years ago

Attachment #465888 - Flags: review?(dvander) → review+

Sean Stangl [:sstangl]

Reporter

Comment 14

•

14 years ago

http://hg.mozilla.org/projects/jaegermonkey/rev/89b775191b9d

Sean Stangl [:sstangl]

Reporter

Updated

•

14 years ago

Depends on: 587444

Sean Stangl [:sstangl]

Reporter

Updated

•

14 years ago

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.