Closed Bug 579618 Opened 14 years ago Closed 14 years ago

JM: Use %ebp for JSFrameReg instead of %ebx.

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: sstangl, Unassigned)

Details

Attachments

(2 files)

EBP patch. 14 years ago Chris Leary [:cdleary] (not checking bugmail) 1.58 KB, patch		Details \| Diff \| Splinter Review
V8 wins on my machine. 14 years ago Chris Leary [:cdleary] (not checking bugmail) 994 bytes, patch		Details \| Diff \| Splinter Review

Sean Stangl [:sstangl]

Reporter

Description

•

14 years ago

A few moments ago, I was discussing timing of x86 instructions with cdleary -- notably, that cmp %eax was faster than cmp %ebp, and that it makes sense to have the register allocator know these things.

We then had a moment of collective realization that %ebx is being used for JSFrameReg.

"But isn't %ebp normally used for the stack? If Intel is giving registers different behavior based on their use, isn't it sensible that dereferencing off of %ebp is faster than dereferencing off of %ebx?"

"That would be very silly if it were true."

So we measured. It turns out it's true. CC'ing cdleary, who ran the perf tests on his computer.

Chris Leary [:cdleary] (not checking bugmail)

Comment 1

•

14 years ago

Attached patch EBP patch. — Details — Splinter Review

Adds ebx to the set of temporary registers as well, which is another effect.

Chris Leary [:cdleary] (not checking bugmail)

Comment 2

•

14 years ago

Attached patch V8 wins on my machine. — Details — Splinter Review

Luke Wagner [:luke]

Comment 3

•

14 years ago

Out of curiosity, what was the speedup of just $ebp vs. $ebx?  I assume comment 2 includes the benefit of having the extra register.

Another question: if you break in gdb in C++ code called from method-jitted code, does this screw up the backtrace?

Chris Leary [:cdleary] (not checking bugmail)

Comment 4

•

14 years ago

My machine is apparently a fanciful unicorn. We don't see the same results on other people's machines...

Good point in comment 3, sstangl said he'd test out the backtrace issue. That may be why it was excluded in the first place.

Sean Stangl [:sstangl]

Reporter

Comment 5

•

14 years ago

Yes, it screws up the backtrace. Good point.

Chris Leary [:cdleary] (not checking bugmail)

Comment 6

•

14 years ago

(In reply to comment #3)
> Out of curiosity, what was the speedup of just $ebp vs. $ebx?  I assume comment
> 2 includes the benefit of having the extra register.

Re-running it with just the switch to ebp shows 1.2%, but again, unicornocity must be taken into account.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WONTFIX

Luke Wagner [:luke]

Comment 7

•

14 years ago

Oops, didn't mean to sound so somber; I applaud your micro-architectural zeal!

Andreas Gal :gal

Comment 8

•

14 years ago

I am looking at the core i7 architecture diagrams. I can't see any reason why ebp and ebx should behave different. On older core 1 architectures esp was a bit funky (it was co-located with pc, which is rarely read directly) and the optimization manual advises against using it as a general purpose register, but ebp and ebx should identical. They both are renamed to internal registers randomly as instructions get decode. But then, its x86. Anything is possible.

Luke Wagner [:luke]

Comment 9

•

14 years ago

(In reply to comment #8)
> On older core 1 architectures esp was a
> bit funky (it was co-located with pc, which is rarely read directly) and the
> optimization manual advises against using it as a general purpose register,

How much older?  Like Core 2 and Pentium 4, or stuff nobody uses?  Because, assuming esp is a fixed offset from ebp (which I vaguely remember it being) perhaps we could commandeer esp instead of ebp.

Chris Leary [:cdleary] (not checking bugmail)

Comment 10

•

14 years ago

(In reply to comment #9)

All I can find in the Intel Opt Ref Manual is "When the ESP register is not used as the destination of an instruction (explicit ESP updates), an implicit ESP update will occur with instructions like PUSH, POP, CALL, RETURN. Mixing explicit ESP updates and implicit ESP updates will also lead to dependency between address generation and data execution." (12.3.2.2) Maybe gal can cite sources?

Chris Leary [:cdleary] (not checking bugmail)

Comment 11

•

14 years ago

(In reply to comment #8)
> I am looking at the core i7 architecture diagrams. I can't see any reason why
> ebp and ebx should behave different.

I felt similarly, having studied about RATs and ROBs, but sstangl's flashy demo was fairly convincing.

Some reflection: TEST is recommended for RAX comparisons with an immediate constant in Intel's Opt Ref Manual 3.5.17 (which is what sstangl showed me), so they may have a smaller uop for comparison against rax.

Would be cool to get an Intel processor optimization pro to come talk.

Andreas Gal :gal

Comment 12

•

14 years ago

3.4.2.6	Scheduling Rules for the Pentium M Processor Decoder. "Assembly/Compiler Coding Rule 25. (M impact, M generality) Avoid putting explicit references to ESP in a sequence of stack operations (POP, PUSH, CALL, RET).

B.5.2.3 talks about how to meter ESP synchronization events. I think the processor tracks the state of ESP implicitly in order to eliminate redundant push pop instructions. I can't find a source for this off hand.

I also just randomly ran across a note that on Intel Atom you are not supposed to do ADD/SUB on ESP and use LEA instead (E.5).

Anyway, we should measure this stuff. ESP is a bit magic. I saw all sorts of weird effects when I added ESP-relative stack addressing to nanojit (we started losing a bit perf because ESP-relative address modes are slightly longer EBP-relative).

If I remember the ESP-nanojit-but correctly it ended up being a wash on core 2 (my machine) but a slight slowdown on older machines.

Chris Leary [:cdleary] (not checking bugmail)

Comment 13

•

14 years ago

(In reply to comment #12)
> I think the
> processor tracks the state of ESP implicitly in order to eliminate redundant
> push pop instructions. I can't find a source for this off hand.

Yeah, there's a section on this in the manual as well, called "ESP folding".

Sean Stangl [:sstangl]

Reporter

Comment 14

•

14 years ago

cmp eax, $0xf has the following bytecode:
4004d6:       3d 0f 00 00 00          cmp    $0xf,%eax

cmp edi, $0xf has the following bytecode:
400514:       81 ff 0f 00 00 00       cmp    $0xf,%edi

The timing benchmark does 128 of these operations in a loop a few million times. So immediate comparisons with %eax/%rax should be expected to be faster, even without looking into microcode.

Brendan Eich [:brendan]

Comment 15

•

14 years ago

Moh, any insights you can share would be greatly appreciated.

/be

Moh Haghighat

Comment 16

•

14 years ago

I'll investigate and get back to you.

-moh

Moh Haghighat

Comment 17

•

14 years ago

Is it possible to get the code snippet that shows the performance difference between dereferencing based on ebp vs. ebx?

Chris Leary [:cdleary] (not checking bugmail)

Comment 18

•

14 years ago

(In reply to comment #17)
> Is it possible to get the code snippet that shows the performance difference
> between dereferencing based on ebp vs. ebx?

I don't have a representative snippet for that difference, unfortunately. It was a change to all of our jit-emitted code running on a big benchmark that saw the aggregate results; plus, after running it a bunch more times on my machine it appears the 1.2% I observed in comment 6 may be in the noise.

If you say there's no reason for a difference I will readily believe. :-)

Moh Haghighat

Comment 19

•

14 years ago

I checked with some of our architects. The expectation is that there should be no performance difference between ebp and ebx based referencing, provided that all references are changed correspondingly. This was also observed by gal in Comment #8. However, if "use of ebp instead of ebx" means freeing up ebp and using it as a general-purpose register (JSFrameReg), then, of course, you have an additional register, potentially less spills, etc. - hence some performance improvement. But then, you'd need to take care of backtrace (as pointed by Luke in Comment #3). For this, we've had <a href="show_bug.cgi?id=473494" title="TM: Spill ESP-relative and free up EBP for general use">bug#473494</a>.

Nonetheless, if you got a case that shows a strange performance behavior, let me know, and I'd be happy to help.

Nicholas Nethercote [inactive]

Comment 20

•

14 years ago

(In reply to comment #19)
> I checked with some of our architects. The expectation is that there should be
> no performance difference between ebp and ebx based referencing, provided that
> all references are changed correspondingly.

Good.  I was disturbed by the thought of it being otherwise.

You need to log in before you can comment on or make changes to this bug.