JM: Use %ebp for JSFrameReg instead of %ebx.

RESOLVED WONTFIX

Status

()

RESOLVED WONTFIX
8 years ago
8 years ago

People

(Reporter: sstangl, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

(Reporter)

Description

8 years ago
A few moments ago, I was discussing timing of x86 instructions with cdleary -- notably, that cmp %eax was faster than cmp %ebp, and that it makes sense to have the register allocator know these things.

We then had a moment of collective realization that %ebx is being used for JSFrameReg.

"But isn't %ebp normally used for the stack? If Intel is giving registers different behavior based on their use, isn't it sensible that dereferencing off of %ebp is faster than dereferencing off of %ebx?"

"That would be very silly if it were true."

So we measured. It turns out it's true. CC'ing cdleary, who ran the perf tests on his computer.
Created attachment 458059 [details] [diff] [review]
EBP patch.

Adds ebx to the set of temporary registers as well, which is another effect.

Comment 3

8 years ago
Out of curiosity, what was the speedup of just $ebp vs. $ebx?  I assume comment 2 includes the benefit of having the extra register.

Another question: if you break in gdb in C++ code called from method-jitted code, does this screw up the backtrace?
My machine is apparently a fanciful unicorn. We don't see the same results on other people's machines...

Good point in comment 3, sstangl said he'd test out the backtrace issue. That may be why it was excluded in the first place.
(Reporter)

Comment 5

8 years ago
Yes, it screws up the backtrace. Good point.
(In reply to comment #3)
> Out of curiosity, what was the speedup of just $ebp vs. $ebx?  I assume comment
> 2 includes the benefit of having the extra register.

Re-running it with just the switch to ebp shows 1.2%, but again, unicornocity must be taken into account.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → WONTFIX

Comment 7

8 years ago
Oops, didn't mean to sound so somber; I applaud your micro-architectural zeal!

Comment 8

8 years ago
I am looking at the core i7 architecture diagrams. I can't see any reason why ebp and ebx should behave different. On older core 1 architectures esp was a bit funky (it was co-located with pc, which is rarely read directly) and the optimization manual advises against using it as a general purpose register, but ebp and ebx should identical. They both are renamed to internal registers randomly as instructions get decode. But then, its x86. Anything is possible.

Comment 9

8 years ago
(In reply to comment #8)
> On older core 1 architectures esp was a
> bit funky (it was co-located with pc, which is rarely read directly) and the
> optimization manual advises against using it as a general purpose register,

How much older?  Like Core 2 and Pentium 4, or stuff nobody uses?  Because, assuming esp is a fixed offset from ebp (which I vaguely remember it being) perhaps we could commandeer esp instead of ebp.
(In reply to comment #9)

All I can find in the Intel Opt Ref Manual is "When the ESP register is not used as the destination of an instruction (explicit ESP updates), an implicit ESP update will occur with instructions like PUSH, POP, CALL, RETURN. Mixing explicit ESP updates and implicit ESP updates will also lead to dependency between address generation and data execution." (12.3.2.2) Maybe gal can cite sources?
(In reply to comment #8)
> I am looking at the core i7 architecture diagrams. I can't see any reason why
> ebp and ebx should behave different.

I felt similarly, having studied about RATs and ROBs, but sstangl's flashy demo was fairly convincing.

Some reflection: TEST is recommended for RAX comparisons with an immediate constant in Intel's Opt Ref Manual 3.5.17 (which is what sstangl showed me), so they may have a smaller uop for comparison against rax.

Would be cool to get an Intel processor optimization pro to come talk.

Comment 12

8 years ago
3.4.2.6	Scheduling Rules for the Pentium M Processor Decoder. "Assembly/Compiler Coding Rule 25. (M impact, M generality) Avoid putting explicit references to ESP in a sequence of stack operations (POP, PUSH, CALL, RET).

B.5.2.3 talks about how to meter ESP synchronization events. I think the processor tracks the state of ESP implicitly in order to eliminate redundant push pop instructions. I can't find a source for this off hand.

I also just randomly ran across a note that on Intel Atom you are not supposed to do ADD/SUB on ESP and use LEA instead (E.5).

Anyway, we should measure this stuff. ESP is a bit magic. I saw all sorts of weird effects when I added ESP-relative stack addressing to nanojit (we started losing a bit perf because ESP-relative address modes are slightly longer EBP-relative).

If I remember the ESP-nanojit-but correctly it ended up being a wash on core 2 (my machine) but a slight slowdown on older machines.
(In reply to comment #12)
> I think the
> processor tracks the state of ESP implicitly in order to eliminate redundant
> push pop instructions. I can't find a source for this off hand.

Yeah, there's a section on this in the manual as well, called "ESP folding".
(Reporter)

Comment 14

8 years ago
cmp eax, $0xf has the following bytecode:
4004d6:       3d 0f 00 00 00          cmp    $0xf,%eax

cmp edi, $0xf has the following bytecode:
400514:       81 ff 0f 00 00 00       cmp    $0xf,%edi

The timing benchmark does 128 of these operations in a loop a few million times. So immediate comparisons with %eax/%rax should be expected to be faster, even without looking into microcode.
Moh, any insights you can share would be greatly appreciated.

/be

Comment 16

8 years ago
I'll investigate and get back to you.

-moh

Comment 17

8 years ago
Is it possible to get the code snippet that shows the performance difference between dereferencing based on ebp vs. ebx?
(In reply to comment #17)
> Is it possible to get the code snippet that shows the performance difference
> between dereferencing based on ebp vs. ebx?

I don't have a representative snippet for that difference, unfortunately. It was a change to all of our jit-emitted code running on a big benchmark that saw the aggregate results; plus, after running it a bunch more times on my machine it appears the 1.2% I observed in comment 6 may be in the noise.

If you say there's no reason for a difference I will readily believe. :-)

Comment 19

8 years ago
I checked with some of our architects. The expectation is that there should be no performance difference between ebp and ebx based referencing, provided that all references are changed correspondingly. This was also observed by gal in Comment #8. However, if "use of ebp instead of ebx" means freeing up ebp and using it as a general-purpose register (JSFrameReg), then, of course, you have an additional register, potentially less spills, etc. - hence some performance improvement. But then, you'd need to take care of backtrace (as pointed by Luke in Comment #3). For this, we've had <a href="show_bug.cgi?id=473494" title="TM: Spill ESP-relative and free up EBP for general use">bug#473494</a>.

Nonetheless, if you got a case that shows a strange performance behavior, let me know, and I'd be happy to help.
(In reply to comment #19)
> I checked with some of our architects. The expectation is that there should be
> no performance difference between ebp and ebx based referencing, provided that
> all references are changed correspondingly.

Good.  I was disturbed by the thought of it being otherwise.
You need to log in before you can comment on or make changes to this bug.