Closed Bug 537842 Opened 15 years ago Closed 13 years ago

TM: sink stack stores to side exits where possible

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Unassigned)

References

Details

This is a follow-on from bug 536630.  Quoth dvander:

"But yeah, in general all these stores to the stack on the fast path are just awful. We should be sinking stores to side exits or figuring out how to spill to only one stack and not two."

Quote dmandelin:

"The idea has come up several times but we haven't done anything with
it yet. I think the argument against is that it will increase code size.
Whether the perf gains are worth the increase is something we probably can't
figure out until we actually do it."
Gregor, can you ask mbebenita to comment on this bug, and link some of his work he did for hotpath? I remember he had a bunch of data and work on compressing the necessary side exit meta data.
Blocks: 536630
I've tried all sorts of things to optimize these side exits. I'm not familiar with how they are implemented in TraceMonkey, so I'll describe what I've tried in Hotpath. In my world I have what are called Transfer instructions. These are treated like any other instruction, and can appear anywhere in the IR. Transfer instructions contain a snapshot of the stack state so they can transfer execution to the interpreter. In my case, I use a inline threaded interpreter so this may be a little different than TraceMonkey. My first attempt at compiling transfer instructions was to generate a transfer table during compilation. At runtime, the transfer instructions would call a transfer routine, which inspected the transfer table and restored the state of the interpreter. This seemed a bit slow, probably partly due to the overhead of calling the transfer routine and the unpredictable branch at the end of the transfer.

My next attempt was to simply emit machine code at each transfer that restored the state. This generates quite a bit of code, 80% of the total code turned out to be transfer code. The amount of generated code is a constant multiple of the table approach, the generated code is simply the transfer table in an executable format.

To improve compilation time and code size I then looked at compressing the bailout code. My first attempt was to take advantage of the commonalities between adjacent transfer instruction stack states. There is usually a small difference between the two. This is especially true of transfers that appear in inlined methods. To compress transfer code, I treated transfer stack states as vectors of varying lengths: T1: <a, b, c>, T2: <a, b, d>, T3: <a, b, c, d>, etc. You'll notice that, you can compress the transfer code of T2, executing the transfer code for T1, then by executing the delta transfer code for T2: <_, _, d>. I then treated these transfer states as nodes in a graph, and connected them with directed edges indicating the order of transfer code execution (T2 -> T1 in this example). The graph is computed using the Levenshtein distance, node distance, and other heuristics. This does a decent job compressing large transfer states, but since it introduces lots of calls in transfer code it can increase transfer code size for small transfer states. The calls also introduce an overhead, at the end of the day, it's a tradeoff between code size and performance. This is also somewhat complicated, hard to debug, and it takes time to compress the bailouts.

My next approach was to guess which transfers are frequent, and which ones are not. Those that are unlikely to happen were handled by a transfer routine, and those that are frequent are handled using uncompressed transfer code. Of course, you can't always guess which are frequent and which are not so I used some heuristics and counters.

The current technique is to have only uncompressed transfer code for transfers. This because I have join nodes in my CFG and I don't use trace trees. Therefore I have a lot fewer transfers in my IR. If I were to do it yet again, I think I would try to have a transfer code cache. Each transfer would first be handled by a transfer routine, which would then generate transfer code (if the transfer is frequent enough), and patch in the trace code to jump directly to the generated transfer code. This way you could minimize the amount of generated transfer code.
Just to explain things a bit more, for my future reference if no-one else's...

We have to push lots of values to the stack (the positive-indexed part of it that corresponds to the VM stack) even though they're generally not needed by the LIR (because the value is in a LIR temporary, ie. a register or spill slot).  We have to do this because if we take a side-exit we have to have the relevant values on the VM stack so that the interpreter can continue on with things.  But it's silly to always do these pushes when we don't take the side-exit.

StackFilter gets rid of a lot of these, but there are still plenty left.  crypto-md5 is a particularly good example, it does lots of arithmetic and the stack gets very big -- we get up to sp[296].

w.r.t. the idea that this might increase code size -- I expect code size won't change.  The same number of stack stores will occur, this change would just get them off the fast path.  Hmm, but it will add register pressure because all of these stack values will be kept live longer -- until the next side-exit.
(In reply to comment #3)
> 
> w.r.t. the idea that this might increase code size -- I expect code size won't
> change.  The same number of stack stores will occur, this change would just get
> them off the fast path.  Hmm, but it will add register pressure because all of
> these stack values will be kept live longer -- until the next side-exit.

Ah, I missed a crucial point.  The same store may be live for multiple side-exits.
Obsolete with the removal of tracejit.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.