Status

()

defect
RESOLVED WONTFIX
10 years ago
5 years ago

People

(Reporter: gal, Assigned: dvander)

Tracking

(Blocks 1 bug)

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 5 obsolete attachments)

Reporter

Description

10 years ago
We are currently unable to trace loops with a very branchy interior. Example:

for (...) 

if ..
  if ...
    if ..
      if ..
        if ..

Such if cascades cause a combinatorial explosion through the inherent tail splitting of trace recording.

The proposed solution is to estimate when a loop becomes "too branchy" by counting the number of merge points we have come across. We already track merges in cfgMerges whenever we hit an if block (which indicates the merge point in its source note). When we actually hit the merge point, we bump a counter. Between branch traces the counter is handed over through the side exit. If the counter hits a threshold, we terminate the loop with endLoop as if we hit a break; statement, and then immediately restart recording. We call these new form of trace a merge traces.

During recording, whenever we hit a merge point, we check whether we already have an existing merge trace with matching types, and if so we terminate the trace we are recording, and link it to the merge trace.
(In reply to comment #0)
> The proposed solution is to estimate when a loop becomes "too branchy" by
> counting the number of merge points we have come across. 

This seems like a reasonable strategy but the explanation is kind of unclear to me. In particular, a loop can become too branchy without having any merge points at all (nested ifs 10 level deep -> 1024 paths, with only one merge point). Merge points seem to be more precisely an indicator of how much tail duplication is happening.

> We already track merges in cfgMerges whenever we hit an if block (which
> indicates the merge point in its source note). When we actually hit the merge
> point, we bump a counter. Between branch traces the counter is handed over 
> through the side exit. If the counter hits a threshold, we terminate the loop 
> with endLoop as if we hit a break; statement, and then immediately restart 
> recording. We call these new form of trace a merge traces.

Restating the above (let me know if I got it wrong), we have a positive integer parameter M. We trace the number of unique merge points that were passed in the recording of each tree. When recording hits the Mth (or greater) merge point for a tree, recording stops and merge traces are started from that point.

Now for some math. The key construct that leads to tail duplication is just this:

  if (...) {
    ... [A] // straight-line code
  }
  ... [B]   // straight-line code

For [A], we can record 2 traces. If we have both, then in the current design we record 2 copies of [B]. If we have 2 simple if statements following each other, then we get 2 copies of the second if and 4 of the following code. More generally, we have:

  branch [N1,S1] (...) {
  }
  branch [N2,S2] (...) {
  }
  ...

"branch [N,S]" means a block of code that has N paths though it that reach a common point at the end, with the total size of the N paths being S (in arbitrary size units). Currently, the size of the traces recorded here is:

  S1 + N1*S2 + N1*N2*S3 + ... + N1*...*N(k-1)*Sk

Presumably the last term often dominates, but variable values of S could change that. 

In the strategy above, we limit the number of terms we get in a row. So if M is 3, for example, we have:

  (S1 + N1*S2 + N1*N2*S3) + (S4 + N4*S5 + N4*N5*S6) + ...

If (N1*N2*S3) (and related terms) are "small enough", then this produces acceptable results. But it seems to me this strategy could fall down if the Ns or the Ss are highly variable. If they are highly variable, I think it would be better to consider instead a measure like this:

  (number of paths coming in to merge point - 1) * (size of paths going out)

This is an estimate of how much tail duplication gain we achieve by merging at this point. Note that both inputs also have to be estimated: we don't know how many traces we will end up recording that reach the merge point, and we don't necessarily how long the traces are going out.

It would be nice to have some samples of extra branchy stuff to look at when designing the algorithm. That way, we could mathematically model the effect of different strategies relatively easily.

There are a couple of less fundamental points I also wonder about:

 - Merge points at the top level (not contained in any other branch nest) seem different from nested merge points. It's easy to understand the benefit of merging at top-level merge points and easy to see how they will look in the trace tree; nested merge points seem more confusing. I would suggest at first restricting ourselves to top-level merge points, unless someone has an easy way of understanding them.

 - Do we plan to select merge points (for starting merge traces) that already have traces going through them, or will we select merge points only on the first time recording reaches them?
I wonder whether there are cases (e.g. switch) where it might be worthwhile setting M = 1.  That is, always merge after a switch.  I can file a separate bug on this if desired...  It seems like it might be more tractable than the general case and might give us implementation experience and immediate wins on some of these emulator "benchmarks"...
Oh, and one more thing regarding the measure used to determine gain....  What matters is not only the codesize reduction from eliminating duplication but also the branch number reduction (due to the existence of MAX_BRANCHES).  So even if there's not that much code in the tails, if they're executed often we might have a big win from merging just due to being able to trace all the codepaths to start with.
I am thinking aloud in order to clarify what this bug means; let me know if I'm wrong.

The idea, if I get it, has not so much to do with handling *nested* ifs (which only produce depth-of-nesting copies of the tail) but, as dmandelin points out, *sequential* ifs, that produce exponential number of copies of the tail.

(Though if so I do not see why he says "nested ifs 10 level deep -> 1024 paths", as a 10-level-deep nested if block seem to me to only produce 10 copies of its tail; you can only enter the Nth nesting if the (N-1)th nesting was taken. Sequential branching causes *both* branches to get a copy.)

Anyway, assuming that was a typo and we're really only concerned about exponential explosion due to *sequential* branches, in the following code:

if (A) { P }
if (B) { Q }
if (C) { R }
S

We're trying to avoid making 2^3=8 copies of S:

   S
  RS
 P S
 QRS
P  S
P RS
PQ S
PQRS

And instead come up with a means of emitting:

Loop-header +       Merge-point +
branch-traces       branch-traces
------------------------------------
| P  |               | R S |
| Q  | ---exit--->   | S   |
| PQ |

IOW, bound the sequential-branchiness (exponential-size) of any path at some (small) number and *convert* it to nested-beanchiness (linear-size) of linked "merge traces".

If so, is the idea that "a tree" will now consist of the 'main' trace, plus all of its direct branch fragments, plus all of the 'main' merge-point fragments, plus all of the branch fragments leading off all the merge-points? You say "call closeLoop and immediately resume recording", but you can't really mean "closeloop", because it does a whole mess of type-stabilization stuff and it assumes it's actually at a loop edge. Something like closeloop though, that produces a new root for subsequent branch traces, but is considered a part of the main tree? What happens on that 'exit' edge? Is it essentially the same as a branch exit?

If so, I'm confused. In the diagram above, I glossed over the fact that only *one side* of a trace like PQ is actually "straight-line" (on-trace) code; the other side P(!Q) actually consists of guard on Q and a branch exit already. If a merge-point-exit is going to be, more or less, identical to a branch exit ... why are we inventing a new idea?

IOW, why manufacture a new mechanism of tracking "merge points" when it's perfectly well represented already by the symmetrical concept "branch points"? Every branch point implies a merge point, right? Are we (logically) doing anything other than "forcing every Nth branch to be a branch-exit on *both* its taken- and not-taken edges"?
I don't understand how merging traces, which happens after the branchiness, will help us avoid hitting our heads on MAX_BRANCHES, which would happen at the branchiness.  Are we proposing to do something at the _head_ of traces, as well as the merging the tails?  Am I totally missing it?
Reporter

Comment 6

10 years ago
if 
  if
    (*) if
          a
        else
          b

Think of it as forming trees that do not start in a loop header but inside a loop. From the original tree header individual traces form a tree, but if they touch (*) they just end and cross-jump to that tree. Hence the innermost if with its individual possible branches does no longer cause additional tail duplication.
OK, yeah, that's what I thought.  But I don't get comment #3 then, still, because MAX_BRANCHES cares about profusion of "trace heads", not "trace tails", right? For a given location, this merging won't keep us from hitting MAX_BRANCHES where we would otherwise, will it?
Reporter

Comment 8

10 years ago
It will. For the case above, you can fully cover the tree with maxbranches=4. Since you have 4 in the tree to the let and 2 in the tree to the right. Otherwise you would need 8.
Mike, simple example of how this can help with max branches is code like this:

switch (cond) {
  /* 10 cases */
}
switch (cond) {
 /* 10 cases */
}

If we merge after the first switch, we get a maximum of 20 branches here, right?  If we don't we get 100.
Hm. Let me try again! I was actually hoping for some clarification in comment #4. In particular the last paragraph:

IOW, why manufacture a new mechanism of tracking "merge points" when it's
perfectly well represented already by the symmetrical concept "branch points"?
Every branch point implies a merge point, right? Are we (logically) doing
anything other than "forcing every Nth branch to be a branch-exit on *both* its
taken- and not-taken edges"?

(I understand that even doing so requires new *policy* concerning the formation of branches; I'm wondering about the need for new mechanisms though.)
If I understand comment #10 correctly, we have a limitation in that the register/spill state might not be compatible from a branch-exit to a merge point. We have to spill everything and reload it. It might not be worthwhile for every branch exit to also be a merge point (for example, shape guards) or it might not be worth recording such new traces unless they are very hot. If we could immediately jump into the middle of a trace, maybe there wouldn't be a problem.
Reporter

Comment 12

10 years ago
#11: I am working on improving that situation. We currently a) spill all registers b) move all data from spill slots into the native stack. I will gradually undo both limitations. I don't think we want all merge points as potential branch targets since in that case we can't do dead code elimination across basic blocks any more.

Comment 13

10 years ago
Why Platform tag is "Mac OS X"? Shouldn't it be "All" as per bug 509986?
Reporter

Comment 14

10 years ago
buzilla sets the default platform/architecture based on the browser ua info and I filed the bug with ff running on macosx. fixed now
OS: Mac OS X → All
Hardware: x86 → All
recorder: started(1), aborted(0), completed(33), different header(0), trees trashed(1), slot promoted(0), unstable loop variable(0), breaks(0), returns(0), unstableInnerCalls(0), blacklisted(0)
monitor: triggered(99205), exits(99205), type mismatch(0), global mismatch(0), flushed(0)

real	0m0.137s
With an experimental patch for merge tracing:

recorder: started(1), aborted(0), completed(24), different header(0), trees trashed(1), slot promoted(0), unstable loop variable(0), breaks(0), returns(0), merged loop exits(1), unstableInnerCalls(0), blacklisted(0)
monitor: exits(25), timeouts(0), type mismatch(0), triggered(25), global mismatch(0), flushed(0)

real	0m0.028s
Posted patch WIP v1 (obsolete) — Splinter Review
This is an extremely hacky patch that I used to experiment. Applies on top of bug 524620 and then bug 525371.
Assignee: general → dvander
Status: NEW → ASSIGNED
fwiw, that crashes on loading http://benfirshman.com/projects/jsnes/ :

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000004
0x003fe8fe in TraceRecorder::dataAlloc (this=0x1c2b23a0) at jstracer.h:1291
1291        VMAllocator& dataAlloc() const { return *traceMonitor->dataAlloc; }

traceMonitor is null.

#0  0x003fe8fe in TraceRecorder::dataAlloc (this=0x1c2b23a0) at jstracer.h:1291
#1  0x003c85f9 in TraceRecorder::outOfMemory (this=0x1c2b23a0) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jstracer.cpp:2596
#2  0x003f89d6 in TraceRecorder::monitorRecording (cx=0x1370e00, tr=0x1c2b23a0, op=JSOP_IFEQ) at /Users/bzbarsky/mozilla/tracemonkey/mozilla/js/src/jstracer.cpp:7164
#3  0x003043af in js_Interpret (cx=0x1370e00) at jsops.cpp:79
(In reply to comment #18)

Thanks, I messed up someone's lifetime somewhere.
Using date.now() to benchmark -- 

JIT without merge traces: 1143ms
Interpreter:               729ms
v8:                         81ms

With merge traces, the results are very dependent on the cut-off heuristics. Here's a little table, the first column is how many IFEQs result in a merge point, and the second is the resulting benchmark time.

--------------------
| 5 |        710ms |
| 4 |        662ms |
| 3 |         72ms |
| 2 |         83ms |
--------------------
Posted patch WIP v2 (obsolete) — Splinter Review
The approach used in comment #20 meant merge traces could easily have overlapping regions. So as cfgDiversions gets higher, the more needless merge traces there will be. Now IFEQ will always join to a compatible merge trace if one exists.

So here are new statistics on the same benchmark:
|----------------------------|
| IFEQ cutoff | Time | Exits |
|-------------|------|-------|
| 2           | 73ms | 35    |
| 3           | 66ms | 51    |
| 4           | 58ms | 77    |
| 5           | 55ms | 107   |
------------------------------

I am not sure what's going on with the time-versus-exit-count pattern there.

Will test on SunSpider as well to see the effect of merge traces on code with a small number of branches.
Attachment #409836 - Attachment is obsolete: true
An explanation for the disparity is probably that transitioning between traces is slow (comments 11, 12). So with a lower cut-off, the traces are less branchy, but the generated code ends up being not as good.
Posted patch fixes crash bz found (obsolete) — Splinter Review
Attachment #410113 - Attachment is obsolete: true
Posted patch WIP v3 (obsolete) — Splinter Review
It seems like it's not valid to key merge points on the PC, call depth, and typemap - we actually need to add each PC of the call chain into the mix. Otherwise a function could be inlined twice, and for one callsite to merge to the other is invalid. Please correct me if this thinking is wrong, Andreas.

This patch addresses that issue. The only opcode supported is still JSOP_IFEQ - more will come soon.
Attachment #410338 - Attachment is obsolete: true
Reporter

Comment 25

10 years ago
If we start at the same tree anchor, and maintain merge traces per tree, then if we inline twice callDepth should be different.
(In reply to comment #25)
> If we start at the same tree anchor, and maintain merge traces per tree, then
> if we inline twice callDepth should be different.

I am thinking of a scenario like:

for (;;) {
   f();
   g();
   f();
}

Now f() has been inlined twice, in each case callDepth == 1.
Reporter

Comment 27

10 years ago
Right ... mhm. Yeah, ok I concede. So its the hash of all pcs along callDepth.
Posted patch handles more opcodes (obsolete) — Splinter Review
Handles all ifop() and fused-if opcodes.
Attachment #410418 - Attachment is obsolete: true
Posted patch WIP v5Splinter Review
Fixed some serious perf regressions. The heuristics are still not good, or something is missing. SunSpider gets a slight win in 3d-cube (1% overall), but the NES benchmark slows down noticeably. Still investigating.
Attachment #410429 - Attachment is obsolete: true
(In reply to comment #10)
> Hm. Let me try again! I was actually hoping for some clarification in comment
> #4. In particular the last paragraph: [..]

I must say I find the terminology confusing too.  In particular I
don't see why we need to speak about either merge points or branch
points.

My impression from talking to dvander was that this is a simple
mechanism.  As we record, we may at any point decide to truncate the
trace and force a continuation on a common successor trace.  The
reason for the truncation can be completely arbitrary, but presumably
would be along the lines of a judgement that the probability of still
actually being on trace at this point had become unacceptably low.

And the main complication is reconciling the stack frame state for all
the various control flow edges reaching the successor trace.

Is that correct?  Did I miss something?

Updated

10 years ago
Blocks: 535925

Updated

10 years ago
No longer blocks: 535925

Updated

10 years ago
Blocks: 530956

Comment 31

9 years ago
Is this something that will become irrelevant with the Jaegermonkey JS engine?
Not necessarily, it's still an interesting topic, but there really isn't pressure to research & implement it anymore.

Comment 33

9 years ago
Being able to waste some time on JSNES isn't sufficient pressure? ;)

Comment 34

9 years ago
Giant leap in JSNES' fps with the landing of jaegermonkey, congrats to developers.

For the record, on my machine (2g ram, intel core 2 CPU 3GHz), Chrome is still faster, running at a constant 58fps, while Firefox varies from 40fps to 58fps.
(In reply to comment #34)
> Giant leap in JSNES' fps with the landing of jaegermonkey, congrats to
> developers.
> 
> For the record, on my machine (2g ram, intel core 2 CPU 3GHz), Chrome is still
> faster, running at a constant 58fps, while Firefox varies from 40fps to 58fps.

What happens if you set in the about:config prefs

  javascript.options.tracejit.content = false

Does the frame rate still vary, or is it steady with that setting?

Comment 36

9 years ago
Frame rate still varies after switching tracejit off. It's also not boosting or negatively impacting the fps.
This bug is obsolete with the removal of the tracejit. Bug 509986 is still open for tracking JSNES performance in SM.
Status: ASSIGNED → RESOLVED
Last Resolved: 8 years ago
Resolution: --- → WONTFIX

Comment 38

8 years ago
So TM is gone yes ? All that work ! What a shame if things didn't work out. Is there a blog anywhere ?
(In reply to Duncan Loveday from comment #38)
> So TM is gone yes ? All that work ! What a shame if things didn't work out.
> Is there a blog anywhere ?

http://blog.mozilla.com/nnethercote/2011/11/23/memshrink-progress-report-week-23/

It also explains why the work on TM is not (entirely) wasted.
You need to log in before you can comment on or make changes to this bug.