Closed Bug 510554 Opened 15 years ago Closed 13 years ago

TM: Remove parent guard at function calls

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: dmandelin, Assigned: dmandelin)

References

Details

Attachments

(1 file, 5 obsolete files)

The function |guardCallee| is used to generate the guards that go before a function call. It generates two guards:

  (a) on the function identity
  (b) on the identity of the function's parent (i.e., its scope chain)

Guard (b) will cause problems for tracing some stuff with closures. (See bug 495331 comment 1, second half.) Getting rid of it should also give a modest speed boost overall.
I tried it out, along with JSOP_NAME for general closures, on try and it seems OK. Now I'll try to prove it.
Step 1 is to check for any callee parent addresses that get hardcoded into the trace. I only found 2 suspicious things:

(a) |setProp| passes an object parent address as |obj2| to |guardPropertyCacheHit|. |obj2| can then be hardcoded into LIR. But this path is not executed if we are setting a property of a call object, so I don't think it matters.

(b) callNative can use a hardcoded callee parent address as |this_ins|. Thus, it seems like we should keep the parent guard for callNative. That shouldn't hurt upvar-type stuff.
Here is a test case that will go wrong without the callee guard:

    function esc(g) {
      //  dis(g);
    }

    function f(a, i, j) {
      var k = 3;
      var g = function() {
	print(k);
      }
      var k = 10 + i;
      esc(g);

      a[i] = g;

      for (var i = 0; i < 5; ++i) {
	a[j]();
      }
    }

    var a = [];
    f(a, 1, 1);
    f(a, 2, 2);
    f(a, 3, 3);

    f(a, 1, 3);

Analysis: |f| creates a version of |g| that should read and print |i|, and then calls the a |g| in position |j| in the array. The first 3 calls to |f| set up the array so that the |g| in position |j| should read and print |j|. 

Because |g|'s upvar access refers to a frame that is active and in range, the tracer just reads the value directly out of the tracer native frame. But in the final call |f(a, 1, 3)|, the tracer native frame has a 1, but the |g| that we call should refer to a 3. Since we are reading directly out of the tracker, we read the 1 and lose. I have confirmed this behavior with my returned-closure and eliminate-parent-guard patches applied.

I believe this problem occurs only when |g| is non-null. If |g| is a null closure, then |g| does not escape, so the only |g| we could be calling is one created inside |f|, so its proper parent must be the same |f| active on trace entry.

I don't think the problem is too hard to solve. The basic idea is that we have to make sure to call the builtin (which knows everything about native stack frames, interpreter frames, etc) if there is a chance that the tracker's version of the upvar is not the right one. The easiest is to call the builtin if the lambda is non-null. It might be possible to optimize a bit more by generating a LIR fast path for the case where the callee matches the record-time callee. So far with upvars, the key thing is getting them to trace; the efficiency of the access paths is much less important (because the speedup of getting all the rest of the code to trace is much greater than the slowdown from reading the upvar in a dumb way).
Depends on: 505591
A proofoid: First, it's clear that flat closures have no problems here, because they don't refer to their parent function in any way after creation. We also know that any upvar access code that traverses the scope chain itself is fine. The only issue is upvar access code that simply refers to a native stack element from the tracker.

Case 1. Null closures. These call TraceRecorder::upvar, which refers to a tracker value if the upvar was imported into the trace or defined on trace. Thus, we have subcases:

Case 1.1. Null closure, upvar imported. The sequence of events must be this ([i] means that event happened in the interpreter, [t] on trace):

  [i] enter function 'f'
  [i] define upvar 'x' in 'f'
      import upvar 'x'
      enter trace
  [t] call function 'g'
  [t] refer to upvar 'x'

Furthermore, g must be defined (created) before it is called. Also, g must be defined after f is entered, because g is a null closure (it cannot escape, so it must be defined in the same 'f' it is called in). Therefore, g's parent must be 'f', which is the same function active at trace start, so the trace stack area contains the correct definition of x.

Case 1.2. Null closure, upvar defined on trace. The sequence must be:

      enter trace
  [t] enter function 'f'
  [t] define upvar 'x' in 'f'
  [t] call function 'g'
  [t] refer to upvar 'x'

By the same reasoning as in case 1.1, g must be defined after f is called and before g is called. And again, g must have this f as parent, and so the trace stack area contains the right x.

This completes the proof for case 1.

Case 2. Non-heavyweight closures. These call TraceRecorder::callProp. In this case, we don't have a constraint on where the function was defined that is called. Even if at record time the upvar-defining function was on the stack and part of the trace, a different instance could be seen. So for general closures, we must not ever rely on being able to read the value out of the tracker state.
Remarks:

- The basic strategy suggested by the proof above is:

  - Remove callee parent guards for interpreted function calls
  - Leave callee parent guards in place for native function calls
  - Get upvars out of the tracker only for null closures. General closures
    must use a builtin to read that checks the callee parent.

- Callee parent guards for natives could be a problem -- if the native is inside a closure, this will cause to leave trace a lot. We should probably not use constant callee pointers if the native call function object may have varying parents.

- The test in comment 3 shows that we can't ever simply refer to a tracked value for general closures. Referring to the tracked value is presumably faster (but not interestingly so, for now). We might want to do that someday for assignments to upvars, which would require allowing them to be null closures (if they are non-escaping).
Attached patch Patch (obsolete) — Splinter Review
Assignee: general → dmandelin
Status: NEW → ASSIGNED
I discovered a new problem here. I suspect it was exposed by dvander's fix to the upvar frame reading, but that doesn't really matter. Consider this:

function f() {
  var count = 0;

  function g() {
    ++count;
    print(count);
  }

  for (var i = 0; i < 10; ++i) {
    g();
  }
}

f();

Here,

    ++count        uses incname
    print(count)   uses getupvar

If we use the "existing" code (i.e., my current patches), both bytecodes will work through the tracker. And this doesn't cause trouble, because the parent guards ensure that the functions exactly match their record-time structure.

If I try to remove the parent guard, then I have to make the incname not use the tracker. (Reading out of the tracker is valid only if we know we are in the same function.) But then what do we do for the set portion of incname? In principle, we need to do this:

  - if the parent is the same one as at record time, then write into the native stack area. This also requires updating the tracker so the subsequent getupvar (or any other such op, including reads of count inside f) gets the right value.

  - if the parent is not the same as at record time, then call the interpreter builtin. Do not update the tracker, since the value has not changed.

So we don't know if we need to update the tracker until we run the trace, but of course we have to make that decision while recording. Bad.

I'm still kind of confused about this, but I think these ideas might help:

1. If the closure that sets the upvar is non-escaping, then we can go ahead and use the tracker all the time, because there is then no way for another instance of the closure to get here (i.e., the parent is always the same). I think.

2. Otherwise, it seems we need to generate a new load (from the native stack area) for a tracked variable that is weakly updated by a builtin (i.e., may or may not be updated).
(In reply to comment #7)
> 1. If the closure that sets the upvar is non-escaping, then we can go ahead and
> use the tracker all the time, because there is then no way for another instance
> of the closure to get here (i.e., the parent is always the same). I think.

Have no doubts. The non-escaping case is able to use the display in the interpreter, and so long as tracing inlines calls the tracker has the right LIR pointers.

> 2. Otherwise, it seems we need to generate a new load (from the native stack
> area) for a tracked variable that is weakly updated by a builtin (i.e., may or
> may not be updated).

Don't we generate such a load already for existing upvar accesses?

/be
(In reply to comment #8)
> > 2. Otherwise, it seems we need to generate a new load (from the native stack
> > area) for a tracked variable that is weakly updated by a builtin (i.e., may or
> > may not be updated).
> 
> Don't we generate such a load already for existing upvar accesses?

We don't trace setting upvars, so it hasn't been necessary yet.
jorendorff told me this morning that another purpose of the parent guard is to ensure that the global object of the function being called matches that of the trace.

If this is correct, I think it can be handled by adding a guard on the global object if the global is accessed through the scope chain of the active function. This guard would take a bit longer to execute than the parent guard, but it is much weaker and it should be possible to make it happen only once per trace.
Attached patch WIP (obsolete) — Splinter Review
This version passes trace-tests and my personal closure test suite, at least. It doesn't do as well on the Go benchmark (https://bug460050.bugzilla.mozilla.org/attachment.cgi?id=378561) as I want, apparently because it exits with shape changes. When I do bindname/setname, I check that the shape of the call objects being accessed is the same as at record time. 

But here, funcA seems to get a new shape each time it is called. And that means we need to record a new trace for the inner loop (in manyTimes) each time, and we never get to finish a trace for the outer loop. So the speedup vs. interpreter is still pretty modest. I guess I'll have to dig in on shaping; it's about time I learned more about how it works.
Attachment #395478 - Attachment is obsolete: true
Hmm, is the call object's scope empty (and therefore uniquely shaped)? This could be what is biting.

/be
Originally an empty scope had shape 0. Problem was, before shapes even existed, a newborn shared its proto's scope, so you would have newborns whose shape was the shape of their proto, usually not 0. We could have fixed this with an extra test in OBJ_SHAPE but jorendorff argued rightly that the better course was to make each proto have an empty scope shared by all its newborns, which would have a unique shape.

Jason can point the bugs more quickly than I can, but I wanted to throw this out in case it helps diagnose.

/be
I have a little more insight on this. The call object looks like this just after creation:

object 0x2fbbc0
class 0x1cc860 Call
properties:
proto null
parent <global object at 0x292000>
private 0x816864
slots:
   3 (reserved) = <function funcA at 0x2d5850 (JSFunction at 0x2d5850)>
   4 (reserved) = undefined
   5 = undefined

After the variable 'n' is "accessed" for the first time (by JSOP_BINDNAME here), js_DefineProperty gets called in the Call resolver, and the object looks like this:

object 0x2fbbc0
class 0x1cc860 Call
properties:
    enumerate permanent shared "n": slot -1
proto null
parent <global object at 0x292000>
private 0x816864
slots:
   3 (reserved) = <function funcA at 0x2d5850 (JSFunction at 0x2d5850)>
   4 (reserved) = undefined
   5 = undefined

Once the Call object gets the property 'n', it has a defined shape that is the same for all such call objects in this program (e.g., 0x105 in my test). The just-created Call objects get shapes 0x109, 0x10a, 0x10b, and so on.

During the first trace recording, the property 'n' is accessed (by js_FindIdentifierBase while recording JSOP_BINDNAME), and so the object gets shape 0x105 (the one with 'n') and a guard on that shape gets added to the trace. The loop is inside the lifetime of the funcA activation, so when the trace is called, the shape is still the same, and we finish the run on trace.

Next time we call funcA, it gets a fresh empty shape, say 0x10a. When we enter the trace, we will immediately fail the guard and take a branch exit from the JSOP_BINDNAME. 

Now we should try to start recording a new branch trace. But for some reason, the hit counter never gets high enough to do that [1], so instead what happens is we continue in the interpreter, where we end up setting the shape to 0x105. The next time we cross the loop header, we enter the trace, and we pass the guard and continue running.

Now, the question is how to make it so that we don't exit on that guard. I see at least two possibilities:

1. Initialize Call object properties when the object is created. This way, they always have the same shape by the time you do anything with them. But this idea is kind of unappealing, because not all variables are ever going to be used, so that work would be wasted.

2. Record code for JSOP_BINDNAME that handles this case. The idea would be to first test if the shape matches our record-time shape. If so, we continue. If not, then we call the property accessors from trace (I think it would be CallPropertyOp, because we already know the object and we just want to probe for the property) and then try again. Only if the shape is still different do we exit.

But even the above is not very general. If there are two properties that get set this way, we would end up recording traces where the first guard wants a shape that has only the first property, and the second guard wants both, so we would always fall off trace. So I think this has to be solved together with the PIC stuff and shape guard relaxation stuff.

[1] I'm not sure why, but I think it might be related to the fact that we are exiting from an inner tree called from an outer trace.
Blocks: 504829
Blocks: 508716
Attached patch WIP 2 (obsolete) — Splinter Review
Here is a refresh of this work. This patch has no effect on SunSpider (not surprising, as they don't use many closures). Outstanding issues:

 - This appears to cause a 4% slowdown on v8-crypto. It's probably because I had
   to remove some of the existing fast paths for closure variable access. But 
   new fast paths can be created that use the scope chain.

 - The validity of bz's new fast paths for NAME and SETNAME are said to depend
   on the parent guard. But I couldn't create a test case that used those fast
   paths and failed, so I left them in. Can we let the fuzzers and nightly
   testers find failing test cases (if they even exist)?

I'm going to test on Dromaeo next, and then look into the v8-crypto slowdown. I'll also check out the bugs this blocks currently. Let me know if there are any other perf test cases I should do.
Attachment #402969 - Attachment is obsolete: true
The testGuardCalleeSneakAttack{,2}.js tests in the patch for bug 542002 show the bugs that result if bz's code is not protected by the parent guard.

/be
Attached patch WIP 3 (obsolete) — Splinter Review
This version passes the Sneaky tests as well. It also happens to remove all of what I think are the upvar-related code paths that are invalid without the parent guard.

It shouldn't be considered a candidate patch yet--I need to check the perf for programs that used the removed fast paths and replace with new fast paths as needed.
Attachment #425586 - Attachment is obsolete: true
The charAt and slice dromaeo benchmarks exercise those fast paths nicely, as I recall.  So does the fluidsim thing, though that's harder to measure with.
So, we should be able to restore bz's fast paths simply by protecting them with a LIR_j* that ensures callobj->private is null, and going to the slow path otherwise.

The other fast paths, which apply for the case where the upvar lives in the current trace's tracker, seem more difficult. Those paths simply refer to the tracker to read, or store to the tracker location to write. They are safe given the parent guard because if the parent guard is passed, then we know that the upvar lives in the same place it did when we recorded the trace.

If we don't have the parent guard, then we know only that the upvar lives *somewhere* on trace (it's not hard to create a LIR conditional that establishes this fact). But it's hard to see any way at record time to know in which trace frame the upvar lives (with recursion, it could potentially live in any frame of the same function, even one in an outer tree that calls us). Given that, we need type stability guards and invalidation of frames in calling trees.
And those things are at least mildly hard.

Before figuring out what to do about that, I think I need to look at some samples of how upvars are actually getting traced these days, in particular which of the tracker-based fast paths are important, and when.

But we might be able to restore those fast paths with a simple trick--if we get into a situation where we want to use them, then we add the parent guard at that point.
Depends on: 545051
Attached patch Patch (obsolete) — Splinter Review
Before this goes to review, I think we need to decide whether we want it at all, specifically whether the benefits are worth the risk of regressions and the risk of perf regressions. The latter exists because I had to remove cases where we just operate via the tracker. Those are hard to restore without this guard. 

Here are the perf gains:

  SunSpider              no effect
  V8                     no effect
  Dromaeo                no effect

  fluid simulator        maybe +2 fps (26 fps vs. 24 fps)
  bug 504829 test case   much faster

So, there is no evidence that it speeds up anything important by very much, and it doesn't help our benchmark scores at all.

Brendan, bz, what do you think?
Attachment #425845 - Attachment is obsolete: true
The main benefit from this patch will be in cases where we in fact have a bunch of different function objects that have the same JSFunction going through, right?  Do we hit that with the jsa* benchmark in bug 536564?  Or some sort of module-like patterns?

It basically feels to me like this patch is more about avoiding a particular not-that-hard-to-hit case that causes performance to fall through the floor than about speeding up our usual metrics....  Getting rid of such cases is a worthwhile goal, all else being equal.  Not sure all else is equal here, of course.  ;)
OK, I think this version is ready for review.
Attachment #426111 - Attachment is obsolete: true
Attachment #426412 - Flags: review?(dvander)
Attachment #426412 - Attachment is patch: true
Attachment #426412 - Attachment mime type: application/octet-stream → text/plain
Obsolete with the removal of tracejit.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: