Closed Bug 448828 Opened 16 years ago Closed 13 years ago

Eliminating JSFrameRegs.sp

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: igor, Assigned: igor)

References

Details

Currently interpreter's stack pointer sp is exposed to the rest of runtime through a pointer to JSStackFrame. This way the GC and the error reporter can find out what is exactly the current stack to perform GC-tracing and search for error values for the decompiler.

But this exposure of sp is costly as it forces redundant stores and loads of the sp value in the interpreter. it would be nice to eliminate this cost. 

For this it would be necessary to teach the GC and the error reporter to calculate the stack depth based on the current pc rather then using the access to JSFrameRegs.sp. This calculations can piggy back on the stack modeling already implemented for the decompiler.
Depends on: 412296
Igor, this is great -- can we get rid of regs.pc, that is get rid of regs on the stack altogether?

I exposed a js_ReconstructStackDepth library-internal API from jsopcode.{h,cpp} in http://hg.mozilla.org/tracemonkey/index.cgi/rev/55a104b3ea55 for tracemonkey. It could be useful here.

/be
(In reply to comment #1)
> Igor, this is great -- can we get rid of regs.pc, that is get rid of regs on
> the stack altogether?

To implement that we would need a static analysis to to verify that any interpreter case that calls external functions contains instructions to sync regs.pc with the real PC before the call. That is, something like SAVE_PC that flushes PC into regs.pc. Note that it is important to continue to use regs.pc, not fp->pc, as doing latter requires to load fp, while writing to regs.sp is just updating the native stack.

In principle that SAVE_PC may even be extended to SAVE_SP_PC, but it would force redundant writes of SP into regs.sp. The GC can calculate SP itself from regs.pc if the interpreter adds few *sp++ = NULL on slow paths to ensure that all the slots that the GC is going to scan are initialized. This is exactly what I would like to try in this bug but it is orthogonal to the decoupling of regs.pc from the PC as used by the interpreter. 

On the other hand this bug also needs a static analysis to enforce those *sp++ = NULL. So I guess the best way to proceed is to bribe Taras to help with the analysis ;)
I think this sounds doable. I'd need concrete examples to proceed further.
(In reply to comment #3)
> I think this sounds doable. I'd need concrete examples to proceed further.

Currently the interpreter bytecode cases looks like:

BEGIN_CASE(bytecode)
  implementation_code;
END_CASE(bytecode)

Here BEGIN_CASE and END_CASE macros expands either into machinary to support indirect threading when computed goto are available or into "case bytecode:" and "goto advance_pc". 

Now this bug about eliminating the sp counter require a analysis that would enforce the following rule:

Suppose we have a bytecode case with the following 2 conditions:

1) implementation_code contains an expression that increases the local variable sp like sp++, ++sp or sp += <compiler-time-constant>.

2)  implementation_code contains a call to a non-inline function or a call to an inline function which can potentially call non-inline function.

Then all expressions that increases sp must happen before the jump or a function call on all possible code paths. Moreover, all such increase operations must have a form:

*sp++ = something. 

The idea behind this rule is to allow for the GC to assume that, when the GC is called during evaluation of some script at a particular bytecode, then all the stack slots that the interpreter can access for that particular bytecode are initialized.

Here is some examples:

          BEGIN_CASE(JSOP_DUP2)
            JS_ASSERT(sp - 2 >= StackBase(fp));
            lval = sp[-2];
            rval = sp[-1];
            *sp++ = lval;
            *sp++ = rval;
          END_CASE(JSOP_DUP2)

This code should be accepted. Although sp++ comes after a call to StackBase, the StackBase is an inline function that does not call non-inline functions.

          BEGIN_CASE(JSOP_ANONFUNOBJ)
...
            parent = js_GetScopeChain(cx, fp);
            if (!parent)
                goto error;
...
            *sp++ = OBJECT_TO_JSVAL(obj);
          END_CASE(JSOP_ANONFUNOBJ)

Here  *sp++ comes after a call to non-inline js_GetScopeChain and the code should be rejected. A fixed version may look like:

          BEGIN_CASE(JSOP_ANONFUNOBJ)
...
            *sp++ = JSVAL_NULL;
            parent = js_GetScopeChain(cx, fp);
            if (!parent)
                goto error;
...
            sp[-1] = OBJECT_TO_JSVAL(obj);
          END_CASE(JSOP_ANONFUNOBJ)

Another example that should be rejected:

          BEGIN_CASE(JSOP_INCNAME)
          BEGIN_CASE(JSOP_DECNAME)
          BEGIN_CASE(JSOP_NAMEINC)
          BEGIN_CASE(JSOP_NAMEDEC)
...
            if (JS_LIKELY(OBJ_IS_NATIVE(obj))) {
                PROPERTY_CACHE_TEST(cx, regs.pc, obj, obj2, entry, atom);
...
            }
...
            if (JS_LIKELY(CAN_DO_FAST_INC_DEC(v))) {
...
            } else {
                *sp++ = JSVAL_NULL;
....
            }
...

Here PROPERTY_CACHE_TEST expands to something that contains non-inline call so sp++ cannot follow it. The code would be fixed to look like:

          BEGIN_CASE(JSOP_INCNAME)
          BEGIN_CASE(JSOP_DECNAME)
          BEGIN_CASE(JSOP_NAMEINC)
          BEGIN_CASE(JSOP_NAMEDEC)
            *sp++ = JSVAL_NULL;
...
            sp[-1] = JSVAL_NULL;

Note a complication here about several consecutive BEGIN_CASE(JSOP_INCNAME). 

That is for this bug. Now for that "eliminate redundant pc updates bug" we need to enforce much simple rule which is:

If implementation_code contains a call to a non-inline function or a call to an inline function which can potentially call non-inline function, then on any code path that may lead to a such call implementation_code must have the following statement:

regs.pc = pc;
dmandelin is doing inline/call-threading of js_Interpret, and he's a static analysis heavy hitter to boot. Cc'ing him.

/be
The analysis looks good. The easy part would be to find cases that don't call non-inline functions and just leave them out. For the hard part, here's an idea of how it could work:

- Discover stack depth change for the op. I think this can be done with jsopcode.tbl, correct? Except for variadic input ops like JSOP_CALL, which might need manual handling?

- Use abstract interpretation to find the relative value of sp at all points in the case.

- Insert the needed number of '*sp++ = NULL' at the nearest dominator to all non-inline calls in the case. The 'nearest dominator' is the "innermost" point such that all paths to a call flow through that point. (Technical note: "point X dominates point Y" means all control flow paths that reach Y pass through X first. The dominator relation forms a tree. "Nearest" dominator is the nearest common ancestor.) Dominators is a standard static analysis.

- Rewrite stores to sp[_] to use the index computed from the abstract interpretation. Also delete all modifications to sp. This sounds tricky to me but I think it amounts to intricate but not deep programming.

The key practical point about this design is with code like:

  BEGIN_CASE(JSOP_BLAH)
    if (slow_case_1) {
      call_foo();
    } else {
      if (fast_case) {
        // no calls here
      } else {
        call_bar();
      }
  END_CASE(JSOP_BLAH)

The '*sp++ = NULL' would go at the very top, because that is the only place dominating all calls. But that puts it on the fast path as well, which is perhaps not wanted. We could fix that by not placing the spacers ('*sp++ = NULL') at the dominator point, instead doing something smarter. I think the solution is closely related to transforming code to SSA form or doing constant subexpression elimination, so it shouldn't be too hard to figure out. But maybe the code is structured so the nearest dominator is usually off the fast path anyway.
(In reply to comment #6)
> - Discover stack depth change for the op. I think this can be done with
> jsopcode.tbl, correct? Except for variadic input ops like JSOP_CALL, which
> might need manual handling?

JSOP_CALL does not need special handling since sp is always decreasing after it. What would require special handling is just JSOP_ENTERBLOCK as currently it is the only bytecode that pushes the variable number of elements on the stack. But then the analysis may just check that the code does indeed pushe those JSVAL_VOID into all block slots.

> The key practical point about this design is with code like:
> 
>   BEGIN_CASE(JSOP_BLAH)
>     if (slow_case_1) {
>       call_foo();
>     } else {
>       if (fast_case) {
>         // no calls here
>       } else {
>         call_bar();
>       }
>   END_CASE(JSOP_BLAH)
> 
> The '*sp++ = NULL' would go at the very top, because that is the only place
> dominating all calls. But that puts it on the fast path as well, which is
> perhaps not wanted.

In fact in most cases *sp++ = JSVAL_NULL is wanted! At one point it would be nice to have the bug 397177 fixed. But that would need strong roots, that is, space on the interpreter stack. And this is exactly what *sp++ = 0 provides.

Note that I agree 100% that potentially redundant *sp++ = JSVAL_NULL should be done only exactly before non-inline function calls, since only they could trigger GC or invocation of stack-walking functions. If a bytecode case has a spaghetti of ifs, only those branches that do calls to external functions should need *sp++ = JSVAL_NULL, not the rest of call-free branches.

But my worry is a feasibility of bulletproof analysis with full stack simulation. Hence the reason for a simpler proposal to ensure that all sp++ happens at the beginning of the bytecode.
Pushing nulls pessimistically will regress performance.

We should have fast-paths first (if (fast_case) { ... } before other else-ifs). PGO'ed builds will reorder to favor common cases, but those should be the fast ones or we're biasing value-tagging, etc., suboptimally. (We could be, this would be good to find out; see bug 360324.)

I talked to dmandelin last night about how we need GC safety of course, but also significant load/store reduction (see his measurements from bug 442379). This suggests to me that we should focus on a combination of static analysis and call plus inline (for short opcodes) threading, with immediates normalized into the call/inline-threaded instruction stream, and pc updates minimized to those cases that call out, possibly even only to specially marked call-outs.

Easy to hand-wave, but the practical point is that we should be careful not to fix this bug only, which may be performance-neutral or even negative, while not helping the call/inline threading work (which would have good GC safety effects too).

Ideally we would make use of our GCC plugins to avoid too much hand-coding of magic offsets and byte-stuffing patterns to search for in the patches for bug 442379. C alone (or C++, doesn't matter at this level) is a poor host language for writing a call/inline-threaded interpreter. Can we combine our GCC static analysis skills and JS interpreter hacking skills quickly to win on both perf and GC safety?

/be
(In reply to comment #8)
> Pushing nulls pessimistically will regress performance.

It is possible to eliminate JSFrameRegs.sp without any pushes of nulls and without the need for a static analysis. The idea is to zero the stack before entering the frame and also zero the unused part of the frame in the GC as calculated based on the current pc. This way the GC will never see the garbage again. 

Of cause, this schema may slow down a specially crafted script that has a deep stack that is used only on a rarely executed script branch due to the overhead of the initial memset, but I expect for normal scripts this should be a win compared with the case of repeated *sp = NULL.

Another drawback is that due to precisely the absence of *sp = 0 before calling external functions anything that is pointed by sp will be treated as a live object if the GC runs before the code overwrites the sp. But in practice such leak would again only possible with a specially crafted script that tries to schedule the GC at the same bytecode.

On the other hand the big plus of the schema is that it would be much harder to exploit a bug in the code that models the maximum stack usage based on the current pc. Since the GC will null the unused stack, such bug would either lead to a harmless leak (GC overestimated the usage) or turns the value into JSVAL_NULL  (GC underestimated the usage). The latter, without farther bugs in the engine, would lead in the worst case to accessing a null pointer.
Depends on: 449494
(In reply to comment #1 by Brendan)
> I exposed a js_ReconstructStackDepth library-internal API from jsopcode.{h,cpp}
> in http://hg.mozilla.org/tracemonkey/index.cgi/rev/55a104b3ea55 for
> tracemonkey. It could be useful here.

It turned out ReconstructStackDepth is not useful for this bug in its current form. Function's reasonable performance depends on the source note cache while eliminating JSFrameRegs.sp requires to map pc -> stack depth during the GC, when the cache is emptied.

So I am going to have a specialized function that just calculates sp based on pc without using source notes. In principle it should be possible even possible to rewrite ReconstructStackDepth so it does not use source notes (which would allow to cache the source notes just during the decompilations), but that is rather involved.
Blocks: 397177
With JITs this bug is no longer relevant.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.