Open Bug 903457 Opened 11 years ago Updated 2 years ago

Generator functions are too slow: Baseline compile generators

Categories

(Core :: JavaScript Engine, defect)

22 Branch
defect

Tracking

()

mozilla36
Tracking Status
relnote-firefox --- 36+

People

(Reporter: georgir, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: dev-doc-needed)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0 (Beta/Release)
Build ID: 20130710173932

Steps to reproduce:

1. Run the sample JS code at http://jsfiddle.net/FAynF/1/ with an open Web Console (or firebug console).
2. Compare the time it takes to run the non-generator version to the time it takes to run the generator version.



Actual results:

On ff 22.0 on gentoo I get:
normal: 11ms
generator: 759ms

On ff 22.0 and 23.0 on windows I got
normal: 13ms
generator: 1038ms

The difference factor is huge.

On ff 23.0 on ubuntu I get
normal: 98ms
generator: 457ms

The difference factor is smaller but still significant.



Expected results:

The generator and non-generator version should run in a relatively similar amount of time.
Note that the generator version is only continued once, so the performance difference it exhibits is not related to yield/continue overhead.
Sorry, I meant to post this jsfiddle link:  http://jsfiddle.net/FAynF/

The http://jsfiddle.net/FAynF/1/ revision is nearly the same but splits the timing of the generator case into 'create' and 'run' phases, which does not give any more useful information and needlessly clutters the console output and the sample code. You can still look at that too, but instead of "normal:" and "generator:" timers you'll have to compare "normal:" and "generator run:" timers in the console output.
// here is the sample code for those that can not or do not want to open jsfiddle

console.clear();
function norm(){
    for(var i=0;i<10000000;i++);
    return 42;
}

function gen(){
    for(var i=0;i<10000000;i++);
    yield 42;
}

console.time('normal')
console.log(norm());
console.timeEnd('normal')

console.time('generator')
console.log(gen().next());
console.timeEnd('generator')
An interesting thing that I noticed - if I run the above sample code in the Web Console (ctrl+shift+k) both normal and generator versions become slow: ~630 ms on my ff 22.0 on gentoo, a bit faster than the ~760ms of the generator version when the code is ran from the jsfiddle page.

If I run it in the firebug console (ctrl+shift+l for those that have firebug) only the generator version is slow, like if the code is running from within the jsfiddle page.
Assignee: nobody → general
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
Generators are not JIT-compiled; they always run in the interpreter and that's what makes them slower in this case.

Andy Wingo is working on improving our generators implementation. I think generators can be a lot faster with baseline JIT support but I don't expect us to Ion-compile them (IonMonkey is the optimizing compiler).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Hi!  I was looking for a bug to discuss generator implementation
details, and it seems this bug is as good as any.  Adding VM folks,
debugging folks, and self-hosting folks, as it seems any change needs
input from these components.  Georgi, it's going to get pretty detailed,
so feel free to remove yourself from Cc if you are not interested in the
nitty-gritties :)

Generators are functions that can suspend.  The state of a function is
basically its call frame: its program counter, the arguments it was
called with, and its local variables, both named and un-named.
Suspending a generator has to capture this state.  Resuming a generator
should restore it.

SM already has a generators implementation.  As far as I understand, it
works like this:

 1. A generator starts with the OP_GENERATOR opcode.  The first time a
    generator function is run, the interpreter will see this opcode.
    OP_GENERATOR causes a new generator object to be allocated, with
    enough space for the generator's state, including any stack values
    it might need.  That object is then returned to the calling frame.

 2. "next" is a function implemented in C++.  When it is called on that
    generator object, the C function recurses into the interpreter,
    *using the generator itself as the stack frame*, without copying.

    It's a neat trick but it does conflate concerns of the interpreter
    (stack frame layout) with the state of the generator.

 3. Calling a function from within a generator might re-enter JIT, or it
    might simply push another frame on the stack and interpret it.

As Georgi notes, the problem is that generators are interpreted, and so
they are slow.  We would like to make the baseline JIT capable of
running generators.  (Ion would be nice too but that's not a priority
now.)

I think my suggestion for how to do this is like V8 does it: have "next"
/ "throw" on generators be self-hosted, so you don't have to leave the
JIT or even the interpreter to start doing generator things; and then
add "special intrinsics" that get called within "next" / "throw" that
cause the bytecompiler to emit some OP_RESUME inline opcode or so.

In the interpreter, OP_RESUME would result in calling a ResumeGenerator
helper that would push an inline frame in the current interpreter
activation and adjust the interpreter state so that it is in that
frame.  OP_YIELD would do the reverse: copy the necessary state from the
inline frame out to the generator object, then return.

In the baseline JIT, OP_RESUME would be codegenned as follows:

  1. Check if the generator object can be resumed as JIT code.
     Initially this will be a call to a runtime function.

  2. If the generator can't be resumed as JIT code, call out to the
     interpreter to resume it.  Return the value that the interpreter
     returns.

  3. Otherwise push on a stack frame and copy the generator state from
     the object to the frame.

       * This is more copying than we currently do.  To minimize
         copying, we should allocate locals on the heap instead of the
         stack -- for example arguments should be accessed through a
         reified arguments object, locals should be in heap objects on
         the scope chain, etc.

  4. Jump to the pc.

Again, OP_YIELD in the baseline JIT would do the reverse.

This is what V8 does, FWIW.

So, questions:

  How does this sound to you all?

  What are the debugging implications?

  Can we add the needed pieces to the self-hosted "next" and "throw"?
  How?  What syntax?  (V8 uses %_Foo, as opposed to %Foo).

  WDYT about the copying overhead?

  How do you get a function to allocate its locals on the heap?

Thoughts appreciated!

Andy
Hello,

Apologies in advance, but yes, a lot of that technical mumbo-jumbo is odd to me ;) 
That said, I really do not see generators as that magical new thing that needs big changes to the engine. It should not take anything that the engine does not already do to support closures.

What I mean is generators are really just syntactic sugar for things that can be done without them anyway. If javascript had a 'goto' statement, rewriting a generator without 'yield' would have been trivial. Without 'goto', it is a bit trickier, but still possible, and besides, I am willing to bet the internals of the engine do have 'goto' so...

A simple example:

function gen(a1, a2) {
  var x=whatever;
  // something1
  var y = someexpr(yield yield_val);
  // something2
}

function genequiv(a1, a2) {
  var x, y; // move vars outside
  var _step = 0;
  function _run(_resume_val,resume_e) {
    if(_step == 1) goto y1;
    // if(_step == 2) goto y2; // if we did have another yield.
    // ... and so on ...
    if(_step == 'done') throw new StopIteration();
  
    // original code with stripped 'var' and refactored 'yield'-s follows
    // ...


    x=whatever;
    // something1

    // refactored yield
    _step = 1; return yield_val;
y1: if (_resume_e) throw e;
    y = someexpr(_resume_val);

    // something2

    _step = 'done'; // on every real 'return'
  }
  return {
    // not in specs, but don't you hate the exception-based approach?
    // paused: function() {return _step != 'done'}, 
    // finished: function() {return _step == 'done'}, 
    next: function (resume_val) {return run(resume_val)},
    throw: function (resume_e) {return run(undefined, resume_e)}
  };
}

(Off-topic: I have actually started a pet project to see if I can do just this - you feed it a generator function source, it parses it and returns the equivalent code you need to 'eval' in order to get that generator in old JS lingo. Lack of 'goto' makes it a challenge)
re: goto special cases
I do not expect goto from outside to inside an if or loop to be a problem, but goto from outside to inside a try block might need special consideration. Likewise, goto from outside to inside a let (not a problem in old JS versions) or with (just don't use it) blocks might be tricky.
Can yield be used in array comprehension expressions? In generator comprehension expressions?
Hi Georgi :)

Thank you for your interest.  I'm quite familiar with the semantics and implementation strategies for generators, having just finished implementing them in V8.  We are going to focus on standard ES6 generators.

With regards to generator semantics, you might be interested in this article:

  http://wingolog.org/archives/2013/05/08/generators-in-v8

In the meantime, I don't mean to be rude, but the SM folks are the ones whose input I need :)  I suggest just sitting back and chilling until we get this done.

Regards,

Andy
Adding djvj on Cc, suggested by dherman.  Thoughts appreciated!
(In reply to Andy Wingo from comment #5)
>  2. "next" is a function implemented in C++.  When it is called on that
>     generator object, the C function recurses into the interpreter,
>     *using the generator itself as the stack frame*, without copying.
> 
>     It's a neat trick but it does conflate concerns of the interpreter
>     (stack frame layout) with the state of the generator.

We used to copy the generator's "floating" StackFrame to the interpreter stack when resuming the generator, and copy the whole StackFrame back to the heap when suspending. As part of bug 881902, I changed this so that we use the generator's heap frame directly when resuming in the interpreter, just because it simplified things at the time.

I think copying from stack to heap and back makes sense, but we should really change the heap representation from StackFrame to something else. It would be nice to make StackFrame interpreter-stack-only. That should also simplify the GC marking code I think.

>   How does this sound to you all?

Sounds great at first glance.

>   How do you get a function to allocate its locals on the heap?

The parser/emitter should mark all locals and arguments as aliased, so that they are stored in the call object. A call object is an object on the scope chain that's allocated as part of the function prologue (functions that require a call object are called "heavyweight" in some places).

Aliased vars are then accessed using JSOP_GETALISEDVAR / JSOP_SETALIASEDVAR instead of JSOP_GETLOCAL / JSOP_SETLOCAL. |eval| and |with| also force locals to be aliased, I think you want to do something similar to what we're doing there.
(In reply to Andy Wingo from comment #5)
> In the baseline JIT, OP_RESUME would be codegenned as follows:
> 
>   1. Check if the generator object can be resumed as JIT code.
>      Initially this will be a call to a runtime function.
> 
>   2. If the generator can't be resumed as JIT code, call out to the
>      interpreter to resume it.  Return the value that the interpreter
>      returns.
> 
>   3. Otherwise push on a stack frame and copy the generator state from
>      the object to the frame.
> 
>        * This is more copying than we currently do.  To minimize
>          copying, we should allocate locals on the heap instead of the
>          stack -- for example arguments should be accessed through a
>          reified arguments object, locals should be in heap objects on
>          the scope chain, etc.
> 
>   4. Jump to the pc.
> 
> Again, OP_YIELD in the baseline JIT would do the reverse.
> 
> This is what V8 does, FWIW.
> 
> So, questions:
> 
>   How does this sound to you all?

Your plan sounds reasonable.

> 
>   What are the debugging implications?

The copy-to-heap, run-on-stack approach will work particularly well with debugging logic.  The frame should look just like a regular JS frame on stack, and everything should "just work".

Do you need to keep track of anything "extra" in the frame that's not present in a non-generator frame?  I'm thinking of the generator object (which will presumably identify the heapspace where the frame will be copied to, among other things).  Will this need to be kept track of in the frame?  If so, are you planning on making this object just another "hidden" local value?

>   How do you get a function to allocate its locals on the heap?

Wait, why do you need to allocate the locals on the heap?  Why can't they be on the stack and be copied out to the generator object on OP_YIELD?

You'll be copying things anyway.  I don't see what precludes the locals being copied along with the other generator state that's on the stack.
(In reply to Kannan Vijayan [:djvj] from comment #12)
> You'll be copying things anyway.  I don't see what precludes the locals
> being copied along with the other generator state that's on the stack.

There are three reasons I can think of (from pre-bug 881902 when we did exactly this):
 - the time to resume a generator becomes proportional to the number of locals (didn't matter so much before b/c there was tons of junk running before and after, but I imagine it'd be significant once we were jitting generator calls)
 - there were really subtle GC barrier issues that we kept finding based on the issue that locals were effectively changing back and forth between being stack and heap slots
 - this necessary barriers on resume/yield paths would also I imagine be a perf hit
I would like to advocate for a more "desugared" representation. The
idea would be to post-process the source of a generator to make
references to "local variables" defined in the generator be
implemented in a similar fashion to closure upvars; that is,
references to them are in fact indirected through some "generator
state" object. The same object would store an integer marking the
current resume point. Ideally, this would be basically the same as the
closure upvar path. Essentially, in a normal closure 0 hops represents
the local variables, 1 hop represents the environment, in a generator
0 hops would be unused, 1 hop would represent locals, and 2 hops would
represent the environment. This would likely require some work to
integrate with the debugger and profiler, but I'd hope not too much
since they have to support upvars anyhow.

Taking this approach ought to make ion/baseline integration quite
straightforward. Resume would be the same as a switch instruction.
Yield would just be a normal return.

If we are concerned about the performance costs of indirecting to
access a local variable, we should (1) measure and then if these
concerns are well-founded (2) implement optimizations in ion that
apply to lambda upvars too. A simple variation on PRE would allow us
to cache local variables on the stack and only save them back to the
state variable when a yield occurs, for example. Similarly, variables
that are never lives across yields might never be saved back.

The advantages of the desugared approach, in my mind, is that:

- Only minimal work on the Baseline and Ion backend should be necessary.
  Porting the interpreter to work in this style also seems easy,
  since upvar access paths and `switch` exist in the interpreter already.

- Ion optimizations etc will continue to work, and won't need to
  think about generators at all.
  
- There is no need to make a safe version of yield that works
  in parallel execution, since we're just relying on the normal ion
  MIR instructions for upvar access.

The only challenge is making sure that debugger, profiler, and other
user-visible tools continue to make a generator look "like a normal
function". This doens't seem too hard to me but I am not close
enough to code to be sure.
Note: I was not saying that there is no RESUME or YIELD bytecodes. Rather, that
the translation of such bytecodes to MIR could be done with a `switch` and a return.
Keeping the bytecodes might be handy for the front-end.
I have to admit I didn't follow all of comment 14. I think any kind of nontrivial desugaring is really hard to get *right* in JS, because the semantics are so not-Scheme. Let me try and show the sort of thing I think makes this kind of approach hard:

  - You can't use a switch-statement to jump into the middle of a nested statement
    (the way you can in C -- Duff's device is not valid JS). Yeah, yeah. If this
    were the only problem we'd be doing great. Read on, smarty-pants.

  - yield inside a try block does not execute the corresponding finally block.

  - Each time you enter a block that contains let-declarations or const-declarations,
    you have to create a new activation of that block; after yielding you must
    continue in the same activation. I don't see how you can use a closures
    to implement that. At least, it's not straightforward.

  - Direct eval will punish you in proportion to the cleverness of your desugaring
    transformation. (This one really sucks.)

I dunno, it can all be overcome, but it looks hard. (I'm not thrilled about the debugger story either.)
To clarify, I understand in theory we could let all such stuff bail us out of the JIT, but then we'd have two implementations of generators to maintain.

While I'm here: maybe we can avoid copying locals to the stack by, in the baseline jit, changing addressOfLocal to use a different register when emitting generator code, rather than BaselineFrameReg.
So, first of all, I am much less familiar
with the interpreter or baseline as compared to IonMonkey,
so it's possible that what I'm saying doesn't
make sense.

However, I was a bit sloppy in saying
"desugaring" and referring to "switch statements"
and the like. I did not mean to imply that
we would do a syntactic transformation: nor
did I even mean to imply that we would reuse
a SWITCH bytecode exactly.

Mostly what I was saying is that we could
avoid this complex machinery of copying
values between the stack and the heap by
transforming local variable accesses in
a generator so that they use a distinct
bytecode that includes within it an indirection
through the state object. Similarly, the
RESUME bytecode would take this state object
and read from it the integer code indicating
where to resume execution.

Now, once we enter the JIT, we *could*
transform the RESUME bytecode into the same
kind of control-flow structure that we use
to compile switch statements. Similarly,
the loads of local variables would be
translated into the same sort of loads that
we use for upvars -- in my ideal world,
we'd just insert the state object into the
"scope chain". So just as an upvar in a closure
can be translated into something like
"skip up 2 levels in the scope chain and
take the 5th local" we could translate
a generator local variable as "skip up 1 level
in the scope chain and take the 5th local",
but that might have unintended consequences,
so it might be easier to just compile
an access to a generator local as a load
from a particular slot of the state object.

So when I said "desugar" what I really meant
was "in the JITs, use the existing low-level
constructs to express generator semantics"
not "convert generators syntactically into
other JS that expresses their semantics".
Re-reading a bit, I wonder if what I'm arguing for is...basically what's being proposed. It's possible.
(In reply to Niko Matsakis [:nmatsakis] from comment #18)
> So when I said "desugar" what I really meant
> was "in the JITs, use the existing low-level
> constructs to express generator semantics"
> not "convert generators syntactically into
> other JS that expresses their semantics".

Oh, I see. This makes sense to me now.

yield can also need to preserve other temporary values, not just locals. So to desugar

    x = f() + (yield x);

would require an extra variable for f(), right? And anything else that (in the interpreter) uses the operand stack. let-scopes and lambdas would always have to be represented in the least-optimized way. I guess that's not so bad.

(I was being extremely sloppy with my suggestion in comment 17; I imagine many bits of code would have to be changed to use the proposed "GeneratorFrameReg" rather than BaselineFrameReg when compiling a generator.)
The advantage of the GeneratorFrameReg approach would be that the frame layout doesn't have to change; it just lives at a different address.
Thanks all for the feedback.

(In reply to Kannan Vijayan [:djvj] from comment #12)
> Do you need to keep track of anything "extra" in the frame that's not
> present in a non-generator frame?  I'm thinking of the generator object
> (which will presumably identify the heapspace where the frame will be copied
> to, among other things).  Will this need to be kept track of in the frame? 
> If so, are you planning on making this object just another "hidden" local
> value?

Yes, this will need to be in the frame.  I think it will be another hidden local value, yes -- probably JSOP_YIELD will take its index as an argument.  Seems easier to do that than to ensure that it has some particular slot.

> >   How do you get a function to allocate its locals on the heap?
> 
> Wait, why do you need to allocate the locals on the heap?  Why can't they be
> on the stack and be copied out to the generator object on OP_YIELD?

When I suggested this I was assuming that the baseline jit reserved a register for the scope chain, so accessing a stack var would have the same perf characteristics as accessing a heap var.  I see now that that is not the case, and accessing a heap value requires an extra load.

> You'll be copying things anyway.  I don't see what precludes the locals
> being copied along with the other generator state that's on the stack.

Nothing precludes it, but it's not required either.  Just a trade-off.  It seems that leaving the values on the heap is probably a win, as Luke notes.

Regarding Ion, you could simply treat each yield expression as a call, causing all registers to spill.  Of course it would be ideal to teach ion how to spill to the heap instead of the stack, but that would require unboxed scope chain slots - tricky.  Either way there are GC implications for copying raw values out to the heap that I don't fully understand (e.g. how do you mark a heap object that consists partly of tagged and untagged data).  It's solveable, but it's unclear to me how important Ion compilation of generators is.  If suspend/resume dominates the profile, optimizing run time is likely to be unimportant.
I think my plan here goes like this:

 1. Mark all locals and arguments as aliased in generator functions.

 2. Refactor generator objects to store state in separate fields:
    - The callee function.
    - The scope chain.
    - An array of values, holding the expression stack.  This could be a singleton empty array if the expression stack is empty.
    - The number of actual arguments (apparently necessary).
    - The arguments object (needed?)
    - The pc relative to the function's start, with one bit reserved to indicate JIT or interpreted code.

    I think that's all the needed state.  It seems like quite a lot.  Whether or not to restore the block state should be statically determinable, but perhaps that is another piece of data we should store.

    Can this data be allocated inline to the GeneratorObject?  Is there any benefit to making generator objects not need finalizers?  With this strategy, generator objects would have a static shape, which I would think would make GC's job easier.

    Anyway, with this refactor in place, to resume we could build a generator frame by hand from jsiter.cpp using the lifo alloc from the activation, then proceed as before to activate the frame.  This would be an intermediate step.

 3. Make "next", "throw", and (for legacy generators) "close" be builtins that call "special intrinsics", which emit JSOP_RESUME inline.  JSOP_RESUME would build a frame as jsiter.cpp did, but do so without leaving the interpreter.  (I guess this means that JSOP_RESUME would cause a function to not be baseline compiled, for now.)

 4. Implement support for JSOP_RESUME, JSOP_YIELD, and the result boxing (for star generators) of JSOP_RETURN in the baseline JIT.

Let me know your thoughts, and anything I'm forgetting here.
(In reply to Andy Wingo from comment #23)
>     - The arguments object (needed?)

Only if we can mutate the arguments variable or if we can access the generator by name.

function g() {
  return f.arguments;
}

function f*() {
  yield g();
}

or

function f*(a) {
  arguments[0] = 1;
  yield a;
}

otherwise the arguments would be on the stack.

Also, I am not sure to understand what does that mean to have arguments for a generator? Are they erased on the second call, or we reuse the first set of arguments?
I don't know if generators have an arguments binding or not, but if they do, it's retained when the generator is suspended, just like all the other local bindings.
Assignee: general → wingo
(In reply to Andy Wingo [:wingo] from comment #23)
>  2. Refactor generator objects to store state in separate fields:
>     - The callee function.
>     - The scope chain.
>     - An array of values, holding the expression stack.  This could be a
> singleton empty array if the expression stack is empty.
>     - The number of actual arguments (apparently necessary).
>     - The arguments object (needed?)
>     - The pc relative to the function's start, with one bit reserved to
> indicate JIT or interpreted code.
> 
>     I think that's all the needed state.  It seems like quite a lot. 
> Whether or not to restore the block state should be statically determinable,
> but perhaps that is another piece of data we should store.

It looks like we have to include the block chain as well -- bug 927782.  Bummer :(

A
Any clue which gecko we would target this for? One of the primary reasons we are avoiding generators on FxOS development is their performance right now (and using them would make many people very happy).
James: There will be a speedup, but I wouldn't bet the project on it.  There's lots that needs doing.  Depends on your workload.  Anyway I want to finish my work here in a few weeks, so that would mean FF28 I think.
Andy, any ETA on this? Am I right to assume this bug only covers baseline-jit, and ion-jit will be in a followup?
Flags: needinfo?(wingo)
This bug only covers baseline JIT, yes.  I don't think I have time to do Ion.  I don't have an ETA currently; maybe three weeks, depending on how many yaks there are in the andes this time of year.
Flags: needinfo?(wingo)
Depends on: 987560
Depends on: 1090491
Depends on: 1093573
Depends on: 1097890
All depending bugs are fixed. Should we close this now or file Ion bugs?
Sorry, for bugspam. I think a separate (tracking) bug for Ion should be better once the need arises. 

dev-doc-needed for "Developer Release notes", more information in the dependant bugs.

Release Note Request (optional, but appreciated)
[Why is this notable]: Ligthning-fast ES6 generators
[Suggested wording]: Improved new ES6 generators for better performance
[Links (documentation, blog post, etc)]: https://wingolog.org/archives/2014/11/14/generators-in-firefox-now-twenty-two-times-faster
relnote-firefox: --- → ?
Keywords: dev-doc-needed
OS: Linux → All
Hardware: x86_64 → All
Summary: Generator functions are too slow → Generator functions are too slow: Baseline compile generators
Target Milestone: --- → mozilla36
Blocks: 1393712
Depends on: 1409231
Blocks: 1521435

Removing myself as owner as I haven't worked in this area for some time :)

Assignee: wingo → nobody

In Firefox 89, generators are about 10x slower than iterators or callbacks. Both in V8 and Webkit, generators are way faster.

In Firefox, I see about 10 ops/s. V8 does 44 ops/s (also a known issue). Webkit about 100 ops/s.

  function* iterator() {
    for (let i = 0; i < 1e9; i++) {
      yield i;
    }
  }
  
  let sum = 0;

  // timed code:
  for (const i of iterator()) {
    sum += i;
  }

Iterators and callbacks get about 100-300 ops/s (depending on the browser engine) for doing the sum. Is this an issue with the js execution engine or is the benchmark not realistic so that the optimized code paths are not taken?

There are still things we could do to improve generator performance, but we're not aware of any real-world code for which generator performance is a bottleneck, so it's not our highest priority. We would welcome any examples you can provide from actual websites.

(In particular, resuming into a Warp-compiled generator is currently implemented by resuming in baseline and doing OSR, which imposes some overhead. That overhead will be most significant in cases where the body of the generator is trivial, so this microbenchmark is pretty close to the worst-case scenario for us.)

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.