642002 - Analyze v8 raytrace performance

Assignee

Description

•

13 years ago

Similar prologue as bug 642001. On this one we are 2x worse than v8/prev and 3x vs. v8/cs. Why the 2x? Inquiring minds want to know.

David Mandelin [:dmandelin]

Assignee

Updated

•

13 years ago

Blocks: 642003

David Mandelin [:dmandelin]

Assignee

Comment 1

•

13 years ago

Attached file JS profiler output (obsolete) — Details

David Mandelin [:dmandelin]

Assignee

Comment 2

•

13 years ago

Initial observations:

We're about 2x slower than v8/nocs, so we need explain how about 50% of our total time is wasted. Most of that is:

0.358755 0.358755   raytrace.js               36
                     0.102453 0.286   STUB_UNKNOWN                36
                     0.100596 0.280   RUN_MJITCODE                36
                     0.089066 0.248   STUB_CREATETHIS             36
                     0.066592 0.186   GC                          36
                     0.000049 0.000   IC_GETPROP                  36

This is our old friend |create|:

  var Class = {
    create: function() {
      return function() {
        this.initialize.apply(this, arguments);
      }
    }
  };

I thought we did a special optimization for that pattern, didn't we? Did it get disabled at some point or did we not do it?

Other than that, we have 7% in GC and 9% in CreateThis, which we already know about (need generational GC/super-fast obj allocation).

Finally, for each hot function, most of the time is charged to mjit code for the first source line in the function. So I guess we need to look into that in detail, since it seems to matter a lot both on earley and here. It could of course be a bug in the profiler.

David Mandelin [:dmandelin]

Assignee

Comment 3

•

13 years ago

Attached file Corrected profiler output — Details

There was indeed a bug in the profiler causing time to get charged to the wrong lines. Correct output is attached.

Attachment #520036 - Attachment is obsolete: true

David Mandelin [:dmandelin]

Assignee

Comment 4

•

13 years ago

With the corrected line-of-code charging, the hot lines of code are mostly 'new' expressions. But there are also a few that look like this:

        for(var i=0; i<scene.lights.length; i++){

scene.lights is an array. I found that v8/nocs is faster than us on empty loops with this kind of header, and also ones with a fixed iteration bound, by about 2x in either case. In this benchmark, these loop headers are probably not more than 5% of our execution time, so fixing this is only worth 2%. That doesn't seem like a huge deal, and for this benchmark alone, it's not, but we really should try to improve our basic loop performance.

David Mandelin [:dmandelin]

Assignee

Updated

•

13 years ago

Assignee: general → dmandelin

The 8472

Comment 5

•

13 years ago

I've looked at it with firebug and from what i can gather this piece of code runs a lot:

var vU = new Flog.RayTracer.Vector(this.position.y, this.position.z, -this.position.x);
var vV = vU.cross(this.position);
var u = info.position.dot(vU);
var v = info.position.dot(vV);

It's basically creating temporary objects as wrappers to do the calculation and then discarding them. As noted in other bugs using a generational GC and bump-allocation can make this a lot faster.

But there is another way to make this cheap: Escape Analysis (stealing once again from the Java optimizations cookbook).

If type inference or tracing can determine which functions are called and guarantee that none of those functions assign the object to any scope outside the current stack, i.e. that it will never escape, then there are two options:
a) decompose the object into primitives (in this case the 3 coordinates of the vector) and put them on the stack
b) simply allocate the whole object on the stack


I don't know how closures are implemented, but if they create objects too then the common pattern of creating one to pass it to some iterator and then discard could be optimized that way too.
But in any case it would optimize the pattern where temporary, immutable wrapper objects like the Vector here are created. Especially when several such operations on immutable objects are chained together.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 6

•

13 years ago

my understanding is that escape analysis doesn't help much in the presence of a fast GC, based on modern Java results.  they might see different allocation patterns than we do in JS, though.

The 8472

Comment 7

•

13 years ago

Escape Analysis as implemented in java 6 does not actually perform stack allocation, it is only used for lock elision. In java 7 it performs stack allocations for scenario a), i.e. if the object can be decomposed into a few primitive values (and fixed-size arrays) and its function calls can be inlined, which mostly covers the immutable wrapper pattern. b) has not been implemented, so we don't actually have real-world data on that case.

I'll see if i can find and/or make some measurements for  decomposed+stackallocated objects.

David Mandelin [:dmandelin]

Assignee

Comment 8

•

13 years ago

(In reply to comment #7)
> I'll see if i can find and/or make some measurements for 
> decomposed+stackallocated objects.

Good observations! Please do give that a try.

The 8472

Comment 9

•

13 years ago

Attached file microbenchmark for java stack allocation — Details

Ok, after doing more or less the same as the raytracer (dot product of two immutable vectors) and making sure that the compiler doesn't do dead code elimination i get the following:

100 million dot products calculated
without escape analysis: 1494ms 
with escape analysis: 276ms

Debug output confirms that zero garbage collections run with escape analysis. Hundreds of them are performed without. Do note that GCs actually are cheap here since no other objects actually are on the heap. In a real case it might be much more expensive.

jvm parameters:
-server -Xms20m -Xmx20m -XX:+UnlockExperimentalVMOptions -Xbatch -XX:-DoEscapeAnalysis

turn the -DoEscape... into +DoEscape... to enable it.

source used to test is attached.



Is that gain worth it? I don't know. It depends on how good our garbage collector would be. How well can we actually determine if something escapes the stack in javascript. How many temporary objects are created and discarded in tight loops...

The 8472

Comment 10

•

13 years ago

I did some further testing by varying the size of the young generation, where practically all GCs happen for this benchmark:

no EA:
1m  young gen: 2254ms
2m  young gen: 1611ms
10m young gen: 1419ms
20m young gen: 1328ms
EA:
1m  young gen: 260ms
20m young gen: 260ms


So Escape Analysis+Decomposition does not only reduce GCing costs (especially with small allocation pools) but also provides additional savings, possibly stemming from eliminating the allocation and object management overhead, fewer indirections for field access, better cache locality (everything on the stack) or something like that.

Emanuel Hoogeveen [:ehoogeveen]

Comment 11

•

13 years ago

Since this is related to type inference, should Brian Hackett get in on this discussion?

The 8472

Comment 12

•

13 years ago

Mhh, yes. He should know if we actually can get enough information to do this with TI. Another question is if Tracemonkey could do this too, anyone who could comment on that?

The 8472

Comment 13

•

13 years ago

Attached file same benchmark in JS — Details

To compare, the same code implemented in JS
(takes about 190145ms on JM-nightly)

The 8472

Updated

•

13 years ago

Attachment #535870 - Attachment mime type: application/octet-stream → text/plain

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 14

•

13 years ago

I've been planning on trying this analysis once IonMonkey is ready... representing board positions as objects and banging on them in tight loops was a motivating use case. Patrick, a while back you mentioned a test case like that, where the allocations were killing performance - do you still have that?

The analysis itself would not be hard on TM's LIR, but the decomposition would be tricker as you'd have to teach bailouts how to recompose the object. For JM, the architecture isn't there at all so it'd have to entirely piggyback on TI. JM+TI's calls out to C++ also guarantee that the stack is reified so I'm not sure how that would play out.

Brian Hackett [Laid off!]

Comment 15

•

13 years ago

My feeling is that we should really see how well generational GC does before investing much in escape analysis.

With a bump allocator, you could do this in a fairly straightforward way without too much pain (not optimal code though) by allocating space for the object from the allocator, then rolling back when you know the object is no longer in use (requires that no escaping allocations occur while the non-escaping object is live).  Similar to how we push/pop storage on the temporary arena pool in a context.  If a side exit decides the object can escape, no fancy movement needs to occur, just don't rollback the bump allocator when done.

Or, to be faster, reserve space from the bump allocator, but leave it totally blank and be able to keep the object's properties in registers (maybe stack space too).  That's a general technique which it would be great for IonMonkey to support.

The 8472

Comment 16

•

13 years ago

Rolling back the bump allocator sounds very interesting but doesn't seem to cover method-chaining.

E.g.

function() {
  var a = new Immuteable256BitInt();
  for(var i=0;i<10000;i++)
  {
    a = new Immuteable256BitInt(i).mul(a).add(new Immuteable256BitInt(i%256))
  }

  return a;
}

This always creates a new object from the old ones, before they become obsolete And the last one created in the chain is returned, thus escaping, preventing bump pointer rollback.

EA on the other hand could at least stack-allocate 3 of the 4 allocations. With some magic special-casing for the last loop iteration all but the very last allocation could be stack-allocated despite the linear dependencies of the objects.

In other words, bump allocator rollback requires that the life-span of temporary objects is non-overlapping, otherwise they're just leapfrogging the pointer until it forces a GC.

Brian Hackett [Laid off!]

Comment 17

•

13 years ago

Hmm, with some analysis on code like this you could figure out what is being allocated and in what order, and allocate space for the escaping objects first.  Regardless of its precision, type inference tends to be impermanent and subject to change anytime you reenter the VM, so having escape hatches to make it easy to recover a generic state is a handy way to build robust optimizations.  The alternative is to really analyze the code carefully and make sure that type changes invalidating the optimization are never ever possible, but those are tricky and less robust.  Not to say they're never necessary, we do an analysis in this style to ignore integer overflows on ADDs that will definitely flow into a bitop.

Also, doing escape analysis with type information should be no problem, since it gives you a complete callgraph (again, not a permanent one).  We already do an escape analysis when analyzing scripts that are called with 'new', to figure out which properties are definitely added to them before 'this' escapes or tries to use any of its properties.  The same principles that apply there should apply here.  Note though that the v8-raytrace type information is currently garbage for a few silly reasons (will fix those after TI merges into TM).

The 8472

Comment 18

•

13 years ago

Mhh, yes. Reordering should work for most cases. The main difference is that it needs to know which objects will escape in advance, while escape analysis only needs to know how many objects *won't* escape.

Maybe a separate allocation region for allocations that shouldn't escape according to current type information. It would basically act as a second stack, dedicated to something resembling stack allocations. If the assumptions turn out to be wrong it'll just hit the limit like a normal allocation pool and will need to be GCed, but that would be a slow-path that should almost never happen.

This would be a bit less efficient since we still have to instantiate the whole object instead of just its fields but it should be quite robust.

Patrick Walton (:pcwalton)

Comment 19

•

13 years ago

(In reply to comment #14)
> I've been planning on trying this analysis once IonMonkey is ready...
> representing board positions as objects and banging on them in tight loops
> was a motivating use case. Patrick, a while back you mentioned a test case
> like that, where the allocations were killing performance - do you still
> have that?

I don't actually have a test case in which *allocations* kill performance. I've seen GC kill performance before, of course (most notably in FOTN).

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 20

•

13 years ago

I sat down with this benchmark today to try and see what has changed since TI. As before, all the OO stuff is really screwing us up. Brian has a patch in bug 638794 that mitigates the problem somewhat, but not entirely.

The code basically looks like this:

var Class = {
  create: function() {
    return function() {
      this.initialize.apply(this, arguments);
    }
  }
};

Flog.RayTracer.Color = Class.create();
Flog.RayTracer.Color.prototype = {
    initialize: function...
    method: ...
    field: ...
};

I tested three versions of the benchmark. The first one is unmodified. The second one is modified to avoid the Class.create pattern. It turns the initialize member into a normal constructor. I'll call this version "ctor". The final version makes the code look like this:

Flog.RayTracer.Color = function(...) { ctor };
Flog.RayTracer.Color.prototype.method = ...
Flog.RayTracer.Color.prototype.field = ...

I'll call this version "ctor+proto".

Here are the results:

Benchmark       SM     SM+bug638794     V8
unmodified   4.51s        3.48s      1.25s    
ctor         2.45s        1.47s*     0.94s
ctor+proto   1.55s*       1.47s*     0.93s

One of the most crucial aspects to these numbers is whether we can inline the GC allocation code. I've put a * next to the cases where we're doing the inlining.

Aside from allocations, I don't know much about what's going on here. I know that in all cases, we take a ton of ValueToBoolean stubs that convert a double to a boolean. In the ctor+proto case of SM+bug638794, the profiler says we spend 15% of the running time doing js_ValueToBoolean or stubs::ValueToBoolean. If we could get rid of these, it might get us down to ~1.25s. There are a few other stub calls that might be worth eliminating, but they're less significant.

I looked at how much time we spend in GC here, and it's pretty piddling--about 1%. As usual, our cache behavior is terrible. We take 105x as many cache misses as V8. It's not clear if this matters, though.

So in summary, if we take Brian's patch, plus some tweaks to ensure that allocations are inlined, and we get rid of the ValueToBoolean stub calls, we might be competitive here. The main thing I don't understand is whether the gap between unmodified and ctor in the SM+bug638794 case is entirely explained by inline allocations.

Nicholas Nethercote [inactive]

Comment 21

•

9 years ago

I'm pretty sure all the V8 benchmarks have been thoroughly analyzed at this point.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WORKSFORME

JS profiler output 13 years ago David Mandelin [:dmandelin] 14.66 KB, text/plain		Details
Corrected profiler output 13 years ago David Mandelin [:dmandelin] 15.42 KB, text/plain		Details
microbenchmark for java stack allocation 13 years ago The 8472 713 bytes, text/plain		Details
same benchmark in JS 13 years ago The 8472 611 bytes, text/plain		Details