Closed Bug 1211795 Opened 4 years ago Closed 1 year ago

Calling another compartment's Array constructor is too slow.

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: daniel, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36

Steps to reproduce:

Subclassed an Array for an in-production module. 

Subclassing Method :: (Iframe injection) 

//Create an Iframe for the new script
var subClasser = document.createElement("iframe");
document.body.appendChild(subClasser);
var currentFrame = frames.length - 1;

//create a script that passes it's own Array type to a function in the global scope.
//within this function you can safely extend the prototype of the Array without pollution 
//of the global Array.  
frames[currentFrame].document.write(
    "<script>parent.InjectorFunction(Array);<\/script>"
);

//remove the Iframe
subClasser.parentNode.removeChild(subClasser);




Actual results:

Instantiating the subclassed array is much slower in firefox then expected. 
(compared with IE 11, chrome, and Safari) 

See jsPerf here :: http://jsperf.com/array-performance-test-sub


Expected results:

Speedy laser shark infested awesomeness
What the jsPerf test does is something like following:

  // preparation
  var subClasser = document.createElement("iframe");
  document.body.appendChild(subClasser);
  var SubArray = subClasser.contentWindow.Array;

  // test
  var myArray = new SubArray();

So, SubArray is an Array constructor function in another compartment, and the issue is that calling such function is too slow.
Component: Untriaged → JavaScript Engine
Product: Firefox → Core
Summary: Performance issue when Instantiating subclassed array. → Calling another compartment's Array constructor is too slow.
Version: 41 Branch → Trunk
Status: UNCONFIRMED → NEW
Ever confirmed: true
Right, because this is going through a cross-compartment wrapper, which is much slower than a normal function call.

We really need better JIT support for transparent cross-compartment wrappers.  Though note that in this case there's a limit on how fast this can be compared to the same-compartment case, since it needs to allocate two objects: the new array and a new cross-compartment wrapper for it.  It will also need to do at least the work of computing that cross-compartment wrapper, though we may be able to optimize _that_ a bit since we know we're creating a new object...  Compared to all that, the normal |new Array| allocation path is actually inlined in the JIT, so is _extremely_ fast.
Component: JavaScript Engine → JavaScript Engine: JIT
Depends on: 907369
I agree we should look into ways to make the most common operations on CCWs not totally suck. I think they'll always be slower but as long as it's not totally horrible that's probably fine. Maybe we could use a specialized VM function or an IC to get rid of some of the layers; we should investigate that.
That said, I tried profiling this just to see what it looked like and compared to the case of just allocating an Array and setting its first element (to keep the Array allocation from being entirely optimized out).  The subarray case is about 150x slower, just for scale, so 1% of the subarray case is about the entirety of the Array case.

Some interesting things:

1)  In the SubArray case, about 1/3 of the time is under InterruptCheck doing js::Nursery::collect.  This is about half GenericBuffer::trace calling WrapperMapRef::trace, about 1/4 Nursery::sweep calling _platform_bzero (via memset, I expect), and about 1/4 MonoTypeBuffer<ValueEdge>::trace calling TenuringTracer::moveToTenured.  In the Array case, I see no GC activity at all, at least over the 5s I ran the test.  Terrence, what's up with that?

2)  The remaining 2/3 of the time is jit::InvokeFunction calling, eventually, proxy_Construct (via js::Construct, InternalConstruct).  Getting there takes a few percent of the time.  Then another percent or two to get to js::CrossCompartmentWrapper::construct.  The actual DirectProxyHandler::construct case is about 1/6 of the total time; the rest (about 1/3 of the total time) is JSCompartment::wrap.

The CrossCompartmentWrapper::construct lands us right back in js::Construct but now for the proxy target, which goes and calls ArrayConstructor, which ends up doing js:::Allocate and all that jazz.  We end up with _more_ gc under js::Allocate, which is where a lot of the time here is spent.

JSCompartment::wrap spends a bunch of time under JSCompartment::putWrapper and even more under Wrapper::New; no obvious GC here, at least....
Oh, the test I used is:

  // preparation
  var subClasser = document.createElement("iframe");
  document.body.appendChild(subClasser);
  var SubArray = subClasser.contentWindow.Array;

  // test
  function test() {
    var start = new Date;
    var count = 1000000;
    for (var i = 0; i < count; ++i) {
      var myArray = new SubArray();
      // var myArray = new Array();
      // myArray[0] = 1;
    }
    var stop = new Date;
    alert((stop - start) / count * 1e6);                    
  }
  test();
Depends on: 1211944
Thanks for the extremely fast response! It's much appreciated.
Clicking through bugs and I see:

(In reply to Boris Zbarsky [:bz] from comment #4)
> That said, I tried profiling this just to see what it looked like and
> compared to the case of just allocating an Array and setting its first
> element (to keep the Array allocation from being entirely optimized out). 
> The subarray case is about 150x slower, just for scale, so 1% of the
> subarray case is about the entirety of the Array case.
> 
> Some interesting things:
> 
> 1)  In the SubArray case, about 1/3 of the time is under InterruptCheck
> doing js::Nursery::collect.  This is about half GenericBuffer::trace calling
> WrapperMapRef::trace, about 1/4 Nursery::sweep calling _platform_bzero (via
> memset, I expect), and about 1/4 MonoTypeBuffer<ValueEdge>::trace calling
> TenuringTracer::moveToTenured.  In the Array case, I see no GC activity at
> all, at least over the 5s I ran the test.  Terrence, what's up with that?

ni? to Terrence; maybe this has already been discussed someplace else, but it'd be nice to have it in the bug?
Flags: needinfo?(terrence)
(In reply to Boris Zbarsky [:bz] from comment #4)
> That said, I tried profiling this just to see what it looked like and
> compared to the case of just allocating an Array and setting its first
> element (to keep the Array allocation from being entirely optimized out). 
> The subarray case is about 150x slower, just for scale, so 1% of the
> subarray case is about the entirety of the Array case.
> 
> Some interesting things:
> 
> 1)  In the SubArray case, about 1/3 of the time is under InterruptCheck
> doing js::Nursery::collect.

Ouch! In theory we shouldn't be doing more nursery collection as these create the same number of nursery objects. In practice, the CCW map is going to be creating store buffer edges that are going to result in FULL_STORE_BUFFER collections. In general, the nursery wants to treat all edges as strong so that collection work is not proportional to the number of dead objects. There are exceptions though: wrapper cache and typed arrays. It sounds like we need a third, which means we need to grow a generic mechanism.

> This is about half GenericBuffer::trace calling
> WrapperMapRef::trace,

Yes. The generic buffer is slow. Converting the wrapper map to two-level hashing should fix this (or at least move the performance cliff elsewhere). That's currently blocked on making CrossCompartmentKey into a mozilla::Variant. This is blocked on us having generic lambdas; otherwise we need to add ~5 new functors that are almost-but-not-quite the same as an existing set of functors for DispatchTyped usage. This is blocked on us getting MSVC2015, which finally adds support for those. And to a certain extent on GCC 4.9, which IIRC was crashing if we used a certain decltype that forces the Variant match functor to use a slightly different calling convention than the other functors. That's all in the works now though, finally.

> about 1/4 Nursery::sweep calling _platform_bzero (via
> memset, I expect),

This should not be happening. We only overwrite the nursery when poisoning and poison bytes are non-zero. I'm not sure what's going on there. Do you have a stack? 

> and about 1/4 MonoTypeBuffer<ValueEdge>::trace calling
> TenuringTracer::moveToTenured.

I'd guess the cross-compartment proto edge is causing the wrapper map to hold the entire minor heap live. I'm kinda surprised that this isn't a larger percentage of the time.

> In the Array case, I see no GC activity at
> all, at least over the 5s I ran the test.  Terrence, what's up with that?

This doesn't seem plausible. The nursery maxes out at 16MiB (and needs 5 collections to get there in the first place). On 32bit an Object4 is 48 bytes, which means that we'd be using a minimum of 48MiB over the course of the test. If there was system activity before, the nursery could already be expanded, but that's still over the limit.

Maybe the elements are COW? I thought that was only for larger element vectors, but if that's the case then the objects would only be 16MiB. That still isn't going to fit because of heap overhead, but at least is closer to plausible.

The only other possibility is that these arrays are getting allocated Tenured. That doesn't seem that likely though if we're running in the jit the entire time.

> 2)  The remaining 2/3 of the time is jit::InvokeFunction calling,
> eventually, proxy_Construct (via js::Construct, InternalConstruct).  Getting
> there takes a few percent of the time.  Then another percent or two to get
> to js::CrossCompartmentWrapper::construct.  The actual
> DirectProxyHandler::construct case is about 1/6 of the total time; the rest
> (about 1/3 of the total time) is JSCompartment::wrap.
> 
> The CrossCompartmentWrapper::construct lands us right back in js::Construct
> but now for the proxy target, which goes and calls ArrayConstructor, which
> ends up doing js:::Allocate and all that jazz.  We end up with _more_ gc
> under js::Allocate, which is where a lot of the time here is spent.

I think all proxies have a finalizer right now, so this is going to be tenured allocations, which are considerably slower than nursery allocations.

> JSCompartment::wrap spends a bunch of time under JSCompartment::putWrapper
> and even more under Wrapper::New; no obvious GC here, at least....

This might be map overhead.
Flags: needinfo?(terrence)
ni? bz for questions in comment 8.
Flags: needinfo?(bzbarsky)
> Do you have a stack? 

This is the only question in comment 8, right?  The stack is pretty simple:

 js::jit::InvokeFunction
 js::jit::InterruptCheck
 JSRuntime::handleInterrupt
 js::gc::GCRuntime::gcIfRequested
 js::gc::GCRuntime::minorGC
 js::Nursery::collect
 js::Nursery::sweep
 _platform_bzero$VARIANT$Unknown

In this case, the bzero call takes 350ms out of a total running time of 5800ms.

What I don't have are line numbers.  Could this be cellsWithUid_.clear()?  Or the initChunk calls?

I tried breakpointing in a debugger, and if it's not lying to me there are bzero calls coming from js::NewObjectCache::clearNurseryObjects, which is called from js::Nursery::collect.  So it's possible the profiler is mis-blaming those hits and they actually belong in collect?  Doing this in opt builds is no fun.
Flags: needinfo?(bzbarsky)
Depends on: 1245965
Blocks: 1245974
Depends on: 1237058
Priority: -- → P3
Depends on: 1338802

OK, so with the CCW overhead eliminated, this is about 6x faster, but still much slower than creating an Array from the same global.

I created profiles for the Array case (https://perfht.ml/2SBI9DW) and the SubArray case (https://perfht.ml/2Sx4sKD) using the testcase in comment 5. The latter shows we're calling js::ArrayConstructor and then doing a bunch of work with ObjectGroup, etc, etc. The former shows us calling js::jit::NewArrayWithGroup directly from jitcode.

Presumably there's some same-realm assumption baked in there (e.g. we'd need to change Realm to call NewArrayWithGroup from jitcode in the SubArray case). Worth a separate bug blocking this one?

Flags: needinfo?(jdemooij)
Depends on: 1527822

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #11)

OK, so with the CCW overhead eliminated, this is about 6x faster, but still much slower than creating an Array from the same global.

I filed bug 1527822.

Flags: needinfo?(jdemooij)
Depends on: 1527843

Can we close this now?

Flags: needinfo?(bzbarsky)

We should at least fix bug 1527822 and then remeasure. And decide whether we care at all about the remaining cross-compartment case.

Flags: needinfo?(bzbarsky)

Ah, I missed the other bug - sorry about that.

Now I get:

Browser Time
Firefox 65 513
Safari 57
Chrome 21
Nightly 19

Without the myArray[0] = 1:

Browser Time
Firefox 65 427
Safari 32
Chrome 20
Nightly 2

(Yes we're likely optimizing away the unused allocation.)

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.