CrossCompartmentWrapper handling is slow on Google Spreadsheet

RESOLVED FIXED in Firefox 51

Status

()

defect
RESOLVED FIXED
4 years ago
8 months ago

People

(Reporter: smaug, Assigned: jonco)

Tracking

(Blocks 5 bugs)

36 Branch
mozilla51
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(platform-rel +, firefox51 fixed)

Details

(Whiteboard: [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSheets])

Attachments

(5 attachments, 1 obsolete attachment)

I was profiling scrolling (using touchpad's 2-finger scrolling when cursor is over the 'comment' column) of https://docs.google.com/spreadsheets/d/10UeyRoiWV2HjkWwAU51HXyXAV7YLi4BjDm55mr5Xv6c/edit?pref=2&pli=1#gid=368835050
and plenty of the time is spent under CrossCompartmentWrapper::call.
Seems like lots of new Proxy objects are being created (creation taking ~7% of total time!), which also leads to rather frequent minorGCs.
(profile looks pretty much the same in e10s-child-process and non-e10s)


I don't know if we cache or reuse those proxy objects, but if we do, perhaps we shouldn't be doing so much
minorGCing? Anyhow, someone more familiar with JS engine should profile this.
Terrence, do you have time to look at this? If not, need some way to get it on the JS team's radar.

Martin, aside from pointing at various people, what is the current best-practice (for someone without the cycles to get involved) to make sure a useful and actionable report like this one doesn't float away?
Flags: needinfo?(terrence)
Flags: needinfo?(mbest)
Proxies have a finalizer, so we don't actually put them in the nursery. I'll take a look.
(In reply to Bobby Holley (busy) from comment #1)
> Terrence, do you have time to look at this? If not, need some way to get it
> on the JS team's radar.
> 
> Martin, aside from pointing at various people, what is the current
> best-practice (for someone without the cycles to get involved) to make sure
> a useful and actionable report like this one doesn't float away?

Good question.  We have it on the plan for 2016 to get someone focused on Google Docs improved support.  Once we have that person identified, they can drive the effort to catalog them.  I'll defer to JST on the best way to handle things in the mean time.  CC me on them in any case, until we get a clearer path.
Flags: needinfo?(mbest) → needinfo?(jst)
(In reply to Martin Best (:mbest) from comment #3)
> (In reply to Bobby Holley (busy) from comment #1)
> > Terrence, do you have time to look at this? If not, need some way to get it
> > on the JS team's radar.
> > 
> > Martin, aside from pointing at various people, what is the current
> > best-practice (for someone without the cycles to get involved) to make sure
> > a useful and actionable report like this one doesn't float away?
> 
> Good question.  We have it on the plan for 2016 to get someone focused on
> Google Docs improved support.  Once we have that person identified, they can
> drive the effort to catalog them.  I'll defer to JST on the best way to
> handle things in the mean time.  CC me on them in any case, until we get a
> clearer path.

The question I'm asking is more general than google-docs performance specifically. If I see a bug that may be important to some cross-cutting metric that we probably care about (or might care about it in the future), I want some way to get it on the radar of people who are keeping track of these things.

Using the org structure doesn't work well, because we clearly don't want to put it all on jst or dbryant's plate, but the owning team may not be immediately obvious either. And even if it _is_ obvious, it's a crapshoot as to whether the manager of that team is interested in that metric in the moment.

In the status quo, there are lots of ways for bugs like this to drift away. When that happens, we have to start from scratch when someone upstairs decides to investigate and prioritize that metric.

This is why I think we need to do some level of tracking within platform planning / product management for bugs that impact a top-level metric that may or may not have a "driver" at this time. For all I knew when I wrote comment 1, that Google Docs driver might have existed already, because it's hard to keep track of who in management is paying attention to what.

Basically, I want a well-publicized inbox to the PM machine that will eventually route things like this to the appropriate driver (which may not exist at the time the bug is filed). At the moment I don't know of a better way than NI mbest, which is why I'm proposing that you create and publicize one. :-)
Flags: needinfo?(mbest)
Yeah, all good points and I agree.  The challenge is that we have a serious shortage of the types of roles you mention on Platform.  This question has come up a few times recently and I don't have a good answer for you this second but let me take this away and see what we can come up with.  Does seem like a good investment.

We definitely don't want it going to JST but a system by which we can find owners is something he should be involved in defining as largely his teams will be impacted.
Flags: needinfo?(mbest)
Sounds great - thanks Martin!
Terrence, let me know if I can help here. I'm interested in these real-world performance issues and I've seen CCW stuff show up in other profiles.
Gecko profiler isn't too good at showing how much time is totally spent in some method (comparing to some other profiling tools), but here is a profile anyhow
https://cleopatra.io/#report=aca131ad96ec741ee4acf33a5b665740d97acdae
It is really almost all JS, and cross-compartment handling showing up everywhere.
(In reply to Olli Pettay [:smaug] from comment #8)
> Gecko profiler isn't too good at showing how much time is totally spent in
> some method (comparing to some other profiling tools)

Inverting the callstack definitely helps. You can also just click on the arrow icon to the right of a function and see the time spent in it, though that doesn't always work reliably for me.

>, but here is a profile
> anyhow
> https://cleopatra.io/#report=aca131ad96ec741ee4acf33a5b665740d97acdae
> It is really almost all JS, and cross-compartment handling showing up
> everywhere.

Some interesting things that I see:

* We do an enormous number of minor GCs, which cause us to spend a combined ~6% in MarkTypeUnbarriered and IonEntry::mark.
* 1.7% in JitcodeGlobalTable::lookupInternal.

CCW overhead doesn't actually dominate the profile in the way I'd expect, but I do see a lot of time spread throughout the interpreter, so it's possible that they're forcing us off-JIT, which could be the real problem.

NI Jan to get his thoughts.
Flags: needinfo?(jdemooij)
Well, if you look at the "most time spent in", CCW is very high there. And in Zoom this is easier to see (it tends to give better data for low level stuff).
(In reply to Olli Pettay [:smaug] from comment #10)
> Well, if you look at the "most time spent in", CCW is very high there. And
> in Zoom this is easier to see (it tends to give better data for low level
> stuff).

I'm not totally clear on the semantics of "most time spent in". JSCompartment::putWrapper is at the top of "most time spent in" (not sure what it's sorted by), but when I invert the callstack and select putWrapper, I see 186 ms, which isn't small, but is smaller than some other things on there.

BenWa, can you clarify the above, and also how to find the amount of time spent in a function for all the different possible ways of getting there? If I click the arrow next to putWrapper I get an empty list.
Flags: needinfo?(bgirard)
"most time spent in" is just a shortcut that shows what you'd see if you expended the tree all the way. This will give you the left most derivation of the tree starting at your current selection. It might not in fact be the heaviest leaf and wont make any attempt to locate it. It's just a helper to see peek at the left most expansion of your selection.
Flags: needinfo?(bgirard)
(In reply to Benoit Girard (:BenWa) from comment #12)
> "most time spent in" is just a shortcut that shows what you'd see if you
> expended the tree all the way. This will give you the left most derivation
> of the tree starting at your current selection. It might not in fact be the
> heaviest leaf and wont make any attempt to locate it. It's just a helper to
> see peek at the left most expansion of your selection.

Ok. So rather than a series of heavy functions, it is actually just the callstack of the "heaviest" single call. And that stack is only really meaningful up until (traversing the callstack top to bottom) we hit a function whose heaviest callee is much lighter than itself (i.e. the weight disperses among several callees). Which is exactly what has happened by the time we reach putWrapper (0.1% with that callstack in that profile).

Is there a way to sum up the time spent in a function, regardless of callstack?
Flags: needinfo?(bgirard)
You can filter by a function name. It will throw out any samples that doesn't have a frame with the function in it. When adding the filter keep track of the total sample count before/after and the timeline before/after (non matching samples will become blank in the timeline).
Flags: needinfo?(bgirard)
Thanks BenWa, that's super-helpful.

So. out of ~11374 ms between the PollWrapper calls (which is presumably the meat of the profile), we spend:
* 948 ms in GCRuntime::collect
* 3169 ms in GCRuntime::minorGC
* 1298 ms in JSCompartment::wrap (~800 of which is in ProxyObject::New)
* 1983 ms in canvas drawimage (almost all of which is billed to sse2_composite_src_x888_8888 and friends)

That all adds up to 7398, which leaves 3976 ms of other stuff which I wasn't able to easily categorize into heavy functions. A lot of this is JS execution spread across the VM. It does seem pretty clear to me that we're spending more time in the interpreter than we should, probably related to the fact that we can't JIT code crossing CCWs.

I can't tell entirely, but it seems pretty to me that almost all of the objects being allocated are also cross-compartment wrappers. This supports the theory that we're spending about half our time allocating and GCing cross-compartment wrappers.

Any brilliant ideas on what we can do about that?
NI efaust and jorendorff for ideas.
Flags: needinfo?(jorendorff)
Flags: needinfo?(efaustbmo)
Are we nursery-allocating cross-compartment wrappers?  Are the wrapper targets in the nursery?  If not, would it make sense to also avoid nursery-allocating the wrapper?
(In reply to Boris Zbarsky [:bz] from comment #17)
> Are we nursery-allocating cross-compartment wrappers?  Are the wrapper
> targets in the nursery?  If not, would it make sense to also avoid
> nursery-allocating the wrapper?

Almost all of the time in GCRuntime::minorGC is in Nursery::collect, which suggests to me that we are. But I suppose there's a chance that these things aren't in the nursery, and we're just banging our head against the wall collecting the nursery over and over again and getting nowhere.

Steve, can you weigh in?
Flags: needinfo?(sphink)
(In reply to Bobby Holley (busy) from comment #18)
> (In reply to Boris Zbarsky [:bz] from comment #17)
> > Are we nursery-allocating cross-compartment wrappers?  Are the wrapper
> > targets in the nursery?  If not, would it make sense to also avoid
> > nursery-allocating the wrapper?

Oh wait, I misunderstood your question - you're suggesting that we _don't_ nursery-allocate the proxies. That might indeed make sense.
(In reply to Bobby Holley (busy) from comment #19)
> (In reply to Bobby Holley (busy) from comment #18)
> > (In reply to Boris Zbarsky [:bz] from comment #17)
> > > Are we nursery-allocating cross-compartment wrappers?  Are the wrapper
> > > targets in the nursery?  If not, would it make sense to also avoid
> > > nursery-allocating the wrapper?
> 
> Oh wait, I misunderstood your question - you're suggesting that we _don't_
> nursery-allocate the proxies. That might indeed make sense.

s/don't/stop doing it if we're doing it/.
Right.  I'm suggesting that for the particular case of a cross-compartment wrapper for a tenured object we consider not nursery-allocating it, on the assumption that the lifetimes will be somewhat similar.  Of course this may not be a very good assumption...
(In reply to Bobby Holley (busy) from comment #18)
> and we're just banging our head against the wall
> collecting the nursery over and over again and getting nowhere.

Perhaps I'm misunderstanding what you're saying here, but if an object in the nursery is still alive by the time we do a minor GC, it will get moved to the tenured heap. Of course, if *a lot* of objects are still alive by the time we clear the nursery, we'll spend a lot of time moving objects out of it, so if objects are long-lived we want to pre-tenure them so they skip the nursery altogether. We have some heuristics for this - perhaps they aren't working out here.
My hypothesis as to what's happening here goes like this.  We have some objects that are long-lived, over in global A.  Code from global B touches them, we create cross-compartment wrappers, put those in the nursery.  We do a minor gc, collect all those wrappers.  The code from B touches the objects again, we create new wrappers, etc.  As a simple example, think of an array of objects over in A, B is holding a reference to that array (via a cross-compartment wrapper) and doing stuff with its entries.  Every time it touches an entry, we create a cross-compartment wrapper for that entry, then the nursery fills up as we touch other entries and we minorgc, then we touch that entry again and need a new wrapper.

Again, this is a hypothesis based on no instrumentation and just the profile numbers from comment 15, so it could be totally off.  We should measure.
Not really better with bz' patch to allocate Proxies more often in tenured heap.
This is still all Compartment/Proxy/minorGC stuff. Seen both in Zoom and in
https://cleopatra.io/#report=80db192e20cdff1711f9e228dd6359bc0d6a55bc

I was looking at our old bugs: bug 816159, bug 819723, bug 907369, Bug 1021223
Looks like each minor gc isn't taking up *that* much time, but there is sure a lot of them.

Also, seeems slightly fishy to me that marking generic things is taking the most of the minor gc's time -- I was under the impression that generic things are a slow path and we should usually be hitting specific buffers (eg the whole cells or slots edges).
The minor gc reason is overwhelmingly FULL_STORE_BUFFER.

Speculation: perhaps we are hitting a slow path where we are repeatedly filling the generic edges store buffer?
Also, there is already an assertion right above that line that we don't allocate wrappers in the nursery, so it makes sense that any patches attempting to do what seems to already be done did not have an effect.
Blocks: 1245974
Flags: needinfo?(jorendorff)
(In reply to Boris Zbarsky [:bz] from comment #23)
> My hypothesis as to what's happening here goes like this.  We have some
> objects that are long-lived, over in global A.  Code from global B touches
> them, we create cross-compartment wrappers, put those in the nursery.  We do
> a minor gc, collect all those wrappers.

The cross compartment wrapper map currently creates strong edges, so we'll actually be tenuring the wrappers at this point.
I've been looking at this problem (actually bug 1258172, which seems to be the same thing) for the past couple of days. I've yet to figure out exactly what's going on, just wanted to note that I've finally freed up some cycles to look at this.

The following shell script seems to exhibit the same symptom:

var N = 1000000;
var g = newGlobal();
g.arr = new g.Array(N);

var i = 0;
for (var v of g.arr) {
    i += 1;
}
print(i);
Clearing the NI as Terrence is on it.
Flags: needinfo?(jdemooij)
Assignee: nobody → terrence
Ditto jandem
Flags: needinfo?(efaustbmo)
Depends on: 1259578
Depends on: 1259580
Depends on: 1232417
Stable hashing for the CCW map may or may not be faster here. I'll have to profile and possibly land before if it makes the implementation of nursery allocated wrappers easier.
See Also: → 1225577
I managed to improve performance on the synthetic shell benchmark above from 200ms to 100ms. Unfortunately, when porting to the browser it became apparent that the mechanism I'm using cannot work in the presence of RecomputeWrappers and brain transplants.

The issue is historical and, unfortunately, totally unavoidable. When a page navigates, or a DOM element is transferred between origins, or the page origin changes via document.domain, all references to the old bare object or wrapper need to become a wrapper, potentially of a different type. This means that the entire character of the object has to change in place, including number of slots. This violates pretty much every major assumption in the GC, however, without the ability to find all pointers, this is the only way to do it at all.

Unfortunately, if we want to nursery allocation some wrappers, we are going to have to solve the above problem in a less restrictive way for all wrappers. Luckily, in the intervening years we've grown much better tools to do this sort of thing. We have the ability to find all references in the heap at all, barriers to do this quickly for a subset of objects, bump-allocation heaps, comprehensible GC phases, strong typing to tell us what kind of thing is being pointed to, etc.

I need to give this some more thought, but we should be able to solve both problems at once and without adding too much new code.
This is surprisingly nice, considering what it's doing, but still adds some really bothersome complexity. I'm loath to take a hack of this magnitude, but unfortunately, we do really need better google docs performance urgently. While the other options we discussed would be more generally palletable, a patch in the tree is better than two in the trello.
Flags: needinfo?(terrence)
Attachment #8759769 - Flags: review?(jcoppeard)
Comment on attachment 8759769 [details] [diff] [review]
allocate_proxies_in_the_nursery-v0.diff

Review of attachment 8759769 [details] [diff] [review]:
-----------------------------------------------------------------

I think you must have posted an old version of that patch because this doesn't compile for.  I tweaked it until I got it to build, but then the tests's didn't pass.  So cancelling review for now.

That said, it looks good.  I do have a bunch of questions though.

Firstly, how do we mark wrapped things where the wrapper is reachable in a minor GC?  It wasn't clear to me how that happens.

::: js/public/Proxy.h
@@ +220,5 @@
>      }
>  
> +    virtual bool canNurseryAllocate() const {
> +        /*
> +         * If the proxy handler finalization is non-effectful, then we can

It would be good to have a more specific explanation of what 'non-effectful' means in this context.

::: js/src/gc/Nursery.h
@@ +97,5 @@
> + */
> +inline bool
> +CanNurseryAllocateFinalizedClass(const js::Class* const clasp)
> +{
> +    bool allow = (clasp->flags & JSCLASS_SKIP_NURSERY_FINALIZE);

Should we assert clasp->hasFinalize()?  If this is called for non-finalized classes then the name is confusing.

Also, can we add a case in here for proxies?  I saw a couple of places where we call this and then also check if the object is a proxy.

::: js/src/jsapi-tests/testGCMarking.cpp
@@ +143,5 @@
> +        ++count;
> +    return count;
> +}
> +
> +BEGIN_TEST(testDeadNurseryCCW)

Thanks for adding tests.

::: js/src/proxy/CrossCompartmentWrapper.cpp
@@ +615,5 @@
>  js::RecomputeWrappers(JSContext* cx, const CompartmentFilter& sourceFilter,
>                        const CompartmentFilter& targetFilter)
>  {
> +    // Drop any nursery-allocated wrappers.
> +    cx->runtime()->gc.evictNursery(JS::gcreason::EVICT_NURSERY);

Does this drop them or tenure them?  Hmm, I guess it drops them if they are not reachable.

::: js/src/vm/Debugger.cpp
@@ +1047,5 @@
>          envobj = &p->value()->as<NativeObject>();
>      } else {
>          /* Create a new Debugger.Environment for env. */
>          RootedObject proto(cx, &object->getReservedSlot(JSSLOT_DEBUG_ENV_PROTO).toObject());
> +        NewObjectKind newKind = IsInsideNursery(env) ? GenericObject : TenuredObject;

Maybe factor this out into NewProxyObjectKind?

::: js/src/vm/ProxyObject.cpp
@@ +48,5 @@
> +          ? TenuredObject
> +          : handler->canNurseryAllocate()
> +            ? NurseryAllocatedProxy
> +            : TenuredObject;
> +

This would be easier to read as a series of if/else if statements.

@@ +77,5 @@
>  
> +gc::AllocKind
> +ProxyObject::allocKindForTenure() const
> +{
> +    //auto self = const_cast<ProxyObject*>(this);

Nit: remove commented out code.

@@ +98,5 @@
> +    else
> +        trc->runtime()->gc.nursery.removeMallocedBuffer(psrc.data.values);
> +
> +    // Trace all values in ProxyValueArray to ensure that any nursery
> +    // allocated objects also get moved.

Doesn't this happen when we call the trace hook in marking?
Attachment #8759769 - Flags: review?(jcoppeard)
D'oh! I indeed appear to have folded the wrong patch.
(In reply to Jon Coppeard (:jonco) from comment #42)
> Comment on attachment 8759769 [details] [diff] [review]
> allocate_proxies_in_the_nursery-v0.diff
> 
> Review of attachment 8759769 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think you must have posted an old version of that patch because this
> doesn't compile for.  I tweaked it until I got it to build, but then the
> tests's didn't pass.  So cancelling review for now.
> 
> That said, it looks good.  I do have a bunch of questions though.
> 
> Firstly, how do we mark wrapped things where the wrapper is reachable in a
> minor GC?  It wasn't clear to me how that happens.

I'm not entirely sure what you're asking: whether it's the strategy in general or only the case where we drop the wrapper.

The relevant code is in NurseryAwareHashMap::sweepAfterMinorGC, keeping in mind that IsAboutToBeFinalized during a nursery GC will automatically relocate a moved (e.g. live) reference. In this code, we either drop the wrapper from the table if the wrappee is dead, or move it (noting the unused return from neesSweep(&key) where we're actually just moving the key). We can assume that the key is not dead here, because we've already needsSweep(&value), which contains a strong edge to the key.

If you're asking what happens if we drop because the wrapper is dead and *don't* move the wrappee from this code: either the other references to the wrappee will have been barriered, in which case it has already been moved, or it is simply also dead and will get swept when we reset the nursery head pointer.

> ::: js/public/Proxy.h
> @@ +220,5 @@
> >      }
> >  
> > +    virtual bool canNurseryAllocate() const {
> > +        /*
> > +         * If the proxy handler finalization is non-effectful, then we can
> 
> It would be good to have a more specific explanation of what 'non-effectful'
> means in this context.

Yeah, that's way too hand-wavy; I should have revisited that after learning in more detail what's going on. How about:

        /*
         * Nursery allocation is allowed if and only if it is safe to not
         * run |finalize| when the ProxyObject dies.
         */

> ::: js/src/gc/Nursery.h
> @@ +97,5 @@
> > + */
> > +inline bool
> > +CanNurseryAllocateFinalizedClass(const js::Class* const clasp)
> > +{
> > +    bool allow = (clasp->flags & JSCLASS_SKIP_NURSERY_FINALIZE);
> 
> Should we assert clasp->hasFinalize()?  If this is called for non-finalized
> classes then the name is confusing.
>
> Also, can we add a case in here for proxies?  I saw a couple of places where
> we call this and then also check if the object is a proxy.

Well, that's sort of the core problem with this patch. Proxies *have to* share the same JSClass so that we can TradeGuts them. That JSClass cannot have nursery allocation in general, for several good reasons largely related to gecko's extensive use of DomProxy.

Previously, our only criterion for "is it nursery allocable" was the code here (which I abstracted out for [I thought] more clarity). The new test is: |CanNurseryAllocateFinalizedClass(clasp) || (clasp->isProxy() && handler->canNurseryAllocate())|. This is simple enough, but there is one major downside: we don't have access to |handler| in JSObject::create, where we've typically made the "do we nursery allocate it" decision.

Observation: there is one other case where the class does not provide enough information for us to chose the right heap and we have to instead pass that information though manually: singletons. That case happens to force us to allocate in tenured instead of the nursery, where the class would naively indicate. There's no reason that we shouldn't be able to use the same mechanism to force allocation in Nursery where the class would indicate Tenured otherwise (other than the assertions enforcing this, of course). Thus, I introduced NurseryAllocatedProxy to force nursery allocation with the caveat that clasp->isProxy() is true. When viewed as the contrapositive of Singleton allocation, this is even somewhat elegant.

As to the core complaint here: do you have a suggestion for a better name with the above context in mind?

> ::: js/src/jsapi-tests/testGCMarking.cpp
> @@ +143,5 @@
> > +        ++count;
> > +    return count;
> > +}
> > +
> > +BEGIN_TEST(testDeadNurseryCCW)
> 
> Thanks for adding tests.

I don't think this patch would be feasible without TDD.

> ::: js/src/proxy/CrossCompartmentWrapper.cpp
> @@ +615,5 @@
> >  js::RecomputeWrappers(JSContext* cx, const CompartmentFilter& sourceFilter,
> >                        const CompartmentFilter& targetFilter)
> >  {
> > +    // Drop any nursery-allocated wrappers.
> > +    cx->runtime()->gc.evictNursery(JS::gcreason::EVICT_NURSERY);
> 
> Does this drop them or tenure them?  Hmm, I guess it drops them if they are
> not reachable.

It does indeed drop any that are not reachable *before* we tradeguts them. I think this path is only used for document.domain, so I think the performance is irrelevant here, regardless.

> ::: js/src/vm/Debugger.cpp
> @@ +1047,5 @@
> >          envobj = &p->value()->as<NativeObject>();
> >      } else {
> >          /* Create a new Debugger.Environment for env. */
> >          RootedObject proto(cx, &object->getReservedSlot(JSSLOT_DEBUG_ENV_PROTO).toObject());
> > +        NewObjectKind newKind = IsInsideNursery(env) ? GenericObject : TenuredObject;
> 
> Maybe factor this out into NewProxyObjectKind?

What is NewProxyObjectKind?

Note: this is only for performance and only required then because the Debugger is skipping NewProxyObject here.

> ::: js/src/vm/ProxyObject.cpp
> @@ +48,5 @@
> > +          ? TenuredObject
> > +          : handler->canNurseryAllocate()
> > +            ? NurseryAllocatedProxy
> > +            : TenuredObject;
> > +
> 
> This would be easier to read as a series of if/else if statements.

Yeah, it got out of hand: fixed.

> @@ +77,5 @@
> >  
> > +gc::AllocKind
> > +ProxyObject::allocKindForTenure() const
> > +{
> > +    //auto self = const_cast<ProxyObject*>(this);
> 
> Nit: remove commented out code.

Thanks, missed one.

> @@ +98,5 @@
> > +    else
> > +        trc->runtime()->gc.nursery.removeMallocedBuffer(psrc.data.values);
> > +
> > +    // Trace all values in ProxyValueArray to ensure that any nursery
> > +    // allocated objects also get moved.
> 
> Doesn't this happen when we call the trace hook in marking?

Oh, Good point! I'd guess it was actually crashing because TraceCrossCompartmentEdge doesn't actually do anything for nursery GC's. I guess the best fix would be to make ShouldMarkCrossCompartment aware of nursery GC's now that it is applicable.
Actually, the tenuring tracer is not a marking tracer, so even the cross compartment edge should be moved already. I'll test again without marking here and see if it maybe wasn't something else entirely that was going off the rails.
This should make more sense.
Attachment #8759769 - Attachment is obsolete: true
Attachment #8760469 - Flags: review?(jcoppeard)
(In reply to Terrence Cole [:terrence] from comment #44)
> > Firstly, how do we mark wrapped things where the wrapper is reachable in a
> > minor GC?  It wasn't clear to me how that happens.
> 
> I'm not entirely sure what you're asking: whether it's the strategy in
> general or only the case where we drop the wrapper.
>
> The relevant code is in NurseryAwareHashMap::sweepAfterMinorGC, keeping in
> mind that IsAboutToBeFinalized during a nursery GC will automatically
> relocate a moved (e.g. live) reference. In this code, we either drop the
> wrapper from the table if the wrappee is dead, or move it (noting the unused
> return from neesSweep(&key) where we're actually just moving the key). We
> can assume that the key is not dead here, because we've already
> needsSweep(&value), which contains a strong edge to the key.
> 
> If you're asking what happens if we drop because the wrapper is dead and
> *don't* move the wrappee from this code: either the other references to the
> wrappee will have been barriered, in which case it has already been moved,
> or it is simply also dead and will get swept when we reset the nursery head
> pointer.

Thanks for the explanation.  I was asking about the general case, but I realise now that this just happens in the usual way so this is all fine.

>         /*
>          * Nursery allocation is allowed if and only if it is safe to not
>          * run |finalize| when the ProxyObject dies.
>          */

Great.  We might also want to make this "when a nursery-allocated ProxyObject dies" just to spell it out.

> > ::: js/src/gc/Nursery.h
> > @@ +97,5 @@
> > > + */
> > > +inline bool
> > > +CanNurseryAllocateFinalizedClass(const js::Class* const clasp)
> > > +{
> > > +    bool allow = (clasp->flags & JSCLASS_SKIP_NURSERY_FINALIZE);
> > 
> > Should we assert clasp->hasFinalize()?  If this is called for non-finalized
> > classes then the name is confusing.
> >
> > Also, can we add a case in here for proxies?  I saw a couple of places where
> > we call this and then also check if the object is a proxy.
> 
> Well, that's sort of the core problem with this patch. Proxies *have to*
> share the same JSClass so that we can TradeGuts them. That JSClass cannot
> have nursery allocation in general, for several good reasons largely related
> to gecko's extensive use of DomProxy.
>
> Previously, our only criterion for "is it nursery allocable" was the code
> here (which I abstracted out for [I thought] more clarity). The new test is:
> |CanNurseryAllocateFinalizedClass(clasp) || (clasp->isProxy() &&
> handler->canNurseryAllocate())|. This is simple enough, but there is one
> major downside: we don't have access to |handler| in JSObject::create, where
> we've typically made the "do we nursery allocate it" decision.

Oh right, this was confusing me but I get it now.  That's annoying that we share the same class between proxies that can and cannot live in the nursery but I guess that's the way it is.

I think using the NurseryAllocatedProxy kind is a good idea and works well here.

> As to the core complaint here: do you have a suggestion for a better name
> with the above context in mind?

The name thing was just a nit as the current name suggests to me that it's always called for classes with a finalizer, which is not the case in a couple of places.  So my suggestion would be to change this to CanNurseryAllocateClass (and also return true if there's no finalizer) or assert leave it as it is and assert that the class has a finalizer (and check this before calling it in a couple of places).  But it's no big deal to leave it as is.

> > > +        NewObjectKind newKind = IsInsideNursery(env) ? GenericObject : TenuredObject;
> > 
> > Maybe factor this out into NewProxyObjectKind?
> 
> What is NewProxyObjectKind?

A new yet-to-be-created function to factor out this logic.  But again, no big deal.
> That's annoying that we share the same class between proxies that can and cannot live in the nursery

If we'd just fix the bug about moving the handler into the JSClass, so many problems would go away... :(
Comment on attachment 8760469 [details] [diff] [review]
allocate_proxies_in_the_nursery-v1.diff

Review of attachment 8760469 [details] [diff] [review]:
-----------------------------------------------------------------

This looks really good.  It's complicated but it doesn't seem like a terrible hack, more like just what we need to do to make this work.

::: js/src/gc/NurseryAwareHashMap.h
@@ +11,5 @@
> +
> +// This class only handles the incremental case and does not deal with nursery
> +// pointers. Most users will want to use ReadBarriered instead.
> +template <typename T>
> +class UnsafeBareReadBarriered : public ReadBarrieredBase<T>

I prefer this defined in this file.  Can we go as far as to make it an inner class of NurseryAwareHashMap?

@@ +46,5 @@
> +template <typename Key,
> +          typename Value,
> +          typename HashPolicy = DefaultHasher<Key>,
> +          typename AllocPolicy = TempAllocPolicy>
> +class NurseryAwareHashMap

This class could use a comment explaining what it's used for and how it's different from a GCHashMap.

::: js/src/jsapi-tests/testGCMarking.cpp
@@ +30,5 @@
> +        fprintf(stderr, "failed to create second global");
> +        return false;
> +    }
> +
> +    // This should always be true, regardless.

Should always be false, I think you mean.

@@ +223,5 @@
> +    CHECK(js::gc::IsInsideNursery(wrapper));
> +
> +    // Let the wrappee die. The wrapper is still held, however, so our minor
> +    // collection must tenure the wrappee and keep it in the table as well as
> +    // the rooted wrapper.

"Let the wrappee die" - we drop the reference to the wrappee but it doesn't die because the wrapper is still reachable, right?
Attachment #8760469 - Flags: review?(jcoppeard) → review+
Blocks: 1278554
(In reply to Jon Coppeard (:jonco) from comment #50)
> Comment on attachment 8760469 [details] [diff] [review]
> allocate_proxies_in_the_nursery-v1.diff
> 
> Review of attachment 8760469 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks really good.  It's complicated but it doesn't seem like a
> terrible hack, more like just what we need to do to make this work.
> 
> ::: js/src/gc/NurseryAwareHashMap.h
> @@ +11,5 @@
> > +
> > +// This class only handles the incremental case and does not deal with nursery
> > +// pointers. Most users will want to use ReadBarriered instead.
> > +template <typename T>
> > +class UnsafeBareReadBarriered : public ReadBarrieredBase<T>
> 
> I prefer this defined in this file.  Can we go as far as to make it an inner
> class of NurseryAwareHashMap?

I think it's not possible to do so? We need a GCPolicy for it because we're using the base GCHashMap::sweep during major GC sweeping. When I tried to define the policy, C++ was unable to deduce the key, value, and policies on the outer class. I guess it would be almost as good to just move it to a detail namespace, which I've now done.
Our debug code is iterating the whole wrapper map each minor GC to check the tables; this changes the big-O of sweeping. This seems to only really bite on reftests, probably because of how they are architected. It seems that we're also crashing in the new sweeping on opt builds, so this will probably not solve the problem, but will at least let us get to the code that's actually crashing in a debug build.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b45cf58eb131
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs]
Clearing stale needinfo.
Flags: needinfo?(sphink)
Depends on: 1281450
platform-rel: --- → ?
There is a second watchdog that tracks pauses on the main thread. It looks like unconditionally doing some light checks on every minor GC is too slow for this watchdog, if not for the other one. Hopefully this will get debug builds to crash in the same place as opt builds.
I'm still not able to reproduce, but while investigating the performance angle, I noticed something strange that I think may be implicated, possibly indirectly. Apparently, at the very end of jstestbrowser, the event loop receives and process 1548 WindowDestroyedEvents. As far as I can tell running locally, we appear to create 2 actual windows, so I'm really not sure where these are all coming from. As part of destroying each "window", we NukeCrossCompartmentWrappers. The mechanism here has to add an evictNursery to each of these calls, which apparently adds over 1ms to each call.

I don't know where these windows are coming from. I don't know why we're not hitting the "nothing in the nursery" fastpath. And I especially don't know why we're spending 800-1000 microseconds in markRuntime when the runtime is basically dead at this point. Questions to answer tomorrow.
Do we navigate? Each navigation creates a new inner window (i.e. global object).
The test code itself is pure js and does not (I believe) have any notion of windows, navigation, dom, etc, so it would depend on exactly how the reftest runner is implemented.

It looks like after all 7000+ tests run, we spin the event loop, which does ipc::MessagePump, which delivers lots (still need to count them) of PBrowser::Msg_Destroy__ID notification from the content process. This does TabChild::DestroyWindow, which eventually does SetDocShell(nullptr) many times, which eventually results in 1500 DestroyWindowEvents in the event loop. After ipc::MessagePump finishes we resume the event loop and kick off 1500 DestroyWindowEvents.

Each of these events has become much slower due to the minor GC. This would normally not be a problem, but *something* is running code in the DestroyWindowEvent (I can't see the stack because it's in the JIT), which is causing a single allocation. Because we have a non-empty nursery, we have to walk the entire runtime to find roots. This would normally be very fast as the stack is basically empty, but for some reason the runtime contains 8128(!) compartments, which, obviously, slows things down a bit.

There's quite a lot going wrong here all at once. Why are there so many compartments? Presumably we're creating one per test for isolation, but there are more than the number of tests. We're probably creating some extra ones during tests, but not enough (apparently) to account for the number of inner windows we're seeing. Why are these compartments still alive at all? I know we GC'd plenty during the test. Why do we only destroy all the windows at the end of the test? Why are there even so many windows?

Unfortunately, answers will have to wait until Tuesday because of the holiday.
Jon asked a great question in this morning's GC meeting: do compartments pile up that way in a stock build too. It turns out they do not. Are we triggering fewer GC's because we have consistently lower heap utilization? Or is there some bug in the new code that is preventing collection of compartments?
Depends on: 1285605
platform-rel: ? → +
See Also: → 844769
(In reply to Terrence Cole [:terrence] from comment #62)
> Jon asked a great question in this morning's GC meeting: do compartments
> pile up that way in a stock build too. It turns out they do not. Are we
> triggering fewer GC's because we have consistently lower heap utilization?
> Or is there some bug in the new code that is preventing collection of
> compartments?

Note that this is actually a bit of a lie. We end up with half as many compartments, but there's only one collection that collects compartments in the before case. Adding a ton of new GC behavior does not actually solve the issue either, so there must be something related to how windows are thrown away that's not being satisfied by our reftest runner. I'm doing the investigation in bug 1285605 (suffering from the same problem), because that bug is substantially simpler.
Depends on: 1293360
Depends on: 1296484
I tested locally with and without the patch to try and track down the reftest issue.  Here are some stats after running the first 1/10 of the tests:

                      Pre:   Post:
Output lines:        24190   26376
Test run:             1368    1368
DOM windows created:  6305    6305
Complete GCs:          608     602
GC slices:             936    2031 *
Max zones before GC:    57      57
Max compartments:      682    2806 *
Max zones after GC:     57      57
Max compartments:      682    2806 *

Interesting points:
 - we're doing roughly the same number of GCs, but in twice as many slices
 - there are a lot of compartments and they are not being collected
Here's a patch on top of allocate_proxies_in_the_nursery-v1.diff.  I think the problem was the get() in GCPolicy<UnsafeBareReadBarriered>::isTenured causing a read barrier.  The patch actually removes that method in favour of passing a Value& since there doesn't seem much point in wrapping it in a read barrier before we put it in the table.

Local testing gives the following results which don't show the compartment build up we were seeing before:

Output lines:        23972
Test run:             1368
DOM windows created:  6305
Complete GCs:          597
GC slices:             829
Max zones before GC:    57
Max compartments:      682
Max zones after GC:      8
Max compartments:      348

A try build will show though: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e265c63f38f

If the try build is green and if you approve, could you also land this for me as I'll be out next week?
Attachment #8789893 - Flags: review?(terrence)
Comment on attachment 8789893 [details] [diff] [review]
bug1237058-patch

Review of attachment 8789893 [details] [diff] [review]:
-----------------------------------------------------------------

This is great, thanks for taking a look! Wish I'd thought to ask for another pair of eyes on it 2 months ago.

::: js/src/gc/NurseryAwareHashMap.h
@@ +102,5 @@
>  
>      MOZ_MUST_USE bool put(const Key& k, const Value& v) {
>          auto p = map.lookupForAdd(k);
>          if (p) {
> +            if (!JS::GCPolicy<Key>::isTenured(k) || !JS::GCPolicy<Value>::isTenured(v)) {

Wha??? How is that not already present??? I guess my tests must have just missed this specific case!

::: js/src/jscompartment.cpp
@@ -245,5 @@
>  {
>      MOZ_ASSERT(wrapped.is<JSString*>() == wrapper.isString());
>      MOZ_ASSERT_IF(!wrapped.is<JSString*>(), wrapper.isObject());
>  
> -    if (!crossCompartmentWrappers.put(wrapped, ReadBarriered<Value>(wrapper))) {

Yup! That was unintentional. I guess I must dropped it in one of the rebases and just missed it when re-reading. Yay second pair of eyes!
Attachment #8789893 - Flags: review?(terrence) → review+
This patch did a performance improvement in talos, thanks.

Improvements:

  3%  cart summary windowsxp opt e10s     39.89 -> 38.71
  2%  kraken summary linux64 opt          1584.01 -> 1548.71

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=3067
https://hg.mozilla.org/mozilla-central/rev/9ff810febb0a
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
and some talos improvements came out of this:
== Change summary for alert #3129 (as of September 12 2016 09:36 UTC) ==

Improvements:

  2%  tp5o Private Bytes windowsxp opt     203740136.32 -> 199316445.6
  2%  tp5o Main_RSS windowsxp opt          222040891.12 -> 217553276.66

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=3129


as a note, I see some on win7 as well :)
with all the data that has come in, I see this as the set of changes talos discovered:
== Change summary for alert #3067 (as of September 12 2016 09:29 UTC) ==

Regressions:

  8%  tabpaint summary windows8-64 pgo     68.83 -> 74.07

Improvements:

  3%  cart summary windowsxp opt e10s     39.89 -> 38.71
  3%  kraken summary linux64 pgo          1530.23 -> 1490.79
  3%  kraken summary linux64 pgo e10s     1549.88 -> 1511.13
  2%  kraken summary linux64 opt          1584.01 -> 1548.71
  2%  kraken summary linux64 opt e10s     1601.52 -> 1568.46

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=3067

the tabpaint regression is odd- it is pgo only, replicated through the branches- but opt builds have never had an issue.  Most likely this is something odd in the pgo process.

This is really an issue with the tabpaint-from-parent (not the -from-content version);
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=af4c053741f1b847888f5566175fd4c7931e5bd7&newProject=mozilla-inbound&newRevision=9ff810febb0af9cf4e08ed5c1b4c902aef3f4f5f&originalSignature=c608611eeaa8825fca06d2f477c62bde5e491e6f&newSignature=c608611eeaa8825fca06d2f477c62bde5e491e6f&framework=1

but looking at the graph for that subtest, we sort of reset a few days later:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=%5Bmozilla-inbound,09dc420434476eaaafb41ba0d05eb69b8bb5d987,1,1%5D&highlightedRevisions=af4c053741f1&highlightedRevisions=9ff810febb0a

:terrence, do you have any thoughts on why we might have this regression?  Do you think there is anything we could do to look into it?  Given the fact that it is pgo only and the subtest reset, it might be worthwhile to focus on new features and improvements.
Flags: needinfo?(terrence)
A wild hypothesis would be that with this change, we no longer exercise some corner of the tenuring code path as much during the training session, and it gets less optimized as a result. Then a few days later, we get back over whatever the mysterious threshold is.

But I doubt it. I would expect something like that to show up as a regression on a lot more stuff.
(In reply to Joel Maher ( :jmaher) from comment #82)
I concur. This patch did not move the needle either way on gdocs either. Given that it's net meh, I think a deeper investigation is not warranted.
Flags: needinfo?(terrence)
No longer depends on: 1296484
Depends on: 1303015
Assignee: terrence.d.cole → jcoppeard
Whiteboard: [platform-rel-Google][platform-rel-GoogleDocs] → [platform-rel-Google][platform-rel-GoogleSuite][platform-rel-GoogleSheets]
Flags: needinfo?(jst)
Blocks: 1338802
Duplicate of this bug: 1211944
You need to log in before you can comment on or make changes to this bug.