GGC regressed performance on dom-attr

RESOLVED FIXED in mozilla32

Status

()

defect
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: terrence, Assigned: terrence)

Tracking

(Blocks 1 bug)

unspecified
mozilla32
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox31-)

Details

(Whiteboard: [talos_regression])

Attachments

(11 attachments)

47.05 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
25.00 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
9.03 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
6.26 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
25.29 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
26.46 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
22.11 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
35.24 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
6.78 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
17.03 KB, patch
jonco
: review+
Details | Diff | Splinter Review
984 bytes, patch
jonco
: review+
Details | Diff | Splinter Review
test( "element.expando = value", function(){
	for ( var i = 0; i < num; i++ )
	a["test" + num] = function(){};
});
Blocks: 990085
Whiteboard: [talos_regression]
This repros in a linux64 shell build.
At 1M iterations, we go from 200ms to 290ms. GGC is repeatedly GCing because of FULL_STORE_BUFFER on the slots buffer. Although each minor GC is only 50us, the total minor GC time is 74ms, explaining the majority of the regression.
If we fix the FULL_STORE_BUFFER GC's through some form of duplication, we shuffle almost all of this work into the dedup, gaining at most 4-8ms on the total. I have a feeling if we make the store buffer compaction faster but sloppier, we'll end up translating the overhead into extra marking. Still worth a try though.
checking in here as we are <2 weeks away from the uplift of this to Aurora, do we have a plan in place for the performance regressions?
Flags: needinfo?(terrence)
Bug 990336 and bug 988950 should address the performance gap on the affected micro-benchmarks.
Flags: needinfo?(terrence)
Assignee: nobody → terrence
Depends on: 988950, 990336
Also bug 995442, maybe some bug 984101 too, and probably others I'm not thinking of. I expect us to be in a good position to release by end-of-week, if not sooner.
Depends on: 995442, 984101
Allowing normal deduplication of the slots buffer worked in a sense. We now do only 2 minor GC's here, one of them for DestroyContext. Surprisingly, we're still quite a bit slower. On the other handk, in the meantime, Jan optimized Lambda creation, so everything is faster. The new numbers for 1M iterations are: 150ms before and 260ms after, roughly the same ratio.

If we switch the test to create a bare object instead of a function, we are actually a bit faster with ggc, so it is something specific to functions. The big obvious difference is that we're not allocating these in the nursery -- they're captured the by the script currently, so it wouldn't matter anyway. Looking at cachegrind output, both scripts take roughly the same amount of time in the jit. The object version spends a significant portion of it's time in minor collection where the function spends no time there. Neither version has significant data in the store buffer. There is no one heavy user: the profile information is spread out all over the run. There is no major source of overhead outside the jit other than atomization, which is unchanged with ggc and is the same on both versions of the test.

Also note that the testcase in bug 945555 is almost identical to this test and its performance is roughly 3x faster with ggc.
Depends on: 945555
If we run these as --baseline-eager --no-ion, the difference is from 330ms -> 360ms, so the ggc difference is all in ion.
(In reply to Terrence Cole [:terrence] from comment #8)
> If we run these as --baseline-eager --no-ion, the difference is from 330ms
> -> 360ms, so the ggc difference is all in ion.

The interpreter/Baseline still allocate lambdas tenured, can we fix that to match Ion (or fix Ion)?
The MIR that's generated before and after is identical because we are hitting the CallSetElement fallback. This enters the interpreter and after checking a surprisingly large number of special cases, it takes the common path, at long last ending up in setSlot, doing getSlotRef().set(). The only difference between the two binaries is the execution of this post barrier. We know from gc dumps that this post barrier is not causing any pathological algorithmic behavior. This basically eliminates everything but codegen.

Looking at the clang's -O3 codegen: mostly miracles, all the way down. The bottom of the stack is identical with these 5 frames, all the intermediates and parameter juggling handily removed:

#1  js::NativeSet<(js::ExecutionMode)0>
#2  js::baseops::SetPropertyHelper<(js::ExecutionMode)0>
#3  setGeneric
#4  SetObjectElementOperation
#5  js::SetObjectElement

In the non-ggc case, clang does a callq to ObjectImpl::setSlot, which inlines HeapSlot::set, the pre-barrier, and the actual set -- a surprising amount of code, even if it's mostly jumped over normally. In the ggc case, clang instead inlines ObjectImpl::set and instead does a callq to HeapSlot::set, which then inlines both the pre and post-barrier.

First, this makes the code in NativeSet is a bit less elegant; HeapSlot::set takes two more parameters than ObjectImpl::setSlot. Then the actual set is just quite long: 365 bytes of code. This is actually large enough to be reflected in the executed instruction counts:

==27287== I   refs:      3,104,099,224
==27287== D   refs:      1,370,402,602  (854,508,317 rd   + 515,894,285 wr)

==27325== I   refs:      3,335,606,115
==27325== D   refs:      1,473,968,889  (930,717,225 rd   + 543,251,664 wr)

This isn't really enough instructions to account for a 30% slowdown. When we include the I1 misses and LL refs, however, the picture gets much clearer.

==27287== I1  misses:       18,790,554
==27287== LL refs:          22,482,207  ( 20,687,583 rd   +   1,794,624 wr)

==27325== I1  misses:       46,294,016
==27325== LL refs:          50,043,843  ( 48,231,231 rd   +   1,812,612 wr)

Even though the absolute numbers here are small in comparison to the totals, these are all going to be high-latency.
(In reply to Terrence Cole [:terrence] from comment #7)
> Also note that the testcase in bug 945555 is almost identical to this test
> and its performance is roughly 3x faster with ggc.

I'd guess the testcase there isn't falling out of Ion like we are here.
Okay, so the question is what to do about it. The bare logical minimum we need is:

if (zone->needsPreBarriers())
    MarkValue(trc, value, "pre-barrier");
value = newValue;
if (value.isObject() && !IsInsideNursery(owner) && IsInsideNursery(&value.toObject))
    rt->storeBuffer.slotsBarrer[last++] = SlotRef(owner, offset, kind);

Unfortunately, this simple concept is burdened with a ton of other complicating crap that's been slowing getting piled on. What it actually looks like in practice is more like:

if (!(value.isString() && StringIsPermanentAtom(value.toString())) &&
    zone->needsBarrier())
{
    Value tmp(value);
    js::gc::MarkValueUnbarriered(shadowZone->barrierTracer(), &tmp, "write barrier");
}
value = newValue;
if (value.isObject() &&
    storeBuffer.isEnabled() &&
    CurrentThreadCanAccessRuntime(value.toObject().chunk()->info.trailer.runtime) &&
    !IsInsideNursery(owner) && IsInsideNursery(&value.toObject))
{
    SlotRef *t = rt->storeBuffer.slotsBarrer.alloc();
    *t = SlotRef(owner, offset, kind);
}
https://tbpl.mozilla.org/?tree=Try&rev=efeb048d445e

This is part 1 of several. The end goal is to unify the heap Ptr/Value/Id classes using the same mechanism used by Rooted/Handle. This should be a tremendous reduction is SLOC and hopefully much simpler as well once I'm done.
Attachment #8412752 - Flags: review?(jcoppeard)
Comment on attachment 8412752 [details] [diff] [review]
rebase_heapptr_to_ptr-v0.diff

Review of attachment 8412752 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ +292,5 @@
> +template <typename T>
> +struct InternalGCMethods {};
> +
> +template <typename T>
> +struct InternalGCMethods<T *>

It would be nice to share these eventually with the GCMethods in RootingAPI.h; however, for the first pass I wanted to maintain the existing inlining behavior, which cannot be done from outside the API currently.

@@ +306,4 @@
>  /*
>   * Base class for barriered pointer types.
>   */
> +template <class T>

The Union bits appear to be completely unused now. Nothing was required other that deleting the code.

@@ +605,5 @@
> +typedef HeapPtr<BaseShape*> HeapPtrBaseShape;
> +typedef HeapPtr<UnownedBaseShape*> HeapPtrUnownedBaseShape;
> +typedef HeapPtr<types::TypeObject*> HeapPtrTypeObject;
> +typedef HeapPtr<types::TypeObjectAddendum*> HeapPtrTypeObjectAddendum;
> +typedef HeapPtr<jit::JitCode*> HeapPtrJitCode;

This was a nice opportunity to unify our style and switch absolutely everything over to typedefed names.

::: js/src/gc/Marking.h
@@ +11,5 @@
>  
>  class JSAtom;
>  class JSLinearString;
>  
> +

Sorry, did not notice this. Will remove it.

@@ +54,2 @@
>   *     This function should be used for marking JSObjects, in preference to all
> + *     others below. Use it when you have HeapPtr<JSObject*>, which

I'll change these names to HeapPtrObject.

@@ +102,1 @@
>  type *Update##base##IfRelocated(JSRuntime *rt, type **thingp);

This was a bit finicky to get right. Thankfully, we're finally getting close to having a fully typed API, which should let us templatize this mess.

@@ -243,5 @@
> - * The unioned HeapPtr stored in script->globalObj needs special treatment to
> - * typecheck correctly.
> - */
> -void
> -MarkObject(JSTracer *trc, HeapPtr<GlobalObject, JSScript *> *thingp, const char *name);

Our one and only "user" of the unioned bits. Fortunately, unused.

@@ +283,5 @@
>      MarkScript(trc, o, name);
>  }
>  
>  inline void
> +Mark(JSTracer *trc, HeapPtr<jit::JitCode*> *code, const char *name)

Not sure why I didn't use the typedef here. Fixed now.
https://tbpl.mozilla.org/?tree=Try&rev=237e835c870a

This uses the new BarrieredPtr to implement BarrieredValue and subclasses. It removes 222 lines of code. Id coming up next.
Attachment #8412921 - Flags: review?(jcoppeard)
Comment on attachment 8412921 [details] [diff] [review]
rebase_barrieredvalue_on_barrieredptr-v0.diff

Review of attachment 8412921 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/public/Value.h
@@ +1609,5 @@
>      JSString *toString() const { return value()->toString(); }
>      JSObject &toObject() const { return value()->toObject(); }
>      JSObject *toObjectOrNull() const { return value()->toObjectOrNull(); }
>      void *toGCThing() const { return value()->toGCThing(); }
> +    uint64_t asRawBits() const { return value()->asRawBits(); }

Was not exposed before. No reason not to since it's just .get().asRawBits() otherwise.

::: js/src/gc/Barrier.cpp
@@ -22,5 @@
> -        return true;
> -
> -    return ZoneOfValue(value) == zone ||
> -           zone->runtimeFromAnyThread()->isAtomsZone(ZoneOfValue(value));
> -}

The zone variant of |set| was never actually called, so we're not actually losing an assertion.

::: js/src/gc/Barrier.h
@@ +365,5 @@
> +        JSRuntime *rt = static_cast<js::gc::Cell *>(vp->toGCThing())->runtimeFromAnyThread();
> +        JS::shadow::Runtime *shadowRuntime = JS::shadow::Runtime::asShadowRuntime(rt);
> +        shadowRuntime->gcStoreBufferPtr()->removeRelocatableValue(vp);
> +#endif
> +    }

I did not attempt to make any improvements to the way we treat these. I plan to follow with optimization and hopefully simplification.

@@ +372,5 @@
>  /*
>   * Base class for barriered pointer types.
>   */
>  template <class T>
> +class BarrieredPtr : public HeapBase<T>

I'm deriving HeapBase<T> in order to pull in ValueOperations. I think it makes sense to re-use the public Heap infrastructure for this. At least, I don't see any obvious foot-guns and the Try run is green.

@@ +387,5 @@
>          this->value = v;
>      }
>  
>      /* Use this if the automatic coercion to T isn't working. */
> +    const T &get() const { return value; }

C++ was having trouble using the return as a |const Value&| for some reason. There may have been a confounding definition in an intermediate state or something. In any case this should be strictly faster on 32bit, so I decided to keep it.

@@ +398,5 @@
>      void unsafeSet(T v) { value = v; }
>  
>      T operator->() const { return value; }
>  
> +    operator const T &() const { return value; }

Ditto.

@@ +598,4 @@
>              relocate();
>              this->value = v;
> +        } else {
> +            this->value = v;

The pointer case is either null or non-null, but with value we need the else. I didn't think it was worth adding new infrastructure just to assert we don't hit this case for pointers when the usage is pretty clear from construction.

@@ +605,5 @@
>      }
>  
>    protected:
> +    static bool isMarkable(const T &v) {
> +        return InternalGCMethods<T>::isMarkable(v);

I added this static method solely to save on typing above.

@@ +760,5 @@
>          post(rt, owner, kind, slot, v);
>      }
>  
> +    //operator Value() const { return this->value; }
> +    //operator const Value &() const { return this->ptr; }

D'oh! Will remove this hunk.

@@ +786,5 @@
>          value = v;
>          post(shadowZone->runtimeFromAnyThread(), owner, kind, slot, v);
>      }
>  
> +    static void writeBarrierPost(JSObject *obj, Kind kind, uint32_t slot, const Value &target)

Caught this when reading. Totally not critical to this particular patch, just sneaking this and the next in since it's arguably related.

::: js/src/jsfun.h
@@ +216,5 @@
>      }
>  
>      void setGuessedAtom(JSAtom *atom) {
> +        JS_ASSERT(!atom_);
> +        JS_ASSERT(atom);

I was hoping to get away without operator== and operator!= entirely, so that C++ would enforce this particular style guideline. Unfortunately, I had to add them back in the next patch for HeapId, which we actually do compare to raw jsid in quite a few places.

::: js/src/jsiter.cpp
@@ +1756,5 @@
>               * write barriers here.
>               */
>              HeapValue::writeBarrierPre(gen->regs.sp[-1]);
>              gen->regs.sp[-1] = arg;
> +            HeapValue::writeBarrierPost(gen->regs.sp[-1], &gen->regs.sp[-1]);

The runtime saves us a few instructions, but this path isn't hot and it's the only user.
This saves about ~100 lines. I think as a followup we should MOZ_CRASH at JSID_IS_OBJECT so that we're not carrying around probably-rotted code. I'd remove it entirely, but there is discussion of re-using it for Symbols at some point.
Attachment #8412930 - Flags: review?(jcoppeard)
This removes the * from FixedHeapPtr to make it match the others and renames it to ImmutableTenuredPtr. There is only the one PropertyName user, which I have called ImmutablePropertyNamePtr. I think this is a fair balance between clarity and length, but disagreements are more than welcome.
Attachment #8412933 - Flags: review?(jcoppeard)
(In reply to Terrence Cole [:terrence] from comment #16)
> @@ +605,5 @@
> >      }
> >  
> >    protected:
> > +    static bool isMarkable(const T &v) {
> > +        return InternalGCMethods<T>::isMarkable(v);
> 
> I added this static method solely to save on typing above.

Could this be

  using InternalGCMethods<T>::isMarkable;

? Maybe it's not a good idea even if it works; too unusual.

(In reply to Terrence Cole [:terrence] from comment #18)
> This removes the * from FixedHeapPtr to make it match the others and renames
> it to ImmutableTenuredPtr. There is only the one PropertyName user, which I
> have called ImmutablePropertyNamePtr. I think this is a fair balance between
> clarity and length, but disagreements are more than welcome.

I think I may have suggested ImmutableTenuredPtr, but I do wonder if the Tenured part should be in there -- what happens when we have compacting GC? Will we even still have these?
EncapsulatedPtr is a terribly unhelpful name in pretty much every dimension. I think we should call these PreBarriered to more accurately reflect their usage, even if it is a bit jargony.
Attachment #8412988 - Flags: review?(jcoppeard)
Comment on attachment 8412988 [details] [diff] [review]
rename_encapsulated_to_prebarriered-v0.diff

Review of attachment 8412988 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ +695,5 @@
>  typedef BarrieredPtr<JSScript*> BarrieredPtrScript;
>  
> +typedef PreBarriered<JSObject*> PreBarrieredObject;
> +typedef PreBarriered<JSScript*> PreBarrieredScript;
> +typedef PreBarriered<jit::JitCode*> PreBarrieredJitCode;

This consciously removes Ptr to closer match Rooted.
This gives the same treatment to ReadBarriered: move the * outside the template and merge with Value.
Attachment #8413905 - Flags: review?(jcoppeard)
Comment on attachment 8413905 [details] [diff] [review]
deref_readbarriered_and_merge_value-v0.diff

Review of attachment 8413905 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier-inl.h
@@ -18,5 @@
> -        JSObject::readBarrier(&value.toObject());
> -    else if (value.isString())
> -        JSString::readBarrier(value.toString());
> -    else
> -        JS_ASSERT(!value.isMarkable());

The only code using this is the cross-compartment wrapper map, which only has object/string. Since these branches are always going to test true, but the subsequent mark is out-of-line, there's really no point inlining this method.

@@ -26,5 @@
> -
> -inline
> -ReadBarrieredValue::operator const Value &() const
> -{
> -    return get();

Since this is just forwarding, I have no idea why it couldn't just be in the header.

@@ -32,5 @@
> -
> -inline JSObject &
> -ReadBarrieredValue::toObject() const
> -{
> -    return get().toObject();

There is only one caller, so I just moved the .get() there so we can kill this file.

::: js/src/gc/Barrier.h
@@ -727,5 @@
> -typedef PreBarriered<jsid> PreBarrieredId;
> -typedef RelocatablePtr<jsid> RelocatableId;
> -typedef HeapPtr<jsid> HeapId;
> -
> -typedef ImmutableTenuredPtr<PropertyName*> ImmutablePropertyNamePtr;

Just moving these below ReadBarriered so that we can add them to the list. Also sorting the individual sections.
BarrieredPtr is now a misnomer.
Attachment #8413982 - Flags: review?(jcoppeard)
Comment on attachment 8413982 [details] [diff] [review]
rename_barrieredptr_to_barrieredbase-v0.diff

Review of attachment 8413982 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ -768,5 @@
>  struct TypeObjectAddendum;
>  }
>  
> -typedef BarrieredPtr<JSObject*> BarrieredPtrObject;
> -typedef BarrieredPtr<JSScript*> BarrieredPtrScript;

Unlike the other types here, we do not want people to actually use BarrieredBase. Removing the typedefs adds another hint that it is special.

@@ -794,5 @@
>  typedef HeapPtr<jit::JitCode*> HeapPtrJitCode;
>  typedef HeapPtr<types::TypeObject*> HeapPtrTypeObject;
>  typedef HeapPtr<types::TypeObjectAddendum*> HeapPtrTypeObjectAddendum;
>  
> -typedef BarrieredPtr<Value> BarrieredValue;

And this was just wrong: BarrieredValue does not actually provide post-barriers.
I looked at the results for linux and dom-attr doesn't appear to be fixed- what specific platforms were regressed here?
Bug 619558 comment 36 says the regression was across the board.
Comment on attachment 8412752 [details] [diff] [review]
rebase_heapptr_to_ptr-v0.diff

Review of attachment 8412752 [details] [diff] [review]:
-----------------------------------------------------------------

Glad to see the Unioned stuff go, I always wondered why we needed that.
Attachment #8412752 - Flags: review?(jcoppeard) → review+
Comment on attachment 8412921 [details] [diff] [review]
rebase_barrieredvalue_on_barrieredptr-v0.diff

Review of attachment 8412921 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ +297,5 @@
>  
>  template <typename T>
>  struct InternalGCMethods<T *>
>  {
> +    static bool isMarkable(T *v) { return bool(v); }

I would prefer |v != nullptr| here.
Attachment #8412921 - Flags: review?(jcoppeard) → review+
Attachment #8412930 - Flags: review?(jcoppeard) → review+
Comment on attachment 8412933 [details] [diff] [review]
rename_and_ref_fixedptr-v0.diff

Review of attachment 8412933 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ +573,1 @@
>          value = ptr;

Can we assert that the pointer is tenured here if it's not too inconvenient?
Attachment #8412933 - Flags: review?(jcoppeard) → review+
Attachment #8412988 - Flags: review?(jcoppeard) → review+
Attachment #8413905 - Flags: review?(jcoppeard) → review+
Attachment #8413982 - Flags: review?(jcoppeard) → review+
https://tbpl.mozilla.org/?tree=Try&rev=1896d4d12d6f

Putting the store buffer pointer in the trailer turns out to be pure awesome-sauce. Will make comments inline.
Attachment #8415393 - Flags: review?(jcoppeard)
Attachment #8412752 - Flags: checkin+
Attachment #8412921 - Flags: checkin+
Attachment #8412930 - Flags: checkin+
Attachment #8412933 - Flags: checkin+
Attachment #8412988 - Flags: checkin+
Attachment #8413905 - Flags: checkin+
Attachment #8413982 - Flags: checkin+
Comment on attachment 8415393 [details] [diff] [review]
storebuffer_in_trailer-v0.diff

Review of attachment 8415393 [details] [diff] [review]:
-----------------------------------------------------------------

Cool, it seems this combines the runtime check with the checking whether the object is in the nursery.

::: js/src/gc/Barrier.h
@@ +357,4 @@
>      static void postBarrier(Value *vp) {
>  #ifdef JSGC_GENERATIONAL
> +        if (vp->isObject()) {
> +            gc::StoreBuffer *sb = reinterpret_cast<gc::Cell *>(&vp->toObject())->storeBuffer();

I think you don't need to cast here because JSObject ultimately derives from Cell.

@@ +366,5 @@
>  
>      static void postBarrierRelocate(Value *vp) {
>  #ifdef JSGC_GENERATIONAL
> +        if (vp->isObject()) {
> +            gc::StoreBuffer *sb = reinterpret_cast<gc::Cell *>(&vp->toObject())->storeBuffer();

Ditto above.

::: js/src/gc/Heap.h
@@ +623,4 @@
>  #endif
>  
>      JSRuntime       *runtime;
>  };

The padding looks a bit strange.  If we have it in the same place on both architectures we won't need the #ifdefs:

  uint32_t location;
  uint32_t padding;
  StoreBuffer *storeBuffer;
  JSRuntime *runtime;
Attachment #8415393 - Flags: review?(jcoppeard) → review+
Comment on attachment 8415393 [details] [diff] [review]
storebuffer_in_trailer-v0.diff

Review of attachment 8415393 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ +232,5 @@
>          }
>  #endif
>      }
>  
> +    static void writeBarrierPost(T *thing, gc::Cell **addr) {}

We need to cast in quite a few more places now, particularly headers that don't understand that Object derives from Cell, but I think it makes the intention clearer. I'd appreciate your thoughts once you've seen the changes that were required.

@@ +358,5 @@
>  #ifdef JSGC_GENERATIONAL
> +        if (vp->isObject()) {
> +            gc::StoreBuffer *sb = reinterpret_cast<gc::Cell *>(&vp->toObject())->storeBuffer();
> +            if (sb)
> +                sb->putValueFromAnyThread(vp);

This saves the indirection through the runtime, which is nice. The real win though is that the null check here is actually checking if the target is in the nursery. This filters not-in-remember set faster /and/ means that we can remove |!nursery.isInside(deref(v))| from maybeInRememberedSet: a mask, three compares, etc. This is probably the biggest win from this patch.

@@ +368,5 @@
>  #ifdef JSGC_GENERATIONAL
> +        if (vp->isObject()) {
> +            gc::StoreBuffer *sb = reinterpret_cast<gc::Cell *>(&vp->toObject())->storeBuffer();
> +            if (sb)
> +                sb->putRelocatableValueFromAnyThread(vp);

Relocate works on the post value again.

@@ +377,5 @@
>      static void postBarrierRemove(Value *vp) {
>  #ifdef JSGC_GENERATIONAL
>          JSRuntime *rt = static_cast<js::gc::Cell *>(vp->toGCThing())->runtimeFromAnyThread();
>          JS::shadow::Runtime *shadowRuntime = JS::shadow::Runtime::asShadowRuntime(rt);
> +        shadowRuntime->gcStoreBufferPtr()->removeRelocatableValueFromAnyThread(vp);

Now that I'm thinking about it, removal should work on prior value, so it should be safe to do the same optimization here. This doesn't check maybeInRememberedSet however, so it's not a huge perf concern.

@@ +863,4 @@
>  #ifdef DEBUG
>      bool preconditionForSet(JSObject *owner, Kind kind, uint32_t slot);
>      bool preconditionForSet(Zone *zone, JSObject *owner, Kind kind, uint32_t slot);
> +    bool preconditionForWriteBarrierPost(JSObject *obj, Kind kind, uint32_t slot, Value target) const;

I have no idea why this was implemented differently from the two above.

@@ +881,2 @@
>          value = v;
> +        post(owner, kind, slot, v);

The runtime now buys us absolutely nothing, so I've ripped out all the runtime-taking paths.

::: js/src/gc/Nursery.h
@@ +202,5 @@
>      }
>  
>      MOZ_ALWAYS_INLINE void initChunk(int chunkno) {
>          NurseryChunkLayout &c = chunk(chunkno);
> +        c.trailer.storeBuffer = JS::shadow::Runtime::asShadowRuntime(runtime())->gcStoreBufferPtr();

We still need gcStoreBufferPtr() in a few places, so moving this out-of-line wouldn't allow us to remove it yet.

::: js/src/gc/StoreBuffer.cpp
@@ -352,5 @@
>  JS_PUBLIC_API(void)
>  JS::HeapValuePostBarrier(JS::Value *valuep)
>  {
>      JS_ASSERT(valuep->isMarkable());
> -    if (valuep->isString() && StringIsPermanentAtom(valuep->toString()))

This is totally superfluous since we only want barriers for things pointing at objects.

@@ +365,5 @@
>  JS::HeapValueRelocate(JS::Value *valuep)
>  {
>      /* Called with old contents of *valuep before overwriting. */
>      JS_ASSERT(valuep->isMarkable());
> +    if (valuep->isString() && valuep->toString()->isPermanentAtom())

No need to call the out-of-line StringIsPermanentAtom from a cpp file. Whoops.

::: js/src/gc/StoreBuffer.h
@@ +244,5 @@
>          bool operator!=(const CellPtrEdge &other) const { return edge != other.edge; }
>  
>          bool maybeInRememberedSet(const Nursery &nursery) const {
> +            JS_ASSERT(nursery.isInside(*edge));
> +            return !nursery.isInside(edge);

Sadly we don't know if the edge here is inside of the GC heap, so we can't safely check the chunk trailer.

@@ +336,5 @@
>  
>          bool operator==(const WholeCellEdges &other) const { return edge == other.edge; }
>          bool operator!=(const WholeCellEdges &other) const { return edge != other.edge; }
>  
> +        bool maybeInRememberedSet(const Nursery &nursery) const {return true; }

O_o  Will fix.

@@ -362,5 @@
>          void *data;
>      };
>  
> -    template <typename Edge>
> -    bool isOkayToUseBuffer(const Edge &edge) const {

This used to call maybeInRememberedSet, but it had to be moved after the reentrancy guard for reasons I don't remember anymore. I guess I forgot to fix the signature when I made that change.

@@ +378,2 @@
>           */
>          if (!CurrentThreadCanAccessRuntime(runtime_))

This is the elephant in the room. This unconditionally calls PR_CurrentThread(). Not only is this out of line, it's probably quite slow in general. The next step is going to be implementing putFromMainThread, which will assert this instead of testing it.

::: js/src/vm/ObjectImpl.h
@@ +970,4 @@
>  #ifdef JSGC_GENERATIONAL
> +    gc::StoreBuffer *storeBuffer = obj->storeBuffer();
> +    if (storeBuffer)
> +        storeBuffer->putRelocatableCellFromAnyThread(cellp);

Gah, I really shouldn't make last-minute changes. Will fix.
Attachment #8415393 - Flags: review+ → review?(jcoppeard)
(In reply to Jon Coppeard (:jonco) from comment #35)
> Comment on attachment 8415393 [details] [diff] [review]
> storebuffer_in_trailer-v0.diff
> 
> Review of attachment 8415393 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Cool, it seems this combines the runtime check with the checking whether the
> object is in the nursery.
> 
> ::: js/src/gc/Barrier.h
> @@ +357,4 @@
> >      static void postBarrier(Value *vp) {
> >  #ifdef JSGC_GENERATIONAL
> > +        if (vp->isObject()) {
> > +            gc::StoreBuffer *sb = reinterpret_cast<gc::Cell *>(&vp->toObject())->storeBuffer();
> 
> I think you don't need to cast here because JSObject ultimately derives from
> Cell.

Sadly this isn't the case. We need full definitions of Barrier available to define JSObject, since it embeds some HeapPtr<>. This means we only have the empty public definition of JSObject in Barrier.h. If we could express that relation without exposing the types, it would make everything in the public tracing api much simpler.
 
> ::: js/src/gc/Heap.h
> @@ +623,4 @@
> >  #endif
> >  
> >      JSRuntime       *runtime;
> >  };
> 
> The padding looks a bit strange.  If we have it in the same place on both
> architectures we won't need the #ifdefs:
> 
>   uint32_t location;
>   uint32_t padding;
>   StoreBuffer *storeBuffer;
>   JSRuntime *runtime;

Wow, great observation! Done.
> if (!CurrentThreadCanAccessRuntime(runtime_))
This also calls InParallelSection, which uses TLS. So let's make that *catastrophically* slow.
> @@ +377,5 @@
> >      static void postBarrierRemove(Value *vp) {
> >  #ifdef JSGC_GENERATIONAL
> >          JSRuntime *rt = static_cast<js::gc::Cell *>(vp->toGCThing())->runtimeFromAnyThread();
> >          JS::shadow::Runtime *shadowRuntime = JS::shadow::Runtime::asShadowRuntime(rt);
> > +        shadowRuntime->gcStoreBufferPtr()->removeRelocatableValueFromAnyThread(vp);
> 
> Now that I'm thinking about it, removal should work on prior value, so it should be safe to do the same
> optimization here. This doesn't check maybeInRememberedSet however, so it's not a huge perf concern.

Actually, the conditions for calling relocate vs remove are isMarkable/non-null, whereas the conditions for actually inserting would be more tightly constrained. This runs into at least one problem if we do 1) ptr = Nursery; 2) ptr = Tenured; 3) ~ptr. This fails to remove the entry on destruction because the tenured can't get to the store buffer to do the removal. I was thinking we removed at (2), but since we're only checking bool(ptr), we end up just inserting a second entry for the tenured entry.

I'm just going to leave all the relocatable barrier code alone for now, as they aren't actually implicated in any perf issues yet. I think we can move the isNursery check to needsPostBarrier to fix this, but it's worth a separate issue.
Comment on attachment 8415709 [details] [diff] [review]
speed_up_some_barriers-v0.diff

Review of attachment 8415709 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/VMFunctions.cpp
@@ -713,5 @@
>  void
>  PostWriteBarrier(JSRuntime *rt, JSObject *obj)
>  {
>      JS_ASSERT(!IsInsideNursery(rt, obj));
> -    rt->gcStoreBuffer.putWholeCell(obj);

I'm surprised how little effect this had on octane. Oh well.
Attachment #8415709 - Flags: review?(jcoppeard)
This gets an small but notable speedup on the testcase in this bug. I'm clearly still missing something important.
Attachment #8415710 - Flags: review?(jcoppeard)
(In reply to Terrence Cole [:terrence] from comment #36)
> > +    static void writeBarrierPost(T *thing, gc::Cell **addr) {}
> 
> We need to cast in quite a few more places now, particularly headers that
> don't understand that Object derives from Cell, but I think it makes the
> intention clearer. I'd appreciate your thoughts once you've seen the changes
> that were required.

I like the typing, but I don't like all the extra casting :)

Can we have overloaded AsCell(JSObject *thing) style functions to do the casting for us?  Maybe if we declare it inline and only supply a definition when JSObject has been defined then we won't need the casting.

> The real win
> though is that the null check here is actually checking if the target is in
> the nursery. This filters not-in-remember set faster /and/ means that we can
> remove |!nursery.isInside(deref(v))| from maybeInRememberedSet: a mask,
> three compares, etc.

Sligtly off topic, but IsInsideNursery() does:

> return uintptr_t(p) >= runtime->gcNurseryStart_ && uintptr_t(p) < runtime->gcNurseryEnd_;

I suspect compilers may optimize this pattern already, but in case they don't we can change it to something like:

> return uintptr_t((uintptr_t(p) - runtime->gcNurseryStart_) < Nursery::NurserySize;

Which would save a compare and branch.
Comment on attachment 8415393 [details] [diff] [review]
storebuffer_in_trailer-v0.diff

Review of attachment 8415393 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Barrier.h
@@ +232,5 @@
>          }
>  #endif
>      }
>  
> +    static void writeBarrierPost(T *thing, gc::Cell **addr) {}

This does make the intention clearer, but unfortunately it means we have to cast (almost?) everywhere we call it.

I experimented with making it |T **addr| and that didn't work out either.

TBH I'm not sure this is worth it.

We can add an assert that |*addr == thing| that will prevent misuse.

::: js/src/gc/StoreBuffer.cpp
@@ +365,5 @@
>  JS::HeapValueRelocate(JS::Value *valuep)
>  {
>      /* Called with old contents of *valuep before overwriting. */
>      JS_ASSERT(valuep->isMarkable());
> +    if (valuep->isString() && valuep->toString()->isPermanentAtom())

Shouldn't we have the same check as in HeapValuePostBarrier() here?

However, if we're going to do this I think we could just change GCMethods<JS::Value>::needsPostBarrier(const JS::Value &v) to check v.isObject() rather than isMarkable() and save an API call.
Attachment #8415393 - Flags: review?(jcoppeard) → review+
Comment on attachment 8415709 [details] [diff] [review]
speed_up_some_barriers-v0.diff

Review of attachment 8415709 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!

::: js/src/gc/StoreBuffer.h
@@ +447,5 @@
>      void putCellFromAnyThread(Cell **cellp) { putFromAnyThread(bufferCell, CellPtrEdge(cellp)); }
>      void putSlotFromAnyThread(JSObject *obj, int kind, int32_t start, int32_t count) {
>          putFromAnyThread(bufferSlot, SlotsEdge(obj, kind, start, count));
>      }
> +    void putWholeCellFromAnyThread(Cell *cell) {

This doesn't appear to be used - can we get rid of it?
Attachment #8415709 - Flags: review?(jcoppeard) → review+
Comment on attachment 8415710 [details] [diff] [review]
speed_up_nativesetslot-v0.diff

Review of attachment 8415710 [details] [diff] [review]:
-----------------------------------------------------------------

Nice, gets rid of the main thread test

::: js/src/gc/Barrier.h
@@ +924,5 @@
> +                if (ContextFriendFields::get(cx)->isJSContext())
> +                    storeBuffer->putSlotFromMainThread(owner, kind, slot, 1);
> +                else
> +                    storeBuffer->putSlotFromAnyThread(owner, kind, slot, 1);
> +            }

So I'm a little confused now...

At the moment, only the main thread can ever touch things in the nursery.  That means if we extract a store buffer pointer from a chunk trailer, then we must be on the main thread (and we assert this in putFromMainThread).  If we're not on the main thread, we don't ever insert entries into the store buffer (but we don't seem to assert that an entry wasn't required, which I guess we could do).

This would mean that the call of storeBuffer->putSlotFromAnyThread() should not be required.  Did I misunderstand what this does?

(Also I think in future PJS is planning to make use of nurseries off the main thread, but that just means our thread assertion will have be updated to take account of that).
(In reply to Jon Coppeard (:jonco) from comment #46)
> Comment on attachment 8415710 [details] [diff] [review]
> speed_up_nativesetslot-v0.diff
> 
> Nice, gets rid of the main thread test
> 
> ::: js/src/gc/Barrier.h
> @@ +924,5 @@
> > +                if (ContextFriendFields::get(cx)->isJSContext())
> > +                    storeBuffer->putSlotFromMainThread(owner, kind, slot, 1);
> > +                else
> > +                    storeBuffer->putSlotFromAnyThread(owner, kind, slot, 1);
> > +            }
> 
> So I'm a little confused now...
> 
> At the moment, only the main thread can ever touch things in the nursery. 
> That means if we extract a store buffer pointer from a chunk trailer, then
> we must be on the main thread (and we assert this in putFromMainThread).  If
> we're not on the main thread, we don't ever insert entries into the store
> buffer (but we don't seem to assert that an entry wasn't required, which I
> guess we could do).
> 
> This would mean that the call of storeBuffer->putSlotFromAnyThread() should
> not be required.  Did I misunderstand what this does?
> 
> (Also I think in future PJS is planning to make use of nurseries off the
> main thread, but that just means our thread assertion will have be updated
> to take account of that).

You are 100% correct: we do not need the putSlotFromAnyThread here. More on the nose: this is a lot of ugly code and the wins are surprisingly marginal, at least on linux. I'm going to push the two preceeding patches independently to see how a big a boost we get from putWholeCellFromMainThread. If that's actually significant on some platform I'll move forward with this, otherwise I'm not sure passing a context is worth it.
Oh, oh, oh... I am an idiot. I've been so focused on reducing overhead that I forgot /why/ GGC is faster. The Lambdas are getting allocated tenured, so we are not getting any of GGC's benefits, only the overhead.

The reasons for this are: (1) generally putting functions in the nursery is a bad idea because they are held live by scripts; (2) things allocated through the interpreter are generally on a slow path anyway so we should avoid wasting nursery and store-buffer space that would be better used by the jits. This particular case is different because Lambdas are (apparently) not held live by their script and the interpreter routine is just getting called from a jit stub here.

This one line patch allocates Lambdas in the nursery and gets us back to performance parity. It seems to pass all jsapi-tests and jit-tests; I will post try results if and when the try push finishes.
Attachment #8415970 - Flags: review?(jcoppeard)
Backed out for mochitest-2 asan failures.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e5a6e3bfa4be
Attachment #8415970 - Flags: review?(jcoppeard) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f4e9cf85af8

Try run seems to be green, modulo the inbound bustage from yesterday morning:
https://tbpl.mozilla.org/?tree=Try&rev=907c8fdeb598
when this lands again we will see a win of 4.5% on linux32/linux64 on the tart test!
This is getting a bit messy, so let me clarify what the current state is. There are 3 fixes here that I'd like to land in this bug.

1) Lambda objects in the nursery. This is currently "breaking" jit-tests: we have a bunch of tests that set gcMaxByte to small, then do a recursive call. These are marked to fail with OOM, but no longer generate garbage so instead are hitting the stack limit first. I think I've got a fix, we'll see shortly: https://tbpl.mozilla.org/?tree=Try&rev=089d0461401a.

This will get us a 100% perf improvement on the dom-attr micro-benchmark, although it is unlikely to help elsewhere. It may even be a net loss, so we'll need to watch other tests as well.

2) StoreBufferPtr through the chunk trailer. This is currently broken because of Relocatable<>. We are using needsPostBarrier for insertion, which only checks isMarkable/non-null. This allows us to put a nursery pointer, then a non-nursery pointer (StoreBuffer* is now null). Now when we assign a nullptr and try to remove, we have no access to the store buffer and end up with the store buffer pointing to nullptr. I think the right solution here is going to be to make the needsPostBarrier check also check IsInsideNursery, but this will warrant a complete try run.

This will get us 4.5% on tart and will be a small win across the board.

3) Do not check CurrentThreadCanAccessRuntime from putWholeCell. Blocked on (2). Should be a small win everywhere.
Duplicate of this bug: 973584
Repushing part 2:
http://hg.mozilla.org/integration/mozilla-inbound/rev/f5ac479e69d4

Green try run, particularly missing the prior M-2 bustage:
https://tbpl.mozilla.org/?tree=Try&rev=e4c1c0a48952
Comment on attachment 8415710 [details] [diff] [review]
speed_up_nativesetslot-v0.diff

Review of attachment 8415710 [details] [diff] [review]:
-----------------------------------------------------------------

Not sure whether you're planning to land this, but it looks fine apart from previous comments.
Attachment #8415710 - Flags: review?(jcoppeard) → review+
Attachment #8415393 - Flags: checkin+
Attachment #8415709 - Flags: checkin+
Comment on attachment 8415710 [details] [diff] [review]
speed_up_nativesetslot-v0.diff

I am leaning towards no; it's a bunch of complexity and extra code for a tiny gain.
this shows up as a 2.3% regression on osx 10.6 for kraken, is this expected?
Flags: needinfo?(terrence)
(In reply to Joel Maher (:jmaher) from comment #62)
> this shows up as a 2.3% regression on osx 10.6 for kraken, is this expected?

The new code uses fewer instructions: I got 23 talos improvement e-mails for the change. The 10.6 numbers appear bi-modal to me so it is almost certainly a changed alignment or access pattern. It's certainly not worth investigating; we can get bigger general wins in the time it would take to figure out what's going off the rails in this one narrow case.
Flags: needinfo?(terrence)
sounds like a bunch of steps forward and a quarter step back, sounds like a plan!
(In reply to Joel Maher (:jmaher) from comment #64)
> sounds like a bunch of steps forward and a quarter step back, sounds like a
> plan!

When we landed GGC, we got improvements of 6-10% across the board on octane: a real benchmark. We also got a handful of 2-4% regressions on some silly micro-benchmarks. I wouldn't even bother with /this/ bug if it weren't trivially reproducible and a good excuse to do some much-needed cleanups at the same time.
thanks for taking the time to clean this up.  From a performance tracking perspective I am fine with closing this.
(In reply to Joel Maher (:jmaher) from comment #66)
> thanks for taking the time to clean this up.  From a performance tracking
> perspective I am fine with closing this.

Lets land the lambdas-in-nursery first.

And maybe some out-of-lining. Initial results of that are mixed, so I'm going to try one more thing there.
No longer depends on: 945555
Moving all the barrier code out-of-line saves ~100 lines of very ugly code immediately and probably would allow us to remove some stuff from the shadow namespace too. It appears entirely performance neutral other than a regression on winxp that is maybe noise? I think it would be worth checking it in, at least for a couple days, to get some statistical rigor behind "it is performance neutral".
Blocks: GC.performance
No longer blocks: GenerationalGC
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/8c234572141a
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
GGC won't be part of 31.
Depends on: 1004457
You need to log in before you can comment on or make changes to this bug.