889129 - GenerationalGC: fix performance on splay

Assignee

Description

•

11 years ago

We currently tenure all the things. There seem to be ~7 bits available in TypeObjectFlags that we can easily make into a saturating counter.

Terrence Cole [:terrence]

Assignee

Updated

•

11 years ago

Assignee: terrence → general

Terrence Cole [:terrence]

Assignee

Comment 1

•

11 years ago

Attached patch fix_splay_perf-wip1.diff (obsolete) — Details — Splinter Review

The attached gets our perf on splay almost back to baseline. Yay. Unfortunately it also triggers a crash in Octane, PdfJS in particular. Boo. Fortunately, this looks like a pre-existing bug. Yay.

I will look into this more when I get back from PTO. If anyone wants to look at the failures in the meantime, be my guest.

Note: we need to investigate what value of tenuredCount is best for isLongLived(). Also it looks like we may want to skip the new test in some of our allocation sites to get back more performance, or have a different isLongLived for each site.

Tom S [:evilpie]

Comment 2

•

11 years ago

Attached patch fix-splay (obsolete) — Details — Splinter Review

I was curious and tested out the patch, but something funky was going on with the format so I rebased it. I also tried to find the problem with the crash. After disabling most of the new code so it looks like the thing causing the crash is the change to IonMacroAssembler where we don't nursery allocate long lived object.

Terrence Cole [:terrence]

Assignee

Comment 3

•

11 years ago

Thanks for checking that out, Tom! It's good to know that it's only that one site: I would have expected any of the places I added that check to have the same result.

For the life of me, I cannot figure out how this is causing us not to forward nursery things when they move; I suspect we may just be uncovering a different GGC bug. I'll look into it once I get back from PTO.

Terrence Cole [:terrence]

Assignee

Updated

•

11 years ago

Depends on: 906128

Paul Wright

Comment 4

•

11 years ago

With 906128 resolved, is this bug clear to proceed?

Terrence Cole [:terrence]

Assignee

Comment 5

•

11 years ago

Attached patch pre_tenure_long_lived_objects-v0.diff — Details — Splinter Review

Brian, your idea worked well.

GGC current to GGC with patch:
Richards          |  13125 ->  13083 = -  0.32%
DeltaBlue         |  16159 ->  15954 = -  1.27%
Crypto            |  17274 ->  17659 = +  2.23%
RayTrace          |  26862 ->  24543 = -  8.63%
EarleyBoyer       |  17464 ->  16180 = -  7.35%
RegExp            |   2516 ->   2531 = +  0.60%
Splay             |   4850 ->  12044 = +148.33%
NavierStokes      |  20479 ->  19994 = -  2.37%
PdfJS             |   7849 ->   7004 = - 10.77%
Mandreel          |   3985 ->   4204 = +  5.50%
Gameboy           |  19317 ->  20670 = +  7.00%
CodeLoad          |  15127 ->  15061 = -  0.44%
Box2D             |   5903 ->   5379 = -  8.88%
Score (version 8) |  10668 ->  11207 = +  5.05%

This patch regresses Raytrace, EarleyBoyer, PdfJS, and Box2D a bit and massively improves Splay, making up the difference.

GGC disabled to GGC with patch:
Richards          |  15222 ->  13083 = - 14.05%
DeltaBlue         |  14222 ->  15954 = + 12.18%
Crypto            |  17609 ->  17659 = +  0.28%
RayTrace          |  16559 ->  24543 = + 48.22%
EarleyBoyer       |  14749 ->  16180 = +  9.70%
RegExp            |   2771 ->   2531 = -  8.66%
Splay             |  12452 ->  12044 = -  3.28%
NavierStokes      |  20337 ->  19994 = -  1.69%
PdfJS             |   7785 ->   7004 = - 10.03%
Mandreel          |   3990 ->   4204 = +  5.36%
Gameboy           |  21493 ->  20670 = -  3.83%
CodeLoad          |  16330 ->  15061 = -  7.77%
Box2D             |   5839 ->   5379 = -  7.88%
Score (version 8) |  11159 ->  11207 = +  0.43%

The majority of the remaining regression from trunk to the performance we should be getting appears to be a fault in parsing. Parsing appears to just run ~10% slower with GGC enabled: I'll file a new bug for that issue.

Assignee: general → terrence

Attachment #788466 - Attachment is obsolete: true

Attachment #788602 - Attachment is obsolete: true

Attachment #798060 - Flags: review?(bhackett1024)

Brian Hackett [Laid off!]

Comment 6

•

11 years ago

Comment on attachment 798060 [details] [diff] [review]
pre_tenure_long_lived_objects-v0.diff

Review of attachment 798060 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Nursery.cpp
@@ +341,5 @@
>      JS_ASSERT(!isInside(*pSlotsElems));
>  }
>  
> +static void
> +InvalidateScript(JSRuntime *rt, types::TypeObject *type)

This name is too generic.  MaybeInvalidateScriptUsedWithNew?

@@ +349,5 @@
> +        return;
> +
> +    JSFunction *fun = newScript->fun;
> +    if (!fun)
> +        return;

newScript->fun should never be null.

::: js/src/jit/LIR-Common.h
@@ +4521,5 @@
>      }
>  };
>  
> +// Generational write barrier used when writing to multiple slots in an object.
> +class LPostWriteBarrier : public LInstructionHelper<0, 1, 0>

This should have a better name, given PostWriteBarrier{O,V} already exist.  PostWriteBarrierForObject?

::: js/src/jit/VMFunctions.cpp
@@ +255,5 @@
> +    NewObjectKind newKind = !type
> +                            ? SingletonObject
> +                            : typeArg->isLongLivedForJITAlloc()
> +                              ? TenuredObject
> +                              : GenericObject;

Split this up using at least one |if| statement, for clarity.

@@ +275,5 @@
> +    NewObjectKind newKind = templateObject->hasSingletonType()
> +                            ? SingletonObject
> +                            : templateObject->type()->isLongLivedForJITAlloc()
> +                              ? TenuredObject
> +                              : GenericObject;

Ditto.

::: js/src/jsinfer.cpp
@@ +1925,5 @@
> +
> +    return count >= MaxJITAllocTenures;
> +}
> +
> +

Extra \n here.

::: js/src/jsinfer.h
@@ +1062,5 @@
>  
> +    /* Tenure counter management. */
> +
> +    const static uint32_t MaxCachedAllocTenures = 64;
> +    const static uint32_t MaxJITAllocTenures = OBJECT_FLAG_TENURE_COUNT_LIMIT - 2;

These constants need comments.

@@ +1076,5 @@
> +         * Uncompiled paths are rarely taken and much more likely to be
> +         * singletons: pre-tenure immediately.
> +         */
> +        return true;
> +    }

This function and comment are really weird.  Can you remove this function entirely and remove its uses in JSObject::{create,createArray}?  It's not at all obvious what's happening and if this is fixing a performance bug it should happen separately from this one.

@@ +1080,5 @@
> +    }
> +
> +    bool isLongLivedForCachedAlloc() const {
> +        return tenureCount() >= MaxCachedAllocTenures;
> +    }

This function is also pretty weird.  What is so special about 64 for MaxCachedAllocTenures?  Can you just use MaxJITAllocTenures and give it a more generic name?  Or just always use the nursery if possible when allocating in the VM (where we'll be hurting for performance regardless.)

Attachment #798060 - Flags: review?(bhackett1024) → review+

Terrence Cole [:terrence]

Assignee

Comment 7

•

11 years ago

(In reply to Brian Hackett (:bhackett) from comment #6)
> ::: js/src/jit/LIR-Common.h
> @@ +4521,5 @@
> >      }
> >  };
> >  
> > +// Generational write barrier used when writing to multiple slots in an object.
> > +class LPostWriteBarrier : public LInstructionHelper<0, 1, 0>
> 
> This should have a better name, given PostWriteBarrier{O,V} already exist. 
> PostWriteBarrierForObject?

ForObject doesn't make much sense to me. This is skipping the target check: how about LPostWriteBarrierAllSlots?
 
> @@ +1080,5 @@
> > +    }
> > +
> > +    bool isLongLivedForCachedAlloc() const {
> > +        return tenureCount() >= MaxCachedAllocTenures;
> > +    }
> 
> This function is also pretty weird.  What is so special about 64 for
> MaxCachedAllocTenures?  Can you just use MaxJITAllocTenures and give it a
> more generic name?  Or just always use the nursery if possible when
> allocating in the VM (where we'll be hurting for performance regardless.)

I originally split the cases into VM and JIT, but was unable to find a good tuning with just those knobs: it turns out that NewObjectCache is super hot through allocations we are making in stubs on some tests. The right solution is probably to compile these cases, but I think this is a reasonable hack until that happens. As to why 64: octane happens to be ~200 points faster when this knob is between 32 and 96, with a soft peak in the middle.

Tuning this was quite awkward: if we end up with a heavily used cross-generation edge because we are tenuring half of an object graph, we spend all our time putting things in the store buffer and doing useless minor GC's when the store buffer overflows. I tried a number of other schemes as well -- propagating the tenure count along edges when tenuring, pre-tenuring common store buffer targets, etc -- but none worked as well in the general case as the attached. There is definitely more work to be done here: I basically just stopped hacking when the performance passed tip and twiddled tuning knobs to optimize on top of that.

> @@ +1076,5 @@
> > +         * Uncompiled paths are rarely taken and much more likely to be
> > +         * singletons: pre-tenure immediately.
> > +         */
> > +        return true;
> > +    }
> 
> This function and comment are really weird.  Can you remove this function
> entirely and remove its uses in JSObject::{create,createArray}?  It's not at
> all obvious what's happening and if this is fixing a performance bug it
> should happen separately from this one.

As you surmised, it turns out the best parameter for VM allocations that are not cached is to always allocate tenured. This used to be |return tenureCount() >= MaxInterpAllocTenures|, but MaxInterpAllocTenures was optimal at 0. I will go ahead and just remove this knob.

Terrence Cole [:terrence]

Assignee

Comment 8

•

11 years ago

https://tbpl.mozilla.org/?tree=Try&rev=9e375d66d4cb
https://hg.mozilla.org/integration/mozilla-inbound/rev/a387224eecca

Ed Morley [:emorley]

Comment 9

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/a387224eecca

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla27

Paul Wright

Comment 10

•

11 years ago

Based on the performance as indicated on AWFY, I would not necessarily call this bug fixed.

Terrence Cole [:terrence]

Assignee

Comment 11

•

11 years ago

(In reply to Paul Wright from comment #10)
> Based on the performance as indicated on AWFY, I would not necessarily call
> this bug fixed.

What graph are you looking at? For me, AWFY is showing a 50% improvement to Splay, RayTrace almost almost even with V8, with only a mild regression on RegExp and EarleyBoyer. More tuning of the mechanism in this patch would be nice, but I don't think it will get us significantly farther. Further improvements will need different techniques and should go in new bugs.

Paul Wright

Comment 12

•

11 years ago

(In reply to Terrence Cole [:terrence] from comment #11)
> (In reply to Paul Wright from comment #10)
> > Based on the performance as indicated on AWFY, I would not necessarily call
> > this bug fixed.
> 
> What graph are you looking at? For me, AWFY is showing a 50% improvement to
> Splay, RayTrace almost almost even with V8, with only a mild regression on
> RegExp and EarleyBoyer. More tuning of the mechanism in this patch would be
> nice, but I don't think it will get us significantly farther. Further
> improvements will need different techniques and should go in new bugs.

I am looking at the 64-bit Octane Splay graph.  Currently, the GGC build is at about 7900, and the non-GGC build is at roughly 12,000.  Are there other influences which would account for the 4000-point difference?  This bug is GGC: Fix Splay Performance.  I do not understand how this would be considered fixed given the "standard" build is performing significantly better than the GGC one, if there are no other differences between the two.
With that being said, the performance of the GCC build is significantly better with the improvements this bug has made, and I applaud those efforts.  However, IMHO, being "fixed" is to achieve parity with the standard build, given the title of this bug.

Terrence Cole [:terrence]

Assignee

Comment 13

•

11 years ago

(In reply to Paul Wright from comment #12)
> (In reply to Terrence Cole [:terrence] from comment #11)
> > (In reply to Paul Wright from comment #10)
> > > Based on the performance as indicated on AWFY, I would not necessarily call
> > > this bug fixed.
> > 
> > What graph are you looking at? For me, AWFY is showing a 50% improvement to
> > Splay, RayTrace almost almost even with V8, with only a mild regression on
> > RegExp and EarleyBoyer. More tuning of the mechanism in this patch would be
> > nice, but I don't think it will get us significantly farther. Further
> > improvements will need different techniques and should go in new bugs.
> 
> I am looking at the 64-bit Octane Splay graph.  Currently, the GGC build is
> at about 7900, and the non-GGC build is at roughly 12,000.  Are there other
> influences which would account for the 4000-point difference?  This bug is
> GGC: Fix Splay Performance.  I do not understand how this would be
> considered fixed given the "standard" build is performing significantly
> better than the GGC one, if there are no other differences between the two.
> With that being said, the performance of the GCC build is significantly
> better with the improvements this bug has made, and I applaud those efforts.
> However, IMHO, being "fixed" is to achieve parity with the standard build,
> given the title of this bug.

Splay creates a single large tree that lives the entire duration of the test. GGC assumes that the majority of objects are short lived. This test is simply going to perform worse with GGC than it does without. In particular, note how big the gap between V8 (GGC) and SpiderMonkey (no GGC) was back in March. Also note that our GGC performance was almost identical to V8's performance as of March. That is simply the baseline cost of enabling GGC on Splay.

Since March, V8 has landed several GC changes that are similar to the change here. The change here has a similar magnitude of improvement to the patch to V8 that was the most similar to the technique used here. With that background and within the context of "initial landing of GGC in SpiderMonkey," I do indeed consider splay performance to be "fixed." We will do more later, but will be focusing on lower hanging fruit first.

Paul Wright

Comment 14

•

11 years ago

(In reply to Terrence Cole [:terrence] from comment #13)
> (In reply to Paul Wright from comment #12)
> > (In reply to Terrence Cole [:terrence] from comment #11)
> > > (In reply to Paul Wright from comment #10)
> > > > Based on the performance as indicated on AWFY, I would not necessarily call
> > > > this bug fixed.
> > > 
> > > What graph are you looking at? For me, AWFY is showing a 50% improvement to
> > > Splay, RayTrace almost almost even with V8, with only a mild regression on
> > > RegExp and EarleyBoyer. More tuning of the mechanism in this patch would be
> > > nice, but I don't think it will get us significantly farther. Further
> > > improvements will need different techniques and should go in new bugs.
> > 
> > I am looking at the 64-bit Octane Splay graph.  Currently, the GGC build is
> > at about 7900, and the non-GGC build is at roughly 12,000.  Are there other
> > influences which would account for the 4000-point difference?  This bug is
> > GGC: Fix Splay Performance.  I do not understand how this would be
> > considered fixed given the "standard" build is performing significantly
> > better than the GGC one, if there are no other differences between the two.
> > With that being said, the performance of the GCC build is significantly
> > better with the improvements this bug has made, and I applaud those efforts.
> > However, IMHO, being "fixed" is to achieve parity with the standard build,
> > given the title of this bug.
> 
> Splay creates a single large tree that lives the entire duration of the
> test. GGC assumes that the majority of objects are short lived. This test is
> simply going to perform worse with GGC than it does without. In particular,
> note how big the gap between V8 (GGC) and SpiderMonkey (no GGC) was back in
> March. Also note that our GGC performance was almost identical to V8's
> performance as of March. That is simply the baseline cost of enabling GGC on
> Splay.
> 
> Since March, V8 has landed several GC changes that are similar to the change
> here. The change here has a similar magnitude of improvement to the patch to
> V8 that was the most similar to the technique used here. With that
> background and within the context of "initial landing of GGC in
> SpiderMonkey," I do indeed consider splay performance to be "fixed." We will
> do more later, but will be focusing on lower hanging fruit first.

I appreciate the background info.  With that, I agree that this bug is fixed with some followups forthcoming.  Thanks.

Paul Bone [:pbone]

Updated

•

5 years ago

Blocks: 1507445

fix_splay_perf-wip1.diff 11 years ago Terrence Cole [:terrence] 16.40 KB, patch		Details \| Diff \| Splinter Review
fix-splay 11 years ago Tom S [:evilpie] 16.03 KB, patch		Details \| Diff \| Splinter Review
pre_tenure_long_lived_objects-v0.diff 11 years ago Terrence Cole [:terrence] 27.37 KB, patch	bhackett1024 : review+	Details \| Diff \| Splinter Review