Closed Bug 528338 Opened 15 years ago Closed 14 years ago

[Speed] Tune the paths into and through GC::Alloc, GC::Free, et al

Categories

(Tamarin Graveyard :: Garbage Collection (mmGC), defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

Attachments

(18 obsolete files)

Turns out there's quite a bit that can be done to reduce the computation on the paths into and through GC::Alloc: tests can be coalesced or lifted off the common paths; methods can be in-lined; lookup tables can replace decision trees; compiler problems can be worked around; accounting calls can be merged.
Priority: -- → P2
Target Milestone: --- → flash10.1
Simple adjustments: make sure functions that should be inlined really are inlined, and specialize 'new' a couple of places so that AllocExtra is only called when there's anything extra to allocate - that path has one extra overflow check on it.
Attachment #412462 - Flags: review?(treilly)
Attachment #412462 - Flags: review?(treilly) → review+
Comment on attachment 412462 [details] [diff] [review]
Cleanup inlining & specialize new

redux changeset:   3123:e7bc2f5c27ca
Attachment #412462 - Attachment is obsolete: true
Attached patch Streamline GC::Alloc (obsolete) — Splinter Review
Comment on attachment 412831 [details] [diff] [review]
Streamline GC::Alloc

This is more or less a wash on the Tamarin benchmarks, even with the AllocDouble optimization disabled.  It has shown some speedups in the Flash Player (benchmark results to appear).  Inspection of the generated code shows it to be considerably shorter and less branchy, though, so I expect it will benefit some platforms more than my development machine.
Attached patch Streamline GC::Free (obsolete) — Splinter Review
Minor changes: move one test in under another, group the flags together in the GC object to allow caching to work better, and add a bunch of comments about remaining work and where the cost goes.
Summary: [Speed] Tune the paths into and through GC::Alloc → [Speed] Tune the paths into and through GC::Alloc, GC::Free, et al
+#ifdef MMGC_MEMORY_INFO
+		item = GetUserPointer(item);
+#endif

Is a verbose way of saying:

+		item = GetUserPointer(item);
(In reply to comment #6)
> +#ifdef MMGC_MEMORY_INFO
> +        item = GetUserPointer(item);
> +#endif
> 
> Is a verbose way of saying:
> 
> +        item = GetUserPointer(item);

I know, but I was in an über-pedantic mood and wanted to be able to see exactly what work would be performed.  Will clean it up tho'.
Attached patch Streamline GC::Free, v2 (obsolete) — Splinter Review
One more tweak: the C++ spec states that calling 'delete' on a NULL pointer has no effect.  Ergo the null pointer check at the beginning of Free is redundant, if Free is called from delete.  So introduce a new function FreeNotNull called from 'delete', then make Free() an inline wrapper that checks for NULL and calls FreeNotNull as appropriate.

Also fixed some warnings and a Debug-mode compile problem.
Attachment #412844 - Attachment is obsolete: true
Cleaned up and tested more extensively.
Attachment #412831 - Attachment is obsolete: true
Attachment #413082 - Attachment is obsolete: true
Attachment #413590 - Flags: review?(treilly)
This is preliminary, but apart from hanging the sampler/Callback test, it works fine.

The gist is that we layer a quick list on top of the blocks in each instance of GCAlloc.  This removes most decision points from both Alloc and Free, effectively batches many expensive operations, and requires fewer expensive oeprations overall.  A consistent global state with empty quick lists is reestablished as needed.  See comment at the head of GCAlloc.cpp in the patch for more info.

This patch needs a couple of things still:

- proper performance numbers
- more testing
- a mechanism to prevent the quick lists from growing without bound, since
  that would increase memory consumption.  I have some ideas, not fully baked
  yet.
Together these two patches drop the time spent in MMgc by 2% on the AS2 GCBench benchmark.  (This is with MMGC_HOOKS turned on; I need to re-measure with that disabled, and if the difference is significant look into tuning the hook paths some more.  For example, there are two tests on the alloc path for whether hooks are enabled, but one ought to be enough.)
We can take the hook checks out if !MMGC_MEMORY_PROFILER and !DEBUG no?
This is the test currently in MMgc.h, should it be refined?

#if defined DEBUGGER || defined MMGC_MEMORY_PROFILER || defined MMGC_MEMORY_INFO
    #ifndef MMGC_HOOKS
        #define MMGC_HOOKS
    #endif
#endif
(In reply to comment #11)
> Together these two patches drop the time spent in MMgc by 2% on the AS2 GCBench
> benchmark.

2 percentage points out of 18 for MMgc = 11% speedup in MMgc for this benchmark.
(In reply to comment #13)
> This is the test currently in MMgc.h, should it be refined?
> 
> #if defined DEBUGGER || defined MMGC_MEMORY_PROFILER || defined
> MMGC_MEMORY_INFO
>     #ifndef MMGC_HOOKS
>         #define MMGC_HOOKS
>     #endif
> #endif

Looks right to me
Here are some more ideas.

- Merge hooks more (alluded to above) and control more precisely when the
  hook check is needed (ditto)

- Have a soft cache of blocks that have already been initialized for a given
  allocator.  The idea is that once an allocator frees a block it should keep
  it but notify GCHeap that it has it - the block is in a consistent state
  and re-initialization should be particularly cheap.  If GCHeap needs a block
  it can take it from somebody's soft cache instead of allocating from the
  heap.  A little complication creeps in when GCHeap needs more than one
  consecutive block, we'd like to be able to perform coalescing of free blocks
  across the allocators' free lists.  Anyway, I mention this because 
  CreateChunk is showing up in the profile and it would be interesting to see
  if we could make it go away again.
Here's yet another ideas:

Observe that GC::Alloc performs a doble dispatch: it dispatches on size and on the flag bits.  In most cases these are both known (because the calls to Alloc are mostly from the new operator and the new operator inserts those flags, and knows the size).  Ergo the 'new' operator may be able to specialize the call.

Specializing by flags is dead easy, we just provide allocator methods for the flag combinations we care about.  Specializing by size is easy once we've specialized by flag, because then we already know which allocator pool to choose from - it's just a simple table lookup.  The code is small and can be inlined everywhere.

A quick test shows that this works: String::createDynamic invokes 'new (gc) String(...)', this is open-coded by the compiler as a direct call to GCAlloc::Alloc, taking GC::Alloc entirely out of the loop (modulo running of hooks and so on, which we'd have to figure out how to deal with).
This is not something finished, just a general idea of what it would look like.  We still need to figure out how to run hooks, and trigger the collector somehow.  Both add to the inline path.  In the existing case of AllocDouble we don't specialize if MMGC_HOOKS is enabled; this is a possible solution but not ideal.
Combines the first streamlining patch with an elaboration of the specialized allocation patch; this one also specializes AllocExtra by means of a fast-path overflow check.

(This probably makes the GCAlloc::Alloc/Free patch no longer apply cleanly.)
Attachment #413590 - Attachment is obsolete: true
Attachment #414037 - Attachment is obsolete: true
Attachment #414319 - Flags: review?(treilly)
Attachment #413590 - Flags: review?(treilly)
Like the previous patch, but fixes a bug in debug accounting.
Attachment #414319 - Attachment is obsolete: true
Attachment #414520 - Flags: review?(treilly)
Attachment #414319 - Flags: review?(treilly)
This is clean and has all facilities.  Again, see major block comment in GCAlloc.cpp for an explanation of how the quick lists work.
Attachment #413591 - Attachment is obsolete: true
Attachment #414708 - Flags: review?(treilly)
Attached patch Bug fixes for the two patches (obsolete) — Splinter Review
Some bugs I found while mopping up, these fixes apply to both patches.  I can fold them in if you think it's unworkable to review multiple patches like this.

- Inline specialized allocators did not always go to the right allocator pool
- Inline specialized allocators did not compute size classes correctly
- DRC pinning can modify free objects, which messes up the storage of the
  index of an object within its block in the second word of the free object.
  Worked around this by documenting the issue and masking the bit on load.
- Removed one redundant division in GCAlloc::CreateChunk.  Fairly strange that
  it was not constant-folded but it showed up hot in a profiler run (could be
  profiler noise).  Anyway we don't need it.
Attachment #414741 - Flags: review?(treilly)
Comment on attachment 414520 [details] [diff] [review]
Streamline GC::Alloc and GC::Free, and specialize - v2

AllocPtrZero/AllocPtrZeroFinalized/AllocRCObject all use containsPointersRCAllocs, that's not right is it?
Nevermind, I see patch 3 fixes it.
Attachment #414520 - Flags: review?(treilly) → review+
Attachment #414708 - Flags: review?(treilly) → review+
Attachment #414741 - Flags: review?(treilly) → review+
Will integrate into TR after Beta 2 branch / integration.
Whiteboard: Has patch
Edwin notes elsewhere about how the JIT code is distributed (this is TR): 

"With an AS3-symbol-instrumented shark run, I get 6.5% self time in Point(), 5.9% in allocate(), 1.3% in Object(), negligible everywhere else.  If I flatten, it all attributes to allocate(), except for a few stray samples attributed to Point but without allocate on the stack.  (i blame shark for fuzzy noise like that)"

Comparatively that's a lot of time in allocate(), but if it has to call finddef_cache, then call newInstance, then call the constructor, I can see it,
depending on how well those calls are done.

Ed again: "[Calls] are not superb, so the relative time in Point vs Allocate doesn't surprise me.  Improvements to finddef, the as3 jit ABI, and general jit code quality should uniformly help."
Comment on attachment 414520 [details] [diff] [review]
Streamline GC::Alloc and GC::Free, and specialize - v2

redux changesets 3245:206804842ea6 through 3248:8f8a3b392128
Attachment #414520 - Attachment is obsolete: true
Comment on attachment 414708 [details] [diff] [review]
Streamline GCAlloc::Alloc and GCAlloc::Free, v2

redux changesets 3245:206804842ea6 through 3248:8f8a3b392128
Attachment #414708 - Attachment is obsolete: true
Comment on attachment 414741 [details] [diff] [review]
Bug fixes for the two patches

redux changesets 3245:206804842ea6 through 3248:8f8a3b392128
Attachment #414741 - Attachment is obsolete: true
Whiteboard: Has patch
I have one more fix coming, an optimization to the GC::FreeNotNull path.  Details currently available in attachment #415359 [details] [diff] [review] and in comments on bug #531250.
Patches were backed out due to regression, will re-land shortly.
Whiteboard: Has patch
Certainly one bug in the change is that the patch assumes that every object is always two words on 64-bit systems.  That turns out not to be true, of course - short non-pointer-containing objects (strings) - are still allocated in 8 bytes, one word.  Thus the trick of storing the object index in the second word of objects on the free list only works on 32-bit systems.

I found another bug which may or may not have any influence on the result; RCObject::Stick is pretty much completely wrong.  It should 'or' in the sticky bit, not set composite to sticky, and if the object is in the ZCT it needs to be removed from the ZCT.  I'll file a separate bug for that.
Yet another bug shows up in debug mode: the collector may now be triggered while no userptr to the object is on the stack or in a register, and the newly allocated object may be reaped as a consequence.
How, is this a GetUserPointer/GetRealPointer issue?
(In reply to comment #34)
> How, is this a GetUserPointer/GetRealPointer issue?

In the patch, SignalAllocWork has moved into GCAlloc::Alloc.  However the 'item' that's returned from GCAlloc::Alloc is a RealPtr and a UserPtr may not exist at that point (it hasn't been computed yet - the caller computes it).  Since SignalAllocWork can trigger a collection, only the RealPtr may be on the stack or in a register when the GC runs.  But the GC does not work on RealPtrs, it needs a UserPtr.  Ergo the GC will reclaim the newly allocated 'item' - in debug builds.
Depends on: 533957, 503304
I consider this reviewed already; it's logged here because it integrates the bugfixes into the patch and also has a refinement necessary to avoid regressing the sampling code (see bug #533954).
Depends on: 533954
No longer depends on: 533957
Tested, passes the buildbot, rebased to 3306, ready to land once the tree opens.
Attachment #416913 - Attachment is obsolete: true
This comes from bug #531250, applies on top of attachment #417102 [details] [diff] [review], and is relative to a work tree I have - not clean yet, but passes regression testing on my system.
Comment on attachment 417102 [details] [diff] [review]
Streamline GC::Alloc and GC::Free, and specialize - v4

redux changeset:   3320:05f8ad8e3e2c
Attachment #417102 - Attachment is obsolete: true
Make GCAlloc::Free and GCLargeAlloc::Free a virtual function; subclass the allocators from a common abstract base; move logic into the two Free functions; make GC::FreeNotNull a simple inlinable wrapper.  We get speedup from the moved complexity - the old FreeNotNull was not being compiled well by GCC - and from the inlining.
Attachment #417116 - Attachment is obsolete: true
Attachment #417515 - Flags: review?(treilly)
This has been scaled back a little from what we had before, because the method of communicating the object index from CreateChunk/Free/Sweep to Alloc does not work well on 64-bit systems (objects can be one word long) and more importantly, interacts poorly with the invariants on the 'composite' field in RCObject.

(This patch includes some very light cleanup of RCObject and documentation of those invariants, all in the name of making things clearer.  I may or may not factor that out before this patch lands.)
Comment on attachment 417515 [details] [diff] [review]
Speed up GC::Free by rearranging and refactoring (clean)

What's with the padding in GCBlockHeader?
(In reply to comment #42)
> (From update of attachment 417515 [details] [diff] [review])
> What's with the padding in GCBlockHeader?

You mean in GCLargeAlloc::LargeBlock?  I meant to document that.  It's necessary to get 8-byte alignment for large-object allocation.  GCBlockHeader increased from three entries to four, and with only one in LargeBlock we need to pad.
Comment on attachment 417515 [details] [diff] [review]
Speed up GC::Free by rearranging and refactoring (clean)


might as well cache m_itemsPerBlock in a local instead of just GCAlloc no?

speeding up Free involves adding a virtual method call, really?  virtual call can't be faster than the math based large alloc check can it?

> GCAssert(m_gc->collecting == false || m_gc->marking == true)
I can't get my head around this, could both be false?	

Another goto bites the dust, yeah!

explicit Collecting check in ~StringObject seems unnecessary
The quicklist work changes object behavior enough that bug #506013 starts showing up as failures in the mmgc selftests.
Depends on: 506013
(In reply to comment #44)
> (From update of attachment 417515 [details] [diff] [review])
> 
> might as well cache m_itemsPerBlock in a local instead of just GCAlloc no?

Good point, will fix this.

> speeding up Free involves adding a virtual method call, really?  virtual call
> can't be faster than the math based large alloc check can it?

As I alluded to in comment #40, there are several reasons why it is faster:

  - The out-of-line FreeNotNull was not compiled well (by GCC at least) because
    of all the special cases in the function; the prologue/epilogue ended up
    being fairly complicated for what is essentially a simple trampoline in
    many cases
  - Complexity moves into the underlying Free where it can be shared with
    the complexity that was already there, so we gain a bit that way
  - You pretty much get perfect branch prediction everywhere when
    FreeNotNull is inlined and makes a virtual call to the underlying Free,
    since typical sites that free objects only free objects of one size.
    In the case where we have a test-and-call, branch prediction is going to
    be less perfect, plus there are two branches to predict.
  - With no decision tree in it FreeNotNull is now so small that it makes
    sense to inline it so we save one function call/return too

As reported in bug #531250, on a synthetic benchmark meant to focus on alloc/dealloc cost, we got a 7% speedup from this change alone.

> > GCAssert(m_gc->collecting == false || m_gc->marking == true)
> I can't get my head around this, could both be false?

Yes.  The point of this test is to ensure that collecting can only be true if marking is true; when that's the case, a test of the former only needs to be performed if the latter is true.  This shaves a little bit off the deallocation path when the marker's not running.

> Another goto bites the dust, yeah!

I stared at that code for a while and then ... "that's a while loop!"

> explicit Collecting check in ~StringObject seems unnecessary

Will remove it.
Attachment #417515 - Flags: review?(treilly) → review+
Comment on attachment 417515 [details] [diff] [review]
Speed up GC::Free by rearranging and refactoring (clean)

(In reply to comment #46)
> (In reply to comment #44)
> > (From update of attachment 417515 [details] [diff] [review] [details])
> > 
> > might as well cache m_itemsPerBlock in a local instead of just GCAlloc no?
> 
> Good point, will fix this.

Changed my mind, it's all debug code, and if we cache this we should also cache m_itemSize and b->items.  (The reason alloc is in a local anyway was just to avoid eye bleed from the cast that was required everywhere.)

(In reply to comment #46)
> (In reply to comment #44)
> > (From update of attachment 417515 [details] [diff] [review] [details])
> > 
> > might as well cache m_itemsPerBlock in a local instead of just GCAlloc no?
> 
> Good point, will fix this.

Changed my mind, it's all debug code, and if we cache this we should also cache m_itemSize and b->items.  (The reason alloc is in a local anyway was just to avoid eye bleed from the cast that was required everywhere.)

redux changeset:   3346:d4334f9c401c
Attachment #417515 - Attachment is obsolete: true
Attached patch Quick lists for GCAlloc (obsolete) — Splinter Review
Optimizes the paths through GCAlloc::Alloc and GCAlloc::Free by moving away from the current situation, where we have either free space at the end of a block or an object on an on-block free list, to a situation where we have one per-allocator list of free objects from various blocks (a "quick list").  The effect of the current situation is akin to eager coalescing with the costs that come with that; with quick lists, we move toward lazy coalescing and try to create fast paths that are hit most of the time.

See documentation block at the beginning of GCAlloc.cpp for more information.
Attachment #417520 - Attachment is obsolete: true
Attachment #421531 - Flags: review?(treilly)
There's probably a way to get rid of the if (size <= kLargestAlloc) by adding the large alloc to the allocs table and doing a little smarter size->index calculation.
(In reply to comment #49)
> There's probably a way to get rid of the if (size <= kLargestAlloc) by adding
> the large alloc to the allocs table and doing a little smarter size->index
> calculation.

Possibly.

There might also be ways of avoiding the constant checking for the quick list budget by batching that.

I will consider these as follow-on items though; won't try to do them in this patch.
Attachment #421531 - Flags: review?(treilly) → review+
tamarin-redux changeset: 3559:9e00648eb776 (merge from tamarin-argo).

Remaining issues deferred.
Priority: P2 → --
Whiteboard: Has patch
Target Milestone: flash10.1 → Future
(In reply to comment #48)
> Created an attachment (id=421531) [details]
> Quick lists for GCAlloc

The quick list patch has a bug with -Dgreedy which manifests itself with a stack overflow or an infinite loop, depending on the compiler's ability to perform tail call elimination.  This patch fixes that.

tamarin-argo 3498:30e38e285ea0 (self-reviewed)
Attachment #421531 - Attachment is obsolete: true
Attachment #422518 - Attachment is obsolete: true
Due to a snafu, the quicklist patch landed in tamarin-redux-argo changeset: 3561:a74a4f33061a, not the previously mentioned change sets.
Large-alloc follow-on item: bug #557898.
Quicklist budget follow-on item: bug #557901.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: