Closed
Bug 528338
Opened 15 years ago
Closed 14 years ago
[Speed] Tune the paths into and through GC::Alloc, GC::Free, et al
Categories
(Tamarin Graveyard :: Garbage Collection (mmGC), defect)
Tracking
(Not tracked)
RESOLVED
FIXED
Future
People
(Reporter: lhansen, Assigned: lhansen)
References
Details
Attachments
(18 obsolete files)
Turns out there's quite a bit that can be done to reduce the computation on the paths into and through GC::Alloc: tests can be coalesced or lifted off the common paths; methods can be in-lined; lookup tables can replace decision trees; compiler problems can be worked around; accounting calls can be merged.
Assignee | ||
Updated•15 years ago
|
Priority: -- → P2
Target Milestone: --- → flash10.1
Assignee | ||
Comment 1•15 years ago
|
||
Simple adjustments: make sure functions that should be inlined really are inlined, and specialize 'new' a couple of places so that AllocExtra is only called when there's anything extra to allocate - that path has one extra overflow check on it.
Attachment #412462 -
Flags: review?(treilly)
Updated•15 years ago
|
Attachment #412462 -
Flags: review?(treilly) → review+
Assignee | ||
Comment 2•15 years ago
|
||
Comment on attachment 412462 [details] [diff] [review] Cleanup inlining & specialize new redux changeset: 3123:e7bc2f5c27ca
Attachment #412462 -
Attachment is obsolete: true
Assignee | ||
Comment 3•15 years ago
|
||
Assignee | ||
Comment 4•15 years ago
|
||
Comment on attachment 412831 [details] [diff] [review] Streamline GC::Alloc This is more or less a wash on the Tamarin benchmarks, even with the AllocDouble optimization disabled. It has shown some speedups in the Flash Player (benchmark results to appear). Inspection of the generated code shows it to be considerably shorter and less branchy, though, so I expect it will benefit some platforms more than my development machine.
Assignee | ||
Comment 5•15 years ago
|
||
Minor changes: move one test in under another, group the flags together in the GC object to allow caching to work better, and add a bunch of comments about remaining work and where the cost goes.
Assignee | ||
Updated•15 years ago
|
Summary: [Speed] Tune the paths into and through GC::Alloc → [Speed] Tune the paths into and through GC::Alloc, GC::Free, et al
Comment 6•15 years ago
|
||
+#ifdef MMGC_MEMORY_INFO + item = GetUserPointer(item); +#endif Is a verbose way of saying: + item = GetUserPointer(item);
Assignee | ||
Comment 7•15 years ago
|
||
(In reply to comment #6) > +#ifdef MMGC_MEMORY_INFO > + item = GetUserPointer(item); > +#endif > > Is a verbose way of saying: > > + item = GetUserPointer(item); I know, but I was in an über-pedantic mood and wanted to be able to see exactly what work would be performed. Will clean it up tho'.
Assignee | ||
Comment 8•15 years ago
|
||
One more tweak: the C++ spec states that calling 'delete' on a NULL pointer has no effect. Ergo the null pointer check at the beginning of Free is redundant, if Free is called from delete. So introduce a new function FreeNotNull called from 'delete', then make Free() an inline wrapper that checks for NULL and calls FreeNotNull as appropriate. Also fixed some warnings and a Debug-mode compile problem.
Attachment #412844 -
Attachment is obsolete: true
Assignee | ||
Comment 9•15 years ago
|
||
Cleaned up and tested more extensively.
Attachment #412831 -
Attachment is obsolete: true
Attachment #413082 -
Attachment is obsolete: true
Attachment #413590 -
Flags: review?(treilly)
Assignee | ||
Comment 10•15 years ago
|
||
This is preliminary, but apart from hanging the sampler/Callback test, it works fine. The gist is that we layer a quick list on top of the blocks in each instance of GCAlloc. This removes most decision points from both Alloc and Free, effectively batches many expensive operations, and requires fewer expensive oeprations overall. A consistent global state with empty quick lists is reestablished as needed. See comment at the head of GCAlloc.cpp in the patch for more info. This patch needs a couple of things still: - proper performance numbers - more testing - a mechanism to prevent the quick lists from growing without bound, since that would increase memory consumption. I have some ideas, not fully baked yet.
Assignee | ||
Comment 11•15 years ago
|
||
Together these two patches drop the time spent in MMgc by 2% on the AS2 GCBench benchmark. (This is with MMGC_HOOKS turned on; I need to re-measure with that disabled, and if the difference is significant look into tuning the hook paths some more. For example, there are two tests on the alloc path for whether hooks are enabled, but one ought to be enough.)
Comment 12•15 years ago
|
||
We can take the hook checks out if !MMGC_MEMORY_PROFILER and !DEBUG no?
Assignee | ||
Comment 13•15 years ago
|
||
This is the test currently in MMgc.h, should it be refined? #if defined DEBUGGER || defined MMGC_MEMORY_PROFILER || defined MMGC_MEMORY_INFO #ifndef MMGC_HOOKS #define MMGC_HOOKS #endif #endif
Assignee | ||
Comment 14•15 years ago
|
||
(In reply to comment #11) > Together these two patches drop the time spent in MMgc by 2% on the AS2 GCBench > benchmark. 2 percentage points out of 18 for MMgc = 11% speedup in MMgc for this benchmark.
Comment 15•15 years ago
|
||
(In reply to comment #13) > This is the test currently in MMgc.h, should it be refined? > > #if defined DEBUGGER || defined MMGC_MEMORY_PROFILER || defined > MMGC_MEMORY_INFO > #ifndef MMGC_HOOKS > #define MMGC_HOOKS > #endif > #endif Looks right to me
Assignee | ||
Comment 16•15 years ago
|
||
Here are some more ideas. - Merge hooks more (alluded to above) and control more precisely when the hook check is needed (ditto) - Have a soft cache of blocks that have already been initialized for a given allocator. The idea is that once an allocator frees a block it should keep it but notify GCHeap that it has it - the block is in a consistent state and re-initialization should be particularly cheap. If GCHeap needs a block it can take it from somebody's soft cache instead of allocating from the heap. A little complication creeps in when GCHeap needs more than one consecutive block, we'd like to be able to perform coalescing of free blocks across the allocators' free lists. Anyway, I mention this because CreateChunk is showing up in the profile and it would be interesting to see if we could make it go away again.
Assignee | ||
Comment 17•15 years ago
|
||
Here's yet another ideas: Observe that GC::Alloc performs a doble dispatch: it dispatches on size and on the flag bits. In most cases these are both known (because the calls to Alloc are mostly from the new operator and the new operator inserts those flags, and knows the size). Ergo the 'new' operator may be able to specialize the call. Specializing by flags is dead easy, we just provide allocator methods for the flag combinations we care about. Specializing by size is easy once we've specialized by flag, because then we already know which allocator pool to choose from - it's just a simple table lookup. The code is small and can be inlined everywhere. A quick test shows that this works: String::createDynamic invokes 'new (gc) String(...)', this is open-coded by the compiler as a direct call to GCAlloc::Alloc, taking GC::Alloc entirely out of the loop (modulo running of hooks and so on, which we'd have to figure out how to deal with).
Assignee | ||
Comment 18•15 years ago
|
||
This is not something finished, just a general idea of what it would look like. We still need to figure out how to run hooks, and trigger the collector somehow. Both add to the inline path. In the existing case of AllocDouble we don't specialize if MMGC_HOOKS is enabled; this is a possible solution but not ideal.
Assignee | ||
Comment 19•15 years ago
|
||
Combines the first streamlining patch with an elaboration of the specialized allocation patch; this one also specializes AllocExtra by means of a fast-path overflow check. (This probably makes the GCAlloc::Alloc/Free patch no longer apply cleanly.)
Attachment #413590 -
Attachment is obsolete: true
Attachment #414037 -
Attachment is obsolete: true
Attachment #414319 -
Flags: review?(treilly)
Attachment #413590 -
Flags: review?(treilly)
Assignee | ||
Comment 20•15 years ago
|
||
Like the previous patch, but fixes a bug in debug accounting.
Attachment #414319 -
Attachment is obsolete: true
Attachment #414520 -
Flags: review?(treilly)
Attachment #414319 -
Flags: review?(treilly)
Assignee | ||
Comment 21•15 years ago
|
||
This is clean and has all facilities. Again, see major block comment in GCAlloc.cpp for an explanation of how the quick lists work.
Attachment #413591 -
Attachment is obsolete: true
Attachment #414708 -
Flags: review?(treilly)
Assignee | ||
Comment 22•15 years ago
|
||
Some bugs I found while mopping up, these fixes apply to both patches. I can fold them in if you think it's unworkable to review multiple patches like this. - Inline specialized allocators did not always go to the right allocator pool - Inline specialized allocators did not compute size classes correctly - DRC pinning can modify free objects, which messes up the storage of the index of an object within its block in the second word of the free object. Worked around this by documenting the issue and masking the bit on load. - Removed one redundant division in GCAlloc::CreateChunk. Fairly strange that it was not constant-folded but it showed up hot in a profiler run (could be profiler noise). Anyway we don't need it.
Attachment #414741 -
Flags: review?(treilly)
Comment 23•15 years ago
|
||
Comment on attachment 414520 [details] [diff] [review] Streamline GC::Alloc and GC::Free, and specialize - v2 AllocPtrZero/AllocPtrZeroFinalized/AllocRCObject all use containsPointersRCAllocs, that's not right is it?
Comment 24•15 years ago
|
||
Nevermind, I see patch 3 fixes it.
Updated•15 years ago
|
Attachment #414520 -
Flags: review?(treilly) → review+
Updated•15 years ago
|
Attachment #414708 -
Flags: review?(treilly) → review+
Updated•15 years ago
|
Attachment #414741 -
Flags: review?(treilly) → review+
Assignee | ||
Comment 25•15 years ago
|
||
Will integrate into TR after Beta 2 branch / integration.
Whiteboard: Has patch
Assignee | ||
Comment 26•15 years ago
|
||
Edwin notes elsewhere about how the JIT code is distributed (this is TR): "With an AS3-symbol-instrumented shark run, I get 6.5% self time in Point(), 5.9% in allocate(), 1.3% in Object(), negligible everywhere else. If I flatten, it all attributes to allocate(), except for a few stray samples attributed to Point but without allocate on the stack. (i blame shark for fuzzy noise like that)" Comparatively that's a lot of time in allocate(), but if it has to call finddef_cache, then call newInstance, then call the constructor, I can see it, depending on how well those calls are done. Ed again: "[Calls] are not superb, so the relative time in Point vs Allocate doesn't surprise me. Improvements to finddef, the as3 jit ABI, and general jit code quality should uniformly help."
Assignee | ||
Comment 27•15 years ago
|
||
Comment on attachment 414520 [details] [diff] [review] Streamline GC::Alloc and GC::Free, and specialize - v2 redux changesets 3245:206804842ea6 through 3248:8f8a3b392128
Attachment #414520 -
Attachment is obsolete: true
Assignee | ||
Comment 28•15 years ago
|
||
Comment on attachment 414708 [details] [diff] [review] Streamline GCAlloc::Alloc and GCAlloc::Free, v2 redux changesets 3245:206804842ea6 through 3248:8f8a3b392128
Attachment #414708 -
Attachment is obsolete: true
Assignee | ||
Comment 29•15 years ago
|
||
Comment on attachment 414741 [details] [diff] [review] Bug fixes for the two patches redux changesets 3245:206804842ea6 through 3248:8f8a3b392128
Attachment #414741 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Whiteboard: Has patch
Assignee | ||
Comment 30•15 years ago
|
||
I have one more fix coming, an optimization to the GC::FreeNotNull path. Details currently available in attachment #415359 [details] [diff] [review] and in comments on bug #531250.
Assignee | ||
Comment 31•15 years ago
|
||
Patches were backed out due to regression, will re-land shortly.
Whiteboard: Has patch
Assignee | ||
Comment 32•15 years ago
|
||
Certainly one bug in the change is that the patch assumes that every object is always two words on 64-bit systems. That turns out not to be true, of course - short non-pointer-containing objects (strings) - are still allocated in 8 bytes, one word. Thus the trick of storing the object index in the second word of objects on the free list only works on 32-bit systems. I found another bug which may or may not have any influence on the result; RCObject::Stick is pretty much completely wrong. It should 'or' in the sticky bit, not set composite to sticky, and if the object is in the ZCT it needs to be removed from the ZCT. I'll file a separate bug for that.
Assignee | ||
Comment 33•15 years ago
|
||
Yet another bug shows up in debug mode: the collector may now be triggered while no userptr to the object is on the stack or in a register, and the newly allocated object may be reaped as a consequence.
Comment 34•15 years ago
|
||
How, is this a GetUserPointer/GetRealPointer issue?
Assignee | ||
Comment 35•15 years ago
|
||
(In reply to comment #34) > How, is this a GetUserPointer/GetRealPointer issue? In the patch, SignalAllocWork has moved into GCAlloc::Alloc. However the 'item' that's returned from GCAlloc::Alloc is a RealPtr and a UserPtr may not exist at that point (it hasn't been computed yet - the caller computes it). Since SignalAllocWork can trigger a collection, only the RealPtr may be on the stack or in a register when the GC runs. But the GC does not work on RealPtrs, it needs a UserPtr. Ergo the GC will reclaim the newly allocated 'item' - in debug builds.
Assignee | ||
Updated•15 years ago
|
Assignee | ||
Comment 36•15 years ago
|
||
I consider this reviewed already; it's logged here because it integrates the bugfixes into the patch and also has a refinement necessary to avoid regressing the sampling code (see bug #533954).
Assignee | ||
Updated•15 years ago
|
Assignee | ||
Comment 37•15 years ago
|
||
Tested, passes the buildbot, rebased to 3306, ready to land once the tree opens.
Attachment #416913 -
Attachment is obsolete: true
Assignee | ||
Comment 38•15 years ago
|
||
This comes from bug #531250, applies on top of attachment #417102 [details] [diff] [review], and is relative to a work tree I have - not clean yet, but passes regression testing on my system.
Assignee | ||
Comment 39•15 years ago
|
||
Comment on attachment 417102 [details] [diff] [review] Streamline GC::Alloc and GC::Free, and specialize - v4 redux changeset: 3320:05f8ad8e3e2c
Attachment #417102 -
Attachment is obsolete: true
Assignee | ||
Comment 40•15 years ago
|
||
Make GCAlloc::Free and GCLargeAlloc::Free a virtual function; subclass the allocators from a common abstract base; move logic into the two Free functions; make GC::FreeNotNull a simple inlinable wrapper. We get speedup from the moved complexity - the old FreeNotNull was not being compiled well by GCC - and from the inlining.
Attachment #417116 -
Attachment is obsolete: true
Attachment #417515 -
Flags: review?(treilly)
Assignee | ||
Comment 41•15 years ago
|
||
This has been scaled back a little from what we had before, because the method of communicating the object index from CreateChunk/Free/Sweep to Alloc does not work well on 64-bit systems (objects can be one word long) and more importantly, interacts poorly with the invariants on the 'composite' field in RCObject. (This patch includes some very light cleanup of RCObject and documentation of those invariants, all in the name of making things clearer. I may or may not factor that out before this patch lands.)
Comment 42•15 years ago
|
||
Comment on attachment 417515 [details] [diff] [review] Speed up GC::Free by rearranging and refactoring (clean) What's with the padding in GCBlockHeader?
Assignee | ||
Comment 43•15 years ago
|
||
(In reply to comment #42) > (From update of attachment 417515 [details] [diff] [review]) > What's with the padding in GCBlockHeader? You mean in GCLargeAlloc::LargeBlock? I meant to document that. It's necessary to get 8-byte alignment for large-object allocation. GCBlockHeader increased from three entries to four, and with only one in LargeBlock we need to pad.
Comment 44•15 years ago
|
||
Comment on attachment 417515 [details] [diff] [review] Speed up GC::Free by rearranging and refactoring (clean) might as well cache m_itemsPerBlock in a local instead of just GCAlloc no? speeding up Free involves adding a virtual method call, really? virtual call can't be faster than the math based large alloc check can it? > GCAssert(m_gc->collecting == false || m_gc->marking == true) I can't get my head around this, could both be false? Another goto bites the dust, yeah! explicit Collecting check in ~StringObject seems unnecessary
Assignee | ||
Comment 45•15 years ago
|
||
The quicklist work changes object behavior enough that bug #506013 starts showing up as failures in the mmgc selftests.
Depends on: 506013
Assignee | ||
Comment 46•15 years ago
|
||
(In reply to comment #44) > (From update of attachment 417515 [details] [diff] [review]) > > might as well cache m_itemsPerBlock in a local instead of just GCAlloc no? Good point, will fix this. > speeding up Free involves adding a virtual method call, really? virtual call > can't be faster than the math based large alloc check can it? As I alluded to in comment #40, there are several reasons why it is faster: - The out-of-line FreeNotNull was not compiled well (by GCC at least) because of all the special cases in the function; the prologue/epilogue ended up being fairly complicated for what is essentially a simple trampoline in many cases - Complexity moves into the underlying Free where it can be shared with the complexity that was already there, so we gain a bit that way - You pretty much get perfect branch prediction everywhere when FreeNotNull is inlined and makes a virtual call to the underlying Free, since typical sites that free objects only free objects of one size. In the case where we have a test-and-call, branch prediction is going to be less perfect, plus there are two branches to predict. - With no decision tree in it FreeNotNull is now so small that it makes sense to inline it so we save one function call/return too As reported in bug #531250, on a synthetic benchmark meant to focus on alloc/dealloc cost, we got a 7% speedup from this change alone. > > GCAssert(m_gc->collecting == false || m_gc->marking == true) > I can't get my head around this, could both be false? Yes. The point of this test is to ensure that collecting can only be true if marking is true; when that's the case, a test of the former only needs to be performed if the latter is true. This shaves a little bit off the deallocation path when the marker's not running. > Another goto bites the dust, yeah! I stared at that code for a while and then ... "that's a while loop!" > explicit Collecting check in ~StringObject seems unnecessary Will remove it.
Updated•15 years ago
|
Attachment #417515 -
Flags: review?(treilly) → review+
Assignee | ||
Comment 47•15 years ago
|
||
Comment on attachment 417515 [details] [diff] [review] Speed up GC::Free by rearranging and refactoring (clean) (In reply to comment #46) > (In reply to comment #44) > > (From update of attachment 417515 [details] [diff] [review] [details]) > > > > might as well cache m_itemsPerBlock in a local instead of just GCAlloc no? > > Good point, will fix this. Changed my mind, it's all debug code, and if we cache this we should also cache m_itemSize and b->items. (The reason alloc is in a local anyway was just to avoid eye bleed from the cast that was required everywhere.) (In reply to comment #46) > (In reply to comment #44) > > (From update of attachment 417515 [details] [diff] [review] [details]) > > > > might as well cache m_itemsPerBlock in a local instead of just GCAlloc no? > > Good point, will fix this. Changed my mind, it's all debug code, and if we cache this we should also cache m_itemSize and b->items. (The reason alloc is in a local anyway was just to avoid eye bleed from the cast that was required everywhere.) redux changeset: 3346:d4334f9c401c
Attachment #417515 -
Attachment is obsolete: true
Assignee | ||
Comment 48•15 years ago
|
||
Optimizes the paths through GCAlloc::Alloc and GCAlloc::Free by moving away from the current situation, where we have either free space at the end of a block or an object on an on-block free list, to a situation where we have one per-allocator list of free objects from various blocks (a "quick list"). The effect of the current situation is akin to eager coalescing with the costs that come with that; with quick lists, we move toward lazy coalescing and try to create fast paths that are hit most of the time. See documentation block at the beginning of GCAlloc.cpp for more information.
Attachment #417520 -
Attachment is obsolete: true
Attachment #421531 -
Flags: review?(treilly)
Comment 49•15 years ago
|
||
There's probably a way to get rid of the if (size <= kLargestAlloc) by adding the large alloc to the allocs table and doing a little smarter size->index calculation.
Assignee | ||
Comment 50•15 years ago
|
||
(In reply to comment #49) > There's probably a way to get rid of the if (size <= kLargestAlloc) by adding > the large alloc to the allocs table and doing a little smarter size->index > calculation. Possibly. There might also be ways of avoiding the constant checking for the quick list budget by batching that. I will consider these as follow-on items though; won't try to do them in this patch.
Updated•15 years ago
|
Attachment #421531 -
Flags: review?(treilly) → review+
Assignee | ||
Comment 51•15 years ago
|
||
Comment on attachment 421531 [details] [diff] [review] Quick lists for GCAlloc tamarin-argo changeset: 3497:c0d7bbdb6dbb
Assignee | ||
Comment 52•15 years ago
|
||
tamarin-redux changeset: 3559:9e00648eb776 (merge from tamarin-argo). Remaining issues deferred.
Priority: P2 → --
Whiteboard: Has patch
Target Milestone: flash10.1 → Future
Assignee | ||
Comment 53•15 years ago
|
||
(In reply to comment #48) > Created an attachment (id=421531) [details] > Quick lists for GCAlloc The quick list patch has a bug with -Dgreedy which manifests itself with a stack overflow or an infinite loop, depending on the compiler's ability to perform tail call elimination. This patch fixes that. tamarin-argo 3498:30e38e285ea0 (self-reviewed)
Assignee | ||
Updated•15 years ago
|
Attachment #421531 -
Attachment is obsolete: true
Assignee | ||
Updated•15 years ago
|
Attachment #422518 -
Attachment is obsolete: true
Assignee | ||
Comment 54•15 years ago
|
||
Due to a snafu, the quicklist patch landed in tamarin-redux-argo changeset: 3561:a74a4f33061a, not the previously mentioned change sets.
Assignee | ||
Comment 55•14 years ago
|
||
Large-alloc follow-on item: bug #557898. Quicklist budget follow-on item: bug #557901.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•