Investigations show that only 10%-50% of the heap contains finalizable objects (including RC objects) in typical Flash applications. The pause induced by FinishIncrementalMark could thus be reduced, possibly substantially, by segregating the allocation of finalizable objects from non-finalizable objects; only the former blocks need be scanned during FinishIncrementalMark.
Created attachment 533301 [details] [diff] [review] Preliminary patch Passes tests and is conceptually complete, but I've not investigated the impact of the change on memory / pauses / performance.
Currently all RCObjects are finalized, but clearly only a subset of them need to be. Is it possible/practical to make a non-finalized RCObject?
(In reply to comment #2) > Currently all RCObjects are finalized, but clearly only a subset of them > need to be. Is it possible/practical to make a non-finalized RCObject? It's probably doable. But the benefits are far from clear: every ScriptObject needs to be finalized because the refcount on the "delegate" member of ScriptObject needs to be decremented. And String has m_master and Namespace has some strings, so every AvmPlusScriptableObject needs to be finalized to do refcounting right. So how many other RCObjects are there (that we care about)? Probably almost none.
Preliminary numbers from a case study on Get the Glass on a MacBook Air suggests that finalization scanning volume is generally cut in half and max pause times are reduced by 25%, without increasing the heap size at all (measured either as GC blocks or total blocks). (An interesting anomaly on that program is that finalization will sometimes pause for 4 - 5 seconds even on a MacPro, with or without this change. It would appear that some of the "finalizers" actually do major synchronous work.) The volume of weak references appears to be high on this benchmark (up to 7000 reaped per GC, out of an unknown population); more investigation is needed to determine the population. If the population is generally large then segregating weakrefs can be beneficial because those that are live will not be finalized and those that are dead are no longer finalizable (they've been cleared); ergo they need never be scanned at all.
Further study reveals that the toal weakref populations are not much larger than the reapable weakref populations on this program. Still, it seems that we can't do any harm if we remove the weakref scanning work from GCAlloc::Finalize.
Another data point. Get the Glass appears to be an AS2 application. The vast bulk of finalizable objects falls into two groups, AS2 ScriptObjects (which are RCObjects) and AS2 variable containers (which are FinalizedObjects). The variable container probably need not be marked as finalized, because its finalization function is NULL - its contents are finalized by the ScriptObject destructor. This could be fixed by having ExactStructContainer turn off the finalized bit in the constructor if the finalization function is NULL, it's a minor tweak and allocating it non-finalized in the first place would be better.
A broader study of Brent's benchmark suite is somewhat less encouraging, these are relative numbers (new / old) for peak heap blocks, gc blocks, and private blocks: audiotool 1.00 1.00 1.03 bigpixelzombies 1.04 1.08 0.92 brainstorm 0.98 0.97 0.96 checkinapp 1.03 1.00 1.06 cignatabbing 1.00 1.00 0.99 coverflow 0.99 1.01 0.98 crushthecastle 0.99 0.95 1.06 flexdatagrid 1.00 1.02 1.01 gettheglass 1.15 1.31 1.16 mechanism 0.95 0.91 0.97 phystest 1.04 1.10 1.20 watsonexp 1.06 1.08 1.08 Some of these are highly variable due to gameplay and in-game advertising (get the glass, crush the castle, big pixel zombies, mechanism), and at one point in get the glass i finally figured out how to steer the van, so... the 10% growth in gc blocks in phystest worries me the most (not clear what the "private" numbers are worth, they are as reported but i don't trust them). (FRR from yesterday + TR from yesterday + the segregated-alloc patch, 64-bit plugin, updated Safari on MacOS 10.6, peak-occupancy data extracted from a gcbehavior dump) The underlying numbers are worth a study too; audiotool appears to trigger GC explicitly (one incremental mark per GC cycle for a long time).
Much controlled benchmark running and the usual gnuplot hair-pulling later: There is some variation on phystest, but the amount of noise in that program is huge. Across five runs with the old and new allocator setup: With the original allocator, the peak number of GC blocks ranges from 5119 to 5427, and the peak number of heap blocks from 7438 to 7898. The number of collections ranges from 20 to 40. The amount of allocation work varies from 1.2GB to 2.4GB (correlates directly with the number of collections). With the new allocator, the peak number of GC blocks ranges from 5099 to 6148, and the peak number of heap blocks from 7271 to 8553. The number of collections ranges from 26 to 32. The allocation work varies from 1.6 to 1.9GB. It's anyone's guess why the new allocator seems to have higher variability in the allocation numbers - with an impact on GC activity and block occupancy - than the old allocator; there should be no feedback there. That said, there may be very different heap dynamics with the segregated allocator: we're relying on lazy sweeping to return memory more than before, and lazy sweeping in one allocator is /not/ triggered by scarcity in other allocators - allocators go straight to GCHeap. This is an issue I brought up in bug #659317 also. With the unsegregated allocator it doesn't make sense for GCHeap to trigger sweeping - there would be little or nothing to gain from it. With the segregated allocator, it's different - there may be entirely empty blocks sitting there, waiting to be swept. Thus if the present change were to land on its own it would create the need for a follow-on item to investigate how we can drive sweeping from GCHeap when memory is scarce.
(In reply to comment #8) Argh! > With the original allocator, ... > With the new allocator, ... I mixed up my numbers. They should be: With the original allocator, the peak number of GC blocks ranges from 5119 to 5427, and the peak number of heap blocks from 7438 to 7898. The number of collections ranges from 26 to 32. The allocation work varies from 1.6 to 1.9GB. With the new allocator, the peak number of GC blocks ranges from 5099 to 6148, and the peak number of heap blocks from 7271 to 8553. The number of collections ranges from 20 to 40. The amount of allocation work varies from 1.2GB to 2.4GB (correlates directly with the number of collections).
Created attachment 535341 [details] [diff] [review] Patch Rebased to current TR. I'd like to land this because we want it for the incrementality work; as commented earlier in this bug and one place in the patch there will be knock-on work items for lazy sweeping, but there's no evidence of serious regressions from this change now, and it only goes into Brannan.
Comment on attachment 535341 [details] [diff] [review] Patch - Consider alpha-renaming *PointersAllocs to *PointersNonFinalizedAllocs; it would help ensure that both old and new code is properly vetted to accommodate the change, the same way we say noPointersAllocs instead of just Allocs. - Good food for thought in the note above GCAlloc::Sweep. Should we open a ticket to investigate that?
(In reply to comment #11) > Comment on attachment 535341 [details] [diff] [review] [review] > Patch > > - Consider alpha-renaming *PointersAllocs to *PointersNonFinalizedAllocs; it > would help ensure that both old and new code is properly vetted to > accommodate the change, the same way we say noPointersAllocs instead of just > Allocs. Agreed. > - Good food for thought in the note above GCAlloc::Sweep. Should we open a > ticket to investigate that? I will consider it, though it's feeding into some other work I'm doing on block recycling so I don't know for sure yet whether it's going to become its own work item.
Created attachment 539774 [details] [diff] [review] Updated patch Contains a couple of bug fixes and the rename suggested by Felix, ready for integration and testing.
changeset: 6402:c8459ef051b0 user: Lars T Hansen <firstname.lastname@example.org> summary: Fix 657949 - Segregated allocation for finalizable objects (r=fklockii) http://hg.mozilla.org/tamarin-redux/rev/c8459ef051b0
The GCAlloc::Sweep issue is logged as bug #665416.