Closed Bug 517446 Opened 15 years ago Closed 6 years ago

Wordcode interpreter uses too much memory

Categories

(Tamarin Graveyard :: Interpreter, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX
Future

People

(Reporter: lhansen, Unassigned)

References

Details

Attachments

(1 file)

There's evidence (most of it stale, now) that the wordcode interpreter uses /more/ memory than the JIT.  This is curious & unacceptable; the most we should really expect is a factor of four increase.  We need to investigate.

Possible culprits are:

 - the ABC code emitted from ASC is really awful and the JIT optimizes a lot
   of it away, but the wordcode interpreter cannot

 - the inline cache structures may be too large because the wordcode interpreter
   does not do all the early binding the JIT does

 - there could be other data that hang around too long
Trevor writes:

"On zerowing: /download/attachments/171169123/FRR-TR+Real++World+App+Execution+Speed+and+Memory+Comparison.xlsx
 
Edit cell M4 and set to 2
 
Then refer to columns B, C, D for an ABC to Wordcode comparison from the sandbox build we’re hoping to integrate to FR soon."
Target Milestone: --- → Future
Two things have changed recently:

- there was a storage leak in the PrecomputedMultinames table; in the past this
  would have put the wordcode interpreter at a disadvantage

- the jit has incorporated the precomputed multinames as well as inline caches,
  and open-codes slightly more primitives, so it now uses more memory.

We should re-test.
Component: Virtual Machine → Interpreter
Assignee: nobody → lhansen
I have tested a standalone FRR with current TR on two flash apps (checkinapp and phystestfast).  The results are ... interesting.  (FRR 7713:146bc500271e, TR 4797:4a0ad842e214, MacOS 10.6, MacPro.)

Results are from single runs.

On checkinapp it appears the WC interpreter uses more memory (peak private is 13000 blocks vs 11300), they have the same number of GC cycles.  The WC interpreter reports < 10% more allocation work in terms of number of objects, and about 10% more in terms of bytes.  GC and pause times are comparable.

On phystestfast, the wc interpreter uses half the memory (peak private 8800 vs 15000), but it has 520 collections vs the jit's 21 (25x).  Allocation work is also completely out of whack - 570M objects vs 21M objects for the jit (25x), 5.2GB vs 978MB (6x).

At first blush it appears that something in the wordcode system forces massively higher allocation rates and that RC cannot handle those allocations, hence the incredible collection rate, but to know this for sure I have to compare to the ABC interpreter first.
Status: NEW → ASSIGNED
Same setup, ABC interpreter:

On checkinapp the memory consumption and allocation work are close to those of the JIT.

On phystestfast the memory consumption is close to that of the JIT but the allocation and GC work are in the same league as the wordcode interpreter.

This is sort of exciting, because it indicates that on this program the interpreters allocate 20-25x more memory than the jit - it might be worthwhile to find out what causes that.  If it's something the jit optimizes away then the wordcode interpreter could benefit from it, at least; if we're lucky, it's just a missing optimization that both interpreters can use.

(As far as the memory consumption is concerned the jury is still out - "more study needed".)
If there's a lot of math involved (phystestfast?) then i naively suspect doubleToAtom().    However, there's also possible QCache activity; I would expect the interpreters to access TraitsBindings (TB) and MethodSignature (MS) more frequently, adding more qcache pressure, than the JIT, if the app has lots of early binding opportunities.  caveat:  WC exploits some of those too, but not ABC.    if WC/ABC have similar allocation patterns then maybe thats not it.

QCache is interesting -- fixed size table of strong refs to TB and MS objects, with random evictions of strong refs to TB and MS, plus weak references from Traits->TB and MethodInfo->MS.  If an item is evicted from either QCache, then when the GC occurs, the weakref is cleared.  The  next access will allocate several objects and reinsert a strong ref in the QCache.  So, if general allocation activity causes lots of GC sweeps, thus clearing weakrefs more, its possible for the QCache design to cause increased misses, which leads to increased allocation activity.  (hmm, can it get into a self-sustaining thrash?)
quick test:  add a strong ref from MethodInfo->MethodSignature and Traits->TraitsBindings
Right you are.  I hacked up the memory profiler to investigate.  The program crawls along (barely) and a partial profile is completely dominated by allocation of Numbers.
bug 549143 records a ton of recent activity (discussion, studies, etc) overhauling spidermonkey to use 128-bit fatvals on its interpreter stack, and 64-bit nan encoding througout the rest of the VM.  webkit and luajit have gone to the nan encoding too, fwiw.
(In reply to comment #5)
> If there's a lot of math involved (phystestfast?) then i naively suspect
> doubleToAtom(). 

Did we ever investigate the possibility of store doubles as RCObject rather than GCObject? (I'd be loathe to make such a change unless it was a big win, and fat boxes are probably a better solution anyway, just curious if it was entertained back in the early days)

> QCache is interesting -- fixed size table of strong refs to TB and MS objects,
> with random evictions of strong refs to TB and MS, plus weak references from
> Traits->TB and MethodInfo->MS.  If an item is evicted from either QCache, then
> when the GC occurs, the weakref is cleared.  The  next access will allocate
> several objects and reinsert a strong ref in the QCache.  So, if general
> allocation activity causes lots of GC sweeps, thus clearing weakrefs more, its
> possible for the QCache design to cause increased misses, which leads to
> increased allocation activity.  (hmm, can it get into a self-sustaining
> thrash?)

Interesting analysis, but I'd think that's only a risk if the QCache size is set too low -- items in the cache should always be strongly held (the weak ref should only be collectable for items recently in the cache).
(In reply to comment #9)

> > QCache is interesting -- fixed size table of strong refs to TB and MS objects,
> > with random evictions of strong refs to TB and MS, plus weak references from
> > Traits->TB and MethodInfo->MS.  If an item is evicted from either QCache, then
> > when the GC occurs, the weakref is cleared.  The  next access will allocate
> > several objects and reinsert a strong ref in the QCache.  So, if general
> > allocation activity causes lots of GC sweeps, thus clearing weakrefs more, its
> > possible for the QCache design to cause increased misses, which leads to
> > increased allocation activity.  (hmm, can it get into a self-sustaining
> > thrash?)
> 
> Interesting analysis, but I'd think that's only a risk if the QCache size is
> set too low -- items in the cache should always be strongly held (the weak ref
> should only be collectable for items recently in the cache).

The situation would be when you have the interpreters accessing hundreds (>>qcache size) of different types and methods, causing constant eviction from qcache, and simultaneously causing enough allocation traffic to trigger frequent collections.  ordinarily the qcache evictions dont add gc pressure.  but a collection clears all the weakrefs (the majority are not in the qcache) which then leads to a flurry of allocations as those TB/MS objects are requested after the collection.

so, you need independent allocation activity to make this happen, but the effect is then for qcache to add to the allocation activity even more.

its just a theory, anyway.  the # of unique TB's and MS being accessed in an app's critical section probably scales with the size of the app's codebase, so this theoretical problem is probably never an issue in smallish apps like phystestfast.

if it were a problem, a possible fix would be for qcache to resize itself according to TB/MS loading.
(In reply to comment #9)
> (In reply to comment #5)
> > If there's a lot of math involved (phystestfast?) then i naively suspect
> > doubleToAtom(). 
> 
> Did we ever investigate the possibility of store doubles as RCObject rather
> than GCObject? (I'd be loathe to make such a change unless it was a big win,
> and fat boxes are probably a better solution anyway, just curious if it was
> entertained back in the early days)

Not to my knowledge, but then I'm a newbie around here.

It would be surprising if that were a win except in programs with substantial heap sizes, since GC in almost-empty heaps is fairly cheap and reference counting has a fair amount of overhead that will show up for these kinds of allocation volumes.

I'll look around for better test programs than phystestfast.
You might also want to try using a Flex app which probably has a completely different start-up profile.
Here is data for ESC's main.swf showing JIT code size vs Wordcode size vs ABC size, on x86-32 and x86-64.

Same data plotted:
https://spreadsheets.google.com/ccc?key=0AkU9wThayjL-dDBRb1VCekpnUkIxQllJRnY2YzRscFE&hl=en&authkey=CIWPlJwK

Summary, from smallest to largest: wordcode-32, wordcode-64, jit-32, jit-64.  However, in the area where most methods lie, many method don't fall into this ranking.  

Todo: do it again for real world swfs, and exclude 64-bit data where size isn't as important.
(In reply to comment #13)
> Todo: do it again for real world swfs, and exclude 64-bit data where size isn't
> as important.

And, compare data on heap size in -Dverifyonly runs without dynamic behavior.
WC32 expansion factor around 2.8 - very much in line with what was expected.  Good!

Looks like WC has a very predictable expansion factor - flat, pretty much - while the JIT has a high expansion factor for small methods and a much lower factor for larger methods.

If we're serious about WC code size we can possibly do better by adding more instructions and specializing more intelligently, for starters.  The effective memory consumption may also be reduced by caching WC content because that allows init code to be discarded.  The JIT will have an edge on some SWF content because it does not translate init code; caching in the WC interpreter would reduce the JIT's edge.  (A "baseline JIT" would not benefit from that, though, so for the purposes of that investigation the current WC numbers are probably more interesting.)
(In reply to comment #15)
> If we're serious about WC code size we can possibly do better by adding more
> instructions and specializing more intelligently, for starters.

Y'know, we've spent effort on trying to use ABC in-place and avoid redundant allocations where possible, but I wonder if we might be better off going in the other direction: deliberately reading everything we need from the ABC chunk and then discarding it entirely. It would certainly end up consuming more memory, but how much more might be an interesting question; if it's a modest inflation it could be worth it in terms of potential performance improvements... (of course, for this to be worthwhile, some innards of Flash/AIR would have to be restructured so that embedded ABC chunks actually do get discarded rather than held in memory with the rest of the SWF, but that's probably not too problematic.)
Playing devils advocate to Steven, what if we treated WC like we do TraitsData, ie keep the wc for hot functions around but delete cold functions, idea is we can re-create from the abc later on.    Not sure what the abc->WC translation cost is like or whether verification can be separated out.
(In reply to comment #15)

> Looks like WC has a very predictable expansion factor - flat, pretty much -
> while the JIT has a high expansion factor for small methods and a much lower
> factor for larger methods.

Correct.  The stack overflow check and MethodFrame init boilerplate is the cause.  WC would have this effect too if we turned the interpreter prologue into a series of instructions.  (look at all the code in interpBoxed() before the loop starts executing).

I think an expansion factor of 2.8 is fine for now.  The posted results confirm that WC should use less memory than JIT, statically speaking, and we can make that even lower if desired.  Another way would be to use the same static (or later, dynamic) heuristics as the JIT, and build a VM with both the ABC and WC interpreters present.

The real next step is to measure the full heap after a -Dverifyonly run (which translates without executing anything), and to plot this for real swfs from ASC, not just main.swf from ESC.  (my first attempt at this produced 128,000 rows of data; MS-Excel choked at 32,000 and Google-Spreadsheets choked at 100,000).

Is private-memory the best coarse metric of heap usage?  the metric would be:

    /* translate lots of swfs */
    gc->Collect()
    printf("heap %d", double(VMPI_getPrivateResidentPageCount() * VMPI_getVMPageSize()));

(I cribbed the code from the System.privateMemory getter)
Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Flags: flashplayer-qrb+
Assignee: nobody → lhansen
Will not be working on this.
Assignee: lhansen → nobody
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: