performance issue in Flash Player 10 with callLater()

RESOLVED FIXED

Status

Tamarin
Virtual Machine
--
critical
RESOLVED FIXED
8 years ago
7 years ago

People

(Reporter: tim.huang, Assigned: pnkfelix)

Tracking

Details

(Whiteboard: has-patch)

Attachments

(8 attachments, 1 obsolete attachment)

(Reporter)

Description

8 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13 ( .NET CLR 3.5.30729)
Build Identifier: FlashPlayer/10.1.102.64

My customer noticed that the performance of the SWF file generated by our product (Xcelsius) running in Flash Player 10 is very slow and not acceptable, while it's a lot faster running in Flash Player 9.  We use a lot of callLater() in our SWF files in order to prevent script time out, and it may be the culprit.

If we do not use callLater(), or execute less loop in the method called by callLater(), the performance of FlashPlayer 10 is similar or better than FlashPlayer9.  For example, change 260 to 16 in the line 32 of the attached test.mxml file:
     for(var i:int=0; i<260; i++){ <======= change to 26


Reproducible: Always

Steps to Reproduce:
1. Extract the SWF files ("test 260.swf") from the attachment ("Test SWF.rar")
2. Compare the loading performance when running in FlashPlayer 9 and FlashPlayer 10 respectively.
Actual Results:  
Test 260.SWF:
   11229 ms in Flash Player 9,0,280,0
   19109 ms in Flash Player 10,1,102,64

Expected Results:  
The performance of FlashPlayer 10 should be similar to that of FlashPlayer 9
(Reporter)

Comment 1

8 years ago
Created attachment 498289 [details]
Sample Source code to reproduce the problem
(Reporter)

Comment 2

8 years ago
Created attachment 498290 [details]
SWF file generated by the sample code

Comment 3

8 years ago
Not a Tamarin bug, presumably needs to be moved to Jira or Watson.
Assignee: nobody → trbaker

Comment 4

7 years ago
Moved to Flash Player bugbase as the callLater API is provided by the Flash Player, not by Tamarin.  

https://bugs.adobe.com/jira/browse/FP-6018
Status: UNCONFIRMED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → INVALID

Comment 5

7 years ago
Created attachment 507396 [details]
Equivalent source code that runs in the avmshell
(Assignee)

Updated

7 years ago
Assignee: trbaker → fklockii
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: INVALID → ---
Created attachment 507994 [details]
Analogous source code that runs in the avmshell and reports time and memory usage

After doing player runs comparing 10.0/10.1/10.2 (see notes in WE 2672250), I tried to see if performance blow-up can be reproduced in the shell alone.  Lars had mentioned trying revisions 3994, 5000, and the TR tip; I've done that below and thrown in revision 2162 (it dates from July 2009; I thought it might better approximate the 10.0 virtual machine).

I collected the results from doing runs on my iMac (a 2.66 Ghz Core i5, 4 GB RAM) at various points along the TR revision history (encoded below as a suffix on the avmshell executable).

% for AVM in $(ls shell/avmshell.r*) ; do echo $AVM; for i in 1 2 3; do $AVM  ../../distilled.abc ; done ; done
shell/avmshell.r2162
8463 ms - CallLater End minmem 6815744 maxmem 384172032
8474 ms - CallLater End minmem 7733248 maxmem 384696320
9318 ms - CallLater End minmem 7733248 maxmem 407633920
shell/avmshell.r3994
14259 ms - CallLater End minmem 5242880 maxmem 378277888
14428 ms - CallLater End minmem 5111808 maxmem 378277888
14458 ms - CallLater End minmem 5242880 maxmem 378277888
shell/avmshell.r5000
14016 ms - CallLater End minmem 5115904 maxmem 379322368
13693 ms - CallLater End minmem 5115904 maxmem 379322368
13893 ms - CallLater End minmem 5115904 maxmem 379322368
shell/avmshell.r5842
12165 ms - CallLater End minmem 5251072 maxmem 383516672
12143 ms - CallLater End minmem 5251072 maxmem 383516672
12153 ms - CallLater End minmem 5251072 maxmem 383516672
Attachment #507396 - Attachment is obsolete: true
Looking more broadly at revisions along tamarin-redux, there is one spot where the max memory drops and performance does not regress (good); and later in the timeline  there is another spot where the performance regresses but the max memory usage stays "about" the same (bad).  I'm in the midst of bisecting, but will report what I know so far: the good spot falls somewhere between rev 2000 and rev 2162.  The bad spot falls somewhere between rev 3500 and rev 3994.
Created attachment 508028 [details]
another avmshell test case with time and memory usage

This (distilled3) is a closer match for the original test case from the Watson ticket.

Here is the results from a series of runs (I'm doing bisection base 5 instead of base 2, or something).


 % for AVM in $(ls -1 shell/avmshell.r*) ; do echo $AVM; for i in 1 2 3; do $AVM  ../../distilled3.abc ; done ; done
shell/avmshell.r2000
tf1: 27447 ms - callLater test minmem: 2048 K maxmem: 917120 K
tf1: 28360 ms - callLater test minmem: 2048 K maxmem: 936448 K
tf1: 27851 ms - callLater test minmem: 2048 K maxmem: 929536 K
shell/avmshell.r2162
tf1: 15685 ms - callLater test minmem: 2048 K maxmem: 626560 K
tf1: 15519 ms - callLater test minmem: 2048 K maxmem: 622976 K
tf1: 15646 ms - callLater test minmem: 2048 K maxmem: 627072 K
shell/avmshell.r3000
tf1: 12462 ms - callLater test minmem: 2052 K maxmem: 639368 K
tf1: 12017 ms - callLater test minmem: 2052 K maxmem: 639368 K
tf1: 11456 ms - callLater test minmem: 2052 K maxmem: 631048 K
shell/avmshell.r3500
tf1: 12324 ms - callLater test minmem: 2052 K maxmem: 627592 K
tf1: 12300 ms - callLater test minmem: 2052 K maxmem: 627592 K
tf1: 12283 ms - callLater test minmem: 2052 K maxmem: 627592 K
shell/avmshell.r3600
tf1: 12230 ms - callLater test minmem: 2052 K maxmem: 619400 K
tf1: 11685 ms - callLater test minmem: 2052 K maxmem: 619400 K
tf1: 12066 ms - callLater test minmem: 2052 K maxmem: 598152 K
shell/avmshell.r3700
tf1: 11794 ms - callLater test minmem: 2052 K maxmem: 598148 K
tf1: 11766 ms - callLater test minmem: 2052 K maxmem: 620804 K
tf1: 11672 ms - callLater test minmem: 2052 K maxmem: 598148 K
shell/avmshell.r3800 <========================================= bad!
tf1: 36661 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36666 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36887 ms - callLater test minmem: 2052 K maxmem: 600196 K
shell/avmshell.r3900
tf1: 36542 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36596 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36578 ms - callLater test minmem: 2052 K maxmem: 600196 K
shell/avmshell.r3994
tf1: 36415 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36776 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36993 ms - callLater test minmem: 2052 K maxmem: 600196 K
shell/avmshell.r5000
tf1: 37511 ms - callLater test minmem: 2052 K maxmem: 598272 K
tf1: 37168 ms - callLater test minmem: 2052 K maxmem: 598272 K
tf1: 37275 ms - callLater test minmem: 2052 K maxmem: 598272 K
shell/avmshell.r5842
tf1: 28203 ms - callLater test minmem: 2052 K maxmem: 615552 K
tf1: 27930 ms - callLater test minmem: 2052 K maxmem: 615552 K
tf1: 27964 ms - callLater test minmem: 2052 K maxmem: 615552 K
Further bisection (dissection?):

shell/avmshell.r3760
tf1: 11382 ms - callLater test minmem: 2052 K maxmem: 651652 K
tf1: 11332 ms - callLater test minmem: 2052 K maxmem: 651652 K
tf1: 11325 ms - callLater test minmem: 2052 K maxmem: 651652 K
shell/avmshell.r3770
tf1: 36565 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36037 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36055 ms - callLater test minmem: 2052 K maxmem: 600196 K
Followup notes:
- In comment 9, I focused in on the range of 10 revisions where the 3x time blowup happened

-Within that range, the precise shift happens here:
shell/avmshell.r3766
tf1: 12634 ms - callLater test minmem: 2052 K maxmem: 639364 K
tf1: 11268 ms - callLater test minmem: 2052 K maxmem: 651652 K
tf1: 11366 ms - callLater test minmem: 2052 K maxmem: 651652 K
shell/avmshell.r3767
tf1: 36498 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36402 ms - callLater test minmem: 2052 K maxmem: 600196 K
tf1: 36448 ms - callLater test minmem: 2052 K maxmem: 600196 K

- But also note that the memory usage also went from 600M/620M up to 650M somewhere along the line after rev 3700; I didn't follow the same pattern of transcribing the full history in comment 9, but I do have it here.  That memory increase is reflected here:
shell/avmshell.r3730
tf1: 11771 ms - callLater test minmem: 2052 K maxmem: 598148 K
tf1: 11754 ms - callLater test minmem: 2052 K maxmem: 598148 K
tf1: 11833 ms - callLater test minmem: 2052 K maxmem: 620804 K
shell/avmshell.r3740
tf1: 11390 ms - callLater test minmem: 2052 K maxmem: 653700 K
tf1: 11331 ms - callLater test minmem: 2052 K maxmem: 651652 K
tf1: 11362 ms - callLater test minmem: 2052 K maxmem: 651652 K

- I don't plan to investigate the memory increase until after I've finished looking into the running time regression.
(In reply to comment #10)
> - I don't plan to investigate the memory increase until after I've finished
> looking into the running time regression.

As noted above, the 3x time blowup happens with changeset 3767.  That particular changeset was to resolve Bug 544695; it replaces a fixed inverse load factor of 2.0 with a telescoping series of load factors that gets smaller as the heap gets larger.  I haven't looked too carefully at the ranges of heap sizes and their associated inverse load factors, but its not hard to imagine that this example is getting tripped up on an overly conservative choice at the edge.
Food for thought:

Here are the default load factors, fed explicitly on the command line:
% ./shell/avmshell.r5842 -load
2.5,10,2.0,25,1.75,50,1.5,75,1.25,100,1.1,150,1.05 ../../distilled3.abc 
tf1: 28060 ms - callLater test minmem: 2052 K maxmem: 615552 K

If I peel off that 1.05, so that 1.1 becomes the max inverse load factor:
% ./shell/avmshell.r5842 -load 2.5,10,2.0,25,1.75,50,1.5,75,1.25,100,1.1
../../distilled3.abc 
tf1: 19923 ms - callLater test minmem: 2052 K maxmem: 628060 K

And peeling off another layer:
% ./shell/avmshell.r5842 -load 2.5,10,2.0,25,1.75,50,1.5,75,1.25
../../distilled3.abc 
tf1: 13504 ms - callLater test minmem: 2052 K maxmem: 648668 K

(The point being that we don't need to go all the way back to an inverse load
factor of 2 to bring performance back for this benchmark.)

Running the shell is fun and all, but my plan for now is to scribble out some
formulas and see if I can come up with a disciplined argument for or against
them.

Comment 13

7 years ago
The very tight load factors at high memory sizes were always suspect, but IIRC they were put in place in response - in significant part - to a rebellion in the Flash QE team, because some of their trusty test cases started using more memory.  And being mindful of mobile needs starting to dominate desktop needs, we also felt that they were somewhat justified.

Comment 14

7 years ago
We should confim that changing the load factors in the player helps the problem there too.    There's way too much difference between the shell and the player to not constantly go back to the player to verify findings.
(In reply to comment #14)
> We should confim that changing the load factors in the player helps the problem
> there too.    There's way too much difference between the shell and the player
> to not constantly go back to the player to verify findings.

Will do.
Yesterday on IRC, Lars asked what shark'ing the app reveals in terms
of where the time is going -- "is it all jitted code or are we dying
in system subroutines for array growth and so on?"

Since the benchmark runs in under 30 seconds, I was able to set up
Shark to collect samples during the entire application run, with a
sample interval of 20 microseconds (not millseconds).

Most of time is spent in the GC (details below [1]).  Inspecting the
back-trace provided by Shark indicates to me that most of the GC calls
are happening in response to newarray calls from jitted code [2].

(Correct me if I'm misinterpreting the data or if I should be looking at something else.)

[1]
== Where time is being spent with the two different load factors: ==

With the default load factor:
  2.5,10,2.0,25,1.75,50,1.5,75,1.25,100,1.1,150,1.05
the top seven entries (of the "Heavy (Bottom-Up)" breakdown) are gc
code, summing to 711,139 of the total 1,155,435 samples taken, or
61.5% of the samples.  I'm copying the top ten entries here mostly for
comparison with the data below.

    24.8%    24.8%    avmshell    MMgc::GC::TraceAtomValue(int)
    17.0%    17.0%    avmshell    avmplus::ScriptObject::gcTrace(MMgc::GC*, unsigned long)
     5.6%     5.6%    avmshell    MMgc::GC::MarkItem(MMgc::GCWorkItem&)
     5.1%     5.1%    avmshell    MMgc::GC::TracePointer(void*)
     4.0%     4.0%    avmshell    MMgc::GCAlloc::Finalize()
     2.5%     2.5%    avmshell    MMgc::GC::IncrementalMark()
     2.5%     2.5%    avmshell    MMgc::GCAlloc::Alloc(int)
     2.1%     2.1%    avmshell    avmplus::InlineHashtable::add(int, int, avmplus::Toplevel*)
     1.7%     1.7%    avmshell    avmplus::InlineHashtable::grow(avmplus::Toplevel*)
     1.6%     1.6%    avmshell    avmplus::Traits::traceSlots(MMgc::GC*, avmplus::ScriptObject*) const

(I'm not sure this is really meaningful in isolation from the rest of
the data, though... these ten entries represent 772,719 samples, or
66.9% of the total samples).

With the faster "peeled" load factor I wrote in comment 12:
  2.5,10,2.0,25,1.75,50,1.5,75,1.25
the breakdown at the top is quite different.

    10.8%    10.8%    avmshell    MMgc::GC::TraceAtomValue(int)
     7.3%     7.3%    avmshell    avmplus::ScriptObject::gcTrace(MMgc::GC*, unsigned long)
     4.5%     4.5%    avmshell    MMgc::GCAlloc::Alloc(int)
     3.8%     3.8%    avmshell    avmplus::InlineHashtable::add(int, int, avmplus::Toplevel*)
     3.1%     3.1%    avmshell    avmplus::InlineHashtable::grow(avmplus::Toplevel*)
     2.7%     2.7%    avmshell    avmplus::ListImpl<int, avmplus::AtomListHelper>::ListImpl(MMgc::GC*, unsigned int, int const*)
     2.5%     2.5%    avmshell    avmplus::AvmCore::atomWriteBarrier_ctor(MMgc::GC*, void const*, int*, int)
     2.5%     2.5%    avmshell    MMgc::GCAlloc::Finalize()
     2.5%     2.5%    mach_kernel lo_mach_scall
     2.5%     2.5%    avmshell    MMgc::GC::MarkItem(MMgc::GCWorkItem&)

Note that there are only 841,085 *total* samples in this trace (this
makes sense, because the program completes and exits much sooner).
These ten entries represent 354,011 samples, or 42.1% of the total
samples.

I find this data consistent with my hypothesis that with the load
factor of 1.05, we are running the garbage collector more often in
order to keep from expanding too far beyond the last-known live
volume, but this leads to very slow ramp-up for programs like this
that are allocating as-fast-as-possible while using miniscule
increments.

[2]
== Fragment of backtrace for default load factor ==

I expanded the Shark backtrace for the top entry (24.8% of the
samples) out until either I hit jitted code, or the "Total" column
listed a value < 5.0% (which I interpret as meaning an insignificant
number of the samples originated from this control context).

From my inspection, the two big items seem to be ArrayClass::newarray
(called from jitted code), and InlineHashtable::grow (called
indirectly from ScriptObject::setAtomProperty which in turn was called
from jitted code).


+ 24.8%, MMgc::GC::TraceAtomValue(int), avmshell
| + 15.9%, avmplus::ScriptObject::gcTrace(MMgc::GC*, unsigned long), avmshell
| | + 15.9%, MMgc::GC::MarkItem(MMgc::GCWorkItem&), avmshell
| | | + 12.1%, MMgc::GC::IncrementalMark(), avmshell
| | | | + 12.0%, MMgc::GCAlloc::Alloc(int), avmshell
| | | | | - 3.7%, avmplus::ArrayClass::newarray(int*, int), avmshell
| | | | | - 3.1%, avmplus::ListImpl<int, avmplus::AtomListHelper>::ListImpl(MMgc::GC*, unsigned int, int const*), avmshell
| | | | | - 3.0%, avmplus::InlineHashtable::grow(avmplus::Toplevel*), avmshell
| | | | | - 1.4%, avmplus::InlineHashtable::initialize(MMgc::GC*, int), avmshell
| | | | | - 0.8%, avmplus::ObjectClass::construct(int, int*), avmshell
| | | | | - 0.0%, avmplus::ArrayClass::createInstance(avmplus::VTable*, avmplus::ScriptObject*), avmshell
| | | | - 0.1%, MMgc::GCLargeAlloc::Alloc(unsigned long, int), avmshell
| | | - 3.8%, MMgc::GC::DoMarkFromStack(void*, void*), avmshell
| | - 0.0%, avmplus::ArrayObject::gcTrace(MMgc::GC*, unsigned long), avmshell
| + 8.7%, avmplus::TracedListData<int>::gcTrace(MMgc::GC*, unsigned long), avmshell
| | + 8.7%, MMgc::GC::HandleLargeMarkItem(MMgc::GCWorkItem&, unsigned long&), avmshell
| | | + 8.7%, MMgc::GC::MarkItem(MMgc::GCWorkItem&), avmshell
| | | | + 5.1%, MMgc::GC::IncrementalMark(), avmshell
| | | | | + 5.1%, MMgc::GCAlloc::Alloc(int), avmshell
| | | | | | - 1.6%, avmplus::ArrayClass::newarray(int*, int), avmshell
| | | | | | - 1.3%, avmplus::ListImpl<int, avmplus::AtomListHelper>::ListImpl(MMgc::GC*, unsigned int, int const*), avmshell
| | | | | | - 1.2%, avmplus::InlineHashtable::grow(avmplus::Toplevel*), avmshell
| | | | | | - 0.6%, avmplus::InlineHashtable::initialize(MMgc::GC*, int), avmshell
| | | | | | - 0.3%, avmplus::ObjectClass::construct(int, int*), avmshell
| | | | | | - 0.0%, avmplus::ArrayClass::createInstance(avmplus::VTable*, avmplus::ScriptObject*), avmshell
| | | | | - 0.1%, MMgc::GCLargeAlloc::Alloc(unsigned long, int), avmshell
| | | | - 3.6%, MMgc::GC::DoMarkFromStack(void*, void*), avmshell
| | - 0.0%, MMgc::GC::MarkItem(MMgc::GCWorkItem&), avmshell
| - 0.2%, MMgc::GC::HandleLargeMarkItem(MMgc::GCWorkItem&, unsigned long&), avmshell
| - 0.0%, MMgc::GC::MarkItem(MMgc::GCWorkItem&), avmshell
| - 0.0%, avmplus::ScopeChain::gcTrace(MMgc::GC*, unsigned long), avmshell
| - 0.0%, avmplus::Namespace::gcTrace(MMgc::GC*, unsigned long), avmshell
| - 0.0%, avmplus::HeapHashtable::gcTrace(MMgc::GC*, unsigned long), avmshell
An alternate interpretation of the data (which may not be derivable from the presentation above):

- for the "peeled" load factors (i.e. limit = 1.25), much of our time *is* being spent in array growth routines (or in GC responding to such growth), as opposed to newarray allocations.

- for the "default" load factors (i.e. limit = 1.05), the time is completely dominated in the GC responses to newarray allocations.

Comment 18

7 years ago
Early on I tried to control for this kind of behavior using the factor "G" (see big block comment in GCPolicyManager.cpp), which is the max fraction of the time that we want to allow GC to run.  Effectively, with a low G a program that allocates like there's no tomorrow at large heap sizes would end up with a larger L.  I never got this to work satisfactorily but I didn't have a lot of programs to test with, and I tested across all heap sizes so my conclusions could have been skewed really.  We only care about this kind of setting for low L, and the experiments with G predated the variable L.
(In reply to comment #18)
> Early on I tried to control for this kind of behavior using the factor "G" (see
> big block comment in GCPolicyManager.cpp), which is the max fraction of the
> time that we want to allow GC to run.

Some additional notes on this:

- The G-factor has a default value of 0.25; which would make one think
  that we strive to keep the collector from occupying more than a
  quarter of the running time (no firm guarantees; the idea is just to
  respond immediately after violations occur).

- The gcbehavior-reported mutator-efficiencies (when running
  distilled3.abc with the "default" load factor settings) are

  85.24%, 67.56%, 69.96%, 70.65%, 59.97%, 61.15%, 64.96%, 48.70%,
  33.17%, 28.37%, 16.60%, 29.97%, 27.98%, 6.35%, 19.10%, 14.22%,
  17.27%, 15.99%, 6.27%, 16.35%, 14.15%, 18.55%, 17.45%, 6.20%,
  18.86%, 14.65%, 17.46%, 16.36%, 6.14%, 15.89%, 13.88%, 17.17%,
  16.84%, 6.19%, 17.00%, 14.22%, 16.94%, 16.60%, 6.06%, 17.55%, 14.23%

  So clearly the collector (at 100% - each of the above values) is
  going way over the 25% threshold.

- The G-factor is _potentially_ used to adjust the load factor if we
  determine that we spent too much time in the collector during the
  last incremental collection cycle; but the adjustment is
  _overridden_ by the setting for gcLoadCeiling X-factor (set on the
  command line via -loadCeiling); the resulting load with ceiling is

    L = min(L_adjustedForG, X*L_selectedForHeap)

  (See GCPolicyManager::adjustL(double) for actual calculation)

  The default X-factor value is 1.0, with a comment that it is
  "probably OK for desktop, maybe less so for mobile."  This
  effectively means that we'll never ramp the load factor _up_; we'll
  just continue to spend much time in the collector.

- The G-factor is not exposed on the command-line, but its default of
  0.25 is probably so aggressive that there is no need to adjust it
  at this point -- the X-factor _is_ exposed and adjusting _that_
  does suffice.

  As an experiment to illustrate this, I ran the distilled3.abc
  benchmarks for 3 iterations with the following X-factor values:
     1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09
     1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00

  Results:
  X-factor        Running Time    Max Memory
  1.00 (default)  28s             616-630 M
  1.01            24s             616-616 M
  1.03            21s             616-618 M
  1.06            18s             629-635 M
  1.20            13s             649-649 M
  1.60            11s             642-670 M
(In reply to comment #19)
>   As an experiment to illustrate this, I ran the distilled3.abc
>   benchmarks for 3 iterations with the following X-factor values:
>      1.00 1.01 1.02 1.03 1.04 1.05 1.06 1.07 1.08 1.09
>      1.10 1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90 2.00
> 
>   Results:
>   X-factor        Running Time    Max Memory
>   1.00 (default)  28s             616-630 M
>   1.01            24s             616-616 M
>   1.03            21s             616-618 M
>   1.06            18s             629-635 M
>   1.20            13s             649-649 M
>   1.60            11s             642-670 M

Note: I left out data points where I subjectively decided that the Running Time did not decrease by a sufficiently significant amount to be worth noting, with a threshold of about 3 seconds.  Before trying to capture anything more formal, I think I'd want to start investigating a larger set of benchmarks.

Comment 21

7 years ago
Very nice.

Worrisome+interesting that we can increase X by 1.6 and reap huge performance benefits from that but not actually increase memory by more than 6%.

Presumably this is just an illustration of what you found with the modest changes to the load factors (from 1.05 to 1.25), that a little bit of breathing room can go a long way?
(In reply to comment #21)
> Worrisome+interesting that we can increase X by 1.6 and reap huge performance
> benefits from that but not actually increase memory by more than 6%.
>
> Presumably this is just an illustration of what you found with the modest
> changes to the load factors (from 1.05 to 1.25), that a little bit of breathing
> room can go a long way?

Yes, the results in comment 12 are show a similar change in both large
time savings and small memory usage cost.

But I did not (and do not) equate increasing the X-factor with
increasing the L-factor because restricting changes to just X-factor
should keep us from increasing L unless there is motivation to do so
based on the mutator utilization; that is, I thought this is a more
conservative change to policy.

(Maybe that just brings us back to a question you raised in IRC: why
put in a complicated policy with tons of knobs when a simple policy,
properly tuned, would suffice.  In this particular case, changes to X
are complicated to think about, while changes to L are simpler.)

Comment 23

7 years ago
(In reply to comment #22)

> (Maybe that just brings us back to a question you raised in IRC: why
> put in a complicated policy with tons of knobs when a simple policy,
> properly tuned, would suffice.  In this particular case, changes to X
> are complicated to think about, while changes to L are simpler.)

I still believe that, but if it turns out that X is a well-functioning check on ill-chosen values of L - relative to a particular program - then that may still be the better solution.  It's not like we've exposed any of this except in the avmshell.  We may have to expose it at some point, but for now it's up to us to pick values that work.

It's probably the case that L=1.05 is simply unreasonable in a nongenerational collector that allocates much at all; assuming accurate accounting one has to scan 20 bytes for every byte allocated and that makes allocation very, very expensive.  With 1.25 the scanning cost is one fifth of that, which is a huge difference.

My gut feeling based on your data and all these deliberations is that we want L=1.25 to extend to much larger memory sizes on desktop, and we want X=1.5 at least, to cater to huge fast-allocating (ie, largely untuned) computations.  We may want L=1.25 to go all the way out on desktop, because Flash applications in the 500MB-1GB range are on the horizon.  And we want generational collection, to take some of the pressure off, given that large parts of large heaps will be stable.

We probably want to create a repository of swfs that we can use for GC policy tuning, I have one I've been using that I'll send you separately.
(Assignee)

Updated

7 years ago
Attachment #507994 - Attachment mime type: application/octet-stream → text/plain
(Assignee)

Updated

7 years ago
Attachment #508028 - Attachment mime type: application/octet-stream → text/plain
Exploration on some real world swf's indicates to me that adjusting the inverse load factor series to converge to a limit of 1.25 is going to hurt memory usage in some cases (not universally, but when it arises, the issue is significant).  I did not find a real-world swf where the memory increase had a noticeable payoff in terms of improved performance.

The benchmark I've been using most recently is:
  http://www.snailsanimation.com/benchmark08_play.php
which provides a benchmark with four distinct phases, increasing the load with each phase.  It does not report memory usage; I've been using Instruments.app to track that.

An oddity with the above benchmark is that loosening the belt by converging the load on 1.25 instead of 1.05 *hurt* the performance, going from 40fps to 34fps in the benchmark's heavy phase (and the memory usage got hit as well, growing from 77MB to 105MB).
- Perhaps the performance problem is a caching issue?
- I might be making a mistake relying on the benchmark's reported average frames-per-second as a gauge of performance.
- I should probably try to make a local copy of the benchmark so that we can repeat the experiment in the future.

Adjusting the X-factor seems like a more promising path for solving this problem, at least without risking introducing unjustified memory overhead elsewhere.  So I am coming around to a conclusion similar to Lars' gut-feeling that both the L and X factors need adjustment, except I am planning a more conservative adjustment (at least for this round of changes).  I hope to have a patch up soon.
Created attachment 516878 [details] [diff] [review]
suggested semi-conservative fix

Lets put this bug to bed.

I'll be posting a spreadsheet with some analysis shortly.
Created attachment 516886 [details]
PDF.  Theoretical analysis of policies; empirical results for callLater

I did the spreadsheet in Numbers.app, so I'm uploading the PDF.
Created attachment 516929 [details]
PDF: charts comparing behavior on 4 "sites"

After spending some time surveying the landscape, I selected a
few sites that either seemed memory intensive (at the time I was
using external monitoring tools) or claimed to be benchmarks, and
I also put in a SWF provided by Lars that is the most memory
intensive of the bunch.

I then turned on gcbehavior.txt instrumentation, did some runs,
and post-processed the resulting files to plot two items that I
thought looked interesting when initially skimming the data:

-- the GC occupancy (in blocks) before each gc as the program
   allocated, and,

-- the reported mutator efficiency for each collection.

That's what is charted here, in the attached PDF.

The main points illustrated by the charts are:

  1. There are fewer collections under the new policy (that is,
     the one in attachment 516878 [details] [diff] [review]).  It is, after all, meant to
     be a belt loosening.

  2. The peak memory is a bit worse, but I think it remains
     acceptable.

  3. The reported mutator efficiency often seems better, but
     its a bit scattered.  Probably better to just be happy
     that there are fewer collections.

This PDF just has the charts, not the raw coordinates.  I'll put
up a tar ball with the original gcbehavior.txt's next (and the
scripts I used to extract the coordinates used for the plots), so
that others can perform their own analysis.  (If someone wants
spreadsheets, I can upload the Numbers.app document.)
Created attachment 516930 [details]
tar.gz archive with gcbehavior.txt's

Here are the gcbehavior.txt's used as the raw data for analysis in comment 27.
(In reply to comment #27)
> Created attachment 516929 [details]
> PDF: charts comparing behavior on 4 "sites"

FYI: it is probably best to just focus on the first three pages, and not worry about the fourth page of charts.

Details on why follow (but its really boring).

Why: While double-checking that the uplolads worked, I noticed that there is a funny connection of lines on the chart in the lower-right hand corner of page 2 (Habbo, XLcombo: Mut Efficiency).  namely, there appears to a line, run2, that jumps forward and then backward in time (its the orange square circa x=4e+08).  I double-checked how this popped up: my scripts did not distinguish between different GC instances (which was silly of me).  This means that all the different GC reports get smushed together, and so one can see coordinates later in the series that report less total allocation work than the coordinates earlier in the series.

This means that the notion of "time" that the charts are using doesn't make sense for the cases where there was >1 GC in play.

There was only one GC for benchmark08 and jean-philippe (I verified this by looking in their gcbehavior.txt logs), so those charts are fine, and for Habbo, the only run where there was more than one GC was for the aforementioned XLcombo:run2; all of the other runs had only one GC.  That just leaves the fourth benchmark, SuperMarioTruck (14114.html); its all over the place, but its data was not as interesting as the other cases anyway.  So I'm not going to go back and redo the charts just for this case.
(Assignee)

Updated

7 years ago
Attachment #516878 - Flags: superreview?(lhansen)
Attachment #516878 - Flags: review?(treilly)
(Assignee)

Updated

7 years ago
Whiteboard: has-patch

Updated

7 years ago
Attachment #516878 - Flags: superreview?(lhansen) → superreview+
(In reply to comment #25)
> Created attachment 516878 [details] [diff] [review]
> suggested semi-conservative fix
> 
> Lets put this bug to bed.

gentle review ping.

Updated

7 years ago
Attachment #516878 - Flags: review?(treilly) → review+

Comment 31

7 years ago
changeset: 6087:c5407232613b
user:      Felix S Klock II <fklockii>
summary:   Bug 619885: loosen gc load; improve time/space tradeoff (r=treilly,sr=lhansen).

http://hg.mozilla.org/tamarin-redux/rev/c5407232613b
(Assignee)

Updated

7 years ago
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.