scimark memory blowup between TR revs 6395:2529571bfdfa and 6411:4013df504cfc

RESOLVED FIXED

Status

Tamarin
Garbage Collection (mmGC)
RESOLVED FIXED
6 years ago
5 years ago

People

(Reporter: pnkfelix, Assigned: pnkfelix)

Tracking

Details

(Whiteboard: WE:3109196)

Attachments

(6 attachments, 5 obsolete attachments)

There appears to be a memory usage regression in scimark introduced over the summer:

tr-old:
changeset:   6395:2529571bfdfa
date:        Wed Jun 15 08:52:05 2011 -0400

tr-new: 
changeset:   6411:4013df504cfc
date:        Wed Jun 22 10:26:32 2011 -0400

In the shell, you can build each of these changesets and run scimark (iterating five times to reduce effects of random noise) as follows in tamarin-redux/test/performance/:
% python runtests.py -i 5 -m --avm /path/to/avmshell scimark

Below I have copied the results of running the above in tr-old and tr-new.  You can see that the reported memory usage (which, in this context, uses --memstats, not System.totalMemory) has indeed blown up.  (I'll track down the injection tamarin-redux revision next.)

avm: ../../../tr-old/objdir-rel64/shell/avmshell  version: 6395:2529571bfdfa
iterations: 5


test                                                 best     avg     95%_conf

Metric: memory 
Dir: scimark/
  FFT                                                  4.0M     3.9M  2.0% 
  LU                                                   7.3M     6.2M 15.2% 
  MonteCarlo                                           2.0M     2.0M  0.0% 
  SOR                                                  6.3M     6.2M  3.2% 
  SparseCompRow                                        3.9M     3.8M  3.2% 


avm: ../../../tr-new/objdir-rel64/shell/avmshell  version: 6411:4013df504cfc
iterations: 5


test                                                 best     avg     95%_conf

Metric: memory 
Dir: scimark/
  FFT                                                 31.8M    30.5M  2.7% 
  LU                                                  31.8M    31.4M  1.4% 
  MonteCarlo                                           2.0M     2.0M  0.0% 
  SOR                                                 33.8M    32.0M  4.1% 
  SparseCompRow                                        4.4M     4.1M  5.8%
(next up: identifying the injection changeset.)
Injection changeset:

changeset:   6402:c8459ef051b0
user:        Lars T Hansen <lhansen@adobe.com>
date:        Mon Jun 20 08:34:10 2011 +0200
summary:     Fix 657949 - Segregated allocation for finalizable objects (r=fklockii)

Hmmm.
(the above was addressing Bug 657949.)
(Assignee)

Updated

6 years ago
Whiteboard: WE:3109196
I did a full run of the test/performance/runtests.py benchmarks with the -m flag to see what other cases should have sent up some signs of smoke.  Here all all the cases that regressed by more than 100% max memory usage:

   % python runtests.py -i 3 -m --avmname r6401 --avm2name r6402 \
     --avm ../../../tr-old/objdir-rel64/shell/avmshell \
     --avm2 ../../../tr-new/objdir-rel64/shell/avmshell

                                      r6401           r6402
test                           best     avg    best     avg   %dBst   %dAvg
Metric: memory
Dir: asmicro/
  alloc-10                     3.9M    3.9M   21.6M   21.6M  -453.8  -453.8 --
  alloc-3                      3.8M    3.8M   36.4M   36.0M  -857.9  -848.2 --
  alloc-8                      3.9M    3.9M   21.6M   21.2M  -453.8  -444.4 --
  alloc-9                      3.9M    3.9M   22.0M   21.5M  -464.1  -452.1 --
  try-2                        3.9M    3.9M   29.3M   28.7M  -651.3  -635.0 --
Dir: jsbench/
  Euler                       16.5M   16.2M   35.8M   33.8M  -117.0  -108.4 --
  Moldyn                       4.5M    4.5M   33.1M   33.1M  -635.6  -635.6 --
  RayTracer                    4.4M    4.3M   33.1M   32.6M  -652.3  -658.9 --
Dir: jsbench/typed/
  Euler                        8.0M    7.9M   35.1M   33.6M  -338.8  -325.7 --
  Moldyn                       4.6M    4.5M   31.8M   31.5M  -591.3  -595.6 --
  Series                       4.4M    4.3M   31.8M   31.2M  -622.7  -619.2 --
Dir: jsmicro/
  alloc-10                     3.9M    3.9M   21.6M   21.6M  -453.8  -453.8 --
  alloc-3                      3.9M    3.8M   36.4M   35.8M  -833.3  -833.0 --
  alloc-8                      3.9M    3.9M   21.6M   21.1M  -453.8  -440.2 --
  alloc-9                      3.9M    3.9M   21.6M   21.3M  -453.8  -445.3 --
  for-2                        4.0M    4.0M   32.4M   31.5M  -710.0  -694.1 --
  for-3                        4.0M    4.0M   31.8M   31.4M  -695.0  -691.6 --
  oop-1                        3.5M    3.5M    8.4M    8.1M  -140.0  -131.4 --
  parseFloat-1                 3.9M    3.6M   19.9M   19.1M  -410.3  -426.6 --
  try-2                        3.6M    3.6M   28.6M   28.3M  -694.4  -686.1 --
Dir: misc/
  boids                        3.8M    3.8M   13.9M   12.6M  -265.8  -232.5 --
Dir: scimark/
  FFT                          4.4M    4.2M   32.4M   31.7M  -636.4  -661.6 --
  LU                           5.8M    5.6M   33.8M   31.8M  -482.8  -464.5 --
  SOR                          6.3M    5.8M   32.4M   31.3M  -414.3  -440.2 --
Dir: sunspider-0.9.1/js/
  s3d-cube                     4.1M    4.1M   10.5M   10.3M  -156.1  -151.2 --
Dir: v8/
  raytrace                     4.1M    4.1M   12.9M   12.3M  -214.6  -202.5 --
Dir: v8.5/untyped/
  raytrace                     4.1M    4.1M   13.1M   12.6M  -219.5  -206.5 --
* I looked first at the asmicro alloc-3 benchmark, since it is teensy.

* Careful analysis of the -gcbehavior output for asmicro/jsmicro may not yield fruit as quickly as looking at the benchmarks with a fixed work load.
-- (asmicro/jsmicro are all dynamic workloads targetting a total running time of 1 second.)

* Still, there is some interesting data
-- mainly that the benchmark itself is running much more quickly (before: 7.8 iters/sec, after: 13.6 iters/sec).
-- But that in turn means that a lot of the statistics from -gcbehavior are not directly interesting.
-- The increased number of iterations means that you see basically twice as much allocation in total, and so there are direct correlations in things like 2x allocation work, 2x zct work, ...

* Exceptions to these general observations:
-- After the patch landed, the total number of collections dropped from 110 collections to 20 collections, *despite* the increased #iterations (and thus amount of allocation).
-- Also, the peak occupancy jumped by 10x.
-- Finally, the policy numbers during the run indicate that the kbytes-live, before the patch landed, hits a steady state quickly and stayed pretty solidly at 699 kbytes, while after the patch landed it grows slowly up to 26 mbytes and then hovers around that area.  That's quite a difference in live heap size, and is quite troubling.

(I was originally going to switch to a benchmark with fixed work load, but now I want an answer for why the kbytes-live estimate grew so dramatically before and after the patch landed.)
(Assignee)

Updated

6 years ago
Assignee: nobody → fklockii
Status: NEW → ASSIGNED
My first guess as to what is going on: The feedback mechanism in the gc load-policy, where we estimate the amount of live storage and use that to set the amount of storage we will allocate during the next GC cycle has been disrupted.  A small change in the estimate early on is getting blown up by this feedback loop.
More notes:
* Before rev 6402:c8459ef051b0, on the alloc-3.abc benchmark, an early invocation of GCAlloc::FinalizationPass is able to populate the GC's smallEmptyPageList.  After the injection, that early invocation goes through GCAlloc::LazySweepPass instead of GCAlloc::FinalizationPass, and the smallEmptyPageList does not get populated.

* GCAlloc::FinalizationPass explcitly points this possibility when it lays out the 3 outcomes for each block at the end of its control flow (its the first potential outcome).

* I cannot yet tell whether Lars anticipated the effects of side-stepping the smallEmptyPageList with his introduction of LazySweepPass.  

* One obvious option would be to duplicate the portion of FinalizationPass that traverses the block items looking for marks, and then use that to duplicate the effect of populating the smallEmptyPageList with suitable blocks. 
-- Part of the point of LazySweepPass is to avoid doing work during the FinishIncrementalMark critical section (in order to avoid long pauses there).
-- So a drawback of this option is that we would be introducing an inner loop traversing the items (though we would be able to bomb out as soon as we encounter any mark at all -- we would not need to count the total number marks -- so it might not be so bad).
Created attachment 596297 [details] [diff] [review]
patch E v1: emulate former handling of all-free blocks

As described in comment 7: duplicate the portion of FinalizationPass that traverses the block items looking for marks, and then use that to duplicate the effect of populating the smallEmptyPageList with suitable blocks.

This appears to fix the memory issue on the case of alloc-3.  I still need to benchmark its effects on time and memory across the board.

(So far, it looks like it may hurt overall run time on the alloc-3 benchmark, but the comparison really isn't fair because its very easy to take up only half the time when you're using 10x as much memory.)
(Assignee)

Updated

6 years ago
Attachment #596297 - Attachment description: emulate former handling of all-free blocks → patch E v1: emulate former handling of all-free blocks
Created attachment 596364 [details]
.numbers spreadsheet F v1: time+mem for TR rev 6401,6402,7193 (w/o and w/ patch)

Had fklockii-iMac grind away today on a x86_64 release build on the test/performance benchmarks to get an idea of what the effects of the injection changelist are, as well as whether the suggested patch E helps alleviate the problem.

(I'll upload PDFs shortly with the most important summary pages, and put some overall analysis with their comments.)
Created attachment 596367 [details]
pdf Fmw: mem worst

portion of F showing benchmarks where E had worst effect on max memory usage for the tip of TR, by %change in max memory.  (The %change is on the rightmost column.)

I figure start with this because I have to choose between the good news and the bad news, so bad news first: Memory does regress in some cases.  But even that is a good news situation, because its a very small number of benchmarks where the regression is at all significant:

* mmgc/ofib.as max mem regresses by 18%
* v8.5/optimized/raytrace.as max mem regresses by 15%
* sunspider-0.9.1/js/access-nsieve max mem regresses by 13%

Beyond that, only 5 or 6 further benchmarks regress by >= 3%.  That's pretty good considering we are looking at 527 benchmarks here.
Created attachment 596368 [details]
pdf Ftw: time worst

portion of F showing benchmarks where E had worst effect on time metric for the tip of TR, by %change in max memory.  (The %change is on the third column from the right.)

Not quite as rosy a story here as what we saw in comment 10, *except* when one also considers the associated memory usage *improvement* for the vast majority of the cases where one sees a time regression.  (These are mostly instances of the phenonmenon I mentioned at the end of comment 8 -- it is easy to use half as much time when you also use 10x as much memory.)
Created attachment 596377 [details]
pdf Ftb: time best

These are the top time-improving cases with the patch (ordered by percentage time improvement, again in the third column from the right).  A larger number of benchmarks than I expected actually get faster with this change; it is a bit counter-intuitive.

* It may be a side-effect of a significantly smaller sized working set in memory: look at the memory improvement numbers in the right-most column to see the cases here the two correlate.

* Some of the others are benchmarks that are known to be noisy/highly-sensitive to supposedly unrelated changes (I'm looking at you, sunspider).

* And then there are other cases like asmicro/string-lastIndexOf where I don't have an answer.

Anyway, this is really just a lead up to the finale: the mem best presentation.  Coming up next.
Created attachment 596379 [details]
pdf Fmb: mem best

portion of F showing benchmarks where E had best effect on max memory usage for the tip of TR, by %change in max memory.  (The %change is on the rightmost column.)

----

Oh, I just realized these last two ("best") PDF's are missing an important legend at the top:

tr-old: rev 6401         (i.e. immediately pre-injection)
tr-new: rev 6402         (i.e. immediately post-injection)
tr-tip: rev 7193         (i.e. the current tip of TR)
tr-fix: rev 7193 + patch (i.e. what I expect if I land patch as-is)

----

It is worth comparing the "d(old,new) mem" column with the "d(tip,fix) mem" column: the first is showing the amount of regression at the injection point, and the second is showing whether the patch fixes the problem.

Hmm.  shoot; something looks wrong now.  The results when I compare the aforementioned columns for the floatBench/dbl/*.as cases here do not make sense.  I suspect something is wrong in the original spreadsheet, but that calls the whole presentation into question.  I have to stop working for the evening now, so I will try to debug that tomorrow and upload corrected versions of the .numbers file and the PDFs.
(In reply to Felix S Klock II from comment #13)
> Hmm.  shoot; something looks wrong now.  The results when I compare the
> aforementioned columns for the floatBench/dbl/*.as cases here do not make
> sense.  I suspect something is wrong in the original spreadsheet, ...

Ah, I see, I deleted some of the entries from the intermediate tables because the original data for some reason will only contain *just* time results for a few benchmarks and *just* memory results for a few others; but when I deleted the entries I was sloppy and did it on a table with formulas containing some relative addresses (rather than a table with newly copied-values); I'm pretty certain the "only" end-effect was that the sign got swapped on some of the subtraction operations in the presentation, but that's still bad.  Fixing now.
(In reply to Felix S Klock II from comment #14)
> the original data for some reason will only contain *just* time
> results for a few benchmarks and *just* memory results for a few others

This is of course in and of itself a total bug in our runtests.py benchmarking infrastructure.  Filed as Bug 726403.
Created attachment 596428 [details]
csv tar.gz for F

(the original csv files that fed into F)

After reviewing the data it seems my analysis from comment 14 was wrong, and the discrepancy I noted must have been introduced by my sloppy spreadsheet editting.  Its okay, I've back-tracked to a sane point and will upload shortly.
Created attachment 596434 [details]
.numbers spreadsheet F v2: time+mem for TR rev 6401,6402,7193 (w/o and w/ patch E)

Fixed the formulas.  Everything looks sane now.

The legend (same as before):

tr-old: rev 6401         (i.e. immediately pre-injection)
tr-new: rev 6402         (i.e. immediately post-injection)
tr-tip: rev 7193         (i.e. the current tip of TR)
tr-fix: rev 7193 + patch (i.e. what I expect if I land patch as-is)
Attachment #596364 - Attachment is obsolete: true
Created attachment 596435 [details]
pdf Ftime: time worst and best

Updated with sane(r) analysis.  The notes from comment 11 and comment 12 still apply essentially.
Attachment #596368 - Attachment is obsolete: true
Attachment #596377 - Attachment is obsolete: true
Created attachment 596436 [details]
pdf Fmem: mem worst and best

combined summary howing benchmarks where E had worst/best effect on max memory usage for the tip of TR, by %change in max memory.  (The %change is on the rightmost column.)

The notes from comment 10 and comment 13 essentially still apply (except now I've fixed the mistake from the end of comment 13.)
Attachment #596367 - Attachment is obsolete: true
Attachment #596379 - Attachment is obsolete: true
Created attachment 596454 [details]
pdf F-all: all data from F in print(ed/able) form

(this is the same as F but actually viewable on non-OS X platforms via no-charge software.)
Acceptance test runs look clean.

The memory results (attachment 596436 [details]) imply to me that landing this is a non-brainer.  

The time results (attachment 596435 [details]) may give one pause, but keeping in mind that (1.) we probably can disregard jsmicro and sunspider as non-important, and that (2.) the other time regressions often have a big payoff for memory usage.
Comment on attachment 596297 [details] [diff] [review]
patch E v1: emulate former handling of all-free blocks

I am going to land this with the review pending.

Ruchi: I recommend doing at least one pass of the review with the mindset of "does this recreate/appproximate the prior control of this code (from before the injection changelist noted in comment 2).  Of course independent review with a head clear of such distraction is also appreciated.  :)
Attachment #596297 - Flags: review?(rulohani)

Comment 23

6 years ago
changeset: 7197:cf863cb7ddb5
user:      Felix S Klock II <fklockii@adobe.com>
summary:   Bug 725955: emulate former handling of 100% clear blocks in GCAlloc::Finalize (r pending=rulohani).

http://hg.mozilla.org/tamarin-redux/rev/cf863cb7ddb5

Comment 24

6 years ago
(In reply to Tamarin Bot from comment #23)
> changeset: 7197:cf863cb7ddb5
> user:      Felix S Klock II <fklockii@adobe.com>
> summary:   Bug 725955: emulate former handling of 100% clear blocks in
> GCAlloc::Finalize (r pending=rulohani).
> 
> http://hg.mozilla.org/tamarin-redux/rev/cf863cb7ddb5

Checked in FRR mac64 standalone release and memory usage is significantly less running ScimarkGUI: 
Max System.totalMemory: 
FRR+7196: 32563200
FRR+7197: 12640256

Not only is the memory usage less for each test but the MFLOPS is either on par or actually a little better:
FRR+7196:
Fast Fourier Transform	MFLOPS: 20.18	Memory: 14.3mb
Jacobi SOR				MFLOPS: 32.8	Memory: 22.7mb
Monte Carlo Integration	MFLOPS: 42.66	Memory: 22.8mb
Sparse Matrix Multiply	MFLOPS: 2	Memory: 31.1mb
LU Factorization			MFLOPS: 22.7	Memory: 13.5mb

FRR+7197
Fast Fourier Transform	MFLOPS: 22.9	Memory: 9.1mb
Jacobi SOR				MFLOPS: 38.45	Memory: 10.7mb
Monte Carlo Integration	MFLOPS: 41.32	Memory: 10.7mb
Sparse Matrix Multiply	MFLOPS: 2.16	Memory: 12.1mb
LU Factorization			MFLOPS: 26.45	Memory: 9.4mb

Comment 25

6 years ago
Comment on attachment 596297 [details] [diff] [review]
patch E v1: emulate former handling of all-free blocks

R+. Is there a reason why  GCAssertMsg(mq != kQueued.. ) is not there in LazySweepPass() along with other checks in the block items mark check loop? 

I also think that the comment on top of LazySweepPass() should be modified to account for this change.
Attachment #596297 - Flags: review?(rulohani) → review+
(In reply to Ruchi Lohani from comment #25)
> Comment on attachment 596297 [details] [diff] [review]
> patch E v1: emulate former handling of all-free blocks
> 
> R+. Is there a reason why  GCAssertMsg(mq != kQueued.. ) is not there in
> LazySweepPass() along with other checks in the block items mark check loop? 

I think I was being conservative in not introducing code that wasn't there before the injection changeset 6402:c8459ef051b0.  But I'm honestly not sure.

I'll need to revisit this anyway to fix the comment you noted, so I'll look into the assertion situation when I do that.
(Assignee)

Updated

5 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.