Closed Bug 575316 Opened 14 years ago Closed 14 years ago

Reengineer per-object bits to provide for more bits

Categories

(Tamarin Graveyard :: Garbage Collection (mmGC), defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

(Whiteboard: has-patch)

Attachments

(4 files, 4 obsolete files)

(Spun off from bug #516156.)

A number of useful projects - exact marking, generations among them - depend on having more per-object bits available.
Attached patch Preliminary patch (obsolete) — — Splinter Review
Rescued from the parent bug, painstakingly rebased to redux 4882, tested on MacOS, test compiled on Windows.  Confidence level == medium, I also need to run all benchmarks (for acceptance testing) and I need to run acceptance on more platforms.  I also need to measure performance and memory consumption, although frankly, the memory consumption will have to be what it is - we need more bits.

There are really two changes rolled into one here.

The big change is a general refactoring that changes the representation of object bits and changes many APIs internal to the GC as a result; also as a result the GC bits per object now needs to be an integral number of bits.  This is what you get if MMGC_FASTBITS is disabled in MMgc.h.

The second change is MMGC_FASTBITS - a new representation for the object bit /array/, which allows the array to have holes that go unused.  The purpose of the representation is to make it quicker to compute the address of the bit vector corresponding to a particular object.  (If desirable I can break this out as a separate patch.)

Apart from more testing and benchmarking, I need to go through and add some documentation here and there, at least.

(For the purpose of making the review easier I've neglected to re-indent some code in GCAlloc.cpp whose controlling statements were coalesced.)
Attached patch Refactor mark bits, make more space available (obsolete) — — Splinter Review
See description of previous patch.  This one comes without the MMGC_FASTBITS changes, though it anticipates them in that it makes a distinction between the offset of an object within its block and the offset of that object's bits within the block's bit vector.  Without MMGC_FASTBITS that distinction would not have been very interesting.

In sum, this patch increases the number of mark bits per object from four to eight, and requires further increases to be in multiples of eight bits.  It implements a number of simplifications that fall out from that change.
Attachment #454579 - Attachment is obsolete: true
Attachment #455455 - Flags: review?(treilly)
Attachment #455455 - Flags: review?(fklockii)
Attached patch Implement MMGC_FASTBITS — — Splinter Review
This implements MMGC_FASTBITS on top of the previous patch.  The comment in GCAlloc.cpp describes the change in detail, but effectively we allow holes in the bit vector in order to compute the offset of the object's mark bits within the bit vector more efficiently.

It's unclear how important this change is, as of yet.  I won't insist on landing it, but I would like to have it reviewed.
Attachment #455456 - Flags: review?(treilly)
Attachment #455456 - Flags: review?(fklockii)
Blocks: 576307
Whiteboard: has-patch
Comment on attachment 455455 [details] [diff] [review]
Refactor mark bits, make more space available

An update to this patch will be posted later.  The difference between the patch and the new patch is that four instances of 

b->bits[...] |= kFreelist;

will change to

b->bits[...] = kFreelist;

(Those are all the instances of the former pattern in the MMgc code.)

Please review with that in mind.
(In reply to comment #4)
> (From update of attachment 455455 [details] [diff] [review])
> An update to this patch will be posted later.  The difference between the patch
> and the new patch is that four instances of 
> 
> b->bits[...] |= kFreelist;
> 
> will change to
> 
> b->bits[...] = kFreelist;
> 
> (Those are all the instances of the former pattern in the MMgc code.)
> 
> Please review with that in mind.

Just to be clear: the former |= variant was keeping in the spirit of a refactoring of the existing code (which invoked SetBit), while the latter = variant is a (sound) change to the behavior, correct?

Is this fixing a bug?  Or was the code previously clearing the other bits elsewhere in the control flow, and this is more of a local performance fix?
(In reply to comment #5)
> 
> Just to be clear: the former |= variant was keeping in the spirit of a
> refactoring of the existing code (which invoked SetBit), while the latter =
> variant is a (sound) change to the behavior, correct?

A necessary change.

> Is this fixing a bug?

Yes.  We want to make sure that when something moves onto the free list, all other bits are reset so that they do not linger, because we do not always properly clear bits when we pick an object off the free list.  The invariant is that objects on the free list have kFreelist set, no other bits.  CreateChunk establishes the invariant.
Depends on: 576576
Incorporates the dependency on bug #576576, and folds in the fixes I mentioned earlier, and has one more change in GCAlloc::AllocFromQuickList, where the flags on the object are cleared and the kFinalizable flag is set if appropriate.  Previously, it would only clear the low four bits, but that's not correct: the object is fresh and should have cleared flags.
Attachment #455455 - Attachment is obsolete: true
Attachment #456265 - Flags: review?(treilly)
Attachment #456265 - Flags: review?(fklockii)
Attachment #455455 - Flags: review?(treilly)
Attachment #455455 - Flags: review?(fklockii)
(In reply to comment #6)
> The invariant is
> that objects on the free list have kFreelist set, no other bits.

The correct invariant is that objects on the free list have kFreelist set.  CreateChunk, GCAlloc::Free, and the garbage collector all maintain that invariant.  Objects on the free list that were freed explicitly or by DRC may have some other bits set additionally; the additional bits are cleared in GCAlloc::AllocFromQuickList.
Comment on attachment 456265 [details] [diff] [review]
Refactor mark bits, make more space available (v2)

How's performance?   Isn't working with machine word sized pointers faster than byte pointers (gcbits&) in general?   Or is that old dumb compiler wisdom?
Comment on attachment 456265 [details] [diff] [review]
Refactor mark bits, make more space available (v2)

Random thought, could we make GCAlloc::GetBitsIndex return 0 for LargeObjects and fold more large/small code together?   items has to go into GCBlockHeader so that would make it bigger for GCLarge.   items is pretty simple calculation we could get rid of it (gain 4 bytes for an extra read and some simple math).
Attachment #456265 - Flags: review?(treilly) → review+
(In reply to comment #10)
> Comment on attachment 456265 [details] [diff] [review]
> Refactor mark bits, make more space available (v2)
> 
> How's performance?   Isn't working with machine word sized pointers faster than
> byte pointers (gcbits&) in general?   Or is that old dumb compiler wisdom?

Hmm; I was worried about potential lossage due to moving from
register-allocatable values to C++-references [1], but I had not thought of the
word/byte size issue.

[1] Lars and I chatted about this in the asteam chat room; he suggested
exposing separate accessors for value- or reference- semantics. 
http://asteam.corp.adobe.com/irc/log/asteam/asteam/20100702
Comment on attachment 456265 [details] [diff] [review]
Refactor mark bits, make more space available (v2)

I bit the bullet and did a tiny bit of comparison of disassembled routines.  My sample was very small; just two routines: GC::ClearUnmarkedWeakRefs and GC::WriteBarrierHit.  Skimming them was enough to convince me that there are not any obvious problems with using C++ references.  (The ordering of quantifiers is important there.)

I'll attach the relevant portions of my transcript, for completeness.
Attachment #456265 - Flags: review?(fklockii) → review+
Attached file baseline disassembly —
baseline disassembly; contrast against disassembly post-refactor via attachment 456265 [details] [diff] [review] (to be posted next).
Attached file refactor disassembly —
disassembly post-refactor via attachment 456265 [details] [diff] [review]; contrast against baseline disassembly (attachment 462752 [details])
Comment on attachment 455456 [details] [diff] [review]
Implement MMGC_FASTBITS

(reviewing)
Rebase of original patch to the tip, to save Lars and Tommy the trouble.  (I've already gone through this rebasing process twice, so I thought I'd put the rebased version up to avoid duplicating that effort again.)

Lars: diffing the patches themselves should indicate that I did not change the patch itself.
Attachment #456265 - Attachment is obsolete: true
(In reply to comment #17)
> Lars: diffing the patches themselves should indicate that I did not change the
> patch itself.

(diffing them will tell you that I did accidentally drop Lar's header describing the patch; it said: "New mark bits structure")
Comment on attachment 465657 [details] [diff] [review]
Refactor mark bits, make more space available (v2)

Marking rebase of patch as self-reviewed. 

Requesting super-review from Lars mostly as a formality.

(Note that the previous patch, originally authored by Lars, had been reviewed by Tommy and myself.)
Attachment #465657 - Flags: superreview?(lhansen)
Attachment #465657 - Flags: review+
Comment on attachment 455456 [details] [diff] [review]
Implement MMGC_FASTBITS

Any idea why MMGC_FASTBITS improves memory usage on jsbench/FFT.as?  (By 17-35% (!)?)

(I thought it might be an artifact of time improvements changing GC behavior, but after turning off incremental GC I still see a big improvement in memory usage on FFT.)

Here's the relevant details:

avm: ../../../tamarin-redux2/objdir-selftest32/shell/avmshell -Dnoincgc version: cyclone
avm2: ../../../tamarin-redux/objdir-selftest32/shell/avmshell -Dnoincgc version: cyclone
iterations: 5

                                  avm:cyclone     avm2:cyclone  
test                              best     avg    best     avg   %dBst   %dAvg
Metric: memory 
...
jsbench/FFT.as                   41.2M   31.6M   26.7M   26.0M   35.19   17.65 +   
...
jsbench/typed/FFT.as              3.1M    3.1M    3.1M    3.1M       0   -0.65     
...


avm: ../../../tamarin-redux2/objdir-selftest32/shell/avmshell  version: cyclone
avm2: ../../../tamarin-redux/objdir-selftest32/shell/avmshell  version: cyclone
iterations: 5

                                  avm:cyclone     avm2:cyclone  
test                              best     avg    best     avg   %dBst   %dAvg
Metric: time 
...
jsbench/FFT.as                  12579  12611.4  12010  12144.8    4.52    3.70 +   
...
jsbench/typed/FFT.as              7919  7934.6    7594  7616.4    4.10    4.01 +
(In reply to comment #20)
> Comment on attachment 455456 [details] [diff] [review]
> Implement MMGC_FASTBITS
> 
> Any idea why MMGC_FASTBITS improves memory usage on jsbench/FFT.as?  (By 17-35%
> (!)?)

Clarification: when running in incremental GC mode, the memory usage improvement is "only" 9-22%; the above 17-35% figure is just from the -Dnoincgc run.

avm: ../../../tamarin-redux2/objdir-selftest32/shell/avmshell  version: cyclone
avm2: ../../../tamarin-redux/objdir-selftest32/shell/avmshell  version: cyclone
iterations: 5


                                  avm:cyclone     avm2:cyclone  
test                              best     avg    best     avg   %dBst   %dAvg
Metric: memory 
...
jsbench/FFT.as                   39.5M   36.0M   35.9M   28.0M    9.11   22.12
...
Comment on attachment 465657 [details] [diff] [review]
Refactor mark bits, make more space available (v2)

Rubber stamp.
Attachment #465657 - Flags: superreview?(lhansen) → superreview+
(In reply to comment #20)
> 
> Any idea why MMGC_FASTBITS improves memory usage on jsbench/FFT.as?
> (By 17-35% (!)?)

No.

(In reply to comment #21)
> Clarification: when running in incremental GC mode, the memory usage
> improvement is "only" 9-22%; the above 17-35% figure is just from the
> -Dnoincgc run.

No, even so :-)

I assume the comparison here is between new mark bits on the one hand and new mark bits plus mmgc_fastbits on the other?  Or is the baseline an unmodified redux?

Anyway, some speculation:

MMGC_FASTBITS will definitely change memory dynamics because fewer bitmaps will be allocated on-page - the new mark bits by themselves require twice the memory, and with MMGC_FASTBITS we require more still.  If there is a lot of floating point data on the stack it is conceivable that the changed dynamics results in different false retention patterns.  Since memory management is essentially chaotic and programs like FFT will allocate some arrays, a small change could have a large effect.

It's not a great explanation because jsbench/FFT.as uses mainly untyped variables and data, and everything should be boxed, and floating point boxes are handled as non-pointer-containing.
Comment on attachment 455456 [details] [diff] [review]
Implement MMGC_FASTBITS

Could we use FindOneBit on size instead of adding a new field?   Do all architectures have a log2n instruction like x86 (I think so)?
Comment on attachment 455456 [details] [diff] [review]
Implement MMGC_FASTBITS

Why does MMGC_FASTBITS have its own GetGCBits, seems like the one that calls GetBitsIndex should work fine no?
(In reply to comment #24)
> Comment on attachment 455456 [details] [diff] [review]
> Implement MMGC_FASTBITS
> 
> Could we use FindOneBit on size instead of adding a new field?   Do all
> architectures have a log2n instruction like x86 (I think so)?

Of all the platforms we support so far, the only one without a find-bit instruction seems to be SH4.  The gcc intrinsic is __builtin_ctz[ll], and the MSVC intrinsic is _BitScanForward[64].

I'm not into the code enough to argue whether a load is faster/slower than using the processor intrinsic, however.
And rebased again to account for recent GC changes.  The MMGC_FASTBITS patch is still good.
Attachment #465657 - Attachment is obsolete: true
Some data (first of several).  Conclusions will be posted after all data.

The system is my macpro, not doing much.
avm1 is the baseline, TR 5067.
avm2 is baseline+refactoring.
compiled with --enable-shell, otherwise defaults.

Metric: time 
jsbench/Crypt.as                3348    3330    0.5    0.5
jsbench/Euler.as                6587    6396    2.9    2.9
jsbench/FFT.as                  6481    6682   -3.1   -3.1
jsbench/HeapSort.as             2789    2784    0.2    0.2
jsbench/LUFact.as               7488    7600   -1.5   -1.5
jsbench/Moldyn.as               8612    8480    1.5    1.5
jsbench/RayTracer.as            6877    6895   -0.3   -0.3
jsbench/Series.as               8055    7860    2.4    2.4
jsbench/SOR.as                 30949   29360    5.1    5.1
jsbench/SparseMatmult.as       11808   11746    0.5    0.5
jsbench/typed/Crypt.as           815     822   -0.9   -0.9
jsbench/typed/Euler.as          8310    8053    3.1    3.1
jsbench/typed/FFT.as            4060    3861    4.9    4.9
jsbench/typed/HeapSort.as       1380    1434   -3.9   -3.9
jsbench/typed/LUFact.as         8062    7550    6.4    6.4
jsbench/typed/Moldyn.as         3692    3553    3.8    3.8
jsbench/typed/RayTracer.as      1039    1029    1.0    1.0
jsbench/typed/Series.as         7007    6804    2.9    2.9
jsbench/typed/SOR.as           22808   21176    7.2    7.2
jsbench/typed/SparseMatmult.as  2221    2598  -17.0  -17.0

I've run this twice, the second run was with the binaries reversed to see if that mattered.  (It didn't, mostly.)  The general summary is that overall the trend is positive, there are minor slowdowns (typed/SparseMatmult dropped to -0.4% on the second run), and there is a persistent 3.5-4% slowdown on typed/HeapSort.as.
More data, same system.

Metric: time 
mmgc/gcbench.as                 2652    2271   14.4   14.4
mmgc/ofib-rc.as                  213     206    3.3    3.3
mmgc/ofib.as                     987     967    2.0    2.0
mmgc/sfib.as                     333     321    3.6    3.6

I did three runs.  The 14% speedup on gcbench is persistent, as is the speedup on sfib.  ofib-rc was always positive, ofib fluctuated from -0.4 to +2.0.
More data, same system.

Metric: v8 custom v8 normalized metric (hardcoded in the test)
v8.5/js/crypto.as                437     445    1.8    1.8
v8.5/js/deltablue.as             387     390    0.8    0.8
v8.5/js/earley-boyer.as         1219    1244    2.1    2.1
v8.5/js/raytrace.as              781     813    4.1    4.1
v8.5/js/regexp.as                107     111    3.7    3.7
v8.5/js/richards.as              319     316   -0.9   -0.9
v8.5/js/splay.as                 836     986   17.9   17.9
v8.5/optimized/crypto.as        4549    4510   -0.9   -0.9
v8.5/optimized/deltablue.as     3798    3883    2.2    2.2
v8.5/optimized/earley-boyer.as  1213    1233    1.6    1.6
v8.5/optimized/raytrace.as      9038    9587    6.1    6.1
v8.5/optimized/regexp.as         108     112    3.7    3.7
v8.5/optimized/richards.as      4455    4664    4.7    4.7
v8.5/optimized/splay.as         6105    6691    9.6    9.6
v8.5/typed/crypto.as            2654    2706    2.0    2.0
v8.5/typed/deltablue.as         4122    4207    2.1    2.1
v8.5/typed/earley-boyer.as      1218    1246    2.3    2.3
v8.5/typed/raytrace.as          9047    9579    5.9    5.9
v8.5/typed/regexp.as             108     112    3.7    3.7
v8.5/typed/richards.as          4455    4658    4.6    4.6
v8.5/typed/splay.as             1249    1289    3.2    3.2
v8.5/untyped/crypto.as           482     476   -1.2   -1.2
v8.5/untyped/deltablue.as       1942    1966    1.2    1.2
v8.5/untyped/earley-boyer.as    1216    1256    3.3    3.3
v8.5/untyped/raytrace.as        3544    3705    4.5    4.5
v8.5/untyped/regexp.as           108     111    2.8    2.8
v8.5/untyped/richards.as         501     506    1.0    1.0
v8.5/untyped/splay.as           1188    1206    1.5    1.5

Two runs, the data appear stable.  A clear positive trend, some significant wins, slowdowns are all minor.
More data, same system.  I've purged differences in the range -2.5..2.5.

Metric: v8 custom v8 normalized metric (hardcoded in the test)
asmicro/alloc-1.as                44      48    9.1    9.1
asmicro/alloc-10.as               16      17    6.2    6.2
asmicro/alloc-11.as               16      17    6.2    6.2
asmicro/alloc-13.as               95     106   11.6   11.6
asmicro/alloc-14.as               82      90    9.8    9.8
asmicro/alloc-2.as                22      23    4.5    4.5
asmicro/alloc-3.as                18      19    5.6    5.6
asmicro/alloc-4.as                51      58   13.7   13.7
asmicro/alloc-5.as                35      38    8.6    8.6
asmicro/alloc-6.as                78      83    6.4    6.4
asmicro/alloc-7.as                39      44   12.8   12.8
asmicro/alloc-8.as                18      19    5.6    5.6
asmicro/alloc-9.as                18      19    5.6    5.6
asmicro/array-2.as               772     724   -6.2   -6.2
asmicro/array-shift-1.as         148     157    6.1    6.1
asmicro/array-slice-1.as          21      22    4.8    4.8
asmicro/array-sort-1.as           29      30    3.4    3.4
asmicro/array-sort-3.as           22      23    4.5    4.5
asmicro/array-unshift-1.as       185     179   -3.2   -3.2
asmicro/closedvar-write-2.as    5208    5999   15.2   15.2
asmicro/funcall-4.as             119     128    7.6    7.6
asmicro/globalvar-read-1.as     4984    4779   -4.1   -4.1
asmicro/globalvar-write-1.as    6526    4739  -27.4  -27.4
asmicro/isNaN-1.as               421     574   36.3   36.3
asmicro/number-toString-2.as      72      78    8.3    8.3
asmicro/parseInt-1.as            208     194   -6.7   -6.7
asmicro/regex-exec-1.as           67      71    6.0    6.0
asmicro/regex-exec-2.as           79      85    7.6    7.6
asmicro/regex-exec-3.as          137     143    4.4    4.4
asmicro/regex-exec-4.as          306     324    5.9    5.9
asmicro/restarg-1.as             883     855   -3.2   -3.2
asmicro/restarg-2.as             491     519    5.7    5.7
asmicro/restarg-3.as              45      48    6.7    6.7
asmicro/restarg-4.as              30      34   13.3   13.3
asmicro/string-casechange-1.as    26      29   11.5   11.5
asmicro/string-casechange-2.as    26      29   11.5   11.5
asmicro/string-charAt-2.as        73      80    9.6    9.6
micro/string-fromCharCode-2.as    64      72   12.5   12.5
asmicro/string-indexOf-1.as      215     220    2.3    2.3
asmicro/string-slice-1.as        124     139   12.1   12.1
asmicro/string-substring-1.as    130     146   12.3   12.3
asmicro/switch-2.as               47      49    4.3    4.3
asmicro/try-1.as                 203     255   25.6   25.6
asmicro/try-2.as                  16      18   12.5   12.5
asmicro/try-3.as                  57      64   12.3   12.3
asmicro/vector-push-1.as          47      44   -6.4   -6.4

These benchmarks are inordinately sensitive to small perturbations (alignment on the stack and in the cache) so use with care, I guess.  Again a very clear positive trend.
Based on the above data it seems pretty clear that the change does not cause a slowdown in any systematic way, though individual tests may be affected, but we don't know if that's because we're using more memory (we are) or because of alignment or other issues.  Since the general trend is a speedup across the board it appears that the new code paths are beneficial.

Since the patch was reviewed and approved, and the numbers look good, I'm inclined to land, and will do so tomorrow if I don't hear otherwise.  Objections need to be registered today :-)
(In reply to comment #32)
> Since the patch was reviewed and approved, and the numbers look good, I'm
> inclined to land, and will do so tomorrow if I don't hear otherwise. 
> Objections need to be registered today :-)

You landing both, or just the refactoring?

(I won't object to landing both.  I had not yet r+'ed MMGC_FASTBITS because I wanted to better understand GC and GCAlloc before doing so, but that should not yet hold things up.)
(In reply to comment #33)
> You landing both, or just the refactoring?

The latter, since the former has not been reviewed yet and there's a discussion still ongoing.

(Performance data for fastbits coming up.)
Second set of data (first chunk of several).  Conclusions will be posted after all data.

The system is my macpro, not doing much.
avm1 is the baseline, TR 6067+refactoring.
avm2 is the baseline+fastbits
compiled with --enable-shell, otherwise defaults.


Metric: time 
jsbench/Crypt.as                3333    3346   -0.4   -0.4
jsbench/Euler.as                6393    6246    2.3    2.3
jsbench/FFT.as                  6509    6464    0.7    0.7
jsbench/HeapSort.as             2782    2789   -0.3   -0.3
jsbench/LUFact.as               7509    7370    1.9    1.9
jsbench/Moldyn.as               8445    8211    2.8    2.8
jsbench/RayTracer.as            6864    6520    5.0    5.0
jsbench/Series.as               7803    7609    2.5    2.5
jsbench/SOR.as                 29189   28256    3.2    3.2
jsbench/SparseMatmult.as       11800   11638    1.4    1.4
jsbench/typed/Crypt.as           822     822    0.0    0.0
jsbench/typed/Euler.as          8069    7789    3.5    3.5
jsbench/typed/FFT.as            3867    3770    2.5    2.5
jsbench/typed/HeapSort.as       1432    1383    3.4    3.4
jsbench/typed/LUFact.as         7558    7272    3.8    3.8
jsbench/typed/Moldyn.as         3555    3617   -1.7   -1.7
jsbench/typed/RayTracer.as      1028    1026    0.2    0.2
jsbench/typed/Series.as         6788    6591    2.9    2.9
jsbench/typed/SOR.as           21018   19925    5.2    5.2
jsbench/typed/SparseMatmult.as  2172    2174   -0.1   -0.1

Generally positive trend but few numbers outside the 2.5% range so hard to say that the change makes a big difference.
More data: v8.5, five iterations.

                             avm:cyclone     avm2:cyclone  
test                         best     avg    best     avg   %dBst   %dAvg
Metric: v8 custom v8 normalized metric (hardcoded in the test)
js/crypto.as                  448   447.8     458     458    2.23    2.28 +   
js/deltablue.as               393   392.6     393   392.6       0       0     
js/earley-boyer.as           1278  1271.4    1303  1297.4    1.96    2.04 +   
js/raytrace.as                817   815.4     817     812       0   -0.42     
js/regexp.as                  112   111.2     106   105.6   -5.36   -5.04 --  
js/richards.as                316   315.8     321   320.8    1.58    1.58 +   
js/splay.as                   989   985.2     993   987.8    0.40    0.26     
optimized/crypto.as          4523  4520.6    4570  4564.4    1.04    0.97 +   
optimized/deltablue.as       3937  3917.8    3972    3962    0.89    1.13 +   
optimized/earley-boyer.as    1246    1239    1285  1281.4    3.13    3.42 +   
optimized/raytrace.as        9607  9595.4    9784  9760.6    1.84    1.72 +   
optimized/regexp.as           112     112     107   106.6   -4.46   -4.82 -   
optimized/richards.as        4670  4666.4    4753  4747.4    1.78    1.74 +   
optimized/splay.as           6705  6677.4    6881    6860    2.62    2.73 +   
typed/crypto.as              2708  2706.2    2723  2720.8    0.55    0.54 +   
typed/deltablue.as           4240  4221.8    4289  4273.2    1.16    1.22 +   
typed/earley-boyer.as        1246  1242.8    1289    1286    3.45    3.48 +   
typed/raytrace.as            9607  9589.8    9774  9752.8    1.74    1.70 +   
typed/regexp.as               112   111.8     107   106.2   -4.46   -5.01 -   
typed/richards.as            4670  4666.4    4746  4744.8    1.63    1.68 +   
typed/splay.as               1288  1277.4    1239  1234.6   -3.80   -3.35 -   
untyped/crypto.as             477   475.4     500   499.2    4.82    5.01 +   
untyped/deltablue.as         1968  1962.2    1902  1894.8   -3.35   -3.43 -   
untyped/earley-boyer.as      1255  1245.8    1285    1281    2.39    2.83 +   
untyped/raytrace.as          3694    3683    3767  3754.8    1.98    1.95 +   
untyped/regexp.as             111     111     106   105.8   -4.50   -4.68 -   
untyped/richards.as           510   508.2     513   512.4    0.59    0.83 +   
untyped/splay.as             1210  1203.6    1193  1191.8   -1.40   -0.98 -   

Generally the trend is positive, though the regexp benchmark has a strong negative trend.
According to Shark the regexp benchmark spends 30% of its time in regular expression matching.  That code is not affected by the GC but is a tight loop that is sensitive to small perturbations (as when a function is aligned differently in a cache line or on a page).  That's a credible explanation for the negative trend on that benchmark.
Another interesting data point, maybe not pertinent here, is that the regexp benchmark really hammers on FixedMalloc - actually about 33% of the time goes to malloc and free, called from the regexp code.
(In reply to comment #26)
> (In reply to comment #24)
> > Comment on attachment 455456 [details] [diff] [review] [details]
> > Implement MMGC_FASTBITS
> > 
> > Could we use FindOneBit on size instead of adding a new field?   Do all
> > architectures have a log2n instruction like x86 (I think so)?
> 
> Of all the platforms we support so far, the only one without a find-bit
> instruction seems to be SH4.  The gcc intrinsic is __builtin_ctz[ll], and the
> MSVC intrinsic is _BitScanForward[64].
> 
> I'm not into the code enough to argue whether a load is faster/slower than
> using the processor intrinsic, however.

I'm not sure what the benefit would be.  Now we have:

  load a value (bitsShift)
  use that to variable-shift another value

If we were to use a log instruction, we'd have

  load a value (size)
  log
  use that to variable-shift another value

Ergo more work where we want less.  I also don't know if shifting with the log of the size would make sense for large objects.

I agree that it would be good to reduce the amount of per-block data, but the goal of the patch is to make it faster to find the bits for an object, and I'm willing to spend a little to make that possible.

Am I missing something?
(In reply to comment #25)
> Comment on attachment 455456 [details] [diff] [review]
> Implement MMGC_FASTBITS
> 
> Why does MMGC_FASTBITS have its own GetGCBits, seems like the one that calls
> GetBitsIndex should work fine no?

I don't understand the question.
(In reply to comment #39)
> Am I missing something?

I'm operating on the hypothesis that reading the memory is gonna be many times more expensive than the log instruction so they would be roughly equivalent.    Another factor is that we have 4 bytes to play with on 32 bit systems since the block header is currently 44 bytes.   I was also thinking that the real savings here probably come from not having to read shift and multiple from the GCAlloc instance and that we may get similar speedups by just moving those to the block instead.
(In reply to comment #40)
> (In reply to comment #25)
> > Comment on attachment 455456 [details] [diff] [review] [details]
> > Implement MMGC_FASTBITS
> > 
> > Why does MMGC_FASTBITS have its own GetGCBits, seems like the one that calls
> > GetBitsIndex should work fine no?
> 
> I don't understand the question.

Seems like unnecessary code duplication, GetGCBits could just call GetBitsIndex instead of inlining it by hand.
Comment on attachment 466991 [details] [diff] [review]
Refactor mark bits, make more space available (v2)

tamarin-redux changeset:   5068:8bd75a9d9735 (patch)
tamarin-redux changeset:   5084:a750dcce0c16 (merge)
(In reply to comment #41)
> (In reply to comment #39)
> > Am I missing something?
> 
> I'm operating on the hypothesis that reading the memory is gonna be many
> times more expensive than the log instruction so they would be roughly
> equivalent.

I still don't get it.  If we were to compute the log of the size every time
we'd still have to load the size field, so we'd just be adding the log instruction, and not getting rid of any instructions.  Can you please be a lot more specific about what you have in mind?
 
> Another factor is that we have 4 bytes to play with on 32 bit systems since
> the block header is currently 44 bytes.   I was also thinking that the real
> savings here probably come from not having to read shift and multiple from
> the GCAlloc instance and that we may get similar speedups by just moving
> those to the block instead.

Again, can you please be a lot more specific?  I don't understand what code you're referring to.
(In reply to comment #42)
> Seems like unnecessary code duplication, GetGCBits could just call GetBitsIndex
> instead of inlining it by hand.

I see.  Could be a consequence of refactoring, I'll consider it.
> I still don't get it.  If we were to compute the log of the size every time
> we'd still have to load the size field, so we'd just be adding the log
> instruction, and not getting rid of any instructions.  Can you please be a lot
> more specific about what you have in mind?

If it cost 100 cycles to load the memory and 1 cycle to to do the log then there's not much diff between 101 and 100 cycles.
 
> > Another factor is that we have 4 bytes to play with on 32 bit systems since
> > the block header is currently 44 bytes.   I was also thinking that the real
> > savings here probably come from not having to read shift and multiple from
> > the GCAlloc instance and that we may get similar speedups by just moving
> > those to the block instead.
> 
> Again, can you please be a lot more specific?  I don't understand what code
> you're referring to.

sizeof(GCAlloc::GCBlock) == 44 bytes, GC objects are 8 byte aligned so there a 4 byte pocket to play with so adding the shift to the header didn't cost us anything.    But going on my earlier theory that loading memory is the true cost then maybe the shift/multiple approach would receive same speedup by moving those values into the block.   I'm basically trying to prove to myself that the (minor) memory wastage we're introducing is actually worth it.   I'll run some benchmarks on refactor+fastbits vs. refactor + move shift/multiple into block header approach and report back.
Results of refactor vs. fastbits followed by refactor vs. move-multiple/shift-to-block followed by fastbits vs. move-multiple/shift-to-block


avm: /tmp/avmshell-refactor  version: cyclone
avm2: /tmp/avmshell-fastbits  version: cyclone
iterations: 1

test                           avm:cyclone avm2:cyclone   %diff

Metric: time 
jsbench/Crypt.as                5915    5944   -0.5   -0.5
jsbench/Euler.as               11240   10733    4.5    4.5
jsbench/FFT.as                 11279   10609    5.9    5.9
jsbench/HeapSort.as             4491    4557   -1.5   -1.5
jsbench/LUFact.as              11522   11355    1.4    1.4
jsbench/Moldyn.as              15906   14839    6.7    6.7
jsbench/RayTracer.as           11514   10818    6.0    6.0
jsbench/Series.as              12947   12759    1.5    1.5
jsbench/SOR.as                 51644   49095    4.9    4.9
jsbench/SparseMatmult.as       17896   17771    0.7    0.7
jsbench/typed/Crypt.as          1151    1145    0.5    0.5
jsbench/typed/Euler.as         13826   12771    7.6    7.6
jsbench/typed/FFT.as            6162    5827    5.4    5.4
jsbench/typed/HeapSort.as       2191    2273   -3.7   -3.7
jsbench/typed/LUFact.as        11919   11446    4.0    4.0
jsbench/typed/Moldyn.as         5613    5306    5.5    5.5
jsbench/typed/RayTracer.as      1767    1720    2.7    2.7
jsbench/typed/Series.as        11929   10945    8.2    8.2
jsbench/typed/SOR.as           38646   34911    9.7    9.7
jsbench/typed/SparseMatmult.as  2730    2735   -0.2   -0.2
treilly-MacBookPro:performance treilly$ ./runtests.py --avm /tmp/avmshell-refactor --avm2 /tmp/avmshell-ms-move-to-block jsbench/
Tamarin tests started: 2010-08-20 11:57:01.101088
Executing 21 test(s)
avm: /tmp/avmshell-refactor  version: cyclone
avm2: /tmp/avmshell-ms-move-to-block  version: cyclone
iterations: 1

test                           avm:cyclone avm2:cyclone   %diff

Metric: time 
jsbench/Crypt.as                5941    5913    0.5    0.5
jsbench/Euler.as               11237   11067    1.5    1.5
jsbench/FFT.as                 11227   11062    1.5    1.5
jsbench/HeapSort.as             4491    4584   -2.1   -2.1
jsbench/LUFact.as              11494   11484    0.1    0.1
jsbench/Moldyn.as              15926   15286    4.0    4.0
jsbench/RayTracer.as           11475   11323    1.3    1.3
jsbench/Series.as              12917   13211   -2.3   -2.3
jsbench/SOR.as                 51615   52021   -0.8   -0.8
jsbench/SparseMatmult.as       17837   17754    0.5    0.5
jsbench/typed/Crypt.as          1160    1157    0.3    0.3
^C
Keyboard Interupt
treilly-MacBookPro:performance treilly$ 
treilly-MacBookPro:performance treilly$ ./runtests.py --avm /tmp/avmshell-fastbits --avm2 /tmp/avmshell-ms-move-to-block jsbench/typed/
Tamarin tests started: 2010-08-20 12:02:50.487311
Executing 10 test(s)
avm: /tmp/avmshell-fastbits  version: cyclone
avm2: /tmp/avmshell-ms-move-to-block  version: cyclone
iterations: 1

test                           avm:cyclone avm2:cyclone   %diff

Metric: time 
jsbench/typed/Crypt.as          1146    1158   -1.0   -1.0
jsbench/typed/Euler.as         12784   13723   -7.3   -7.3
jsbench/typed/FFT.as            5829    6014   -3.2   -3.2
jsbench/typed/HeapSort.as       2275    2234    1.8    1.8
jsbench/typed/LUFact.as        11503   11616   -1.0   -1.0
jsbench/typed/Moldyn.as         5345    5658   -5.9   -5.9
jsbench/typed/RayTracer.as      1731    1751   -1.2   -1.2
jsbench/typed/Series.as        10929   11654   -6.6   -6.6
jsbench/typed/SOR.as           35012   35275   -0.8   -0.8
jsbench/typed/SparseMatmult.as  2733    2733    0.0    0.0
Results show:
even greater speedup due to fastbits than lars reported (%10 in one case)
neglible speedup by moving shift/multiple to the block
Attachment #455456 - Flags: review?(treilly) → review+
Comment on attachment 455456 [details] [diff] [review]
Implement MMGC_FASTBITS

We should consider opening a bug to replace the calls to log2 with an intrinsic where available.
Attachment #455456 - Flags: review?(fklockii) → review+
(In reply to comment #49)
> Comment on attachment 455456 [details] [diff] [review]
> Implement MMGC_FASTBITS
> 
> We should consider opening a bug to replace the calls to log2 with an intrinsic
> where available.

I would only recommend taking on the complexity of intrinsics for critical paths.   That said we do use log2n in the avmplus::Hashtable and if we wanted to push the impl down to the VMPI layer and share that would be nice.
(In reply to comment #48)
> Results show:
> even greater speedup due to fastbits than lars reported (%10 in one case)
> neglible speedup by moving shift/multiple to the block

Thanks for investigating!

The 44 (actually 48) bytes overhead in the block seems like a lot, actually - it's nearly 12%.  So it seems like it would be a good idea to investigate ways of reducing it.  (For one thing, the items pointer can go away if we align the user objects with the start of the block and place the metainformation at the end of the block, it'll still be trivial to find it: block + 4K - sizeof(meta), for example.  But I don't know yet what the consequences of removing 'items' will be elsewhere.)  I'll create a work item.
I think you mean 1.2%?   Its double that for 64 bits.
(In reply to comment #52)
> I think you mean 1.2%?   Its double that for 64 bits.

You're right - I'm off by a factor of 10 again.  Cancel red alert.
Memory investigation for the fastbits change.

This is jsbench:

                          avm:cyclone     avm2:cyclone  
test                      best     avg    best     avg   %dBst   %dAvg
Metric: memory 
Crypt.as                 42.9M   42.9M   42.9M   42.9M       0       0     
Euler.as                  4.7M    4.4M    5.3M    4.4M  -12.77   -0.91     
FFT.as                   31.3M   28.0M   30.6M   23.0M    2.24   17.93     
HeapSort.as               4.7M    4.7M    4.7M    4.7M       0       0     
LUFact.as                32.0M   24.5M   32.0M   20.7M       0   15.50     
Moldyn.as                 1.2M    962K    904K    899K   26.43    6.58 +   
RayTracer.as              1.2M   1008K    1.1M    989K    8.33    1.95     
SOR.as                   57.6M   55.4M   56.5M   56.4M    1.91   -1.70     
Series.as                11.2M   10.6M   11.2M   10.5M       0    0.75     
SparseMatmult.as         13.9M   13.4M   13.9M   13.4M       0    0.15     
typed/Crypt.as           34.7M   34.7M   34.7M   34.7M       0       0     
typed/Euler.as            936K    889K    1.7M    1.2M  -85.98  -41.13 --  
typed/FFT.as              468K    468K    468K    428K       0    8.55     
typed/HeapSort.as         4.1M    4.1M    4.1M    4.1M       0       0     
typed/LUFact.as           2.1M    2.1M    2.1M    2.1M       0    0.96     
typed/Moldyn.as           1.3M    1.0M    908K    908K   31.79   14.56 ++  
typed/RayTracer.as        368K    352K    360K    349K    2.17    0.91     
typed/SOR.as              8.5M    8.3M    8.5M    8.3M       0       0     
typed/Series.as           992K    924K    784K    760K   20.97   17.82 ++  
typed/SparseMatmult.as    4.7M    4.5M    4.7M    4.7M       0   -4.44     

The consistently worse results on Euler are worrisome, other than that it looks good.  (The use of "best" in the headings is misleading, it is really "worst", or better, "largest".  But that's a quibble.)
Memory investigation continues.

This is v8.5:

                             avm:cyclone     avm2:cyclone  
test                         best     avg    best     avg   %dBst   %dAvg
Metric: memory 
js/crypto.as                1020K    857K    1.1M    894K  -10.43   -4.35     
js/deltablue.as              552K    496K    604K    553K   -9.42  -11.43     
js/earley-boyer.as           4.6M    4.5M    4.6M    4.5M       0       0     
js/raytrace.as               340K    340K    340K    340K       0       0     
js/regexp.as                 1.4M    849K    896K    776K   37.50    8.70     
js/richards.as               340K    340K    340K    340K       0       0     
js/splay.as                 95.8M   95.8M   96.2M   96.2M   -0.42   -0.42     
optimized/crypto.as          340K    340K    340K    340K       0       0     
optimized/deltablue.as       964K    788K    904K    856K    6.22   -8.62     
optimized/earley-boyer.as    4.6M    4.5M    4.7M    4.6M   -2.17   -3.11     
optimized/raytrace.as        340K    340K    340K    340K       0       0     
optimized/regexp.as          1.4M    838K    872K    801K   39.17    4.43     
optimized/richards.as        340K    340K    340K    340K       0       0     
optimized/splay.as          40.2M   40.2M   40.7M   40.7M   -1.24   -1.19 -   
typed/crypto.as              664K    616K    732K    650K  -10.24   -5.58     
typed/deltablue.as           996K    835K    1.0M    937K   -2.81  -12.26     
typed/earley-boyer.as        4.8M    4.6M    4.6M    4.6M    4.17    0.43 +   
typed/raytrace.as            340K    340K    340K    340K       0       0     
typed/regexp.as              736K    670K    820K    768K  -11.41  -14.56     
typed/richards.as            340K    340K    340K    340K       0       0     
typed/splay.as              67.0M   67.0M   67.6M   67.6M   -0.90   -0.90     
untyped/crypto.as            1.2M    867K    1.1M    770K    8.33   11.22     
untyped/deltablue.as        1008K    880K    804K    694K   20.24   21.09 +   
untyped/earley-boyer.as      4.5M    4.5M    4.7M    4.5M   -4.44    0.44 -   
untyped/raytrace.as          1.2M    971K    1.3M    1.0M   -8.33   -5.65     
untyped/regexp.as            796K    694K    760K    739K    4.52   -6.45     
untyped/richards.as          340K    340K    340K    340K       0       0     
untyped/splay.as            67.7M   67.7M   68.2M   68.2M   -0.74   -0.74     

Apart from Earley-Boyer and Splay the heap sizes here are all puny and the results are probably not terribly interesting.  For those two benchmarks the new code is more or less a wash, memory-wise.
(In reply to comment #51)
 (For one thing, the items pointer can go away if we align the
> user objects with the start of the block and place the metainformation at the
> end of the block, it'll still be trivial to find it: block + 4K - sizeof(meta),
> for example.

Another idea would be to remove items and replace it with an offset in GCAlloc, since items is always the same offset for each size class, eg: items = b + b->alloc->itemsOffset.
On the mmgc tests, there were improvements on ofib and sfib but otherwise nothing to report.

Summarizing:

- there's a negative trend on v8.5
- there's a positive trend on jsbench
- there's a positive trend on mmgc
- we have one significant regression on jsbench, typed/Euler, but many big
  wins on the positive side.

Generally the memory data are inconclusive.

According to the comments in the code the worst efficiency for the bits array is 50% (half the bytes are wasted).  Suppose this happens for the smallest object size (8 bytes).  Then mark bytes should account for two out of ten useful bytes or 20% of the heap, not figuring other overhead.  Suppose all the mark bits arrays previously fit on-block and now have to go off-block (not realistic).  Then a 20% increase in space is the worst we should see anywhere, for a heap consisting entirely of 8-byte objects.  Ergo the 41% growth in average heap size with fastbits relative to no fastbits is at least half due to sheer dynamics in the heap.
In an effort to get to thumbs up or down on the fastbits work I rebuilt the binaries from today's tip and reran jsbench/typed:

                                      avm          avm2
test                          best    avg   best    avg  %dBst  %dAvg
Metric: memory 
Dir: jsbench/typed/
  Crypt                      34.7M  34.7M  34.7M  34.7M      0      0   
  Euler                       1.4M  1012K   1.5M   976K   -7.1    3.5   
  FFT                         468K   468K   468K   460K      0    1.5   
  HeapSort                    4.1M   4.1M   4.1M   4.1M      0      0   
  LUFact                      2.2M   2.2M   2.3M   2.2M   -4.5      0   
  Moldyn                      964K   898K   1.3M   1.0M  -38.1  -14.2 - 
  RayTracer                   348K   341K   340K   340K    2.3    0.5 + 
  SOR                         8.5M   7.1M   8.5M   8.4M      0  -17.4   
  Series                      1.2M   1.0M   1.2M   987K      0    7.3   
  SparseMatmult               4.7M   4.7M   4.7M   4.3M      0    8.5   

Now Euler is a wash and Moldyn, which was the poster child for the memory-reducing benefits of this patch is suddenly the black sheep.

I'm going to conclude that the memory numbers are subject to chaos effects and that there will probably be a slight real increase due to the increased use of memory by the bit vectors.  However, that effect is not systematically visible.  Furthermore, there are performance benefits documented a couple of different ways.

I'll land the patch.
Fastbits pushed:

tamarin-redux changeset:   5148:4555067fa143 (patch)
tamarin-redux changeset:   5151:85eb17230b1f (merge)
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Flags: flashplayer-bug-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: