Last Comment Bug 601075 - bump-allocator for GC arenas
: bump-allocator for GC arenas
Status: RESOLVED FIXED
fixed-in-tracemonkey
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: ---
Assigned To: Igor Bukanov
:
Mentors:
: 650095 (view as bug list)
Depends on: 601234 656261
Blocks: 665354
  Show dependency treegraph
 
Reported: 2010-10-01 01:31 PDT by Igor Bukanov
Modified: 2011-12-25 18:06 PST (History)
12 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-
wanted


Attachments
v1 (24.76 KB, patch)
2010-10-01 07:01 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v2 (23.21 KB, patch)
2010-10-02 10:30 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v3 (46.67 KB, patch)
2010-10-04 09:04 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
testcase (535 bytes, text/plain)
2010-10-04 14:04 PDT, Gregor Wagner [:gwagner]
no flags Details
v4 (47.57 KB, patch)
2010-10-04 15:05 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
updated testcase (484 bytes, text/plain)
2010-10-23 11:59 PDT, Igor Bukanov
no flags Details
v5 (45.43 KB, patch)
2010-10-23 12:42 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v6 (45.36 KB, patch)
2011-04-28 19:24 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v7 (32.12 KB, patch)
2011-05-11 06:15 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v8 (39.71 KB, patch)
2011-05-24 15:47 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v9 (39.76 KB, patch)
2011-06-02 11:36 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v10 (40.25 KB, patch)
2011-06-15 06:52 PDT, Igor Bukanov
wmccloskey: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2010-10-01 01:31:37 PDT
Currently FinalizeArenaList assembles the free list even if after the GC the arena will be put to the free list. To avoid the performance loss when rebuilding the free list during allocation the patch uses specialized empty free lists. That leads to substantial complication of the code.

Thus the idea in FinalizeArenaList is to query before the finalization loop if the mark bitmap contains any marked things and avoid the the free list assembly if so.

This will effectively moves the free list assembly to the allocation phase for such arenas effectively trading shorter GC pause for more activity during allocation.
Comment 1 Igor Bukanov 2010-10-01 07:01:51 PDT
Created attachment 480101 [details] [diff] [review]
v1

The patch implements the above idea. The callgrind shows very tiny win for v8-splay on the scvale of 0.2% so in general the above trade-of of always building free lists during the allocation is worth the efforts and simpler code does not make things slower.
Comment 2 Gregor Wagner [:gwagner] 2010-10-01 14:26:15 PDT
Now we don't throw the the typeinformation away (typed free lists...) and have to create it from scratch if we get the arena off the freelist again. This was a win for high throughput benchmarks.
Comment 3 Gregor Wagner [:gwagner] 2010-10-01 14:28:32 PDT
Even better would be if we could move this to a helper thread. So we don't have the allocation slowdown or the GC slowdown.
Comment 4 Igor Bukanov 2010-10-02 08:38:08 PDT
(In reply to comment #2)
> Now we don't throw the the typeinformation away (typed free lists...) and have
> to create it from scratch if we get the arena off the freelist again.

Currently the free list in the empty arena is assembled during finalization. For a typical finalizer like string or object that does not write any memory this free list assembly implies the cost of writing cache lines into the main memory. This slows down the the GC. So if we can check that all things in the arena are free before doing a finalization, we can avoid that cost for empty arenas.

During the allocation the code typically write to the object/string fields so assembling the free list does not cost from the write cache point of view. Moreover, it avoids the need to read the memory into the cache at all if *all* fields of the GC things are initialized 

(In reply to comment #3)
> Even better would be if we could move this to a helper thread. So we don't have
> the allocation slowdown or the GC slowdown.

If the free list is assembled on another core or CPU, that CPU will need to write its cache context eventually. When that list is read on the allocation thread CPU it means that CPU will need to read all the cache lines competing for memory bandwidth.

For this reason I would like to try to avoid building any free lists during finalization and always do it during allocation, but that is for another bug. Here the focus is to the optimization for empty arenas.
Comment 5 Igor Bukanov 2010-10-02 10:30:54 PDT
Created attachment 480406 [details] [diff] [review]
v2

For a synthetic benchmark like 

for (var i = 0; i != 100; ++i) {
    for (var j = 0; j != 15000; ++j)
	"qwerty".substring(1);
    gc();
}
the patch shows 5-7% win according the user time and callgrind. For SS/V8 I do not see any changes on my rather noisy laptop when using the benchmark suite scripts. For v8-splay and v8-earley-boyer running under the callgring the patch shows a win on the scale of 0.2-0.3%.
Comment 6 Igor Bukanov 2010-10-04 09:04:56 PDT
Created attachment 480629 [details] [diff] [review]
v3

To minimize the amount of Arena<FreeCell> <-> Arena<T> casts the new version makes ArenaList a non-template class that uses Arena<FreeCell>* as the type of the list head and cursor.

This change allowed to use in JSCompartment a single ArenaList arenas[] array representing all arena lists replacing explicit list fields and simplifying the code that has to deal with multiple arena lists.

After this change the code contains only casts from Arena<FreeCell> to Arena<T>. They present is 3 places, in RefillFinalizableArenaList, FinalizeArenaList and during conservative marking, i.e. exactly the places where the code needs to know the type of GC things for maximal efficiency.
Comment 7 Igor Bukanov 2010-10-04 09:11:42 PDT
(In reply to comment #6)
> Created attachment 480629 [details] [diff] [review]
> v3

Another change here is to use the conservative GC machinery to locate bad pointers in gc_root_traversal.
Comment 8 Brian Hackett (:bhackett) 2010-10-04 10:08:29 PDT
(In reply to comment #6)
> Created attachment 480629 [details] [diff] [review]
> v3
> 
> To minimize the amount of Arena<FreeCell> <-> Arena<T> casts the new version
> makes ArenaList a non-template class that uses Arena<FreeCell>* as the type of
> the list head and cursor.
> 
> This change allowed to use in JSCompartment a single ArenaList arenas[] array
> representing all arena lists replacing explicit list fields and simplifying the
> code that has to deal with multiple arena lists.
> 
> After this change the code contains only casts from Arena<FreeCell> to
> Arena<T>. They present is 3 places, in RefillFinalizableArenaList,
> FinalizeArenaList and during conservative marking, i.e. exactly the places
> where the code needs to know the type of GC things for maximal efficiency.

FWIW the patch in bug 584917 also makes this change, and gets rid of the other template specializations in jsgc.h and jsgc.cpp.  That bug is blocked waiting for someone to review the changes to these two files.  Igor, would you mind looking at those?
Comment 9 Gregor Wagner [:gwagner] 2010-10-04 10:57:42 PDT
What performance improvement do you see with this patch? I see a slowdown for high throughput benchmarks like the one I posted in the arenaheader bug.
I measure the time to allocate 1000000 objects like:
    var t1 = Date.now();
    for (var j = 0; j < 1000; ++j) {
        for (var i = 0; i < 1000; ++i) {
                x.next = ({});
                x = x.next;
            }
    }
    print(Date.now() - t1);


Current tip (I edited the output that it fits in one row):
dhcp-004052:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js 
76, 89, 88, 93, 99, 104, 110, 26, 27, 27...... (stays at 27)

real	0m6.329s
user	0m6.226s
sys	0m0.101s

with this patch:
dhcp-004052:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js 
77, 94, 94, 100, 107, 113, 119, 37, 36, 36, 36... (stays at 36)

real	0m6.519s
user	0m6.415s
sys	0m0.101s

The finalization time reduction (when we GC the whole list again and all arenas are empty) is minimal but this is kind of the "best case" benchmark for this patch:
tip:
 AppTime,  Total,   Mark,  Sweep, FinObj, FinStr,  Destroy, 
9245.1,  126.8,   79.0,   47.8,   47.7,    0.1,      0.0,   

with this patch:
9949.2,  122.7,   79.2,   43.4,   43.3,    0.1,      0.0,
Comment 10 Gregor Wagner [:gwagner] 2010-10-04 11:10:55 PDT
There is also the overhead of checking the bitmap with a full heap. So perform a GC right when I created the linked-list.

I don't know why but the marking is also slower with this patch.

Average GC-time for a full heap:
AppTime,  Total,   Mark,  Sweep, FinObj, FinStr,  Destroy
Tip:
12891.3,  228.8,  208.2,   20.5,   20.4,    0.1,      0.0, 

with this patch:
13305.4,  233.4,  211.7,   21.6,   21.5,    0.1,      0.0,
Comment 11 Igor Bukanov 2010-10-04 11:52:40 PDT
(In reply to comment #8)
> FWIW the patch in bug 584917 also makes this change, and gets rid of the other
> template specializations in jsgc.h and jsgc.cpp.

Gregor, do you have time to review that patch?
Comment 12 Igor Bukanov 2010-10-04 12:19:47 PDT
(In reply to comment #9)
>     var t1 = Date.now();
>     for (var j = 0; j < 1000; ++j) {
>         for (var i = 0; i < 1000; ++i) {
>                 x.next = ({});
>                 x = x.next;
>             }
>     }
>     print(Date.now() - t1);

I assume that you have var x = {}; at the beginning.

> dhcp-004052:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js 
> 76, 89, 88, 93, 99, 104, 110, 26, 27, 27...... (stays at 27)

Hm, for me the time runs between 86 and 72. I have never get anything less. This is on iCore 7 dual-core CPU with 2.6GHz ceiling with 32 bit executable *64 bits runs slower). The patch and tm tip both reaches the value 72 but the tip show bigger fluctuations:

79 76 75 81 74 74 75 80 86 75 84 73 77 89 82 74 79 79 72 74 tip
85 78 76 83 80 74 75 73 74 72 74 73 72 72 74 74 73 74 75 75 patch

The callgrind shows 7% win. 

I will try this on a less advanced CPU.
Comment 13 Gregor Wagner [:gwagner] 2010-10-04 14:04:30 PDT
Created attachment 480733 [details]
testcase

Thats the testcase I used. I added more prints to get more information.
We have 500000 objects that are alive all the time.
So the first number is allocation of 1000000 objects, the second number is a gc run with a full heap and the third one a GC run when we finalize 100000 objects.

I see a 25% slowdown (from 26 to 37) for the allocation once the number gets stable.

Average numbers on tip:
26, 86, 48

with this patch:
37, 89, 45
Comment 14 Gregor Wagner [:gwagner] 2010-10-04 14:31:04 PDT
For v8-v4 in the shell I get this
(FROM is tip)

TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      *1.018x as slow*  2630.4ms +/- 1.2%   2677.5ms +/- 0.7%     significant

=============================================================================

  v8:             *1.018x as slow*  2630.4ms +/- 1.2%   2677.5ms +/- 0.7%     significant
    crypto:       ??                 246.5ms +/- 5.7%    255.8ms +/- 3.4%     not conclusive: might be *1.038x as slow*
    deltablue:    -                  333.0ms +/- 4.0%    331.5ms +/- 0.9% 
    earley-boyer: *1.054x as slow*   945.1ms +/- 0.9%    996.0ms +/- 0.8%     significant
    raytrace:     -                  436.7ms +/- 2.1%    431.9ms +/- 1.0% 
    regexp:       1.009x as fast     172.2ms +/- 0.8%    170.7ms +/- 0.4%     significant
    richards:     -                  167.3ms +/- 0.8%    167.0ms +/- 2.7% 
    splay:        1.015x as fast     329.6ms +/- 1.1%    324.6ms +/- 0.9%     significant
Comment 15 Igor Bukanov 2010-10-04 15:05:59 PDT
Created attachment 480752 [details] [diff] [review]
v4

I was able to reproduce the slowdown with the test case. cachegrind has shown that it comes from js_FinishTitle which where writing to its fields during finalization. That made the penalty of assembling the free list significantly less costly as the cache was written in any case. 

With this fixed the test case shows that the best first number has increases slightly with the patch from 59 to 63 while the finalization timing has decreased from 66 to 62. The marking time has stayed the same at 174. This is sort of confirms that the trade-off is about faster GC for the cost of slowing down the allocation paths for objects.

Gregor, could you test the patch on your box?
Comment 16 Gregor Wagner [:gwagner] 2010-10-04 15:31:13 PDT
I still see the 30ms slowdown on v8.

For the microbenchmark: There is still the 3ms win for the finalization but with the cost of 10ms for allocation. And again: this microbenchmark is kind of the "best case" for this patch so I am not convinced at all.
Comment 17 Gregor Wagner [:gwagner] 2010-10-04 17:48:46 PDT
An optimized browser build with this patch crashes when I want to run the v8 benchmarks with the signature: 

Thread 0 Crashed:  Dispatch queue: com.apple.main-thread
0   XUL                           	0x0000000100ec89d3 js::gc::MarkKind(JSTracer*, void*, unsigned int) + 595
1   ???                           	000000000000000000 0 + 0
2   ???                           	0x000000011ee121f0 0 + 4813038064
3   ???                           	0x000000011ed15f50 0 + 4812005200
4   ???                           	0x000000011ed165f0 0 + 4812006896
5   ???                           	0x00000001176ed2c0 0 + 4688106176
Comment 18 Igor Bukanov 2010-10-05 03:55:28 PDT
(In reply to comment #16)
> I still see the 30ms slowdown on v8.

The allocation slowdown comes from building the free list for empty arena. So I will try avoiding that at all. The idea is to have a flag that tells that the arena is empty and replace if so freeList = freeList->next with freeList++. This should avoid touching the arena during allocation.
Comment 19 Igor Bukanov 2010-10-05 04:01:57 PDT
(In reply to comment #17)
> An optimized browser build with this patch crashes when I want to run the v8
> benchmarks with the signature: 

Thanks for testing this!
Comment 20 Gregor Wagner [:gwagner] 2010-10-05 09:44:31 PDT
(In reply to comment #18)
> (In reply to comment #16)
> > I still see the 30ms slowdown on v8.
> 
> The allocation slowdown comes from building the free list for empty arena. So I
> will try avoiding that at all. The idea is to have a flag that tells that the
> arena is empty and replace if so freeList = freeList->next with freeList++.
> This should avoid touching the arena during allocation.

The freelist generation for arenas during allocation showed up in all the profiles I did for my patch. We were thinking of pointer bumping but decided to stick with the freelist model because of patch size reasons.

But I like the pointer-bumping idea a lot. Would be great to have pointer bumping for empty arenas and the freelists approach for non-empty arenas.

BTW: A clean browser build doesn't crash any more. It might have been unrelated to this patch.
Comment 21 Igor Bukanov 2010-10-23 11:59:41 PDT
Created attachment 485515 [details]
updated testcase

To run without OOM on 64 bit system under js shell  I have to shrink the number of allocated objects.
Comment 22 Igor Bukanov 2010-10-23 12:42:07 PDT
Created attachment 485523 [details] [diff] [review]
v5

The new patch implements the bump allocator for empty arenas so no free lists are constructed for them during the GC. For the test case above it shows that a typical timing changes from (21, 48, 38) to (22, 44, 31). That is, the object allocation time increases by 5% while the GC timing for mostly live heap drops by 8% and the GC time with heap filled by garbage drops by 18%.

The object allocation time increase is expected as the code first checks for a free list case. Only after checking that it is empty the bump allocator checks/increases its pointer. Thus for a case when most of the arenas are empty this will contribute at least one not taken branch. If I change the code to try the bump allocation first then the loss AFAICS disappears. Still such change may not be beneficial in practice if allocations from empty arenas do not dominate significantly over arenas with some live things. 

Still even in the current form the patch shows no changes in SS scores (the GC wins are not observable there) while V8 improves by 3% (sensitive to the GC):

TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      1.030x as fast    1980.7ms +/- 0.1%   1923.7ms +/- 0.1%     significant

=============================================================================

  v8:             1.030x as fast    1980.7ms +/- 0.1%   1923.7ms +/- 0.1%     significant
    crypto:       *1.007x as slow*   207.8ms +/- 0.5%    209.3ms +/- 0.4%     significant
    deltablue:    1.024x as fast     314.1ms +/- 0.2%    306.9ms +/- 0.2%     significant
    earley-boyer: 1.059x as fast     305.1ms +/- 0.1%    288.1ms +/- 0.2%     significant
    raytrace:     1.068x as fast     428.4ms +/- 0.2%    401.2ms +/- 0.3%     significant
    regexp:       1.008x as fast     223.9ms +/- 0.1%    222.2ms +/- 0.1%     significant
    richards:     *1.027x as slow*   222.9ms +/- 0.6%    228.8ms +/- 0.6%     significant
    splay:        1.042x as fast     278.5ms +/- 0.4%    267.2ms +/- 0.5%     significant
Comment 23 Igor Bukanov 2010-10-25 04:46:40 PDT
Nominating for 2.0 due to 3% win in V8.
Comment 24 Igor Bukanov 2011-04-28 19:24:14 PDT
Created attachment 529021 [details] [diff] [review]
v6

unrotten patch
Comment 25 Bill McCloskey (:billm) 2011-04-28 19:53:32 PDT
Igor, would it be possible to work some of the ideas from bug 650095 into this patch? As we move toward generational GC, I expect that we'll have a lot of arenas that have only one or two pinned objects in them. It would be nice if we could bump pointer allocate in those as well.
Comment 26 Igor Bukanov 2011-04-28 23:52:16 PDT
(In reply to comment #25)
> Igor, would it be possible to work some of the ideas from bug 650095 into this
> patch?

We can replace the link field stored in the GC cell with a pointer into the next free range and then assemble all free ranges of things of the given kind into single a list. But that would requires measurements regarding performance to see if this would be beneficial compared with the current hybrid approach in this patch.
Comment 27 Igor Bukanov 2011-04-29 00:43:42 PDT
(In reply to comment #26)
> We can replace the link field stored in the GC cell with a pointer into the
> next free range and then assemble all free ranges of things of the given kind
> into single a list.

That is, we can replace struct FreeCell with something like:

struct FreeCell {
   FreeCell *nextFreeStart;
   FreeCell *nextFreeEnd;
}

and use that in the allocator directly. But without the moving GC when
utilized arenas have random holes in them that may not be beneficial.
So lets measure.
Comment 28 Bill McCloskey (:billm) 2011-04-29 10:26:01 PDT
Storing two pointers in the FreeCell is much cleaner. Very nice.
Comment 29 Igor Bukanov 2011-05-10 04:40:56 PDT
I am experimenting with various patches that implements the bump allocation. It revelled a problem with the current allocation code pattern where the inline methods return an allocated GC thing via the return value. Consider the following:

T *cursor;
T *end; 

inline T *allocate()
{
    if (cursor != end)
        return cursor++;    
    return allocateSlow();
}

T *t = allocate();
if (!t)
    return_failure;
do_something(t);


For the last 4 lines GCC effectively generates the code like for the following fragment:

if (cursor != end)
    t = cursor++;
else
    t = allocateSlow();
if (!t)
    return_failure;
do_something(t);

Note that the !t check is done even on the fast path of the allocation as the compiler do not see that the pointer cannot be NULL. Using C++ references does not help at all - apparently GC treats them internally as nullable pointers.

To remove the redundant check that adds a non-trivial cost it is necessary to change the allocate function to return an explicit success/failure flag together with the pointer:

T *cursor;
T *end; 

inline bool allocate(T **thingp)
{
    if (cursor != end) {
        *thingp = cursor++;
        return true;
    }    
    T *t = allocateSlow();
    if (t) {
        *thingp = t;
        return true;
    }
    return false;
}

T *t;
if (!allocate(&t))
    return_failure;
do_something(t);

Then for the last 4 lines the code becomes:

    if (cursor == end)
        goto slow;
    t = cursor++
  action:
    do_something(t);
    ...

  slow:
    t = allocateSlow();
    if (!t)
       return_failure;
    goto action;
Comment 30 Brian Hackett (:bhackett) 2011-05-10 06:00:32 PDT
(In reply to comment #29)
> I am experimenting with various patches that implements the bump allocation.
> It revelled a problem with the current allocation code pattern where the
> inline methods return an allocated GC thing via the return value. Consider
> the following:

I wouldn't worry much about GCC inserting an extra NULL check.  How much does this slow down the C++?  Normally the cost of these well-predicted branches is pretty puny, whether in C++ or in jitcode (e.g. bug 609440).  For really really hot code it may matter, but in such cases we'll be doing things in jitcode anyways and will control the emitted assembly (we'll definitely bump allocate objects from jitcode, and ought to be doing so with the current freelist allocator).
Comment 31 Igor Bukanov 2011-05-10 06:25:16 PDT
(In reply to comment #30)
> I wouldn't worry much about GCC inserting an extra NULL check.  How much
> does this slow down the C++? 

I was investigating an unexpected slowdown with some synthetic benchmark with an early version of the patch when the bump allocation were losing against allocation from a new arena with pre-build free list losing 1-2% in JS code. That is how I discovered the issue. Now, that early patch has some other performance problems and bugs that I the new version supposed to fix.

I will post an update when the new patch is ready.
Comment 32 Igor Bukanov 2011-05-10 06:33:06 PDT
(In reply to comment #29)

> Note that the !t check is done even on the fast path of the allocation as
> the compiler do not see that the pointer cannot be NULL.

On GCC mail list I got a suggestion how to workaround this - a dereference of a pointer makes GCC to treat it as non-nullable. So at worst it would be necessary to do the refactoring from the comment 29 on the limited number of functions.
Comment 33 Igor Bukanov 2011-05-10 17:04:39 PDT
It seems a fixed patch addressed performance issues. Now I have 1% win in SunSpider, 3% win in V8 where splay becomes 15% faster.
Comment 34 Igor Bukanov 2011-05-11 05:03:35 PDT
I change the title to better reflect the nature of the patch.
Comment 35 Igor Bukanov 2011-05-11 06:15:22 PDT
Created attachment 531605 [details] [diff] [review]
v7

The new patch implements the idea from the commnet 27 and replaces the linked list of free things with a linked list of spans. Thus for a newly allocated arena there is only one span from which the code can bump-allocate. But this is also very beneficial for arenas with just few used things as for them the bump allocation works most of the time. On the other hand for arenas where most of the spans covers just one thing and the list of spans effectively reduces into the list of free things the overhead is an extra if check per thing allocation.

With the patch the fast allocation path becomes:

        thing = freeList->start;
        if (thing != freeList->end) {
            freeList->start += thingSize;
        } else if (thing & ArenaMask) {
            *freeList = *freeList->nextSpan();
        } else {
            return NULL;
        }
        return thing;

where the first "if" checks if we have more than one thing in the current span and the second if checks if the current span is not the last one (that ends at the arena end) but rather has a linkage to another span. If so, it moves to that span.
Comment 36 Igor Bukanov 2011-05-11 11:57:00 PDT
If I farther change the patch as outlined in the comment 29, then V8 shows an extra 0.3% improvement. So the extra NULL checks do cost cycles. I will turn those changes into separated bug.
Comment 37 Igor Bukanov 2011-05-12 07:03:31 PDT
The perf tool on Linux shows that the benefits of the bump allocator comes from less cache misses:

Typical run of v8-splay before the patch:

 725586623  cycles
 677410957  instructions
   1136306  cache-misses

After the patch:
 694101176  cycles
 688832839  instructions
    954948  cache-misses

As expected, the bump allocation compared with the free list requires more instructions but those are faster as less cache misses happens.
Comment 38 Bill McCloskey (:billm) 2011-05-12 11:12:00 PDT
Fantastic! Thanks Igor. Now if we can do bug 648320, we'll have almost entirely eliminated all the cache traffic for garbage objects during sweeping and finalization. This would be a big step towards generational GC.
Comment 39 Igor Bukanov 2011-05-24 15:47:39 PDT
Created attachment 534920 [details] [diff] [review]
v8

The new version updates for changes from the bug 658137.
Comment 40 Igor Bukanov 2011-06-02 11:36:03 PDT
Created attachment 536945 [details] [diff] [review]
v9

Here is an updated patch. The comments in the patch should explain how the bump allocation works.
Comment 41 Bill McCloskey (:billm) 2011-06-06 13:37:08 PDT
Comment on attachment 536945 [details] [diff] [review]
v9

Review of attachment 536945 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, Igor. This is a great patch, and I'm surprised at how small it is. However, I'd like to propose an alternate design, mostly just to see if it's workable.

It seems like FreeSpan is actually serving two different purposes. One purpose is to determine the layout of free cells in the heap. The other purpose is to save information about the current free span while allocating. The reason I mention this is because I think it might be useful to split these apart into two uses into different structures.

For storing the free list segments in the heap, it seems like you only need two pointers:
- a pointer to the end of the segment
- a pointer to the beginning of the next segment
I'll call this structure a FreeSegment to distinguish it from a FreeSpan. You don't really need the starting pointer in this case, because it's always equal to its own address. And if we store the next pointer here, rather than at the end, I think the code is cleaner and easier to understand.

With this arrangement, the ArenaHeader could just have a pointer to a FreeSegment. The pointer would be NULL if the arena has no free cells. This way you wouldn't need the complexity of the two uint16_ts.

Then in the FreeList data structure, you could store three pointers (for each thingKind):
- beginning of the current free segment (which is the next cell that we'll allocate)
- end of the current free segment
- pointer to next free segment (or NULL if it's the last one)
We could call this structure AllocatorInfo or something. Then if you need to purge, it would look like this:

struct FreeLists {
  AllocatorInfo lists[FINALIZE_LIMIT];
  void purge() {
      foreach (AllocatorInfo *alloc in lists) {
          FreeSegment *segment = (FreeSegment *) alloc->begin;
          segment->end = alloc->end;
          segment->next = alloc->next;
      }
  }
};

I don't know if this would work, since I haven't tried it. But it seems like a cleaner design to me. I don't think the allocation path would be any more expensive, but maybe I've missed something.

::: js/src/jsgc.h
@@ +336,5 @@
>      static size_t thingsStartOffset(size_t thingSize) {
> +        return ArenaSize - thingsSpan(thingSize);
> +    }
> +
> +    static bool isAligned(uintptr_t thing, size_t thingSize) {

Since this is a method of Arena, I think isThingAligned would be a better name,

@@ +338,5 @@
> +    }
> +
> +    static bool isAligned(uintptr_t thing, size_t thingSize) {
> +        /* Things ends at the arena end. */
> +        uintptr_t tailOffset = (ArenaSize - thing) & ArenaMask;

It seems more natural to me to do (ArenaSize - (thing & ArenaMask)). Is there a reason you didn't do it that way?
Comment 42 Igor Bukanov 2011-06-06 15:25:54 PDT
(In reply to comment #41)
> Comment on attachment 536945 [details] [diff] [review] [review]
> v9
> 
> Review of attachment 536945 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> Thanks, Igor. This is a great patch, and I'm surprised at how small it is.
> However, I'd like to propose an alternate design, mostly just to see if it's
> workable.
> 
> It seems like FreeSpan is actually serving two different purposes. One
> purpose is to determine the layout of free cells in the heap. The other
> purpose is to save information about the current free span while allocating.
> The reason I mention this is because I think it might be useful to split
> these apart into two uses into different structures.
> 
> For storing the free list segments in the heap, it seems like you only need
> two pointers:
> - a pointer to the end of the segment
> - a pointer to the beginning of the next segment
> I'll call this structure a FreeSegment to distinguish it from a FreeSpan.
> You don't really need the starting pointer in this case, because it's always
> equal to its own address. And if we store the next pointer here, rather than
> at the end, I think the code is cleaner and easier to understand.
> 
> With this arrangement, the ArenaHeader could just have a pointer to a
> FreeSegment. The pointer would be NULL if the arena has no free cells. This
> way you wouldn't need the complexity of the two uint16_ts.

Storing the link information in the first cell does not work as it would be overwritten by allocated things. It is possible to workaround this, but that resulted in strictly more complex allocation code.

Before the current patch I tried an alternative when the FreeSegment is represented as a pointer to the last cell that stores the pointer to the start of the segment and the link to the next segment. With this setup the arena header also needs just a pointer to the last cell of the current segment. But that required an extra check on the allocation path to check for an empty segment list plus an extra indirection.

In addition both these approaches resulted in more corner cases and complexity during finalization so FreeSpan won. 

> @@ +338,5 @@
> > +    }
> > +
> > +    static bool isAligned(uintptr_t thing, size_t thingSize) {
> > +        /* Things ends at the arena end. */
> > +        uintptr_t tailOffset = (ArenaSize - thing) & ArenaMask;
> 
> It seems more natural to me to do (ArenaSize - (thing & ArenaMask)). Is
> there a reason you didn't do it that way?

IIRC that came from some other part of the code when I needed to ensure that tailOffset < ArenaSize even if the thing points to the start of the arena. But this is not necessary here.
Comment 43 Bill McCloskey (:billm) 2011-06-06 16:33:36 PDT
(In reply to comment #42)
> Storing the link information in the first cell does not work as it would be
> overwritten by allocated things. It is possible to workaround this, but that
> resulted in strictly more complex allocation code.

I don't think this is a problem. Any segment that you're allocating into will be part of the FreeLists structure, whose AllocatorInfo has a copy of the next pointer. Then when you call purge, the information will be written back into *a different* FreeSegment--the one that's now at the head.

Here's an example. Say you five cells in an arena: A,B,C,D,E. Assume A, B, and E are free. Initially, the arena's header will have a single FreeSegment pointer to A. The pointers look like this:
  A->end = C
  A->next = E
  E->end = <end of the arena> = E+sizeof(T)
  E->next = NULL

Now we start allocating into this arena. Assume that alloc == &cx->compartment->freeLists.list[thingKind]. Then we set alloc->start to A, alloc->end to A->end, and alloc->next to A->next.

The allocation procedure would look like this:
  thing = alloc->start;
  if (thing != alloc->end) {
    alloc->start += sizeof(T)
  } else if (alloc->next) {
    FreeSegment *next = alloc->next;
    thing = next;
    alloc->start = next + sizeof(T);
    alloc->end = next->end;
    alloc->next = next->next;
  } else {
    thing = NULL;
  }
  return thing;

This includes an extra store when you reach the end of a segment, but the memory will already be in L1, so it should only be a few cycles.

So if you allocate, you'll return A and alloc->start will be set to B. Now if we do a GC, purge() will set the ArenaHeader's freelist pointer to B. And it will set B->end = C, and B->next = E. This information would come from alloc. It's copied to B, since B is the new head of the free list. (I realize now that purge would also have to deal with the case where alloc->start == alloc->end, and in that case it would set the freelist pointer to alloc->next.)
Comment 44 Igor Bukanov 2011-06-07 09:25:27 PDT
(In reply to comment #43)
> The allocation procedure would look like this:
>   thing = alloc->start;
>   if (thing != alloc->end) {
>     alloc->start += sizeof(T)
>   } else if (alloc->next) {
>     FreeSegment *next = alloc->next;
>     thing = next;
>     alloc->start = next + sizeof(T);
>     alloc->end = next->end;
>     alloc->next = next->next;
>   } else {
>     thing = NULL;
>   }
>   return thing;
> 
> This includes an extra store when you reach the end of a segment, but the
> memory will already be in L1, so it should only be a few cycles.

What about storing the last GC thing of the segment at the start of the segment and the next link at the end of the segment. I.e. we would have:

struct SegmentStart {
    SegmentTail *tail;
};

struct SegmentTail {
    SegmentStart pad; 
    SegmentStart *next;
};

Here SegmentTail::pad coincides with the real SegmentStart if the segment has only one node.

The the arena header stores SegmentStart* for the first segment and the allocation code becomes:

alloc = &cx->compartment->freeLists[thingKind];
thing = alloc->start;
if (thing != alloc->end) {
    alloc->start += sizeof(T);
} else {
    SegmentTail *tail = (SegmentTail *) alloc->end;
    if (tail->next) {
        alloc->start = tail->next;
        alloc->end = tail->next->tail;
    } else {
        thing = NULL;
    }
}

I will try to prototype this.
Comment 45 Bill McCloskey (:billm) 2011-06-07 10:42:43 PDT
OK, thanks, that sounds reasonable.
Comment 46 Igor Bukanov 2011-06-15 06:52:35 PDT
Created attachment 539498 [details] [diff] [review]
v10

The new version is v9 with an assert fixed. I tried to implement the alternatives above, but the resulting code complexity is greater due to extra corner cases in purge and finalize so I abandoned those efforts. What was simpler was using FreeSpan directly in ArenaHeader. That simplified things without the need to encode/decode the first free span. But that regressed on V8 and synthetic benchmark due to extra header bloat. So I returned to the original version.
Comment 47 Bill McCloskey (:billm) 2011-06-17 18:20:32 PDT
Comment on attachment 539498 [details] [diff] [review]
v10

Review of attachment 539498 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure I understand what you mean by "extra header bloat". However, I don't want to hold up this patch any more. But for posterity, could you post your prototypes here? Also, what specifically regressed? SS/v8?

::: js/src/jsgc.cpp
@@ +185,5 @@
> +        return;
> +
> +    FreeSpan firstSpan;
> +    firstSpan.start = address() + firstFreeSpanStart;
> +    firstSpan.end = address() + firstFreeSpanEnd;

It might be nice to give FreeSpan a constructor and use that here. I've also seen a few other places where a constructor would be cleaner, but possibly slower, depending on how smart the compiler is.

@@ +669,5 @@
> +             * span->end < thing < thingsEnd and so we must have more spans.
> +             */
> +            span = span->nextSpan();
> +        }
> +    }

I think it's better to have this code in a separate function, like it was before. Is there a reason you inlined it?

::: js/src/jsgc.h
@@ +129,5 @@
> + * span covers adjacent free things. The last span in the list covers the
> + * things from [start, end) interval at the end of the arena. It can be
> + * empty if the arena has no free things at the end. Any span before the last
> + * covers the interval [start, end]. It can not be empty and stores the link
> + * to the next span in the GC thing that starts at the end address.

I wrote an alternate version of this paragraph, since it seemed confusing. Feel free to integrate them however you want.

"A FreeSpan represents a contiguous sequence of free cells in an Arena. |start| is the address of the first free cell in the span. |end| is the address of the last free cell in the span. The last cell (starting at |end|) holds a FreeSpan data structure for the next span. However, the last FreeSpan in an Arena is special: |end| points to the end of the Arena (an unusable address), and no next FreeSpan is stored there."

@@ +225,5 @@
> +     * The first span of free things in the arena. We encode it as the start
> +     * and end offsets within the arena, not as FreeSpan structure, to
> +     * minimize the header size. When the arena has no free things, the span
> +     * must be empty one pointing at the arena end. For such span the start
> +     * and end offsets must be ArenaSize.

Nit: "When the arena has no free things, the span
must be the empty one pointing to the arena's end. For such a span the start
and end offsets must be ArenaSize."

@@ +256,5 @@
> +    void setAsFullyUsed() {
> +        firstFreeSpanStart = firstFreeSpanEnd = ArenaSize;
> +    }
> +
> +    void getFirstFreeSpan(FreeSpan *span) const {

I may be wrong, but I think that if you just return a FreeSpan, the compiler will optimize this so it performs as well as what you have here. Then you could just say:
  return FreeSpan(address() + firstFreeSpanStart, address() + firstFreeSpanEnd);
and the caller could say:
  FreeSpan span = getFirstFreeSpan();

@@ +383,5 @@
>  /* The chunk header (located at the end of the chunk to preserve arena alignment). */
>  struct ChunkInfo {
>      Chunk           *link;
>      JSRuntime       *runtime;
> +    ArenaHeader     *emptyArenaList;

How about emptyArenaListHead?

@@ +791,5 @@
> +     * things we copy its first free span into the list and set arena is if
> +     * it has no free things. This way we do not need to update the arena
> +     * header after the initial allocation and only move the the head of the
> +     * of the list of spans spans back to the arena when starting the GC only
> +     * for the arena that was not fully allocated.

I think you meant "set arena *as* if it has no free things." Also, there's some typos where you repeated words here.

@@ +861,5 @@
> +            /*
> +             * We either has at least one thing in the span that ends the
> +             * arena list or we have at least two things in the non-last span.
> +             * In both cases we just need to bump the start pointer to account
> +             * for the allocation.

"We either have at least"
Comment 48 Igor Bukanov 2011-06-19 08:59:20 PDT
http://hg.mozilla.org/tracemonkey/rev/da75107e0632 - pushed with all nits addressed
Comment 49 Igor Bukanov 2011-06-19 11:16:37 PDT
http://hg.mozilla.org/tracemonkey/rev/af6968f8231a - followup to fix MSVC warings and adjust the testStringBufferMallocAccounting.js test to overwrite all conservative GC roots
Comment 50 Chris Leary [:cdleary] (not checking bugmail) 2011-06-20 17:14:34 PDT
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/da75107e0632
http://hg.mozilla.org/mozilla-central/rev/af6968f8231a
Comment 51 Bill McCloskey (:billm) 2011-12-25 18:06:04 PST
*** Bug 650095 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.