Last Comment Bug 673795 - Avoid hashing empty GC chunks
: Avoid hashing empty GC chunks
Status: RESOLVED FIXED
[MemShrink:P2]
: footprint
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla8
Assigned To: Igor Bukanov
:
Mentors:
Depends on: 673760
Blocks: 676205 671702
  Show dependency treegraph
 
Reported: 2011-07-24 10:54 PDT by Igor Bukanov
Modified: 2011-08-17 02:44 PDT (History)
18 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
part 1 (19.65 KB, patch)
2011-07-26 00:54 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
part 2 (16.77 KB, patch)
2011-07-26 00:57 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
part 1 (19.84 KB, patch)
2011-08-01 00:48 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
part 2 (17.54 KB, patch)
2011-08-01 01:07 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
part 1 (17.37 KB, patch)
2011-08-02 14:07 PDT, Igor Bukanov
anygregor: review+
Details | Diff | Splinter Review
part 2 (17.09 KB, patch)
2011-08-02 14:27 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
part 1 (17.38 KB, patch)
2011-08-03 00:59 PDT, Igor Bukanov
igor: review+
Details | Diff | Splinter Review
part 2 (17.10 KB, patch)
2011-08-03 01:04 PDT, Igor Bukanov
wmccloskey: review+
Details | Diff | Splinter Review
part 2 (18.52 KB, patch)
2011-08-12 14:38 PDT, Igor Bukanov
igor: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2011-07-24 10:54:13 PDT
An apparent win from smaller GC chunks on Linux from the patches in the bug 671702 comes from not including the empty GC chunks in the chunk hashtables. As these tables are enumerated at the beginning of the GC to clear the mark bitmap each empty 1MB chunk implies useless clearing of 16K.

Another problem is that when searching for a chunk with available arenas the code also enumerates the chunk hashtables. As such an empty chunk can be picked up when there are other chunks with arenas preventing its release during the GC.

So to clearly see the effect of smaller chunks the idea is to factor out that part of the bug 671702 into a separated patch here.
Comment 1 Nicholas Nethercote [:njn] 2011-07-24 16:31:16 PDT
Nice observation!  Breaking the big changes up into smaller changes is a good idea.
Comment 2 Igor Bukanov 2011-07-26 00:54:47 PDT
Created attachment 548385 [details] [diff] [review]
part 1

Part 1 implements that idea of not including the empty chunks into the chunk hash. However, it regressed on v8 in a shell on Linux. Apparently with the patch more chunks are released to the system and newly-allocated during the benchmark. This is a direct consequence of not using empty chunks until all other chunks are fully used so there are more chances that a chunk ages and released.

But this regression on Linux can be fully offset with another idea from the bug 671702, that is of putting all chunks with at least one available arena into a list and use it for allocations. That will come in part 2, bit here some numbers:

MC tip against part1 in thread-safe js shell on Linux:


TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      *1.007x as slow*  1532.9ms +/- 0.1%   1543.6ms +/- 0.1%     significant

=============================================================================

  v8:             *1.007x as slow*  1532.9ms +/- 0.1%   1543.6ms +/- 0.1%     significant
    crypto:       -                  202.7ms +/- 0.2%    202.6ms +/- 0.2% 
    deltablue:    *1.008x as slow*   282.2ms +/- 0.2%    284.6ms +/- 0.3%     significant
    earley-boyer: ??                 252.6ms +/- 0.3%    253.2ms +/- 0.2%     not conclusive: might be *1.002x as slow*
    raytrace:     *1.006x as slow*   196.5ms +/- 0.2%    197.7ms +/- 0.2%     significant
    regexp:       *1.015x as slow*   192.2ms +/- 0.3%    195.1ms +/- 0.2%     significant
    richards:     ??                 205.7ms +/- 0.4%    206.5ms +/- 0.5%     not conclusive: might be *1.004x as slow*
    splay:        *1.015x as slow*   200.9ms +/- 0.3%    204.0ms +/- 0.2%     significant


MC tip against part1 + part2:

TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      1.011x as fast    1532.9ms +/- 0.1%   1516.7ms +/- 0.1%     significant

=============================================================================

  v8:             1.011x as fast    1532.9ms +/- 0.1%   1516.7ms +/- 0.1%     significant
    crypto:       -                  202.7ms +/- 0.2%    202.7ms +/- 0.3% 
    deltablue:    -                  282.2ms +/- 0.2%    281.8ms +/- 0.3% 
    earley-boyer: *1.024x as slow*   252.6ms +/- 0.3%    258.6ms +/- 0.4%     significant
    raytrace:     1.003x as fast     196.5ms +/- 0.2%    195.9ms +/- 0.1%     significant
    regexp:       *1.017x as slow*   192.2ms +/- 0.3%    195.6ms +/- 0.2%     significant
    richards:     -                  205.7ms +/- 0.4%    205.4ms +/- 0.4% 
    splay:        1.137x as fast     200.9ms +/- 0.3%    176.7ms +/- 0.4%     significant

During score-based benchmarking in the browser on http://v8.googlecode.com/svn/data/benchmarks/v6/run.html I see no differences:

                tip      part1       part1+part2

Score:         3953       3959         3952

Richards:      6234       6047         6128
DeltaBlue:     4093       3848         3930
Crypto:        6380       6342         6033
RayTrace:      2719       2948         2942
EarleyBoyer:   3861       3915         3859
RegExp:        1674       1736         1753
Splay:         5275       5157         5207
Comment 3 Igor Bukanov 2011-07-26 00:57:37 PDT
Created attachment 548386 [details] [diff] [review]
part 2

The part 2 replaces the system and user hashtables with one global hashtable and separated user and system lists of available chunks. For benchmarking see above.
Comment 4 Igor Bukanov 2011-07-26 01:00:00 PDT
Note that the patches here depend on the patch from the bug 673760. The try server builds for part1+part2 is at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ibukanov@mozilla.com-ba3113b5ae27/
Comment 5 Gregor Wagner [:gwagner] 2011-07-28 13:17:07 PDT
(In reply to comment #2)
> Created attachment 548385 [details] [diff] [review] [review]
> part 1
> 
> Part 1 implements that idea of not including the empty chunks into the chunk
> hash. However, it regressed on v8 in a shell on Linux. Apparently with the
> patch more chunks are released to the system and newly-allocated during the
> benchmark. This is a direct consequence of not using empty chunks until all
> other chunks are fully used so there are more chances that a chunk ages and
> released.
> 

I am not convinced by your regression explanation. I looked at the chunks allocation and they don't really change with your first patch (at least on OS X). Have you measured it? The GCTimer tells you how many chunks are allocated/released since the last GC.

I am not a big fan of making big GC changes right now. All the GC behavior is  off (30% pause time regression) because of the instrumentation patches from Bill.
Comment 6 Igor Bukanov 2011-07-29 08:22:54 PDT
(In reply to comment #5)
> I am not convinced by your regression explanation. I looked at the chunks
> allocation and they don't really change with your first patch (at least on
> OS X). Have you measured it? The GCTimer tells you how many chunks are
> allocated/released since the last GC.

I patched the code so the chunk allocation counters are never reset and only reported at js_FinishGC. This way I can see the numbers of chunk allocation calls during the whole benchmark suite. For v8 without the patch the number was 154, with the patch that becomes 158, so the patch releases empty chunks slightly more aggressively and the effect from the comment 0 is visible even during v8 run. 

> 
> I am not a big fan of making big GC changes right now. All the GC behavior
> is  off (30% pause time regression) because of the instrumentation patches
> from Bill.

But why is this bad? I tried to re-do base tests when doing measurements after syncing with a tree unless I can guess that the changes would not affect benchmarks... Or is it too time consuming in your setup?
Comment 7 Igor Bukanov 2011-08-01 00:48:27 PDT
Created attachment 549733 [details] [diff] [review]
part 1

Here is a rebased patch.
Comment 8 Igor Bukanov 2011-08-01 01:07:41 PDT
Created attachment 549736 [details] [diff] [review]
part 2
Comment 9 Gregor Wagner [:gwagner] 2011-08-01 12:35:49 PDT
Do we have to store the "age" in the chunk address? I don't like this. I'd rather use an additional field in chunkinfo or reuse another one.
Comment 10 Igor Bukanov 2011-08-01 13:08:52 PDT
(In reply to comment #9)
> Do we have to store the "age" in the chunk address? I don't like this. I'd
> rather use an additional field in chunkinfo or reuse another one.

Storing the age outside the chunk allows not to reference the empty chunk at all avoiding TLB/cache misses. With big chunks this does not matter, but with smaller chunks this became noticeable. Initially I have used a struct with 2 fields for that, but with smaller chunks this just waste memory and does not make source any simpler.
Comment 11 Gregor Wagner [:gwagner] 2011-08-01 14:44:47 PDT
(In reply to comment #10)
> (In reply to comment #9)
> > Do we have to store the "age" in the chunk address? I don't like this. I'd
> > rather use an additional field in chunkinfo or reuse another one.
> 
> Storing the age outside the chunk allows not to reference the empty chunk at
> all avoiding TLB/cache misses. With big chunks this does not matter, but
> with smaller chunks this became noticeable. Initially I have used a struct
> with 2 fields for that, but with smaller chunks this just waste memory and
> does not make source any simpler.


We only iterate over all empty chunks in ExpireGCChunks and this happens on the background thread right? This is really a regression on a multithreaded build?
Comment 12 Igor Bukanov 2011-08-02 00:54:14 PDT
(In reply to comment #11)
> 
> We only iterate over all empty chunks in ExpireGCChunks and this happens on
> the background thread right? This is really a regression on a multithreaded
> build?

I noticed a spike of cache misses with cachegrind. With a single core CPU that becomes expensive. For example, with smaller chunks we can trivially end up with 1000 or more empty aging chunks. Accessing each empty chunk info header pretty much guarantee that we get TLB and L2 miss on each access. Assuming that the cost of those misses is about 250 cycles this wastes 0.15ms on recently-released single-core Atom per GC or even more on ARM.

If this is acceptable price for simpler code then I will replace the age vector with a linked list.
Comment 13 Bill McCloskey (:billm) 2011-08-02 09:56:45 PDT
Given that this is on the background thread, 0.15ms seems perfectly acceptable to me.
Comment 14 Igor Bukanov 2011-08-02 14:07:10 PDT
Created attachment 550195 [details] [diff] [review]
part 1

The new version uses a linked list for empty chunks.
Comment 15 Igor Bukanov 2011-08-02 14:27:20 PDT
Created attachment 550208 [details] [diff] [review]
part 2

The new version accounts for the empty chunk list changes from part 1.
Comment 16 Gregor Wagner [:gwagner] 2011-08-02 15:01:02 PDT
Comment on attachment 550195 [details] [diff] [review]
part 1

>+    if (unused()) {
>+        rt->gcUserChunkSet.remove(this);
>+        rt->gcSystemChunkSet.remove(this);
>+
>+        /*
>+         * We must not release empty chunks until we done with finalization to

until we _are_ done

>+         * allow calling IsAboutToBeFinalized/Cell::isMarked for finalized GC
>+         * things in empty chunks. So we always add the chunk to the empty
>+         * set.

I guess this is similar to the other IsAboutToBeFinalized discussion we had. We could do it if we stick to "no more IsAboutToBeFinalized after we started finalizing" right? The new code makes sense for performance since we always free the chunk in the background even for arenas with foreground finalization. Maybe you could reflect this in the comment.



>+    /* Use an empty chunk when available or allocate a new one. */
>+    chunk = rt->gcEmptyChunkListHead;
>+    if (chunk) {
>+        JS_ASSERT(chunk->unused());
>+        JS_ASSERT(!rt->gcUserChunkSet.has(chunk));
>+        JS_ASSERT(!rt->gcSystemChunkSet.has(chunk));
>+        JS_ASSERT(rt->gcEmptyChunkCount >= 1);
>+        rt->gcEmptyChunkListHead = chunk->info.link;
>+        rt->gcEmptyChunkCount--;
>+    } else {
>+        chunk = AllocateGCChunk(rt);
>         if (!chunk)
>             return NULL;
>+
>+        chunk->init(rt);
>+        rt->gcChunkAllocationSinceLastGC = true;

Now it would make sense to move the last line out of the else for smaller memory footprint. We should always try to perform IDLE_GC if we start using new memory. Before we didn't have enough information about the ratio of used vs. unused chunks.


> 
>+    uintptr_t address() const {
>+        uintptr_t addr = reinterpret_cast<uintptr_t>(this);
>+        JS_ASSERT(!(addr & GC_CHUNK_MASK));
>+        return addr;
>+    }
>+
Still needed?
Comment 17 Igor Bukanov 2011-08-02 15:35:48 PDT
(In reply to comment #16)
> I guess this is similar to the other IsAboutToBeFinalized discussion we had.
> We could do it if we stick to "no more IsAboutToBeFinalized after we started
> finalizing" right? The new code makes sense for performance since we always
> free the chunk in the background even for arenas with foreground
> finalization. Maybe you could reflect this in the comment.

Our shape code uses Cell::isMarked during finalization in an essential way so this is not only about IsAboutToBiFinalized.

> >+    /* Use an empty chunk when available or allocate a new one. */
> >+    chunk = rt->gcEmptyChunkListHead;
> >+    if (chunk) {
...
> >+    } else {
> >+        chunk = AllocateGCChunk(rt);
...
> >+        rt->gcChunkAllocationSinceLastGC = true;
> 
> Now it would make sense to move the last line out of the else for smaller
> memory footprint. We should always try to perform IDLE_GC if we start using
> new memory. Before we didn't have enough information about the ratio of used
> vs. unused chunks.

I will file a new bug about it to keep GC tuning outside this bug.
Comment 18 Igor Bukanov 2011-08-03 00:59:27 PDT
Created attachment 550318 [details] [diff] [review]
part 1

Here is the patch with fixed comments.
Comment 19 Igor Bukanov 2011-08-03 01:04:15 PDT
Created attachment 550319 [details] [diff] [review]
part 2

Here is a rebased part 2.
Comment 20 Igor Bukanov 2011-08-11 02:25:38 PDT
Gregor: do you have time review the part 2 here?
Comment 21 Gregor Wagner [:gwagner] 2011-08-11 14:11:02 PDT
Comment on attachment 550319 [details] [diff] [review]
part 2

Uh I forgot about the 2nd part. I don't have the time for my usual review process with all the testing so it's better if Bill takes this.
Comment 22 Bill McCloskey (:billm) 2011-08-12 08:08:55 PDT
Comment on attachment 550319 [details] [diff] [review]
part 2

Review of attachment 550319 [details] [diff] [review]:
-----------------------------------------------------------------

Nice patch!

::: js/src/jscntxt.h
@@ +376,5 @@
>  
>      /* Garbage collector state, used by jsgc.c. */
> +    js::GCChunkSet      gcChunkSet;
> +    js::gc::Chunk       *gcSystemAvailableChunks;
> +    js::gc::Chunk       *gcUserAvailableChunks;

I would prefer different names here, as well as a comment.

It seems inconsistent that gcEmptyChunkListHead ends in "Head" and the new ones don't. How about "gcAvailableSystem/UserChunkListHead"? It's pretty long, but we almost never use the name in the code.

The comment should say something about how gcChunkSet contains only contains non-empty chunks and that the Available lists contain chunks that are not empty and not completely full.

I could also see it being confusing that the empty chunk list does not have prev pointers. If we keep this inconsistency, it should be mentioned in the comment. But I think it would be better just to maintain all the lists in the same way (see below).

::: js/src/jsgc.cpp
@@ +346,4 @@
>  }
>  
> +inline Chunk **
> +FindAvailableChunkList(JSCompartment *comp)

I'm not sure if this is SpiderMonkey style, but I usually try to use "Find" for things that require a search. Maybe "GetAvailableChunkList"?

@@ +353,4 @@
>  }
>  
> +inline void
> +Chunk::addToAvailableList(JSCompartment *comp)

How about generalizing this to addToChunkList and passing in a Chunk** instead of a compartment? Then we could use this function (and a similarly renamed removeFromChunkList) for all three chunk lists. I think it would be cleaner this way.

@@ +449,5 @@
>           * in empty chunks. So we add the chunk to the empty set even during
>           * GC_SHRINK.
>           */
>          info.age = 0;
> +        info.next = rt->gcEmptyChunkListHead;

This could change to addToChunkList(&rt->gcEmptyChunkListHead).

@@ +525,5 @@
>      return chunk;
>  }
>  
>  static void
>  ReleaseEmptyGCChunks(JSRuntime *rt)

I realize it's not part of your patch, but this function seems somewhat redundant. Could we just remove it, and then modify the main ExpireGCChunks loop to always release the chunk if it's a shrinking GC?

@@ +556,3 @@
>              JS_ASSERT(chunk->info.age <= MAX_EMPTY_CHUNK_AGE);
>              if (chunk->info.age == MAX_EMPTY_CHUNK_AGE) {
> +                *chunkp = chunk->info.next;

And here we could use removeFromChunkList.
Comment 23 Igor Bukanov 2011-08-12 14:38:51 PDT
Created attachment 552770 [details] [diff] [review]
part 2

The new version uses longer ListHead names, adds comments and inlines/removes ReleaseEmptyGCChunks. I have kept the empty chunk list as single list. In bug 671702 the available lists I will move to the compartment and the suggested reorganization would not work in that setup.
Comment 25 Kyle Huey [:khuey] (Exited; not receiving bugmail, email if necessary) 2011-08-14 05:06:38 PDT
http://hg.mozilla.org/mozilla-central/rev/8c36a72adee9
http://hg.mozilla.org/mozilla-central/rev/3210abdedf8a
Comment 26 Nicholas Nethercote [:njn] 2011-08-14 16:13:59 PDT
Igor, can you briefly summarize the what final patches did and what the benefits are?  Also, you haven't given any performance numbers after comment 2, which were for an early version.  Do you have any?
Comment 27 Igor Bukanov 2011-08-14 23:50:15 PDT
(In reply to Nicholas Nethercote [:njn] from comment #26)
> Igor, can you briefly summarize the what final patches did and what the
> benefits are?

The goal of the patches here is to make the patch for the bug 671702 smaller and to make performance characteristics more visible there. I filed it as a separated bug as it makes sense to do on its own.

>  Also, you haven't given any performance numbers after comment
> 2, which were for an early version.  Do you have any?

No, not for the landed patch. The changes from the measured versions should not affect benchmarks, plus this patch combined with other patches in my patch queue unrelated to this work give a small V8 win that matches the comment 2.
Comment 28 Ed Morley [:emorley] 2011-08-15 02:32:50 PDT
(In reply to Igor Bukanov from comment #27)
> No, not for the landed patch. The changes from the measured versions should
> not affect benchmarks, plus this patch combined with other patches in my
> patch queue unrelated to this work give a small V8 win that matches the
> comment 2.

Do you have any stats from a MemShrink POV? :-)
Comment 29 Igor Bukanov 2011-08-16 03:22:29 PDT
(In reply to Ed Morley [:edmorley] from comment #28)
 
> Do you have any stats from a MemShrink POV? :-)

I redid that http://gregor-wagner.com/tmp/mem test with 32-bit Firefox build on Linux. Here the stats without and with changesets from this bug:


before:
                             initial     all tabs       no tabs     no window

heap-allocated               29.04 MB   1,191.33 MB     575.62 MB    81.41 MB
heap-committed               37.00 MB	1,236.00 MB   1,224.00 MB   940.00 MB
heap-dirty                    2.32 MB	    3.60 MB       2.37 MB     2.35 MB
heap-unallocated              7.96 MB	   44.67 MB     648.38 MB   858.59 MB
js-compartments-system              2	          2             2           2
js-compartments-user                1	        251           149           1
js-gc-heap                    7.00 MB	  314.00 MB     310.00 MB    32.00 MB
js-gc-heap-arena-unused       0.72 MB	   53.94 MB      55.24 MB     2.48 MB
js-gc-heap-chunk-clean-unuse  0.00 MB	    6.00 MB       0.00 MB     0.00 MB
js-gc-heap-chunk-dirty-unuse  2.77 MB	   18.62 MB     196.83 MB    25.26 MB
js-gc-heap-unused-fraction     49.77%	     25.02%        81.31%      86.70%
page-faults-hard                    2	         18            18          18
page-faults-soft               17,037	  1,465,915     1,573,512   1,590,947
resident                     71.39 MB	1,396.98 MB   1,044.98 MB   290.79 MB
vsize                       341.13 MB	1,804.29 MB   1,657.52 MB 1,289.02 MB


after:

                             initial     all tabs       no tabs     no window

heap-allocated               30.07 MB   1,092.00 MB     523.90 MB     76.15 MB
heap-committed               40.00 MB	1,119.00 MB   1,120.00 MB    874.00 MB
heap-dirty                    3.05 MB	    3.27 MB       3.19 MB      3.21 MB
heap-unallocated              9.93 MB	   27.00 MB     596.10 MB    797.85 MB
js-compartments-system              2	          2             2            2
js-compartments-user                1	        249           150            1
js-gc-heap                    7.00 MB	  259.00 MB     268.00 MB     29.00 MB
js-gc-heap-arena-unused       0.94 MB	   47.70 MB      51.06 MB      2.55 MB
js-gc-heap-chunk-clean-unuse  0.00 MB	    0.00 MB       0.00 MB      0.00 MB
js-gc-heap-chunk-dirty-unuse  2.51 MB	    5.12 MB     163.02 MB     22.12 MB
js-gc-heap-unused-fraction     49.18%	     20.39%        79.88%       85.05%
page-faults-hard                    2	          6             6            7
page-faults-soft               18,096	    676,624       778,808      799,441
resident                     69.58 MB	1,264.78 MB     929.28 MB    213.10 MB
vsize                       340.32 MB	1,655.29 MB   1,599.58 MB  1,207.57 MB


So the patch did slightly minimize the number of unallocated arenas in GC chunks resulting in overall slightly better memory usage. I suppose this comes from using empty chunks only after allocating all GC arenas in non-empty chunks. Before the patch we selected randomly a chunk for new allocation from empty and and non-empty ones.
Comment 30 Nicholas Nethercote [:njn] 2011-08-16 04:07:31 PDT
> before:
> js-gc-heap                    7.00 MB	  314.00 MB     310.00 MB    32.00 MB
> page-faults-hard                    2	         18            18          18
> page-faults-soft               17,037	  1,465,915     1,573,512   1,590,947
> resident                     71.39 MB	1,396.98 MB   1,044.98 MB   290.79 MB
> vsize                       341.13 MB	1,804.29 MB   1,657.52 MB 1,289.02 MB
> 
> after:
> js-gc-heap                    7.00 MB	  259.00 MB     268.00 MB     29.00 MB
> page-faults-hard                    2	          6             6            7
> page-faults-soft               18,096	    676,624       778,808      799,441
> resident                     69.58 MB	1,264.78 MB     929.28 MB    213.10 MB
> vsize                       340.32 MB	1,655.29 MB   1,599.58 MB  1,207.57 MB

That's really good!  Peak resident dropped by 9.5%, final resident dropped by 27%.  And I never know what to make of soft page faults, but the fact that they halved can only be good :)

Note You need to log in before you can comment on or make changes to this bug.