Closed
Bug 673795
Opened 13 years ago
Closed 13 years ago
Avoid hashing empty GC chunks
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla8
People
(Reporter: igor, Assigned: igor)
References
Details
(Keywords: memory-footprint, Whiteboard: [MemShrink:P2])
Attachments
(2 files, 7 obsolete files)
17.38 KB,
patch
|
igor
:
review+
|
Details | Diff | Splinter Review |
18.52 KB,
patch
|
igor
:
review+
|
Details | Diff | Splinter Review |
An apparent win from smaller GC chunks on Linux from the patches in the bug 671702 comes from not including the empty GC chunks in the chunk hashtables. As these tables are enumerated at the beginning of the GC to clear the mark bitmap each empty 1MB chunk implies useless clearing of 16K.
Another problem is that when searching for a chunk with available arenas the code also enumerates the chunk hashtables. As such an empty chunk can be picked up when there are other chunks with arenas preventing its release during the GC.
So to clearly see the effect of smaller chunks the idea is to factor out that part of the bug 671702 into a separated patch here.
Comment 1•13 years ago
|
||
Nice observation! Breaking the big changes up into smaller changes is a good idea.
Updated•13 years ago
|
Whiteboard: [MemShrink]
Assignee | ||
Comment 2•13 years ago
|
||
Part 1 implements that idea of not including the empty chunks into the chunk hash. However, it regressed on v8 in a shell on Linux. Apparently with the patch more chunks are released to the system and newly-allocated during the benchmark. This is a direct consequence of not using empty chunks until all other chunks are fully used so there are more chances that a chunk ages and released.
But this regression on Linux can be fully offset with another idea from the bug 671702, that is of putting all chunks with at least one available arena into a list and use it for allocations. That will come in part 2, bit here some numbers:
MC tip against part1 in thread-safe js shell on Linux:
TEST COMPARISON FROM TO DETAILS
=============================================================================
** TOTAL **: *1.007x as slow* 1532.9ms +/- 0.1% 1543.6ms +/- 0.1% significant
=============================================================================
v8: *1.007x as slow* 1532.9ms +/- 0.1% 1543.6ms +/- 0.1% significant
crypto: - 202.7ms +/- 0.2% 202.6ms +/- 0.2%
deltablue: *1.008x as slow* 282.2ms +/- 0.2% 284.6ms +/- 0.3% significant
earley-boyer: ?? 252.6ms +/- 0.3% 253.2ms +/- 0.2% not conclusive: might be *1.002x as slow*
raytrace: *1.006x as slow* 196.5ms +/- 0.2% 197.7ms +/- 0.2% significant
regexp: *1.015x as slow* 192.2ms +/- 0.3% 195.1ms +/- 0.2% significant
richards: ?? 205.7ms +/- 0.4% 206.5ms +/- 0.5% not conclusive: might be *1.004x as slow*
splay: *1.015x as slow* 200.9ms +/- 0.3% 204.0ms +/- 0.2% significant
MC tip against part1 + part2:
TEST COMPARISON FROM TO DETAILS
=============================================================================
** TOTAL **: 1.011x as fast 1532.9ms +/- 0.1% 1516.7ms +/- 0.1% significant
=============================================================================
v8: 1.011x as fast 1532.9ms +/- 0.1% 1516.7ms +/- 0.1% significant
crypto: - 202.7ms +/- 0.2% 202.7ms +/- 0.3%
deltablue: - 282.2ms +/- 0.2% 281.8ms +/- 0.3%
earley-boyer: *1.024x as slow* 252.6ms +/- 0.3% 258.6ms +/- 0.4% significant
raytrace: 1.003x as fast 196.5ms +/- 0.2% 195.9ms +/- 0.1% significant
regexp: *1.017x as slow* 192.2ms +/- 0.3% 195.6ms +/- 0.2% significant
richards: - 205.7ms +/- 0.4% 205.4ms +/- 0.4%
splay: 1.137x as fast 200.9ms +/- 0.3% 176.7ms +/- 0.4% significant
During score-based benchmarking in the browser on http://v8.googlecode.com/svn/data/benchmarks/v6/run.html I see no differences:
tip part1 part1+part2
Score: 3953 3959 3952
Richards: 6234 6047 6128
DeltaBlue: 4093 3848 3930
Crypto: 6380 6342 6033
RayTrace: 2719 2948 2942
EarleyBoyer: 3861 3915 3859
RegExp: 1674 1736 1753
Splay: 5275 5157 5207
Attachment #548385 -
Flags: review?(anygregor)
Assignee | ||
Comment 3•13 years ago
|
||
The part 2 replaces the system and user hashtables with one global hashtable and separated user and system lists of available chunks. For benchmarking see above.
Attachment #548386 -
Flags: review?(anygregor)
Assignee | ||
Comment 4•13 years ago
|
||
Note that the patches here depend on the patch from the bug 673760. The try server builds for part1+part2 is at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ibukanov@mozilla.com-ba3113b5ae27/
Depends on: 673760
Updated•13 years ago
|
Whiteboard: [MemShrink] → [MemShrink:P2]
Comment 5•13 years ago
|
||
(In reply to comment #2)
> Created attachment 548385 [details] [diff] [review] [review]
> part 1
>
> Part 1 implements that idea of not including the empty chunks into the chunk
> hash. However, it regressed on v8 in a shell on Linux. Apparently with the
> patch more chunks are released to the system and newly-allocated during the
> benchmark. This is a direct consequence of not using empty chunks until all
> other chunks are fully used so there are more chances that a chunk ages and
> released.
>
I am not convinced by your regression explanation. I looked at the chunks allocation and they don't really change with your first patch (at least on OS X). Have you measured it? The GCTimer tells you how many chunks are allocated/released since the last GC.
I am not a big fan of making big GC changes right now. All the GC behavior is off (30% pause time regression) because of the instrumentation patches from Bill.
Assignee | ||
Comment 6•13 years ago
|
||
(In reply to comment #5)
> I am not convinced by your regression explanation. I looked at the chunks
> allocation and they don't really change with your first patch (at least on
> OS X). Have you measured it? The GCTimer tells you how many chunks are
> allocated/released since the last GC.
I patched the code so the chunk allocation counters are never reset and only reported at js_FinishGC. This way I can see the numbers of chunk allocation calls during the whole benchmark suite. For v8 without the patch the number was 154, with the patch that becomes 158, so the patch releases empty chunks slightly more aggressively and the effect from the comment 0 is visible even during v8 run.
>
> I am not a big fan of making big GC changes right now. All the GC behavior
> is off (30% pause time regression) because of the instrumentation patches
> from Bill.
But why is this bad? I tried to re-do base tests when doing measurements after syncing with a tree unless I can guess that the changes would not affect benchmarks... Or is it too time consuming in your setup?
Assignee | ||
Comment 7•13 years ago
|
||
Here is a rebased patch.
Attachment #548385 -
Attachment is obsolete: true
Attachment #548385 -
Flags: review?(anygregor)
Attachment #549733 -
Flags: review?(anygregor)
Assignee | ||
Comment 8•13 years ago
|
||
Attachment #549736 -
Flags: review?(anygregor)
Assignee | ||
Updated•13 years ago
|
Attachment #548386 -
Attachment is obsolete: true
Attachment #548386 -
Flags: review?(anygregor)
Comment 9•13 years ago
|
||
Do we have to store the "age" in the chunk address? I don't like this. I'd rather use an additional field in chunkinfo or reuse another one.
Assignee | ||
Comment 10•13 years ago
|
||
(In reply to comment #9)
> Do we have to store the "age" in the chunk address? I don't like this. I'd
> rather use an additional field in chunkinfo or reuse another one.
Storing the age outside the chunk allows not to reference the empty chunk at all avoiding TLB/cache misses. With big chunks this does not matter, but with smaller chunks this became noticeable. Initially I have used a struct with 2 fields for that, but with smaller chunks this just waste memory and does not make source any simpler.
Comment 11•13 years ago
|
||
(In reply to comment #10)
> (In reply to comment #9)
> > Do we have to store the "age" in the chunk address? I don't like this. I'd
> > rather use an additional field in chunkinfo or reuse another one.
>
> Storing the age outside the chunk allows not to reference the empty chunk at
> all avoiding TLB/cache misses. With big chunks this does not matter, but
> with smaller chunks this became noticeable. Initially I have used a struct
> with 2 fields for that, but with smaller chunks this just waste memory and
> does not make source any simpler.
We only iterate over all empty chunks in ExpireGCChunks and this happens on the background thread right? This is really a regression on a multithreaded build?
Assignee | ||
Comment 12•13 years ago
|
||
(In reply to comment #11)
>
> We only iterate over all empty chunks in ExpireGCChunks and this happens on
> the background thread right? This is really a regression on a multithreaded
> build?
I noticed a spike of cache misses with cachegrind. With a single core CPU that becomes expensive. For example, with smaller chunks we can trivially end up with 1000 or more empty aging chunks. Accessing each empty chunk info header pretty much guarantee that we get TLB and L2 miss on each access. Assuming that the cost of those misses is about 250 cycles this wastes 0.15ms on recently-released single-core Atom per GC or even more on ARM.
If this is acceptable price for simpler code then I will replace the age vector with a linked list.
Given that this is on the background thread, 0.15ms seems perfectly acceptable to me.
Assignee | ||
Comment 14•13 years ago
|
||
The new version uses a linked list for empty chunks.
Attachment #549733 -
Attachment is obsolete: true
Attachment #549736 -
Attachment is obsolete: true
Attachment #549733 -
Flags: review?(anygregor)
Attachment #549736 -
Flags: review?(anygregor)
Attachment #550195 -
Flags: review?(anygregor)
Assignee | ||
Comment 15•13 years ago
|
||
The new version accounts for the empty chunk list changes from part 1.
Attachment #550208 -
Flags: review?(anygregor)
Comment 16•13 years ago
|
||
Comment on attachment 550195 [details] [diff] [review]
part 1
>+ if (unused()) {
>+ rt->gcUserChunkSet.remove(this);
>+ rt->gcSystemChunkSet.remove(this);
>+
>+ /*
>+ * We must not release empty chunks until we done with finalization to
until we _are_ done
>+ * allow calling IsAboutToBeFinalized/Cell::isMarked for finalized GC
>+ * things in empty chunks. So we always add the chunk to the empty
>+ * set.
I guess this is similar to the other IsAboutToBeFinalized discussion we had. We could do it if we stick to "no more IsAboutToBeFinalized after we started finalizing" right? The new code makes sense for performance since we always free the chunk in the background even for arenas with foreground finalization. Maybe you could reflect this in the comment.
>+ /* Use an empty chunk when available or allocate a new one. */
>+ chunk = rt->gcEmptyChunkListHead;
>+ if (chunk) {
>+ JS_ASSERT(chunk->unused());
>+ JS_ASSERT(!rt->gcUserChunkSet.has(chunk));
>+ JS_ASSERT(!rt->gcSystemChunkSet.has(chunk));
>+ JS_ASSERT(rt->gcEmptyChunkCount >= 1);
>+ rt->gcEmptyChunkListHead = chunk->info.link;
>+ rt->gcEmptyChunkCount--;
>+ } else {
>+ chunk = AllocateGCChunk(rt);
> if (!chunk)
> return NULL;
>+
>+ chunk->init(rt);
>+ rt->gcChunkAllocationSinceLastGC = true;
Now it would make sense to move the last line out of the else for smaller memory footprint. We should always try to perform IDLE_GC if we start using new memory. Before we didn't have enough information about the ratio of used vs. unused chunks.
>
>+ uintptr_t address() const {
>+ uintptr_t addr = reinterpret_cast<uintptr_t>(this);
>+ JS_ASSERT(!(addr & GC_CHUNK_MASK));
>+ return addr;
>+ }
>+
Still needed?
Attachment #550195 -
Flags: review?(anygregor) → review+
Assignee | ||
Comment 17•13 years ago
|
||
(In reply to comment #16)
> I guess this is similar to the other IsAboutToBeFinalized discussion we had.
> We could do it if we stick to "no more IsAboutToBeFinalized after we started
> finalizing" right? The new code makes sense for performance since we always
> free the chunk in the background even for arenas with foreground
> finalization. Maybe you could reflect this in the comment.
Our shape code uses Cell::isMarked during finalization in an essential way so this is not only about IsAboutToBiFinalized.
> >+ /* Use an empty chunk when available or allocate a new one. */
> >+ chunk = rt->gcEmptyChunkListHead;
> >+ if (chunk) {
...
> >+ } else {
> >+ chunk = AllocateGCChunk(rt);
...
> >+ rt->gcChunkAllocationSinceLastGC = true;
>
> Now it would make sense to move the last line out of the else for smaller
> memory footprint. We should always try to perform IDLE_GC if we start using
> new memory. Before we didn't have enough information about the ratio of used
> vs. unused chunks.
I will file a new bug about it to keep GC tuning outside this bug.
Assignee | ||
Comment 18•13 years ago
|
||
Here is the patch with fixed comments.
Attachment #550195 -
Attachment is obsolete: true
Attachment #550208 -
Attachment is obsolete: true
Attachment #550208 -
Flags: review?(anygregor)
Attachment #550318 -
Flags: review+
Assignee | ||
Comment 19•13 years ago
|
||
Here is a rebased part 2.
Attachment #550319 -
Flags: review?(anygregor)
Assignee | ||
Comment 20•13 years ago
|
||
Gregor: do you have time review the part 2 here?
Comment 21•13 years ago
|
||
Comment on attachment 550319 [details] [diff] [review]
part 2
Uh I forgot about the 2nd part. I don't have the time for my usual review process with all the testing so it's better if Bill takes this.
Attachment #550319 -
Flags: review?(anygregor) → review?(wmccloskey)
Comment on attachment 550319 [details] [diff] [review]
part 2
Review of attachment 550319 [details] [diff] [review]:
-----------------------------------------------------------------
Nice patch!
::: js/src/jscntxt.h
@@ +376,5 @@
>
> /* Garbage collector state, used by jsgc.c. */
> + js::GCChunkSet gcChunkSet;
> + js::gc::Chunk *gcSystemAvailableChunks;
> + js::gc::Chunk *gcUserAvailableChunks;
I would prefer different names here, as well as a comment.
It seems inconsistent that gcEmptyChunkListHead ends in "Head" and the new ones don't. How about "gcAvailableSystem/UserChunkListHead"? It's pretty long, but we almost never use the name in the code.
The comment should say something about how gcChunkSet contains only contains non-empty chunks and that the Available lists contain chunks that are not empty and not completely full.
I could also see it being confusing that the empty chunk list does not have prev pointers. If we keep this inconsistency, it should be mentioned in the comment. But I think it would be better just to maintain all the lists in the same way (see below).
::: js/src/jsgc.cpp
@@ +346,4 @@
> }
>
> +inline Chunk **
> +FindAvailableChunkList(JSCompartment *comp)
I'm not sure if this is SpiderMonkey style, but I usually try to use "Find" for things that require a search. Maybe "GetAvailableChunkList"?
@@ +353,4 @@
> }
>
> +inline void
> +Chunk::addToAvailableList(JSCompartment *comp)
How about generalizing this to addToChunkList and passing in a Chunk** instead of a compartment? Then we could use this function (and a similarly renamed removeFromChunkList) for all three chunk lists. I think it would be cleaner this way.
@@ +449,5 @@
> * in empty chunks. So we add the chunk to the empty set even during
> * GC_SHRINK.
> */
> info.age = 0;
> + info.next = rt->gcEmptyChunkListHead;
This could change to addToChunkList(&rt->gcEmptyChunkListHead).
@@ +525,5 @@
> return chunk;
> }
>
> static void
> ReleaseEmptyGCChunks(JSRuntime *rt)
I realize it's not part of your patch, but this function seems somewhat redundant. Could we just remove it, and then modify the main ExpireGCChunks loop to always release the chunk if it's a shrinking GC?
@@ +556,3 @@
> JS_ASSERT(chunk->info.age <= MAX_EMPTY_CHUNK_AGE);
> if (chunk->info.age == MAX_EMPTY_CHUNK_AGE) {
> + *chunkp = chunk->info.next;
And here we could use removeFromChunkList.
Attachment #550319 -
Flags: review?(wmccloskey) → review+
Assignee | ||
Comment 23•13 years ago
|
||
The new version uses longer ListHead names, adds comments and inlines/removes ReleaseEmptyGCChunks. I have kept the empty chunk list as single list. In bug 671702 the available lists I will move to the compartment and the suggested reorganization would not work in that setup.
Attachment #550319 -
Attachment is obsolete: true
Attachment #552770 -
Flags: review+
Assignee | ||
Comment 24•13 years ago
|
||
http://hg.mozilla.org/integration/mozilla-inbound/rev/8c36a72adee9
http://hg.mozilla.org/integration/mozilla-inbound/rev/3210abdedf8a
Whiteboard: [MemShrink:P2] → [MemShrink:P2][inbound]
http://hg.mozilla.org/mozilla-central/rev/8c36a72adee9
http://hg.mozilla.org/mozilla-central/rev/3210abdedf8a
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink:P2][inbound] → [MemShrink:P2]
Target Milestone: --- → mozilla8
Comment 26•13 years ago
|
||
Igor, can you briefly summarize the what final patches did and what the benefits are? Also, you haven't given any performance numbers after comment 2, which were for an early version. Do you have any?
Assignee | ||
Comment 27•13 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #26)
> Igor, can you briefly summarize the what final patches did and what the
> benefits are?
The goal of the patches here is to make the patch for the bug 671702 smaller and to make performance characteristics more visible there. I filed it as a separated bug as it makes sense to do on its own.
> Also, you haven't given any performance numbers after comment
> 2, which were for an early version. Do you have any?
No, not for the landed patch. The changes from the measured versions should not affect benchmarks, plus this patch combined with other patches in my patch queue unrelated to this work give a small V8 win that matches the comment 2.
Comment 28•13 years ago
|
||
(In reply to Igor Bukanov from comment #27)
> No, not for the landed patch. The changes from the measured versions should
> not affect benchmarks, plus this patch combined with other patches in my
> patch queue unrelated to this work give a small V8 win that matches the
> comment 2.
Do you have any stats from a MemShrink POV? :-)
Assignee | ||
Comment 29•13 years ago
|
||
(In reply to Ed Morley [:edmorley] from comment #28)
> Do you have any stats from a MemShrink POV? :-)
I redid that http://gregor-wagner.com/tmp/mem test with 32-bit Firefox build on Linux. Here the stats without and with changesets from this bug:
before:
initial all tabs no tabs no window
heap-allocated 29.04 MB 1,191.33 MB 575.62 MB 81.41 MB
heap-committed 37.00 MB 1,236.00 MB 1,224.00 MB 940.00 MB
heap-dirty 2.32 MB 3.60 MB 2.37 MB 2.35 MB
heap-unallocated 7.96 MB 44.67 MB 648.38 MB 858.59 MB
js-compartments-system 2 2 2 2
js-compartments-user 1 251 149 1
js-gc-heap 7.00 MB 314.00 MB 310.00 MB 32.00 MB
js-gc-heap-arena-unused 0.72 MB 53.94 MB 55.24 MB 2.48 MB
js-gc-heap-chunk-clean-unuse 0.00 MB 6.00 MB 0.00 MB 0.00 MB
js-gc-heap-chunk-dirty-unuse 2.77 MB 18.62 MB 196.83 MB 25.26 MB
js-gc-heap-unused-fraction 49.77% 25.02% 81.31% 86.70%
page-faults-hard 2 18 18 18
page-faults-soft 17,037 1,465,915 1,573,512 1,590,947
resident 71.39 MB 1,396.98 MB 1,044.98 MB 290.79 MB
vsize 341.13 MB 1,804.29 MB 1,657.52 MB 1,289.02 MB
after:
initial all tabs no tabs no window
heap-allocated 30.07 MB 1,092.00 MB 523.90 MB 76.15 MB
heap-committed 40.00 MB 1,119.00 MB 1,120.00 MB 874.00 MB
heap-dirty 3.05 MB 3.27 MB 3.19 MB 3.21 MB
heap-unallocated 9.93 MB 27.00 MB 596.10 MB 797.85 MB
js-compartments-system 2 2 2 2
js-compartments-user 1 249 150 1
js-gc-heap 7.00 MB 259.00 MB 268.00 MB 29.00 MB
js-gc-heap-arena-unused 0.94 MB 47.70 MB 51.06 MB 2.55 MB
js-gc-heap-chunk-clean-unuse 0.00 MB 0.00 MB 0.00 MB 0.00 MB
js-gc-heap-chunk-dirty-unuse 2.51 MB 5.12 MB 163.02 MB 22.12 MB
js-gc-heap-unused-fraction 49.18% 20.39% 79.88% 85.05%
page-faults-hard 2 6 6 7
page-faults-soft 18,096 676,624 778,808 799,441
resident 69.58 MB 1,264.78 MB 929.28 MB 213.10 MB
vsize 340.32 MB 1,655.29 MB 1,599.58 MB 1,207.57 MB
So the patch did slightly minimize the number of unallocated arenas in GC chunks resulting in overall slightly better memory usage. I suppose this comes from using empty chunks only after allocating all GC arenas in non-empty chunks. Before the patch we selected randomly a chunk for new allocation from empty and and non-empty ones.
Comment 30•13 years ago
|
||
> before:
> js-gc-heap 7.00 MB 314.00 MB 310.00 MB 32.00 MB
> page-faults-hard 2 18 18 18
> page-faults-soft 17,037 1,465,915 1,573,512 1,590,947
> resident 71.39 MB 1,396.98 MB 1,044.98 MB 290.79 MB
> vsize 341.13 MB 1,804.29 MB 1,657.52 MB 1,289.02 MB
>
> after:
> js-gc-heap 7.00 MB 259.00 MB 268.00 MB 29.00 MB
> page-faults-hard 2 6 6 7
> page-faults-soft 18,096 676,624 778,808 799,441
> resident 69.58 MB 1,264.78 MB 929.28 MB 213.10 MB
> vsize 340.32 MB 1,655.29 MB 1,599.58 MB 1,207.57 MB
That's really good! Peak resident dropped by 9.5%, final resident dropped by 27%. And I never know what to make of soft page faults, but the fact that they halved can only be good :)
You need to log in
before you can comment on or make changes to this bug.
Description
•