Closed Bug 673795 Opened 13 years ago Closed 13 years ago

Avoid hashing empty GC chunks

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla8

People

(Reporter: igor, Assigned: igor)

References

Details

(Keywords: memory-footprint, Whiteboard: [MemShrink:P2])

Attachments

(2 files, 7 obsolete files)

An apparent win from smaller GC chunks on Linux from the patches in the bug 671702 comes from not including the empty GC chunks in the chunk hashtables. As these tables are enumerated at the beginning of the GC to clear the mark bitmap each empty 1MB chunk implies useless clearing of 16K. Another problem is that when searching for a chunk with available arenas the code also enumerates the chunk hashtables. As such an empty chunk can be picked up when there are other chunks with arenas preventing its release during the GC. So to clearly see the effect of smaller chunks the idea is to factor out that part of the bug 671702 into a separated patch here.
Nice observation! Breaking the big changes up into smaller changes is a good idea.
Whiteboard: [MemShrink]
Attached patch part 1 (obsolete) — Splinter Review
Part 1 implements that idea of not including the empty chunks into the chunk hash. However, it regressed on v8 in a shell on Linux. Apparently with the patch more chunks are released to the system and newly-allocated during the benchmark. This is a direct consequence of not using empty chunks until all other chunks are fully used so there are more chances that a chunk ages and released. But this regression on Linux can be fully offset with another idea from the bug 671702, that is of putting all chunks with at least one available arena into a list and use it for allocations. That will come in part 2, bit here some numbers: MC tip against part1 in thread-safe js shell on Linux: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: *1.007x as slow* 1532.9ms +/- 0.1% 1543.6ms +/- 0.1% significant ============================================================================= v8: *1.007x as slow* 1532.9ms +/- 0.1% 1543.6ms +/- 0.1% significant crypto: - 202.7ms +/- 0.2% 202.6ms +/- 0.2% deltablue: *1.008x as slow* 282.2ms +/- 0.2% 284.6ms +/- 0.3% significant earley-boyer: ?? 252.6ms +/- 0.3% 253.2ms +/- 0.2% not conclusive: might be *1.002x as slow* raytrace: *1.006x as slow* 196.5ms +/- 0.2% 197.7ms +/- 0.2% significant regexp: *1.015x as slow* 192.2ms +/- 0.3% 195.1ms +/- 0.2% significant richards: ?? 205.7ms +/- 0.4% 206.5ms +/- 0.5% not conclusive: might be *1.004x as slow* splay: *1.015x as slow* 200.9ms +/- 0.3% 204.0ms +/- 0.2% significant MC tip against part1 + part2: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.011x as fast 1532.9ms +/- 0.1% 1516.7ms +/- 0.1% significant ============================================================================= v8: 1.011x as fast 1532.9ms +/- 0.1% 1516.7ms +/- 0.1% significant crypto: - 202.7ms +/- 0.2% 202.7ms +/- 0.3% deltablue: - 282.2ms +/- 0.2% 281.8ms +/- 0.3% earley-boyer: *1.024x as slow* 252.6ms +/- 0.3% 258.6ms +/- 0.4% significant raytrace: 1.003x as fast 196.5ms +/- 0.2% 195.9ms +/- 0.1% significant regexp: *1.017x as slow* 192.2ms +/- 0.3% 195.6ms +/- 0.2% significant richards: - 205.7ms +/- 0.4% 205.4ms +/- 0.4% splay: 1.137x as fast 200.9ms +/- 0.3% 176.7ms +/- 0.4% significant During score-based benchmarking in the browser on http://v8.googlecode.com/svn/data/benchmarks/v6/run.html I see no differences: tip part1 part1+part2 Score: 3953 3959 3952 Richards: 6234 6047 6128 DeltaBlue: 4093 3848 3930 Crypto: 6380 6342 6033 RayTrace: 2719 2948 2942 EarleyBoyer: 3861 3915 3859 RegExp: 1674 1736 1753 Splay: 5275 5157 5207
Attachment #548385 - Flags: review?(anygregor)
Attached patch part 2 (obsolete) — Splinter Review
The part 2 replaces the system and user hashtables with one global hashtable and separated user and system lists of available chunks. For benchmarking see above.
Attachment #548386 - Flags: review?(anygregor)
Note that the patches here depend on the patch from the bug 673760. The try server builds for part1+part2 is at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/ibukanov@mozilla.com-ba3113b5ae27/
Depends on: 673760
Whiteboard: [MemShrink] → [MemShrink:P2]
(In reply to comment #2) > Created attachment 548385 [details] [diff] [review] [review] > part 1 > > Part 1 implements that idea of not including the empty chunks into the chunk > hash. However, it regressed on v8 in a shell on Linux. Apparently with the > patch more chunks are released to the system and newly-allocated during the > benchmark. This is a direct consequence of not using empty chunks until all > other chunks are fully used so there are more chances that a chunk ages and > released. > I am not convinced by your regression explanation. I looked at the chunks allocation and they don't really change with your first patch (at least on OS X). Have you measured it? The GCTimer tells you how many chunks are allocated/released since the last GC. I am not a big fan of making big GC changes right now. All the GC behavior is off (30% pause time regression) because of the instrumentation patches from Bill.
(In reply to comment #5) > I am not convinced by your regression explanation. I looked at the chunks > allocation and they don't really change with your first patch (at least on > OS X). Have you measured it? The GCTimer tells you how many chunks are > allocated/released since the last GC. I patched the code so the chunk allocation counters are never reset and only reported at js_FinishGC. This way I can see the numbers of chunk allocation calls during the whole benchmark suite. For v8 without the patch the number was 154, with the patch that becomes 158, so the patch releases empty chunks slightly more aggressively and the effect from the comment 0 is visible even during v8 run. > > I am not a big fan of making big GC changes right now. All the GC behavior > is off (30% pause time regression) because of the instrumentation patches > from Bill. But why is this bad? I tried to re-do base tests when doing measurements after syncing with a tree unless I can guess that the changes would not affect benchmarks... Or is it too time consuming in your setup?
Attached patch part 1 (obsolete) — Splinter Review
Here is a rebased patch.
Attachment #548385 - Attachment is obsolete: true
Attachment #548385 - Flags: review?(anygregor)
Attachment #549733 - Flags: review?(anygregor)
Attached patch part 2 (obsolete) — Splinter Review
Attachment #549736 - Flags: review?(anygregor)
Attachment #548386 - Attachment is obsolete: true
Attachment #548386 - Flags: review?(anygregor)
Do we have to store the "age" in the chunk address? I don't like this. I'd rather use an additional field in chunkinfo or reuse another one.
(In reply to comment #9) > Do we have to store the "age" in the chunk address? I don't like this. I'd > rather use an additional field in chunkinfo or reuse another one. Storing the age outside the chunk allows not to reference the empty chunk at all avoiding TLB/cache misses. With big chunks this does not matter, but with smaller chunks this became noticeable. Initially I have used a struct with 2 fields for that, but with smaller chunks this just waste memory and does not make source any simpler.
(In reply to comment #10) > (In reply to comment #9) > > Do we have to store the "age" in the chunk address? I don't like this. I'd > > rather use an additional field in chunkinfo or reuse another one. > > Storing the age outside the chunk allows not to reference the empty chunk at > all avoiding TLB/cache misses. With big chunks this does not matter, but > with smaller chunks this became noticeable. Initially I have used a struct > with 2 fields for that, but with smaller chunks this just waste memory and > does not make source any simpler. We only iterate over all empty chunks in ExpireGCChunks and this happens on the background thread right? This is really a regression on a multithreaded build?
(In reply to comment #11) > > We only iterate over all empty chunks in ExpireGCChunks and this happens on > the background thread right? This is really a regression on a multithreaded > build? I noticed a spike of cache misses with cachegrind. With a single core CPU that becomes expensive. For example, with smaller chunks we can trivially end up with 1000 or more empty aging chunks. Accessing each empty chunk info header pretty much guarantee that we get TLB and L2 miss on each access. Assuming that the cost of those misses is about 250 cycles this wastes 0.15ms on recently-released single-core Atom per GC or even more on ARM. If this is acceptable price for simpler code then I will replace the age vector with a linked list.
Given that this is on the background thread, 0.15ms seems perfectly acceptable to me.
Attached patch part 1 (obsolete) — Splinter Review
The new version uses a linked list for empty chunks.
Attachment #549733 - Attachment is obsolete: true
Attachment #549736 - Attachment is obsolete: true
Attachment #549733 - Flags: review?(anygregor)
Attachment #549736 - Flags: review?(anygregor)
Attachment #550195 - Flags: review?(anygregor)
Attached patch part 2 (obsolete) — Splinter Review
The new version accounts for the empty chunk list changes from part 1.
Attachment #550208 - Flags: review?(anygregor)
Comment on attachment 550195 [details] [diff] [review] part 1 >+ if (unused()) { >+ rt->gcUserChunkSet.remove(this); >+ rt->gcSystemChunkSet.remove(this); >+ >+ /* >+ * We must not release empty chunks until we done with finalization to until we _are_ done >+ * allow calling IsAboutToBeFinalized/Cell::isMarked for finalized GC >+ * things in empty chunks. So we always add the chunk to the empty >+ * set. I guess this is similar to the other IsAboutToBeFinalized discussion we had. We could do it if we stick to "no more IsAboutToBeFinalized after we started finalizing" right? The new code makes sense for performance since we always free the chunk in the background even for arenas with foreground finalization. Maybe you could reflect this in the comment. >+ /* Use an empty chunk when available or allocate a new one. */ >+ chunk = rt->gcEmptyChunkListHead; >+ if (chunk) { >+ JS_ASSERT(chunk->unused()); >+ JS_ASSERT(!rt->gcUserChunkSet.has(chunk)); >+ JS_ASSERT(!rt->gcSystemChunkSet.has(chunk)); >+ JS_ASSERT(rt->gcEmptyChunkCount >= 1); >+ rt->gcEmptyChunkListHead = chunk->info.link; >+ rt->gcEmptyChunkCount--; >+ } else { >+ chunk = AllocateGCChunk(rt); > if (!chunk) > return NULL; >+ >+ chunk->init(rt); >+ rt->gcChunkAllocationSinceLastGC = true; Now it would make sense to move the last line out of the else for smaller memory footprint. We should always try to perform IDLE_GC if we start using new memory. Before we didn't have enough information about the ratio of used vs. unused chunks. > >+ uintptr_t address() const { >+ uintptr_t addr = reinterpret_cast<uintptr_t>(this); >+ JS_ASSERT(!(addr & GC_CHUNK_MASK)); >+ return addr; >+ } >+ Still needed?
Attachment #550195 - Flags: review?(anygregor) → review+
(In reply to comment #16) > I guess this is similar to the other IsAboutToBeFinalized discussion we had. > We could do it if we stick to "no more IsAboutToBeFinalized after we started > finalizing" right? The new code makes sense for performance since we always > free the chunk in the background even for arenas with foreground > finalization. Maybe you could reflect this in the comment. Our shape code uses Cell::isMarked during finalization in an essential way so this is not only about IsAboutToBiFinalized. > >+ /* Use an empty chunk when available or allocate a new one. */ > >+ chunk = rt->gcEmptyChunkListHead; > >+ if (chunk) { ... > >+ } else { > >+ chunk = AllocateGCChunk(rt); ... > >+ rt->gcChunkAllocationSinceLastGC = true; > > Now it would make sense to move the last line out of the else for smaller > memory footprint. We should always try to perform IDLE_GC if we start using > new memory. Before we didn't have enough information about the ratio of used > vs. unused chunks. I will file a new bug about it to keep GC tuning outside this bug.
Attached patch part 1Splinter Review
Here is the patch with fixed comments.
Attachment #550195 - Attachment is obsolete: true
Attachment #550208 - Attachment is obsolete: true
Attachment #550208 - Flags: review?(anygregor)
Attachment #550318 - Flags: review+
Attached patch part 2 (obsolete) — Splinter Review
Here is a rebased part 2.
Attachment #550319 - Flags: review?(anygregor)
Blocks: 676205
Gregor: do you have time review the part 2 here?
Comment on attachment 550319 [details] [diff] [review] part 2 Uh I forgot about the 2nd part. I don't have the time for my usual review process with all the testing so it's better if Bill takes this.
Attachment #550319 - Flags: review?(anygregor) → review?(wmccloskey)
Comment on attachment 550319 [details] [diff] [review] part 2 Review of attachment 550319 [details] [diff] [review]: ----------------------------------------------------------------- Nice patch! ::: js/src/jscntxt.h @@ +376,5 @@ > > /* Garbage collector state, used by jsgc.c. */ > + js::GCChunkSet gcChunkSet; > + js::gc::Chunk *gcSystemAvailableChunks; > + js::gc::Chunk *gcUserAvailableChunks; I would prefer different names here, as well as a comment. It seems inconsistent that gcEmptyChunkListHead ends in "Head" and the new ones don't. How about "gcAvailableSystem/UserChunkListHead"? It's pretty long, but we almost never use the name in the code. The comment should say something about how gcChunkSet contains only contains non-empty chunks and that the Available lists contain chunks that are not empty and not completely full. I could also see it being confusing that the empty chunk list does not have prev pointers. If we keep this inconsistency, it should be mentioned in the comment. But I think it would be better just to maintain all the lists in the same way (see below). ::: js/src/jsgc.cpp @@ +346,4 @@ > } > > +inline Chunk ** > +FindAvailableChunkList(JSCompartment *comp) I'm not sure if this is SpiderMonkey style, but I usually try to use "Find" for things that require a search. Maybe "GetAvailableChunkList"? @@ +353,4 @@ > } > > +inline void > +Chunk::addToAvailableList(JSCompartment *comp) How about generalizing this to addToChunkList and passing in a Chunk** instead of a compartment? Then we could use this function (and a similarly renamed removeFromChunkList) for all three chunk lists. I think it would be cleaner this way. @@ +449,5 @@ > * in empty chunks. So we add the chunk to the empty set even during > * GC_SHRINK. > */ > info.age = 0; > + info.next = rt->gcEmptyChunkListHead; This could change to addToChunkList(&rt->gcEmptyChunkListHead). @@ +525,5 @@ > return chunk; > } > > static void > ReleaseEmptyGCChunks(JSRuntime *rt) I realize it's not part of your patch, but this function seems somewhat redundant. Could we just remove it, and then modify the main ExpireGCChunks loop to always release the chunk if it's a shrinking GC? @@ +556,3 @@ > JS_ASSERT(chunk->info.age <= MAX_EMPTY_CHUNK_AGE); > if (chunk->info.age == MAX_EMPTY_CHUNK_AGE) { > + *chunkp = chunk->info.next; And here we could use removeFromChunkList.
Attachment #550319 - Flags: review?(wmccloskey) → review+
Attached patch part 2Splinter Review
The new version uses longer ListHead names, adds comments and inlines/removes ReleaseEmptyGCChunks. I have kept the empty chunk list as single list. In bug 671702 the available lists I will move to the compartment and the suggested reorganization would not work in that setup.
Attachment #550319 - Attachment is obsolete: true
Attachment #552770 - Flags: review+
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [MemShrink:P2][inbound] → [MemShrink:P2]
Target Milestone: --- → mozilla8
Igor, can you briefly summarize the what final patches did and what the benefits are? Also, you haven't given any performance numbers after comment 2, which were for an early version. Do you have any?
(In reply to Nicholas Nethercote [:njn] from comment #26) > Igor, can you briefly summarize the what final patches did and what the > benefits are? The goal of the patches here is to make the patch for the bug 671702 smaller and to make performance characteristics more visible there. I filed it as a separated bug as it makes sense to do on its own. > Also, you haven't given any performance numbers after comment > 2, which were for an early version. Do you have any? No, not for the landed patch. The changes from the measured versions should not affect benchmarks, plus this patch combined with other patches in my patch queue unrelated to this work give a small V8 win that matches the comment 2.
(In reply to Igor Bukanov from comment #27) > No, not for the landed patch. The changes from the measured versions should > not affect benchmarks, plus this patch combined with other patches in my > patch queue unrelated to this work give a small V8 win that matches the > comment 2. Do you have any stats from a MemShrink POV? :-)
(In reply to Ed Morley [:edmorley] from comment #28) > Do you have any stats from a MemShrink POV? :-) I redid that http://gregor-wagner.com/tmp/mem test with 32-bit Firefox build on Linux. Here the stats without and with changesets from this bug: before: initial all tabs no tabs no window heap-allocated 29.04 MB 1,191.33 MB 575.62 MB 81.41 MB heap-committed 37.00 MB 1,236.00 MB 1,224.00 MB 940.00 MB heap-dirty 2.32 MB 3.60 MB 2.37 MB 2.35 MB heap-unallocated 7.96 MB 44.67 MB 648.38 MB 858.59 MB js-compartments-system 2 2 2 2 js-compartments-user 1 251 149 1 js-gc-heap 7.00 MB 314.00 MB 310.00 MB 32.00 MB js-gc-heap-arena-unused 0.72 MB 53.94 MB 55.24 MB 2.48 MB js-gc-heap-chunk-clean-unuse 0.00 MB 6.00 MB 0.00 MB 0.00 MB js-gc-heap-chunk-dirty-unuse 2.77 MB 18.62 MB 196.83 MB 25.26 MB js-gc-heap-unused-fraction 49.77% 25.02% 81.31% 86.70% page-faults-hard 2 18 18 18 page-faults-soft 17,037 1,465,915 1,573,512 1,590,947 resident 71.39 MB 1,396.98 MB 1,044.98 MB 290.79 MB vsize 341.13 MB 1,804.29 MB 1,657.52 MB 1,289.02 MB after: initial all tabs no tabs no window heap-allocated 30.07 MB 1,092.00 MB 523.90 MB 76.15 MB heap-committed 40.00 MB 1,119.00 MB 1,120.00 MB 874.00 MB heap-dirty 3.05 MB 3.27 MB 3.19 MB 3.21 MB heap-unallocated 9.93 MB 27.00 MB 596.10 MB 797.85 MB js-compartments-system 2 2 2 2 js-compartments-user 1 249 150 1 js-gc-heap 7.00 MB 259.00 MB 268.00 MB 29.00 MB js-gc-heap-arena-unused 0.94 MB 47.70 MB 51.06 MB 2.55 MB js-gc-heap-chunk-clean-unuse 0.00 MB 0.00 MB 0.00 MB 0.00 MB js-gc-heap-chunk-dirty-unuse 2.51 MB 5.12 MB 163.02 MB 22.12 MB js-gc-heap-unused-fraction 49.18% 20.39% 79.88% 85.05% page-faults-hard 2 6 6 7 page-faults-soft 18,096 676,624 778,808 799,441 resident 69.58 MB 1,264.78 MB 929.28 MB 213.10 MB vsize 340.32 MB 1,655.29 MB 1,599.58 MB 1,207.57 MB So the patch did slightly minimize the number of unallocated arenas in GC chunks resulting in overall slightly better memory usage. I suppose this comes from using empty chunks only after allocating all GC arenas in non-empty chunks. Before the patch we selected randomly a chunk for new allocation from empty and and non-empty ones.
> before: > js-gc-heap 7.00 MB 314.00 MB 310.00 MB 32.00 MB > page-faults-hard 2 18 18 18 > page-faults-soft 17,037 1,465,915 1,573,512 1,590,947 > resident 71.39 MB 1,396.98 MB 1,044.98 MB 290.79 MB > vsize 341.13 MB 1,804.29 MB 1,657.52 MB 1,289.02 MB > > after: > js-gc-heap 7.00 MB 259.00 MB 268.00 MB 29.00 MB > page-faults-hard 2 6 6 7 > page-faults-soft 18,096 676,624 778,808 799,441 > resident 69.58 MB 1,264.78 MB 929.28 MB 213.10 MB > vsize 340.32 MB 1,655.29 MB 1,599.58 MB 1,207.57 MB That's really good! Peak resident dropped by 9.5%, final resident dropped by 27%. And I never know what to make of soft page faults, but the fact that they halved can only be good :)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: