Closed
Bug 541140
Opened 15 years ago
Closed 14 years ago
TM: don't return GCChunks immediately
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: gwagner, Assigned: gwagner)
References
Details
(Whiteboard: fixed-in-tracemonkey)
Attachments
(12 files, 12 obsolete files)
30.08 KB,
patch
|
Details | Diff | Splinter Review | |
36.94 KB,
patch
|
Details | Diff | Splinter Review | |
36.96 KB,
patch
|
Details | Diff | Splinter Review | |
6.99 KB,
image/png
|
Details | |
5.95 KB,
image/png
|
Details | |
8.83 KB,
image/png
|
Details | |
9.48 KB,
image/png
|
Details | |
8.89 KB,
image/png
|
Details | |
9.60 KB,
image/png
|
Details | |
9.60 KB,
image/png
|
Details | |
10.15 KB,
image/png
|
Details | |
9.03 KB,
patch
|
Details | Diff | Splinter Review |
The measurements in bug 539532 show that we spend a lot of time returning GC pages to the OS during each GC. These pages have to be allocated again. The idea is to keep the pages around. This will reduce the GC pause time and object allocation time in general.
Assignee | ||
Updated•15 years ago
|
Assignee: general → anygregor
Assignee | ||
Comment 1•14 years ago
|
||
Assignee | ||
Comment 2•14 years ago
|
||
Is there a better way to get the chunk address from the chunkinfo? The code in DestoyEmptyGCChunks is a little bit complicated.
Attachment #423624 -
Attachment is obsolete: true
Attachment #423842 -
Flags: review?(igor)
Comment 3•14 years ago
|
||
(In reply to comment #2) > Created an attachment (id=423842) [details] > patch Hm, I thought the idea is to offload chunk destruction to another thread and see how that would affect performance on different platforms. For example, OS X is known for a slow VM, on a faster VM the overhead of having of up to 50 MB unused memory may slow things down. Also, if munmap is bad even if it is done on another thread, an alternative is to use posix_memalign(big_chunk). That at least would return unused memory into the malloc heap. > > Is there a better way to get the chunk address from the chunkinfo? The code in > DestoyEmptyGCChunks is a little bit complicated. Unused chunks can be linked through a link at the first chunk word. They do not need chunkinfo.
Comment 4•14 years ago
|
||
We don't want to free the memory. Under load the pages will be required again soon. Gregor's benchmark shows a SunSpider speedup. We never GC during sunspider runs (at least not as part of the SunSpider time). The savings is from avoiding page reallocation. In short, we don't want to free here. Not in the foreground. Not in the background.
Assignee | ||
Comment 5•14 years ago
|
||
(In reply to comment #3) > (In reply to comment #2) > > Created an attachment (id=423842) [details] [details] > > patch > > Hm, I thought the idea is to offload chunk destruction to another thread and > see how that would affect performance on different platforms. For example, OS X > is known for a slow VM, on a faster VM the overhead of having of up to 50 MB > unused memory may slow things down. Also, if munmap is bad even if it is done > on another thread, an alternative is to use posix_memalign(big_chunk). That at > least would return unused memory into the malloc heap. I agree that a background thread would reduce the pause time but some benchmarks clearly suffer from the slow reallocation. The top is the base64 benchmark with a 13% performance win on my Mac Pro. It looks like an overall win on sunspider: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.016x as fast 734.0ms +/- 0.3% 722.6ms +/- 0.2% significant ============================================================================= 3d: 1.022x as fast 115.8ms +/- 0.8% 113.3ms +/- 0.3% significant cube: - 37.7ms +/- 0.9% 37.3ms +/- 0.9% morph: 1.080x as fast 21.6ms +/- 1.7% 20.0ms +/- 0.0% significant raytrace: 1.009x as fast 56.5ms +/- 0.7% 56.0ms +/- 0.0% significant access: 1.043x as fast 96.8ms +/- 0.8% 92.8ms +/- 0.6% significant binary-trees: 1.026x as fast 19.7ms +/- 1.8% 19.2ms +/- 1.6% significant fannkuch: 1.043x as fast 45.9ms +/- 0.5% 44.0ms +/- 0.0% significant nbody: 1.102x as fast 19.4ms +/- 1.9% 17.6ms +/- 2.1% significant nsieve: ?? 11.8ms +/- 2.6% 12.0ms +/- 0.0% not conclusive: might be *1.017x as slow* bitops: 1.032x as fast 31.8ms +/- 0.9% 30.8ms +/- 1.0% significant 3bit-bits-in-byte: ?? 1.2ms +/- 25.1% 1.4ms +/- 26.4% not conclusive: might be *1.167x as slow* bits-in-byte: 1.078x as fast 9.7ms +/- 3.6% 9.0ms +/- 0.0% significant bitwise-and: ?? 1.9ms +/- 11.9% 2.0ms +/- 0.0% not conclusive: might be *1.053x as slow* nsieve-bits: 1.033x as fast 19.0ms +/- 0.0% 18.4ms +/- 2.0% significant controlflow: - 7.5ms +/- 5.0% 7.2ms +/- 4.2% recursive: - 7.5ms +/- 5.0% 7.2ms +/- 4.2% crypto: 1.018x as fast 45.8ms +/- 1.0% 45.0ms +/- 0.7% significant aes: 1.023x as fast 26.2ms +/- 1.2% 25.6ms +/- 1.4% significant md5: 1.008x as fast 13.0ms +/- 0.0% 12.9ms +/- 1.8% significant sha1: - 6.6ms +/- 5.6% 6.5ms +/- 5.8% date: 1.029x as fast 103.1ms +/- 0.5% 100.2ms +/- 0.3% significant format-tofte: 1.041x as fast 53.4ms +/- 0.7% 51.3ms +/- 0.7% significant format-xparb: 1.016x as fast 49.7ms +/- 0.7% 48.9ms +/- 0.5% significant math: - 34.8ms +/- 1.6% 34.8ms +/- 0.9% cordic: ?? 15.8ms +/- 1.9% 15.9ms +/- 1.4% not conclusive: might be *1.006x as slow* partial-sums: - 12.5ms +/- 3.0% 12.5ms +/- 3.0% spectral-norm: - 6.5ms +/- 5.8% 6.4ms +/- 5.8% regexp: *1.016x as slow* 36.5ms +/- 1.0% 37.1ms +/- 0.6% significant dna: *1.016x as slow* 36.5ms +/- 1.0% 37.1ms +/- 0.6% significant string: - 261.9ms +/- 0.3% 261.4ms +/- 0.2% base64: 1.130x as fast 11.3ms +/- 3.1% 10.0ms +/- 0.0% significant fasta: 1.037x as fast 50.1ms +/- 0.5% 48.3ms +/- 0.7% significant tagcloud: ?? 86.2ms +/- 0.3% 86.3ms +/- 0.4% not conclusive: might be *1.001x as slow* unpack-code: *1.031x as slow* 79.6ms +/- 0.5% 82.1ms +/- 0.3% significant validate-input: - 34.7ms +/- 1.0% 34.7ms +/- 1.0% > > > > > Is there a better way to get the chunk address from the chunkinfo? The code in > > DestoyEmptyGCChunks is a little bit complicated. > > Unused chunks can be linked through a link at the first chunk word. They do not > need chunkinfo. Sure they could be linked separately. But do we want another data structure?
Comment 6•14 years ago
|
||
The code works and is only called once during shutdown. Why does the complexity matter?
Comment 7•14 years ago
|
||
(Note: we should add an API hook to flush cashes during memory pressure, we could release empty pages in that hook.)
Comment 8•14 years ago
|
||
(In reply to comment #5) > I agree that a background thread would reduce the pause time but some > benchmarks clearly suffer from the slow reallocation. > The top is the base64 benchmark with a 13% performance win on my Mac Pro. > > It looks like an overall win on sunspider: .... > ** TOTAL **: 1.016x as fast 734.0ms +/- 0.3% 722.6ms +/- 0.2% > significant I did the test on Linux for Intel Atom N270 1.6 GHz and Core-2-based Xenon 2.6 GHz. I see no difference with the patch for susnspider. Could we get the results on Windows?
Comment 9•14 years ago
|
||
(In reply to comment #4) > In short, we don't want to free here. Not in the foreground. Not in the background. That depends on the quality of VM. I have not observed any win on Linux. Thus for Linux the patch makes things worse overall. It does bring the speedup yet it makes the browser to consume more memory.
Comment 10•14 years ago
|
||
(In reply to comment #9) > That depends on the quality of VM. I have not observed any win on Linux. Thus > for Linux the patch makes things worse overall. It does bring the speedup yet > it makes the browser to consume more memory. I meant "It does not bring the speedup yet it makes the browser to consume more memory."
Comment 11•14 years ago
|
||
Did you measure how much more memory is consumed as a percentage of the total memory use over time?
Comment 12•14 years ago
|
||
(In reply to comment #11) > Did you measure how much more memory is consumed as a percentage of the total > memory use over time? No. For reliable stats one need a browser session that ressemble real one in number of open windows and in duration. I will try to get such stats tomorrow.
Assignee | ||
Comment 13•14 years ago
|
||
I can't see a slowdown on Linux on my Intel Q9300 2.5GHz Quadcore: ** TOTAL **: - 788.8ms +/- 0.7% 785.1ms +/- 0.6% I don't have a windows machine to test right now.
Comment 14•14 years ago
|
||
(In reply to comment #11) > Did you measure how much more memory is consumed as a percentage of the total > memory use over time? After some testing a memory consumption for my browser session in GC arenas is about 10-15MB. Opening Google docs and maps adds about 5-7 MB to that. Running http://www.chromeexperiments.com/detail/canopy/ adds 10-20 MB to that with an observed peak about 40MB. Closing the corresponding tabs returns the memory to the initial 10-15. Thus if the GC would not return the memory back to the system, this would waste 20MB. That memory would not be available not only for other application, but also for the browser itself. Hence the suggestion to try to use posix_memalign or even plain malloc. The corresponding free would at least make the memory available for other browser needs while avoiding mmap/munmap overhead. Once the code has supported that option, but only to allocate 4K arenas. It would be nice to try to use it with current 64K or bigger chunks.
Comment 15•14 years ago
|
||
I don't think this approach makes sense for high-throughput applications. You add a bunch of unnecessary malloc/free and mmap/unmap roundtrips (malloc implementations will unmap when you release enough memory in a row). I don't see a point to aggressively save 10-20mb footprint. We should rather GC & shrink when there is real memory pressure.
Comment 16•14 years ago
|
||
(In reply to comment #15) > I don't think this approach makes sense for high-throughput applications. You > add a bunch of unnecessary malloc/free and mmap/unmap roundtrips (malloc > implementations will unmap when you release enough memory in a row). I don't > see a point to aggressively save 10-20mb footprint. We should rather GC & > shrink when there is real memory pressure. I made a quick hack to change the GC to use posix_memalign again but allocate in chunks. With the crrent 16 GC arenas per chunk I see no difference, but with 8 arenas and 32K chunks I got in sunspider on Xeon X3353 2.66GHz CPU ** TOTAL **: 1.016x as fast 763.5ms +/- 0.7% 751.8ms +/- 1.3% significant That is, on Linux it make sence to use posix_memalign/free compared with mmap even when munmap does not release the memory.
Comment 17•14 years ago
|
||
To confirm that it is important to call "free" to share GC-ed memory with the rest of of the system allocator I changed the patch from comment 2 to use memalign/free. Then the gains from the comment 16 disappeared on Linux. I also see similar results with running sunspider in jemalloc-compiled xpcshell. There with posix_memalign/free the speedup is 1.2-1.3% and, again, no speedup if the memory is only released at shutdown.
Assignee | ||
Comment 18•14 years ago
|
||
How about delaying the free? We could count how many GC cycles the chunk is empty and once it reaches a certain threshold we give it back to the OS.
Yeah, I'm more confortable with some sort of decay model here. I wanted to try some with arrays when I did the dense array work, so that I could amortize the growth over fewer allocations without having long-term memory impact, but never got to it. We don't get reliable memory pressure signals, and some of our memory pressure is from the market rather than the OS.
Comment 20•14 years ago
|
||
(In reply to comment #18) > How about delaying the free? Well, my tests indicates that doing the free in time makes things faster on Linux both with GLIBC allocator and jemalloc.
Comment 21•14 years ago
|
||
Gregor, could you try this on Mac to see how it would affect sunspider?
Assignee | ||
Updated•14 years ago
|
Attachment #424051 -
Attachment mime type: application/octet-stream → text/plain
Comment 22•14 years ago
|
||
(In reply to comment #21) > Created an attachment (id=424051) [details] > hack to restore posix_memalign > > Gregor, could you try this on Mac to see how it would affect sunspider? Also, could you try different values of GC_ARENAS_PER_CHUNK?
Assignee | ||
Comment 23•14 years ago
|
||
(In reply to comment #21) > Created an attachment (id=424051) [details] > hack to restore posix_memalign > > Gregor, could you try this on Mac to see how it would affect sunspider? Yes this makes SS faster on Mac. It goes down from 734 to 716ms. Adding my path is 713ms. GC_ARENAS_PER_CHUNK 8 717ms 32 720 64 715 128 715
Comment 24•14 years ago
|
||
Do we know why its faster? Better cache utilization? We are running strictly more cycles, so it would be worth while finding out where the speedup comes from. Maybe we can have both. More cycles and better cache effects. In what order are you re-using chunks in your patch gregor?
Updated•14 years ago
|
Attachment #424051 -
Attachment is patch: true
Comment 25•14 years ago
|
||
I will test these patches on Windows. But do we have a better benchmark? It seems that pause times for long-running JS are much more interesting here than SS scores.
There are pause-time benchmarks in the GC suite from Q4, I think, but I don't know where exactly. Sayre probably will!
Comment 27•14 years ago
|
||
(In reply to comment #24) > Do we know why its faster? Better cache utilization? gregor's data indicates that on Mac the situation is intuitive one. There VM is slow so either avoiding munmap or replacing it with free brings similar benefits. Now, using free is perhaps faster then not using munmap (715ms versus 722 in the comment 5) since now the GC data is allocated together with the rest of data allocated using malloc. Still not using free at all makes things even faster - 713 ms. I will recheck my results on Linux tomorrow, but for now the data indicates that the decision to remove posix_memalign support from js_GC was a wrong one. We should have gone in another direction, that is, to remove mmap code. We are running strictly > more cycles, so it would be worth while finding out where the speedup comes > from. Maybe we can have both. More cycles and better cache effects. In what > order are you re-using chunks in your patch gregor?
Comment 28•14 years ago
|
||
I am not in favor of mmap over posix_memalign, not at all. I am in favor of 1 mechanism. I still don't understand why we are faster with posix_memalign. I think we should try to understand that. Also, have these numbers been confirmed in the browser? A single-threaded shell has vastly different performance behavior from a multi-threaded browser when it comes to malloc/free (ballpark 5% sunspider difference).
Comment 29•14 years ago
|
||
(In reply to comment #25) > I will test these patches on Windows. Note that the patch from the comment 21 still uses VirtualAlloc on Windows. So to test posix_memalign one would need jemalloc shell and hack the patch farther to call posix_memalign when MOZ_MEMORY is defined on Windows. > But do we have a better benchmark? To test just the GC performance I often use the following synthetic benchmark or its variations: function test(N) { var N = 6e6; var a = []; var time0 = Date.now(); for(var i = 0; i != N; ++i) { a[i] = i + 0.1; } var time0 = Date.now() - time0; gc(); var time1 = Date.now(); gc(); time1 = Date.now() - time1; a = null; var time2 = Date.now(); gc(); time2 = Date.now() - time2; return [time0, time1, time2]; } //warmup test(); var min_times = [Infinity, Infinity, Infinity]; for (var i = 0; i != 10; ++i) { var times = test(); min_times[0] = Math.min(min_times[0], times[0]); min_times[1] = Math.min(min_times[1], times[1]); min_times[2] = Math.min(min_times[2], times[2]); } print("Alloc time: "+min_times[0]); print("GC time all used: "+min_times[1]); print("GC time all free: "+min_times[2]);
Comment 30•14 years ago
|
||
(In reply to comment #28) > Also, have these numbers been confirmed > in the browser? I have seen the same effect in JS_THREADSAFE js shell and xpcshell from browser build. The problem with running sunspider in the browser itself is that the noise is big and hides the small effect.
Comment 31•14 years ago
|
||
Ok, then lets measure JS_THREADSAFE in the shell. I have seen the same malloc/free penalty there as in the browser. It should be a fair enough approximation for this benchmark.
Comment 32•14 years ago
|
||
(In reply to comment #25) > But do we have a better benchmark? Would :alix's animation/GC demo be of use for that? (Or by chance is that what's being used already?) http://people.mozilla.com/~afranquet/clock.html That link is via this page -- http://hacks.mozilla.org/2010/01/javascript-speedups-in-firefox-3-6/ -- which also has this description: "Besides animating the dial, this demo creates one million 100-character strings per second, so it requires frequent GC. [...] The estimated GC delay meter gives the average estimated GC delay, based on the assumption that if a frame has a delay of 1.7 times the average delay or more, then exactly one GC ran during that frame."
Ah, yeah, that might work. One would hope that it would show up in the pause rates there, given the binge/purge allocation behaviour that it manifests. (I think that demo really creates a million String objects per second, each one pointing at the same 100 character string primitive, so it's not really churning through 100MB/sec, but more like 8MB/sec on a 32-bit machine, right?)
Comment 34•14 years ago
|
||
I did the test again and they confirm the initial observation that delaying free slowing down things on Linux. Here for reference susnspider results on Xeon X3353 2.66GHz CPU with xpcshell builda gainst jemalloc. applying patch from the comment 21 to hg revison 3c3b005de959 improve timing from 742 to 731: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.015x as fast 742.0ms +/- 0.4% 730.9ms +/- 0.5% significant ============================================================================= 3d: ?? 122.4ms +/- 1.6% 124.0ms +/- 2.0% not conclusive: might be *1.013x as slow* cube: ?? 38.5ms +/- 3.8% 38.9ms +/- 3.8% not conclusive: might be *1.010x as slow* morph: ?? 29.5ms +/- 2.4% 30.2ms +/- 2.9% not conclusive: might be *1.024x as slow* raytrace: *1.009x as slow* 54.4ms +/- 0.4% 54.9ms +/- 1.4% significant access: 1.045x as fast 93.8ms +/- 0.9% 89.8ms +/- 1.4% significant binary-trees: 1.112x as fast 25.6ms +/- 1.5% 23.0ms +/- 1.2% significant fannkuch: - 41.3ms +/- 0.9% 41.0ms +/- 2.2% nbody: 1.053x as fast 16.2ms +/- 1.7% 15.4ms +/- 1.3% significant nsieve: 1.035x as fast 10.7ms +/- 2.2% 10.3ms +/- 1.7% significant bitops: - 31.4ms +/- 1.5% 31.1ms +/- 1.7% 3bit-bits-in-byte: *1.152x as slow* 1.3ms +/- 10.1% 1.5ms +/- 10.2% significant bits-in-byte: ?? 9.4ms +/- 3.1% 9.6ms +/- 2.6% not conclusive: might be *1.023x as slow* bitwise-and: 1.074x as fast 3.8ms +/- 4.7% 3.5ms +/- 5.5% significant nsieve-bits: 1.033x as fast 16.9ms +/- 1.3% 16.4ms +/- 1.2% significant controlflow: ?? 7.4ms +/- 2.0% 7.5ms +/- 2.1% not conclusive: might be *1.019x as slow* recursive: ?? 7.4ms +/- 2.0% 7.5ms +/- 2.1% not conclusive: might be *1.019x as slow* crypto: 1.019x as fast 45.1ms +/- 0.8% 44.3ms +/- 0.7% significant aes: 1.012x as fast 25.6ms +/- 0.8% 25.3ms +/- 0.6% significant md5: 1.026x as fast 12.9ms +/- 1.8% 12.5ms +/- 1.4% significant sha1: 1.037x as fast 6.6ms +/- 2.2% 6.4ms +/- 2.2% significant date: 1.011x as fast 109.3ms +/- 0.5% 108.1ms +/- 0.3% significant format-tofte: - 62.5ms +/- 0.4% 62.5ms +/- 0.4% format-xparb: 1.025x as fast 46.8ms +/- 1.1% 45.6ms +/- 0.5% significant math: ?? 34.1ms +/- 1.5% 34.3ms +/- 0.8% not conclusive: might be *1.005x as slow* cordic: - 12.9ms +/- 2.1% 12.8ms +/- 1.2% partial-sums: ?? 15.5ms +/- 1.6% 15.7ms +/- 1.1% not conclusive: might be *1.014x as slow* spectral-norm: - 5.7ms +/- 2.5% 5.7ms +/- 2.3% regexp: - 47.1ms +/- 1.3% 47.0ms +/- 0.8% dna: - 47.1ms +/- 1.3% 47.0ms +/- 0.8% string: 1.027x as fast 251.4ms +/- 0.4% 244.9ms +/- 0.5% significant base64: ?? 11.3ms +/- 1.2% 11.5ms +/- 1.6% not conclusive: might be *1.014x as slow* fasta: - 48.9ms +/- 0.6% 48.7ms +/- 0.5% tagcloud: 1.035x as fast 79.7ms +/- 0.5% 77.0ms +/- 0.5% significant unpack-code: 1.039x as fast 78.3ms +/- 0.9% 75.3ms +/- 1.2% significant validate-input: 1.024x as fast 33.3ms +/- 0.9% 32.5ms +/- 0.7% significant applying the patch from the comment 2 on top of that slow things down from 731 to 738: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: *1.010x as slow* 730.9ms +/- 0.5% 738.3ms +/- 0.7% significant ============================================================================= 3d: - 124.0ms +/- 2.0% 122.7ms +/- 2.0% cube: - 38.9ms +/- 3.8% 38.0ms +/- 3.6% morph: - 30.2ms +/- 2.9% 29.4ms +/- 3.0% raytrace: ?? 54.9ms +/- 1.4% 55.2ms +/- 1.6% not conclusive: might be *1.006x as slow* access: ?? 89.8ms +/- 1.4% 91.2ms +/- 0.9% not conclusive: might be *1.016x as slow* binary-trees: *1.036x as slow* 23.0ms +/- 1.2% 23.8ms +/- 1.9% significant fannkuch: - 41.0ms +/- 2.2% 41.0ms +/- 1.2% nbody: - 15.4ms +/- 1.3% 15.2ms +/- 1.2% nsieve: *1.072x as slow* 10.3ms +/- 1.7% 11.1ms +/- 1.1% significant bitops: *1.025x as slow* 31.1ms +/- 1.7% 31.8ms +/- 1.9% significant 3bit-bits-in-byte: ?? 1.5ms +/- 10.2% 1.5ms +/- 9.3% not conclusive: might be *1.013x as slow* bits-in-byte: - 9.6ms +/- 2.6% 9.5ms +/- 2.2% bitwise-and: *1.26x as slow* 3.5ms +/- 5.5% 4.4ms +/- 5.2% significant nsieve-bits: ?? 16.4ms +/- 1.2% 16.4ms +/- 1.7% not conclusive: might be *1.001x as slow* controlflow: ?? 7.5ms +/- 2.1% 7.6ms +/- 2.3% not conclusive: might be *1.005x as slow* recursive: ?? 7.5ms +/- 2.1% 7.6ms +/- 2.3% not conclusive: might be *1.005x as slow* crypto: *1.015x as slow* 44.3ms +/- 0.7% 44.9ms +/- 1.9% significant aes: ?? 25.3ms +/- 0.6% 25.4ms +/- 1.4% not conclusive: might be *1.004x as slow* md5: ?? 12.5ms +/- 1.4% 12.7ms +/- 2.1% not conclusive: might be *1.016x as slow* sha1: *1.056x as slow* 6.4ms +/- 2.2% 6.8ms +/- 8.3% significant date: ?? 108.1ms +/- 0.3% 108.4ms +/- 1.4% not conclusive: might be *1.003x as slow* format-tofte: - 62.5ms +/- 0.4% 62.1ms +/- 1.5% format-xparb: *1.016x as slow* 45.6ms +/- 0.5% 46.3ms +/- 1.4% significant math: ?? 34.3ms +/- 0.8% 34.4ms +/- 1.6% not conclusive: might be *1.004x as slow* cordic: ?? 12.8ms +/- 1.2% 12.9ms +/- 1.8% not conclusive: might be *1.005x as slow* partial-sums: - 15.7ms +/- 1.1% 15.7ms +/- 1.7% spectral-norm: ?? 5.7ms +/- 2.3% 5.8ms +/- 2.7% not conclusive: might be *1.011x as slow* regexp: 1.017x as fast 47.0ms +/- 0.8% 46.2ms +/- 1.4% significant dna: 1.017x as fast 47.0ms +/- 0.8% 46.2ms +/- 1.4% significant string: *1.025x as slow* 244.9ms +/- 0.5% 251.1ms +/- 0.5% significant base64: 1.049x as fast 11.5ms +/- 1.6% 10.9ms +/- 1.9% significant fasta: - 48.7ms +/- 0.5% 48.3ms +/- 1.0% tagcloud: *1.029x as slow* 77.0ms +/- 0.5% 79.2ms +/- 0.6% significant unpack-code: *1.062x as slow* 75.3ms +/- 1.2% 80.0ms +/- 1.0% significant validate-input: ?? 32.5ms +/- 0.7% 32.6ms +/- 0.7% not conclusive: might be *1.004x as slow*
Comment 35•14 years ago
|
||
Sunspider on Vista for tip TM vs. first patch applied: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.017x as fast 865.3ms +/- 0.3% 850.5ms +/- 0.2% significant ============================================================================= 3d: 1.009x as fast 138.2ms +/- 0.7% 136.9ms +/- 0.8% significant cube: 1.034x as fast 36.6ms +/- 1.0% 35.4ms +/- 1.0% significant morph: - 45.7ms +/- 0.8% 45.4ms +/- 2.0% raytrace: ?? 55.9ms +/- 0.7% 56.1ms +/- 1.5% not conclusive: might be *1.004x as slow* access: 1.021x as fast 122.9ms +/- 0.7% 120.4ms +/- 0.8% significant binary-trees: - 21.9ms +/- 1.0% 21.6ms +/- 3.6% fannkuch: 1.018x as fast 66.2ms +/- 1.1% 65.0ms +/- 0.7% significant nbody: 1.040x as fast 23.4ms +/- 1.6% 22.5ms +/- 1.7% significant nsieve: - 11.4ms +/- 3.2% 11.3ms +/- 3.1% bitops: ?? 32.8ms +/- 2.0% 33.5ms +/- 2.3% not conclusive: might be *1.021x as slow* 3bit-bits-in-byte: ?? 1.4ms +/- 26.4% 1.6ms +/- 23.1% not conclusive: might be *1.143x as slow* bits-in-byte: *1.083x as slow* 8.4ms +/- 4.4% 9.1ms +/- 4.5% significant bitwise-and: - 2.1ms +/- 10.8% 2.0ms +/- 0.0% nsieve-bits: - 20.9ms +/- 2.5% 20.8ms +/- 2.7% controlflow: ?? 7.2ms +/- 4.2% 7.5ms +/- 5.0% not conclusive: might be *1.042x as slow* recursive: ?? 7.2ms +/- 4.2% 7.5ms +/- 5.0% not conclusive: might be *1.042x as slow* crypto: ?? 49.6ms +/- 1.0% 49.7ms +/- 1.0% not conclusive: might be *1.002x as slow* aes: - 28.1ms +/- 0.8% 27.9ms +/- 0.8% md5: ?? 13.5ms +/- 2.8% 13.8ms +/- 2.2% not conclusive: might be *1.022x as slow* sha1: - 8.0ms +/- 0.0% 8.0ms +/- 0.0% date: - 133.7ms +/- 0.4% 133.7ms +/- 0.4% format-tofte: 1.008x as fast 62.3ms +/- 0.6% 61.8ms +/- 0.5% significant format-xparb: ?? 71.4ms +/- 0.7% 71.9ms +/- 0.6% not conclusive: might be *1.007x as slow* math: - 47.0ms +/- 1.9% 46.2ms +/- 1.0% cordic: 1.015x as fast 27.4ms +/- 1.3% 27.0ms +/- 0.0% significant partial-sums: - 14.0ms +/- 5.4% 13.6ms +/- 2.7% spectral-norm: - 5.6ms +/- 6.6% 5.6ms +/- 6.6% regexp: 1.066x as fast 45.2ms +/- 1.8% 42.4ms +/- 0.9% significant dna: 1.066x as fast 45.2ms +/- 1.8% 42.4ms +/- 0.9% significant string: 1.030x as fast 288.7ms +/- 0.5% 280.2ms +/- 0.3% significant base64: ?? 11.4ms +/- 3.2% 11.5ms +/- 3.3% not conclusive: might be *1.009x as slow* fasta: 1.039x as fast 58.6ms +/- 1.3% 56.4ms +/- 0.7% significant tagcloud: 1.032x as fast 95.4ms +/- 1.6% 92.4ms +/- 0.4% significant unpack-code: 1.032x as fast 86.3ms +/- 0.6% 83.6ms +/- 0.7% significant validate-input: 1.019x as fast 37.0ms +/- 1.3% 36.3ms +/- 1.0% significant
Comment 36•14 years ago
|
||
Again on Vista, TM tip vs. posix_memalign patch applied: TEST COMPARISON FROM TO DETAILS ============================================================================= ** TOTAL **: 1.011x as fast 865.3ms +/- 0.3% 855.5ms +/- 0.4% significant ============================================================================= 3d: 1.009x as fast 138.2ms +/- 0.7% 136.9ms +/- 0.4% significant cube: 1.014x as fast 36.6ms +/- 1.0% 36.1ms +/- 0.6% significant morph: 1.013x as fast 45.7ms +/- 0.8% 45.1ms +/- 0.5% significant raytrace: - 55.9ms +/- 0.7% 55.7ms +/- 0.6% access: 1.016x as fast 122.9ms +/- 0.7% 121.0ms +/- 0.6% significant binary-trees: 1.019x as fast 21.9ms +/- 1.0% 21.5ms +/- 1.8% significant fannkuch: - 66.2ms +/- 1.1% 65.6ms +/- 0.6% nbody: 1.026x as fast 23.4ms +/- 1.6% 22.8ms +/- 2.0% significant nsieve: - 11.4ms +/- 3.2% 11.1ms +/- 2.0% bitops: ?? 32.8ms +/- 2.0% 33.0ms +/- 3.1% not conclusive: might be *1.006x as slow* 3bit-bits-in-byte: ?? 1.4ms +/- 26.4% 1.7ms +/- 20.3% not conclusive: might be *1.21x as slow* bits-in-byte: ?? 8.4ms +/- 4.4% 8.6ms +/- 5.8% not conclusive: might be *1.024x as slow* bitwise-and: - 2.1ms +/- 10.8% 2.0ms +/- 0.0% nsieve-bits: - 20.9ms +/- 2.5% 20.7ms +/- 2.3% controlflow: ?? 7.2ms +/- 4.2% 7.3ms +/- 6.6% not conclusive: might be *1.014x as slow* recursive: ?? 7.2ms +/- 4.2% 7.3ms +/- 6.6% not conclusive: might be *1.014x as slow* crypto: - 49.6ms +/- 1.0% 49.0ms +/- 1.0% aes: 1.014x as fast 28.1ms +/- 0.8% 27.7ms +/- 1.2% significant md5: - 13.5ms +/- 2.8% 13.5ms +/- 2.8% sha1: 1.026x as fast 8.0ms +/- 0.0% 7.8ms +/- 3.9% significant date: - 133.7ms +/- 0.4% 133.3ms +/- 0.5% format-tofte: 1.011x as fast 62.3ms +/- 0.6% 61.6ms +/- 0.6% significant format-xparb: ?? 71.4ms +/- 0.7% 71.7ms +/- 0.8% not conclusive: might be *1.004x as slow* math: - 47.0ms +/- 1.9% 46.1ms +/- 1.4% cordic: - 27.4ms +/- 1.3% 27.2ms +/- 1.1% partial-sums: - 14.0ms +/- 5.4% 13.6ms +/- 2.7% spectral-norm: - 5.6ms +/- 6.6% 5.3ms +/- 6.5% regexp: 1.051x as fast 45.2ms +/- 1.8% 43.0ms +/- 1.1% significant dna: 1.051x as fast 45.2ms +/- 1.8% 43.0ms +/- 1.1% significant string: 1.010x as fast 288.7ms +/- 0.5% 285.9ms +/- 0.6% significant base64: ?? 11.4ms +/- 3.2% 11.5ms +/- 3.3% not conclusive: might be *1.009x as slow* fasta: - 58.6ms +/- 1.3% 58.5ms +/- 1.9% tagcloud: - 95.4ms +/- 1.6% 93.8ms +/- 1.1% unpack-code: - 86.3ms +/- 0.6% 85.7ms +/- 1.2% validate-input: 1.016x as fast 37.0ms +/- 1.3% 36.4ms +/- 1.0% significant
Comment 37•14 years ago
|
||
I measured GC pause times on the Canopy Chrome experiment. My procedure: - Measure time for each call to js_GC with rdtsc() and a class that takes timestamps in ctor/dtor - Run Canopy for several seconds - To analyze data, throw away pause times much shorter than usual (less than 20M cycles). - Report average and max pause time in ms Results: mean (ms) max (ms) TM tip: 21.5 26.9 first patch: 17.5 25.7 posix_memalign patch: 22.2 26.1
Assignee | ||
Comment 38•14 years ago
|
||
In order to get a better understanding for Linux I was measuring clock cycles (measured with rdtsc) for the clock benchmark Alex mentioned. I take the first 10 GC runs once I start the benchmark. All numbers represent clockcycles * 1E6: chunk and destroy represent clock-cycles spend in createChunk and destroyChunk function until the next GC. GC total means clock cycles spend in js_GC. never return means the approach where I never return chunks to the OS. Looks like a combination of memalign and delayed returning pages is the way to go for Linux but not for Windows as Davids number show. Tip: GC total: 59.984250 create chunk: 1.283946, destroy: 5.357551 GC total: 65.148690 create chunk: 1.702142, destroy: 8.668120 GC total: 65.102940 create chunk: 1.668936, destroy: 8.571407 GC total: 66.635302 create chunk: 1.785963, destroy: 8.642302 GC total: 67.303733 create chunk: 1.823751, destroy: 9.096835 GC total: 65.943045 create chunk: 1.647631, destroy: 8.743144 GC total: 71.483768 create chunk: 1.833246, destroy: 11.906154 GC total: 69.165578 create chunk: 1.959102, destroy: 9.234659 GC total: 62.773845 create chunk: 1.602108, destroy: 8.140037 GC total: 69.744675 create chunk: 1.851726, destroy: 9.220635 memalign GC total: 56.798535 create chunk: 1.078626, destroy: 4.140181 GC total: 67.185240 create chunk: 2.071883, destroy: 7.399031 GC total: 66.639907 create chunk: 2.088105, destroy: 7.072140 GC total: 66.712673 create chunk: 2.148167, destroy: 7.076758 GC total: 65.022098 create chunk: 1.877782, destroy: 7.274674 GC total: 62.263410 create chunk: 3.300468, destroy: 6.702548 GC total: 62.094457 create chunk: 3.134920, destroy: 6.511204 GC total: 62.282565 create chunk: 3.174125, destroy: 6.583060 GC total: 62.116913 create chunk: 3.093809, destroy: 6.525590 GC total: 60.745748 create chunk: 1.510684, destroy: 3.266253 Tip + never return: GC total: 53.453002 create chunk: 1.550957, destroy: 0.000000 GC total: 56.476117 create chunk: 0.585223, destroy: 0.000000 GC total: 62.370593 create chunk: 0.256883, destroy: 0.000000 GC total: 55.096665 create chunk: 0.000000, destroy: 0.000000 GC total: 55.641825 create chunk: 0.000000, destroy: 0.000000 GC total: 55.118168 create chunk: 0.000000, destroy: 0.000000 GC total: 55.079535 create chunk: 0.000000, destroy: 0.000000 GC total: 55.285597 create chunk: 0.000000, destroy: 0.000000 GC total: 58.120110 create chunk: 0.000000, destroy: 0.000000 GC total: 58.995465 create chunk: 0.000000, destroy: 0.000000 memalign + never return: GC total: 34.332472 create chunk: 0.463917, destroy: 0.000000 GC total: 43.483253 create chunk: 0.956575, destroy: 0.000000 GC total: 44.499712 create chunk: 0.051564, destroy: 0.000000 GC total: 44.070758 create chunk: 0.000000, destroy: 0.000000 GC total: 44.366948 create chunk: 0.000000, destroy: 0.000000 GC total: 44.334638 create chunk: 0.000000, destroy: 0.000000 GC total: 44.319345 create chunk: 0.000000, destroy: 0.000000 GC total: 50.329492 create chunk: 0.302865, destroy: 0.000000 GC total: 43.602270 create chunk: 0.000000, destroy: 0.000000 GC total: 45.702323 create chunk: 0.000000, destroy: 0.000000
Comment 39•14 years ago
|
||
The new patch moves the free call to the background thread.
Comment 40•14 years ago
|
||
(In reply to comment #38) > In order to get a better understanding for Linux I was measuring clock cycles > (measured with rdtsc) for the clock benchmark Alex mentioned. Could you post that measuring code?
Comment 41•14 years ago
|
||
The new patch increases the chunk size to 128K and uses as before memalign + free on the background thread. For the benchmark from the comment 29 that exercise only GC allocator with no malloc calls I have with the patch in jemalloc+xpcshell: base: Alloc time: 209 GC time all used: 58 GC time all free: 15 memalign + background free: Alloc time: 214 GC time all used: 59 GC time all free: 6 mmap + never return memory Alloc time: 200 GC time all used: 59 GC time all free: 7 This shows that on Linux mmap/munmap is faster than memalign allocation but that is completely offset by slower finalization. Still mmap/no-release wins on all counts. If I change the benchmark so allocation uses the same number of GC and malloc allocations: function test(N) { var N = 6e6; var a = []; var time0 = Date.now(); for(var i = 0; i != N; ++i) { a[i] = "aa".toUpperCase(); } var time0 = Date.now() - time0; gc(); var time1 = Date.now(); gc(); time1 = Date.now() - time1; a = null; var time2 = Date.now(); gc(); time2 = Date.now() - time2; return [time0, time1, time2]; } // warmup test(); var min_times = [Infinity, Infinity, Infinity]; for (var i = 0; i != 100; ++i) { var times = test(); min_times[0] = Math.min(min_times[0], times[0]); min_times[1] = Math.min(min_times[1], times[1]); min_times[2] = Math.min(min_times[2], times[2]); } print("Alloc time: "+min_times[0]); print("GC time all used: "+min_times[1]); print("GC time all free: "+min_times[2]); The results: base: Alloc time: 760 GC time all used: 97 GC time all free: 227 memalign + background free: Alloc time: 745 GC time all used: 97 GC time all free: 197 mmap + never return memory Alloc time: 722 GC time all used: 97 GC time all free: 196 That is, now memalign wins with the base on allocation while mmap/no-free is still the fastest. My theory regarding memalign allocation speedup is that memalign somehow warmup jemalloc code and data caches. This speedup is not enough to offset the wins from mmap/no-free in this benchmark but in sunspider it is enough to make the allocation faster. Still if one focuses purely on GC mark and finally timing then memalign+free on the background thread on Linux is at least as fast as mmap/no-free but has an advantage that is does not leave unused memory arround. Gregor/David, could you repeat the tests with this patch?
Attachment #424051 -
Attachment is obsolete: true
Attachment #424198 -
Attachment is obsolete: true
Comment 42•14 years ago
|
||
The patch moves munmup calls to the background thread. On Linux it worse then other proposals. There is no changes in sunspider, allocation tests shows slowdown and the mark/sweep phases are not faster then other patches.
Assignee | ||
Comment 43•14 years ago
|
||
MacPro OS X 10.6.2 I calculated the (arith) mean of 10 GC calls once I started the benchmarks. All numbers (except Count) represent (rdtsc) cycles * 1E6 GC Total: accumulated cycles in js_GC (mean) Alloc since last GC: accumulate cycles in NewGCChunk (mean) Count: number of NewGCChunk calls between last GC call and current call (no mean value) per alloc: current accumulated NewGCChunk cycles / Count (no mean value) Mark, Sweep, Finalize Obj, Doubles: cycles in mark, sweep... (mean) Destroy: cycles in DestroyGCArenas (mean) Count: number of DestroyGCArenas calls between last GC call and current one (no mean value) per Dist: current accumulated DestroyGCArenas cycles / Count (no mean value) Tip: mmap + 16 arenas per chunk Clock GC Total: 86.277183 Alloc since last GC: 11.231231, Count: 490 per alloc: 0.022659 Mark: 20.773317, Sweep: 63.651558 Finalize Obj: 40.383799, Doubles: 0.002311 Destroy: 21.227316, Count: 491, per Dest: 0.037003 Canopy GC Total: 108.236756 Alloc since last GC: 97.377330, Count: 985 per alloc: 0.100792 Mark: 35.407740, Sweep: 69.874921 Finalize Obj: 19.989628, Doubles: 6.034316 Destroy: 41.343824, Count: 984, per Dest: 0.03649 It is very interesting that the time per allocation increases dramatically if we call mmap over 600 times. Attachment 424209 [details] [diff] 32 arenas per chunk posix_memalign ? free on background thread ? Clock GC Total: 84.069372 Alloc since last GC: 5.045846, Count: 254 per alloc: 0.020286 Mark: 20.815956, Sweep: 61.436992 Finalize Obj: 40.309536, Doubles: 0.002079 Destroy: 19.095992, Count: 254, per Dest: 0.065833 Canopy GC Total: 99.394880 Alloc since last GC: 49.907864, Count: 495 per alloc: 0.115329 Mark: 35.076904, Sweep: 61.967926 Finalize Obj: 18.955368, Doubles: 5.883493 Destoy: 34.705608, Count: 497, per Dest: 0.064539 Attachment 424219 [details] [diff] Background mmunmap Clock GC Total: 71.826731 Alloc since last GC: 11.787724, Count: 490 per alloc: 0.022092 Mark: 20.867233, Sweep: 48.612205 Finalize Obj: 43.314811, Doubles: 0.002473 Destoy: 3.230002, Count: 486, per Dest: 0.054199 Canopy: GC Total: 71.387655 Alloc since last GC: 95.360449, Count: 1048 per alloc: 0.103057 Mark: 35.334600, Sweep: 33.680048 Finalize Obj: 18.939819, Doubles: 6.190844 Destoy: 6.139078, Count: 1012, per Dest: 0.059629 Allocation time kills here!!!! never release chunks Clock GC Total: 73.801036 Alloc since last GC: 0.000000, Count: 0 per alloc: 0.000000 Mark: 23.834874, Sweep: 47.563324 Finalize Obj: 42.371644, Doubles: 0.003180 Destoy: 3.149600, Count: 0, per Dest: 0.000000 Canopy GC Total: 58.777888 Alloc since last GC: 1.598225, Count: 0 per alloc: 0.000000 Mark: 27.906838, Sweep: 28.843181 Finalize Obj: 16.121037, Doubles: 5.332098 Destoy: 5.215941, Count: 0, per Dest: 0.000000
Assignee | ||
Comment 44•14 years ago
|
||
And the same for Ubuntu 9.10 2.6.30 Tip: mmap + 16 arenas per chunk Clock: GC Total: 70.850034 Alloc since last GC: 2.257380, Count: 466 per alloc: 0.004596 Mark: 19.527767, Sweep: 50.059779 Finalize Obj: 35.943663, Doubles: 0.001456 Destroy: 12.141015, Count: 467, per Dest: 0.020657 Canopy: GC Total: 72.883884 Alloc since last GC: 8.125352, Count: 779 per alloc: 0.011639 Mark: 26.552469, Sweep: 44.960441 Finalize Obj: 8.169489, Doubles: 3.656163 Destroy: 19.895777, Count: 776, per Dest: 0.021073 GC Total: 71.099658 Attachment 424209 [details] [diff] 32 arenas per chunk posix_memalign free on background thread Clock: GC Total: 71.766398 Alloc since last GC: 1.622382, Count: 235 per alloc: 0.007643 Mark: 18.974574, Sweep: 45.849586 Finalize Obj: 34.197816, Doubles: 0.002181 Destroy: 2.356967, Count: 234, per Dest: 0.000039 Canopy: GC Total: 78.486300 Alloc since last GC: 2.612945, Count: 388 per alloc: 0.006740 Mark: 25.261387, Sweep: 49.146618 Finalize Obj: 8.961127, Doubles: 4.070297 Destroy: 4.361608, Count: 393, per Dest: 0.000038 Sweep is unstable between 30 and 60! Attachment 424219 [details] [diff] Background mmunmap Clock: GC Total: 70.966204 Alloc since last GC: 2.115593, Count: 474 per alloc: 0.004514 Mark: 19.397501, Sweep: 50.368943 Finalize Obj: 35.884453, Doubles: 0.001558 Destroy: 2.411190, Count: 473, per Dest: 0.021089 Canopy: GC Total: 71.389255 Alloc since last GC: 9.345766, Count: 752 per alloc: 0.011788 Mark: 26.901378, Sweep: 43.113534 Finalize Obj: 7.819234, Doubles: 3.657568 Destroy: 3.910578, Count: 0, per Dest: 0.000000 never release Chunk Clock: GC Total: 61.238915 Alloc since last GC: 0.000000, Count: 0 per alloc: 0.000000 Mark: 19.263335, Sweep: 40.673392 Finalize Obj: 36.356200, Doubles: 0.001970 Destroy: 2.414596, Count: 0, per Dest: 0.000000 Canopy: GC Total: 53.569553 Alloc since last GC: 0.000000, Count: 0 per alloc: 0.000000 Mark: 27.715081, Sweep: 24.418610 Finalize Obj: 8.086296, Doubles: 3.615567 Destroy: 3.709132, Count: 0, per Dest: 0.000000
Assignee | ||
Comment 45•14 years ago
|
||
A quick hack that delays the chunk release. If a chunk is empty for 3 GC runs, it will be returned to the OS. I can't see at all how this is slower on Linux: Tip: Alloc time: 381 GC time all used: 78 GC time all free: 13 This patch: Alloc time: 323 GC time all used: 78 GC time all free: 5
Comment 46•14 years ago
|
||
For code simplicity I would like to try yet another idea - the background free and allocation. The idea is to have a free arena pool that the background thread would try to replenish when it would drop beyond certain limit. Also the background thread would also free a part of the pool if it becomes too big. I hope that with big enough minimal pool it would allow to use posix_memalig (_aligned_malloc on Windows when jemalloc is not available) to allocate the arenas directly and remove the code complexity associated with the current chunk management while approaching the performance of never releasing arenas.
Comment 47•14 years ago
|
||
I like #46. We should prepare pools in the background and have them ready for the foreground thread to pickup instead of allocating and preparing them on the spot as we run out of pools in the foreground. If we this in the background, why do we need posix_memalign? Once mmap/munmap are not on the hot path we should be able to use them.
Comment 48•14 years ago
|
||
(In reply to comment #47) > Once mmap/munmap are not on the hot path we should be able to use them. On Windows (at least on XP and Vista) the VirtualAlloc allocation granularity is 64K (see http://blogs.msdn.com/oldnewthing/archive/2003/10/08/55239.aspx for reasons behind it). Allocating anything less just waists memory. Thus, if we use arenas that are less in size, then the code would still need to deal with chunks. I would like to avoid that to simplify the GC.
Comment 49•14 years ago
|
||
Sounds good. I am all for getting rid of chunks.
Comment 50•14 years ago
|
||
(In reply to comment #46) > I hope that with big enough minimal pool it would allow to use posix_memalig > (_aligned_malloc on Windows when jemalloc is not available) That does not work. _aligned_malloc(size, alignment) on Windows is just a wrapper on top of malloc that calls malloc(size + alignment + extra_word). Thus using it like _aligned_malloc(GC_PAGE_SIZE, GC_PAGE_SIZE) just waste memory. The bottom line is that on Windows the minimal alignment that one can get from the system is 64K and one form or another of chunked allocation is necessary as we cannot require that jemalloc must be used, right? So I will try to keep chunks while adding their background allocation and trimming.
Comment 51•14 years ago
|
||
This patch implements background allocation and release of GC chunks using mmap/munmap.
Attachment #424209 -
Attachment is obsolete: true
Attachment #424219 -
Attachment is obsolete: true
Comment 52•14 years ago
|
||
This patch replaces mmap/munmap for chunks with oversized malloc. This wastes about one arena per chunk, but the plus is that any released chunk is immediately available for other caller of malloc. As with the previous patch, both the allocation and release of chunks is done in the background.
Comment 53•14 years ago
|
||
The new patch effectively disables the background allocation in the previous patch so all chunk allocations is done using straight oversized malloc call.
Comment 54•14 years ago
|
||
Here are the benchmark results on Linux for 4-core Xeon X3353 2.66 GHz CPU for the last 3 patches compared TM tip (revison ce654228dabe). I have used jemalloc-enabled xpcshell to stay as close to a browser as possible. The table lists the speed up factors compared with the base, so a number less than 1 is a slowdown . susnspider alloc1 free1 alloc2 free2 background mmap/munmap 1.000 0.84 1.14 1.00 2.14 background malloc/free 1.016 0.91 1.14 1.01 2.14 malloc/background free 1.011 1.02 1.14 0.98 2.14 Here alloc1 and free1 are numbers for the benchmark from the comment 41 and alloc2 and free2 are for the benchmark from the comment 29. In the first benchmark there is a 50% mix of GC and malloc allocations. In the second benchmark only GC allocations are exercised. The numbers indicates that background munmap/free makes things better accross all patches. The pure allocation measurements shows a slowdown with background mmap/malloc done on separated thread. One possibility for that could be an extra memory bandwidth that is required to populate caches on two threads. To explain sunspider results that shows a win for malloc over mmap one has to remember that after each test the sunspider driver runs the gc(). When that returns the background thread would continue to call the free. It seems that such free improves malloc performance on another thread. I do not have a reason for that, but I have observed that effect with jemalloc and GLIBC-malloc in other benchmarks and patches. From code simplicity point of view on Linux the winner is oversized malloc done in foreground with background free. It wastes some space, but some of that can be recovered with moving JSGCArenInfo there.
Assignee | ||
Comment 55•14 years ago
|
||
I tried attachment 425804 [details] [diff] [review] on the clock benchmark on my Mac Pro. I just filed a bug 545729 because canopy does not work right now. I also noticed a big GC pause difference between the first start of the browser after a recompile and restarts afterwards. The GC Pause time decreases by about 20% after the first restart. These numbers are now collected after a restart. Again all numbers except call counts are cycles * 1E6 and a mean value of 10 GC runs after the benchmark is running for about one minute. Tip GC Total: 72.612444 NewGCChunk: 9.6939 NewGCChunk calls: 457, per alloc: 0.023627 Mark: 15.872284, Sweep: 55.327894 Finalize Obj: 35.034302, Doubles: 0.002316 Desrtoy Chunk: 18.502619, Count: 458, per Dest: 0.037205 attachment 425804 [details] [diff] [review] GC Total: 55.300570 NewGCChunk: 4.147308 NewGCChunk called: 447 per alloc: 0.010420 Mark: 16.602647, Sweep: 34.473739 Finalize Obj: 32.812583, Doubles: 0.001686 Destroy GCChunk calls: 421, per Dest: 0.081875 return GCChunks if they remain empty for 2 GC runs GC Total: 53.816886 NewGCChunk: 0, Count: 0 per alloc: 0.000000 Mark: 16.472776, Sweep: 35.943617 Finalize Obj: 31.837134, Doubles: 0.002064 Destroy GCChunk: 0 Count: 0, per Dest: 0.000000 The GC pause time is reduced thats good but we still give back all free chunks and reallocate them right away. All benchmarks that generate short-lived objects will suffer due to the bad allocation job OS X does. Also SS suffers as shown in comment 5. I don't understand the code simplicity argument. JSReplenishGCChunkPoolTask is simpler than adding a count variable to each chunk? Returning chunks after they remained empty between two GC calls shouldn't be that complicated.
Comment 56•14 years ago
|
||
(In reply to comment #55) > The GC pause time is reduced thats good but we still give back all free chunks > and reallocate them right away. All benchmarks that generate short-lived > objects will suffer due to the bad allocation job OS X does. Also SS suffers as > shown in comment 5. That is a platform-specific. On Linux I see the opposite. So I guess we need to get the results on Windows before we can proceed. Also note that a particular way to delay chunk release is orthogonal to whether we use mmap or malloc. Could you also try the attachment attachment 425803 [details] [diff] [review]? > I don't understand the code simplicity argument. JSReplenishGCChunkPoolTask is > simpler than adding a count variable to each chunk? Sorry I was not clear about that. I was referring to the difference between the attachment 425803 [details] [diff] [review] and the attachment 425804 [details] [diff] [review]. The former implements background chunk allocation on top of the latter. That background allocation brings extra complexity that AFAICS does not justify the gains on Linux.
Comment 57•14 years ago
|
||
Engine returns GC chunks to the OS way to fast. This fix would yield big speedup's in a threaded world where GC is run much more often. Ideas on when to expect this to land? thanks, Mike M.
Assignee | ||
Comment 58•14 years ago
|
||
Well as you can see with the discussion in this bug, its not so easy. It seems on Mac its a huge performance win if we don't return the chunks right away but on Linux it's a little bit different. Neither Igor, nor me has a windows machine to get some numbers. So we are waiting for a windows user but I agree we should be more active here again.
Comment 59•14 years ago
|
||
I'm your windows guy. Update your patch for tip (make sure to use js_malloc() instead of malloc() and js_free() instead of free()) Just tell me exactly how to test it and I'll get you some numbers for WIN32. Does your testing method involve threads? Mike M.
Assignee | ||
Comment 60•14 years ago
|
||
I want to go back to this bug again and show how important it would be to solve the problem for Mac. This Graph represents GC times of the Canopy benchmark for current Tip.
Assignee | ||
Comment 61•14 years ago
|
||
And the same for my patch where I return GC chunks if they remain empty for 2 GC runs. The average GC time reduces from about 120 [1E6 cycles] to 80. We should decide which direction we want to go here.... My patch shouldn't need a refresh but I can create one if it does.
Comment 62•14 years ago
|
||
Can we get some numbers for windows? That would be great. As long we don't make things slower there, we should take the patch.
Comment 63•14 years ago
|
||
I can get you numbers. Just tell me what i need to do to get those timings. Be as specific as possible. Greg, updating the patch(s) might help too. Seems to be two needed for this?
Comment 64•14 years ago
|
||
(In reply to comment #61) > We should decide which direction we want to go here.... I would like to move to the background thread the finalization of doubles and strings. For most practical cases this will effectively moves the chunk release from the GC thread. When this will be done we may revisit this bug to see if background allocation or delayed release of the chunks would be necessary.
Comment 65•14 years ago
|
||
Igor, In an embedding with many threads (where GC runs way more often) this patch will probably make a big difference. I'm not sure if background finalization will have the same impact. Your thoughts?
Comment 66•14 years ago
|
||
(In reply to comment #65) > In an embedding with many threads (where GC runs way more often) this patch > will probably make a big difference. I'm not sure if background finalization > will have the same impact. I am not against proposed changes. I just would like to do them after implementing the background finalization as that may require to rewrite any proposed patch here.
Comment 67•14 years ago
|
||
Igor, Do you have a bug# for the background finalization you are speaking about? If so are you in active development or is this an idea for future. ETA? Thanks.
Assignee | ||
Comment 68•14 years ago
|
||
(In reply to comment #66) > (In reply to comment #65) > > In an embedding with many threads (where GC runs way more often) this patch > > will probably make a big difference. I'm not sure if background finalization > > will have the same impact. > > I am not against proposed changes. I just would like to do them after > implementing the background finalization as that may require to rewrite any > proposed patch here. When will this patch be ready? So if I understand correctly you want to move the finalization and chunk release to the background thread. This means for the canopy benchmark for example that we free and alloc the same amount of memory in parallel. I don't think this will be very fast. Didn't we see the impact on the GC benchmark we had to remove from the test-suite?
Comment 69•14 years ago
|
||
(In reply to comment #67) > Igor, > > Do you have a bug# for the background finalization you are speaking about? > If so are you in active development or is this an idea for future. > ETA? Bug 543036
Comment 70•14 years ago
|
||
(In reply to comment #68) > When will this patch be ready? For doubles (bug 543036) I should have the patch before Monday. Strings shold be ready during the next week. > This means for the canopy benchmark for example that we free and alloc the same > amount of memory in parallel. We already do the free/malloc in parallel. But this is still faster then free from one thread as it minimizes the GC pause and a possible slowdown is spread over all allocations. Still for the GC chunks we may need to do something special. But then again, lets consider this after the background changes.
Assignee | ||
Comment 71•14 years ago
|
||
Added GCChunk count.
Assignee | ||
Comment 72•14 years ago
|
||
Added GCChunk count.
Comment 73•14 years ago
|
||
In the bug 553812 I am going to bump the chunk size to 2 MB among other refactorings. The patch will also use vm_allocate on MAC. That will alter the stats here.
Depends on: 553812
Comment 74•14 years ago
|
||
Gregor, could you try on Mac thepatch from the bug 553812 to see how it performs?
Assignee | ||
Comment 75•14 years ago
|
||
Performance for the patch of bug 553812. Igor asked in this bug and since all other graphs are in this bug I post the result here. I can't see a reduction of the GC pause time. We allocate and deallocate less chunks but we still give them back and allocate them right away. It might be a improvement that we perform less GC runs but I have to verify that.
Assignee | ||
Comment 76•14 years ago
|
||
Comparison of the current tip, Igors patch for bug 553812 and the patch from this bug with the delayed chunk deallocation. The main difference is that the chunk destruction gets cheaper. Its 20E6 cycles for Tip, 10E6 cycles for Igors patch and 4E6 cycles for this patch. The overall GC pause is about 68E6 cycles for Tip, 58E6 cycles with Igors patch and 53 with the delayed Chunk deallocation. PS: adding multiple file at once would be a nice feature :)
Assignee | ||
Comment 77•14 years ago
|
||
Assignee | ||
Comment 78•14 years ago
|
||
Comment 79•14 years ago
|
||
Gregor: thanks for benchmarking! (In reply to comment #75) > I can't see a reduction of the GC pause time. We allocate and deallocate less > chunks but we still give them back and allocate them right away. Yes, sure. But with bigger chunks a simple strategy of having one extra chunk available (perhaps via background allocation) may just work. Plus these chunks can be shared with jemalloc chunks so making a common pool of chunks for jemalloc and the GC could be very beneficial.
Comment 80•14 years ago
|
||
We have a proposed patch that is a strict improvement and a lot of maybes. I side with the patch in hand, and possible future improvements on top of that.
Comment 81•14 years ago
|
||
Would it be possible to make this value: static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3; settable via API? If you are running 100 concurrent threads GC gets call A LOT. The thrashing is pretty bad. It might be nice to be able to tune this by hand depending on the embedding. Other than that...land it.
Comment 82•14 years ago
|
||
(In reply to comment #80) > We have a proposed patch that is a strict improvement and a lot of maybes. The patch is a strict improvement on Mac. On Linux with its better VM the situation is not that clear. That is why we need data from a Windows. > I side with the patch in hand, and possible future improvements on top of that. On top of the patch from the bug 553812 I would like to implement a pool of chunks that is shared between jemalloc and the GC. For such pool any internal GC pooling would be counterproductive. Also, even for platforms where the pooling is necessary (MAC does not use jemalloc), big 1MB chunks may need/allow for a different strategy of pooling. That is why I would like to land first platform-neutral changes and then add platform-specific improvements.
Comment 83•14 years ago
|
||
Can we land patch from bug 553682? That would help me get you some numbers for windows. > On top of the patch from the bug 553812 I would like to implement a pool of > chunks that is shared between jemalloc and the GC. For such pool any internal > GC pooling would be counterproductive. Igor that might help the browser but not spider monkey embedders in general. We don't use jemalloc. (If JS_USE_CUSTOM_ALLOCATOR patch ever lands we might be able to) Your scope is too narrow. Also please do some multi-threaded testing on Linux before you make up your mind. Try testing with many threads running and GC firing every couple of seconds.
Comment 84•14 years ago
|
||
(In reply to comment #83) > Can we land patch from bug 553682? That would help me get you some numbers for > windows. That is already done. > Igor that might help the browser but not spider monkey embedders in general. > We don't use jemalloc. This emphases that what is good for the browser may not be good for other embed dings. I suspect that the ultimate solution would be to do the chunk pooling outside the engine via some easy to use API especially given huge platform differences in mmap implementations. One way to do that would be to support rerouting of chunk allocation/deallocation through an embedding supplied callback. This will support both the browser implementing common pooling via jemalloc or a heavy multi-threaded embedding that just preallocate all the GC heap memory for best possible performance. If embedding has enough memory and does not care if it is not swappable it could event try on Windows VirtualAloc(MEM_LARGE_PAGES).
Comment 85•14 years ago
|
||
> This emphases that what is good for the browser may not be good for other embed > dings. I suspect that the ultimate solution would be to do the chunk pooling > outside the engine via some easy to use API especially given huge platform > differences in mmap implementations. > One way to do that would be to support rerouting of chunk > allocation/deallocation through an embedding supplied callback. Yes. That was the purpose of bug 549532 (custom allocator support). Once it lands, I will create a follow up patch that will handle the need for aligned memory need by GC such as VirtualAlloc. That patch got too big so I left it out for now. Then you can put in jemalloc or any other custom allocator, for only GC memory, or all allocations if you wish. That will help a lot. However the "ultimate solution" is to have thread specific GC (or greatly reduced locking) AND a custom allocator in play.
Assignee | ||
Comment 86•14 years ago
|
||
(In reply to comment #81) > Would it be possible to make this value: > > static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3; > > settable via API? > Sounds like a good idea. I will refresh it later today. We could even set it to 0 for linux if we don't find a solution for the platform problem.
Comment 87•14 years ago
|
||
(In reply to comment #86) > (In reply to comment #81) > > Would it be possible to make this value: > > > > static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3; > > > > settable via API? > > > > Sounds like a good idea. I will refresh it later today. > We could even set it to 0 for linux if we don't find a solution for the > platform problem. I suggest to add instead an API that would allow to override NewChunk/DestroyChunk calls and implement the pooling outside the engine. This way the browser can implement the chunk release on memory pressure without the need to provide yet another hook into the engine.
Comment 88•14 years ago
|
||
> I suggest to add instead an API that would allow to override > NewChunk/DestroyChunk calls and implement the pooling outside the engine. See comment #85. > This way the browser can implement the chunk release on memory pressure > without the need to provide yet another hook into the engine. I think the patch is still valid. Even with a custom allocator. An allocator could introduce more locking contention. And not very many embedders are going to use one. Its sort of "advanced" I think. It would be nice to improve speed without the need for a custom allocator.
Comment 89•14 years ago
|
||
(In reply to comment #88) > An allocator could introduce more locking contention. The lock contention is not an issue for GC chunks. > And not very many > embedders are going to use one. Its sort of "advanced" I think. > It would be nice to improve speed without the need for a custom allocator. Since on Linux the benefits are not that clear with the proposed API the engine is going to gain a few more lines of platform-dependent code while exposing a chunk management policy that may not be that useful outside a browser. Plus it may not be useful even with the browser on Windows and Linux if we could implement a shared pool of chunks for the GC and jemalloc. Also consider that in the browser we have all the necessary infrastructure to do the time-based and memory-pressure based chunk allocations that are hard to do in the engine. These are the grounds for my belief that a chunk allocation callback would be a better idea. I do not think that a simple to use API together with working examples would make things very complex for embedders. For example, the documentation or a test source could provide an example how to write a trivial custom "allocator" that just VirtualAlloc all the necessary memory in one call and releases it only on the shutdown. For server-style embeddings that would be the fastest way to deal with chunks.
Assignee | ||
Comment 90•14 years ago
|
||
Well I want to come back to the starting point of this bug where we found out that for certain webpages or benchmark sweeping really kills us. Lets take the numbers from bug 539532 and 556133 [in cycles 1E6] for OS X: mark sweep total Firefox Tip 1: 25 110 135 Firefox Tip 2: 25 210 235 WebKit: 7 0 7 Firefox Tip 1 is cross-compiled with all the flags that were needed a few months ago to build the browser on 10.6. Firefox Tip 2 is a simple optimized browser build without cross compilation. I don't have the time right now to measure every detail again why we are so slow but I can only see a reduction of the GC pause time by avoiding work and not adding further complexity.
Comment 91•14 years ago
|
||
(In reply to comment #90) > Well I want to come back to the starting point of this bug where we found out > that for certain webpages or benchmark sweeping really kills us. > > Lets take the numbers from bug 539532 and 556133 [in cycles 1E6] for OS X: > > mark sweep total > Firefox Tip 1: 25 110 135 > Firefox Tip 2: 25 210 235 > WebKit: 7 0 7 > > Firefox Tip 1 is cross-compiled with all the flags that were needed a few > months ago to build the browser on 10.6. > Firefox Tip 2 is a simple optimized browser build without cross compilation. Could you clarify what is the difference between tip1 and tip2? Is it different OS or compiler flags? > > I don't have the time right now to measure every detail again why we are so > slow but I can only see a reduction of the GC pause time by avoiding work and > not adding further complexity. I agree with that. But if the pause time is platform-dependent, then putting more platform-dependent code would just increase the complexity. That is why I would prefer first try platform neutral optimization like using bigger chunk size, better layout of data structures etc only then consider platform-specific tuning.
Assignee | ||
Comment 92•14 years ago
|
||
> Could you clarify what is the difference between tip1 and tip2? Is it different
> OS or compiler flags?
Tip1 and 2 were compiled with OS X 10.6, 2x2.66GHz Dual-Core Intel Xeon MacPro.
Tip 1 was compiled with following compiler flags in the .mozconfig:
CC="gcc-4.2 -arch i386"
CXX="g++-4.2 -arch i386"
ac_add_options --target=i386-apple-darwin8.0.0
ac_add_options --with-macos-sdk=/Developer/SDKs/MacOSX10.5.sdk
ac_add_options --enable-macos-target=10.5
ac_add_options --disable-crashreporter
HOST_CC="gcc-4.2"
HOST_CXX="g++-4.2"
RANLIB=ranlib
AR=ar
AS=$CC
LD=ld
STRIP="strip -x -S"
CROSS_COMPILE=1
For Tip 2 I removed everything and only have a disable-debug, enable-optimize and enable-gctimer.
Assignee | ||
Comment 93•14 years ago
|
||
>
> I agree with that. But if the pause time is platform-dependent, then putting
> more platform-dependent code would just increase the complexity. That is why I
> would prefer first try platform neutral optimization like using bigger chunk
> size, better layout of data structures etc only then consider platform-specific
> tuning.
I don't think we need platform specific code here. We are talking about 30% GC pause time reduction in the browser versus 1% slowdown on Sunspider for certain Linux distributions in the shell. I can't even see a slowdown for SS on my Ubuntu linux machine.
With patch: Total: 715.9ms +/- 0.3%
Tip: Total: 715.8ms +/- 0.4%
I am not a big fan of optimizations where we give back an reallocate memory right away rather than reusing memory. I would love to see numbers from mobile devices because that sounds like a major bottleneck.
Comment 94•14 years ago
|
||
(In reply to comment #93) > I don't think we need platform specific code here. If we would implement a common pool for jemalloc and the GC, there would be inevitably platform dependent code in the engine even just to disable internal chunk pooling on Linux and Windows. Given this and that shared pool for GC and jemalloc would need to be implemented outside the engine, I would prefer to put GC-only pooling in the same place.
Comment 95•14 years ago
|
||
(In reply to comment #93) > > I don't think we need platform specific code here. We are talking about 30% GC > pause time reduction in the browser versus 1% slowdown on Sunspider for certain > Linux distributions in the shell. I can't even see a slowdown for SS on my > Ubuntu linux machine. This bug seems stuck, but I can't see why. Gregor has provided a lot of data. What are the concrete steps we need to take to complete this work?
Comment 96•14 years ago
|
||
(In reply to comment #95) > This bug seems stuck, but I can't see why. Gregor has provided a lot of data. > What are the concrete steps we need to take to complete this work? This is my bad, but I have not realized until late in the game that using pooling of GC chunks should be done outside the engine as different platforms requires different strategies. For example, when using jemalloc we should use a common pool of GC chunks and jemalloc chunks for optimum performance. Another observation is that with own chunk pooling we should provide a way to release the allocated chunks on the memory pressure, like real OOM especially in view of infallible malloc. Custom chunk allocation would provide just that. So as a way forward I would like first to implement the bug 557538 and move the suggested pooling outside the engine while properly integrating it with infallible malloc.
Comment 97•14 years ago
|
||
(In reply to comment #96) > > So as a way forward I would like first to implement the bug 557538 How complex is that bug? Do you have an estimate of when you'll get to it?
Comment 98•14 years ago
|
||
(In reply to comment #97) > (In reply to comment #96) > > > > So as a way forward I would like first to implement the bug 557538 > > How complex is that bug? Do you have an estimate of when you'll get to it? I have a patch but that is build on top of the bug 553812. I guess I can rebase it and ask for an review tomorrow.
Assignee | ||
Comment 99•14 years ago
|
||
(In reply to comment #98) > (In reply to comment #97) > > (In reply to comment #96) > > > > > > So as a way forward I would like first to implement the bug 557538 > > > > How complex is that bug? Do you have an estimate of when you'll get to it? > > I have a patch but that is build on top of the bug 553812. I guess I can rebase > it and ask for an review tomorrow. How are these two bugs related? How does the deallocation and reallocation of chunks for the Mac browser get optimized? Is the idea to reimplement this idea with delayed chunk returns in an outside defined callback?
Comment 100•14 years ago
|
||
(In reply to comment #99) > How are these two bugs related? How does the deallocation and reallocation of > chunks for the Mac browser get optimized? Is the idea to reimplement this idea > with delayed chunk returns in an outside defined callback? The current patch needs integration with the allocation code to release pooled GC chunks on memory pressure especially given that the browser has switched to use malloc that never returns null. One or another way this has to be implemented and implementing the chunk management together with the code that implements infallible malloc makes more sense. So yes, I would like to move the the pooling outside the engine via a callback.
Does it get harder to do that if this patch has landed? It seems like it will change our GC timing quite a bit, in pleasant ways, and that will let us start to get data on the next hurdles to tackle. If landing this patch doesn't represent a regression for the browser, and it doesn't make moving to a callback model massively more difficult, I would be very much in favour of landing it, getting the baseline win, and gathering more data while the refactoring and jemalloc chunk-sharing patches wend their way through the system. We have a patch here with a lot of good data, and it seems like we've stalled on it mostly because we think there's something even *better* in the future. I think we should try to avoid that, especially where we're talking about a major win on a major pain point!
Assignee | ||
Comment 102•14 years ago
|
||
bugfix.
Attachment #423842 -
Attachment is obsolete: true
Attachment #424382 -
Attachment is obsolete: true
Attachment #437774 -
Flags: review?(igor)
Attachment #423842 -
Flags: review?(igor)
Comment 103•14 years ago
|
||
(In reply to comment #101) > Does it get harder to do that if this patch has landed? It seems like it will > change our GC timing quite a bit, in pleasant ways, It helps definitely MAC OSX, but on other platforms this is less clear. Theoretically with fast enough mmap/munmap the patch can even harm things since the system cannot use the physical memory taken by chunks (which will be cached after the GC) and use it for malloc chunks. But OK, lets proceed with this and do alternative solutions later.
Comment 104•14 years ago
|
||
(In reply to comment #102) > Created an attachment (id=437774) [details] > patch We need to add two things here. A public API like JS_ClearCaches that would for a start clear pooled chunks and a preprocessor define to switch on the pooling that for now will be always true. The later will help to move the pooling outside the engine later. I will promptly review the patch with this changes.
Why not put the ifdef in later when we have the patch for the pooling, and the public API in when we have a consumer ready for it? Again, I don't think we need those things for this to avoid being a regression, so why hold up the patch on it? When we have something that is a win, and does not have regressions, we should get it *in*, not hold it up because we might want to do even more later. It is no harder to add the #ifdef afterwards, nor to add the public API afterwards, and working with smaller patches reduces risk. We really need to get out of this world where possible-future-addition-perfection keeps us from taking things that are, on their own, improvements.
Comment 106•14 years ago
|
||
I agree with shaver. 1.) We shouldn't add to the public API without a caller 2.) the #ifdef should be on the patch that uses it
Comment 107•14 years ago
|
||
Comment on attachment 437774 [details] [diff] [review] patch >+AddChunkToEmptyList(JSRuntime *rt, JSGCChunkInfo *ci) >+{ >+ ci->next = rt->gcEmptyChunkList; >+ rt->gcEmptyChunkList = ci; >+} In bug 553812 I have seen rather bad effects of using a list to store empty arenas as working over the list brings all cachelines and TLB entries pages containing the list link fields stored in the arenas. So to avoid that for GC chunks and for simpler code use js::vector to store empty chunks. This way working over the empty chunk list would touch only the vector of pointers itself. Any OOM failure when adding chunks to the vector should be treated as a signal to release the chunk immediately.
Attachment #437774 -
Flags: review?(igor)
Comment 108•14 years ago
|
||
Igor, I would like to see profile data for #107. Especially after your change to use 1MB chunks, we have very few chunks (8-16 in a browser session, maybe). I have a hard time believing that can measurably impact TLB performance. Lets optimize real performance problems (like the one the patch does), not imaginary ones. Even if this was a measurable problem (I am confident its not), lets take a follow-up patch. I think we have given Gregor enough run-around.
Comment 109•14 years ago
|
||
The data for comment 107 came from bug 550373 comment 18, a synthetic benchmark. Indeed we need to study real Firefox GC chunk populations including with hundreds of tabs open (8-16 in a browser session? not with that many tabs, I suspect -- but let's measure). A bug over 100 comments with a patch that wins good performance, with synthetic benchmarks wanting a data structure change, suggests doing the data structure change in a followup bug. And doing the real Firefox population measurements. Usually you can do such measurements by hand. For this and many other studies, we also have TestPilot. We should consider using it. /be
Assignee | ||
Comment 110•14 years ago
|
||
new version due to changes in GC code. Added vector for empty chunks.
Attachment #437774 -
Attachment is obsolete: true
Comment 111•14 years ago
|
||
Comment on attachment 438668 [details] [diff] [review] patch >+ >+ js::Vector<JSGCChunkInfo*, 0, js::SystemAllocPolicy> gcEmptyChunkVector; Nit: I suggest gcEmptyChunks. We don't use Hungarian notation around here :) >+static const jsuword GC_EMPTY_CHUNK_SURVIVES = 3; >+ Nit: I don't much like "survives". GC_EMPTY_CHUNK_RESERVE = 3 maybe? Its also overloaded. RESERVOIR maybe? Pick something. >+ size_t gcSurvived; Nit: Same here. Name sucks. I suggest something more expressive. > JSGCChunkInfo *ci = rt->gcChunkList; >+ >+ if (!ci && !rt->gcEmptyChunkVector.empty()) { >+ ci = rt->gcEmptyChunkVector.back(); >+ rt->gcEmptyChunkVector.popBack(); >+ JS_ASSERT(ci); >+ ci->gcSurvived = 0; >+ ci->init(rt); >+ } >+ > jsuword chunk; > if (!ci) { Merge these conditions somehow. This seems redundant if you hit the fast path above. > if (ci->numFreeArenas == GC_ARENAS_PER_CHUNK) { > ci->removeFromList(rt); >- ci->next = emptyChunkList; >- emptyChunkList = ci; >+ rt->gcEmptyChunkVector.append(ci); Not setting gcSurvived = 0 here?
Comment 112•14 years ago
|
||
RESERVOIR comment above is nonsense of course. So looks like thats an age counter? Or generations?
Assignee | ||
Comment 113•14 years ago
|
||
Age is better. Now with append failure handling.
Attachment #438668 -
Attachment is obsolete: true
Comment 114•14 years ago
|
||
(In reply to comment #113) > Igor, I would like to see profile data for #107. Especially after your change > to use 1MB chunks, we have very few chunks (8-16 in a browser session, maybe). I was referring to pre-big chunk state of things. > I have a hard time believing that can measurably impact TLB performance. There is code simplicity argument also. Vector is just more suitable data structure for doing appending/bulk removal than the list. I should have seen that long time ago. See the bug 559141 which eliminates the doubly-linked list of chunks based on your suggestion. As another proof consider that doing the delayed chunk release on top of that patch would mean adding few lines of code to chunk scanning loop without the need to add any helper methods.
Comment 115•14 years ago
|
||
Oh I completely agree Vector is much better and Gregor is adding that. I am just not buying the TLB argument, and that it could be done as a separate patch (will be one patch now though).
Comment 116•14 years ago
|
||
(In reply to comment #115) > I am > just not buying the TLB argument With big chunks those arguments are moot indeed. But with smaller ones it is not that straightforward. For example, for smaller chunks the cachegring has shown that during double finalization walking over the list of empty arenas to see if the GC can release the chunks that contains those arenas was the second biggest source of L1/L2 cache misses during the GC (the biggest source of the misses came from arena->hasMakedDoubles checks). Clearly those misses were also TLB msses since the GC arena takes the whole CPU page.
Assignee | ||
Comment 117•14 years ago
|
||
Attachment #438673 -
Attachment is obsolete: true
Comment 118•14 years ago
|
||
Comment on attachment 438919 [details] [diff] [review] patch >+ if (!rt->gcEmptyChunks.empty()) { >+ ci = rt->gcEmptyChunks.back(); >+ rt->gcEmptyChunks.popBack(); >+ JS_ASSERT(ci); >+ chunk = ci->getChunk(); Get a reference straight to gcEmptyChunks if you talk to it repeatedly to avoid the rt-> dereference every time. Its also easier to read. >+ ci->gcChunkAge = 0; >+ if (!rt->gcEmptyChunks.append(ci)) { >+ jsuword chunk = ci->getChunk(); >+ PutGCChunk(rt, (void *)chunk); >+ } PutGCChunk is the worst name in the world. Please make that something else. Free or release. Anything. >+void DestroyEmptyGCChunks(JSRuntime *rt, bool releaseAll) >+{ >+ size_t newLength = 0; >+ JSGCChunkInfo **array = rt->gcEmptyChunks.begin(); >+ size_t length = rt->gcEmptyChunks.length(); >+ >+ for(size_t i = 0; i < length; ++i){ >+ JSGCChunkInfo *ci = array[i]; The proper pattern is probably for (JSGCChunkInfo *ci = ->begin(); ci != ->end(); ++ci) ^ also space here >+ ci->gcChunkAge++; >+ array[newLength++] = ci; You can keep this one I guess. r=me with nits picked at your leisure.
Attachment #438919 -
Flags: review+
Assignee | ||
Comment 119•14 years ago
|
||
Attachment #438919 -
Attachment is obsolete: true
Assignee | ||
Comment 120•14 years ago
|
||
white space fix
Attachment #438929 -
Attachment is obsolete: true
Assignee | ||
Comment 121•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/15955e3513e0 r=gal
Assignee | ||
Updated•14 years ago
|
Whiteboard: fixed-in-tracemonkey
Comment 122•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/15955e3513e0
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•