Open
Bug 600234
Opened 13 years ago
Updated 6 months ago
moving GC ArenaHeader out of Arena again
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
NEW
People
(Reporter: igor, Unassigned)
References
Details
Attachments
(2 files, 8 obsolete files)
378 bytes,
text/plain
|
Details | |
25.70 KB,
patch
|
Details | Diff | Splinter Review |
The patch for bug 558861 has moved the GC arena header into the arena itself I presume for code simplicity. But having the header there implies that various read and writes that involves only the header field would have bigger CPU cache and TLB footprints. Also this waste some memory especially for objects and function arenas as the header occupies the whole cell for the object when only 6 words are required. Those effects were visible previously in cachegrind and were the reason for moving the header out of arena.
Comment 1•13 years ago
|
||
Not too worried about the memory waste. Peanuts. Can you link me to the cachegrind data?
Reporter | ||
Comment 2•13 years ago
|
||
Here is test based on what I have used before: function prepare(N) { var array = []; for (var i = 0; i != N; ++i) { array.push(String.fromCharCode(100, 101, 102)); array.push({}); } return array; } function do_it() { var array = prepare(12e5); var t = Date.now(); gc(); var time1 = Date.now() - t; array = prepare(10); array = null; var t = Date.now(); gc(); var time2 = Date.now() - t; return [time1, time2]; } print(do_it()); The test prints the pair representing the timing of GC dominated mostly by marking and by finalization time of a lot of empty objects and short strings to consider only pure GC things. On the TM the best timing among 20 runs is the pair 100,36. On a pre-compartment GC the best pair is 93,33. So we have slowdown on the scale of 7-10%. This is for 32 bit js shell on dual-core iCore 7 laptop under fedora 13. Given the code that test exercise the only difference can come from a separated arena header.
Reporter | ||
Comment 3•13 years ago
|
||
Making this bug as a performance regression from the bug 558861.
Comment 4•13 years ago
|
||
Lets assume for a second that the workload is relevant and that the measurement is accurate, how do we explain the difference?
Comment 5•13 years ago
|
||
The patch to move the header out of line. Untested, but pretty trivial. I am still not convinced this is a good idea though.
Assignee: igor → gal
Comment 6•13 years ago
|
||
Without accidental jetpack change.
Attachment #479134 -
Attachment is obsolete: true
Reporter | ||
Comment 7•13 years ago
|
||
(In reply to comment #4) > Lets assume for a second that the workload is relevant and that the measurement > is accurate, how do we explain the difference? The first number in the above test represent the case when almost all GC arenas are fully used with no free cell available. With a separated array of arena headers during the sweep the GC does not need to touch the arena itself as it will see that everything is marked. This explains the first number. The second number is the opposite case when the arena contains only dead things with an empty free list. This is a common scenario when the GC happens after an allocation burst with the heap containing mostly a garbage. But it is harder to see what is responsible for the win here. With or without a separated header the GC has to assemble the new free list which presumably warm ups the CPU cache.
Reporter | ||
Comment 8•13 years ago
|
||
With a single header I could not match the numbers of the pre compartment GC. It is really important to have the header to be the power of two. So the patch adds two headers, one 4-worded and one two-word where only two-worded header is touched during the GC. This allowed to recover all the timing.
Reporter | ||
Comment 9•13 years ago
|
||
This patch uses single separated arena header but shrinks it to 4 words in optimized build. This avoids the extra complexity of 2 headers and allowed to have the same results in the benchmark above or its version that I have observed with pre-compartment-gc code. The patch does not use templates for arena header as semantically the header allows to discover the dynamic type of the arena. The patch also adds Arena::chunkAndIndex with very similar code that was used in older GC. Last time I have checked GCC could not optimize separated chunk() and arenaIndex() calls into the common optimized version. The patch does not affect SS or V8 scores.
Assignee: gal → igor
Attachment #479136 -
Attachment is obsolete: true
Attachment #479143 -
Attachment is obsolete: true
Attachment #479399 -
Flags: review?(anygregor)
Reporter | ||
Comment 10•13 years ago
|
||
Here is a rebased patch that also adds Cell::arenaHeader helper to make those lines with asCell()->arena()->header()-> little bit shorter. The patch also moves the alignment checks into the jsgc.h header. This avoids explicit template incitations and allowed for a wider assert coverage as the check is called during marking of any GC thing including JSXML.
Attachment #479399 -
Attachment is obsolete: true
Attachment #479751 -
Flags: review?(anygregor)
Attachment #479399 -
Flags: review?(anygregor)
Comment 11•13 years ago
|
||
This change was mostly driven by code simplicity. Do we have real numbers that really force us to go back to the old system? SS, V8, also the browser versions of the benchmarks? I was measuring a lot for the new GC and the cycles needed to mark an object (total marking time / amount of marked objects) went down.
Comment 12•13 years ago
|
||
I also used this micro benchmark for shark profiles and I get following numbers: Changeset right before GC landing: idefix2:OPT.OBJ idefix2$ time ./js ../../../../tests/wave.js END real 0m13.179s user 0m13.076s sys 0m0.101s idefix2:OPT.OBJ idefix2$ time ./js -j ../../../../tests/wave.js END real 0m10.501s user 0m10.389s sys 0m0.110s idefix2:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js END real 0m10.513s user 0m10.401s sys 0m0.109s idefix2:OPT.OBJ idefix2$ time ./js -m ../../../../tests/wave.js END real 0m10.627s user 0m10.498s sys 0m0.092s current tip: idefix2:OPT.OBJ idefix2$ time ./js ../../../../tests/wave.js END real 0m10.363s user 0m10.265s sys 0m0.096s idefix2:OPT.OBJ idefix2$ time ./js -j ../../../../tests/wave.js END real 0m7.991s user 0m7.894s sys 0m0.096s idefix2:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js END real 0m7.942s user 0m7.847s sys 0m0.093s idefix2:OPT.OBJ idefix2$ time ./js -m ../../../../tests/wave.js END real 0m8.120s user 0m8.022s sys 0m0.097s
Comment 13•13 years ago
|
||
Ups ignore that testcase. That's the wrong one.
Reporter | ||
Comment 14•13 years ago
|
||
(In reply to comment #11) > This change was mostly driven by code simplicity. Unfortunately simpler does not translate into more efficient. Besides, given that we already have separated mark bitmap and delayed marking structures, having the header that comes together with arena adds feature complexity IMO (one have to deal both with embedded and separated structures instead of uniformity of having everything outside). > Do we have real numbers that > really force us to go back to the old system? The numbers above are real with independent callgrind confirmation. > SS, V8, See the last paragraph of the comment 9. It would be very surprising if the benchmark scores are affected. The benchmarks are mostly CPU bound so changes in the GC support structures that does not touch the free list implementation should not be visible in them. > also the browser versions > of the benchmarks? I will try to check the v8 in the browser. > I was measuring a lot for the new GC and the cycles needed to mark an object > (total marking time / amount of marked objects) went down. If the amount of marked objects went down, it implies different GC heuristics like GC running less frequent. It does not indicate that the marking time per object went down. It could well go up as the data here and in the bug 600648 indicates.
Comment 15•13 years ago
|
||
(In reply to comment #14) > (In reply to comment #11) > > This change was mostly driven by code simplicity. > > Unfortunately simpler does not translate into more efficient. Besides, given > that we already have separated mark bitmap and delayed marking structures, > having the header that comes together with arena adds feature complexity IMO > (one have to deal both with embedded and separated structures instead of > uniformity of having everything outside). I agree. If the measurements show that we could do better with more complex code then we have to go for it. But if we look at JSC and their GC implementation then the message is clearly simplicity wins! They prove that one can be faster with even simpler code. > > > Do we have real numbers that > > really force us to go back to the old system? > > The numbers above are real with independent callgrind confirmation. > > > SS, V8, > > See the last paragraph of the comment 9. It would be very surprising if the > benchmark scores are affected. The benchmarks are mostly CPU bound so changes > in the GC support structures that does not touch the free list implementation > should not be visible in them. > > > also the browser versions > > of the benchmarks? > > I will try to check the v8 in the browser. We still perform GC's in the current v8 shell harness. > > > I was measuring a lot for the new GC and the cycles needed to mark an object > > (total marking time / amount of marked objects) went down. > > If the amount of marked objects went down, it implies different GC heuristics > like GC running less frequent. It does not indicate that the marking time per > object went down. It could well go up as the data here and in the bug 600648 > indicates. Not the amount of marked objects went down. I measured the cycles needed on average to mark an object. So total cycles needed for marking process of a GC divided by the number of marked objects.
Reporter | ||
Comment 16•13 years ago
|
||
(In reply to comment #12) > Created attachment 479813 [details] > benchmark > > I also used this micro benchmark for shark profiles and I get following > numbers: The benchmark is very allocation sensitive and the compartment GC includes the inlining of the allocations. A better compassion would to check the pre-compartmental GC with the patch from the 595048 bug on top aginst the current tip. Or the older code versus the current tip plus the patch from the bug 558861 comment 67. In any case, the patch here shows a win from ~/m/tm/js/src> time ~/b/js64opt.tm/js -jm ~/s/wave.js END real 0m8.640s user 0m8.497s sys 0m0.086s to: ~/m/tm/js/src> time ~/b/js64opt.tm/js -jm ~/s/wave.js END real 0m7.416s user 0m7.292s sys 0m0.080s
Reporter | ||
Comment 17•13 years ago
|
||
(In reply to comment #15) > I agree. If the measurements show that we could do better with more complex > code then we have to go for it. But if we look at JSC and their GC > implementation then the message is clearly simplicity wins! They do not need to deal with threads, OOM, limited amount of the stack space and the need to have a cycle collector. They could also afford much simpler GC heuristics. Still our fast allocation path should be on pair with them. > Not the amount of marked objects went down. > I measured the cycles needed on average to mark an object. Hm, was it on 64 bit system?
Comment 18•13 years ago
|
||
> > Not the amount of marked objects went down. > > I measured the cycles needed on average to mark an object. > > Hm, was it on 64 bit system? Yes. All these measurements are part of bug 593729. It would make sense to tune everything in a larger context but I guess time is against me.
Comment 19•13 years ago
|
||
This patch doesn't apply to current tip.
Reporter | ||
Comment 20•13 years ago
|
||
(In reply to comment #19) > This patch doesn't apply to current tip. Sory, I forgot to mention it is on top of the patch from bug 600687 comment 3.
Reporter | ||
Comment 21•13 years ago
|
||
(In reply to comment #20) > (In reply to comment #19) > > This patch doesn't apply to current tip. > > Sory, I forgot to mention it is on top of the patch from bug 600687 comment 3. And I will land that in a minute.
Reporter | ||
Comment 22•13 years ago
|
||
In the browser running v8 on my laptop is too noisy to have any firm conclusions. Still the best v8 score among 20 runs went up from 2957 to 2981.
Comment 23•13 years ago
|
||
I get a bunch of warnings with this patch: jsgc.cpp ../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSObject]’: ../jsgc.cpp:2308: instantiated from here ../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSFunction]’: ../jsgc.cpp:2309: instantiated from here ../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSXML]’: ../jsgc.cpp:2311: instantiated from here ../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSShortString]’: ../jsgc.cpp:2323: instantiated from here ../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSString]’: ../jsgc.cpp:2324: instantiated from here ../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules
Reporter | ||
Comment 24•13 years ago
|
||
(In reply to comment #23) > I get a bunch of warnings with this patch: This is from: ap = (Arena<T> **) &header->next; GCC 4.4 does not complain about it has much better aliasing analysis than older versions. But I guess on Mac one one has to deal with that.
Reporter | ||
Comment 25•13 years ago
|
||
This patch uses template-based header to avoid aliasing warnings.
Attachment #479751 -
Attachment is obsolete: true
Attachment #480035 -
Flags: review?(anygregor)
Attachment #479751 -
Flags: review?(anygregor)
Reporter | ||
Comment 26•13 years ago
|
||
With the latest patch the callgrind reliably shows for the v8-splay benchmark the decrease of instruction count with the patch from 1.67e9 down to 1.65e9 or a 1.3% win. I cannot see that using the benchmark suite - the iCore7 CPU speed on my laptop fluctuates substantially depending on the thermal situation which adds way too much noise.
Comment 27•13 years ago
|
||
Yes the cycles used to mark an object went down. Thats good! I still get a bunch of warnings with this patch: jsapi.cpp ../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSObject]’: ../jsgc.h:234: instantiated from here ../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSString]’: ../jsgc.h:237: instantiated from here ../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSShortString]’: ../jsgc.h:240: instantiated from here ../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSFunction]’: ../jsgc.h:245: instantiated from here ../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules ../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = js::gc::Cell]’: ../jsgc.h:441: instantiated from here ../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules I also don't really like the chunkAndIndex function. Does it really make a difference? If yes please add a comment. I have very limited time now since I am no longer in Mountain View but I will finish the review once all warnings are gone.
Reporter | ||
Comment 28•13 years ago
|
||
Comment on attachment 480035 [details] [diff] [review] v6 I will rebase the patch on top of the bug 601075. That bugs removes EmptyArenaLists which will minimize the amount of changes here and should remove the aliasing warnings.
Attachment #480035 -
Attachment is obsolete: true
Attachment #480035 -
Flags: review?(anygregor)
Reporter | ||
Comment 29•13 years ago
|
||
The new patch is based on patches from the bug 605029 comment 4 and bug 601234 comment 3. To avoid expensive divisions when accessing the header the patch shrinks ArenaHeader down to 4 words by sharing the free list offset together with the ting kind. An alternative of making the header to be 8 words has adverse results on the above benchmark. The patch gives about 0.3%-0.5% win in SS and no regressions in V8. AFAICS it is hard to see wins with V8 as the patch affects the GC heuristics by putting an extra object into the arena and that adds a big noise to v8. For a synthetic benchmark like the following code the patch gives a 5% win. function test() { gc(); var zz = ({}); var z = zz; for (count = 0; count < 1e5; count++) { z.next = ({}); z = z.next; } gc(); for(count = 0; count < 35; count++){ var xx = ({}); var x = xx; for (var j = 0; j < 6e5; ++j) { x.next = ({}); x = x.next; } gc(); xx = x = null; gc(); } } var t = Date.now(); test(); t = Date.now() - t; print("time: "+t);
Attachment #485262 -
Flags: review?(gal)
Reporter | ||
Comment 30•12 years ago
|
||
patch update to reflect bug 616666 chnages
Attachment #485262 -
Attachment is obsolete: true
Attachment #527123 -
Flags: review?(gal)
Attachment #485262 -
Flags: review?(gal)
Reporter | ||
Updated•12 years ago
|
Attachment #527123 -
Attachment description: v8 → attachment for the wrong bug
Attachment #527123 -
Attachment is obsolete: true
Attachment #527123 -
Flags: review?(gal)
Reporter | ||
Comment 31•12 years ago
|
||
I need this bug for 648320. There I plan to add a few 2^N size classes of GC things. But if the arena header is stored at the beginning of the arena, this wastes the space for one GC thing which will be substantial for N >= 5. In addition, it complicates allocation and access to the type map which will be necessary for the conservative GC.
Reporter | ||
Updated•12 years ago
|
Reporter | ||
Comment 32•12 years ago
|
||
Here is a rebased patch, it looks it is necessary to get best results with smaller chunks in bug 671702.
Comment 33•11 months ago
|
||
The bug assignee didn't login in Bugzilla in the last 7 months, so the assignee is being reset.
Assignee: igor → nobody
Updated•6 months ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•