Closed Bug 708382 Opened 13 years ago Closed 13 years ago

investigate faster GC scanning of objects

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: igor, Assigned: igor)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

During investigation for the bug 706062 I observed extreme sensitivity of the GC marking time during V8 to the order of GC graph traversal, the tail recursion elimination and to the way we currently split between marking of large items and other object slots. The attached patch tries to optimize the GC marking based on that information. In particular: 1. All marking stacks are merged into one that uses tagged pointers or assumed knowledge to distinguish between different stack elements. LargeItem are replaced by explicit pushing into the stack of (value_array_start, value_array_end) pair. 2. ScanRope now also pops pushed ropes from the stack allowing to assume that the stack is not changed after it returns. 3. The patch tries to eliminate the tail recursion whenever possible With those changes in JS shell on a 3.0 GHz Linux box the GC timing during V8-splay are reduced from (using MOZ_GCTIMER=stdout): 30.878000 30.765000 0.083000 32.662000 31.993000 0.058000 32.939000 32.324000 0.068000 33.136000 33.062000 0.057000 33.111000 33.039000 0.048000 33.733000 33.665000 0.047000 33.639000 33.533000 0.071000 33.736000 33.667000 0.048000 33.863000 33.768000 0.076000 to: 22.326000 22.214000 0.084000 23.232000 22.584000 0.057000 23.201000 23.115000 0.070000 23.801000 23.715000 0.065000 24.441000 24.376000 0.047000 24.217000 24.147000 0.050000 24.768000 24.676000 0.070000 24.622000 24.556000 0.048000 24.600000 24.503000 0.077000 That is, roughly 30% speed-up. I will investigate if the win persists during normal browser session.
Attached patch v1 (obsolete) — Splinter Review
This is what I am going to test.
Assignee: general → igor
This part makes ScanRope to pop the stack until all ropes pushed by the function are processed. This improves string scanning localization and allows to merge object and string stacks.
Attachment #579798 - Attachment is obsolete: true
Attachment #580039 - Flags: review?(bhackett1024)
This patch merges all the mark stack into one using the tagged pointers to distinguish between the pushed item kinds. The patch tries to eliminate push/pop pair when it is know what would be pushed. Instead it just jumps to the necessary code directly. Here is the output MOZ_GC_TIMER=stdout output: Current TIP part1 + part2 During V8 33.614000 31.979000 1.583000 25.019000 24.387000 0.589000 32.987000 32.357000 0.587000 25.630000 25.031000 0.555000 33.667000 33.057000 0.560000 26.277000 25.200000 0.513000 34.289000 33.719000 0.528000 26.138000 25.576000 0.521000 34.436000 33.869000 0.525000 26.333000 25.690000 0.542000 35.022000 34.425000 0.553000 26.304000 25.744000 0.516000 34.800000 34.226000 0.531000 26.287000 25.705000 0.539000 34.948000 34.354000 0.550000 26.220000 25.628000 0.547000 10.334000 8.233000 2.033000 7.863000 6.779000 1.016000 After http://gregor-wagner.com/tmp/mem opens all 150+ tabs 422.587000 329.048000 90.807000 390.314000 300.842000 87.583000 406.061000 323.988000 80.499000 380.170000 297.208000 81.493000 404.894000 323.755000 79.629000 380.071000 297.395000 81.230000 During V8 with all 150+ tabs still openned 418.272000 326.218000 89.835000 391.836000 300.919000 88.592000 402.318000 323.612000 77.362000 381.030000 299.589000 80.095000 402.594000 324.181000 77.073000 380.389000 299.487000 79.555000 402.212000 323.569000 77.302000 380.374000 299.660000 79.367000 55.005000 42.561000 11.376000 46.795000 34.127000 11.583000 54.616000 42.247000 11.046000 46.906000 34.554000 11.303000 54.881000 42.745000 11.036000 46.889000 34.759000 11.091000 55.779000 43.685000 11.032000 47.499000 35.273000 11.100000 56.463000 44.186000 11.132000 47.860000 35.694000 11.121000 56.296000 44.243000 10.993000 47.989000 35.824000 11.120000 56.917000 44.766000 11.031000 47.848000 35.682000 11.110000 56.506000 44.420000 10.994000 48.553000 36.379000 11.133000 So V8 shows 25%-30% reduction in the GC time when only V8 tab is opened and 20% speed up with 150+ other tabs. The global GC with 150+ opened shows consistent 5% faster timing. This is with 32-bit browser build on 3.0 GHz Linux bux with GCC 4.6.
Attachment #580039 - Attachment is obsolete: true
Attachment #580039 - Flags: review?(bhackett1024)
Attachment #580041 - Flags: review?(bhackett1024)
Comment on attachment 580041 [details] [diff] [review] part2 - common mark stack and push/pop optimizations This looks great, but I don't think I know enough about the mark stack internals to review this properly.
Attachment #580041 - Flags: review?(bhackett1024) → review?(wmccloskey)
Also, does this apply on top of part 1? That patch was marked as obsolete.
Attachment #580039 - Attachment is obsolete: false
Attachment #580039 - Flags: review?(wmccloskey)
(In reply to Brian Hackett (:bhackett) from comment #5) > Also, does this apply on top of part 1? That patch was marked as obsolete. Yes, the patch consists of two parts - the obsolete mark was accidental.
I can also 25% win in GC timing during V-splay on Atom N270 netbook: 152.306000 151.811000 0.320000 122.330000 121.841000 0.327000 156.669000 154.589000 0.222000 126.423000 124.271000 0.228000 Splay: 1765 Splay: 1728
Another reason for the wins the patch provides for the GC under V8-splay is that it scans unmarked objects stored in the slots array from smallest slot to highest slot. Without the patch the code first pushes objects to the stack in that order resulting in scanning in the reverse order. Apparently the object graph from v8-splay benchmark is rather sensitive to the traversal order. If I change the scanning loop in processMarkStackTop from the patch to go from end to start, then the 25% win in V8 shrinks down to 10%. On he other hand http://29a.ch/2010/6/2/realtime-raytracing-in-javascript is not sensitive to that order. With or without that loop directin change the patch shows 10% win.
Comment on attachment 580039 [details] [diff] [review] part 1 - merging string and object mark stacks Review of attachment 580039 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jsgcmark.cpp @@ +768,5 @@ > > static void > ScanRope(GCMarker *gcmarker, JSRope *rope) > { > + unsigned savedTos = gcmarker->objStack.tos; Could you add a comment to the top of the function giving a brief description of what it's doing? Make it clear that, for this function to be correct, marking a string should never cause an object to be marked. Also, it might be nice to add an accessor function to MarkStack that returns the top of the stack, but that's up to you. @@ +770,5 @@ > ScanRope(GCMarker *gcmarker, JSRope *rope) > { > + unsigned savedTos = gcmarker->objStack.tos; > + for (;;) { > + JS_ASSERT(rope->JSString::isRope()); Before this, can you assert that rope is actually a JSTRACE_STRING (via GetGCThingTraceKind, or whatever it's called)? @@ +805,5 @@ > + rope = static_cast<JSRope *>(gcmarker->objStack.pop()); > + } else { > + break; > + } > + } How about asserting here (at the end of ScanRope) that gcmarker->objStack.tos == savedTos?
Attachment #580039 - Flags: review?(wmccloskey) → review+
Comment on attachment 580041 [details] [diff] [review] part2 - common mark stack and push/pop optimizations Review of attachment 580041 [details] [diff] [review]: ----------------------------------------------------------------- Nice work. I hope some day compilers will handle tail recursion better, because processMarkStackTop is not very pretty. ::: js/src/jsgc.h @@ +1644,5 @@ > + static const uintptr_t StackTagMask = 7; > + > + static void staticAsserts() { > + JS_STATIC_ASSERT(StackTagMask >= uintptr_t(LastTag)); > + JS_STATIC_ASSERT(StackTagMask + 1 == sizeof(Value)); What's the second assert for? ::: js/src/jsgcmark.cpp @@ +734,5 @@ > + * Make sure that ScanBaseShape is not recursive so its inlining > + * is possible. > + */ > + UnownedBaseShape *unowned = base->baseUnowned(); > + JS_OPT_ASSERT(base->compartment() == unowned->compartment()); We're no longer using JS_OPT_ASSERT in jsgcmark.cpp. If you don't want to use the usual compartment assert (since the one you have here is stronger), then could you add a macro like JS_GC_COMPARTMENT_ASSERT that can expand to either JS_ASSERT or JS_OPT_ASSERT? Then we could very easily change whether to enable these assertions in nightlies. @@ +791,5 @@ > /* > * When both children are ropes, set aside the right one to > * scan it later. > */ > + if (next && !gcmarker->stack.push(reinterpret_cast<uintptr_t>(next))) Yikes, untagged! Maybe say something about that in the comment at the top of ScanRope (i.e., that everything above savedTos is known to be a rope, so the tag is unnecessary). @@ +831,5 @@ > ScanString(gcmarker, str); > } > > +static JS_NEVER_INLINE void > +DelayScanningForArray(GCMarker *gcmarker, HeapValue *begin, HeapValue *end) Could you call this DelayMarkingValueArray? @@ +846,5 @@ > + color = gcmarker->getMarkColor(); > + } else { > + continue; > + } > + if (cell->markIfUnmarked(color)) I don't understand why you always color strings black. The code is simpler if you always call getMarkColor(). Is this an optimization to avoid calling getMarkColor? It doesn't seem worth it here, since we're already in delayed marking. @@ +854,3 @@ > > +static inline void > +ScanValueArray(GCMarker *gcmarker, HeapValue *array, size_t size) I think PushValueArray would be a better name for this. @@ +1067,5 @@ > } /* namespace gc */ > > +inline void > +GCMarker::processMarkStackTop() > +{ I think this function would be easier to read if you moved scan_value_array and scan_obj to the end. They both end in return statements, so then we could think of "goto scan_obj" as sort of like a function call. And the structure of the function would be clearer, since it would look like this: if (tag == ...) ... goto scan_value_array else if (tag == ...) ... goto scan_obj else if (tag == ...) ... scan_obj: ... return; scan_value_array: ... return; Also, it would be nice to put a comment at the top of the function to say why it's written this way. @@ +1074,5 @@ > + uintptr_t tag = addr & StackTagMask; > + uintptr_t payload; > + HeapValue *vp, *end; > + JSObject *obj; > + if (tag == ValueArrayTag) { Extra line between the variables and the code, please. @@ +1087,5 @@ > + const Value &v = *vp++; > + if (v.isString()) { > + JSString *str = v.toString(); > + if (str->markIfUnmarked()) > + ScanString(this, str); Could you just call PushMarkStack(str) here? I would think it would get inlined. Also, could you add a comment in PushMarkStack that you can mark strings black because the cycle collector never needs to see them. @@ +1109,5 @@ > + unsigned nslots = obj->slotSpan(); > + vp = obj->fixedSlots(); > + if (obj->slots) { > + unsigned nfixed = obj->numFixedSlots(); > + if (nslots > nfixed) { Do we ever have dynamic slots and have nslots <= nfixed?
Attachment #580041 - Flags: review?(wmccloskey) → review+
Actually, on second thought Igor, could you make sure that strings are always marked gray if the GCMarker color is gray? I don't think it actually make the cycle collector run any faster if they're marked black. And I'm worried that someone might copy-and-paste the string marking code somewhere else where we actually want to mark gray. People tend to copy-and-paste a lot in jsgcmark.cpp. (Which suggests that there's probably too much redundancy in the file, but that's for another bug.)
(In reply to Bill McCloskey (:billm) from comment #11) > Actually, on second thought Igor, could you make sure that strings are > always marked gray if the GCMarker color is gray? The patch does not change that always-color-strings-in-black code. It is leftover of the 2-word strings era when we did not have spare bits in the marking bitmap for the color. If we want to change that, lets discuss it in another bug.
(In reply to Bill McCloskey (:billm) from comment #10) > Do we ever have dynamic slots and have nslots <= nfixed? The try server told me so ;)
(In reply to Igor Bukanov from comment #13) > (In reply to Bill McCloskey (:billm) from comment #10) > > Do we ever have dynamic slots and have nslots <= nfixed? nfixed is the inline capacity which predicts eventual object size (via type information), so it makes sense that you could get objects not using their full inline capacity.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Blocks: 711095
Depends on: 719114
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: