Last Comment Bug 708382 - investigate faster GC scanning of objects
: investigate faster GC scanning of objects
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal with 1 vote (vote)
: ---
Assigned To: Igor Bukanov
:
:
Mentors:
Depends on: 719114
Blocks: 706062 711095
  Show dependency treegraph
 
Reported: 2011-12-07 12:48 PST by Igor Bukanov
Modified: 2013-09-21 11:41 PDT (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (28.67 KB, patch)
2011-12-07 12:53 PST, Igor Bukanov
no flags Details | Diff | Splinter Review
part 1 - merging string and object mark stacks (8.37 KB, patch)
2011-12-08 07:30 PST, Igor Bukanov
wmccloskey: review+
Details | Diff | Splinter Review
part2 - common mark stack and push/pop optimizations (27.67 KB, patch)
2011-12-08 07:43 PST, Igor Bukanov
wmccloskey: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2011-12-07 12:48:17 PST
During investigation for the bug 706062 I observed extreme sensitivity of the GC marking time during V8 to the order of GC graph traversal, the tail recursion elimination and to the way we currently split between marking of large items and other object slots.

The attached patch tries to optimize the GC marking based on that information. In particular:

1. All marking stacks are merged into one that uses tagged pointers or assumed knowledge to distinguish between different stack elements. LargeItem are replaced by explicit pushing into the stack of (value_array_start, value_array_end) pair.

2. ScanRope now also pops pushed ropes from the stack allowing to assume that the stack is not changed after it returns.

3. The patch tries to eliminate the tail recursion whenever possible  

With those changes in JS shell on a 3.0 GHz Linux box the GC timing during V8-splay are reduced from (using MOZ_GCTIMER=stdout):

30.878000 30.765000 0.083000
32.662000 31.993000 0.058000
32.939000 32.324000 0.068000
33.136000 33.062000 0.057000
33.111000 33.039000 0.048000
33.733000 33.665000 0.047000
33.639000 33.533000 0.071000
33.736000 33.667000 0.048000
33.863000 33.768000 0.076000

to:

22.326000 22.214000 0.084000
23.232000 22.584000 0.057000
23.201000 23.115000 0.070000
23.801000 23.715000 0.065000
24.441000 24.376000 0.047000
24.217000 24.147000 0.050000
24.768000 24.676000 0.070000
24.622000 24.556000 0.048000
24.600000 24.503000 0.077000

That is, roughly 30% speed-up. I will investigate if the win persists during normal browser session.
Comment 1 Igor Bukanov 2011-12-07 12:53:23 PST
Created attachment 579798 [details] [diff] [review]
v1

This is what I am going to test.
Comment 2 Igor Bukanov 2011-12-08 07:30:48 PST
Created attachment 580039 [details] [diff] [review]
part 1 - merging string and object mark stacks

This part makes ScanRope to pop the stack until all ropes pushed by the function are processed. This improves string scanning localization and allows to merge object and string stacks.
Comment 3 Igor Bukanov 2011-12-08 07:43:46 PST
Created attachment 580041 [details] [diff] [review]
part2 - common mark stack and push/pop optimizations

This patch merges all the mark stack into one using the tagged pointers to distinguish between the pushed item kinds. The patch tries to eliminate push/pop pair when it is know what would be pushed. Instead it just jumps to the necessary code directly.

Here is the output MOZ_GC_TIMER=stdout output:

Current TIP                               part1 + part2

                              During V8
33.614000 31.979000 1.583000              25.019000 24.387000 0.589000 
32.987000 32.357000 0.587000		  25.630000 25.031000 0.555000 
33.667000 33.057000 0.560000		  26.277000 25.200000 0.513000 
34.289000 33.719000 0.528000		  26.138000 25.576000 0.521000 
34.436000 33.869000 0.525000		  26.333000 25.690000 0.542000 
35.022000 34.425000 0.553000		  26.304000 25.744000 0.516000 
34.800000 34.226000 0.531000		  26.287000 25.705000 0.539000 
34.948000 34.354000 0.550000		  26.220000 25.628000 0.547000 
10.334000 8.233000 2.033000		  7.863000 6.779000 1.016000   

                  After  http://gregor-wagner.com/tmp/mem
                         opens all 150+ tabs

422.587000 329.048000 90.807000           390.314000 300.842000 87.583000
406.061000 323.988000 80.499000		  380.170000 297.208000 81.493000
404.894000 323.755000 79.629000		  380.071000 297.395000 81.230000


                  During V8 with all 150+ tabs still 
                               openned
                     
418.272000 326.218000 89.835000           391.836000 300.919000 88.592000
402.318000 323.612000 77.362000		  381.030000 299.589000 80.095000
402.594000 324.181000 77.073000		  380.389000 299.487000 79.555000
402.212000 323.569000 77.302000		  380.374000 299.660000 79.367000
55.005000 42.561000 11.376000		  46.795000 34.127000 11.583000	
54.616000 42.247000 11.046000		  46.906000 34.554000 11.303000	
54.881000 42.745000 11.036000		  46.889000 34.759000 11.091000	
55.779000 43.685000 11.032000		  47.499000 35.273000 11.100000	
56.463000 44.186000 11.132000		  47.860000 35.694000 11.121000	
56.296000 44.243000 10.993000		  47.989000 35.824000 11.120000	
56.917000 44.766000 11.031000		  47.848000 35.682000 11.110000	
56.506000 44.420000 10.994000		  48.553000 36.379000 11.133000   


So V8 shows 25%-30% reduction in the GC time when only V8 tab is opened and 20% speed up with 150+ other tabs.

The global GC with 150+ opened shows consistent 5% faster timing.

This is with 32-bit browser build on 3.0 GHz Linux bux with GCC 4.6.
Comment 4 Brian Hackett (:bhackett) 2011-12-08 08:31:02 PST
Comment on attachment 580041 [details] [diff] [review]
part2 - common mark stack and push/pop optimizations

This looks great, but I don't think I know enough about the mark stack internals to review this properly.
Comment 5 Brian Hackett (:bhackett) 2011-12-08 08:32:35 PST
Also, does this apply on top of part 1?  That patch was marked as obsolete.
Comment 6 Igor Bukanov 2011-12-08 08:42:44 PST
(In reply to Brian Hackett (:bhackett) from comment #5)
> Also, does this apply on top of part 1?  That patch was marked as obsolete.

Yes, the patch consists of two parts - the obsolete mark was accidental.
Comment 7 Igor Bukanov 2011-12-08 08:50:51 PST
I can also 25% win in GC timing during V-splay on Atom N270 netbook:


152.306000 151.811000 0.320000             122.330000 121.841000 0.327000
156.669000 154.589000 0.222000             126.423000 124.271000 0.228000
Splay: 1765                                Splay: 1728
Comment 8 Igor Bukanov 2011-12-08 17:23:24 PST
Another reason for the wins the patch provides for the GC under V8-splay is that it scans unmarked objects stored in the slots array from smallest slot to highest slot. Without the patch the code first pushes objects to the stack in that order resulting in scanning in the reverse order.

Apparently the object graph from v8-splay benchmark is rather sensitive to the traversal order. If I change the scanning loop in processMarkStackTop from the patch to go from end to start, then the 25% win in V8 shrinks down to 10%. On he other hand http://29a.ch/2010/6/2/realtime-raytracing-in-javascript is not sensitive to that order. With or without that loop directin change the patch shows 10% win.
Comment 9 Bill McCloskey (:billm) 2011-12-10 21:02:32 PST
Comment on attachment 580039 [details] [diff] [review]
part 1 - merging string and object mark stacks

Review of attachment 580039 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgcmark.cpp
@@ +768,5 @@
>  
>  static void
>  ScanRope(GCMarker *gcmarker, JSRope *rope)
>  {
> +    unsigned savedTos = gcmarker->objStack.tos;

Could you add a comment to the top of the function giving a brief description of what it's doing? Make it clear that, for this function to be correct, marking a string should never cause an object to be marked.

Also, it might be nice to add an accessor function to MarkStack that returns the top of the stack, but that's up to you.

@@ +770,5 @@
>  ScanRope(GCMarker *gcmarker, JSRope *rope)
>  {
> +    unsigned savedTos = gcmarker->objStack.tos;
> +    for (;;) {
> +        JS_ASSERT(rope->JSString::isRope());

Before this, can you assert that rope is actually a JSTRACE_STRING (via GetGCThingTraceKind, or whatever it's called)?

@@ +805,5 @@
> +            rope = static_cast<JSRope *>(gcmarker->objStack.pop());
> +        } else {
> +            break;
> +        }
> +    }

How about asserting here (at the end of ScanRope) that gcmarker->objStack.tos == savedTos?
Comment 10 Bill McCloskey (:billm) 2011-12-10 22:01:08 PST
Comment on attachment 580041 [details] [diff] [review]
part2 - common mark stack and push/pop optimizations

Review of attachment 580041 [details] [diff] [review]:
-----------------------------------------------------------------

Nice work. I hope some day compilers will handle tail recursion better, because processMarkStackTop is not very pretty.

::: js/src/jsgc.h
@@ +1644,5 @@
> +    static const uintptr_t StackTagMask = 7;
> +
> +    static void staticAsserts() {
> +        JS_STATIC_ASSERT(StackTagMask >= uintptr_t(LastTag));
> +        JS_STATIC_ASSERT(StackTagMask + 1 == sizeof(Value));

What's the second assert for?

::: js/src/jsgcmark.cpp
@@ +734,5 @@
> +             * Make sure that ScanBaseShape is not recursive so its inlining
> +             * is possible.
> +             */
> +            UnownedBaseShape *unowned = base->baseUnowned();
> +            JS_OPT_ASSERT(base->compartment() == unowned->compartment());

We're no longer using JS_OPT_ASSERT in jsgcmark.cpp. If you don't want to use the usual compartment assert (since the one you have here is stronger), then could you add a macro like JS_GC_COMPARTMENT_ASSERT that can expand to either JS_ASSERT or JS_OPT_ASSERT? Then we could very easily change whether to enable these assertions in nightlies.

@@ +791,5 @@
>                  /*
>                   * When both children are ropes, set aside the right one to
>                   * scan it later.
>                   */
> +                if (next && !gcmarker->stack.push(reinterpret_cast<uintptr_t>(next)))

Yikes, untagged! Maybe say something about that in the comment at the top of ScanRope (i.e., that everything above savedTos is known to be a rope, so the tag is unnecessary).

@@ +831,5 @@
>          ScanString(gcmarker, str);
>  }
>  
> +static JS_NEVER_INLINE void
> +DelayScanningForArray(GCMarker *gcmarker, HeapValue *begin, HeapValue *end)

Could you call this DelayMarkingValueArray?

@@ +846,5 @@
> +            color = gcmarker->getMarkColor();
> +        } else {
> +            continue;
> +        }
> +        if (cell->markIfUnmarked(color))

I don't understand why you always color strings black. The code is simpler if you always call getMarkColor(). Is this an optimization to avoid calling getMarkColor? It doesn't seem worth it here, since we're already in delayed marking.

@@ +854,3 @@
>  
> +static inline void
> +ScanValueArray(GCMarker *gcmarker, HeapValue *array, size_t size)

I think PushValueArray would be a better name for this.

@@ +1067,5 @@
>  } /* namespace gc */
>  
> +inline void
> +GCMarker::processMarkStackTop()
> +{

I think this function would be easier to read if you moved scan_value_array and scan_obj to the end. They both end in return statements, so then we could think of "goto scan_obj" as sort of like a function call. And the structure of the function would be clearer, since it would look like this:
  if (tag == ...) ... goto scan_value_array
  else if (tag == ...) ... goto scan_obj
  else if (tag == ...) ...

scan_obj:
  ...
  return;

scan_value_array:
  ...
  return;

Also, it would be nice to put a comment at the top of the function to say why it's written this way.

@@ +1074,5 @@
> +    uintptr_t tag = addr & StackTagMask;
> +    uintptr_t payload;
> +    HeapValue *vp, *end;
> +    JSObject *obj;
> +    if (tag == ValueArrayTag) {

Extra line between the variables and the code, please.

@@ +1087,5 @@
> +            const Value &v = *vp++;
> +            if (v.isString()) {
> +                JSString *str = v.toString();
> +                if (str->markIfUnmarked())
> +                    ScanString(this, str);

Could you just call PushMarkStack(str) here? I would think it would get inlined.

Also, could you add a comment in PushMarkStack that you can mark strings black because the cycle collector never needs to see them.

@@ +1109,5 @@
> +            unsigned nslots = obj->slotSpan();
> +            vp = obj->fixedSlots();
> +            if (obj->slots) {
> +                unsigned nfixed = obj->numFixedSlots();
> +                if (nslots > nfixed) {

Do we ever have dynamic slots and have nslots <= nfixed?
Comment 11 Bill McCloskey (:billm) 2011-12-10 22:46:33 PST
Actually, on second thought Igor, could you make sure that strings are always marked gray if the GCMarker color is gray? I don't think it actually make the cycle collector run any faster if they're marked black. And I'm worried that someone might copy-and-paste the string marking code somewhere else where we actually want to mark gray. People tend to copy-and-paste a lot in jsgcmark.cpp. (Which suggests that there's probably too much redundancy in the file, but that's for another bug.)
Comment 12 Igor Bukanov 2011-12-11 15:22:32 PST
(In reply to Bill McCloskey (:billm) from comment #11)
> Actually, on second thought Igor, could you make sure that strings are
> always marked gray if the GCMarker color is gray? 

The patch does not change that always-color-strings-in-black code. It is leftover of the 2-word strings era when we did not have spare bits in the marking bitmap for the color. If we want to change that, lets discuss it in another bug.
Comment 13 Igor Bukanov 2011-12-11 15:35:01 PST
(In reply to Bill McCloskey (:billm) from comment #10)
> Do we ever have dynamic slots and have nslots <= nfixed?

The try server told me so ;)
Comment 14 Luke Wagner [:luke] 2011-12-12 11:12:02 PST
(In reply to Igor Bukanov from comment #13)
> (In reply to Bill McCloskey (:billm) from comment #10)
> > Do we ever have dynamic slots and have nslots <= nfixed?

nfixed is the inline capacity which predicts eventual object size (via type information), so it makes sense that you could get objects not using their full inline capacity.

Note You need to log in before you can comment on or make changes to this bug.