Open Bug 600234 Opened 11 years ago Updated 10 years ago

moving GC ArenaHeader out of Arena again

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

People

(Reporter: igor, Assigned: igor)

References

Details

Attachments

(2 files, 8 obsolete files)

The patch for bug 558861 has moved the GC arena header into the arena itself I presume for code simplicity. But having the header there implies that various read and writes that involves only the header field would have bigger CPU cache and TLB footprints. Also this waste some memory especially for objects and function arenas as the header occupies the whole cell for the object when only 6 words are required. Those effects were visible previously in cachegrind and were the reason for moving the header out of arena.
Not too worried about the memory waste. Peanuts. Can you link me to the cachegrind data?
Here is test based on what I have used before:

function prepare(N) {
    var array = [];
    for (var i = 0; i != N; ++i) {
	array.push(String.fromCharCode(100, 101, 102));
	array.push({});
    }
    return array;
}

function do_it()
{
    var array = prepare(12e5);
    var t = Date.now();
    gc();
    var time1 = Date.now() - t;
    array = prepare(10);
    array = null;
    var t = Date.now();
    gc();
    var time2 = Date.now() - t;
    return [time1, time2];
}
    
print(do_it());

The test prints the pair representing the timing of GC dominated mostly by marking and by finalization time of a lot of empty objects and short strings to consider only pure GC things.

On the TM the best timing among 20 runs is the pair 100,36. On a pre-compartment GC the best pair is 93,33. So we have slowdown on the scale of 7-10%. This is for 32 bit js shell on dual-core iCore 7 laptop under fedora 13.

Given the code that test exercise the only difference can come from a separated arena header.
Making this bug as a performance regression from the bug 558861.
Blocks: 558861
No longer depends on: 558861
Lets assume for a second that the workload is relevant and that the measurement is accurate, how do we explain the difference?
Attached patch patch (obsolete) — Splinter Review
The patch to move the header out of line. Untested, but pretty trivial. I am still not convinced this is a good idea though.
Assignee: igor → gal
Attached patch patch (obsolete) — Splinter Review
Without accidental jetpack change.
Attachment #479134 - Attachment is obsolete: true
(In reply to comment #4)
> Lets assume for a second that the workload is relevant and that the measurement
> is accurate, how do we explain the difference?

The first number in the above test represent the case when almost all GC arenas are fully used with no free cell available. With a separated array of arena headers during the sweep the GC does not need to touch the arena itself as it will see that everything is marked. This explains the first number.

The second number is the opposite case when the arena contains only dead things  with an empty free list. This is a common scenario when the GC happens after an allocation burst with the heap containing mostly a garbage. But it is harder to see what is responsible for the win here. With or without a separated header the GC has to assemble the new free list which presumably warm ups the CPU cache.
Attached patch patch with two headers (obsolete) — Splinter Review
With a single header I could not match the numbers of the pre compartment GC. It is really important to have the header to be the power of two. So the patch adds two headers, one 4-worded and one two-word where only two-worded header is touched during the GC. This allowed to recover all the timing.
Attached patch one small header (obsolete) — Splinter Review
This patch uses single separated arena header but shrinks it to 4 words in optimized build. This avoids the extra complexity of 2 headers and allowed to have the same results in the benchmark above or its version that I have observed with pre-compartment-gc code. The patch does not use templates for arena header as semantically the header allows to discover the dynamic type of the arena.

The patch also adds Arena::chunkAndIndex with very similar code that was used in older GC. Last time I have checked GCC could not optimize separated chunk() and arenaIndex() calls into the common optimized version.

The patch does not affect SS or V8 scores.
Assignee: gal → igor
Attachment #479136 - Attachment is obsolete: true
Attachment #479143 - Attachment is obsolete: true
Attachment #479399 - Flags: review?(anygregor)
Attached patch v5 (obsolete) — Splinter Review
Here is a rebased patch that also adds Cell::arenaHeader helper to make those lines with asCell()->arena()->header()-> little bit shorter.

The patch also moves the alignment checks into the jsgc.h header. This avoids explicit template incitations and allowed for a wider assert coverage as the check is called during marking of any GC thing including JSXML.
Attachment #479399 - Attachment is obsolete: true
Attachment #479751 - Flags: review?(anygregor)
Attachment #479399 - Flags: review?(anygregor)
This change was mostly driven by code simplicity. Do we have real numbers that really force us to go back to the old system? SS, V8, also the browser versions of the benchmarks?
I was measuring a lot for the new GC and the cycles needed to mark an object  (total marking time / amount of marked objects) went down.
Attached file benchmark
I also used this micro benchmark for shark profiles and I get following numbers:

Changeset right before GC landing:
idefix2:OPT.OBJ idefix2$ time ./js ../../../../tests/wave.js 
END

real	0m13.179s
user	0m13.076s
sys	0m0.101s
idefix2:OPT.OBJ idefix2$ time ./js -j ../../../../tests/wave.js 
END

real	0m10.501s
user	0m10.389s
sys	0m0.110s
idefix2:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js 
END

real	0m10.513s
user	0m10.401s
sys	0m0.109s
idefix2:OPT.OBJ idefix2$ time ./js -m ../../../../tests/wave.js 
END

real	0m10.627s
user	0m10.498s
sys	0m0.092s



current tip:
idefix2:OPT.OBJ idefix2$ time ./js ../../../../tests/wave.js 
END

real	0m10.363s
user	0m10.265s
sys	0m0.096s
idefix2:OPT.OBJ idefix2$ time ./js -j ../../../../tests/wave.js 
END

real	0m7.991s
user	0m7.894s
sys	0m0.096s
idefix2:OPT.OBJ idefix2$ time ./js -jm ../../../../tests/wave.js 
END

real	0m7.942s
user	0m7.847s
sys	0m0.093s
idefix2:OPT.OBJ idefix2$ time ./js -m ../../../../tests/wave.js 
END

real	0m8.120s
user	0m8.022s
sys	0m0.097s
Ups ignore that testcase. That's the wrong one.
(In reply to comment #11)
> This change was mostly driven by code simplicity.

Unfortunately simpler does not translate into more efficient. Besides, given that we already have separated mark bitmap and delayed marking structures, having the header that comes together with arena adds feature complexity IMO (one have to deal both with embedded and separated structures instead of uniformity of having everything outside).

> Do we have real numbers that
> really force us to go back to the old system?

The numbers above are real with independent callgrind confirmation.

> SS, V8, 

See the last paragraph of the comment 9. It would be very surprising if the benchmark scores are affected. The benchmarks are mostly CPU bound so changes in the GC support structures that does not touch the free list implementation should not be visible in them.

> also the browser versions
> of the benchmarks?

I will try to check the v8 in the browser.

> I was measuring a lot for the new GC and the cycles needed to mark an object 
> (total marking time / amount of marked objects) went down.

If the amount of marked objects went down, it implies different GC heuristics like GC running less frequent. It does not indicate that the marking time per object went down. It could well go up as the data here and in the bug 600648 indicates.
(In reply to comment #14)
> (In reply to comment #11)
> > This change was mostly driven by code simplicity.
> 
> Unfortunately simpler does not translate into more efficient. Besides, given
> that we already have separated mark bitmap and delayed marking structures,
> having the header that comes together with arena adds feature complexity IMO
> (one have to deal both with embedded and separated structures instead of
> uniformity of having everything outside).

I agree. If the measurements show that we could do better with more complex code then we have to go for it. But if we look at JSC and their GC implementation then the message is clearly simplicity wins! They prove that one can be faster with even simpler code. 

> 
> > Do we have real numbers that
> > really force us to go back to the old system?
> 
> The numbers above are real with independent callgrind confirmation.
> 
> > SS, V8, 
> 
> See the last paragraph of the comment 9. It would be very surprising if the
> benchmark scores are affected. The benchmarks are mostly CPU bound so changes
> in the GC support structures that does not touch the free list implementation
> should not be visible in them.
> 
> > also the browser versions
> > of the benchmarks?
> 
> I will try to check the v8 in the browser.

We still perform GC's in the current v8 shell harness.

> 
> > I was measuring a lot for the new GC and the cycles needed to mark an object 
> > (total marking time / amount of marked objects) went down.
> 
> If the amount of marked objects went down, it implies different GC heuristics
> like GC running less frequent. It does not indicate that the marking time per
> object went down. It could well go up as the data here and in the bug 600648
> indicates.

Not the amount of marked objects went down.
I measured the cycles needed on average to mark an object.
So total cycles needed for marking process of a GC divided by the number of marked objects.
(In reply to comment #12)
> Created attachment 479813 [details]
> benchmark
> 
> I also used this micro benchmark for shark profiles and I get following
> numbers:

The benchmark is very allocation sensitive and the compartment GC includes the inlining of the allocations. A better compassion would to check the pre-compartmental GC with the patch from the 595048 bug on top aginst the current tip. Or the older code versus the current tip plus the patch from the bug 558861 comment 67.

In any case, the patch here shows a win from

~/m/tm/js/src> time ~/b/js64opt.tm/js -jm ~/s/wave.js
END

real	0m8.640s
user	0m8.497s
sys	0m0.086s

to:

~/m/tm/js/src> time ~/b/js64opt.tm/js -jm ~/s/wave.js
END

real	0m7.416s
user	0m7.292s
sys	0m0.080s
(In reply to comment #15)
> I agree. If the measurements show that we could do better with more complex
> code then we have to go for it. But if we look at JSC and their GC
> implementation then the message is clearly simplicity wins!

They do not need to deal with threads, OOM, limited amount of the stack space and the need to have a cycle collector. They could also afford much simpler GC heuristics. Still our fast allocation path should be on pair with them.   

> Not the amount of marked objects went down.
> I measured the cycles needed on average to mark an object.

Hm, was it on 64 bit system?
> > Not the amount of marked objects went down.
> > I measured the cycles needed on average to mark an object.
> 
> Hm, was it on 64 bit system?

Yes. All these measurements are part of bug 593729. It would make sense to tune everything in a larger context but I guess time is against me.
This patch doesn't apply to current tip.
(In reply to comment #19)
> This patch doesn't apply to current tip.

Sory, I forgot to mention it is on top of the patch from bug 600687 comment 3.
(In reply to comment #20)
> (In reply to comment #19)
> > This patch doesn't apply to current tip.
> 
> Sory, I forgot to mention it is on top of the patch from bug 600687 comment 3.

And I will land that in a minute.
In the browser running v8 on my laptop is too noisy to have any firm conclusions. Still the best v8 score among 20 runs went up from 2957 to 2981.
I get a bunch of warnings with this patch:
jsgc.cpp
../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSObject]’:
../jsgc.cpp:2308:   instantiated from here
../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSFunction]’:
../jsgc.cpp:2309:   instantiated from here
../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSXML]’:
../jsgc.cpp:2311:   instantiated from here
../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSShortString]’:
../jsgc.cpp:2323:   instantiated from here
../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.cpp: In function ‘void FinalizeArenaList(JSCompartment*, JSContext*, unsigned int) [with T = JSString]’:
../jsgc.cpp:2324:   instantiated from here
../jsgc.cpp:2001: warning: dereferencing type-punned pointer will break strict-aliasing rules
(In reply to comment #23)
> I get a bunch of warnings with this patch:

This is from:

            ap = (Arena<T> **) &header->next;

GCC 4.4 does not complain about it has much better aliasing analysis than older versions. But I guess on Mac one one has to deal with that.
Attached patch v6 (obsolete) — Splinter Review
This patch uses template-based header to avoid aliasing warnings.
Attachment #479751 - Attachment is obsolete: true
Attachment #480035 - Flags: review?(anygregor)
Attachment #479751 - Flags: review?(anygregor)
With the latest patch the callgrind reliably shows for the v8-splay benchmark the decrease of instruction count with the patch from 1.67e9 down to 1.65e9 or a 1.3% win. I cannot see that using the benchmark suite - the iCore7 CPU speed on my laptop fluctuates substantially depending on the thermal situation which adds way too much noise.
Yes the cycles used to mark an object went down. Thats good!

I still get a bunch of warnings with this patch:
jsapi.cpp
../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSObject]’:
../jsgc.h:234:   instantiated from here
../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSString]’:
../jsgc.h:237:   instantiated from here
../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSShortString]’:
../jsgc.h:240:   instantiated from here
../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = JSFunction]’:
../jsgc.h:245:   instantiated from here
../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules
../jsgc.h: In member function ‘js::gc::ArenaHeader<T>* js::gc::Arena<T>::header() const [with T = js::gc::Cell]’:
../jsgc.h:441:   instantiated from here
../jsgc.h:389: warning: dereferencing type-punned pointer will break strict-aliasing rules

I also don't really like the chunkAndIndex function. Does it really make a difference?
If yes please add a comment.

I have very limited time now since I am no longer in Mountain View but I will finish the review once all warnings are gone.
Comment on attachment 480035 [details] [diff] [review]
v6

I will rebase the patch on top of the bug 601075. That bugs removes EmptyArenaLists which will minimize the amount of changes here and should remove the aliasing warnings.
Attachment #480035 - Attachment is obsolete: true
Attachment #480035 - Flags: review?(anygregor)
Depends on: 605029
Attached patch v7 (obsolete) — Splinter Review
The new patch is based on patches from the bug 605029 comment 4 and bug 601234 comment 3. To avoid expensive divisions when accessing the header the patch shrinks ArenaHeader down to 4 words by sharing the free list offset together with the ting kind. An alternative of making the header to be 8 words has adverse results on the above benchmark.

The patch gives about 0.3%-0.5% win in SS and no regressions in V8. AFAICS it is hard to see wins with V8 as the patch affects the GC heuristics by putting an extra object into the arena and that adds a big noise to v8.

For a synthetic benchmark like the following code the patch gives a 5% win.

function test() {
    gc();

    var zz = ({});

    var z = zz;
    for (count = 0; count < 1e5; count++) {
        z.next = ({});
        z = z.next;
    }

    gc();

    for(count = 0; count < 35; count++){
        var xx = ({});
        var x = xx;
        for (var j = 0; j < 6e5; ++j) {
            x.next = ({});
            x = x.next;
        }

        gc();

        xx = x = null;

        gc();
    }
}

var t = Date.now();
test();
t = Date.now() - t;
print("time: "+t);
Attachment #485262 - Flags: review?(gal)
Attached patch attachment for the wrong bug (obsolete) — Splinter Review
patch update to reflect bug 616666 chnages
Attachment #485262 - Attachment is obsolete: true
Attachment #527123 - Flags: review?(gal)
Attachment #485262 - Flags: review?(gal)
Attachment #527123 - Attachment description: v8 → attachment for the wrong bug
Attachment #527123 - Attachment is obsolete: true
Attachment #527123 - Flags: review?(gal)
I need this bug for 648320. There I plan to add a few 2^N size classes of GC things. But if the arena header is stored at the beginning of the arena, this wastes the space for one GC thing which will be substantial for N >= 5. In addition, it complicates allocation and access to the type map which will be necessary for the conservative GC.
Blocks: 648320
No longer blocks: 558861
Depends on: 665354
Attached patch v8Splinter Review
Here is a rebased patch, it looks it is necessary to get best results with smaller chunks in bug 671702.
You need to log in before you can comment on or make changes to this bug.