Last Comment Bug 605029 - ArenaHeader versus Arena<FreeCell> and other cleanups
: ArenaHeader versus Arena<FreeCell> and other cleanups
Status: RESOLVED FIXED
fixed-in-tracemonkey
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: Trunk
: All All
: -- normal (vote)
: ---
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 600234 600648 601234
  Show dependency treegraph
 
Reported: 2010-10-17 13:25 PDT by Igor Bukanov
Modified: 2011-05-02 15:59 PDT (History)
7 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
.x+


Attachments
v1 (57.05 KB, patch)
2010-10-17 13:25 PDT, Igor Bukanov
no flags Details | Diff | Review
v2 (65.76 KB, patch)
2010-10-18 12:26 PDT, Igor Bukanov
no flags Details | Diff | Review
v3 (70.85 KB, patch)
2010-10-20 08:28 PDT, Igor Bukanov
no flags Details | Diff | Review
v4 (70.77 KB, patch)
2010-10-21 09:34 PDT, Igor Bukanov
no flags Details | Diff | Review
v5 (79.43 KB, patch)
2010-12-02 08:31 PST, Igor Bukanov
no flags Details | Diff | Review
v6 (78.64 KB, patch)
2010-12-16 13:41 PST, Igor Bukanov
no flags Details | Diff | Review
v7 (77.29 KB, patch)
2011-01-05 07:15 PST, Igor Bukanov
no flags Details | Diff | Review
v8 (89.28 KB, patch)
2011-04-15 10:04 PDT, Igor Bukanov
no flags Details | Diff | Review
v9 (89.88 KB, patch)
2011-04-15 16:34 PDT, Igor Bukanov
no flags Details | Diff | Review
v10 (88.96 KB, patch)
2011-04-19 15:16 PDT, Igor Bukanov
no flags Details | Diff | Review
v11 (89.07 KB, patch)
2011-04-20 12:45 PDT, Igor Bukanov
no flags Details | Diff | Review
v12 (90.38 KB, patch)
2011-04-21 05:53 PDT, Igor Bukanov
wmccloskey: review+
Details | Diff | Review
v13 (91.45 KB, patch)
2011-04-28 19:40 PDT, Igor Bukanov
igor: review+
Details | Diff | Review
v14 (90.63 KB, patch)
2011-04-29 00:19 PDT, Igor Bukanov
no flags Details | Diff | Review

Description Igor Bukanov 2010-10-17 13:25:33 PDT
Created attachment 483863 [details] [diff] [review]
v1

To minimize the amount of Arena<T> casts in the GC implementation the attached patch replaces various Arena<FreeCell> links with ArenaHeader.

The patch also shrinks MarkingDelay down to one word. With forthcoming patch for bug 600234 that allowed to pack an extra arena into the chunk. Although the win is tiny (it decreases the memory consumption by 0.4%) and not visible during normal execution, the finalization is bound by the memory bandwidth so memory shrinking directly translates into smaller finalization time.

Besides that the patch contains few other cleanups to make patches for bug 600234, bug 600648 and bug 601234 smaller.
Comment 1 Igor Bukanov 2010-10-18 12:26:35 PDT
Created attachment 484049 [details] [diff] [review]
v2

This is v1 with extra changes to remove debug-related checkArenaListsForThing and ArenaList::.arenasContainThing as the corresponding functionality is provided by MarkIfGCThingWord.
Comment 2 Igor Bukanov 2010-10-20 08:28:34 PDT
Created attachment 484692 [details] [diff] [review]
v3

Here are few more changes to simplify the dependent patches. The new version removes ArenaHeader::isUsed replacing it with ArenaHeader::compartment check. It also removed ArenaHeader::thingSize. The latter is used only in debug builds and can be trivially found from thingKind using an array lookup.

Another change is the removal of MarkingDelay::start. That field at best on average makes marking of delayed arena twice faster but the delayed marking does not happen in practice so the simplest code wins. And if we would get any indications that a slow delayed marking is observable, it can be made several times faster using a version of the pre-compartment GC code.

With this change MarkingDelay becomes unnecessary so the patch moves the linking field into ArenaHeader. On 64 bit this brings an extra arena in chunk available for things allocation while making simpler code.

With this patch I see no changes in V8/SS on a 64-bit server box with little noise.
Comment 3 Igor Bukanov 2010-10-21 09:34:56 PDT
Created attachment 485060 [details] [diff] [review]
v4

rebased patch
Comment 4 Gregor Wagner [:gwagner] 2010-10-21 15:53:34 PDT
I am no longer in MV and I have very limited time because of paper deadlines and other stuff. Andreas can you take the review?
Comment 5 Igor Bukanov 2010-12-02 08:31:29 PST
Created attachment 494724 [details] [diff] [review]
v5

Here is a rebased patch
Comment 6 Igor Bukanov 2010-12-16 13:41:59 PST
Created attachment 498211 [details] [diff] [review]
v6

here is a refreshed patch
Comment 7 Igor Bukanov 2011-01-05 06:33:10 PST
Nominating for 2.0 as this bug blocks other bugs that improves SS/V8 scores by few percent.
Comment 8 Igor Bukanov 2011-01-05 07:15:41 PST
Created attachment 501314 [details] [diff] [review]
v7

v6 got too rottent.

Gregor, do you have time to review this?
Comment 9 David Mandelin [:dmandelin] 2011-01-05 10:29:53 PST
(In reply to comment #7)
> Nominating for 2.0 as this bug blocks other bugs that improves SS/V8 scores by
> few percent.

I think our direction is pretty firmly against blocking release on perf wins this late in the game. I do like perf wins, though. Do you have any comments on the risk of regressions we take from landing this? It's a big patch but I guess it's mostly a refactoring?
Comment 10 Igor Bukanov 2011-01-05 12:38:20 PST
(In reply to comment #9)
> Do you have any comments on
> the risk of regressions we take from landing this? It's a big patch but I guess
> it's mostly a refactoring?

The patch is indeed about refactoring. The non-refactoring part is a debug-only changes to the GC stats collection.
Comment 11 Igor Bukanov 2011-04-15 10:04:34 PDT
Created attachment 526287 [details] [diff] [review]
v8

unrotten patch
Comment 12 Igor Bukanov 2011-04-15 16:34:56 PDT
Created attachment 526421 [details] [diff] [review]
v9

The new version removes an incorrect assertion in Mark. With the patch the conservative GC uses the GC tracing machinery to collect the stats and that assert made that impossible during the compartment GC.

Besides the conservative GC stats collection changes the patch does the following:

1. The patch uses ArenaHeader* in place of Arena<FreeCell> * to minimize the number of Arena<FreeCell>* <-> Arena<T>* casts.

2. The delayed marking is simplified father and the single remaining per-arena field is moved from MarkingDelay into ArenaHeader. The latter is compressed down to 4 fords in preparation for the bug 600234.

3. There is some refactoring regarding background finalization implementation to prepare for the bug 601075. In particular, for simplicity the code uses a vector of finalized lists, not a a vector of array, as the former is not justified for few thousands at most elements.

There are few other changes to make the patches for the bug 600648 and bug 601234.
Comment 13 Igor Bukanov 2011-04-18 22:39:59 PDT
gal: review ping
Comment 14 Igor Bukanov 2011-04-19 15:16:14 PDT
Created attachment 527126 [details] [diff] [review]
v10

patch update to reflect bug 616666 chnages
Comment 15 Igor Bukanov 2011-04-20 12:45:01 PDT
Created attachment 527354 [details] [diff] [review]
v11

Here comes another rebase due to the bug 651546.

Gal: any chance to review this before it rots again?
Comment 16 Igor Bukanov 2011-04-21 05:53:38 PDT
Created attachment 527521 [details] [diff] [review]
v12

Landing of bug 651041 required to rebase this patch again.
Comment 17 Igor Bukanov 2011-04-26 03:08:08 PDT
Comment on attachment 527521 [details] [diff] [review]
v12

Bill, could you review this? Andreas or Gregor do not have time for this. See comment 1 and comment 12 for details of what the patch does.
Comment 18 Bill McCloskey (:billm) 2011-04-27 18:15:58 PDT
I'm about half way through the patch. I'll finish tomorrow.

I noticed that you use capital letters for the names of static methods like MarkingDelayEnd. In the rest of the engine, it seems like we use the normal convention for these (e.g., markingDelayEnd).

I also have a larger question. Do you know why so much of the GC code is templated? The only advantage I can see is that the compiler might be able to optimize a few mod operations over sizeof(T) into shifts. But this seems like a pretty minor optimization. The disadvantages are that it bloats our instruction cache footprint and it's a little ugly in my opinion (especially the big switch statements over all possible GC thing kinds).

Anyway, this patch definitely simplifies things. But it would be nice if we could go all the way and just eliminate the parameterization. Do you know of any reason why this wouldn't work?
Comment 19 Igor Bukanov 2011-04-28 07:08:48 PDT
(In reply to comment #18)
> I also have a larger question. Do you know why so much of the GC code is
> templated? The only advantage I can see is that the compiler might be able to
> optimize a few mod operations over sizeof(T) into shifts. But this seems like a
> pretty minor optimization.

We do need templates during finalization. It really help to do the thing kind switch once per arena list. Now, with the background finalization, we may consider doing that once per arena to avoid extra switches and code bloat, but I prefer to do that after fixing bug 652416 as that patch simplifies the background finalization code.

As regrading the allocation I tried to eliminate templates, but that regressed on benchmarks. That comes from two reasons. First multiplication is expensive even if it is done once during RefillFinalizableFreeList. It can be worked around with using a static array with precalculated values, but then we have another problem. Templated version of RefillFinalizableFreeList and corresponding code bloat improves branch prediction especially when allocating arenas from typed free lists or when building free cell list in the arena. 

I plan to reconsider that after fixing the bug 601075 as with that bug fixed we no longer needs to build free lists in RefillFinalizableFreeList nor use typed free lists.
Comment 20 Bill McCloskey (:billm) 2011-04-28 16:55:53 PDT
Comment on attachment 527521 [details] [diff] [review]
v12

Review of attachment 527521 [details] [diff] [review]:

Well, regarding the templates, I want to emphasize that we could remove them incrementally. For example, I think we could take away the type parameter for Arena while still making the allocation and finalization code parametrized. I think that would be a little cleaner than passing ArenaHeaders around, since it would avoid the occasional casts back to Arena that you need to do. (Although I realize that the layout of things in the arena would be a little grosser, but we could hide that behind accessors.)

However, I think this patch has a lot of good stuff in it, and it's been sitting around for a while. I especially like the cleanup of the background finalization data structure. So r+ with the nits fixed.

::: js/src/jsgc.cpp
@@ +1922,5 @@
     for (;;) {
+        bool allClear = reinterpret_cast<Arena<T> *>(aheader)->finalize(cx);
+
+        /*
+         * We don't delete the head because the next allcoated arena has to

"allocated"

@@ +2185,3 @@
     finalizeVector.resize(0);
     cx = NULL;
+

Whitespace.

::: js/src/jsgc.h
@@ +118,5 @@
+
+  private:
+    /*
+     * When recursive marking uses too much stack the marking is delayed and
+     * the corresponding arenas is put into a stack using a linked via the

"corresponding arenas are put"

@@ +127,5 @@
+     * direct byte access for the field.
+     */
+    uintptr_t       thingKind                       : 8;
+    uintptr_t       nextDelayedMarkingArena         : JS_BITS_PER_WORD - 8;
+

This kind of bit packing is a little gross. Maybe in the future we could just scan all the arenas in a compartment during delayed marking? I've seen other GC implementations that work this way.

::: js/src/jsgcmark.cpp
@@ +76,5 @@
     JS_ASSERT(thing);
     JS_ASSERT(JS_IS_VALID_TRACE_KIND(GetGCThingTraceKind(thing)));
     JS_ASSERT(trc->debugPrinter || trc->debugPrintArg);
 
+    JS_ASSERT_IF(!JSAtom::isStatic(thing), thing->asFreeCell()->isAligned());

Shouldn't we be able to assert !JSAtom::isStatic(thing) here? Then we could assert alignment separately.
Comment 21 Igor Bukanov 2011-04-28 19:40:39 PDT
Created attachment 529023 [details] [diff] [review]
v13

Addressing the nits and consistently using lowercase names for static methods.
Comment 22 Igor Bukanov 2011-04-28 19:46:45 PDT
(In reply to comment #20)
> For example, I think we could take away the type parameter for
> Arena while still making the allocation and finalization code parametrized. I
> think that would be a little cleaner than passing ArenaHeaders around, since it
> would avoid the occasional casts back to Arena that you need to do.

I plan to move the ArenaHeader from the arena in the bug 600234. Then even with template-less arena it would be necessary to convert between arena and the header.	

> +    uintptr_t       thingKind                       : 8;
> +    uintptr_t       nextDelayedMarkingArena         : JS_BITS_PER_WORD - 8;
> +
> 
> This kind of bit packing is a little gross.

It helps synthetic benchmarks as it allows to put more things into the arena. For example, without the bitmap fields the number of strings in arena on 64 bit CPU drops by 0.9%. Some objects sizes also benefits from that.
Comment 23 Bill McCloskey (:billm) 2011-04-28 19:57:53 PDT
(In reply to comment #22)
> > This kind of bit packing is a little gross.
> 
> It helps synthetic benchmarks as it allows to put more things into the arena.
> For example, without the bitmap fields the number of strings in arena on 64 bit
> CPU drops by 0.9%. Some objects sizes also benefits from that.

Well, I didn't mean we shouldn't pack them. I meant we should eliminate the nextDelayedMarkingArena field entirely. Then we would just set a bit in the runtime whenever we ran out of room on the mark stack. At the end of GC, we would iterate over all arenas and do delayed marking on them. It's slower, but it should be a very uncommon case, and it seems cleaner to me. But there's no need to do this now.
Comment 24 Igor Bukanov 2011-04-29 00:16:07 PDT
(In reply to comment #23)
> Well, I didn't mean we shouldn't pack them. I meant we should eliminate the
> nextDelayedMarkingArena field entirely. Then we would just set a bit in the
> runtime whenever we ran out of room on the mark stack. At the end of GC, we
> would iterate over all arenas and do delayed marking on them. It's slower, but
> it should be a very uncommon case, and it seems cleaner to me.

Gregor has already discovered that we have XML test cases that exercise delayed marking. In that particular case this happened due to a lot due to the lack of support of LargeElem for XML. But I would not be surprised that even without XML the WEB has structures that need delayed marking. And enumerating all the arenas in the runtime repeatedly could be very slow.
Comment 25 Igor Bukanov 2011-04-29 00:19:34 PDT
Created attachment 529050 [details] [diff] [review]
v14

I should have redone measurements after merging this patch with the marking stack changes. Apparently using bit field for the marking delay and thing kind became rather expensive. So the new patch keeps the marking delay as a separated structure in the chunk as before.
Comment 27 Gregor Wagner [:gwagner] 2011-04-29 14:57:21 PDT
Have you measured the GC pause time during the v8 benchmarks in the browser?
It seems that most GCs take a few ms longer now.
No big differences but GC's that took 8ms before take now 10ms for example.
Comment 28 Igor Bukanov 2011-04-29 15:30:34 PDT
(In reply to comment #27)
> Have you measured the GC pause time during the v8 benchmarks in the browser?
> It seems that most GCs take a few ms longer now.

No - I only checked that the total score is the same and that the callgrind shows no regressions.

Now, the only data structure that the patch changes is eliminating of two fields from the MarkingDelay. One was unused and another was used to scan on average twice less things in arena with delayed marking. On 32 bit eliminating the fields has not changed the number of arenas we can fit into the chunk - it remained at 251. So on 32 bit I suppose the slowdown can come from the delayed marking calls. Now, if we happens to have delayed arenas, then something is wrong with the marking stack patch. I will check that. 

On 64 bit the change resulted in the increase of number of arenas we have per chunk from 250 to 251. That may affect the results.
Comment 29 Igor Bukanov 2011-04-30 13:26:49 PDT
I do not see the GC slowdown during V8 on Linux 64 bit. Without the patch typical timing during the end of V8 run, when we hit the longest GC pauses, as measured with MOZ_GCTIMER defined and JS_WANT_GC_SUITE_PRINT set to true:

67.456000 65.506000 0.674000  
67.581000 65.609000 0.673000 
67.470000 65.524000 0.672000 
...
69.056000 67.055000 0.680000
68.677000 66.639000 0.677000
68.582000 66.503000 0.752000

With the landed patch that becomes:

67.721000 65.750000 0.669000
67.592000 65.596000 0.653000
67.249000 65.100000 0.737000
...
67.228000 65.271000 0.676000
66.863000 64.915000 0.684000
66.920000 64.928000 0.698000

I do not see a regression here.
Comment 30 Gregor Wagner [:gwagner] 2011-05-02 10:17:49 PDT
I didn't mean the long pauses during the splay benchmark.
For the raytracer benchmark I see now a triangle pause time from 5.5ms to 10.5 ms. This was 4.0 to 8.9 before. I don't know where it comes from because the marking just gets slightly worse and doesn't account for all the regression.

before:
Total,  Mark,  Sweep
4.0,    2.5,    0.8,    0.7,    0.0,      0.0,     0.0,
6.4,    4.9,    0.8,    0.7,    0.0,      0.0,     0.0,
7.5,    6.0,    0.8,    0.7,    0.0,      0.0,     0.0, 
8.2,    6.6,    0.8,    0.7,    0.0,      0.0,     0.0,
8.9,    7.4,    0.8,    0.7,    0.0,      0.0,     0.0,

now:
Total, Mark,  Sweep,
5.9,    2.9,    0.8,    0.8,    0.0,      0.0,     0.0,
8.2,    5.2,    0.8,    0.8,    0.0,      0.0,     0.0,
9.4,    6.5,    0.8,    0.8,    0.0,      0.0,     0.0,
9.9,    7.0,    0.8,    0.8,    0.0,      0.0,     0.0,
10.6,    7.7,    0.8,    0.8,    0.0,      0.0,     0.0,
Comment 31 Igor Bukanov 2011-05-02 12:56:19 PDT
(In reply to comment #30)
> For the raytracer benchmark I see now a triangle pause time from 5.5ms to 10.5
> ms. This was 4.0 to 8.9 before. 

Is it on 64 bit?
Comment 32 Gregor Wagner [:gwagner] 2011-05-02 13:30:52 PDT
(In reply to comment #31)
> (In reply to comment #30)
> > For the raytracer benchmark I see now a triangle pause time from 5.5ms to 10.5
> > ms. This was 4.0 to 8.9 before. 
> 
> Is it on 64 bit?

Yes on 64 bit. A 2ms regression for all GCs during the v8 browser suite. They are more obvious during the short GC pauses of course. 
I am doing a clean build now. Maybe that helps.
Comment 33 Igor Bukanov 2011-05-02 13:36:51 PDT
Could also check if adding uintptr_t dummy[2]; to the MarkingDelay structure would affect the benchmark timing?
Comment 34 Igor Bukanov 2011-05-02 13:42:00 PDT
One more thing, the delay would still present if you run the benchmark individually?
Comment 35 Igor Bukanov 2011-05-02 14:02:27 PDT
The regression is observed only in the browser, it does not exist in the shell, does it?
Comment 36 Gregor Wagner [:gwagner] 2011-05-02 15:06:38 PDT
(In reply to comment #35)
> The regression is observed only in the browser, it does not exist in the shell,
> does it?

Yes only in the browser. How can I run the v8 benchmarks individually?
Comment 37 Igor Bukanov 2011-05-02 15:28:00 PDT
(In reply to comment #36)
> Yes only in the browser. How can I run the v8 benchmarks individually?

If you want to run richards from the sunspider distribution, then just create a page and include the script. The script contains the loop at the end that executes the benchmark number of times.

If you want to run richards from V8 distribution or from a copy under js/src/v8 in our tree, then the page should define BenchmarkSuite and Benchmark constructors. So you would need something like:

<html><body>

<script>

function BenchmarkSuite(name, number, tests) {
    this.tests = tests;
}

function Benchmark(name, entryPoint) {
    this.entryPoint = entryPoint;
}

function run_one_test(test) {
    f = test.tests[0].entryPoint;
    for (var i = 0; i != 350; ++i)
        f();    
}

</script>

<script src="raytrace.js"></script>

<input type="button" value="Run RayTrace" onclick="run_one_test(RayTrace)">

</body></html>
Comment 38 Chris Leary [:cdleary] (not checking bugmail) 2011-05-02 15:59:48 PDT
cdleary-bot mozilla-central merge info:
http://hg.mozilla.org/mozilla-central/rev/61bbaedfc2a3
http://hg.mozilla.org/mozilla-central/rev/e2843f43757e

Note You need to log in before you can comment on or make changes to this bug.