Last Comment Bug 681884 - Faster slow path of GC allocation
: Faster slow path of GC allocation
Status: RESOLVED FIXED
: perf
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal with 4 votes (vote)
: mozilla9
Assigned To: Igor Bukanov
:
Mentors:
Depends on:
Blocks: 684581 671702 684527
  Show dependency treegraph
 
Reported: 2011-08-25 00:28 PDT by Igor Bukanov
Modified: 2011-11-29 01:32 PST (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
empty arena distribution before c-p-g, after, and after-with-2K-arenas (1.33 KB, text/plain)
2011-08-25 09:36 PDT, Luke Wagner [:luke]
no flags Details
v1 (82.40 KB, patch)
2011-08-26 09:34 PDT, Igor Bukanov
no flags Details | Diff | Splinter Review
v2 (143.35 KB, patch)
2011-09-01 06:45 PDT, Igor Bukanov
wmccloskey: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2011-08-25 00:28:55 PDT
+++ This bug was initially created as a clone of Bug #671702 +++

For the bug 671702 I investigated 2K arenas instead of the current 4K one so a page like <script>alert("Hello!")</script> would allocate less than 64K of GC things and with 64K per-compartment chunks only one chunk would be necessary for such page.

However, this regressed V8 by 3-4% with most of the regression coming from GC-allocation intensive benchmarks. Apparently our slow path of GC allocation is too slow. art of it comes from using locks when allocation new arenas from new chunks, but some can be offsets by better optimizations. 

For example, as we no longer build a free list for an empty arena (that is represented as single free span), a switch in RefillFinalizableFreeList over thingKind can be replaced by table lookup to find the offset of the first free thing in arena. Another observation is that we use quite a few checks in RefillTypedFreeList to decide if we need to run the last-ditch GC. Those can be optimized. Yet another observation is that when allocation from the new arena we rely on the generic FreeSpan::allocate when we know that the call cannot fail.
Comment 1 Luke Wagner [:luke] 2011-08-25 09:36:28 PDT
Created attachment 555756 [details]
empty arena distribution before c-p-g, after, and after-with-2K-arenas

Great timing!  I've also been looking at 2K arenas.  The reason is that compartment-per-global (bug 650353) leads to a higher number of mostly-unused arenas.  I found 2K arenas to cut the delta roughly in half, so I'm definitely hopeful you can make the perf work out.

One thing I did notice (that you can see looking at the 90-95% vs. 95-100% full rows in the attached distributions generated from GMail) is that more space is wasted in 2K arenas, presumably due to the attention that has been payed to packing 4K arenas.
Comment 2 Nicholas Nethercote [:njn] 2011-08-25 17:17:59 PDT
In bug 673480 I wondered about whether we could reduce arena waste by reducing the number of thing kinds.  There are some inconclusive measurements there if you're interested.
Comment 3 Igor Bukanov 2011-08-26 09:34:21 PDT
Created attachment 556053 [details] [diff] [review]
v1

Changes in the patch:

1. The thing kind switch during the free list refil is replaced by a lookup in the table that records the offsets in the first arena.

2. During the refill IsGCAllowed checks are done after checking for rt->gcIsNeeded. This way IsGCAllowed is done only after much less lickely rt->gcIsNeeded check.

3. During the GC all fully allocated arenas are moved before the cursor position in arena list. This way the allocation code does not need to loop through the list and can just check for arenas after the cursor.

4. The free list uses special allocation functions tailored for allocation from the available list or from newly allocated empty arena.

To make code clear after those changes the patch merged FreeLists/ArenaList strcucts into single ArenaLists that contains the free list, arena and background finalization state as separated arrays. Initially I wanted to have one array with elements being structs that contain the free list, arena list and the background finalization info.

However the access to elements in such array would require multiplication by 5 and that is something to avoid. Also this would be less cache friendly as the free lists for different GC things are accessed much more frequently than arena lists.

With this changes I see 0.3% win in SS and 0.8 in v8:

TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      1.008x as fast    995.5ms +/- 0.1%   988.0ms +/- 0.1%     significant

=============================================================================

  v8:             1.008x as fast    995.5ms +/- 0.1%   988.0ms +/- 0.1%     significant
    crypto:       -                 145.0ms +/- 0.3%   144.7ms +/- 0.1% 
    deltablue:    1.008x as fast    181.0ms +/- 0.3%   179.6ms +/- 0.4%     significant
    earley-boyer: 1.004x as fast    170.6ms +/- 0.2%   169.8ms +/- 0.3%     significant
    raytrace:     1.006x as fast    133.6ms +/- 0.3%   132.8ms +/- 0.2%     significant
    regexp:       -                 122.8ms +/- 0.5%   122.3ms +/- 0.6% 
    richards:     ??                135.5ms +/- 0.5%   135.7ms +/- 0.5%     not conclusive: might be *1.002x as slow*
    splay:        1.038x as fast    106.9ms +/- 0.2%   103.0ms +/- 0.3%     significant


If the GC uses 2K arenas the v8 win is slightly smaller 0.7% (but that is probably within a noise):

TEST              COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:      1.007x as fast    1012.4ms +/- 0.1%   1004.9ms +/- 0.1%     significant

=============================================================================

  v8:             1.007x as fast    1012.4ms +/- 0.1%   1004.9ms +/- 0.1%     significant
    crypto:       -                  143.6ms +/- 0.2%    143.5ms +/- 0.1% 
    deltablue:    -                  183.1ms +/- 0.4%    182.7ms +/- 0.4% 
    earley-boyer: 1.009x as fast     174.6ms +/- 0.2%    173.1ms +/- 0.3%     significant
    raytrace:     ??                 135.0ms +/- 0.3%    135.3ms +/- 0.3%     not conclusive: might be *1.002x as slow*
    regexp:       -                  124.4ms +/- 0.6%    123.4ms +/- 0.6% 
    richards:     ??                 135.8ms +/- 0.4%    136.2ms +/- 0.5%     not conclusive: might be *1.003x as slow*
    splay:        1.047x as fast     115.7ms +/- 0.3%    110.6ms +/- 0.3%     significant

In splay, the most GC-intensive benchmark, the win increased from 3.8% to 4.7% when going from 4K to 2K arenas.
Comment 4 Igor Bukanov 2011-09-01 06:45:43 PDT
Created attachment 557483 [details] [diff] [review]
v2

Here is a rebased patch with the following extra changes:

1. To prevent accidental passing size_t values as thingKind and remove a few redundant JS_ASSERT(thingKind < FINALIZE_LIMIT) checks I changed method signatures that take unsigned kind into FinalizeKind kind. I also renamed FinalizeKind into AllocKind as the current name has very little to do with the finalization.

2. To remove GCC 4.6 warnings about potentially not itialized variables I replaced Maybe<AutoGCLock> with AutoGCLock while extending that with couple of classes to cover the maybe-lock usage.
Comment 5 Bill McCloskey (:billm) 2011-09-01 15:05:33 PDT
Comment on attachment 557483 [details] [diff] [review]
v2

Review of attachment 557483 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the slow review. This is a great patch! It really cleans up a lot of stuff.

::: js/src/jscell.h
@@ +50,5 @@
>  struct ArenaHeader;
>  struct Chunk;
>  
> +/* The GC allocation kinds. */
> +enum AllocKind {

AllocKind is 100 times awesomer than FinalizeKind. Thank you!

::: js/src/jscntxt.h
@@ +450,5 @@
>       * This is used to look for compartment bugs.
>       */
>      JSCompartment       *gcCheckCompartment;
>  
> +    volatile jsuword    gcIsNeeded;

Was there a reason to move this?

::: js/src/jscompartment.h
@@ +393,5 @@
>  struct JS_FRIEND_API(JSCompartment) {
>      JSRuntime                    *rt;
>      JSPrincipals                 *principals;
>  
> +    js::gc::ArenaLists           arenas;

I like the direction this takes us. Eventually, I would like to have a much stronger interface between the GC and the rest of the engine. We would have a gc::Heap structure, which would contain all the gcXyz fields in JSRuntime right now. And we would have some kind of structure like this one that would go inside JSCompartment.

::: js/src/jsgc.cpp
@@ +314,5 @@
> +    while (ArenaHeader *aheader = *ap) {
> +         bool allClear = aheader->getArena()->finalize<T>(cx, thingKind, thingSize);
> +         if (allClear) {
> +            *ap = aheader->next;
> +             aheader->chunk()->releaseArena(aheader);

I think the indentation is wrong in this function.

@@ +341,2 @@
>  {
> +    size_t thingSize = GCThingSizeMap[thingKind];

Couldn't we compute thingSize in FinalizeTypedArenas and avoid the argument in each case below? It seems cleaner to me.

@@ +1238,5 @@
> +    if (!al->head) {
> +        JS_ASSERT(al->cursor == &al->head);
> +        al->cursor = &aheader->next;
> +    }
> +    al->head = aheader;

This code confused me because I thought that only full arenas go at the front of the list. But I guess that arenas that we're allocating into are technically full according to the ArenaHeader, so this makes sense. Could you add something to the comment about it though?

@@ +1344,2 @@
>      } else {
> +        lists->backgroundFinalizeState[thingKind] = BFS_DONE;

Is it safe to use BFS_DONE and not BFS_JUST_FINISHED? It seems like the background finalize thread could have released some arenas, and we want the allocator to see the writes done by releaseArena. If I'm wrong, please add a comment explaining why.

@@ +1435,5 @@
>       * during the GC in optimized builds.
>       */
> +    JSRuntime *rt = cx->runtime;
> +    JS_ASSERT(!rt->gcRunning);
> +    if (rt->gcRunning)

What's the history here? Do we still need this?

@@ +1450,5 @@
>  
>              /*
>               * The JSGC_END callback can legitimately allocate new GC
>               * things and populate the free list. If that happens, just
>               * return that list head.

Does this still happen?

::: js/src/jsgc.h
@@ +86,5 @@
>   */
>  const size_t MAX_BACKGROUND_FINALIZE_KINDS = FINALIZE_LIMIT - (FINALIZE_OBJECT_LAST + 1) / 2;
>  
> +extern JS_FRIEND_DATA(const uint32) GCThingSizeMap[];
> +extern JS_FRIEND_DATA(const uint32) ArenaFirstThingOffsets[];

Can we make inline functions to access these and have them automatically do the bounds assertions?

@@ +144,5 @@
>  
>      /*
>       * To minimize the size of the arena header the first span is encoded
>       * there as offsets from the arena start.
> +     *

Mistake?

@@ +155,5 @@
>          JS_ASSERT(firstOffset <= ((lastOffset + 1) & ~size_t(1)));
>          return firstOffset | (lastOffset << 16);
>      }
>  
>      static const size_t EmptyOffsets = ArenaSize | ((ArenaSize - 1) << 16);

The name here is a little confusing. Aren't these the offsets we use when the arena is completely full? It seems like FullOffsets would be a better name.

@@ +373,2 @@
>       * contain any GC things and is on the list of empty arenas in the GC
>       * chunk. The later allows to quickly check if the arena is allocated

later -> latter

@@ +701,5 @@
>      JS_ASSERT(!allocated());
>      JS_ASSERT(!getMarkingDelay()->link);
>      compartment = comp;
> +    allocKind = kind;
> +    firstFreeSpanOffsets = FreeSpan::EmptyOffsets;

Could you put a comment here like "See FreeSpan::allocateFromNewArena"?

@@ +833,5 @@
> +    /*
> +     * ArenaList::head points to the start of the list and cursor points to
> +     * the first arena in the list with some free things. All arenas before
> +     * cursor are fully allocated. cursor is an indirect pointer to allow for
> +     * efficient list insertion at the cursor point and other manipulations.

Maybe say that arenas that are currently being allocated into count as fully allocated for purposes of the list?

@@ +858,5 @@
> +     * things we copy its first free span into the list and set the arena as
> +     * if it has no free things. This way we do not need to update the arena
> +     * header after the initial allocation. When starting the GC We only move
> +     * the head of the of the list of spans back to the arena only for the
> +     * arena that was not fully allocated.

It looks like the last sentence got spliced during a rebase or something.

::: js/src/jsgcinlines.h
@@ +118,5 @@
>      JS_ASSERT(thing);
>      if (JSAtom::isStatic(thing))
>          return JSTRACE_STRING;
>      const Cell *cell = reinterpret_cast<const Cell *>(thing);
> +    return MapAllocToTraceKind(cell->arenaHeader()->getAllocKind());

Seems like arenaHeader() is not needed.

@@ +158,4 @@
>  }
>  
>  static inline bool
> +CanBumpAllocKind(AllocKind kind)

I think CanIncreaseAllocKind or CanIncrementAllocKind would be a better name. "Bump allocation" means something completely different and I found this confusing at first.

::: js/src/jsgcstats.cpp
@@ +202,5 @@
>  #if defined(MOZ_GCTIMER) || defined(JSGC_TESTPILOT)
>  
>  volatile GCTimer::JSGCReason gcReason = GCTimer::NOREASON;
>  const char *gcReasons[] = {"  API", "Maybe", "LastC", "DestC", "Compa", "LastD",
> +                          "Malloc", "Refill", "Chunk", "Shape", "  None"};

It looks like an extra space of indent is needed here.

::: js/src/methodjit/BaseAssembler.h
@@ +1235,5 @@
>       * taken if a free thing was not retrieved.
>       */
>      Jump getNewObject(JSContext *cx, RegisterID result, JSObject *templateObject)
>      {
> +        gc::AllocKind allocKind = templateObject->arenaHeader()->getAllocKind();

Do you need arenaHeader()?
Comment 6 Igor Bukanov 2011-09-02 01:37:21 PDT
(In reply to Bill McCloskey (:billm) from comment #5)
> ::: js/src/jscntxt.h
> @@ +450,5 @@
> >       * This is used to look for compartment bugs.
> >       */
> >      JSCompartment       *gcCheckCompartment;
> >  
> > +    volatile jsuword    gcIsNeeded;
> 
> Was there a reason to move this?

The move is leftover from some early version. But changing type into jsuword is necessary as bool access on a not-so-slow path makes things slower and volatile bool is not portable as on some platforms access to bool is not atomic.

> @@ +1435,5 @@
> >       * during the GC in optimized builds.
> >       */
> > +    JSRuntime *rt = cx->runtime;
> > +    JS_ASSERT(!rt->gcRunning);
> > +    if (rt->gcRunning)
> 
> What's the history here? Do we still need this?

I will file a bug about this, but few months ago we still had XPCOM finalizers that could lead to allocation attempts with some addons. I file a bug about it.

> 
> @@ +1450,5 @@
> >  
> >              /*
> >               * The JSGC_END callback can legitimately allocate new GC
> >               * things and populate the free list. If that happens, just
> >               * return that list head.
> 
> Does this still happen?

That I have not checked for a while. I combine that with the bug above.
Comment 8 Marco Bonardo [::mak] 2011-09-03 02:56:10 PDT
http://hg.mozilla.org/mozilla-central/rev/60df75bc1428

Note You need to log in before you can comment on or make changes to this bug.