Always use inline FreeLists

RESOLVED FIXED in Firefox 47

Status

()

RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: terrence, Assigned: ehoogeveen)

Tracking

(Blocks: 1 bug)

Trunk
mozilla47
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox47 fixed)

Details

Attachments

(2 attachments, 1 obsolete attachment)

(Reporter)

Description

3 years ago
At the moment we store the free list we are allocating into on the Zone instead of directly in the arena's header. As far as I can tell from the code and comments, we do this for 3 reasons:

1) Better cache usage as we are always frobbing the same lines, rather than moving them around as we allocate from different arenas.
2) We can store the free list in a compact packed form in the arena header and store it unpacked in the free list so that it takes fewer instructions to manipulate.
3) It simplifies the JIT implementation as we can just bake in the address of the unpacked free list rather than computing it from the header.

The two "performance" rationals are almost certainly totally bogus at this point, given our nursery and modern processors. So we would need to do some work on the JIT to load the arena header and [un]pack the free list. This would probably be fairly straightforward and would dramatically simplify the GC's allocation, sweeping, and compartment merging algorithms.
FWIW, string allocation is a potential concern: strings are always tenured, and some benchmarks allocate a lot of strings (Octane-regexp for instance).
(Assignee)

Comment 2

3 years ago
I have a patch almost ready - I'll clean it up and try to post a WIP tonight. It compiles and seems to run fine ... so long as Ion is disabled. I've hit a bit of a snag there: Now that ArenaHeaders and the FreeSpan they contain are used as the free lists directly, the pointers we bake into JITcode become invalid whenever the free lists are changed! I can see 3 ways to solve this (while continuing to switch to using the ArenaHeaders directly):

1) Discard JITcode whenever we change the free lists. This is not ideal, since we'd have to compile again, and I don't know if we can discard selectively enough to avoid nuking everything.
2) Patch the JITcode to bail out the next time it tries to allocate. This is also not ideal, since we'd have to compile again - but may still be a smaller hammer, since it wouldn't necessarily affect *all* JITcode in the zone.
3) Patch new addresses into the JITcode whenever we start allocating from a new arena (and when we purge them for GC). This would be much better, since we'd be able to reuse the scripts - but I don't know if we can reliable scan scripts for baked in addresses (or if it could, say, clobber instructions). I also don't know how patching all the scripts in a zone would take, but in theory we already do that to toggle the prebarriers.

Jan, is (3) something we can do? Unfortunately I don't see a way around this without adding back the indirection I just removed.

Note that I also don't have performance numbers for this yet, since the patch isn't stable with Ion enabled (and I still have some debugging code to clean up).
Flags: needinfo?(jdemooij)
(Assignee)

Comment 3

3 years ago
Hmm, I suppose this is also an option:

4) Add indirection to the JITcode, so it bakes in the addresses of the free list array itself instead (then loads their contents when allocating). This would make the JITcode more complicated and potentially slower, but it might be a less drastic change overall. I can experiment with this locally.
(Assignee)

Comment 4

3 years ago
Implementing 4) actually went pretty well. Looking at it now, I think this is the best solution by far.
Flags: needinfo?(jdemooij)
(Assignee)

Comment 5

3 years ago
Created attachment 8723882 [details] [diff] [review]
Part 0: Fix MacroAssembler support for store16().

Ran into this while implementing the JIT changes. On x64, store*() needs to use a scratch register if the target address doesn't fit in 4 bytes. This was implemented for storePtr() and store32(), but the equivalent for store16() appears to be missing. This patch adds that variant, and adds the simpler version for x86 for parity between the files (though I'm not sure it's necessary).
Assignee: nobody → emanuel.hoogeveen
Status: NEW → ASSIGNED
Attachment #8723882 - Flags: review?(jdemooij)
(Assignee)

Comment 6

3 years ago
Created attachment 8723897 [details] [diff] [review]
Part 1: Refactor FreeSpan management to be less indirect and confusing.

And here's the meat of the patch.

10 files changed, 265 insertions(+), 630 deletions(-)

Jan, I'm flagging you for review on the JIT changes (the changes to MacroAssembler.cpp about 1000 lines into the patch). I'm pretty happy with how it turned out in the end, but let me know if there's anything that could be done more efficiently. Terrence, the rest is all yours :)

This patch:

1) Removes FreeList and CompactFreeSpan and moves their functionality into ArenaHeader and FreeSpan.
2) Switches the pointers in FreeSpan over to the 16-bit offsets used previously in CompactFreeSpan.
3) Changes ArenaLists::freeLists into an array of ArenaHeader pointers which are used directly, removing the need for periodic synchronization.
4) Changes the JIT to work with the new interface.

I ran Octane a few times with and without the patch, and I think any difference is in the noise, so that's encouraging.

Overall I think the approach turned out pretty nice, but there are a few warts. Firstly, because FreeSpans now use offsets instead of pointers, they can't find the next span without being told what Arena they belong to. As a result, most of FreeSpan's methods now take an |ArenaHeader*| parameter. It turns out we can pass in the Arena everywhere we touch them, so I was able to make the debug mode checks more comprehensive (which helped greatly in debugging this patch).

Secondly, because ArenaLists::freeLists is now a list of pointers to Arenas that might go away during GC, I needed some way to handle how to clear them. I initially added null checks, but this didn't work with the JIT implementation, which baked in the pointers, so I opted for another approach: I gave ArenaLists a static 'placeholder' ArenaHeader, initialized as unallocated (since it doesn't have an Arena) and with an empty span. This allows the normal bump allocation path to work and gives the JITs something to call into.

I don't think either of those is too bad, and they both come with some advantages (compared to the alternatives I could think up), but I'd be interested in any ideas to improve things more.
Attachment #8723897 - Flags: review?(terrence)
Attachment #8723897 - Flags: review?(jdemooij)
(Assignee)

Comment 7

3 years ago
And a try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=727fce06aadd (last attempt failed on a build warning, fingers crossed)
(Assignee)

Comment 8

3 years ago
A couple more notes now that I'm more awake:

Jan: I think the JITcode could be faster by using a 3rd register, but I wasn't sure if I could just take one, or what would happen if none are available.

Terrence: Please pay special attention to the PrepareForIncrementalGC function. I wasn't entirely sure how this worked, so I just tried to match existing behavior - but knowing the assigned arenas even if they're full gives us a little more freedom to choose.
Can we use int32 instead of int16? I think loading/storing int16 can be slower; we have some code that avoids this by using 32-bit loads (see MacroAssembler::branchFunctionKind for instance).
Flags: needinfo?(emanuel.hoogeveen)
(Assignee)

Comment 10

3 years ago
For the storage in FreeSpan, or for the JITcode?

For the storage in FreeSpan: We could, at the cost of increasing the size of ArenaHeader by 4 bytes on x86 (from 16 to 20 - IIRC we're already on 32 bytes due to alignment on x64), and splitting up the load and store when getting the next span. I don't know if that would be worth it, but it's an easy change in and of itself - maybe we could try it in a follow-up so we can compare the performance on AWFY.

For the JITcode: If we make sure that FreeSpan::first ends up in the low word (presumably depends on endianness), we could load them in one go, then copy FreeSpan::last (the high word) into both the high and low words of the other register before comparing them (since |(FreeSpan::last << 16) + FreeSpan::first <= (FreeSpan::last << 16) + FreeSpan::last|). Then we could bump the offset and store both with a 32-bit store. That seems like a fair amount of work though compared to just loading them separately (and we'd still have to clear the high word before turning the offset into an address).

Unless there's a fast way of splitting the high word in one register off into the low word of another?
Flags: needinfo?(emanuel.hoogeveen) → needinfo?(jdemooij)
(Assignee)

Comment 11

3 years ago
The bustage on try in s and g2 have existing bugs filed on them, and while there are no open bugs about g2, it doesn't look like anything my patch would have caused. The only orange without a bug on file is the timeout in C on Windows 7, but unless I introduced a massive performance fault or a hang I doubt that's related either.
(Assignee)

Comment 12

3 years ago
It might be interesting to check what kind of code our various compilers generate for ArenaHeader::allocate(), since that's what the JITcode is essentially duplicating (minus a few sanity checks, and bailing instead of returning null).
(Reporter)

Comment 13

3 years ago
Comment on attachment 8723897 [details] [diff] [review]
Part 1: Refactor FreeSpan management to be less indirect and confusing.

Review of attachment 8723897 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/GCInternals.h
@@ -26,5 @@
> -
> -  public:
> -    AutoCopyFreeListToArenas(JSRuntime* rt, ZoneSelector selector);
> -    ~AutoCopyFreeListToArenas();
> -};

\o/

::: js/src/gc/Heap.h
@@ +331,5 @@
>    public:
>      // This inits just |first| and |last|; if the span is non-empty it doesn't
>      // do anything with the next span stored at |last|.
> +    void initBoundsUnchecked(uintptr_t first, uintptr_t last, const ArenaHeader* aheader) {
> +        checkRange(first, last, aheader);

Heh.

::: js/src/jsgc.cpp
@@ -479,5 @@
> -     * empty and also points to this arena. Thus they must be the same.
> -     */
> -    MOZ_ASSERT(freeList->isSameNonEmptySpan(firstSpan));
> -}
> -#endif

\o/

@@ +1952,5 @@
>  ArenaLists::prepareForIncrementalGC(JSRuntime* rt)
>  {
>      for (auto i : AllAllocKinds()) {
> +        ArenaHeader* aheader = freeLists[i];
> +        if (aheader && aheader != &placeholder) {

I think you can skip the null check here.

::: js/src/jsgc.h
@@ -775,5 @@
> -            MOZ_ASSERT(freeList->isSameNonEmptySpan(aheader->getFirstFreeSpan()));
> -            return true;
> -        }
> -        return false;
> -    }

\o/

::: js/src/jsgcinlines.h
@@ -120,5 @@
>  
>    public:
>      ArenaCellIterImpl()
> -      : firstThingOffset(0)     // Squelch
> -      , thingSize(0)            //   warnings

O_o

Wow, that kinda speaks volumes.

@@ -141,5 @@
> -        AllocKind kind = aheader->getAllocKind();
> -        MOZ_ASSERT(aheader->zone->arenas.isSynchronizedFreeList(kind));
> -#endif
> -        initUnsynchronized(aheader);
> -    }

Now that there is only one |init()| and it is infallible, we should just move this all into the constructor. That can be a followup.
Attachment #8723897 - Flags: review?(terrence) → review+
(Assignee)

Comment 14

3 years ago
I looked at the code produced by GCC 4.8 (after hacking up a templated version to make it bake in the thing size), and it reads like a less register-starved version of my code:

75bca0: mov    0x40(%rdi),%rdx    ;           rdx = freeLists[thingKind]; // The Arena address
75bca4: movzwl 0x10(%rdx),%ecx    ;           ecx = rdx->firstFreeSpan.first;
75bca8: movzwl 0x12(%rdx),%esi    ;           esi = rdx->firstFreeSpan.last;
75bcac: movzwl %cx,%eax           ;           eax = cx;
75bcaf: add    %rdx,%rax          ;           rax += rdx; // The address to use as the result
75bcb2: cmp    %si,%cx            ;           if (cx >= si)
75bcb5: jae    75bcc0             ;               goto fallback;
75bcb7: add    $0x40,%ecx         ;           ecx += thingSize;
75bcba: mov    %cx,0x10(%rdx)     ;           rdx->firstFreeSpan.first = cx;
75bcbe: retq                      ;           return rax;
75bcbf: nop                       ;
75bcc0: test   %cx,%cx            ; fallback: if (cx == 0)
75bcc3: je     75bccc             ;               goto fail;
75bcc5: mov    (%rdx,%rsi,1),%ecx ;           ecx = *(rdx + rsi); // Load the next span
75bcc8: mov    %ecx,0x10(%rdx)    ;           rdx->firstFreeSpan = ecx;
75bccb: retq                      ;           return rax;
75bccc: xor    %eax,%eax          ; fail:     eax = nullptr;
75bcce: retq                      ;           return rax;

In particular, they don't avoid the 16-bit loads. Code produced by Clang 3.4 is slightly different, but also includes the 16-bit loads. I'll play with making each field 32-bits in a follow-up and see if it gains us anything (it's entirely possible that because of the nursery, this path just really isn't a bottleneck anymore).
(Assignee)

Updated

3 years ago
Blocks: 1251833
(Assignee)

Updated

3 years ago
Blocks: 1251834
(Assignee)

Comment 15

3 years ago
I filed bug 1251833 for some further simplifications and bug 1251834 to see about using 32-bit fields.

Updated

3 years ago
Attachment #8723882 - Flags: review?(jdemooij) → review+
Comment on attachment 8723897 [details] [diff] [review]
Part 1: Refactor FreeSpan management to be less indirect and confusing.

Review of attachment 8723897 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great, thanks for doing this! Will be interesting to see what happens on AWFY.

::: js/src/gc/Heap.h
@@ +460,5 @@
>          static_assert(size_t(AllocKind::LIMIT) <= 255,
>              "We must be able to fit the allockind into uint8_t.");
>          allocKind = size_t(kind);
>  
> +        setAsFullyUnused();

This is so much nicer than marking the arena as full and relying on some other code resetting that.

::: js/src/jit/MacroAssembler.cpp
@@ +791,5 @@
> +    // If there is no room remaining in the span, fall back to get the next one.
> +    loadPtr(AbsoluteAddress(zone->addressOfFreeList(allocKind)), temp);
> +    load16ZeroExtend(Address(temp, js::gc::ArenaHeader::offsetOfFreeSpanFirst()), result);
> +    load16ZeroExtend(Address(temp, js::gc::ArenaHeader::offsetOfFreeSpanLast()), temp);
> +    branchPtr(Assembler::AboveOrEqual, result, temp, &fallback);

Nit: branch32

@@ +794,5 @@
> +    load16ZeroExtend(Address(temp, js::gc::ArenaHeader::offsetOfFreeSpanLast()), temp);
> +    branchPtr(Assembler::AboveOrEqual, result, temp, &fallback);
> +
> +    // Bump the offset for the next allocation.
> +    addPtr(Imm32(thingSize), result);

Nit: add32

@@ +797,5 @@
> +    // Bump the offset for the next allocation.
> +    addPtr(Imm32(thingSize), result);
> +    loadPtr(AbsoluteAddress(zone->addressOfFreeList(allocKind)), temp);
> +    store16(result, Address(temp, js::gc::ArenaHeader::offsetOfFreeSpanFirst()));
> +    subPtr(Imm32(thingSize), result);

Nit: sub32

This add + sub is a bit unfortunate but I don't see an easy way to avoid it. If we had an extra temp register, we could do:

  computeEffectiveAddress(Address(result, thingSize), temp2);
  store16(temp2, ...);

It's probably fine to keep it this way. Getting an extra temp register here doesn't look trivial and extra regalloc pressure might hurt elsewhere.

@@ +809,1 @@
>      branchPtr(Assembler::Equal, result, ImmPtr(0), fail);

Nit:

branchTest32(Assembler::Zero, result, result, fail);

::: js/src/jsgc.cpp
@@ +485,5 @@
>      MOZ_ASSERT(!aheader.allocatedDuringIncremental);
>  
> +    uint16_t firstThing = firstThingOffset(thingKind);
> +    uint16_t firstThingOrSuccessorOfLastMarkedThing = firstThing;
> +    uint16_t lastThing = ArenaSize - thingSize;

It might make sense to use uint_fast16_t here and a few times below.
Attachment #8723897 - Flags: review?(jdemooij) → review+
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #10)
> For the storage in FreeSpan, or for the JITcode?

I assumed the bigger FreeSpans would still fit in the ArenaHeader, but if they don't you're probably right that it's best to keep it this way. Let's see what happens on AWFY; I agree it's likely not worth worrying about.
Flags: needinfo?(jdemooij)
(Assignee)

Comment 18

3 years ago
Created attachment 8724727 [details] [diff] [review]
Part 1 v2: Refactor FreeSpan management to be less indirect and confusing.

Thanks for the careful reviews! Carrying forward r+.

(In reply to Terrence Cole [:terrence] from comment #13)
> > +        if (aheader && aheader != &placeholder) {
> 
> I think you can skip the null check here.

Well spotted, that was left over from before I added the placeholder.

> >      ArenaCellIterImpl()
> > -      : firstThingOffset(0)     // Squelch
> > -      , thingSize(0)            //   warnings
> 
> O_o
> 
> Wow, that kinda speaks volumes.

Doesn't it just? I cleaned this up more in bug 1251833 (if we take part 2), but leaving half the variables uninitialized without a good reason seemed unsound.

> > -        initUnsynchronized(aheader);
> > -    }
> 
> Now that there is only one |init()| and it is infallible, we should just
> move this all into the constructor. That can be a followup.

It didn't work out, but I had a look at this in bug 1251833.

(In reply to Jan de Mooij [:jandem] from comment #16)
> > +        setAsFullyUnused();
> 
> This is so much nicer than marking the arena as full and relying on some
> other code resetting that.

Yeah, there used to be other reasons for this but I removed them a long time ago. I was worried there might still be some vestigial reason, but I couldn't find any (though I did have to add a new function to keep the ArenaList cursor from getting misplaced).

> Nit: branch32
> Nit: add32
> Nit: sub32

Thanks, done.
 
> This add + sub is a bit unfortunate but I don't see an easy way to avoid it.
> If we had an extra temp register, we could do:
> 
>   computeEffectiveAddress(Address(result, thingSize), temp2);
>   store16(temp2, ...);
> 
> It's probably fine to keep it this way. Getting an extra temp register here
> doesn't look trivial and extra regalloc pressure might hurt elsewhere.

Yeah, I looked at what it would take to get another register here, but it branches out into a lot of callers. One of them already uses 4 registers, and I wasn't sure if I could just push one of the registers (and pop it afterward) since we might bail. Can that leave the stack misaligned?

> >      branchPtr(Assembler::Equal, result, ImmPtr(0), fail);
> 
> Nit:
> 
> branchTest32(Assembler::Zero, result, result, fail);

Ooh, I wouldn't have known how to transform that one. Fixed.

> > +    uint16_t firstThing = firstThingOffset(thingKind);
> > +    uint16_t firstThingOrSuccessorOfLastMarkedThing = firstThing;
> > +    uint16_t lastThing = ArenaSize - thingSize;
> 
> It might make sense to use uint_fast16_t here and a few times below.

Good idea, done (everywhere except for the actual FreeSpan).
Attachment #8723897 - Attachment is obsolete: true
Attachment #8724727 - Flags: review+
(Assignee)

Updated

3 years ago
Keywords: checkin-needed

Comment 20

3 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/1783f15418c5
https://hg.mozilla.org/mozilla-central/rev/5e76a5e6b927
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox47: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
(Assignee)

Updated

3 years ago
Blocks: 1192301
You need to log in before you can comment on or make changes to this bug.