Make our JIT post barriers more efficient

RESOLVED FIXED in Firefox 50

Status

()

Core
JavaScript: GC
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: jonco, Assigned: jonco)

Tracking

(Blocks: 1 bug)

unspecified
mozilla50
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox50 fixed)

Details

Attachments

(5 attachments, 1 obsolete attachment)

(Assignee)

Description

2 years ago
I wrote a test program that created a large array of object and noticed that it spent ~25% of its time calling into the VM from JIT code to run postbarriers.  This bug is to investigate improving this situation.
How much work would it be to move the is-this-item-is-the-last-item-we-put-into-the-store-buffer check inline into the jit code? The reason that 1-item cache works so well is because it is the common case, so if we could avoid a VM call in the common case, that seems pretty good. This would be much easier with the whole cell buffer than the slots buffer (which is used for array elements, unfortunately).

I think we also talked about checking the ObjectElements::Flags::IN_WHOLE_CELL_BUFFER bit inline in the jit code, too.

We could also investigate eliding post-barriers in the JIT when we can prove that we already put the edge in the store buffer (again, much easier with the whole cell buffer than with the slots buffer).

Can you share how you were creating the large array, and which post barrier (PostWriteBarrier or PostWriteElementBarrier) we end up running?
Blocks: 1057530
(Assignee)

Comment 2

2 years ago
Created attachment 8759133 [details] [diff] [review]
barrier-benchmarks

Here are the benchmarks I'm using.

This is about getting rid of calls to PostWriteBarrier, although I've ideas for PostWriteElementBarrier too.

The plan is to change the whole cell store buffer representation from a HashSet to an optional bit vector associated with each arena.  This is simple enough to manipulate that we can do this from jitcode without the overhead of calling into the VM.  Maybe we just was to check if a cell is already in the buffer, or maybe we go the whole way and add too, I'm not sure.
(Assignee)

Updated

2 years ago
See Also: → bug 1275033
(Assignee)

Updated

2 years ago
Depends on: 1277868
Depends on: 1277866
(Assignee)

Updated

2 years ago
No longer depends on: 1277868
(Assignee)

Comment 3

2 years ago
Created attachment 8762096 [details] [diff] [review]
bitvector-1-add-arena-header-field

The plan is to associate an optional bit vector with each arena that tells us whether the cells in that arena are in the whole cell store buffer.

To do this we need to add a pointer to the arena header.

Here are the stats about the arena layout for the different thing kinds on 32 and 64 bit builds.

    64-bit:                                                             
                                   Pre:                Post:
                                   Size Offset Count   Size Offset Count
                         FUNCTION:  64      64    63     64     64    63
                FUNCTION_EXTENDED:  80      96    50     80     96    50
                          OBJECT0:  32      32   127     32     64   126
               OBJECT0_BACKGROUND:  32      32   127     32     64   126
                          OBJECT2:  48      64    84     48     64    84
               OBJECT2_BACKGROUND:  48      64    84     48     64    84
                          OBJECT4:  64      64    63     64     64    63
               OBJECT4_BACKGROUND:  64      64    63     64     64    63
                          OBJECT8:  96      64    42     96     64    42
               OBJECT8_BACKGROUND:  96      64    42     96     64    42
                         OBJECT12: 128     128    31    128    128    31
              OBJECT12_BACKGROUND: 128     128    31    128    128    31
                         OBJECT16: 160      96    25    160     96    25
              OBJECT16_BACKGROUND: 160      96    25    160     96    25
                           SCRIPT: 208     144    19    208    144    19
                      LAZY_SCRIPT:  64      64    63     64     64    63
                            SHAPE:  40      56   101     40     56   101
                   ACCESSOR_SHAPE:  56      64    72     56     64    72
                       BASE_SHAPE:  40      56   101     40     56   101
                     OBJECT_GROUP:  48      64    84     48     64    84
                FAT_INLINE_STRING:  32      32   127     32     64   126
                           STRING:  24      40   169     24     40   169
                  EXTERNAL_STRING:  24      40   169     24     40   169
                           SYMBOL:  24      40   169     24     40   169
                          JITCODE:  48      64    84     48     64    84

    32-bit:
                                   Pre:                Post:
                                   Size Offset Count   Size Offset Count
                         FUNCTION:  32      32   127     32     32   127
                FUNCTION_EXTENDED:  48      16    85     48     64    84
                          OBJECT0:  16      16   255     16     32   254
               OBJECT0_BACKGROUND:  16      16   255     16     32   254
                          OBJECT2:  32      32   127     32     32   127
               OBJECT2_BACKGROUND:  32      32   127     32     32   127
                          OBJECT4:  48      16    85     48     64    84
               OBJECT4_BACKGROUND:  48      16    85     48     64    84
                          OBJECT8:  80      16    51     80     96    50
               OBJECT8_BACKGROUND:  80      16    51     80     96    50
                         OBJECT12: 112      64    36    112     64    36
              OBJECT12_BACKGROUND: 112      64    36    112     64    36
                         OBJECT16: 144      64    28    144     64    28
              OBJECT16_BACKGROUND: 144      64    28    144     64    28
                           SCRIPT: 144      64    28    144     64    28
                      LAZY_SCRIPT:  48      16    85     48     64    84
                            SHAPE:  24      16   170     24     40   169
                   ACCESSOR_SHAPE:  32      32   127     32     32   127
                       BASE_SHAPE:  24      16   170     24     40   169
                     OBJECT_GROUP:  24      16   170     24     40   169
                FAT_INLINE_STRING:  32      32   127     32     32   127
                           STRING:  16      16   255     16     32   254
                  EXTERNAL_STRING:  16      16   255     16     32   254
                           SYMBOL:  16      16   255     16     32   254
                          JITCODE:  40      16   102     40     56   101
                                                                        
In summary, on 64 bit builds we get one less OBJECT0 (+ BACKGROUND) and FAT_INLINE_STRING per arena (126 rather than 127).  All other kinds are unaffected.

On 32 bit builds there is a great affect, with the following kinds having one less cell per arena:
        FUNCTION_EXTENDED
        OBJECT0 + BACKGROUND
        OBJECT4
        OBJECT5
        LAZY_SCRIPT
        SHAPE
        BASE_SHAPE
        OBJECT_GROUP
        STRING
        EXTERNAL_STRING
        JITCODE
Attachment #8762096 - Flags: feedback?(terrence)
(Assignee)

Comment 4

2 years ago
Created attachment 8762098 [details] [diff] [review]
bitvector-2-whole-cell-store-buffer

Reimplement the whole cell store buffer using an optional bitvector associated with each arena.  These are allocated from the nursery.
(Assignee)

Comment 5

2 years ago
Created attachment 8762100 [details] [diff] [review]
bitvector-3-whole-cell-store-buffer-jit-constant

For writes to objects known at compile time we can compile a faster store buffer check that calling into the VM to do this.
(Assignee)

Comment 6

2 years ago
Benchmark results for the owb.js microbenchmark:

Initial:

            1.29 real         1.28 user         0.01 sys
            1.30 real         1.28 user         0.01 sys
            1.28 real         1.26 user         0.01 sys

With reimplememented whole cell store buffer:

            0.86 real         0.85 user         0.01 sys
            0.87 real         0.86 user         0.01 sys
            0.86 real         0.84 user         0.01 sys

With constant object optimisation:

            0.62 real         0.61 user         0.01 sys
            0.63 real         0.61 user         0.01 sys
            0.62 real         0.60 user         0.01 sys
(Assignee)

Comment 7

2 years ago
I also spent a while optimising the non-constant object case so we could check this in JIT code without calling into the VM, and made it work on x86 (yay for the BTS instruction), arm and arm64.  It wasn't any faster.
Attachment #8762096 - Flags: review+
Attachment #8762096 - Flags: feedback?(terrence)
Attachment #8762096 - Flags: feedback+
Comment on attachment 8762098 [details] [diff] [review]
bitvector-2-whole-cell-store-buffer

Review of attachment 8762098 [details] [diff] [review]:
-----------------------------------------------------------------

I'd like to see a copy using BitArray first.

::: js/src/gc/Heap.h
@@ +337,5 @@
>   * wastes space but allows to avoid expensive devisions by thing's size when
>   * accessing the bitmap. In addition this allows to use some bits for colored
>   * marking during the cycle GC.
>   */
> +const size_t ArenaCellCount = ArenaSize / CellSize;

Maybe also static_assert that ArenaSize % CellSize is 0.

::: js/src/gc/StoreBuffer-inl.h
@@ +50,5 @@
> +    uint32_t mask = getMask(cellIndex);
> +    uint32_t word = bits[wordIndex];
> +    if ((word & mask) == 0)
> +        bits[wordIndex] = word | mask;
> +}

All of this code seems to be identical to the code in ds/BitArray.h. Please re-implement |bits| using a BitArray.

::: js/src/gc/StoreBuffer.cpp
@@ +137,5 @@
> +        oomUnsafe.crash("Failed to allocate WholeCellSet");
> +        return nullptr;
> +    }
> +
> +    if (nursery.approxFreeSpace() < 64 * 1024)

Please make 64 * 1024 a constant of some sort in the header so that it's apparent that it is a knob we can fiddle with.
Attachment #8762098 - Flags: feedback+
Ah! I guess we can't naively use BitArray because we need to have a fixed word size for the JIT integration! I think it would be worth the effort to give BitArray a template parameter of WordType instead of having a second copy with a fixed size type though. Or even to make BitArray just use uint32_t, for that matter.
Comment on attachment 8762100 [details] [diff] [review]
bitvector-3-whole-cell-store-buffer-jit-constant

Review of attachment 8762100 [details] [diff] [review]:
-----------------------------------------------------------------

This is nice. Obviously don't land this if it's not actually faster, but you have my r+ to do so if it is.
Attachment #8762100 - Flags: review+
(In reply to Jon Coppeard (:jonco) from comment #6)
> Benchmark results for the owb.js microbenchmark:

Great results!
(Assignee)

Comment 12

2 years ago
Created attachment 8762546 [details] [diff] [review]
bitvector-0-bit-array-changes

Make BitArray use 32 bit words everywhere so it's easier to access its storage from the JITs.
Attachment #8762546 - Flags: review?(terrence)
(Assignee)

Comment 13

2 years ago
Created attachment 8762547 [details] [diff] [review]
bitvector-2-whole-cell-store-buffer v2

Updated patch using BitArray and with review comments applied.
Attachment #8762098 - Attachment is obsolete: true
Attachment #8762547 - Flags: review?(terrence)
Comment on attachment 8762546 [details] [diff] [review]
bitvector-0-bit-array-changes

Review of attachment 8762546 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ds/BitArray.h
@@ +18,5 @@
>  template <size_t nbits>
>  class BitArray
>  {
>    private:
> +    using WordT = uint32_t;

Maybe add a comment on this to the effect that the fixed word size is to make it easier to code against in the JIT?
Attachment #8762546 - Flags: review?(terrence) → review+
Comment on attachment 8762547 [details] [diff] [review]
bitvector-2-whole-cell-store-buffer v2

Review of attachment 8762547 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/Nursery.h
@@ +217,5 @@
>      MOZ_ALWAYS_INLINE uintptr_t heapEnd() const {
>          return heapEnd_;
>      }
>  
> +    // Free space remaining, now counting chunk trailers.

s/now/not/

::: js/src/gc/StoreBuffer-inl.h
@@ +23,5 @@
> +
> +inline /* static */ void
> +ArenaCellSet::getWordIndexAndMask(size_t cellIndex, size_t* wordp, uint32_t* maskp)
> +{
> +    BitArray<ArenaCellCount>::getIndexAndMask(cellIndex, wordp, maskp);

Much nicer!
Attachment #8762547 - Flags: review?(terrence) → review+
(Assignee)

Comment 16

2 years ago
Comment on attachment 8762100 [details] [diff] [review]
bitvector-3-whole-cell-store-buffer-jit-constant

Requesting additional review for JIT changes.
Attachment #8762100 - Flags: review?(jdemooij)
Comment on attachment 8762100 [details] [diff] [review]
bitvector-3-whole-cell-store-buffer-jit-constant

Review of attachment 8762100 [details] [diff] [review]:
-----------------------------------------------------------------

Cool!

::: js/src/jit/CodeGenerator.cpp
@@ +3444,5 @@
> +
> +    masm.branchTest32(Assembler::NonZero, Address(cells, offset), Imm32(mask), exit);
> +
> +    // Check whether this is the sentinel set and if so call the VM to allocate
> +    // a one for this arena.

Nit: s/a one/one/ or something similar
Attachment #8762100 - Flags: review?(jdemooij) → review+

Comment 18

2 years ago
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6a5167669bd2
Make BitArray use 32 bit words to make it easier to access from JIT code r=terrence
https://hg.mozilla.org/integration/mozilla-inbound/rev/b1c44ce827f7
Add an extra field to the arena header r=terrence
https://hg.mozilla.org/integration/mozilla-inbound/rev/85911372f276
Reimplement whole cell store buffer using a bit vector associated with the arena r=terrence
https://hg.mozilla.org/integration/mozilla-inbound/rev/b8b6dd03d8fd
Optimise post barriers in the JIT for constant objects r=terrence r=jandem

Comment 19

2 years ago
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/79328d1a536d
Fix implicit conversion constructor error r=me on a CLOSED TREE
Nice gains:
https://treeherder.allizom.org/perf.html#/alerts?id=2726

@Brian: misc 0.7 typedobj-splay-standard linux32 opt shell has regressed. I think it is a fluke and not related to what the benchmark is actually testing. And the linux64 opt shell has improved with the same revision. Can you confirm I can ignore that regression.
Flags: needinfo?(bhackett1024)
Hannes, did you mean to ni Jon instead of Brian?
Flags: needinfo?(hv1989)
(In reply to Terrence Cole [:terrence] from comment #22)
> Hannes, did you mean to ni Jon instead of Brian?

No, not at all. The testcase is a testcase that Brian added. That's why I'm NI'ed him
Flags: needinfo?(hv1989)
(In reply to Hannes Verschore [:h4writer] from comment #21)
> Nice gains:
> https://treeherder.allizom.org/perf.html#/alerts?id=2726
> 
> @Brian: misc 0.7 typedobj-splay-standard linux32 opt shell has regressed. I
> think it is a fluke and not related to what the benchmark is actually
> testing. And the linux64 opt shell has improved with the same revision. Can
> you confirm I can ignore that regression.

This should be fine to ignore, I don't know why linux32 would regress and linux64 would improve.
Flags: needinfo?(bhackett1024)
You need to log in before you can comment on or make changes to this bug.