Closed Bug 988950 Opened 6 years ago Closed 6 years ago

Improve performance of generational barriers

Categories

(Core :: JavaScript: GC, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla31

People

(Reporter: jonco, Assigned: jonco)

References

(Blocks 1 open bug)

Details

Attachments

(9 files, 3 obsolete files)

The generational post-barriers need to work out whether a pointer points from the tenured heap into the nursery.  Currently this is done by range testing on both pointers.

Now that we have a ChunkTrailer that is common between heap and nursery chunks we can put a flag there to indicate whether a chunk is part of the nursery.  This would make the check much simpler.
Assignee: nobody → jcoppeard
Attached patch bug988950-barriers (obsolete) — Splinter Review
Here's a first cut of this.

I benchmarked on x64 only which showed a promising 2.8% improvement on Richards over 20 runs.

Now looking into testing on ARM.
Comment on attachment 8400049 [details] [diff] [review]
bug988950-barriers

Review of attachment 8400049 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great! Pre-emptive r=me. Please flag Jan for a second review of the jit bits.

::: js/public/HeapAPI.h
@@ +25,5 @@
>  CurrentThreadCanAccessZone(JS::Zone *zone);
>  
>  namespace gc {
>  
> +struct Cell;

\o/

We'll want to use this to kill void* use in the browser, but that's a different patch.

@@ +174,5 @@
>  #endif
>  }
>  
> +MOZ_ALWAYS_INLINE bool
> +IsInsideNursery(const JS::shadow::Runtime *runtime, const js::gc::Cell *cell)

Please remove the runtime parameter of this instance to make it harder to accidentally get the wrong variant: calling this on a malloc pointer would be disastrous and the browser is uncomfortably eager to cast gc things to void*.

::: js/src/jit/CodeGenerator.cpp
@@ +1791,5 @@
> +     * To get the address of the location field in the chunk trailer, we OR the
> +     * object address with the chunk mask to give us the address of the last
> +     * byte in the chunk, and then subtract off an offset.
> +     */
> +    const int addressOffset = 1 + gc::ChunkLocationOffset - gc::ChunkSize;

Nice!

@@ +1805,5 @@
>          Register objreg = ToRegister(lir->object());
> +        masm.movePtr(objreg, temp);
> +        masm.orPtr(Imm32(gc::ChunkMask), temp);
> +        masm.branch32(Assembler::NotEqual, Address(temp, addressOffset),
> +                      Imm32(gc::ChunkLocationTenuredHeap), ool->rejoin());

It would be interesting to see if we useRegisterAtStart and clobber instead of taking a temp would be faster and where. Probably not worth fussing over until we break down and make this all platform specific.
Attachment #8400049 - Flags: review+
Duplicate of this bug: 922354
Attached patch bug988950-approach2 (obsolete) — Splinter Review
Here's another approach based on the patch in bug 922354 (I made some changes to get this to compile for ARM).
Attached file perf-results
I did some performance testing of this, and it seems that both patches improve performance of Richards by about 5% on X64.  However the first patch is slightly slower on ARM whereas the second is slightly faster.  Overall score improvement seems to be masked by the usual variations in splay result.

I think on balance we should probably go with the second patch as it's simpler too.

It should be possible to improve this too by not making it use a temporary register where possible.
Attachment #8400049 - Attachment is obsolete: true
Comment on attachment 8401823 [details] [diff] [review]
bug988950-approach2

Requesting review for this as it is, we can take the improvement now and refine further at our leisure.
Attachment #8401823 - Flags: review?(terrence)
Comment on attachment 8401823 [details] [diff] [review]
bug988950-approach2

Requesting review for JIT changes.
Attachment #8401823 - Flags: review?(jdemooij)
Comment on attachment 8401823 [details] [diff] [review]
bug988950-approach2

Review of attachment 8401823 [details] [diff] [review]:
-----------------------------------------------------------------

r=me
Attachment #8401823 - Flags: review?(terrence) → review+
Comment on attachment 8401823 [details] [diff] [review]
bug988950-approach2

Review of attachment 8401823 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.

::: js/src/jit/CodeGenerator.cpp
@@ +1793,4 @@
>          Register objreg = ToRegister(lir->object());
> +        masm.movePtr(ImmWord(-ptrdiff_t(nursery.start())), temp);
> +        masm.addPtr(objreg, temp);
> +        masm.branchPtr(Assembler::Below, temp, ImmWord(nursery.heapEnd() - nursery.start()),

nursery.heapEnd() - nursery.start() should always fit in 32 bits right? In that case you can use Imm32 instead of ImmWord (branchPtr instead of branch32 will ensure we still compare the full 64-bit value).

This avoids an extra mov instruction on x64, because cmpq can't have a 64-bit immediate. In some cases we optimize this in the macro-assembler, but not this one, looks like.

Same for the 4 cases below.
Attachment #8401823 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #9)
> Same for the 4 cases below.

3 cases..
Just realized you may have to add cmpPtr(Register, Imm32) to the macro assemblers to get this to compile. It should be very similar to cmpPtr(Operand, Imm32) (and on 32-bit platforms exactly the same as cmp32).
Attached patch bug988950-approach2 v2 (obsolete) — Splinter Review
Thanks for the comments!  I'm asking for review again to check I got the macro assembler changes right, and as I also fixed a problem in the second half CodeGenerator::visitPostWriteBarrierV() when valuereg ended up being the same register as the temp.
Attachment #8401823 - Attachment is obsolete: true
Attachment #8405436 - Flags: review?(jdemooij)
Comment on attachment 8405436 [details] [diff] [review]
bug988950-approach2 v2

Review of attachment 8405436 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/CodeGenerator.cpp
@@ +1874,5 @@
> +     * This is a little complicated because sometimes the value object address
> +     * ends up in a temp register, and sometimes it's available in a register
> +     * already.  We can trash the temp, but not the original register.  And if
> +     * it's in the temp, we mustn't trash this before we use its value!
> +     */

Nit: //-style-comments in jit/*

@@ +1878,5 @@
> +     */
> +    Register temp = ToRegister(lir->temp());
> +    Register valuereg = masm.extractObject(value, temp);
> +    if (temp != valuereg)
> +        masm.movePtr(valuereg, temp);

You can use unboxObject so that you don't need valuereg and the if:

Register temp = ToRegister(lir->temp());
masm.unboxObject(value, temp);

@@ +1880,5 @@
> +    Register valuereg = masm.extractObject(value, temp);
> +    if (temp != valuereg)
> +        masm.movePtr(valuereg, temp);
> +    masm.rshiftPtr(Imm32(Nursery::ChunkShift), temp);
> +    masm.subPtr(Imm32(nursery.start() >> Nursery::ChunkShift), temp);

Ah, that's clever! Would be good to add a short comment explaining why Imm32 is fine on x64 (nursery.start() always fits in 47 bits for Value format) and the following assert:

MOZ_ASSERT((nursery.start() >> Nursery::ChunkShift) < INT32_MAX);

::: js/src/jit/arm/MacroAssembler-arm.h
@@ +989,5 @@
>          movePtr(imm, ScratchRegister);
>          branchPtr(cond, lhs, ScratchRegister, label);
>      }
> +    void branchPtr(Condition cond, Register lhs, Imm32 imm, Label *label) {
> +        branchPtr(cond, lhs, imm, label);

branch32
Attachment #8405436 - Flags: review?(jdemooij) → review+
Thinking about this more, I don't think you need the separate rshiftPtr + subPtr methods if you do:

masm.addPtr(ImmWord(-ptrdiff_t(nursery.start())), temp);

On x64 we have a dedicated scratch register, and addPtr will use it internally if nursery.start() doesn't fit in 32-bits.
Blocks: 989414
To guide further optimisation, I gathered some stats for postbarrier instructions for the v8 benchmarks.  o is the object whose property is being written and v is the value being written to it:

                           instrs   vObj  oCnst oNrs  vNurs taken
               Richards:    1623943  48.8   0.0 100.0 100.0   0.0
              DeltaBlue:    1048647  98.4   0.0  99.2  75.8   0.5
                 Crypto:      20346 100.0   0.0  99.9 100.0   0.1
               RayTrace:    1585805 100.0   0.0 100.0  73.1   0.0
            EarleyBoyer:   14206107  47.0   8.5  91.5  90.5   8.5
                 RegExp:          0
                  Splay:    2035582  87.4   0.0  35.1  34.5   8.2
           NavierStokes:          0

                  Total:   20520430  57.9   5.9  87.6  83.6   6.7

    instrs: postbarrier instructions executed
    vObj:   LPostWriteBarrierO instruction rather than LPostWriteBarrierV
    oCnst:  o is known to be a constant
    oNurs:  o was in the nursery
    vNurs:  v was in the nursery
    taken:  the postbarrier was called

The last five columns are percentages.
Great work! That is beautify data. It may be time for us to start thinking about what sort of static or dynamic analyses could let us elide barriers in the common cases. Maybe we could look at the top couple of instances in E-B to see what the JS looks like?
Attached file barrier-assembly.txt
Here's the code we currently generate for these barriers on different architectures.
In the value postbarrier, we currently check the whether the value is an object before we check whether the object whose property is being set is tenured.

For the v8 benchmark, the first check results in us skipping the barrier 43% of the time, whereas the second check lets us skip the barrier 87% of the time.

This suggests we should do these checks the other way around, to allow us to skip the barrier check earlier more frequently.
Attachment #8410374 - Flags: review?(terrence)
Factor out checks nursery checks into two new methods in the macro assembler.
Attachment #8410376 - Flags: review?(jdemooij)
Move these nursery checks to architecture specific classes to allow optimization per-architecture (no code changes).
Attachment #8410378 - Flags: review?(jdemooij)
Use the scratch register rather than requiring a temp on x64 and ARM.
Attachment #8410380 - Flags: review?(jdemooij)
Attached patch 5-improve-x64Splinter Review
Improve x86 value test by combining tag and payload comparisons.
Attachment #8410386 - Flags: review?(jdemooij)
Comment on attachment 8410376 [details] [diff] [review]
1-add-nursery-check-instruction

Review of attachment 8410376 [details] [diff] [review]:
-----------------------------------------------------------------

Nice.

::: js/src/jit/IonMacroAssembler.cpp
@@ +1790,5 @@
> +#ifdef JSGC_GENERATIONAL
> +
> +void
> +MacroAssembler::branchPtrInNurseryRange(Register ptr, Register temp, Label *label)
> +{

You can also call this method in ICStubCompiler::emitPostWriteBarrierSlot in BaselineIC.cpp. The value check there branches if *not* in the nursery; that one doesn't need to be changed in this patch.

(Also, I just noticed that the barriers in BaselineCompiler.cpp check that the object is tenured and one of them checks that the value is an object, but we don't ensure it's a nursery object, maybe worth fixing..)
Attachment #8410376 - Flags: review?(jdemooij) → review+
Comment on attachment 8410374 [details] [diff] [review]
0-check-value-type-after-object

Review of attachment 8410374 [details] [diff] [review]:
-----------------------------------------------------------------

A zero line patch that provides a provable performance boost? Brilliant! r=me
Attachment #8410374 - Flags: review?(terrence) → review+
Comment on attachment 8410378 [details] [diff] [review]
2-split-macro-assembler-methods

Review of attachment 8410378 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/MacroAssembler-arm.cpp
@@ +4342,5 @@
> +    JS_ASSERT(ptr != temp);
> +    JS_ASSERT(temp != InvalidReg);
> +
> +    const Nursery &nursery = GetIonContext()->runtime->gcNursery();
> +    movePtr(ImmWord(-ptrdiff_t(nursery.start())), temp);

Pre-existing, but will this do the right thing if nursery.start() has the upper bit set?
Attachment #8410378 - Flags: review?(jdemooij) → review+
Comment on attachment 8410380 [details] [diff] [review]
3-use-scratch-reg-where-possible

Review of attachment 8410380 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/x86/MacroAssembler-x86.cpp
@@ +433,5 @@
>  void
>  MacroAssemblerX86::branchPtrInNurseryRange(Register ptr, Register temp, Label *label)
>  {
>      JS_ASSERT(ptr != temp);
> +    JS_ASSERT(temp != InvalidReg);  // A temp register is requied for x86.

Nit: required
Attachment #8410380 - Flags: review?(jdemooij) → review+
Comment on attachment 8410380 [details] [diff] [review]
3-use-scratch-reg-where-possible

Review of attachment 8410380 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/x86/MacroAssembler-x86.cpp
@@ +449,5 @@
>  
>      branchTestObject(Assembler::NotEqual, value, &done);
>  
>      Register obj = extractObject(value, temp);
> +    branchPtrInNurseryRange(obj, temp, label);

Sorry I didn't notice this until I got to the last patch, but instead of extractObject you can pass value.payloadReg() to branchPtrInNurseryRange. Same for ARM/MIPS.
Comment on attachment 8410386 [details] [diff] [review]
5-improve-x64

Review of attachment 8410386 [details] [diff] [review]:
-----------------------------------------------------------------

This is cool!

::: js/src/jit/x64/Lowering-x64.h
@@ +35,5 @@
>  
>      LDefinition tempToUnbox();
>  
>      bool needTempForObjectInNurseryRange() { return false; }
> +    bool needTempForValueIsNurseryObject() { return false; }

With this change, these two return the same value on x86/x64/ARM. Maybe we can have a single needTempForPostBarrier instead?
Attachment #8410386 - Flags: review?(jdemooij) → review+
(In reply to Jan de Mooij [:jandem] from comment #25)

Thanks for all the reviews!

I filed bug 1000100 for the baseline improvements.

>> +    movePtr(ImmWord(-ptrdiff_t(nursery.start())), temp);
> Pre-existing, but will this do the right thing if nursery.start() has the upper bit set?

On ARM the result of -ptrdiff_t(nursery.start()) ends up for example as 0x4a500000, so I think this works.

Other comments addressed.
Depends on: 1000712
This added two build warnings, as Nursery is now unused in BaselineCompiler.cpp:

In file included from /home/ben/code/moz/builds/d64/js/src/Unified_cpp_js_src2.cpp:67:0:
/home/ben/code/moz/inbound/js/src/jit/BaselineCompiler.cpp: In member function ‘bool
js::jit::BaselineCompiler::emit_JSOP_SETALIASEDVAR()’:
/home/ben/code/moz/inbound/js/src/jit/BaselineCompiler.cpp:2105:14: warning: unused variable ‘nursery’
[-Wunused-variable]
     Nursery &nursery = cx->runtime()->gcNursery;
              ^
In file included from /home/ben/code/moz/builds/d64/js/src/Unified_cpp_js_src2.cpp:67:0:
/home/ben/code/moz/inbound/js/src/jit/BaselineCompiler.cpp: In member function ‘bool
js::jit::BaselineCompiler::emitFormalArgAccess(uint32_t, bool)’:
/home/ben/code/moz/inbound/js/src/jit/BaselineCompiler.cpp:2425:18: warning: unused variable ‘nursery’
[-Wunused-variable]
         Nursery &nursery = cx->runtime()->gcNursery;
                  ^
Attachment #8411612 - Flags: review?(jdemooij)
Assignee: jcoppeard → benj
Status: NEW → ASSIGNED
Assignee: benj → jcoppeard
oops, sorry didn't mean to assign the bug to myself, bzexport did that in my behalf :/
Attachment #8411612 - Flags: review?(jdemooij) → review+
Attachment #8405436 - Attachment is obsolete: true
Attached patch 4-improve-armSplinter Review
This is only a small win, but by doing the comparison in terms of chunks rather than raw addresses, we can load the nursery start in one MOV rather than two, saving in total two instructions for each barrier.
Attachment #8412535 - Flags: review?(mrosenberg)
Comment on attachment 8412535 [details] [diff] [review]
4-improve-arm

Review of attachment 8412535 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/arm/MacroAssembler-arm.cpp
@@ +4361,5 @@
> +
> +    ma_mov(Imm32(startChunk), ScratchRegister);
> +    as_rsb(ScratchRegister, ScratchRegister, lsr(ptr, Nursery::ChunkShift));
> +    branch32(cond == Assembler::Equal ? Assembler::Below : Assembler::AboveOrEqual,
> +              ScratchRegister, Imm32(Nursery::NumNurseryChunks), label);

There should be an assert that Nursery::NumNurseryChunks < 256, because things will go very wrong (ScratchRegister will get re-used to load the constant)
Alternately, you can use SecondScratchRegister, and not worry about it.
Attachment #8412535 - Flags: review?(mrosenberg) → review+
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [leave open]
Target Milestone: --- → mozilla31
Blocks: 1507445
You need to log in before you can comment on or make changes to this bug.