Closed Bug 1233857 Opened 4 years ago Closed 4 years ago

Remove the nursery performance cliff caused by putting large arrays in the WholeObjectBuffer

Categories

(Core :: JavaScript: GC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla47
Tracking Status
firefox46 --- affected
firefox47 --- fixed

People

(Reporter: terrence, Assigned: fitzgen, Mentored)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 6 obsolete files)

269.34 KB, image/png
Details
27.32 KB, patch
fitzgen
: review+
Details | Diff | Splinter Review
4.41 KB, patch
fitzgen
: review+
Details | Diff | Splinter Review
Problem
=======
In IonMonkey, post-write barriers for Arrays have a very clever (and surprisingly simple) implementation. Instead of storing a reference to each and every slot that was written, we just store the source object and then at GC time trace every element in the Array to discover the actual set of cross-generation edges.

This sounds, on paper, like it would be sub-optimal compared to recording the actual set of written slots; however, for most programs, some subtle aspects of our implementation make the prior implementation faster in practice. First, most Objects, including Arrays, are fairly small and are normally written to in one block of writes. Thus, the number of non-cross generational edges we visit for most objects is actually quite low in practice. Secondly, we have a one-element cache in front of the HashTable backing the StoreBuffer. Thus, subsequent writes to the same object are a nop, whereas recording the actual slots would probably result in a hash on each write.

Unfortunately, there is one serious downside to this approach: the massive performance cliff that occurs if one writes a cross generation edge into a large array that is mostly filled with same-generation edges. In this case, we spend an inordinate amount of time visiting edges that are not actually part of the remembered set.

since we've inlined the entire tenuring path, it is only 10's of ms per million edges, but that can still be extremely bad.

Solution
========

The typical solution to this problem is known as "card marking". Because we do not control our element memory, however, this approach will not work for us. We've discussed writing an elements allocator to make this possible, but there is actually another way that is much simpler: specialize the jit to use a more exact barrier when writing to a large array.

Since C++ code is already so slow, we do not bother use the whole-object buffer there and already use an exact method there. In C++, elements are represented via the HeapSlot class [1]: a wrapper around Value. As can be seen at [2], we store an exact reference to the slot when storing to an object. Given that the address of the elements vector may be relocated via realloc, we store this as the source object + offset + whether it is slots or elements being stored. The GC then re-looks-up the correct address at the time the GC happens [3].

The purpose of this bug is to teach IonMonkey how to insert into this store buffer instead of the other one when writing to a large array. Subtly, but importantly, we also need to ensure that we recompile any code that uses the small Array store buffer to use the large array store buffer when the target array gets large. There is already a sophisticated constraint engine in place to make exactly this sort of thing possible: Jan or Brian will be able to show us how this works once we get something working.

The relevant method to look at in IonMonkey is jsop_setelem [4]. The store buffers nodes are added in the implementations of setelem for concrete types that may have cross-generation edges (e.g. not TypedArrays, etc), here [5]. The source object is already in a register and the vector type is obviously "elements", so the only thing we need to do in the new case is compute the offset.

1- https://dxr.mozilla.org/mozilla-central/source/js/src/gc/Barrier.h?from=HeapSlot#639
2- https://dxr.mozilla.org/mozilla-central/source/js/src/gc/Barrier.h?from=HeapSlot#693
3- https://dxr.mozilla.org/mozilla-central/source/js/src/gc/Marking.cpp#1993
4- https://dxr.mozilla.org/mozilla-central/source/js/src/jit/IonBuilder.cpp#9620
5- https://dxr.mozilla.org/mozilla-central/source/js/src/jit/IonBuilder.cpp#9963,9995
Assignee: nobody → nfitzgerald
Status: NEW → ASSIGNED
The plan as I understand it:

1. Grab a bit from somewhere (ideally `OBJECT_FLAG_*`, but that seems to be full) that means "this is a large array". How many bits does the generation need to be? It is 2 right now, could it get away with 1 (that seems to good to be true)? Is there somewhere else we could grab a bit from?

2. Set or unset this new bit when realloc'ing array elements, since we already bailout in that case.

3. Add a new type constraint that uses the new bit.

4. Create `OutOfLineCallPostWriteHeapSlotBarrier` and `CodeGenerator::visitOutOfLineCallPostWriteHeapSlotBarrier` to call the new barrier.

5. Add plumbing to pass the offset of the slot being written to to LPostWriteBarrier.

6. In `CodeGenerator::visitPostWriteBarrierV`, check whether the new bit is set. (What about PostWriteBarrierO?)

6.a. If it is not set, proceed as we do now.

6.b. If it is set, jump to `OutOfLinePostWriteHeapSlotBarrier` instead.

-------------

:jandem: does this sound right? Any idea where I could grab a bit from for step (1)?
Flags: needinfo?(jdemooij)
Using TI is an option, but it's fairly complicated and we may invalidate/recompile more code. I'd try this simpler approach first:

(1) Add a new MIR instruction (MElementPostWriteBarrier or something) that we use when we might be accessing an element. We know this statically in IonBuilder and I think most of the (perf-critical) barriers are for slots (setprop, setgname etc.) and we don't want to regress/change those.

We will use this element barrier in IonBuilder::jsop_setelem_dense, in IonBuilder::setElemTryCache if index->type() == MIRType_Int32, and probably some similar cases.

(2) MElementPostWriteBarrier has an extra operand, the index.

(3) We add a new function to VMFunctions.h/cpp, similar to the PostWriteBarrier function there, but with an |int32_t index| argument. That function can do something like this:

if (obj->is<NativeObject>() && obj->as<NativeObject>().getDenseInitializedLength() > X)
    ...add element barrier (obj, index)...
else
    rt->gc.storeBuffer.putWholeCell(obj);

(4) The Baseline ICs for SETELEM should do something similar.

I think this shouldn't be measurably slower on any benchmarks.

---

Also, ObjectElements has a number of bits available. We could add a flag that means "object is already in the whole-cell buffer", or we could add a counter: if we barriered many elements, give up and just put it in the whole cell buffer instead.
Flags: needinfo?(jdemooij)
Attached patch WIP WIP WIP (obsolete) — Splinter Review
See Also: → 1242691
See Also: → 1244279
This is 10 runs of octane without this patch, and then 10 runs with this patch.

This patch looks promising, cleaning it up now.
This commit teaches IonMonkey how to put individual array elements' edges in the
store buffer, rather than using the whole cell buffer. This alleviates
perfomance cliffs where there are very large arrays in the tenured heap and then
the mutator adds a relatively small number of edges from this array into the
nursery. With the whole cell buffer, which was used previously, a nursery
collection would need to trace the whole large array. With this patch, only
the modified edges need by traced.
Attachment #8713868 - Flags: review?(jdemooij)
Attachment #8711205 - Attachment is obsolete: true
perf key word?
Comment on attachment 8713868 [details] [diff] [review]
Teach the JIT how to put individual elements' edges in the store buffer

Review of attachment 8713868 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks for doing this! r=me with comments below addressed.

Baseline ICs should probably do something similar, but it's great to have this fixed for Ion.

::: js/src/jit/CodeGenerator.cpp
@@ +3134,2 @@
>  void
> +CodeGenerator::visitPostWriteBarrierCommonO(LPostBarrierType* lir, OutOfLineCallType* ool)

I think we don't need the OutOfLineCallType template parameter and can use |OutOfLineCode* ool| directly.

@@ +3166,5 @@
>          masm.branchPtrInNurseryRange(Assembler::Equal, ToRegister(lir->object()), temp,
>                                       ool->rejoin());
>      }
>  
> +    ValueOperand value = ToValue(lir, LPostWriteElementBarrierV::Input);

s/LpostWriteElementBarrierV/LPostBarrierType/

@@ +3242,5 @@
> +    Register runtimereg = regs.takeAny();
> +    masm.mov(ImmPtr(GetJitContext()->runtime), runtimereg);
> +
> +    Register indexreg = ToRegister(index);
> +    regs.takeUnchecked(indexreg);

It's possible runtimereg and indexreg will be the same register, so we will pass a bogus index value to the VM function. |objreg = regs.takeAny()| has a similar problem: it could take our indexreg before we remove it from the set here. To fix these problems, we should restructure the code like this:

  Register objreg = obj->isConstant() ? InvalidReg : ToRegister(obj);
  Register indexreg = ToRegister(index);

  AllocatableGeneralRegisterSet regs(GeneralRegisterSet::Volatile());
  if (objreg != InvalidReg)
      regs.takeUnchecked(objreg);
  regs.takeUnchecked(indexreg);

After that, we can call regs.takeAny() and be sure it won't conflict with objreg/indexreg.

::: js/src/jit/MIR.h
@@ +12762,5 @@
>  
> +// Given a value being written to another object's elements at the specified
> +// index, update the generational store buffer if the value is in the nursery
> +// and object is in the tenured heap.
> +class MPostWriteElementBarrier : public MTernaryInstruction, public ObjectPolicy<0>::Data

We should use a type policy for the index as well. Even though it's always MIRType_Int32 in IonBuilder, this can change later (we can box it and end up with MIRType_Value).

So instead of ObjectPolicy<0>::Data, we can use

MixPolicy<ObjectPolicy<0>, IntPolicy<2>>::Data

If that doesn't compile/link, we likely have to add an entry to TEMPLATE_TYPE_POLICY_LIST in TypePolicy.cpp, if it's the first time we use this MixPolicy.

::: js/src/jit/VMFunctions.cpp
@@ +613,5 @@
>      MOZ_ASSERT(!IsInsideNursery(obj));
>      rt->gc.storeBuffer.putWholeCell(obj);
>  }
>  
> +static const size_t MAX_WHOLE_CELL_BUFFER_SIZE = 8196;

8196 is quite large. A pref to always use the element store buffer might be useful, to make it easier for the fuzzers to hit this case. We could add a JitOption and add a test that uses setJitCompilerOption, or something similar.

@@ +620,5 @@
> +PostWriteElementBarrier(JSRuntime* rt, JSObject* obj, size_t index)
> +{
> +    MOZ_ASSERT(!IsInsideNursery(obj));
> +    if (obj->is<NativeObject>() &&
> +        obj->as<NativeObject>().getDenseInitializedLength() > MAX_WHOLE_CELL_BUFFER_SIZE) {

Nit: { goes on next line if the condition spans multiple lines.

::: js/src/jit/shared/LIR-shared.h
@@ +6605,5 @@
> +        setOperand(1, index);
> +        setTemp(0, temp);
> +    }
> +
> +    static const size_t Input = 1;

This should be 2. It's the index of the Value operand and it comes after the object and index (the setOperand calls in the constructor).
Attachment #8713868 - Flags: review?(jdemooij) → review+
This commit teaches IonMonkey how to put individual array elements' edges in the
store buffer, rather than using the whole cell buffer. This alleviates
perfomance cliffs where there are very large arrays in the tenured heap and then
the mutator adds a relatively small number of edges from this array into the
nursery. With the whole cell buffer, which was used previously, a nursery
collection would need to trace the whole large array. With this patch, only
the modified edges need by traced.
Comment on attachment 8715516 [details] [diff] [review]
Teach the JIT how to put individual elements' edges in the store buffer; r=jandem

Jan, could you look at the register usage in CodeGenerator.cpp once again? Want to double check that I got things correct this time around.

Will make a second patch for a zeal mode to always use the slots edge store buffer.
Attachment #8715516 - Flags: review?(jdemooij)
Attachment #8713868 - Attachment is obsolete: true
This commit adds gc zeal mode 15 to force the use of the individual elements
edges barrier regardless of the size of the elements. It also adds a jit-test
which uses the zeal option. Hopefully, this will let the fuzzers go to town with
the new barrier type.
Attachment #8715532 - Flags: review?(terrence)
Fix stupid var typo and missing namespace. Turns out it is useful to build with
JS_GC_ZEAL when testing JS_GC_ZEAL specific changes...
Attachment #8715549 - Flags: review?(terrence)
Attachment #8715532 - Attachment is obsolete: true
Attachment #8715532 - Flags: review?(terrence)
Comment on attachment 8715549 [details] [diff] [review]
Follow up: Add a new GC zeal mode for the elements edges barrier

Review of attachment 8715549 [details] [diff] [review]:
-----------------------------------------------------------------

Nice. Maybe re-use mode 12, otherwise this looks great.

::: js/src/jsgc.h
@@ +1263,5 @@
>  const int ZealIncrementalMultipleSlices = 10;
>  const int ZealIncrementalMarkingValidator = 11;
>  const int ZealCheckHashTablesOnMinorGC = 13;
>  const int ZealCompactValue = 14;
> +const int ZealElementsBarrier = 15;

Note that these are not well packed. We don't move compact then when we remove a mode, so this could easily re-use 12.
Attachment #8715549 - Flags: review?(terrence) → review+
Comment on attachment 8715516 [details] [diff] [review]
Teach the JIT how to put individual elements' edges in the store buffer; r=jandem

Review of attachment 8715516 [details] [diff] [review]:
-----------------------------------------------------------------

regalloc in CodeGenerator.cpp looks good to me. Thanks!
Attachment #8715516 - Flags: review?(jdemooij) → review+
This commit teaches IonMonkey how to put individual array elements' edges in the
store buffer, rather than using the whole cell buffer. This alleviates
perfomance cliffs where there are very large arrays in the tenured heap and then
the mutator adds a relatively small number of edges from this array into the
nursery. With the whole cell buffer, which was used previously, a nursery
collection would need to trace the whole large array. With this patch, only
the modified edges need by traced.
Attachment #8715946 - Flags: review+
Attachment #8715516 - Attachment is obsolete: true
This commit adds gc zeal mode 15 to force the use of the individual elements
edges barrier regardless of the size of the elements. It also adds a jit-test
which uses the zeal option. Hopefully, this will let the fuzzers go to town with
the new barrier type.
Attachment #8715948 - Flags: review+
Attachment #8715549 - Attachment is obsolete: true
Fix dumb syntax error introduced by switching 15 -> 12.
Attachment #8715978 - Flags: review+
Attachment #8715948 - Attachment is obsolete: true
Linux jit tests are finally happy! \o/
Keywords: checkin-needed
Apparently, this made no difference on AWFY.
https://hg.mozilla.org/mozilla-central/rev/cbb480545732
https://hg.mozilla.org/mozilla-central/rev/2ada62724f2a
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
(In reply to Guilherme Lima from comment #25)
> Apparently, this made no difference on AWFY.

Yeah, not surprising. Discussed this with folks on irc and came to the conclusion that my numbers are off because my first set of octane runs were while I was browsing the internet in the background, and the second set was while I was off to lunch. Nonetheless, it still alleviates the nursery performance cliff, so that's a win :)
Depends on: 1246593
You need to log in before you can comment on or make changes to this bug.