Closed Bug 902174 Opened 6 years ago Closed 5 years ago

G1: The Good Parts

Categories

(Core :: JavaScript Engine, defect)

defect
Not set

Tracking

()

RESOLVED INVALID

People

(Reporter: terrence, Assigned: terrence)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

The major problem with naive GGC -- as we now have -- is that we cannot control the time at which collection takes place. Minor GCs happen when the nursery fills up, thus may happen at any point within a script's execution. This can, and does ocassionally, leave us doing a minor GC at extremely unfortunate times. The result is that GC things allocated immediately after the last minor GC are much more likely to be dead than gc things allocated immediately before the next major GC. This leads to a constant, if usually small, level of unnecessary tenuring as well as occasional, catastrophic collections.

We mitigate this to a certain extent by growing and shrinking the nursery with demand, however, this relies on us having at least one unfortunate minor GC before we can correct. V8 mitigates the above problem to a somewhat larger degree by having both this optimization and by using a semi-space Nursery. Since objects can live through two collections in this scheme, any object considered for tenuring is guaranteed to have been allocated comparatively soon after a minor GC.  On the other hand, the semi-space requires a second memcpy for live objects. Because our objects are at least twice the size of V8's objects, a copying semi-space nursery would provide significantly less benefit for us than it does for V8.

Thus, instead of blindly copying V8, I have been looking for a different nursery enhancement that would avoid these issues and solve the underlying problem of forced GCs. I believe that a judiciously selected subset of the features of Java's "Garbage First" collector (G1) can give us the performance boost our first generation heap needs.

Allocation
----------
Instead of a single large allocation for the nursery, we will have a collection of fixed size blocks, linked into a FIFO list. These blocks will have equal size to and the same alignment as a Chunk (1MiB). When we make an allocation and do not have room for the allocation, we trigger a minor collection (as now), however, if the time is not right for the collection, we can skip it and allocate a new block instead. If the block allocation fails for OOM or we exceed the first generation size threshhold, then we will force a full minor GC to free memory.

Barriers
--------
Each block and Chunk will have, in addition to the current shared JSRuntime*, a store buffer, list next/prev pointers, an isNursery bit, an isCollecting bit, and some extra stuff to help us find good times to collect. This new data will only be used on the nursery blocks, so should not impact the Chunk's capacity.

Our existing post-barrier for a write at a location |JSObject** objp;| is 4 compares, a logical op, and ~4 loads, although at least one should be in the same cache line:

    JSRuntime *rt = (Chunk*)(objp & ChunkMask)->info.runtime;
    if((objp < rt->gcNurseryStart_ || objp >= rt->gcNurseryEnd_) &&
       *objp >= rt->gcNurseryStart_ && *objp < rt->gcNurseryEnd_)
    {
        rt->gcStoreBuffer.putCell(objp);
    }

Our new post-barrier will look like 1 compare, 3 logic ops, and ~1 load:

    if ((objp & ChunkMask) ^ (*objp & ChunkMask))
    {
        (Chunk*)(*objp & ChunkMask).info.storeBuffer.putCell(objp);
    }

The new store buffer should not pose a new performance concern.

Collection
----------
Minor collections will select the first N blocks in the FIFO list, up to the first block that it is not appropriate to collect. This may be zero blocks, in which case we skip the minor collection.  OOM situations, limit triggers, and full GC slices will force all nursery blocks to be collected.  The G1 paper includes a number of cheap heuristics that determine when it is a "good time" to collect a block: we should shamelessly steal any of these that are appropriate for SpiderMonkey and benchmark to develop further metrics as needed.

The new collection will work almost identically to our existing minor collection. The primary difference is in how we mark and sweep the store buffer. The marking will need to be cognizant of whether the source is in to the collecting set at mark time. The sweeping will need to update the store buffers of the non-collected blocks to remove pointers to things we have moved.

Risks
-----
I've been mulling this over for a few weeks by myself and not seen any serious problems and Bill also did not see anything catastrophically wrong with the plan when I presented it to him. This is a large and complex project, however, so any feedback that could help identify problems early would be very welcome. I currently see the following as potential risks of going this route:

* It may be hard to find good metrics for when to collect.
* Highly interconnected nursery blocks could make the store buffer rewriting slow.
* I forgot some important difficulty that isn't reflected in the discussion above.
This patch converts the nursery from a linear buffer to a series of 1MiB chunks. 

IsInsideNursery is changed to check the chunk footer, so now only works with Cell*. Most of the patch is code motion to type this path. New typed paths are added for isInRememberedSet, isNurserySlots, and isInsideSlow, which do slightly different things to handle the more subtle semantics.

This still uses a global store buffer, so the entire nursery must still be collected at once. Thus, the management of the nursery chunks is basically identical to the existing logic.

JIT support is not implemented, so build with --disable-ion --disable-baseline. We will want to implement the per-chunk store buffer logic before trying to JIT the new logic, since we would be throwing out anything that gets written against this state.

Zeal mode 7 is ignored at the moment; no attempt is made to delay freeing unused chunks; and a bunch of other rough edges.
Assignee: general → terrence
Status: NEW → ASSIGNED
Whiteboard: [leave open
I forgot to guard these ion bits when I added them.
Attachment #8335413 - Flags: review?(jcoppeard)
Attachment #8335413 - Flags: review?(jcoppeard) → review+
Whiteboard: [leave open → [leave open]
This is part 1 of "shrink the store buffer". The goal here is to make the store buffer < 172 bytes on debug x64 builds so that it will fit in the padding between the last arena and the chunk footer in all builds.
Attachment #8335462 - Flags: review?(jcoppeard)
This is the second half of the storebuffer minification. This allocates the LifoAllocs on demand, reducing the size of the StoreBuffer from ~600 -> ~100, getting us to our goal.
Attachment #8335470 - Flags: review?(jcoppeard)
Create the compaction hashtables on demand. This was a bit of a pre-mature optimization. The hashtable takes 50 bytes in StoreBuffer in debug x64 builds.
Attachment #8335473 - Flags: review?(jcoppeard)
Comment on attachment 8335462 [details] [diff] [review]
remove_duplicate_storebuffer_fields-v0.diff

Review of attachment 8335462 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!
Attachment #8335462 - Flags: review?(jcoppeard) → review+
Comment on attachment 8335470 [details] [diff] [review]
create_storebuffer_on_demand-v0.diff

Review of attachment 8335470 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/StoreBuffer.h
@@ +117,5 @@
>          void put(StoreBuffer *owner, const T &t) {
> +            if (!storage_) {
> +                storage_ = js_new<LifoAlloc>(LifoAllocBlockSize);
> +                if (!storage_)
> +                    MOZ_CRASH();

Can we allocate this ahead of time to avoid the possibility of dying here?

@@ +183,5 @@
>          void put(StoreBuffer *owner, const T &t) {
> +            if (!storage_) {
> +                storage_ = js_new<LifoAlloc>(LifoAllocBlockSize);
> +                if (!storage_)
> +                    MOZ_CRASH();

Ditto above.
Attachment #8335470 - Flags: review?(jcoppeard) → review+
Comment on attachment 8335473 [details] [diff] [review]
create_hashtables_on_demand-v0.diff

Review of attachment 8335473 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/StoreBuffer.cpp
@@ +80,5 @@
>  void
>  StoreBuffer::MonoTypeBuffer<T>::compactRemoveDuplicates(StoreBuffer *owner)
>  {
> +    EdgeSet duplicates;
> +    duplicates.init();

We need to check the return value here.

@@ +135,5 @@
>  StoreBuffer::RelocatableMonoTypeBuffer<T>::compactMoved(StoreBuffer *owner)
>  {
>      LifoAlloc &storage = *this->storage_;
> +    EdgeSet invalidated;
> +    invalidated.init();

Ditto above.
Attachment #8335473 - Flags: review?(jcoppeard) → review+
Depends on: 944040
This patch will be folded into "remove duplicate storebuffer fields" before landing. I'll probably fold the rest together too to minimize the amount of separate testing I need.

It turns out the order we make our assertions in the store buffer matters a whole lot and is extremely subtle. I did not get it exactly right in the previous patch and hit some strange assertions with --tbpl.

First, we push store buffer entries in NewFunction in the off-main-thread parser. These are guaranteed to not actually push into the store buffer and since we only made the assertion after checking inRememberedSet, this was not firing. However, there is still a bug there because putSlot between two threads could race on their ReentrancyGuard. We happened to "never" hit this before but with the ReentryancyGuard flag commoned up now, we frequently hit putSlot with MinorGC on the main thread stack. The solution here is to promote the assertion to a runtime check and to do it before touching any fields other than runtime_.

Next, when the store buffer is stored in the chunk trailer, only the fields which are valid in the constructor are actually valid: touching the reentrancy guard is right out. Thus, we have to move isEnabled() up too. However, we don't want to move this above the runtime_ check because then the read of enabled_ could race -- ineffectually, it turns out, but TSAN would probably still complain. Accessing runtime_ from multiple threads should still be okay, however, as chunk init should be threadsafe.

I'm sad at how subtle this has to be, so I split the logic into a separate template method that all the implementations funnel through. I think overall this makes for a nice cleanup as well as reducing the solid angle of foot available to the barriers.
Attachment #8339621 - Flags: review?(jcoppeard)
Comment on attachment 8339621 [details] [diff] [review]
common_up_storebuffer_assertions-v0.diff

Review of attachment 8339621 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, this is a nice cleanup.

::: js/src/gc/StoreBuffer.h
@@ +328,5 @@
> +         * The concurrent parsing thread cannot validly insert into the buffer,
> +         * but it should not activate the re-entrancy guard either.
> +         */
> +        if (!CurrentThreadCanAccessRuntime(runtime_)) {
> +            JS_ASSERT(!edge.inRememberedSet(nursery_));

I like this.
Attachment #8339621 - Flags: review?(jcoppeard) → review+
Blocks: GC.performance
No longer blocks: 875863
Now that we've taken extensive measurements, it turns out that our current GGC's tunings are pretty much right on for the majority of workloads. Also, it's been 10+ years and the java team still hasn't figured out how to make G1 match the performance and latencies of of a good generational scheme. The cleanups we did here in preparation for a split heap were excellent in general, so we totally want to keep them, however, I don't think we need to pursue a fully chunked heap any further.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
Whiteboard: [leave open]
You need to log in before you can comment on or make changes to this bug.