Use a separate thread for garbage collection

NEW
Unassigned

Status

()

Core
JavaScript: GC
P3
normal
a year ago
4 months ago

People

(Reporter: jonco, Unassigned)

Tracking

(Blocks: 1 bug, {triage-deferred})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
We need to be able to yield to the mutator during GC and at the moment we do that by having this state machine in GCRuntime::incrementalSweepSlice and its associated methods.  The main part of this isn't too bad, but sweepPhase in particular is particularly complicated.  Collector state has to be stored as part of GCRuntime.

Instead, we should run the GC on a completely separate thread and use standard mutex/conditional variables to hand control between them when necessary.  The threads would not run concurrently (at the moment, although this could be seen as a starting point for a more concurrent GC).  This would however allow collector state to be stored on the stack of the GC thread.  This would make the implementation much simpler and more understandable.
(Reporter)

Comment 1

a year ago
Created attachment 8802212 [details] [diff] [review]
gc-thread WIP

Work in progress patch to do GC on a separate thread.  Passes SM tests but crashes in mach package.
ron-paul-its-happening.gif!
Comment on attachment 8802212 [details] [diff] [review]
gc-thread WIP

Review of attachment 8802212 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry to drive by your WIP patch, but I'm very excited for this change :)

::: js/src/gc/GCRuntime.h
@@ +1208,5 @@
> +    ConditionVariable mutatorWakeup;
> +    bool collectorThreadStarted;
> +    bool shutdownCollector;
> +    SliceBudget* budgetArg;
> +    JS::gcreason::Reason reasonArg;

Somewhere, maybe here, it would be *great* to have a big overview comment that documents the architecture and how the mutator and collector interact and transfer control to and from one another.

@@ +1211,5 @@
> +    SliceBudget* budgetArg;
> +    JS::gcreason::Reason reasonArg;
> +
> +#ifdef DEBUG
> +    bool collectorThreadRunning;

Is this accessed only by the main thread? Is it protected by the GC lock? Should it be atomic?

I would expect some kind of comment documenting these things.

::: js/src/jsgc.cpp
@@ +5858,5 @@
> +        }
> +    }
> +
> +    if (collectorThread.joinable())
> +        collectorThread.join();

Should we unset `collectorThreadStarted` in this function?

@@ +5917,5 @@
> +
> +void
> +GCRuntime::yieldToMutator(AutoLockGC& lock)
> +{
> +    AutoSetCollectorThreadRunning runState(this, false, lock);

Maybe an enum would make this easier to read?

  enum class CollectorThreadState {
      Running,
      Paused,
  }

@@ +5919,5 @@
> +GCRuntime::yieldToMutator(AutoLockGC& lock)
> +{
> +    AutoSetCollectorThreadRunning runState(this, false, lock);
> +    mutatorWakeup.notify_one();
> +    collectorWakeup.wait(lock.guard());

I'm pretty sure all of these condvar waits need to have a predicate to protect against spurious wake ups.

@@ +5947,5 @@
> +    budgetArg = &budget;
> +    reasonArg = reason;
> +    AutoUnlockForExclusiveAccess unlock(rt, lock);
> +    AutoLockGC lockgc(rt);
> +    yieldToCollector(lockgc);

Lock ordering is scary, especially with auto-unlocking. We really need to enforce the ordering somehow (I saw you filed a bug recently). I thought that AutoLockForExclusiveAccess took an AutoLockGC& to enforce ordering between them, but it seems like that is not actually the case.

Maybe we could dynamically build up a graph of lock ordering that we actually do in practice inside js::Mutex and then if we ever detect a cycle in that graph, panic and crash.

@@ +5971,5 @@
> +            break;
> +
> +        MOZ_ASSERT(rt->isHeapCollecting());
> +        AutoUnlockGC unlock(lock);
> +        AutoLockForExclusiveAccess lock(rt);

So we have to acquire the exclusiveAccessLock before the gc lock. My naive assumption was that the gc lock is either held by the mutator or the collector, and whoever holds it is the one running. But giving up the gc lock to take the exclusiveAccessLock means that we have a third state which is when the collector is running but does not have the gc lock. Do we ever take the gc lock while holding the exclusiveAccessLock? If not, maybe we can change the ordering here. If so, I don't really know...

This isn't really a constructive comment other than "the locking is a little more complicated than I'd hoped" and "maybe we can do something about it?"
"Maybe we could dynamically build up a graph of lock ordering that we actually do in practice inside js::Mutex and then if we ever detect a cycle in that graph, panic and crash."

There is a deadlock detector in XPCOM that I think works in that fashion, in xpcom/glue/DeadlockDetector.h. That might be something to look at and/or try to incorporate.
(Reporter)

Comment 5

a year ago
(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #3)
Thanks for the comments!

> I'm pretty sure all of these condvar waits need to have a predicate to
> protect against spurious wake ups.

That's interesting.  I did have a check here originally but never saw it fail.  If it's possible though then I think there are several places in the code where we need to do this.
Blocks: 1311731
(In reply to Andrew McCreight [:mccr8] from comment #4)
> "Maybe we could dynamically build up a graph of lock ordering that we
> actually do in practice inside js::Mutex and then if we ever detect a cycle
> in that graph, panic and crash."
> 
> There is a deadlock detector in XPCOM that I think works in that fashion, in
> xpcom/glue/DeadlockDetector.h. That might be something to look at and/or try
> to incorporate.

We used to have static assertions about lock ordering, but I can't find them anymore. I hope they weren't removed. See the canLock assertions in bug 928050.
(In reply to Bill McCloskey (:billm) from comment #6)
> We used to have static assertions about lock ordering, but I can't find them
> anymore. I hope they weren't removed. See the canLock assertions in bug
> 928050.

FWIW, it looks like they were removed in bug 1290589. "PTHREAD_MUTEX_ERRORCHECK gives us this error checking against reentrancy and unlocking an unlocked lock or another thread's lock already, so it isn't needed. This also makes the *CanLock assertions no-ops."
IIRC, the assertions I removed were all about whether or not locks were being held by the correct thread when acquiring/releasing, and checks against reentrancy. We didn't have much for general ordering. Either way, jonco has added a proper static ordering in bug bug 1309909.
Hmm. Bug 928050 seems to be one of those cases where the patch Brian landed was different than the one I reviewed. Somehow the assertions got weaker.
Any progress on this Jon?
Flags: needinfo?(jcoppeard)
(Reporter)

Comment 11

a year ago
(In reply to Bill McCloskey (:billm) from comment #10)
I'm working on it.  One issue is that you need a separate GC thread per ongoing incremental GC, because the state is all stored on the stack.  My idea for this is to have one dedicated GC thread for the main thread and one GC thread for all other threads in the system, since having an extra thread per JS runtime/content seems to much.

I'm testing sharing a GC thread between all threads in the system and currently this is causing deadlocks on try.
Flags: needinfo?(jcoppeard)
(Reporter)

Updated

a year ago
Depends on: 1337450
(Reporter)

Comment 12

10 months ago
This is proving problematic because we sometime leak JSRuntimes.  If we do this to a runtime which is currently GCing, we effectively leak the GC thread as well and deadlock when other threads try to do shutdown GCs.
What kind of GC runtime is leaking (worker, main thread, something else)? All the ones I know of have pretty clear lifetimes. When we shut down the JSContext, we're not going to be doing anything on the runtime anymore.
Keywords: triage-deferred
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.