1297122 - Use a separate thread for garbage collection

Reporter

Description

•

8 years ago

We need to be able to yield to the mutator during GC and at the moment we do that by having this state machine in GCRuntime::incrementalSweepSlice and its associated methods. The main part of this isn't too bad, but sweepPhase in particular is particularly complicated. Collector state has to be stored as part of GCRuntime. Instead, we should run the GC on a completely separate thread and use standard mutex/conditional variables to hand control between them when necessary. The threads would not run concurrently (at the moment, although this could be seen as a starting point for a more concurrent GC). This would however allow collector state to be stored on the stack of the GC thread. This would make the implementation much simpler and more understandable.

Jon Coppeard (:jonco)

Reporter

Comment 1

•

8 years ago

Attached patch gc-thread WIP — Details — Splinter Review

Work in progress patch to do GC on a separate thread. Passes SM tests but crashes in mach package.

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Comment 2

•

8 years ago

ron-paul-its-happening.gif!

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Comment 3

•

8 years ago

Comment on attachment 8802212 [details] [diff] [review] gc-thread WIP Review of attachment 8802212 [details] [diff] [review]: ----------------------------------------------------------------- Sorry to drive by your WIP patch, but I'm very excited for this change :) ::: js/src/gc/GCRuntime.h @@ +1208,5 @@ > + ConditionVariable mutatorWakeup; > + bool collectorThreadStarted; > + bool shutdownCollector; > + SliceBudget* budgetArg; > + JS::gcreason::Reason reasonArg; Somewhere, maybe here, it would be *great* to have a big overview comment that documents the architecture and how the mutator and collector interact and transfer control to and from one another. @@ +1211,5 @@ > + SliceBudget* budgetArg; > + JS::gcreason::Reason reasonArg; > + > +#ifdef DEBUG > + bool collectorThreadRunning; Is this accessed only by the main thread? Is it protected by the GC lock? Should it be atomic? I would expect some kind of comment documenting these things. ::: js/src/jsgc.cpp @@ +5858,5 @@ > + } > + } > + > + if (collectorThread.joinable()) > + collectorThread.join(); Should we unset `collectorThreadStarted` in this function? @@ +5917,5 @@ > + > +void > +GCRuntime::yieldToMutator(AutoLockGC& lock) > +{ > + AutoSetCollectorThreadRunning runState(this, false, lock); Maybe an enum would make this easier to read? enum class CollectorThreadState { Running, Paused, } @@ +5919,5 @@ > +GCRuntime::yieldToMutator(AutoLockGC& lock) > +{ > + AutoSetCollectorThreadRunning runState(this, false, lock); > + mutatorWakeup.notify_one(); > + collectorWakeup.wait(lock.guard()); I'm pretty sure all of these condvar waits need to have a predicate to protect against spurious wake ups. @@ +5947,5 @@ > + budgetArg = &budget; > + reasonArg = reason; > + AutoUnlockForExclusiveAccess unlock(rt, lock); > + AutoLockGC lockgc(rt); > + yieldToCollector(lockgc); Lock ordering is scary, especially with auto-unlocking. We really need to enforce the ordering somehow (I saw you filed a bug recently). I thought that AutoLockForExclusiveAccess took an AutoLockGC& to enforce ordering between them, but it seems like that is not actually the case. Maybe we could dynamically build up a graph of lock ordering that we actually do in practice inside js::Mutex and then if we ever detect a cycle in that graph, panic and crash. @@ +5971,5 @@ > + break; > + > + MOZ_ASSERT(rt->isHeapCollecting()); > + AutoUnlockGC unlock(lock); > + AutoLockForExclusiveAccess lock(rt); So we have to acquire the exclusiveAccessLock before the gc lock. My naive assumption was that the gc lock is either held by the mutator or the collector, and whoever holds it is the one running. But giving up the gc lock to take the exclusiveAccessLock means that we have a third state which is when the collector is running but does not have the gc lock. Do we ever take the gc lock while holding the exclusiveAccessLock? If not, maybe we can change the ordering here. If so, I don't really know... This isn't really a constructive comment other than "the locking is a little more complicated than I'd hoped" and "maybe we can do something about it?"

Andrew McCreight [:mccr8]

Comment 4

•

8 years ago

"Maybe we could dynamically build up a graph of lock ordering that we actually do in practice inside js::Mutex and then if we ever detect a cycle in that graph, panic and crash." There is a deadlock detector in XPCOM that I think works in that fashion, in xpcom/glue/DeadlockDetector.h. That might be something to look at and/or try to incorporate.

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Updated

•

8 years ago

Comment 5

•

8 years ago

(In reply to Nick Fitzgerald [:fitzgen] [⏰PDT; UTC-7] from comment #3) Thanks for the comments! > I'm pretty sure all of these condvar waits need to have a predicate to > protect against spurious wake ups. That's interesting. I did have a check here originally but never saw it fail. If it's possible though then I think there are several places in the code where we need to do this.

Steve Fink [:sfink] [:s:]

Updated

•

8 years ago

Blocks: 1311731

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 6

•

8 years ago

(In reply to Andrew McCreight [:mccr8] from comment #4) > "Maybe we could dynamically build up a graph of lock ordering that we > actually do in practice inside js::Mutex and then if we ever detect a cycle > in that graph, panic and crash." > > There is a deadlock detector in XPCOM that I think works in that fashion, in > xpcom/glue/DeadlockDetector.h. That might be something to look at and/or try > to incorporate. We used to have static assertions about lock ordering, but I can't find them anymore. I hope they weren't removed. See the canLock assertions in bug 928050.

Andrew McCreight [:mccr8]

Comment 7

•

8 years ago

(In reply to Bill McCloskey (:billm) from comment #6) > We used to have static assertions about lock ordering, but I can't find them > anymore. I hope they weren't removed. See the canLock assertions in bug > 928050. FWIW, it looks like they were removed in bug 1290589. "PTHREAD_MUTEX_ERRORCHECK gives us this error checking against reentrancy and unlocking an unlocked lock or another thread's lock already, so it isn't needed. This also makes the *CanLock assertions no-ops."

Nick Fitzgerald [:fitzgen] [⏰PST; UTC-8]

Comment 8

•

8 years ago

IIRC, the assertions I removed were all about whether or not locks were being held by the correct thread when acquiring/releasing, and checks against reentrancy. We didn't have much for general ordering. Either way, jonco has added a proper static ordering in bug bug 1309909.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 9

•

8 years ago

Hmm. Bug 928050 seems to be one of those cases where the patch Brian landed was different than the one I reviewed. Somehow the assertions got weaker.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 10

•

8 years ago

Any progress on this Jon?

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Reporter

Comment 11

•

8 years ago

(In reply to Bill McCloskey (:billm) from comment #10) I'm working on it. One issue is that you need a separate GC thread per ongoing incremental GC, because the state is all stored on the stack. My idea for this is to have one dedicated GC thread for the main thread and one GC thread for all other threads in the system, since having an extra thread per JS runtime/content seems to much. I'm testing sharing a GC thread between all threads in the system and currently this is causing deadlocks on try.

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Reporter

Updated

•

8 years ago

Depends on: 1337450

Jon Coppeard (:jonco)

Reporter

Comment 12

•

8 years ago

This is proving problematic because we sometime leak JSRuntimes. If we do this to a runtime which is currently GCing, we effectively leak the GC thread as well and deadlock when other threads try to do shutdown GCs.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 13

•

8 years ago

What kind of GC runtime is leaking (worker, main thread, something else)? All the ones I know of have pretty clear lifetimes. When we shut down the JSContext, we're not going to be doing anything on the runtime anymore.

Firefox Bug Husbandry Bot

Updated

•

7 years ago

Keywords: triage-deferred

Priority: -- → P3

Jon Coppeard (:jonco)

Reporter

Comment 14

•

4 years ago

This didn't work out. Some things got easier with a separate thread, but it also made some things more difficult.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → WONTFIX