Closed Bug 1064578 (ParallelSweeping) Opened 10 years ago Closed 10 years ago

Do embarrassingly parallel non-incrementalizable sweeping phases in parallel

Categories

(Core :: JavaScript: GC, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: terrence, Assigned: terrence)

References

(Blocks 1 open bug)

Details

Attachments

(8 files, 5 obsolete files)

2.26 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
7.32 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
2.77 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
9.63 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
6.20 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
12.33 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
10.13 KB, patch
jonco
: review+
terrence
: checkin+
Details | Diff | Splinter Review
49.86 KB, patch
jonco
: review+
bhackett1024
: review+
Details | Diff | Splinter Review
I think we could safely parallelize sweeping of the atoms table, types, analysis data, maybe compartments, and probably other things. this may be an easy way to cut a few ms off our max pause.
Blocks: GC.60fps
No longer blocks: GC.performance
This is a great idea!  I bet there is lots of stuff that can't be run in parallel with the mutator but doesn't actually require the main thread.
Alias: ConcurrentSweeping
Shouldn't that be ParallelSweeping?
Eh, I don't really care too much what we call it. I guess we shouldn't call it BackgroundSweeping since that's already taken.
So the idea is that the mutator is stopped, but the GC is sweeping things in multiple threads at once?
(In reply to Andrew McCreight [:mccr8] from comment #4)
> So the idea is that the mutator is stopped, but the GC is sweeping things in
> multiple threads at once?

Yes. In particular, we have a handful of caches that are very slow to sweep. These should only be doing reads from the mark bits, so should be relatively easy to do in parallel.
Ok, cool.  I think according to standard GC terminology, "background sweeping" is really "concurrent sweeping" and this is "parallel sweeping".
Alias: ConcurrentSweeping → ParallelSweeping
Jon, does this transformation make sense?

The watchpoint and debugger maps appear to be global, yet we sweep them out at the start of every zone group. It seems like the first pass is going to clear out everything that isn't dead at the end of the MARK phase. Anything we insert inbetween SWEEP slices seems like it will be live, so won't need sweeping out. What am I missing?
Attachment #8492467 - Flags: feedback?(jcoppeard)
Comment on attachment 8492467 [details] [diff] [review]
dont_repeat_watchpoint_and_debugger_sweep-v0.diff

Review of attachment 8492467 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Terrence Cole [:terrence] from comment #7)

Sadly this doesn't work.  These sweepAll() methods iterate through all debuggers/watchpoints and finalize any that are being swept in the current zone group, which is why this is called for every zone group.  We can't finalize things in zone group which we are still marking in case something didn't get marked yet.

The Watchpoints code is easily fixable so it only sweep the current zone group, and the debugger may be fixable too with some work.  I think the reason this hasn't been done before though is that in the common use case there aren't any debuggers or watchpoints, so we haven't spent time optimising this.
Attachment #8492467 - Flags: feedback?(jcoppeard) → feedback-
That makes sense, thanks for the explanation! I'll add a comment.
Attachment #8492467 - Attachment is obsolete: true
This splits zone->discardJITCode, comp->sweep, and zone->sweep into different sections and hoists their AutoPhase for later.
Assignee: nobody → terrence
Status: NEW → ASSIGNED
Attachment #8494681 - Flags: review?(jcoppeard)
This makes JSCompartment::sweep fine-grained.
Attachment #8494682 - Flags: review?(jcoppeard)
This splits the comp->sweep section into the parts that are delineated by sub-sections and those that are not accounted separately. This also moves the two full-runtime sweep methods down next to the "misc" section for later.
Attachment #8494692 - Flags: review?(jcoppeard)
Attached patch reorganize_4_drop_tables-v0.diff (obsolete) — Splinter Review
First, drop the SWEEP_TABLES section and let the individual table sweeping bits parent to SWEEP_COMPARTMENT. It was never really a distinct a portion of JSCompartment::sweep and the way we're breaking things up across phases for parallelization makes it twice as annoying to have a second level here when we're already crossing one to sweep atoms in parallel.

Second, move all of the sweep tables sections that we want to parallelize into separate loops. This will allow us to at least sweep all zones in a zone-group in parallel, even if we can't break it up further.

Third, move all of the AutoPhase bits out of called functions and into beginSweepingZoneGroups. This is a bit longer now, but leads to a very nice simplification later.
Attachment #8494702 - Flags: review?(jcoppeard)
This patch looks enormous, but it's really just an indentation change. You might want to grab this one locally and qdiff -w to see what's really going on. Basically, it's (1) indenting everything in the compartment phase and commoning up the AutoPhase and AutoSCC and (2) moving sweepSymbols and discardJITCode down to the section that will be swept on the main thread eventually.
Attachment #8494709 - Flags: review?(jcoppeard)
This does the same for Zone::sweep -- splitting it into sweepBreakpoint and sweepAnalysis (as well as adding some encapsulation that I'm surprised wasn't already there). I also finally got around to adding the comment I promised to earlier.
Attachment #8494718 - Flags: review?(jcoppeard)
In order to give the phase times the actual time instead of time on the main thread, I added an endParallelPhase and associated AutoParallalPhase and AutoMaybeParallelPhase. This was an absurd amount of duplication though, so instead I think we should reify and give AutoPhase extra constructors to handle the Maybe and Parallel workloads with the same class. This patch does that.
Attachment #8494722 - Flags: review?(jcoppeard)
The reason I needed both Auto and AutoMaybe is because atoms sweeping uses an AutoPhase instead of an AutoMaybePhase(!isHeapCompacting()), like every other instance in that method. I'm totally baffled as to why this works if we need the others unless we're never getting the atoms zone in compacting. I guess this could be because there aren't objects in the atoms compartment? In any case, is it okay if I change the atoms phase guard to an AutoMaybePhase(!isHeapCompacting())?
Flags: needinfo?(jcoppeard)
The next patch in the queue is the actual parallelization bits. It's passing all jit-tests, jsapi-tests, jsreftests and can run gmail/cnn/bbc in the browser, but is still hitting some assertions on Try:
  https://tbpl.mozilla.org/?tree=Try&rev=1a785a06e03a.

These all appear to be pretty simple, so hopefully won't take too long to clean up.
Attachment #8494681 - Flags: review?(jcoppeard) → review+
Attachment #8494682 - Flags: review?(jcoppeard) → review+
Comment on attachment 8494692 [details] [diff] [review]
reorganize_3-v0.diff

Review of attachment 8494692 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  I'm not sure why some things are accounted separately and some are not - it look arbitrary to me.  We may want to add stats phases for those if this unaccounted section is taking significant time.
Attachment #8494692 - Flags: review?(jcoppeard) → review+
Comment on attachment 8494702 [details] [diff] [review]
reorganize_4_drop_tables-v0.diff

Review of attachment 8494702 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +2366,5 @@
>          } else {
>              /* Update cross compartment wrappers into moved zones. */
> +            for (CompartmentsInZoneIter c(zone); !c.done(); c.next()) {
> +                gcstats::MaybeAutoPhase ap(stats, !rt->isHeapCompacting(),
> +                                           gcstats::PHASE_SWEEP_TABLES_WRAPPER);

Wait, PHASE_SWEEP_TABLES_WRAPPER just got renamed to PHASE_SWEEP_CC_WRAPPER - is this the right version of the patch?
Attachment #8494702 - Flags: review?(jcoppeard)
I honestly have no idea how that hunk got there.
Attachment #8494702 - Attachment is obsolete: true
Attachment #8495317 - Flags: review?(jcoppeard)
Flags: needinfo?(jcoppeard)
D'oh, didn't mean to remove the ni?.
Flags: needinfo?(jcoppeard)
And fixup 7 from the changes in 4.
Attachment #8494722 - Attachment is obsolete: true
Attachment #8494722 - Flags: review?(jcoppeard)
Attachment #8495319 - Flags: review?(jcoppeard)
Comment on attachment 8495317 [details] [diff] [review]
reorganize_4_drop_tables-v1.diff

Review of attachment 8495317 [details] [diff] [review]:
-----------------------------------------------------------------

Great, I prefer having the AutoPhases in jsgc.cpp.
Attachment #8495317 - Flags: review?(jcoppeard) → review+
Comment on attachment 8494709 [details] [diff] [review]
reorganize_5_serial_to_the_back-v0.diff

Review of attachment 8494709 [details] [diff] [review]:
-----------------------------------------------------------------

TIL about diff -w...
Attachment #8494709 - Flags: review?(jcoppeard) → review+
Attachment #8494718 - Flags: review?(jcoppeard) → review+
Comment on attachment 8495317 [details] [diff] [review]
reorganize_4_drop_tables-v1.diff

Review of attachment 8495317 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +4660,5 @@
>  
>      {
>          gcstats::AutoPhase ap(stats, gcstats::PHASE_SWEEP_COMPARTMENTS);
>          gcstats::AutoSCC scc(stats, zoneGroupIndex);
> +        gcstats::MaybeAutoPhase apiv(stats, !isHeapCompacting(),

I didn't spot this the first time, but we don't need the checks for !isHeapCompacting() here because this method is not called during compaction, as this calls the sweep methods directly.
Comment on attachment 8495319 [details] [diff] [review]
reorganize_7_merge_maybe_phase-v1.diff

Review of attachment 8495319 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, this makes sense, but see previous comment - in beginSweepingZoneGroup() we don't need to use the maybe constructor.
Attachment #8495319 - Flags: review?(jcoppeard) → review+
(In reply to Terrence Cole [:terrence] from comment #17)

As explained above we don't need this condition here at all.
Flags: needinfo?(jcoppeard)
Attached patch sweep_tables_in_parallel-v0.diff (obsolete) — Splinter Review
Got a green try yesterday, but one more to be sure after the rebase:
https://tbpl.mozilla.org/?tree=Try&rev=80bfe2babd65
Attachment #8496146 - Flags: review?(jcoppeard)
Attachment #8496146 - Flags: review?(bhackett1024)
Comment on attachment 8496146 [details] [diff] [review]
sweep_tables_in_parallel-v0.diff

Review of attachment 8496146 [details] [diff] [review]:
-----------------------------------------------------------------

I just looked at the helper threads stuff.  Generally this looks very good but some comments below for how I think the structure could be improved, and it would be nice to see a new patch.

::: js/src/vm/HelperThreads.cpp
@@ +726,5 @@
> +    // Tasks cannot be started twice.
> +    MOZ_ASSERT(state == NotStarted);
> +
> +    {
> +        AutoLockHelperThreadState helperLock;

Since the caller will start a bunch of tasks at once, it will be more efficient if the helper thread state is only taken once, in that caller.

@@ +738,5 @@
> +        if (!HelperThreadState().gcParallelWorklist().append(this))
> +            return false;
> +        state = Enqueued;
> +
> +        HelperThreadState().notifyAll(GlobalHelperThreadState::PRODUCER);

This should be more efficient if you use notifyOne()

@@ +747,5 @@
> +
> +void
> +js::GCParallelTask::join()
> +{
> +    AutoLockHelperThreadState helperLock;

As for start(), he caller will join a bunch of tasks at once, so it would be nice if this lock was only take once.

@@ +765,5 @@
> +    // to keep TSAN from complaining.
> +    {
> +        AutoLockHelperThreadState helperLock;
> +        state = Working;
> +    }

I think the state enum is too detailed.  There doesn't seem to be any need to distinguish Enqueued from Working, so why not merge these two and remove this chunk?

@@ +773,5 @@
> +    duration_ = PRMJ_Now() - timeStart;
> +
> +    AutoLockHelperThreadState helperLock;
> +    state = Finished;
> +    HelperThreadState().notifyAll(GlobalHelperThreadState::CONSUMER);

The caller is about to retake the helper thread state lock so this stuff should be done there, which will avoid this unnecessary lock/unlock of the helper thread state.

Also, after removing this block and the initial block this function is pretty thin, so why not do the timing stuff in handleGCParallelWorkload and remove this weirdly named function.

::: js/src/vm/HelperThreads.h
@@ +229,5 @@
>      JSScript *finishParseTask(JSContext *maybecx, JSRuntime *rt, void *token);
>      bool compressionInProgress(SourceCompressionTask *task);
>      SourceCompressionTask *compressionTaskForSource(ScriptSource *ss);
>  
> +    //bool gcParallelTaskInProgress(GCParallelTask *task);

rm
Attachment #8496146 - Flags: review?(bhackett1024)
Attached patch sweep_tables_in_parallel-v1.diff (obsolete) — Splinter Review
(In reply to Brian Hackett (:bhackett) from comment #30)
> Comment on attachment 8496146 [details] [diff] [review]
> sweep_tables_in_parallel-v0.diff
> @@ +773,5 @@
> > +    duration_ = PRMJ_Now() - timeStart;
> > +
> > +    AutoLockHelperThreadState helperLock;
> > +    state = Finished;
> > +    HelperThreadState().notifyAll(GlobalHelperThreadState::CONSUMER);
> 
> The caller is about to retake the helper thread state lock so this stuff
> should be done there, which will avoid this unnecessary lock/unlock of the
> helper thread state.

Good catch!
 
> Also, after removing this block and the initial block this function is
> pretty thin, so why not do the timing stuff in handleGCParallelWorkload and
> remove this weirdly named function.

We can't expose these details to just HelperThread because of the dependency loop, so instead I moved things the other direction into the task dispatcher. I've renamed it to runFromHelperThread, which is clearer and provides a nice complement to runFromMainThread. In any case, we only re-take the lock once now.
Attachment #8496146 - Attachment is obsolete: true
Attachment #8496146 - Flags: review?(jcoppeard)
Attachment #8496956 - Flags: review?(jcoppeard)
Attachment #8496956 - Flags: review?(bhackett1024)
Sorry for the churn. I pushed the rest of the reordering and AutoPhase changes that were in this patch because I am lazy down into the patches where they actually belong.
Attachment #8496956 - Attachment is obsolete: true
Attachment #8496956 - Flags: review?(jcoppeard)
Attachment #8496956 - Flags: review?(bhackett1024)
Attachment #8497034 - Flags: review?(jcoppeard)
Attachment #8497034 - Flags: review?(bhackett1024)
Keywords: leave-open
Attachment #8494681 - Flags: checkin+
Attachment #8494682 - Flags: checkin+
Attachment #8494692 - Flags: checkin+
Attachment #8494709 - Flags: checkin+
Attachment #8494718 - Flags: checkin+
Attachment #8495317 - Flags: checkin+
Attachment #8495319 - Flags: checkin+
Comment on attachment 8497034 [details] [diff] [review]
sweep_tables_in_parallel-v2.diff

Review of attachment 8497034 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: js/src/vm/HelperThreads.cpp
@@ +729,5 @@
> +    MOZ_ASSERT(state == NotStarted);
> +
> +    // If we do the shutdown GC before running anything, we may never
> +    // have initialized the helper threads. Just use the serial path
> +    // since we cannot safely intialize them at this point.

typo
Attachment #8497034 - Flags: review?(bhackett1024) → review+
Comment on attachment 8497034 [details] [diff] [review]
sweep_tables_in_parallel-v2.diff

Review of attachment 8497034 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  I like that you've fixed all those places we triggered read barriers unnecessarily.

::: js/src/gc/GCRuntime.h
@@ +723,5 @@
> +    SweepBaseShapesTask   sweepBaseShapesTask;
> +    SweepInitialShapesTask sweepInitialShapesTask;
> +    SweepTypeObjectsTask  sweepTypeObjectsTask;
> +    SweepRegExpsTask      sweepRegExpsTask;
> +    SweepMiscTask         sweepMiscTask;

Since these are only ever used within an invocation of GCRuntime::beingSweepingZoneGroup() could these be created on the stack instead?

::: js/src/gc/Zone.h
@@ +195,5 @@
> +        if (runtimeFromAnyThread()->isHeapCollecting())
> +            return gcState_ != NoGC;
> +        else
> +            return needsIncrementalBarrier();
> +    }

I don't think this is safe for general use, as if it returns false there is no guarantee that a collection isn't about to start.  We could make this into AssertCollectingFromAnyThread() which is what it is used for.
Attachment #8497034 - Flags: review?(jcoppeard) → review+
(In reply to Jon Coppeard (:jonco) from comment #36)
> Comment on attachment 8497034 [details] [diff] [review]
> sweep_tables_in_parallel-v2.diff
> 
> Review of attachment 8497034 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good.  I like that you've fixed all those places we triggered read
> barriers unnecessarily.
> 
> ::: js/src/gc/GCRuntime.h
> @@ +723,5 @@
> > +    SweepBaseShapesTask   sweepBaseShapesTask;
> > +    SweepInitialShapesTask sweepInitialShapesTask;
> > +    SweepTypeObjectsTask  sweepTypeObjectsTask;
> > +    SweepRegExpsTask      sweepRegExpsTask;
> > +    SweepMiscTask         sweepMiscTask;
> 
> Since these are only ever used within an invocation of
> GCRuntime::beingSweepingZoneGroup() could these be created on the stack
> instead?

But what about the heavy initialization of the CondVar... and... oh yeah, now that that's all gone, this is just logic. Done.

> ::: js/src/gc/Zone.h
> @@ +195,5 @@
> > +        if (runtimeFromAnyThread()->isHeapCollecting())
> > +            return gcState_ != NoGC;
> > +        else
> > +            return needsIncrementalBarrier();
> > +    }
> 
> I don't think this is safe for general use, as if it returns false there is
> no guarantee that a collection isn't about to start.  We could make this
> into AssertCollectingFromAnyThread() which is what it is used for.

You are quite correct! I guess was I got into auto-pilot with adding FromAnyThread to IsAboutToBeFinalized and didn't stop to think about this specific case. We know this is only callable under gcCycle, so we can trivially simplify this to just an isHeapCollecting check... which we can inline:

    JS_ASSERT(zone()->runtimeFromAnyThread()->isHeapCollecting());
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/8be54e6c4dcd
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
Telemetry post-36 release shows that GC_MAX_PAUSE is shifted to the left from the 35 release. Of course, we only have a couple days of samples so far, so this could be selection bias, but still it's a nice early sign.
Plus there is a very clear drop in GC_SCC_SWEEP_MAX_PAUSE_MS telemetry in Nightly from 10/6 to 10/7, when this first reached Nightly.

95th percentile went from about 90ms to 60ms, 75th percentile went from about 40ms to about 30ms, median goes from about 22ms to to 17ms, 25th percentile went from about 15ms to 10ms.  Even the 5th percentile clearly went down a bit.
You need to log in before you can comment on or make changes to this bug.