Closed Bug 1622757 Opened 8 months ago Closed 7 months ago

Parallel unmarking is slower on 64 bit Linux

Categories

(Core :: JavaScript: GC, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: jonco, Assigned: jonco)

Details

Attachments

(4 files)

Looking at telemetry since bug 1618638 landed, I can see a reduction in the proportion of budget overruns due to waiting for unmarking to finish on most platforms. The only exception is 64 bit Linux where there is an increase.

I don't know what's causing this, but I'm going to make a bunch of small improvements and see if that improves matters.

Keywords: leave-open

The change to use multiple parallel threads for unmarking made it so that all unmarking must complete before any other GC work was started. This patch queues all parallel work / tasks at the same time and waits for everything to complete.

Assignee: nobody → jcoppeard
Status: NEW → ASSIGNED

Previously we could start more parallel tasks than there were work items. This changes things around so we take a work item when we create a ParallelWorker, so that we can stop creating them if we run of of work.

I change the foreground cell update code to not use ParallelWorker at all for work that happens entirely on the main thread and doesn't need synchronisation.

Depends on D66961

Yoshi pointed out that we don't always unlock the helper thread mutex when using AutoRunParallelWork. The code will work anyway, but it looks strange and perhaps we should not rely on this. This patch adds an unlock in the one place it's currently missing.

Depends on D66962

Not a performance improvement, but I tidyup I was planning. This changes makes use use the same number of threads for all parallel work rather than making this a parameter. The number of threads used is half the cpu count clamped between 1 and 8.

The main change is that we will use fewer threads for sweeping weak caches now, but I doubt that will make much difference since all this work is likely to be memory bound.

This cleaned up GCRuntime::updateCellPointers; the thread count passed in could never be zero here anyway.

Depends on D66964

Attachment #9133550 - Attachment description: Bug 1622757 - Run unmarking work in parallel with other GC tasks r?sfink → Bug 1622757 - Run unmarking work in parallel with other GC tasks r=sfink
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/47610b8acc76
Run unmarking work in parallel with other GC tasks r=sfink
https://hg.mozilla.org/integration/autoland/rev/ff16837de269
Start fewer parallel tasks if not enough work to satisfy maximum number of threads r=sfink
https://hg.mozilla.org/integration/autoland/rev/2b586baa631e
Explicitly release the helper thread lock rather than rely on this happening when we run the first task on the main thread r=sfink
https://hg.mozilla.org/integration/autoland/rev/0dd757180739
Use the same thread count for all parallel GC work r=sfink

This hasn't worked and over the last four days I can see a big increase in the proportion of GC_SLOW_TASK telemetry caused by unmarking. Let's start the investigation by backing out the second patch ("Start fewer parallel tasks if not enough work to satisfy maximum number of threads") as that's the largest change.

The priority flag is not set for this bug.
:jonco, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jcoppeard)
Flags: needinfo?(jcoppeard)
Priority: -- → P1

Backing out that change hasn't made any discernable impact on telemetry so I'm going to reland it.

Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6c58a466d557
Start fewer parallel tasks if not enough work to satisfy maximum number of threads r=sfink

Looking at the GC_SLOW_TASK telemetry again, I can see that the largest task changes a lot when we change the code. But the GC_BUDGET_OVERRUN telemetry doesn't necessarily link that with longer overruns. Looking at that telemetry, we see a regression when bug 1618638 landed and a subseqent improvement to original levels when the patches in this bug landed, on all platforms including linux. Therefore I'm going to mark this as fixed now.

Status: ASSIGNED → RESOLVED
Closed: 7 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.