Open Bug 1809610 Opened 1 year ago Updated 6 days ago

[meta] Reduce lock contention in jemalloc

Categories

(Core :: Memory Allocator, enhancement, P1)

enhancement

Tracking

()

REOPENED

People

(Reporter: pbone, Assigned: pbone)

References

(Depends on 6 open bugs, Blocks 4 open bugs)

Details

(Keywords: meta, perf:responsiveness)

Attachments

(4 obsolete files)

Jon and Jan noticed a lot of lock contention in jemalloc (Bug 1809058) specifically in the free routine from off-thread sweeping. Likewise I noticed the JS arena is often contended (Bug 1805644) we can move several things to outside the arena lock's critical section on the free path. The candidates are:

  • VirtualFree/munmap
  • madvise
  • Poisoning

In the speedometer 3 meeting yesterday, Randal suggested that he's already been looking at this and that there are existing bugs.

Flags: needinfo?(rjesup)

There are existing bugs, and known problems from doing so (at least for poisoning, IIRC).

Found the known problems poisoning outside a lock: https://phabricator.services.mozilla.com/D60036#1843167

IME the best way to reduce lock contention is to have more fine-grained locking (at the run level), which itself requires not using pthread locks because they are too large. I did that in the Rust prototype, with good results.

(In reply to Mike Hommey [:glandium] from comment #4)

IME the best way to reduce lock contention is to have more fine-grained locking (at the run level), which itself requires not using pthread locks because they are too large. I did that in the Rust prototype, with good results.

I'll take a look. I'm proposing "just moving" this code rather than doing fine-grained locking because I didn't want to invest too much time. But I'll consider making that a longer-term plan, part of looking at other allocators.

See Also: → 1609478

The system calls of releasing a chunk of memory can be costly and should be
done outside the arena lock's critical section so that other threads aren't
blocked waiting for the lock.

This should reduce lock contention on the jemalloc locks but it also makes
it easier to manage jemalloc state in the next patch.

Depends on D166776

We must hold the arena lock while calling madvise so that our record of what
is madvised and what isn't is consistent with the OS. But we can batch
these operations until after the JS GC has completed sweeping which combines
more system calls and doesn't block the arena lock for so long.

Depends on D166777

Attachment #9312151 - Attachment description: Bug 1809610 - Coalleasce madvise calls until after sweeping r=glandium,jonco → Bug 1809610 - Coalesce madvise calls until after sweeping r=glandium,jonco

In a review jonco asked if coalleascing the calls to madvise leads to fewer-but-longer pauses. I want to share that here on the bug where it's more visible:

mozilla-central with patches
Number of times Purge() called 49,706.00 36,110.00
Total duration of Purge() 2.44s 1.95s
Average pause 49.04us 53.98us
Median 43.96us 40.77us
90th percentile 85.03us 81.77us
99th percentile 181.08us 337.83us

The arena lock is held for the entire duration of Purge(), so all these numbers represent the length of that critical section and likely pauses on the main thread if the JS engine tries to allocate memory (but not gecko, that's a different arena). I obtained the figures above from opening a tab with the speedometer 2 benchmark, running the benchmark then closing the browser. They are for the content process only.

While the total time spent in the lock, and therefore the benchmark result is a little better. I don't like the idea of adding latency in those 99th percentile results. So I'm now experimenting with having Purge() exit early even if it hasn't de-committed "enough" dirty pages, then running again each time a thread obtains the lock until enough dirtied pages are purged. The best results so far are for purging one chunk at a time. Which has better average-case and worst-case behaviour than my patches so far, and only slightly worse overall throughput - but I'll keep experimenting.

Flags: needinfo?(jcoppeard)

(In reply to Paul Bone [:pbone] from comment #10)
Thanks for sharing these results. This seems to reduce the number of purges by 25% while keeping the median and average pause roughly the same, so I'd say that's a good win.

Flags: needinfo?(jcoppeard)

Here are some results from try:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=3240143fd8c0db90e502ba9ede37b54a21ec544d&newProject=try&newRevision=62882f3e634051a946e0c550ab3d8678a0c5adad&framework=13&page=1&showOnlyComparable=1

I also ran tests in the "browsertime", "awsy" and "js-bench" groups.

Lots of things are better, particularly responsiveness but also some throughput. Some page load benchmarks are worse though. I'm wondering if now that a bunch of madvise calls happen all in one lock transaction if that's hitting some benchmarks in an unlucky way. I'm going to investigate a bit further.

See Also: → 1810954
Blocks: 1811985
Blocks: 1811987
Depends on: 1812231

(In reply to Paul Bone [:pbone] from comment #12)

Here are some results from try:

https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=3240143fd8c0db90e502ba9ede37b54a21ec544d&newProject=try&newRevision=62882f3e634051a946e0c550ab3d8678a0c5adad&framework=13&page=1&showOnlyComparable=1

I also ran tests in the "browsertime", "awsy" and "js-bench" groups.

Lots of things are better, particularly responsiveness but also some throughput. Some page load benchmarks are worse though. I'm wondering if now that a bunch of madvise calls happen all in one lock transaction if that's hitting some benchmarks in an unlucky way. I'm going to investigate a bit further.

I ran the "fandom" tests (some of the most extreme in try) on my PC and extracted the fcp values. I ran it against the baseline and then each of the 5 patches in my workspace applied in series. I'm not an expert statistician, so grain of salt. But the data appears to be too noisy to say that there's a statistical difference. I was hoping to find "Ah, this patch is the one that makes page loads slower" and have clues about why. But I can't spot a difference that "feels" significant. I'll try them on "try" in case the environment makes a difference.

In the meantime if someone else can spot something in this data or has other insights about the tests I'd appreciate the clues.

Flags: needinfo?(smaug)
Flags: needinfo?(rjesup)
Flags: needinfo?(bas)

I'm going to break this bug up into sub-bugs so that landing each one is lower risk and gives us a better chance at understanding performance changes (including unforseen regressions).

No longer depends on: 1812231
Summary: Move some work to outside critical sections. → Reduce lock contention in jemalloc
Priority: -- → P1
Depends on: 1810953
Depends on: 1810954
Depends on: 1609478
Depends on: 1814808

Comment on attachment 9312148 [details]
Bug 1809610 - Move chunk releasing out of the arena lock's critical section r=glandium

Revision D166775 was moved to bug 1814808. Setting attachment 9312148 [details] to obsolete.

Attachment #9312148 - Attachment is obsolete: true

Comment on attachment 9312149 [details]
Bug 1809610 - Poison memory we want to free before we take the arena lock r=glandium

Revision D166776 was moved to bug 1609478. Setting attachment 9312149 [details] to obsolete.

Attachment #9312149 - Attachment is obsolete: true
Depends on: 1814813
Depends on: 1814815

Comment on attachment 9312151 [details]
Bug 1809610 - Coalesce madvise calls until after sweeping r=glandium,jonco

Revision D166778 was moved to bug 1814815. Setting attachment 9312151 [details] to obsolete.

Attachment #9312151 - Attachment is obsolete: true
Depends on: 1815069
Keywords: meta
Summary: Reduce lock contention in jemalloc → [meta] Reduce lock contention in jemalloc

I have some ideas to try to try to figure out the regression.

Flags: needinfo?(smaug)
Flags: needinfo?(bas)

Comment on attachment 9312150 [details]
Bug 1809610 - Don't run background sweep and background free concurrently r=jonco

Revision D166777 was moved to bug 1814813. Setting attachment 9312150 [details] to obsolete.

Attachment #9312150 - Attachment is obsolete: true
Whiteboard: [sp3]

Unassigned because this is a meta bug.

Status: ASSIGNED → RESOLVED
Closed: 6 days ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1898646
Depends on: 1898647
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: