[meta] Reduce lock contention in jemalloc
Categories
(Core :: Memory Allocator, enhancement, P1)
Tracking
()
People
(Reporter: pbone, Assigned: pbone)
References
(Depends on 9 open bugs, Blocks 3 open bugs)
Details
(Keywords: meta, perf:responsiveness)
Attachments
(4 obsolete files)
Jon and Jan noticed a lot of lock contention in jemalloc (Bug 1809058) specifically in the free routine from off-thread sweeping. Likewise I noticed the JS arena is often contended (Bug 1805644) we can move several things to outside the arena lock's critical section on the free path. The candidates are:
- VirtualFree/munmap
- madvise
- Poisoning
Comment 1•2 years ago
|
||
In the speedometer 3 meeting yesterday, Randal suggested that he's already been looking at this and that there are existing bugs.
Comment 2•2 years ago
|
||
There are existing bugs, and known problems from doing so (at least for poisoning, IIRC).
Comment 3•2 years ago
|
||
Found the known problems poisoning outside a lock: https://phabricator.services.mozilla.com/D60036#1843167
Comment 4•2 years ago
|
||
IME the best way to reduce lock contention is to have more fine-grained locking (at the run level), which itself requires not using pthread locks because they are too large. I did that in the Rust prototype, with good results.
Assignee | ||
Comment 5•2 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #4)
IME the best way to reduce lock contention is to have more fine-grained locking (at the run level), which itself requires not using pthread locks because they are too large. I did that in the Rust prototype, with good results.
I'll take a look. I'm proposing "just moving" this code rather than doing fine-grained locking because I didn't want to invest too much time. But I'll consider making that a longer-term plan, part of looking at other allocators.
Assignee | ||
Comment 6•2 years ago
|
||
The system calls of releasing a chunk of memory can be costly and should be
done outside the arena lock's critical section so that other threads aren't
blocked waiting for the lock.
Assignee | ||
Comment 7•2 years ago
|
||
Depends on D166775
Assignee | ||
Comment 8•2 years ago
|
||
This should reduce lock contention on the jemalloc locks but it also makes
it easier to manage jemalloc state in the next patch.
Depends on D166776
Assignee | ||
Comment 9•2 years ago
|
||
We must hold the arena lock while calling madvise so that our record of what
is madvised and what isn't is consistent with the OS. But we can batch
these operations until after the JS GC has completed sweeping which combines
more system calls and doesn't block the arena lock for so long.
Depends on D166777
Updated•2 years ago
|
Assignee | ||
Comment 10•2 years ago
|
||
In a review jonco asked if coalleascing the calls to madvise
leads to fewer-but-longer pauses. I want to share that here on the bug where it's more visible:
mozilla-central | with patches | |
---|---|---|
Number of times Purge() called | 49,706.00 | 36,110.00 |
Total duration of Purge() | 2.44s | 1.95s |
Average pause | 49.04us | 53.98us |
Median | 43.96us | 40.77us |
90th percentile | 85.03us | 81.77us |
99th percentile | 181.08us | 337.83us |
The arena lock is held for the entire duration of Purge()
, so all these numbers represent the length of that critical section and likely pauses on the main thread if the JS engine tries to allocate memory (but not gecko, that's a different arena). I obtained the figures above from opening a tab with the speedometer 2 benchmark, running the benchmark then closing the browser. They are for the content process only.
While the total time spent in the lock, and therefore the benchmark result is a little better. I don't like the idea of adding latency in those 99th percentile results. So I'm now experimenting with having Purge()
exit early even if it hasn't de-committed "enough" dirty pages, then running again each time a thread obtains the lock until enough dirtied pages are purged. The best results so far are for purging one chunk at a time. Which has better average-case and worst-case behaviour than my patches so far, and only slightly worse overall throughput - but I'll keep experimenting.
Comment 11•2 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #10)
Thanks for sharing these results. This seems to reduce the number of purges by 25% while keeping the median and average pause roughly the same, so I'd say that's a good win.
Assignee | ||
Comment 12•2 years ago
|
||
Here are some results from try:
I also ran tests in the "browsertime", "awsy" and "js-bench" groups.
Lots of things are better, particularly responsiveness but also some throughput. Some page load benchmarks are worse though. I'm wondering if now that a bunch of madvise calls happen all in one lock transaction if that's hitting some benchmarks in an unlucky way. I'm going to investigate a bit further.
Assignee | ||
Comment 13•2 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #12)
Here are some results from try:
I also ran tests in the "browsertime", "awsy" and "js-bench" groups.
Lots of things are better, particularly responsiveness but also some throughput. Some page load benchmarks are worse though. I'm wondering if now that a bunch of madvise calls happen all in one lock transaction if that's hitting some benchmarks in an unlucky way. I'm going to investigate a bit further.
I ran the "fandom" tests (some of the most extreme in try) on my PC and extracted the fcp values. I ran it against the baseline and then each of the 5 patches in my workspace applied in series. I'm not an expert statistician, so grain of salt. But the data appears to be too noisy to say that there's a statistical difference. I was hoping to find "Ah, this patch is the one that makes page loads slower" and have clues about why. But I can't spot a difference that "feels" significant. I'll try them on "try" in case the environment makes a difference.
In the meantime if someone else can spot something in this data or has other insights about the tests I'd appreciate the clues.
Assignee | ||
Comment 14•2 years ago
|
||
I'm going to break this bug up into sub-bugs so that landing each one is lower risk and gives us a better chance at understanding performance changes (including unforseen regressions).
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Updated•2 years ago
|
Comment 15•2 years ago
|
||
Comment on attachment 9312148 [details]
Bug 1809610 - Move chunk releasing out of the arena lock's critical section r=glandium
Revision D166775 was moved to bug 1814808. Setting attachment 9312148 [details] to obsolete.
Comment 16•2 years ago
|
||
Comment on attachment 9312149 [details]
Bug 1809610 - Poison memory we want to free before we take the arena lock r=glandium
Revision D166776 was moved to bug 1609478. Setting attachment 9312149 [details] to obsolete.
Comment 17•2 years ago
|
||
Comment on attachment 9312151 [details]
Bug 1809610 - Coalesce madvise calls until after sweeping r=glandium,jonco
Revision D166778 was moved to bug 1814815. Setting attachment 9312151 [details] to obsolete.
Updated•2 years ago
|
Assignee | ||
Comment 18•2 years ago
|
||
I have some ideas to try to try to figure out the regression.
Comment 19•2 years ago
|
||
Comment on attachment 9312150 [details]
Bug 1809610 - Don't run background sweep and background free concurrently r=jonco
Revision D166777 was moved to bug 1814813. Setting attachment 9312150 [details] to obsolete.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 20•6 months ago
|
||
Unassigned because this is a meta bug.
Assignee | ||
Updated•6 months ago
|
Description
•