decommit can be slow while holding the allocator lock
Categories
(Core :: Memory Allocator, enhancement, P1)
Tracking
()
Performance Impact | medium |
Tracking | Status | |
---|---|---|
firefox64 | --- | fix-optional |
firefox65 | --- | fix-optional |
People
(Reporter: tcampbell, Assigned: pbone)
References
(Blocks 3 open bugs)
Details
(Whiteboard: [sp3])
Attachments
(17 files, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
In profile from https://bugzilla.mozilla.org/show_bug.cgi?id=1488435 (In reply to Randell Jesup [:jesup] from comment #3) > https://docs.google.com/document/d/ > 1CwPnZ7NnzzIt_UjqhaESbgWyI8VRDDzUQA_le9qaeG0/edit?usp=sharing https://perf-html.io/public/af86f00f11f7c7719ed6018a24dab7a86ecc507b/calltree/?globalTrackOrder=0-1-2-3-4&hiddenGlobalTracks=0-1-2-3&hiddenLocalTracksByPid=2573-0&invertCallstack&localTrackOrderByPid=2573-0~&range=3.3981_8.3281~4.3286_7.4642&thread=5&v=3 Here, 10% of time is spent spinning a lock. This seems to be contention do to off-thread parse and Ion.
Comment 1•6 years ago
|
||
We could experiment with using separate jemalloc arenas for LIFO allocations on helper threads.
Reporter | ||
Updated•6 years ago
|
Updated•6 years ago
|
Reporter | ||
Comment 2•6 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #1) > We could experiment with using separate jemalloc arenas for LIFO allocations > on helper threads. This seems like a good idea. Helper threads sharing an arena is probably fine since they are by definition background work.
Updated•6 years ago
|
Comment 3•6 years ago
|
||
Another idea is to use multiple malloc heaps on the main thread and cycle between them so that when we free things on the background thread we don't content on the heap lock. Background free happens separately for malloc allocations associated with nursery and tenured objects.
Comment 4•6 years ago
|
||
A further idea would be to make jemalloc use more finegrained locking. At the moment I think it's pretty coarse.
Reporter | ||
Comment 5•6 years ago
|
||
Jon, on this profile it seems pretty severe. I see 10% of mainthread time unless I'm reading this wrong. Pushing back to qf:67 is fine if we think the fix is more experimental. Thoughts?
Comment 6•6 years ago
|
||
I wasn't able to reproduce that degree of contention (I only saw ~1%), but it would be great to improve the situation around malloc locking. I can't commit to working on this right now though.
Comment 7•6 years ago
|
||
Based on recent discussion with Jon and Ted, we decided to move the QF target to qf:p1:f67.
Comment 8•6 years ago
|
||
FYI, I still see 10% in google docs: We still experience (at least on my dual-xeon machine) ~10% jemalloc lock contention. See https://perfht.ml/2N0lzNn - during keypresses, it's even higher - https://perfht.ml/2ztMEVS with several long spins of up to 60ms while one of the JS Helpers does a LifAlloc::freeAll() (mostly in the madvise()). Perhaps we should bump the priority of this, unless it's specific to high-core-count machines. Note this profile is a "fast" profile where keypress time is <100ms.
Comment 9•6 years ago
|
||
Can we just use the same thread-local arena stuff that we wound up using for Stylo threads? Or at least see if that makes any difference?
Updated•6 years ago
|
Updated•6 years ago
|
Comment 10•6 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #8)
We still experience (at least on my dual-xeon machine) ~10% jemalloc lock
contention. See https://perfht.ml/2N0lzNn - during keypresses, it's even
higher - https://perfht.ml/2ztMEVS with several long spins of up to 60ms
while one of the JS Helpers does a LifAlloc::freeAll() (mostly in the
madvise()).
Can we re-evaluate this observation since Bug 1489572 landed and should have highly reduced the number of these allocations.
Comment 11•6 years ago
|
||
I was looking into this this week and found that when you free memory jemalloc may decide to decommit/madvise all the dirty pages for an arena in one go, with the lock held. This can be slow.
Using a separate arena for JIT data doesn't stop this happening unfortunately because the initial allocations (IonBuilder) are done on the main thread. I guess it might help as at least it wouldn't conflict with other kinds of allocation.
Ideally this decommit process would be incremental and would release the lock while calling into the OS.
Comment 12•6 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #11)
I was looking into this this week and found that when you free memory jemalloc may decide to decommit/madvise all the dirty pages for an arena in one go, with the lock held. This can be slow.
Using a separate arena for JIT data doesn't stop this happening unfortunately because the initial allocations (IonBuilder) are done on the main thread. I guess it might help as at least it wouldn't conflict with other kinds of allocation.
Ideally this decommit process would be incremental and would release the lock while calling into the OS.
glandium - any thoughts on if jemalloc could background the decommits or at least move them out from under the lock, so it would only block the thread and not other uses of jemalloc on other threads?
The profile from above (https://perfht.ml/2TH2ovO focused on an madvise() in the top js helper thread) has LifoAlloc::freeAll() -> arena_dalloc() -> Purge() (likely some others inlined there and thus missing, like DallocSmall()), -> madvise(). Perhaps Purge could move segments to a to-be-freed list, and then process them at the end of arena_dalloc() after dropping the lock? Or even fancier, a separate thread to free segments afterwards - but that's probably overkill, and might have some negative ramifications in load situations.
Comment 13•6 years ago
|
||
Can you try removing https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#211-215 and replace #ifdef XP_DARWIN with #if defined(XP_DARWIN) || defined(XP_LINUX) on https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#170 and get a new profile?
Updated•6 years ago
|
Comment 14•6 years ago
|
||
I don't think this is a JS engine bug anymore. I think that aspect of it was fixed by bug 1489572.
The new idea in comment 11 should probably be a new bug, but for now I'm just moving this one -> Core :: Memory Allocator.
Updated•5 years ago
|
Updated•3 years ago
|
Updated•2 years ago
|
Comment 16•5 months ago
|
||
Paul - you might want to take a look at this, and see if it's still an issue or still applies. See comment 13
Assignee | ||
Comment 17•4 months ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #16)
Paul - you might want to take a look at this, and see if it's still an issue or still applies. See comment 13
Thanks, I worked on some things that reduce contention in jemalloc already (Bug 1809610),
but without very significant changes decommit has to be done with the lock held. However Bug 1814815, Bug 1898646 and Bug 1843997 could all help reduce the impact.
Assignee | ||
Comment 18•2 months ago
|
||
I've been working on some patches to perform the system calls outside of the arena lock.
Assignee | ||
Comment 19•2 months ago
|
||
Assignee | ||
Comment 20•2 months ago
|
||
Assignee | ||
Comment 21•2 months ago
|
||
Purge() will now take the lock itself. It must not be called with the lock
held and therefore this patch moves calls to Purge until after the lock is
released.
Assignee | ||
Comment 22•2 months ago
|
||
A future patch will lock only the first and third phases. Until then this
change will make the next few patches easier to read.
Assignee | ||
Comment 23•2 months ago
|
||
- Rename DeallocChunk to PrepareRemoveChunk,
- Move chunk removral code into a new function RemoveChunk
Assignee | ||
Comment 24•2 months ago
|
||
Assignee | ||
Comment 25•2 months ago
|
||
- Add a mDying flag to Chunks
- After Purge(), if a chunk is dying don't add it to structures such as
mChunksMadvised - Add a mHasBusyPages flag to Chunks
- Don't delete busy chunks
- Instead in Purge() if a chunk has a mDying set then Purge is responsible
for finishing the deletion.
Assignee | ||
Comment 26•2 months ago
|
||
The easiest way to avoid multiple threads from trying to purge the same
memory is to remove the chunk being purged from the list of dirty chunks.
It can be re-added later if it has any dirty pages remaining.
Although this may prevent another thread from purging some memory, at least
immediately, it will make following patches easier to implement.
This also means that Purge() may not find a chunk to purge memory from even
if there is dirty memory.
Assignee | ||
Comment 27•2 months ago
|
||
Assignee | ||
Comment 28•2 months ago
|
||
- Mark runs as busy by marking their first and last page
- Remove busy runs from mAvailRuns during purge
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Assignee | ||
Comment 29•2 months ago
|
||
When a run of pages is freed don't merge it with adjacent busy runs
Assignee | ||
Comment 30•2 months ago
|
||
Assignee | ||
Comment 31•2 months ago
|
||
A chunk may have been busy and unable to be deleted or a run may have been
unable to be merged due to neighouring busy runs. So when Purge() finishes
it must check if it needs to merge runs or delete the chunk. Note that
merging a run may result in one large empty run which means the chunk must
then become the spare chunk and purge will delete the previous spare chunk -
if any.
Assignee | ||
Comment 32•2 months ago
|
||
Assignee | ||
Comment 33•2 months ago
|
||
Assignee | ||
Comment 34•2 months ago
|
||
Assignee | ||
Comment 35•2 months ago
•
|
||
I'm waiting for test results, when they're done they'll appear here:
The windows11 subtests show the biggest wins: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=d04985d0c83e9f10d6f4a606eaeb28dca3c5e55c&originalSignature=5131097&newSignature=5131097&framework=13&application=firefox&originalRevision=e9a39c5993dc5d6deb267d05a376498102bd59ed&page=1&showOnlyConfident=1
Edit: Added correct URLs.
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Assignee | ||
Comment 36•2 months ago
|
||
Assignee | ||
Comment 37•2 months ago
|
||
Description
•