Open Bug 1488780 Opened 6 years ago Updated 9 days ago

decommit can be slow while holding the allocator lock

Categories

(Core :: Memory Allocator, enhancement, P1)

enhancement

Tracking

()

ASSIGNED
Performance Impact medium
Tracking Status
firefox64 --- fix-optional
firefox65 --- fix-optional

People

(Reporter: tcampbell, Assigned: pbone)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(17 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
We could experiment with using separate jemalloc arenas for LIFO allocations on helper threads.
Assignee: tcampbell → nobody
Whiteboard: [qf:p1:64]
(In reply to Jon Coppeard (:jonco) from comment #1)
> We could experiment with using separate jemalloc arenas for LIFO allocations
> on helper threads.

This seems like a good idea. Helper threads sharing an arena is probably fine since they are by definition background work.
Whiteboard: [qf:p1:64] → [qf:p1:f64]
Another idea is to use multiple malloc heaps on the main thread and cycle between them so that when we free things on the background thread we don't content on the heap lock.  Background free happens separately for malloc allocations associated with nursery and tenured objects.
A further idea would be to make jemalloc use more finegrained locking.  At the moment I think it's pretty coarse.
Jon, on this profile it seems pretty severe. I see 10% of mainthread time unless I'm reading this wrong. Pushing back to qf:67 is fine if we think the fix is more experimental. Thoughts?
Flags: needinfo?(jcoppeard)
I wasn't able to reproduce that degree of contention (I only saw ~1%), but it would be great to improve the situation around malloc locking.

I can't commit to working on this right now though.
Flags: needinfo?(jcoppeard)
Based on recent discussion with Jon and Ted, we decided to move the QF target to qf:p1:f67.
Whiteboard: [qf:p1:f64] → [qf:p1:f67]
FYI, I still see 10% in google docs:

We still experience (at least on my dual-xeon machine) ~10% jemalloc lock contention. See https://perfht.ml/2N0lzNn - during keypresses, it's even higher - https://perfht.ml/2ztMEVS with several long spins of up to 60ms while one of the JS Helpers does a LifAlloc::freeAll() (mostly in the madvise()).  Perhaps we should bump the priority of this, unless it's specific to high-core-count machines.

Note this profile is a "fast" profile where keypress time is <100ms.
Can we just use the same thread-local arena stuff that we wound up using for Stylo threads?  Or at least see if that makes any difference?
Whiteboard: [qf:p1:f67] → [qf:p2]

(In reply to Randell Jesup [:jesup] from comment #8)

We still experience (at least on my dual-xeon machine) ~10% jemalloc lock
contention. See https://perfht.ml/2N0lzNn - during keypresses, it's even
higher - https://perfht.ml/2ztMEVS with several long spins of up to 60ms
while one of the JS Helpers does a LifAlloc::freeAll() (mostly in the
madvise()).

Can we re-evaluate this observation since Bug 1489572 landed and should have highly reduced the number of these allocations.

Flags: needinfo?(rjesup)

I was looking into this this week and found that when you free memory jemalloc may decide to decommit/madvise all the dirty pages for an arena in one go, with the lock held. This can be slow.

Using a separate arena for JIT data doesn't stop this happening unfortunately because the initial allocations (IonBuilder) are done on the main thread. I guess it might help as at least it wouldn't conflict with other kinds of allocation.

Ideally this decommit process would be incremental and would release the lock while calling into the OS.

(In reply to Jon Coppeard (:jonco) from comment #11)

I was looking into this this week and found that when you free memory jemalloc may decide to decommit/madvise all the dirty pages for an arena in one go, with the lock held. This can be slow.

Using a separate arena for JIT data doesn't stop this happening unfortunately because the initial allocations (IonBuilder) are done on the main thread. I guess it might help as at least it wouldn't conflict with other kinds of allocation.

Ideally this decommit process would be incremental and would release the lock while calling into the OS.

glandium - any thoughts on if jemalloc could background the decommits or at least move them out from under the lock, so it would only block the thread and not other uses of jemalloc on other threads?

The profile from above (https://perfht.ml/2TH2ovO focused on an madvise() in the top js helper thread) has LifoAlloc::freeAll() -> arena_dalloc() -> Purge() (likely some others inlined there and thus missing, like DallocSmall()), -> madvise(). Perhaps Purge could move segments to a to-be-freed list, and then process them at the end of arena_dalloc() after dropping the lock? Or even fancier, a separate thread to free segments afterwards - but that's probably overkill, and might have some negative ramifications in load situations.

Flags: needinfo?(rjesup) → needinfo?(mh+mozilla)

Can you try removing https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#211-215 and replace #ifdef XP_DARWIN with #if defined(XP_DARWIN) || defined(XP_LINUX) on https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#170 and get a new profile?

Flags: needinfo?(mh+mozilla)
Flags: needinfo?(rjesup)

I don't think this is a JS engine bug anymore. I think that aspect of it was fixed by bug 1489572.

The new idea in comment 11 should probably be a new bug, but for now I'm just moving this one -> Core :: Memory Allocator.

Component: JavaScript Engine → Memory Allocator
Priority: P1 → --
Flags: needinfo?(rjesup)
Priority: -- → P3
Summary: Lots of jemalloc lock contention during gdocs startup → decommit can be slow while holding the allocator lock

Jesup, can you try out comment 13?

Flags: needinfo?(rjesup)
Performance Impact: --- → P2
Whiteboard: [qf:p2]
Severity: normal → S3

Paul - you might want to take a look at this, and see if it's still an issue or still applies. See comment 13

Flags: needinfo?(rjesup) → needinfo?(pbone)

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #16)

Paul - you might want to take a look at this, and see if it's still an issue or still applies. See comment 13

Thanks, I worked on some things that reduce contention in jemalloc already (Bug 1809610),
but without very significant changes decommit has to be done with the lock held. However Bug 1814815, Bug 1898646 and Bug 1843997 could all help reduce the impact.

Depends on: 1898646, 1843997, 1814815
Flags: needinfo?(pbone)

I've been working on some patches to perform the system calls outside of the arena lock.

Assignee: nobody → pbone
Blocks: 1809610
Status: NEW → ASSIGNED
Type: defect → enhancement
No longer depends on: 1814815, 1843997, 1898646
Priority: P3 → P1

Purge() will now take the lock itself. It must not be called with the lock
held and therefore this patch moves calls to Purge until after the lock is
released.

A future patch will lock only the first and third phases. Until then this
change will make the next few patches easier to read.

  • Rename DeallocChunk to PrepareRemoveChunk,
  • Move chunk removral code into a new function RemoveChunk
  • Add a mDying flag to Chunks
  • After Purge(), if a chunk is dying don't add it to structures such as
    mChunksMadvised
  • Add a mHasBusyPages flag to Chunks
  • Don't delete busy chunks
  • Instead in Purge() if a chunk has a mDying set then Purge is responsible
    for finishing the deletion.

The easiest way to avoid multiple threads from trying to purge the same
memory is to remove the chunk being purged from the list of dirty chunks.
It can be re-added later if it has any dirty pages remaining.

Although this may prevent another thread from purging some memory, at least
immediately, it will make following patches easier to implement.

This also means that Purge() may not find a chunk to purge memory from even
if there is dirty memory.

  • Mark runs as busy by marking their first and last page
  • Remove busy runs from mAvailRuns during purge
Whiteboard: [sp3]
Attachment #9414081 - Attachment description: WIP: Bug 1488780 - pt 1. Refactoring in arena_t::Purge → Bug 1488780 - pt 1. Refactoring in arena_t::Purge r=glandium
Attachment #9414082 - Attachment description: WIP: Bug 1488780 - pt 2. Make Purge() a one-shot operation → Bug 1488780 - pt 2. Make Purge() a one-shot operation r=glandium
Attachment #9414083 - Attachment description: WIP: Bug 1488780 - pt 3. Move Purge() out of the arena lock → Bug 1488780 - pt 3. Move Purge() out of the arena lock r=glandium
Attachment #9414084 - Attachment description: WIP: Bug 1488780 - pt 4. Split Pruge() into three phases → Bug 1488780 - pt 4. Split Pruge() into three phases r=glandium

When a run of pages is freed don't merge it with adjacent busy runs

A chunk may have been busy and unable to be deleted or a run may have been
unable to be merged due to neighouring busy runs. So when Purge() finishes
it must check if it needs to merge runs or delete the chunk. Note that
merging a run may result in one large empty run which means the chunk must
then become the spare chunk and purge will delete the previous spare chunk -
if any.

Attachment #9414085 - Attachment description: WIP: Bug 1488780 - pt 5. Refactor DeallocChunk → Bug 1488780 - pt 5. Refactor DeallocChunk r=glandium
Attachment #9414086 - Attachment description: WIP: Bug 1488780 - pt 6. Don't insert then remove an empty chunk's run → Bug 1488780 - pt 6. Don't insert then remove an empty chunk's run r=glandium
Attachment #9414087 - Attachment description: WIP: Bug 1488780 - pt 7. Make Purge() and RemoveChunk() safe to run concurrently → Bug 1488780 - pt 7. Make Purge() and RemoveChunk() safe to run concurrently r=glandium
Attachment #9414088 - Attachment description: WIP: Bug 1488780 - pt 8. Remove busy chunks from dirty list during purge → Bug 1488780 - pt 8. Remove busy chunks from dirty list during purge r=glandium
Attachment #9414089 - Attachment description: WIP: Bug 1488780 - pt 9. Mark pages busy during the purge operation → Bug 1488780 - pt 9. Mark pages busy during the purge operation r=glandium
Attachment #9414090 - Attachment description: WIP: Bug 1488780 - pt 10. Don't allocate runs with busy pages → Bug 1488780 - pt 10. Don't allocate runs with busy pages r=glandium
Attachment #9414274 - Attachment description: WIP: Bug 1488780 - pt 11. Don't merge busy runs → Bug 1488780 - pt 11. Don't merge busy runs r=glandium
Attachment #9414275 - Attachment description: WIP: Bug 1488780 - pt 12. Move Coaleascing code out of DallocRun → Bug 1488780 - pt 12. Move Coaleascing code out of DallocRun r=glandium
Attachment #9414276 - Attachment description: WIP: Bug 1488780 - pt 13. Coalesce runs and delete chunks in Purge() → Bug 1488780 - pt 13. Coalesce runs and delete chunks in Purge() r=glandium
Attachment #9414277 - Attachment description: WIP: Bug 1488780 - pt 14. Don't realloc into busy memory → Bug 1488780 - pt 14. Don't realloc into busy memory r=glandium
Attachment #9414278 - Attachment description: WIP: Bug 1488780 - pt 15. Only allocate into the spare chunk if it's not busy → Bug 1488780 - pt 15. Only allocate into the spare chunk if it's not busy r=glandium
Attachment #9414279 - Attachment description: WIP: Bug 1488780 - pt 16. Unlock for decommit → Bug 1488780 - pt 16. Unlock for decommit r=glandium
Attachment #9414081 - Attachment is obsolete: true
See Also: → 1903758
Attachment #9414084 - Attachment description: Bug 1488780 - pt 4. Split Pruge() into three phases r=glandium → Bug 1488780 - pt 4. Split Purge() into three phases r=glandium
Blocks: 1913753
See Also: → 1916803
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: