Open Bug 1488780 Opened 6 years ago Updated 17 hours ago

decommit can be slow while holding the allocator lock

Tracking

()

Status:

ASSIGNED

Project Flags:

Performance Impact

medium

Tracking Flags:

Tracking

Status

firefox64

---

fix-optional

firefox65

---

fix-optional

People

(Reporter: tcampbell, Assigned: pbone)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [sp3])

Attachments

(17 files, 2 obsolete files)

Bug 1488780 - pt 1. Refactoring in arena_t::Purge r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 2. Make Purge() a one-shot operation r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 3. Move Purge() out of the arena lock r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 4. Split Purge() into three phases r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 5. Refactor DeallocChunk r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 6. Don't insert then remove an empty chunk's run r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 7. Make Purge() and RemoveChunk() safe to run concurrently r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 8. Remove busy chunks from dirty list during purge r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 9. Mark pages busy during the purge operation r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 10. Don't allocate runs with busy pages r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 11. Don't merge busy runs r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 12. Move Coaleascing code out of DallocRun r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 13. Coalesce runs and delete chunks in Purge() r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 14. Don't realloc into busy memory r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 15. Only allocate into the spare chunk if it's not busy r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 16. Unlock for decommit r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 17. Release chunks outside the arena lock r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 18. Add a hint for dirty page scanning r=glandium 2 months ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1488780 - pt 10b. Mark only the run as busy, not each page r=glandium 17 hours ago Paul Bone [:pbone] 48 bytes, text/x-phabricator-request		Details \| Review

Ted Campbell [:tcampbell]

Reporter

Description

•

6 years ago

In profile from https://bugzilla.mozilla.org/show_bug.cgi?id=1488435 (In reply to Randell Jesup [:jesup] from comment #3) > https://docs.google.com/document/d/ > 1CwPnZ7NnzzIt_UjqhaESbgWyI8VRDDzUQA_le9qaeG0/edit?usp=sharing https://perf-html.io/public/af86f00f11f7c7719ed6018a24dab7a86ecc507b/calltree/?globalTrackOrder=0-1-2-3-4&hiddenGlobalTracks=0-1-2-3&hiddenLocalTracksByPid=2573-0&invertCallstack&localTrackOrderByPid=2573-0~&range=3.3981_8.3281~4.3286_7.4642&thread=5&v=3 Here, 10% of time is spent spinning a lock. This seems to be contention do to off-thread parse and Ion.

Jon Coppeard (:jonco)

Comment 1

•

6 years ago

We could experiment with using separate jemalloc arenas for LIFO allocations on helper threads.

Ted Campbell [:tcampbell]

Reporter

Updated

•

6 years ago

Assignee: tcampbell → nobody

Randell Jesup [:jesup] (needinfo me)

Updated

•

6 years ago

Whiteboard: [qf:p1:64]

Ted Campbell [:tcampbell]

Reporter

Comment 2

•

6 years ago

(In reply to Jon Coppeard (:jonco) from comment #1) > We could experiment with using separate jemalloc arenas for LIFO allocations > on helper threads. This seems like a good idea. Helper threads sharing an arena is probably fine since they are by definition background work.

Steven DeTar [:sdetar]

Updated

•

6 years ago

Whiteboard: [qf:p1:64] → [qf:p1:f64]

Jon Coppeard (:jonco)

Comment 3

•

6 years ago

Another idea is to use multiple malloc heaps on the main thread and cycle between them so that when we free things on the background thread we don't content on the heap lock. Background free happens separately for malloc allocations associated with nursery and tenured objects.

Jon Coppeard (:jonco)

Comment 4

•

6 years ago

A further idea would be to make jemalloc use more finegrained locking. At the moment I think it's pretty coarse.

Ted Campbell [:tcampbell]

Reporter

Comment 5

•

6 years ago

Jon, on this profile it seems pretty severe. I see 10% of mainthread time unless I'm reading this wrong. Pushing back to qf:67 is fine if we think the fix is more experimental. Thoughts?

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Comment 6

•

6 years ago

I wasn't able to reproduce that degree of contention (I only saw ~1%), but it would be great to improve the situation around malloc locking. I can't commit to working on this right now though.

Flags: needinfo?(jcoppeard)

Steven DeTar [:sdetar]

Comment 7

•

6 years ago

Based on recent discussion with Jon and Ted, we decided to move the QF target to qf:p1:f67.

Whiteboard: [qf:p1:f64] → [qf:p1:f67]

Randell Jesup [:jesup] (needinfo me)

Comment 8

•

6 years ago

FYI, I still see 10% in google docs: We still experience (at least on my dual-xeon machine) ~10% jemalloc lock contention. See https://perfht.ml/2N0lzNn - during keypresses, it's even higher - https://perfht.ml/2ztMEVS with several long spins of up to 60ms while one of the JS Helpers does a LifAlloc::freeAll() (mostly in the madvise()). Perhaps we should bump the priority of this, unless it's specific to high-core-count machines. Note this profile is a "fast" profile where keypress time is <100ms.

Nathan Froyd [:froydnj]

Comment 9

•

6 years ago

Can we just use the same thread-local arena stuff that we wound up using for Stylo threads? Or at least see if that makes any difference?

Jason Orendorff [:jorendorff]

Updated

•

6 years ago

status-firefox64: --- → fix-optional

status-firefox65: --- → fix-optional

Steven DeTar [:sdetar]

Updated

•

6 years ago

Whiteboard: [qf:p1:f67] → [qf:p2]

Nicolas B. Pierron [:nbp — off until 29-09]

Comment 10

•

6 years ago

(In reply to Randell Jesup [:jesup] from comment #8)

We still experience (at least on my dual-xeon machine) ~10% jemalloc lock
contention. See https://perfht.ml/2N0lzNn - during keypresses, it's even
higher - https://perfht.ml/2ztMEVS with several long spins of up to 60ms
while one of the JS Helpers does a LifAlloc::freeAll() (mostly in the
madvise()).

Can we re-evaluate this observation since Bug 1489572 landed and should have highly reduced the number of these allocations.

Flags: needinfo?(rjesup)

Jon Coppeard (:jonco)

Comment 11

•

6 years ago

I was looking into this this week and found that when you free memory jemalloc may decide to decommit/madvise all the dirty pages for an arena in one go, with the lock held. This can be slow.

Using a separate arena for JIT data doesn't stop this happening unfortunately because the initial allocations (IonBuilder) are done on the main thread. I guess it might help as at least it wouldn't conflict with other kinds of allocation.

Ideally this decommit process would be incremental and would release the lock while calling into the OS.

Randell Jesup [:jesup] (needinfo me)

Comment 12

•

6 years ago

(In reply to Jon Coppeard (:jonco) from comment #11)

I was looking into this this week and found that when you free memory jemalloc may decide to decommit/madvise all the dirty pages for an arena in one go, with the lock held. This can be slow.

Using a separate arena for JIT data doesn't stop this happening unfortunately because the initial allocations (IonBuilder) are done on the main thread. I guess it might help as at least it wouldn't conflict with other kinds of allocation.

Ideally this decommit process would be incremental and would release the lock while calling into the OS.

glandium - any thoughts on if jemalloc could background the decommits or at least move them out from under the lock, so it would only block the thread and not other uses of jemalloc on other threads?

The profile from above (https://perfht.ml/2TH2ovO focused on an madvise() in the top js helper thread) has LifoAlloc::freeAll() -> arena_dalloc() -> Purge() (likely some others inlined there and thus missing, like DallocSmall()), -> madvise(). Perhaps Purge could move segments to a to-be-freed list, and then process them at the end of arena_dalloc() after dropping the lock? Or even fancier, a separate thread to free segments afterwards - but that's probably overkill, and might have some negative ramifications in load situations.

Flags: needinfo?(rjesup) → needinfo?(mh+mozilla)

Mike Hommey [:glandium]

Comment 13

•

6 years ago

Can you try removing https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#211-215 and replace #ifdef XP_DARWIN with #if defined(XP_DARWIN) || defined(XP_LINUX) on https://dxr.mozilla.org/mozilla-central/source/memory/build/mozjemalloc.cpp#170 and get a new profile?

Flags: needinfo?(mh+mozilla)

Jason Orendorff [:jorendorff]

Updated

•

6 years ago

Flags: needinfo?(rjesup)

Jason Orendorff [:jorendorff]

Comment 14

•

6 years ago

I don't think this is a JS engine bug anymore. I think that aspect of it was fixed by bug 1489572.

The new idea in comment 11 should probably be a new bug, but for now I'm just moving this one -> Core :: Memory Allocator.

Component: JavaScript Engine → Memory Allocator

Priority: P1 → --

Eric Rahm [:erahm]

Updated

•

5 years ago

Flags: needinfo?(rjesup)

Priority: -- → P3

Summary: Lots of jemalloc lock contention during gdocs startup → decommit can be slow while holding the allocator lock

Eric Rahm [:erahm]

Comment 15

•

5 years ago

Jesup, can you try out comment 13?

Flags: needinfo?(rjesup)

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P2

Whiteboard: [qf:p2]

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

Randell Jesup [:jesup] (needinfo me)

Comment 16

•

5 months ago

Paul - you might want to take a look at this, and see if it's still an issue or still applies. See comment 13

Flags: needinfo?(rjesup) → needinfo?(pbone)

Paul Bone [:pbone]

Assignee

Comment 17

•

4 months ago

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #16)

Paul - you might want to take a look at this, and see if it's still an issue or still applies. See comment 13

Thanks, I worked on some things that reduce contention in jemalloc already (Bug 1809610),
but without very significant changes decommit has to be done with the lock held. However Bug 1814815, Bug 1898646 and Bug 1843997 could all help reduce the impact.

Depends on: 1898646, 1843997, 1814815

Flags: needinfo?(pbone)

Paul Bone [:pbone]

Assignee

Comment 18

•

2 months ago

I've been working on some patches to perform the system calls outside of the arena lock.

Assignee: nobody → pbone

Blocks: 1809610

Status: NEW → ASSIGNED

Type: defect → enhancement

No longer depends on: 1814815, 1843997, 1898646

Priority: P3 → P1

Paul Bone [:pbone]

Assignee

Comment 19

•

2 months ago

Attached file Bug 1488780 - pt 1. Refactoring in arena_t::Purge r=glandium (obsolete) — Details

Paul Bone [:pbone]

Assignee

Comment 20

•

2 months ago

Attached file Bug 1488780 - pt 2. Make Purge() a one-shot operation r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 21

•

2 months ago

Attached file Bug 1488780 - pt 3. Move Purge() out of the arena lock r=glandium — Details

Purge() will now take the lock itself. It must not be called with the lock
held and therefore this patch moves calls to Purge until after the lock is
released.

Paul Bone [:pbone]

Assignee

Comment 22

•

2 months ago

Attached file Bug 1488780 - pt 4. Split Purge() into three phases r=glandium — Details

A future patch will lock only the first and third phases. Until then this
change will make the next few patches easier to read.

Paul Bone [:pbone]

Assignee

Comment 23

•

2 months ago

Attached file Bug 1488780 - pt 5. Refactor DeallocChunk r=glandium — Details

Rename DeallocChunk to PrepareRemoveChunk,
Move chunk removral code into a new function RemoveChunk

Paul Bone [:pbone]

Assignee

Comment 24

•

2 months ago

Attached file Bug 1488780 - pt 6. Don't insert then remove an empty chunk's run r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 25

•

2 months ago

Attached file Bug 1488780 - pt 7. Make Purge() and RemoveChunk() safe to run concurrently r=glandium — Details

Add a mDying flag to Chunks
After Purge(), if a chunk is dying don't add it to structures such as
mChunksMadvised
Add a mHasBusyPages flag to Chunks
Don't delete busy chunks
Instead in Purge() if a chunk has a mDying set then Purge is responsible
for finishing the deletion.

Paul Bone [:pbone]

Assignee

Comment 26

•

2 months ago

Attached file Bug 1488780 - pt 8. Remove busy chunks from dirty list during purge r=glandium — Details

The easiest way to avoid multiple threads from trying to purge the same
memory is to remove the chunk being purged from the list of dirty chunks.
It can be re-added later if it has any dirty pages remaining.

Although this may prevent another thread from purging some memory, at least
immediately, it will make following patches easier to implement.

This also means that Purge() may not find a chunk to purge memory from even
if there is dirty memory.

Paul Bone [:pbone]

Assignee

Comment 27

•

2 months ago

Attached file Bug 1488780 - pt 9. Mark pages busy during the purge operation r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 28

•

2 months ago

Attached file Bug 1488780 - pt 10. Don't allocate runs with busy pages r=glandium — Details

Mark runs as busy by marking their first and last page

Remove busy runs from mAvailRuns during purge

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 months ago

Whiteboard: [sp3]

Jira Integration Bot

Updated

•

2 months ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-780

Phabricator Automation

Updated

•

2 months ago

Attachment #9414081 - Attachment description: WIP: Bug 1488780 - pt 1. Refactoring in arena_t::Purge → Bug 1488780 - pt 1. Refactoring in arena_t::Purge r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414082 - Attachment description: WIP: Bug 1488780 - pt 2. Make Purge() a one-shot operation → Bug 1488780 - pt 2. Make Purge() a one-shot operation r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414083 - Attachment description: WIP: Bug 1488780 - pt 3. Move Purge() out of the arena lock → Bug 1488780 - pt 3. Move Purge() out of the arena lock r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414084 - Attachment description: WIP: Bug 1488780 - pt 4. Split Pruge() into three phases → Bug 1488780 - pt 4. Split Pruge() into three phases r=glandium

Paul Bone [:pbone]

Assignee

Comment 29

•

2 months ago

Attached file Bug 1488780 - pt 11. Don't merge busy runs r=glandium — Details

When a run of pages is freed don't merge it with adjacent busy runs

Paul Bone [:pbone]

Assignee

Comment 30

•

2 months ago

Attached file Bug 1488780 - pt 12. Move Coaleascing code out of DallocRun r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 31

•

2 months ago

Attached file Bug 1488780 - pt 13. Coalesce runs and delete chunks in Purge() r=glandium — Details

A chunk may have been busy and unable to be deleted or a run may have been
unable to be merged due to neighouring busy runs. So when Purge() finishes
it must check if it needs to merge runs or delete the chunk. Note that
merging a run may result in one large empty run which means the chunk must
then become the spare chunk and purge will delete the previous spare chunk -
if any.

Paul Bone [:pbone]

Assignee

Comment 32

•

2 months ago

Attached file Bug 1488780 - pt 14. Don't realloc into busy memory r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 33

•

2 months ago

Attached file Bug 1488780 - pt 15. Only allocate into the spare chunk if it's not busy r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 34

•

2 months ago

Attached file Bug 1488780 - pt 16. Unlock for decommit r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 35

•

2 months ago

•

Edited

I'm waiting for test results, when they're done they'll appear here:

Here's the correct URL: https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=e9a39c5993dc5d6deb267d05a376498102bd59ed&newProject=try&newRevision=d04985d0c83e9f10d6f4a606eaeb28dca3c5e55c&framework=13&page=1&filter=speedometer

The windows11 subtests show the biggest wins: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=d04985d0c83e9f10d6f4a606eaeb28dca3c5e55c&originalSignature=5131097&newSignature=5131097&framework=13&application=firefox&originalRevision=e9a39c5993dc5d6deb267d05a376498102bd59ed&page=1&showOnlyConfident=1

Edit: Added correct URLs.

Phabricator Automation

Updated

•

2 months ago

Attachment #9414085 - Attachment description: WIP: Bug 1488780 - pt 5. Refactor DeallocChunk → Bug 1488780 - pt 5. Refactor DeallocChunk r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414086 - Attachment description: WIP: Bug 1488780 - pt 6. Don't insert then remove an empty chunk's run → Bug 1488780 - pt 6. Don't insert then remove an empty chunk's run r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414087 - Attachment description: WIP: Bug 1488780 - pt 7. Make Purge() and RemoveChunk() safe to run concurrently → Bug 1488780 - pt 7. Make Purge() and RemoveChunk() safe to run concurrently r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414088 - Attachment description: WIP: Bug 1488780 - pt 8. Remove busy chunks from dirty list during purge → Bug 1488780 - pt 8. Remove busy chunks from dirty list during purge r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414089 - Attachment description: WIP: Bug 1488780 - pt 9. Mark pages busy during the purge operation → Bug 1488780 - pt 9. Mark pages busy during the purge operation r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414090 - Attachment description: WIP: Bug 1488780 - pt 10. Don't allocate runs with busy pages → Bug 1488780 - pt 10. Don't allocate runs with busy pages r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414274 - Attachment description: WIP: Bug 1488780 - pt 11. Don't merge busy runs → Bug 1488780 - pt 11. Don't merge busy runs r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414275 - Attachment description: WIP: Bug 1488780 - pt 12. Move Coaleascing code out of DallocRun → Bug 1488780 - pt 12. Move Coaleascing code out of DallocRun r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414276 - Attachment description: WIP: Bug 1488780 - pt 13. Coalesce runs and delete chunks in Purge() → Bug 1488780 - pt 13. Coalesce runs and delete chunks in Purge() r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414277 - Attachment description: WIP: Bug 1488780 - pt 14. Don't realloc into busy memory → Bug 1488780 - pt 14. Don't realloc into busy memory r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414278 - Attachment description: WIP: Bug 1488780 - pt 15. Only allocate into the spare chunk if it's not busy → Bug 1488780 - pt 15. Only allocate into the spare chunk if it's not busy r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414279 - Attachment description: WIP: Bug 1488780 - pt 16. Unlock for decommit → Bug 1488780 - pt 16. Unlock for decommit r=glandium

Phabricator Automation

Updated

•

2 months ago

Attachment #9414081 - Attachment is obsolete: true

Jens Stutte [:jstutte]

Updated

•

2 months ago

Updated

•

2 months ago

Attachment #9414084 - Attachment description: Bug 1488780 - pt 4. Split Pruge() into three phases r=glandium → Bug 1488780 - pt 4. Split Purge() into three phases r=glandium

Paul Bone [:pbone]

Assignee

Comment 36

•

2 months ago

Attached file Bug 1488780 - pt 17. Release chunks outside the arena lock r=glandium — Details

Paul Bone [:pbone]

Assignee

Comment 37

•

2 months ago

Attached file Bug 1488780 - pt 18. Add a hint for dirty page scanning r=glandium (obsolete) — Details

Jens Stutte [:jstutte]

Updated

•

1 month ago

Blocks: 1913753

Mayank Bansal

Updated

•

12 days ago

Updated

•

17 hours ago

Attachment #9416225 - Attachment is obsolete: true

Paul Bone [:pbone]

Assignee

Comment 38

•

17 hours ago

Attached file Bug 1488780 - pt 10b. Mark only the run as busy, not each page r=glandium — Details

You need to log in before you can comment on or make changes to this bug.