Open Bug 1843997 Opened 1 year ago Updated 5 months ago

Consider adding batch allocation functions

Tracking

()

Status:

ASSIGNED

People

(Reporter: pbone, Assigned: pbone)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [sp3])

Paul Bone [:pbone]

Assignee

Description

•

1 year ago

It'd be nice to be able to amortise some allocation costs when allocating batches of objects. To be able to ask Jemalloc "Please allocate 25 objects which are 32 bytes long each" and have it amortise lookup and locking costs.

Freeing objects in the same way could also be beneficial, we should consider what invariants we want such as:

Do we allow objects that weren't batch allocated to be batch freed? Can we mix objects from different batches?
Can objects be batch freed across arenas? Different sizes?

Bug 1843932 may have a motivating test case.

Paul Bone [:pbone]

Assignee

Updated

•

1 year ago

Blocks: 1843932

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

1 year ago

Whiteboard: [sp3]

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-470

Paul Bone [:pbone]

Assignee

Updated

•

1 year ago

Blocks: 1845130

Paul Bone [:pbone]

Assignee

Comment 1

•

1 year ago

Continuing discussion from Bug 1845130 comment 8.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #8)

I'd be happy to attend such a call of course.

we could make jemalloc free things quickly if we knew they all belonged to the same run, bin and arena. But would that restriction on the API be too severe that such a batch API isn't useful? So what rules are useful to a batch API that give us the right amount of optimisations?

That seems hard to guarantee from style code at least. Things are usually allocated from the worker threads, which have thread-local arenas, but we don't keep around which worker thread computed the style (and even so we might share some allocations across threads).

That sounds almost as if jemalloc had some kind of visibility into a "free/not free" status of objects and could process them all in a batch. Like mark bits in a GC. Blending normal allocators and some GC-ish features (mark bits, or refcounts) is pretty exciting.

Kinda? I was arguably thinking of something a bit more dumb (just record the free calls, do them all together potentially more efficiently, at some point later). How it is implemented I don't think I have a strong opinion on tho...

Based on this I think we can do this asynchronously.

For batched frees it sounds like we want no-constraints on how they were allocatoed, they can come from different arenas, and be of different allocation classes and sizes. What jemalloc can do is sort by address which will allow it to amortise some lookups/locks. It'll know that if a sequence of addresses are in the same chunk they they belong to the same arena, and if they're in the same "run" then we only need to lookup the run/bin information once. Something like:

free_batch(void **aAddressArray, size_t aNumAddresses);

The array is allocated by the caller and may be modified by the callee. When the call terminates every non-null pointer within the array is freed.

We could do batched allocation too but in that case I would say all the resulting allocations belong to the same arena and are the same size.

Paul Bone [:pbone]

Assignee

Updated

•

1 year ago

Assignee: nobody → pbone

Status: NEW → ASSIGNED

Paul Bone [:pbone]

Assignee

Comment 2

•

1 year ago

I started working on a patch that added the batching code to servo's ARC implementation. That is the code that collects the pointers to free before giving them to mozjemalloc. It could would work that way however I've changed my mind. It's better to add that to mozjemalloc for two reasons, it's simpler and can capture other pointers from servo that aren't managed by Arc (I don't know, if servo has something like C++'s unique_ptr then it'd catch that too). Second, it can be used from other places in firefox that free things in batches like the cycle collector.

I'll continue work on it but with this new plan.

Jeff Muizelaar [:jrmuizel]

Comment 3

•

1 year ago

For comparison libpas (WebKit's malloc) uses a fixed size thread local deallocation log.

When freeing the entries in the deallocation log it will only release the lock if the next needed lock is not the next one. This avoids locking and unlocking the same lock over and over again. See process_deallocation_log_with_config.

Paul Bone [:pbone]

Assignee

Updated

•

6 months ago

Blocks: 1488780

Paul Bone [:pbone]

Assignee

Updated

•

5 months ago

Blocks: 1809610

Paul Bone [:pbone]

Assignee

Updated

•

5 months ago

No longer blocks: 1488780

Paul Bone [:pbone]

Assignee

Updated

•

5 months ago

Blocks: 1488435

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Consider adding batch allocation functions

Categories

(Core :: Memory Allocator, enhancement)

Tracking

()

People

(Reporter: pbone, Assigned: pbone)

References

(Blocks 4 open bugs)

Details

(Whiteboard: [sp3])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Updated

Updated

Updated

Updated