Investigate lock-free thread-local arenas
Categories
(Core :: Memory Allocator, enhancement, P3)
Tracking
()
People
(Reporter: nical, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [sp3])
The following text is a proposal for improving the performance of allocations and deallocations made in thread-local areanas (TLA) only.
To my knowledge only WebRender and Stylo threads opt into TLAs. As a result, the scope of the improvement is limited, however the risks are similarly scoped (in particular the risk of significantly increasing Firefox's memory usage).
The idea is borrowed from mimalloc (paper) and, I think, adapts well to mozjemalloc's thread-local arenas.
Motivation
At the moment, thread-local-arenas (as most arenas in mozjemalloc) are guarded by mutex. This mutex can be heavily contended (see the stylo worker thread in this profile, captured while running speedometer) if for allocations using a "publisher-consumer" style where allocations happen on the thread using the thread-local arena, and frees happen on another thread.
The change
- Arenas optionally contain the ID of a thread that has exclusive mutable access to them.
- TLAs store the bitmap for small runs twice: one for local frees and one for remote frees.
- thread-local arenas are initialized such that the arena lock is disabled.
- allocations on TLAs only ever happen on their owning thread (this is already the case, by design).
- for frees:
- local frees: If the current thread ID is the owning thread ID of the TLA, then free the memory as usual
- remote frees: If the current thread ID is not the owning thread ID of the TLA, then the remote-frees bitmap of the corresponding run is updated using atomics (instead of the regular local free bitmap), but the rest of the deallocation logic is not run.
- "every now and then" on the owning thread, the remote free bitmaps are pulled to update the local bitmap. If any region is freed at this point, run the rest of the deallocation logic (freeing empty runs, dirty pages, etc.)
The definition of "every now and then" here would require a bit of testing to get right, but I suspect that we would want a mix of:
- every N allocations and every M frees,
- during allocation when a run is full
- via an explicit API that the thread can chose to call in the cold path.
As a result, allocations on TLAs would be lock-free and almost entirely atomic-free.
The main trap with this scheme, is if a producer thread does a large number of allocations on a TLA, send them to a consumer thread where they are dropped and does not wake up to do the "every now and then" cleanups for a long time.
Since TLAs are currently opt-in and used by few systems, I think that it is reasonable to make sure that they schedule cleanups.
Alternative
We could decide to skip this and aim directly for more thread-local caches (in all or a subset of threads), which may make TLAs obsolete. However I suspect that it is a lot more difficult to pull off, so it may be worth prototyping lock-free TLAs as a stopgap for the short/medium term.
Updated•2 months ago
|
Updated•2 months ago
|
Description
•