Closed Bug 1644637 Opened 6 months ago Closed 3 months ago

Attempt to improve performance by reducing contention in jemalloc deallocation code


(Core :: Memory Allocator, enhancement)






(Reporter: abr, Assigned: abr)


(Whiteboard: [qf-])


(1 file)

Currently, our jemalloc implementation, during deallocation of memory <1 MB, acquires the arena lock used by all small- and large-block allocations, and then performs a memset() on the to-be-deallocated block of memory. If this block of memory is paged out, then the arena lock is held during the swap-in process, which blocks all other threads in the process from allocating memory blocks <1 MB.

During evaluation of IPC performance, it was found that allocation of blocks of memory in the 128 byte - 4096 byte range would occasionally take several hundred milliseconds. The locking behavior described above is the most likely suspect for why this delay might occur.

This bug is to track an experiment to examine the performance impact of attempts to alleviate the arena lock contention caused by the behavior described above.

[qf triage of Core:Performance bugs]:
Calling this [qf-] for now since it sounds theoretical/investigatory at this point. Adam, feel free to adjust (e.g. to qf:p3:responsiveness or similar) if you think (or when you discover) there's a clear perf gain to be had here.

Whiteboard: [qf-]
Component: Performance → Memory Allocator
Closed: 6 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1609478

I haven't looked at the patch yet, but see the discussion in the patch attached to bug 1609478.

(In reply to Mike Hommey [:glandium] from comment #6)

I haven't looked at the patch yet, but see the discussion in the patch attached to bug 1609478.

Thanks! Yes, this is related, inamuch as I'm pretty certain we have an unnecessary contention issue in jemalloc, and currently suspect that the deallocation flow is the cause. Here's what I know for sure:

If I measure time immediately before this line and then immediately after this line, I've seen delays as long as 564 ms to do work that essentially boils down to allocating a single 128-byte buffer:

In a single test run (loading and scrolling a set of 20 pages 5 times each), I measured this allocation taking in excess of 10 ms nearly 100 times (13 of these allocations took in excess of 100 ms to allocate buffers ranging in size from 128 bytes to 4 kB). Changing the underlying buffer to use the system malloc() eliminated these delays.

At the same time, I've observed that the IPC thread has been blocked on NtWaitForAlertByThreadId (the underlying Windows construct for mutexes) for windows of time that the main thread was performing a long-running memset().

So while I cannot conclusively prove that the deallocation-associated memory poisoning while the arena lock is held is the cause of some of our measured IPC delays, it's a prime candidate. In any case, this bug is intended to ferret out where in jemalloc this resource contention issue arises and propose a fix for it.

I'm reopening this to cover the experiments I'm running to try to resolve this, which may or may not be the same as bug 1609478. It depends on whether the issue is caused by deallocator contention or contention elsewhere.

Resolution: DUPLICATE → ---

(In reply to Mike Hommey [:glandium] from comment #6)

I haven't looked at the patch yet...

Oh, it's also probably useful to point out that I'm touching some structures in this patch that I'm not 100% sure are safe to touch without holding the arena lock. I plan to run this down before I propose landing anything -- the first step is to make sure that I'm solving the right problem.


Given that I'm unlikely to have time to pursue this as a volunteer, I'm going to go ahead and re-dupe it to the bug Mike identified above.

Closed: 6 months ago3 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1609478
You need to log in before you can comment on or make changes to this bug.