Large IPC messages on Unix are accidentally quadratic and generally inefficient, especially with small OS buffer sizes
Categories
(Core :: IPC, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox88 | --- | fixed |
People
(Reporter: jld, Assigned: jld)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
This bug requires some context:
-
IPC messages are stored in a list of buffers (
BufferList
) rather than a single allocation. By default, the chunks are 4KiB; I think that decision was influenced by issues with 32-bit address space exhaustion. -
On Unix platforms, we use the
struct iovec
array taken bysendmsg
to do gathered I/O, and we give the OS as many segments in each call as we can. On macOS and Linux, that limit (IOV_MAX
) appears to be 1024. (On Windows, we justWriteFile
each segment individually.)
And there are two parts to this problem:
-
Every time we call
sendmsg
, we iterate the entire remainder of theBufferList
to find out how much data could be written, to detect a short write, even if we stop accumulatingiovec
s and we're guaranteed a short write. This is O(n²). -
The amount of space in the socket buffer may be much less than we're trying to write; this wastes time in userspace constructing useless
iovec
s, and the kernel possibly (this appears to be the case at least on macOS) also wastes time processing them.
Concretely, on macOS the default socket buffer size is only 8KiB. So if we have a 100MiB message in 25600 parts, like the one in bug 1680860, then every sendmsg
call we inspect on average 12800 buffers, construct ~1022 iovecs, and the kernel uses at most 2 of them. This is very wasteful, and it's relatively easy to fix.
We may also want to consider increasing the socket buffer size on Mac, and maybe using larger buffer segments on 64-bit platforms. Both of these could have implications for memory usage, however (and for socket buffers, that would be kernel memory, which probably won't appear anywhere in about:memory
).
The longer-term plan to avoid latency issues with large messages is to provide more facilities for using shared memory instead, but increasing the cutoff point where it's necessary to do that should still help.
Assignee | ||
Comment 1•4 years ago
|
||
Currently we walk through the entire list of not-yet-written IPC buffers
when building the gathered I/O list for sendmsg
, to determine the
total remaining length of the messages, even after reaching the OS's
limit on how many iovec
s it will accept in one call.
This patch halts the iteration when we reach the iovec
limit, because
we don't need the exact length; it's sufficient to know whether the
entire message was written, which is impossible in that case.
This increases throughput on large messages by about 7x on macOS (from
~0.04 to ~0.3 GB/s) and 1.7x on Linux (from ~0.3 to ~0.5 GB/s), on my
test machines. The effect is more significant on macOS because its
smaller socket buffer size (8kB vs. ~200kB) means we spend more time
setting up the syscall per unit data copied; see also the next patch.
Assignee | ||
Comment 2•4 years ago
|
||
When setting up calls to sendmsg
for IPC on Unix systems, we generate
iovec
s for the entire message or until the IOV_MAX
limit is reached,
whichever comes first. However, messages can be very large (up to 256
MiB currently), while the OS socket buffer is relatively small (8KiB on
macOS and FreeBSD, ~200KiB on Linux).
This patch detects the socket buffer size with the SO_SNDBUF
socket
option and cuts off the iovec
array after it's reached; it also adjusts
the Linux sandbox policy to allow reading that value in all processes.
On my test machines this increases throughput on large messages by about
2.5x on macOS (from ~0.3 to ~0.7 GB/s), but on Linux the improvement is
only about 5% (most of the running time is spent elsewhere).
Assignee | ||
Comment 3•4 years ago
|
||
The function to detect whether the kernel has separate syscalls for
socket operations (rather than only socketcall
) had a comment that
it's called only once, which is no longer true. So, this seems like a
good time to add a cache (but not on newer archs like x86_64
where the
answer is constant).
This patch also removes the ifdefs on __NR_socket
, because all archs
have it now, and our local headers will define it even if the build
host's headers don't.
Comment 4•4 years ago
|
||
There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:jld, could you have a look please?
For more information, please visit auto_nag documentation.
Comment 6•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/74f7e4524c04
https://hg.mozilla.org/mozilla-central/rev/9328b2f3187c
https://hg.mozilla.org/mozilla-central/rev/7a453581836b
Updated•4 years ago
|
Updated•4 years ago
|
Description
•