Closed Bug 1690921 Opened 4 years ago Closed 4 years ago

Large IPC messages on Unix are accidentally quadratic and generally inefficient, especially with small OS buffer sizes

Categories

(Core :: IPC, defect)

defect

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox88 --- fixed

People

(Reporter: jld, Assigned: jld)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

This bug requires some context:

  • IPC messages are stored in a list of buffers (BufferList) rather than a single allocation. By default, the chunks are 4KiB; I think that decision was influenced by issues with 32-bit address space exhaustion.

  • On Unix platforms, we use the struct iovec array taken by sendmsg to do gathered I/O, and we give the OS as many segments in each call as we can. On macOS and Linux, that limit (IOV_MAX) appears to be 1024. (On Windows, we just WriteFile each segment individually.)

And there are two parts to this problem:

  1. Every time we call sendmsg, we iterate the entire remainder of the BufferList to find out how much data could be written, to detect a short write, even if we stop accumulating iovecs and we're guaranteed a short write. This is O(n²).

  2. The amount of space in the socket buffer may be much less than we're trying to write; this wastes time in userspace constructing useless iovecs, and the kernel possibly (this appears to be the case at least on macOS) also wastes time processing them.

Concretely, on macOS the default socket buffer size is only 8KiB. So if we have a 100MiB message in 25600 parts, like the one in bug 1680860, then every sendmsg call we inspect on average 12800 buffers, construct ~1022 iovecs, and the kernel uses at most 2 of them. This is very wasteful, and it's relatively easy to fix.

We may also want to consider increasing the socket buffer size on Mac, and maybe using larger buffer segments on 64-bit platforms. Both of these could have implications for memory usage, however (and for socket buffers, that would be kernel memory, which probably won't appear anywhere in about:memory).

The longer-term plan to avoid latency issues with large messages is to provide more facilities for using shared memory instead, but increasing the cutoff point where it's necessary to do that should still help.

Currently we walk through the entire list of not-yet-written IPC buffers
when building the gathered I/O list for sendmsg, to determine the
total remaining length of the messages, even after reaching the OS's
limit on how many iovecs it will accept in one call.

This patch halts the iteration when we reach the iovec limit, because
we don't need the exact length; it's sufficient to know whether the
entire message was written, which is impossible in that case.

This increases throughput on large messages by about 7x on macOS (from
~0.04 to ~0.3 GB/s) and 1.7x on Linux (from ~0.3 to ~0.5 GB/s), on my
test machines. The effect is more significant on macOS because its
smaller socket buffer size (8kB vs. ~200kB) means we spend more time
setting up the syscall per unit data copied; see also the next patch.

When setting up calls to sendmsg for IPC on Unix systems, we generate
iovecs for the entire message or until the IOV_MAX limit is reached,
whichever comes first. However, messages can be very large (up to 256
MiB currently), while the OS socket buffer is relatively small (8KiB on
macOS and FreeBSD, ~200KiB on Linux).

This patch detects the socket buffer size with the SO_SNDBUF socket
option and cuts off the iovec array after it's reached; it also adjusts
the Linux sandbox policy to allow reading that value in all processes.

On my test machines this increases throughput on large messages by about
2.5x on macOS (from ~0.3 to ~0.7 GB/s), but on Linux the improvement is
only about 5% (most of the running time is spent elsewhere).

The function to detect whether the kernel has separate syscalls for
socket operations (rather than only socketcall) had a comment that
it's called only once, which is no longer true. So, this seems like a
good time to add a cache (but not on newer archs like x86_64 where the
answer is constant).

This patch also removes the ifdefs on __NR_socket, because all archs
have it now, and our local headers will define it even if the build
host's headers don't.

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:jld, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(jld)
Flags: needinfo?(gpascutto)
Pushed by jedavis@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/74f7e4524c04 Avoid quadratic runtime when building `sendmsg` gather lists for IPC. r=mccr8 https://hg.mozilla.org/integration/autoland/rev/9328b2f3187c Limit IPC `sendmsg` gather list sizes based on socket buffer capacity. r=mccr8,gcp https://hg.mozilla.org/integration/autoland/rev/7a453581836b Detect socket syscalls only once per process when building Linux sandbox policies. r=gcp
Flags: needinfo?(gpascutto)
Flags: needinfo?(jld)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: