Splicing Sockets in FreeBSD

FreeBSD recently has gained support for splicing sockets using the SO_SPLICE flag in setsockopt(2). This API is source compatible to OpenBSD, which has the feature since version 4.9; it was added in 2011. The FreeBSD implementation is independent and varies slightly in behavior.

What is socket splicing?

Many network processes these day primarily shuffle around data between services, consider for example a public-facing reverse proxy like HAProxy that connects to your backend server and then forwards the HTTP requests between both sockets.

Classically, the BSD socket API results in heavy copying: both recv(2) and send(2) operate on userspace buffers, so data needs to be copied around a lot between kernel and userspace. Thus, a simple userspace forward of one packet from one server to another implies needlessly copying the packet twice.

For common operations like sending files, FreeBSD since version 3.0 has support for sendfile(2), so file contents can be sent from disk directly to the network stack without having them to copy to userspace first.

Until now, there was no mechanism to copy data between sockets. Since FreeBSD 14.2 (and in the upcoming FreeBSD 15.0), there is support for SO_SPLICE. Using this flag one can tell the kernel to forward data between TCP sockets directly.

In the simplest case, one just needs a call like:

struct splice sp = {
    .sp_fd = s2
};
setsockopt(s1, SOL_SOCKET, SO_SPLICE, &sp, sizeof(sp));

Then the kernel will send data written to socket s1 to socket s2. For a bidirectional copy, you need to setup SO_SPLICE twice (for each direction).

How this works

Internally, FreeBSD spawns a number of so_splice kernel threads (one for each CPU) which tackle the forwarding. The sponsors of this feature have a blog post that explains it in more detail.

Benchmarking

I did a very simple benchmark using the in-tree proxy.c example and iperf3. The test machine has an quad-core Intel i3-8100T CPU @ 3.10GHz with a Mellanox ConnectX-3 Pro dual port SFP+ ethernet adapter.

I'm running iperf3 from a remote machine with 60 seconds test time. Clearly, the machine can do wirespeed at 10GBit/s with negegible load. (I'm not sure what the benchmark linked above did that a single-stream iperf3 was limited to 7.6Gbit/s).

When using the proxy with -m copy, it needs 6.2s system time and 0.36s user time, totalling 6.56s CPU time. When using the -m splice option to enable SO_SPLICE, the so_splice kernel thread needs 5.98s CPU time. This is about 8% more efficient.

Curiously, for a iperf -R, i.e. making the FreeBSD machine send to the client, the copying proxy needs 19.06s CPU time, while so_splice needs 14.93s (almost 22% faster). (But I'm surprised the write path is so much less efficient in general??)

Implications

The design using a thread pool has up- and downsides. Other operating systems, as e.g. Linux use a splice(2) syscall instead that does the packet copying in the kernel stack of the current process. This requires more code client-side as you need to write the control loop yourself. However, portable code is likely to have a basic send/recv loop already and can thus easily switch to splice. (It's slightly more complicated, as Linux splice can't directly splice between two sockets; you need a pipe in between to hold the buffer.)

After toying around a bit, I found a few interesting things:

Since SO_SPLICE uses a kernel thread worker pool, packet forwarding keeps working even when the process that initiated it is stopped. Upon killing the process that owns the socket, the sockets are closed and the thread pool stops copying.

It is possible to create loops. In the OpenBSD implementation, this is detected and results in error ELOOP. In FreeBSD, you can happily copy in a loop. My system maxes out at 18.7 Gbit/s and 420k packets/s over the loopback interface. However, the so_splice threads only eat roughly 70% of a CPU core as they run with userspace priority (I'm not sure what the bottleneck is...). This is entirely useless: as the sockets are blocked for userspace while splicing is active, there's nothing useful you can do with such a loop.

The kinds of sockets that can be spliced is quite limited so far: you can only splice TCP sockets (but also across IPv4 and IPv6). Being able to splice to Unix Domain Sockets would be very useful for proxying to application servers. But this is still a TODO. It could also make sense to allow splicing UDP or SCTP.