Understanding Sendfile() and Splice()

Does Linux have zero-copy? splice or sendfile?

sendfile has been ever since, and still is zero-copy (assuming the hardware allows for it, but that is usually the case). Being zero-copy was the entire point of having this syscall in the first place. sendfile is nowadays implemented as a wrapper around splice.

That suggests that splice, too, is zero-copy, and this is indeed the case. At least in theory, and at least in some cases. The problem is figuring out how to correctly use it so it works reliably and so it is zero-copy. The documentation is... sparse, to say the least.

In particular, splice only works zero-copy if the pages were given as "gift", i.e. you don't own them any more (formally, but in reality you still do). That is a non-issue if you simply splice a file descriptor onto a socket, but it is a big issue if you want to splice data from your application's address space, or from one pipe to another. It is unclear what to do with the pages afterwards (and when). The documentation states that you may not touch the pages afterwards or do anything with them, never, not ever. So if you follow the letter of the documentation, you must leak the memory.

That's obviously not correct (it can't be), but there is no good way of knowing (for you at least!) when it's safe to reuse or release that memory. The kernel doing a sendfile would know, since as soon as it receives the TCP ACK, it knows that the data is never needed again. The problem is, you don't ever get to see an ACK. All you know when splice has returned is that data has been accepted to be sent (but you have no idea whether it has already been sent or received, nor when this will happen).

Which means you need to figure this out somehow on an application layer, either by doing manual ACKs (comes for free with reliable UDP), or by assuming that if the other side sends an answer to your request, they obviously must have gotten the request.

Another thing you have to manage is the finite pipe space. The default is very small, but even if you increase the size, you can't just naively splice a file of any size. sendfile on the other hand will just let you do that, which is cool.

All in all, sendfile is nice because it just works, and it works well, and you don't need to care about any of the above details. It's not a panacea, but it sure is a great addition.

I would, personally, stay away from splice and its family until the whole thing is greatly overhauled and until it is 100% clear what you have to do (and when) and what you don't have to do.

The real, effective gains over plain old write are marginal for most applications, anyway. I recall some less than polite comments by Mr. Torvalds a few years ago (when BSD had a form of write that would do some magic with remapping pages to get zero-copy, and Linux didn't) which pointed out that making a copy usually isn't any issue, but playing tricks with pages is [won't repeat that here].

How to splice onto socketfd?

sendfile() systemcall does not check if the filedescriptor is seekable. The only check onto that fd is, if you can read (FMODE_READ) onto the fd.

splice() does some more checks. Among others, if the fd is seekable (FMODE_PREAD) / (FMODE_PWRITE).

That's why sendfile works, but splice won't.

can io_uring system call or any other system call be used by an application to transfer data from a socket to file while doing a zero copy?

(This looks like a duplicate of Understanding sendfile() and splice()) This question asker here wants to know if data read from a socket can be zero-copied to a file and the mention of io_uring strongly suggests the asker is specifically interested in Linux.

In short yes, it is possible to receive from a socket and output to a file without having to make unnecessary duplicate copies by using splice(2) on Linux but it's not trivial - the socket must be attached to one end of the pipe and the file's descriptor to the other end. Since the 5.7 Linux kernel io_uring also supports a splice operation so it too can do zero copy from a socket to a file via a pipe.

Is writing to a socket an arbitrary limitation of the sendfile() syscall?

Fundamentally, the only thing limiting it is that "no-one's written the code yet".

However, I gather that the reason that no-ones written the code for those two cases you mention is that they both would require the data to be copied, which removes much of the advantage of using sendfile in the first place.

For a file-to-file sendfile, you'd need a copy because otherwise the same page would have to be in the pagecache as both a clean page in the source file and a dirty page in the destination file. I don't think the pagecache is built to handle that case at the moment (though of course, this could be changed if there was sufficient motivation).
For a file-to-pipe sendfile, you need a copy regardless because the destination process needs to get a private, writeable copy of the data. Anyway, for most uses of this case we already have mmap.

Sendfile without file descriptor

The primary benefit of sendfile() is that it allows you to avoid the overhead of having to first read() data from a file descriptor into memory before you can send() it. If the data you want to send is already in memory, sendfile() is not needed. Using weird workarounds to move the data into a file (like mmap()ing it) will only reduce performance.

What is the most efficient way to copy many files programmatically?

Ultimately I did not determine the "most efficient" way but I did end up with a solution that was sufficiently fast for my needs.

generate a list of files to copy and store it

copy files in parallel using openMP

#pragma omp parallel for
for (auto iter = filesToCopy.begin(); iter < filesToCopy.end(); ++iter)
{
   copyFile(*iter);
}

copy each file using copy_file_range()
falling back to using splice() with a pipe() when compiling for old platforms not supporting copy_file_range().

Reflinking, as supported by copy_file_range(), to avoid copying at all when the source and destination are on the same filesystem is a massive win.

Linux supports splice() and sendfile(), how about Android?

These are Linux kernel calls, so they do exist on Android.

The more interesting question is if Bionic libc provides wrappers as it does for most ordinarily used system calls, or if you will have to invoke them directly. Additionally, apart from being included in Bionic there is the question of the functionality being exported for general use in the NDK.

It appears that sendfile() has been there since the first NDK release.

splice() has not historically seemed to be part of the NDK (I did not check the latest), though it was added to the AOSP sources of Bionic libc in June 2014.

Incidentally grep -r on relevant parts of the NDK installation and/or an AOSP Bionic checkout is a quick way to look into things like this.