Vmsplice() and Tcp

vmsplice() and TCP

Yes, due to the TCP socket holding on to the pages for an indeterminate time you cannot use the double-buffering scheme mentioned in the example code. Also, in my use case the pages come from circular buffer so I cannot gift the pages to the kernel and alloc fresh pages. I can verify that I am seeing data corruption in the received data.

I resorted to polling the level of the TCP socket's send queue until it drains to 0. This fixes data corruption but is suboptimal because draining the send queue to 0 affects throughput.

n = ::vmsplice(mVmsplicePipe.fd.w, &iov, 1, 0);
while (n) {
    // splice pipe to socket
    m = ::splice(mVmsplicePipe.fd.r, NULL, mFd, NULL, n, 0);
    n -= m;
}

while(1) {
    int outsize=0;
    int result;

    usleep(20000);

    result = ::ioctl(mFd, SIOCOUTQ, &outsize);
    if (result == 0) {
        LOG_NOISE("outsize %d", outsize);
    } else {
        LOG_ERR_PERROR("SIOCOUTQ");
        break;
    }
    //if (outsize <= (bufLen >> 1)) {
    if (outsize == 0) {
        LOG("outsize %d <= %u", outsize, bufLen>>1);
        break;
    }
};

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.

The solution then is to allocate the memory in a way that struct page*s are associated with the memory.

The workflow that now works for me to allocate the memory is:

Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.

Now to submit such a region to the DMA (for data reception):

dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);

When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:

dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);

Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:

static int my_mmap(struct file *file, struct vm_area_struct *vma) {
    int res;
...
    for (i = 0; i < 2**page_order; ++i) {
        if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
            break;
        }
    }
    vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
    return res;
}

When the file is closed, don't forget to free the pages:

for (i = 0; i < 2**page_order; ++i) {
    __free_page(&dev->shm[i].pages[i]);
}

Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.

Although this works, there are two things that don't sit well with me with this approach:

You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:

 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.

These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)

Does Linux have zero-copy? splice or sendfile?

sendfile has been ever since, and still is zero-copy (assuming the hardware allows for it, but that is usually the case). Being zero-copy was the entire point of having this syscall in the first place. sendfile is nowadays implemented as a wrapper around splice.

That suggests that splice, too, is zero-copy, and this is indeed the case. At least in theory, and at least in some cases. The problem is figuring out how to correctly use it so it works reliably and so it is zero-copy. The documentation is... sparse, to say the least.

In particular, splice only works zero-copy if the pages were given as "gift", i.e. you don't own them any more (formally, but in reality you still do). That is a non-issue if you simply splice a file descriptor onto a socket, but it is a big issue if you want to splice data from your application's address space, or from one pipe to another. It is unclear what to do with the pages afterwards (and when). The documentation states that you may not touch the pages afterwards or do anything with them, never, not ever. So if you follow the letter of the documentation, you must leak the memory.

That's obviously not correct (it can't be), but there is no good way of knowing (for you at least!) when it's safe to reuse or release that memory. The kernel doing a sendfile would know, since as soon as it receives the TCP ACK, it knows that the data is never needed again. The problem is, you don't ever get to see an ACK. All you know when splice has returned is that data has been accepted to be sent (but you have no idea whether it has already been sent or received, nor when this will happen).

Which means you need to figure this out somehow on an application layer, either by doing manual ACKs (comes for free with reliable UDP), or by assuming that if the other side sends an answer to your request, they obviously must have gotten the request.

Another thing you have to manage is the finite pipe space. The default is very small, but even if you increase the size, you can't just naively splice a file of any size. sendfile on the other hand will just let you do that, which is cool.

All in all, sendfile is nice because it just works, and it works well, and you don't need to care about any of the above details. It's not a panacea, but it sure is a great addition.

I would, personally, stay away from splice and its family until the whole thing is greatly overhauled and until it is 100% clear what you have to do (and when) and what you don't have to do.

The real, effective gains over plain old write are marginal for most applications, anyway. I recall some less than polite comments by Mr. Torvalds a few years ago (when BSD had a form of write that would do some magic with remapping pages to get zero-copy, and Linux didn't) which pointed out that making a copy usually isn't any issue, but playing tricks with pages is [won't repeat that here].

Is it possible to `splice()` from a socket to a buffer with zero-copy?

@yeyo has answer this question in the question's comments IMO but did not provide a timely answer. So I will summarize:

You can not splice() from a socket to a buffer with “zero-copy”.

The reason you can't use splice() is because splice() "requires either the source or destination to be a pipe". In other words when the source is a socket then the destination must be a pipe because of the API's restrictions ('either the source or destination or both must be a pipe').

That answers the yes or no question but not the "why".

I still had questions about the "why" or specifically:

"Why does splice() require the source or destination to be a pipe?"
and
"Why can't I splice directly from a socket to user memory?"

Linus Torvalds answers those questions directly in the link and his answer seems clear and concise to me (if not with a little studying of what he writes): http://yarchive.net/comp/linux/splice.html

Vmsplice() and Tcp

vmsplice() and TCP

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

Does Linux have zero-copy? splice or sendfile?

Is it possible to `splice()` from a socket to a buffer with zero-copy?

Related Topics

Leave a reply