Why No Zero-Copy Networking in Linux Kernel

Is Netty's Zero Copy different from OS level Zero Copy?

According to Wikipedia:

Zero-Copy describes computer operations in which the CPU does not perform the task of copying data from one memory area to another.

OS-level zero copy involves avoiding copying memory blocks from one location to another (typically from user space to kernel space) before sending data to the hardware driver (network card or disk drive) or vice versa.

Netty zero copy is talking about optimizing data manipulation on Java level (user-space only). Their ChannelBuffer allows to read contents of multiple byte buffers without actually copying their content.

In other words, while Netty works only in user space, it is still valid to call their approach "zero copy". However, if OS does not use or support true zero copy, it is possible that when data created by Netty-powered program will be sent over the network, data would still be copied from user space to kernel space, and thus true zero-copy would not be achieved.

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

As I posted in an update in my question, the underlying problem is that zerocopy networking does not work for memory that has been mapped using remap_pfn_range() (which dma_mmap_coherent() happens to use under the hood as well). The reason is that this type of memory (with the VM_PFNMAP flag set) does not have metadata in the form of struct page* associated with each page, which it needs.

The solution then is to allocate the memory in a way that struct page*s are associated with the memory.

The workflow that now works for me to allocate the memory is:

Use struct page* page = alloc_pages(GFP_USER, page_order); to allocate a block of contiguous physical memory, where the number of contiguous pages that will be allocated is given by 2**page_order.
Split the high-order/compound page into 0-order pages by calling split_page(page, page_order);. This now means that struct page* page has become an array with 2**page_order entries.

Now to submit such a region to the DMA (for data reception):

dma_addr = dma_map_page(dev, page, 0, length, DMA_FROM_DEVICE);
dma_desc = dmaengine_prep_slave_single(dma_chan, dma_addr, length, DMA_DEV_TO_MEM, 0);
dmaengine_submit(dma_desc);

When we get a callback from the DMA that the transfer has finished, we need to unmap the region to transfer ownership of this block of memory back to the CPU, which takes care of caches to make sure we're not reading stale data:

dma_unmap_page(dev, dma_addr, length, DMA_FROM_DEVICE);

Now, when we want to implement mmap(), all we really have to do is call vm_insert_page() repeatedly for all of the 0-order pages that we pre-allocated:

static int my_mmap(struct file *file, struct vm_area_struct *vma) {
    int res;
...
    for (i = 0; i < 2**page_order; ++i) {
        if ((res = vm_insert_page(vma, vma->vm_start + i*PAGE_SIZE, &page[i])) < 0) {
            break;
        }
    }
    vma->vm_flags |= VM_LOCKED | VM_DONTCOPY | VM_DONTEXPAND | VM_DENYWRITE;
...
    return res;
}

When the file is closed, don't forget to free the pages:

for (i = 0; i < 2**page_order; ++i) {
    __free_page(&dev->shm[i].pages[i]);
}

Implementing mmap() this way now allows a socket to use this buffer for sendmsg() with the MSG_ZEROCOPY flag.

Although this works, there are two things that don't sit well with me with this approach:

You can only allocate power-of-2-sized buffers with this method, although you could implement logic to call alloc_pages as many times as needed with decreasing orders to get any size buffer made up of sub-buffers of varying sizes. This will then require some logic to tie these buffers together in the mmap() and to DMA them with scatter-gather (sg) calls rather than single.
split_page() says in its documentation:

 * Note: this is probably too low level an operation for use in drivers.
 * Please consult with lkml before using this in your driver.

These issues would be easily solved if there was some interface in the kernel to allocate an arbitrary amount of contiguous physical pages. I don't know why there isn't, but I don't find the above issues so important as to go digging into why this isn't available / how to implement it :-)

Is zero-copy UDP packing receiving possibly on Linux?

This article may be useful:

http://yusufonlinux.blogspot.com/2010/11/data-link-access-and-zero-copy.html

Why No Zero-Copy Networking in Linux Kernel

Is Netty's Zero Copy different from OS level Zero Copy?

Zero-copy user-space TCP send of dma_mmap_coherent() mapped memory

Is zero-copy UDP packing receiving possibly on Linux?

Related Topics

Leave a reply