How Much Memory Is Consumed by the Linux Kernel Per Tcp/Ip Network Connection

How much memory is consumed by the Linux kernel per TCP/IP network connection?

For a TCP connection memory consumed depends on

size of sk_buff (internal networking structure used by linux kernel)
the read and write buffer for a connection

the size of buffers can be tweaked as required

root@x:~# sysctl -A | grep net | grep mem

check for these variables

these specify the maximum default memory buffer usage for all network connections in kernel

net.core.wmem_max = 131071

net.core.rmem_max = 131071

net.core.wmem_default = 126976

net.core.rmem_default = 126976

these specify buffer memory usage specific to tcp connections

net.ipv4.tcp_mem = 378528   504704  757056

net.ipv4.tcp_wmem = 4096    16384   4194304

net.ipv4.tcp_rmem = 4096    87380   4194304

the three values specified are " min default max" buffer sizes.
So to start with linux will use the default values of read and write buffer for each connection.
As the number of connection increases , these buffers will be reduced [at most till the specified min value]
Same is the case for max buffer value.

These values can be set using this sysctl -w KEY=KEY VALUE

eg. The below command ensures the read and write buffers for each connection are 4096 each.

sysctl -w net.ipv4.tcp_rmem='4096 4096 4096'

sysctl -w net.ipv4.tcp_wmem='4096 4096 4096'

How to track memory usage of the networking subsystem in the Linux kernel?

Did you try: ss -m? The documentation of the reported values seems scarce but you can make educated guesses based on their full names defined in linux/sock_diag.h.

Travelling time of data through TCP/IP stack (Linux)

After res.send() has been called the package has left the NIC. What you are interested in is the time send() takes.

How does the Linux kernel manage data that has been passed to a user program via DMA?

The short answer is that it doesn't. Data isn't going to be processed in more than one location at once, so if networking packets are passed directly to a user space program, then the kernel isn't going to do anything else with them; it has been bypassed. It will be up to the user space program to handle it.

An example of this was presented in a device drivers class I took a while back: High-Frequency stock trading. There is an article about one such implementation at Forbes.com. The idea is that traders want their information as fast as possible, so they use specially crafted packets that when received (by equally specialized hardware), they are presented directly to the traders program, bypassing the relatively high-latency TCP/IP stack in the kernel. Here's an excerpt from the linked article talking about two such special network cards:

Both of these cards provide kernel bypass drivers that allow you to send/receive data via TCP and UDP in userspace. Context switching is an expensive (high-latency) operation that is to be avoided, so you will want all critical processing to happen in user space (or kernel-space if so inclined).

This technique can be used for just about any application where the latency between user programs and the hardware needs to be minimized, but as your question implies, it means that the kernel's normal mechanisms for handling such transactions are going to be bypassed.

NET_DMA TCP receive offload in Linux

My guess is that with small packets (1448 is considered small nowadays), the latency overhead from activating and waiting for the IOAT interrupt is higher than the overhead of simply copying the memory, especially when memory and CPU access are fast. Modern servers can push 5GB/sec with memcpy.

For the 10Gbit/sec Ethernet case it would be worthwhile to work with higher MTU as possible and certainly larger buffer sizes. I think the original tests with receive offload only started showing performance increases when single packets were at about PAGE_SIZE.

How Much Memory Is Consumed by the Linux Kernel Per Tcp/Ip Network Connection