High Performance Tcp Server in C#

High performance TCP server in C#

It must be async, there is no way around this. High performance and scalability don't mix with one-thread-per-socket. You can have a look at what StackExchange themselves are doing, see async Redis await BookSleeve which leverages the CTP features from the next C# release (so is on the edge and subject to changes, but it is cool). For even more bleeding edge the solutions evolves around leveraging SocketAsyncEventArgs Class which takes things one step further by eliminating the frequent allocations of async handlers associated with 'classic' C# async processing:

The SocketAsyncEventArgs class is part
of a set of enhancements to the
System.Net.Sockets.Socket class that
provide an alternative asynchronous
pattern that can be used by
specialized high-performance socket
applications. This class was
specifically designed for network
server applications that require high
performance. An application can use
the enhanced asynchronous pattern
exclusively or only in targeted hot
areas (for example, when receiving
large amounts of data).

Long story short: learn async or die trying...

BTW, if you're asking why async, then read the three articles linked from this post: High Performance Windows programs. The ultimate answer is: the underlying OS design requires it.

High-performance TCP Socket programming in .NET C#

Because this question gets a lot of views I decided to post an "answer", but technically this isn't an answer, but my final conclusion for now, so I will mark it as answer.

About the approaches:

The async/await functions tend to produce awaitable async Tasks assigned to the TaskScheduler of the dotnet runtime, so having thousands of simultaneous connections, therefore thousands or reading/writing operations will start up thousands of Tasks. As far as I know this creates thousands of StateMachines stored in ram and countless context switchings in the threads they are assigned to, resulting in very high CPU overhead. With a few connections/async calls it is better balanced, but as the awaitable Task count grows it gets slow exponentially.

The BeginReceive/EndReceive/BeginSend/EndSend socket methods are technically async methods with no awaitable Tasks, but with callbacks on the end of the call, which actually optimizes more the multithreading, but still the limitation of the dotnet design of these socket methods are poor in my opinion, but for simple solutions (or limited count of connections) it is the way to go.

The SocketAsyncEventArgs/ReceiveAsync/SendAsync type of socket implementation is the best on Windows for a reason. It utilizes the Windows IOCP in the background to achieve the fastest async socket calls and use the Overlapped I/O and a special socket mode. This solution is the "simplest" and fastest under Windows. But under mono/linux, it never will be that fast, because mono emulates the Windows IOCP by using linux epoll, which actually is much faster than IOCP, but it has to emulate the IOCP to achieve dotnet compatibility, this causes some overhead.

About buffer sizes:

There are countless ways to handle data on sockets. Reading is straightforward, data arrives, You know the length of it, You just copy bytes from the socket buffer to Your application and process it.
Sending data is a bit different.

You can pass Your complete data to the socket and it will cut it to chunks, copy the chucks to the socket buffer until there is no more to send and the sending method of the socket will return when all data is sent (or when error happens).
You can take Your data, cut it to chunks and call the socket send method with a chunk, and when it returns then send the next chunk until there is no more.

In any cases You should consider what socket buffer size You should choose. If You are sending large amount of data, then the bigger the buffer is, the less chunks has to be sent, therefore less calls in Your (or in the socket's internal) loop has to be called, less memory copy, less overhead.
But allocating large socket buffers and program data buffers will result in large memory usage, especially if You are having thousands of connections, and allocating (and freeing up) large memory multiple times is always expensive.

On sending side 1-2-4-8kB socket buffer size is ideal for most cases, but if You are preparing to send large files (over few MB) regularly then 16-32-64kB buffer size is the way to go. Over 64kB there is usually no point to go.

But this has only advantage if the receiver side has relatively large receiving buffers too.

Usually over the internet connections (not local network) no point to get over 32kB, even 16kB is ideal.

Going under 4-8kB can result in exponentially incremented call count in the reading/writing loop, causing large CPU load and slow data processing in the application.

Go under 4kB only if You know Your messages will usually be smaller than 4kB, or just very rarely over 4KB.

My conclusion:

Regarding my experiments built-in socket class/methods/solutions in dotnet are OK, but not efficient at all. My simple linux C test programs using non-blocking sockets could overperform the fastest and "high-performance" solution of dotnet sockets (SocketAsyncEventArgs).

This does not mean it is impossible to have fast socket programming in dotnet, but under Windows I had to make my own implementation of Windows IOCP by directly communicating with the Windows Kernel via InteropServices/Marshaling, directly calling Winsock2 methods, using a lot of unsafe codes to pass the context structs of my connections as pointers between my classes/calls, creating my own ThreadPool, creating IO event handler threads, creating my own TaskScheduler to limit the count of simultaneous async calls to avoid pointlessly much context switches.

This was a lot of job with a lot of research, experiment, and testing. If You want to do it on Your own, do it only if You really think it worth it. Mixing unsafe/unmanage code with managed code is a pain in the ass, but the end it worth it, because with this solution I could reach with my own http server about 36000 http request/sec on a 1gbit lan, on Windows 7, with an i7 4790.

This is such a high performance that I never could reach with dotnet built-in sockets.

When running my dotnet server on an i9 7900X on Windows 10, connected to a 4c/8t Intel Atom NAS on Linux, via 10gbit lan, I can use the complete bandwidth (therefore copying data with 1GB/s) no matter if I have only 1 or 10000 simultaneous connections.

My socket library also detects if the code is running on linux, and then instead of Windows IOCP (obviously) it is using linux kernel calls via InteropServices/Marshalling to create, use sockets, and handle the socket events directly with linux epoll, managed to max out the performance of the test machines.

Design tip:

As it turned out it is difficult to design a networking library from scatch, especially one, that is likely very universal for all purposes. You have to design it to have many settings, or especially to the task You need.
This means finding the proper socket buffer sizes, the I/O processing thread count, the Worker thread count, the allowed async task count, these all has to be tuned to the machine the application running on and to the connection count, and data type You want to transfer through the network. This is why the built-in sockets are not performing that good, because they must be universal, and they do not let You set these parameters.

In my case assingning more than 2 dedicated threads to I/O event processing actually makes the overall performance worse, because using only 2 RSS Queues, and causing more context switching than what is ideal.

Choosing wrong buffer sizes will result in performance loss.

Always benchmark different implementations for the simulated task You need to find out which solution or setting is the best.

Different settings may produce different performance results on different machines and/or operating systems!

Mono vs Dotnet Core:

Since I've programmed my socket library in a FW/Core compatible way I could test them under linux with mono, and with core native compilation. Most interestingly I could not observe any remarkable performance differences, both were fast, but of course leaving mono and compiling in core should be the way to go.

Bonus performance tip:

If Your network card is capable of RSS (Receive Side Scaling) then enable it in Windows in the network device settings in the advanced properties, and set the RSS Queue from 1 to as high you can/as high is the best for your performance.

If it is supported by Your network card then it is usually set to 1, this assigns the network event to process only by one CPU core by the kernel. If You can increment this queue count to higher numbers then it will distribute the network events between more CPU cores, and will result in much better performance.

In linux it is also possible to set this up, but in different ways, better to search for Your linux distro/lan driver information.

I hope my experience will help some of You!

High performance TCP Client in .net

Go with netMQ (0MQ). It's available as a NuGet package so that should be easy to maintain.

I'd suggest something like a client-side request socket and a router/dealer construction on the server side. The documention providere here: http://zguide.zeromq.org/page:all is excellent.

High-Availability TCP server application

The number of listeners is unlikely to be a limiting factor. Here at Stack Overflow we handle ~60k sockets per instance, and the only reason we need multiple listeners is so we can split the traffic over multiple ports to avoid ephemeral port exhaustion at the load balancer. Likewise, I should note that those 60k per-instance socket servers run at basically zero CPU, so: it is premature to think about multiple exes, VMs, etc. That is not the problem. The problem is the code, and distributing a poor socket infrastructure over multiple processes just hides the problem.

Writing high performance socket servers is hard, but the good news is: you can avoid most of this. Kestrel (the ASP.NET Core http server) can act as a perfectly good TCP server, dealing with most of the horrible bits of async, sockets, buffer management, etc for you, so all you have to worry about is the actual data processing. The "pipelines" API even deals with back-buffers for you, so you don't need to worry about over-read.

An extensive walkthrough of this is in my 3-and-a-bit part blog series starting here - it is simply way too much information to try and post here. But it links through to a demo server - a dummy redis server hosted via Kestrel. It can also be hosted without Kestrel, using Pipelines.Sockets.Unofficial, but... frankly I'd use Kestrel. The server shown there is broadly similar (in terms of broad initialization - not the actual things it does) to our 60k-per-instance web-socket tier.

High performance socket server (like MMO)

There: C# SocketAsyncEventArgs High Performance Socket Code; based on things I have learnt from this (and some other resources) I have written a high performance TCP server which is handling more than 7000 clients.

Edit: Other good .NET code bases I studied to some extend are fracture (F#), SocketAwaitable and SuperSocket. I especially like fracture because of it's simple (not naive) and smart buffer pool handling but (as the version I've worked with) it does not provide a separate pool for acceptors; which I've done myself easily based on the already provided pool.

Tips / techniques for high-performance C# server sockets

A lot of this has to do with many threads running on your system and the kernel giving each of them a time slice. The design is simple, but does not scale well.

You probably should look at using Socket.BeginReceive which will execute on the .net thread pools (you can specify somehow the number of threads it uses), and then pushing onto a queue from the asynchronous callback ( which can be running in any of the .NET threads ). This should give you much higher performance.

Tcp high performance server/framework and configuration to use on Azure

After some investigation I found that:

Writing custom tcp server is our only choice.
It is possible to do in WCF but very difficult because we need to use custom binding witch custom message encoder and this is very difficult in our case.
Signal/R is not suitable for us because it is by design for other kinds of tasks. Not for generic TCP server.
Not found any info about that.
ZeroMQ is really good and it supports tcp request reply style and multicasting but it forces us to use zmq on client side becaue it uses custom framing. Unfortunately our client side implemented in hardware and it is not possible to add zmq at all.

create a TCP socket server which is able to handle thousands of requests per second

You may have a "non-blocking listener", but when any particular client connects, it devotes itself to just that client until that client has sent a message and a response has been sent back to it. That's not going to scale well.

I'm usually not a fan of async void, but it's in keeping with your current code:

public async void StartListener() //non blocking listener
{

    listener.Start();
    while (true)
    {
        TcpClient client = await listener.AcceptTcpClientAsync().ConfigureAwait(false);
        HandleClient(client);
    }
}
private async void HandleClient(TcpClient client)
{
    NetworkStream networkStream = client.GetStream();
    byte[] bytesFrom = new byte[20];
    int totalRead = 0;
    while(totalRead<20)
    {
        totalRead += await networkStream.ReadAsync(bytesFrom, totalRead, 20-totalRead).ConfigureAwait(false);
    }
    string dataFromClient = System.Text.Encoding.ASCII.GetString(bytesFrom);
    string serverResponse = "Received!";
    Byte[] sendBytes = Encoding.ASCII.GetBytes(serverResponse);
    await networkStream.WriteAsync(sendBytes, 0, sendBytes.Length).ConfigureAwait(false);
    networkStream.Flush(); /* Not sure necessary */
}

I've also fixed the bug I mentioned in the comments about ignoring the return value from Read and removed the "hide errors from me making bugs impossible to spot in the wild" error handling.

If you're not guaranteed that your clients will always send a 20 byte message to this code, then you need to do something else so that the server knows how much data to read. This is usually done by either prefixing the message with its length or using some form of sentinel value to indicate the end. Note that even with length-prefixing, you're not guaranteed to read the whole length in one go and so you'd need to also use a read loop, as above, to discover the length first.

If switching everything to async isn't giving you the scale you need, then you need to abandon using NetworkStream and start working at the Socket level, and specifically with the async methods designed to work with SocketAsyncEventArgs:

The SocketAsyncEventArgs class is part of a set of enhancements to the System.Net.Sockets.Socket class that provide an alternative asynchronous pattern that can be used by specialized high-performance socket applications... An application can use the enhanced asynchronous pattern exclusively or only in targeted hot areas (for example, when receiving large amounts of data).
The main feature of these enhancements is the avoidance of the repeated allocation and synchronization of objects during high-volume asynchronous socket I/O...
In the new System.Net.Sockets.Socket class enhancements, asynchronous socket operations are described by reusable SocketAsyncEventArgs objects allocated and maintained by the application...

High Performance Tcp Server in C#