Difference Between Unix Domain Stream and Datagram Sockets

Difference between UNIX domain STREAM and DATAGRAM sockets?

Just as the manual page says Unix sockets are always reliable. The difference between SOCK_STREAM and SOCK_DGRAM is in the semantics of consuming data out of the socket.

Stream socket allows for reading arbitrary number of bytes, but still preserving byte sequence. In other words, a sender might write 4K of data to the socket, and the receiver can consume that data byte by byte. The other way around is true too - sender can write several small messages to the socket that the receiver can consume in one read. Stream socket does not preserve message boundaries.

Datagram socket, on the other hand, does preserve these boundaries - one write by the sender always corresponds to one read by the receiver (even if receiver's buffer given to read(2) or recv(2) is smaller then that message).

So if your application protocol has small messages with known upper bound on message size you are better off with SOCK_DGRAM since that's easier to manage.

If your protocol calls for arbitrary long message payloads, or is just an unstructured stream (like raw audio or something), then pick SOCK_STREAM and do the required buffering.

Performance should be the same since both types just go through local in-kernel memory, just the buffer management is different.

What's the difference between streams and datagrams in network programming?

A long time ago I read a great analogy for explaining the difference between the two. I don't remember where I read it so unfortunately I can't credit the author for the idea, but I've also added a lot of my own knowledge to the core analogy anyway. So here goes:

A stream socket is like a phone call -- one side places the call, the other answers, you say hello to each other (SYN/ACK in TCP), and then you exchange information. Once you are done, you say goodbye (FIN/ACK in TCP). If one side doesn't hear a goodbye, they will usually call the other back since this is an unexpected event; usually the client will reconnect to the server. There is a guarantee that data will not arrive in a different order than you sent it, and there is a reasonable guarantee that data will not be damaged.

A datagram socket is like passing a note in class. Consider the case where you are not directly next to the person you are passing the note to; the note will travel from person to person. It may not reach its destination, and it may be modified by the time it gets there. If you pass two notes to the same person, they may arrive in an order you didn't intend, since the route the notes take through the classroom may not be the same, one person might not pass a note as fast as another, etc.

So you use a stream socket when having information in order and intact is important. File transfer protocols are a good example here. You don't want to download some file with its contents randomly shuffled around and damaged!

You'd use a datagram socket when order is less important than timely delivery (think VoIP or game protocols), when you don't want the higher overhead of a stream (this is why DNS is primarily a datagram protocol, so that servers can respond to many, many requests at once very quickly), or when you don't care too much if the data ever reaches its destination.

To expand on the VoIP/game case, such protocols include their own data-ordering mechanism. But if one packet is damaged or lost, you don't want to wait on the stream protocol (usually TCP) to issue a re-send request -- you need to recover quickly. TCP can take up to some number of minutes to recover, and for realtime protocols like gaming or VoIP even three seconds may be unacceptable! Using a datagram protocol like UDP allows the software to recover from such an event extremely quickly, by simply ignoring the lost data or re-requesting it sooner than TCP would.

VoIP is a good candidate for simply ignoring the lost data -- one party would just hear a short gap, similar to what happens when talking to someone on a cell phone when they have poor reception. Gaming protocols are often a little more complex, but the actions taken will usually be to either ignore the missing data (if subsequently-received data supercedes the data that was lost), re-request the missing data, or request a complete state update to ensure that the client's state is in sync with the server's.

What's the difference between a stream-type socket and a datagram socket type?

The short answer: message boundaries and connections.

With a stream socket you can write two five byte messages and wind up reading one ten byte message. This is because the data you write just gets placed into a single stream, with no boundaries between data written. This is just like writing a word at a time to a file. As a reader of the file, how do you know whether the writer originally wrote to the file one character at a time, one word at a time, one sentence at a time, one paragraph at a time or wrote the whole file all at once? Basically, if the file is already written, you don't. With a stream, how will you know that the source sent two five byte messages or one ten byte message if the sending was done in rapid succession? You have to have some sort of length or delimiter to help indicate message boundaries. Sometimes you don't care about messages or their boundaries. Other times, you add application level data (e.g., headers, delimiters, pre-defined message lengths, etc...). This makes a stream socket usable as well, since you handle the messaging yourself (i.e., at the application layer).

With a datagram based socket, the receiver knows the size of the messages that the sender sent, because they are delivered 1:1 (baring losses, dups, etc...), retaining their original sizes.

In addition to all of this, stream based sockets tend to be connection oriented and 1:1, while datagram sockets connectionless and potentially one (source) to many (receivers), with broadcast / multicast.

For what is better suited every type of communication in Unix sockets?

It really depends what kind of server you are going to implement.

If message boundaries are important, then SOCK_DGRAM would be the best choice.
Because recvfrom/recvmsg/select will return when a complete message is received.

With SOCK_STREAM, message receiving is more tricky: One receiving call may return a partial message, or part of two messages, or several messages... etc.

If message boundaries are not important, then SOCK_STREAM could be the best choice.

SOCK_DGRAM of AF_INET is unreliable UDP. But, in most sytems, SOCK_DGRAM of AF_UNIX is reliable.
For example: If queue of receiver is full, sender will be blocked until there is space.

How do Unix Domain Sockets differentiate between multiple clients?

If you create a PF_UNIX socket of type SOCK_STREAM, and accept connections on it, then each time you accept a connection, you get a new file descriptor (as the return value of the accept system call). This file descriptor reads data from and writes data to a file descriptor in the client process. Thus it works just like a TCP/IP connection.

There's no “unix domain protocol format”. There doesn't need to be, because a Unix-domain socket can't be connected to a peer over a network connection. In the kernel, the file descriptor representing your end of a SOCK_STREAM Unix-domain socket points to a data structure that tells the kernel which file descriptor is at the other end of the connection. When you write data to your file descriptor, the kernel looks up the file descriptor at the other end of the connection and appends the data to that other file descriptor's read buffer. The kernel doesn't need to put your data inside a packet with a header describing its destination.

For a SOCK_DGRAM socket, you have to tell the kernel the path of the socket that should receive your data, and it uses that to look up the file descriptor for that receiving socket.

If you bind a path to your client socket before you connect to the server socket (or before you send data if you're using SOCK_DGRAM), then the server process can get that path using getpeername (for SOCK_STREAM). For a SOCK_DGRAM, the receiving side can use recvfrom to get the path of the sending socket.

If you don't bind a path, then the receiving process can't get an id that uniquely identifies the peer. At least, not on the Linux kernel I'm running (2.6.18-238.19.1.el5).

Difference Between Unix Domain Stream and Datagram Sockets