Python Socket Receive - Incoming Packets Always Have a Different Size

Python socket receive - incoming packets always have a different size

Note: As people have pointed out in the comments, calling recv() with no parameters is not allowed in Python, and so this answer should be disregarded.

Original answer:


The network is always unpredictable. TCP makes a lot of this random behavior go away for you. One wonderful thing TCP does: it guarantees that the bytes will arrive in the same order. But! It does not guarantee that they will arrive chopped up in the same way. You simply cannot assume that every send() from one end of the connection will result in exactly one recv() on the far end with exactly the same number of bytes.

When you say socket.recv(x), you're saying 'don't return until you've read x bytes from the socket'. This is called "blocking I/O": you will block (wait) until your request has been filled. If every message in your protocol was exactly 1024 bytes, calling socket.recv(1024) would work great. But it sounds like that's not true. If your messages are a fixed number of bytes, just pass that number in to socket.recv() and you're done.

But what if your messages can be of different lengths? The first thing you need to do: stop calling socket.recv() with an explicit number. Changing this:

data = self.request.recv(1024)

to this:

data = self.request.recv()

means recv() will always return whenever it gets new data.

But now you have a new problem: how do you know when the sender has sent you a complete message? The answer is: you don't. You're going to have to make the length of the message an explicit part of your protocol. Here's the best way: prefix every message with a length, either as a fixed-size integer (converted to network byte order using socket.ntohs() or socket.ntohl() please!) or as a string followed by some delimiter (like '123:'). This second approach often less efficient, but it's easier in Python.

Once you've added that to your protocol, you need to change your code to handle recv() returning arbitrary amounts of data at any time. Here's an example of how to do this. I tried writing it as pseudo-code, or with comments to tell you what to do, but it wasn't very clear. So I've written it explicitly using the length prefix as a string of digits terminated by a colon. Here you go:

length = None
buffer = ""
while True:
data += self.request.recv()
if not data:
break
buffer += data
while True:
if length is None:
if ':' not in buffer:
break
# remove the length bytes from the front of buffer
# leave any remaining bytes in the buffer!
length_str, ignored, buffer = buffer.partition(':')
length = int(length_str)

if len(buffer) < length:
break
# split off the full message from the remaining bytes
# leave any remaining bytes in the buffer!
message = buffer[:length]
buffer = buffer[length:]
length = None
# PROCESS MESSAGE HERE

Python Socket Receive Large Amount of Data

TCP/IP is a stream-based protocol, not a message-based protocol. There's no guarantee that every send() call by one peer results in a single recv() call by the other peer receiving the exact data sent—it might receive the data piece-meal, split across multiple recv() calls, due to packet fragmentation.

You need to define your own message-based protocol on top of TCP in order to differentiate message boundaries. Then, to read a message, you continue to call recv() until you've read an entire message or an error occurs.

One simple way of sending a message is to prefix each message with its length. Then to read a message, you first read the length, then you read that many bytes. Here's how you might do that:

def send_msg(sock, msg):
# Prefix each message with a 4-byte length (network byte order)
msg = struct.pack('>I', len(msg)) + msg
sock.sendall(msg)

def recv_msg(sock):
# Read message length and unpack it into an integer
raw_msglen = recvall(sock, 4)
if not raw_msglen:
return None
msglen = struct.unpack('>I', raw_msglen)[0]
# Read the message data
return recvall(sock, msglen)

def recvall(sock, n):
# Helper function to recv n bytes or return None if EOF is hit
data = bytearray()
while len(data) < n:
packet = sock.recv(n - len(data))
if not packet:
return None
data.extend(packet)
return data

Then you can use the send_msg and recv_msg functions to send and receive whole messages, and they won't have any problems with packets being split or coalesced on the network level.

Python Socket is receiving inconsistent messages from Server

It is best to think of a socket as a continuous stream of data, that may arrive in dribs and drabs, or a flood.

In particular, it is the receivers job to break the data up into the "records" that it should consist of, the socket does not magically know how to do this for you. Here the records are lines, so you must read the data and split into lines yourself.

You cannot guarantee that a single recv will be a single full line. It could be:

  • just part of a line;
  • or several lines;
  • or, most probably, several lines and another part line.

Try something like: (untested)

# we'll use this to collate partial data
data = ""

while 1:
# receive the next batch of data
data += s.recv(BUFFER_SIZE).decode('utf-8')

# split the data into lines
lines = data.splitlines(keepends=True)

# the last of these may be a part line
full_lines, last_line = lines[:-1], lines[-1]

# print (or do something else!) with the full lines
for l in full_lines:
print(l, end="")

# was the last line received a full line, or just half a line?
if last_line.endswith("\n"):
# print it (or do something else!)
print(last_line, end="")

# and reset our partial data to nothing
data = ""
else:
# reset our partial data to this part line
data = last_line

Total bytes receive through socket is different with real file size

If the socket is asynchronous, you do not get whole 1024*1024 bytes. You get only as many bytes as is available (was received already).

Print the size of the block instead, to get an accurate number of bytes read.

How does the python socket.recv() method know that the end of the message has been reached?

It depends on the protocol. Some protocols like UDP send messages and exactly 1 message is returned per recv. Assuming you are talking about TCP specifically, there are several factors involved. TCP is stream oriented and because of things like the amount of currently outstanding send/recv data, lost/reordered packets on the wire, delayed acknowledgement of data, and the Nagle algorithm (which delays some small sends by a few hundred milliseconds), its behavior can change subtly as a conversation between client and server progresses.

All the receiver knows is that it is getting a stream of bytes. It could get anything from 1 to the fully requested buffer size on any recv. There is no one-to-one correlation between the send call on one side and the recv call on the other.

If you need to figure out message boundaries its up to the higher level protocols to figure that out. Take HTTP for example. It starts with a \r\n delimited header and then has a count of the remaining bytes the client should expect to receive. The client knows how to read the header because of the \r\n then knows exactly how many bytes are coming next. Part of the charm of RESTful protocols is that they are HTTP based and somebody else already figured this stuff out!

Some protocols use NUL to delimit messages. Others may have a fixed length binary header that includes a count of any variable data to come. I like zeromq which has a robust messaging system on top of TCP.

More details on what happens with receive...

When you do recv(1024), there are 6 possibilities

  1. There is no receive data. recv will wait until there is receive data. You can change that by setting a timeout.

  2. There is partial receive data. You'll get that part right away. The rest is either buffered or hasn't been sent yet and you just do another recv to get more (and the same rules apply).

  3. There is more than 1024 bytes available. You'll get 1024 of that data and the rest is buffered in the kernel waiting for another receive.

  4. The other side has shut down the socket. You'll get 0 bytes of data. 0 means you will never get more data on that socket. But if you keep asking for data, you'll keep getting 0 bytes.

  5. The other side has reset the socket. You'll get an exception.

  6. Some other strange thing has gone on and you'll get an exception for that.

What is the actual impact of calling socket.recv with a bufsize that is not a power of 2?

I'm pretty sure the 'power of 2' advice is based on an error in editing, and should not be taken as a requirement.

That specific piece of advice was added to the Python 2.5 documentation (and backported to Python 2.4.3 docs), in response to Python issue #756104. The reporter was using an unreasonably large buffer size for socket.recv(), which prompted the update.

It was Tim Peters that introduced the 'power of 2' concept:

I expect you're the only person in history to try passing such
a large value to recv() -- even if it worked, you'd almost
certainly run out of memory trying to allocate buffer space for
1.9GB. sockets are a low-level facility, and it's common to
pass a relatively small power of 2 (for best match with
hardware and network realities).

(Bold emphasis mine). I've worked with Tim and he has a huge amount of experience with network programming and hardware, so generally speaking I'd take him on his word when making a remark like that. He was particularly 'fond' of the Windows 95 stack, he called it his canary in a coalmine for its ability to fail under stress. But note that he says it is common, not that it is required to use a power of 2.

It was that wording that then led to the documentation update:

This is a documentation bug; something the user should be
"warned" about.

This caught me once, and two different persons asked about
this in #python, so maybe we should put something like the
following in the recv() docs.

"""

For best match with hardware and network realities, the

value of "buffer" should be a relatively small power of 2,

for example, 4096.

"""

If you think the wording is right, just assign the bug to
me, I'll take care of it.

No one challenged the 'power of 2' assertion here, but the editor moved from it is common to should be in the space of a few replies.

To me, those proposing the documentation update were more concerned with making sure you use a small buffer, and not whether or not it is a power of 2. That's not to say it is not good advice however; any low-level buffer that interacts with the kernel benefits with alignment with the kernel data structures.

But although there may well be an esoteric stack where buffers with a size that is a power of 2 matters even more, I doubt Tim Peters ever meant for his experience (that it is common practice) to be cast in such iron-clad terms. Just ignore it if a different buffer size makes more sense for your specific use cases.

Can I say that socket.send() flushed/resets the TCP stream here?

I've read here that recv blocks IO until my request of 1000 bytes has been fulfilled.

Which is wrong. recv blocks until at least one byte is received. The number given just specifies the maximum number of bytes which should be read, i.e. neither the exact number nor the minimum number.

The following 3 bytes go unreceived.

It is likely that in this specific case the 1000 bytes are received at once, leaving 3 bytes unread. This is different though if larger amounts of data are send, especially over links with low MTU (i.e. local network, WiFi vs. localhost traffic). Here it can be seen that only parts of the expected data are received during a single recv.

Even the assumption that send will send all given data is wrong: send will only send at most the given data. One needs to actually check the return value to see how much actually got send. Use sendall instead if you want to have everything send.

Can I say that socket.send() “flushed”/“resets” the TCP stream here?

No. send and recv work only on the socket write and read buffers. They don't actually cause a sending or receiving. This is done by the OS instead. A send just puts the data into the sockets write buffer and the OS will eventually transmit this data. This transmission is not in all cases done immediately though. If there are outstanding unacknowledged data the sending might get deferred until the data are acknowledged (details depend on the TCP window). If only few data are in the buffer the OS might wait a while for the application to call send with more data in order to keep the transmission overhead low (NAGLE algorithm).

Thus the phrase "flush" has no real meaning here. And "reset" actually means something completely different with TCP - namely forcibly breaking the connection using the RST flag. So don't use these phrases in this context.



Related Topics



Leave a reply



Submit