Tcp: When Is Epollhup Generated

TCP: When is EPOLLHUP generated?

For these kind of questions, use the source! Among other interesting comments, there is this text:

EPOLLHUP is UNMASKABLE event (...). It means that after we received EOF, poll always returns immediately, making impossible poll() on write() in state CLOSE_WAIT. One solution is evident --- to set EPOLLHUP if and only if shutdown has been made in both directions.

And then the only code that sets EPOLLHUP:

if (sk->sk_shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
    mask |= EPOLLHUP;

Being SHUTDOWN_MASK equal to RCV_SHUTDOWN |SEND_SHUTDOWN.

TL; DR; You are right, this flag is only sent when the shutdown has been both for read and write (I reckon that the peer shutdowning the write equals to my shutdowning the read). Or when the connection is closed, of course.

UPDATE: From reading the source code with more detail, these are my conclusions.

About shutdown:

Doing shutdown(SHUT_WR) sends a FIN and marks the socket with SEND_SHUTDOWN.
Doing shutdown(SHUT_RD) sends nothing and marks the socket with RCV_SHUTDOWN.
Receiving a FIN marks the socket with RCV_SHUTDOWN.

And about epoll:

If the socket is marked with SEND_SHUTDOWN and RCV_SHUTDOWN, poll will return EPOLLHUP.
If the socket is marked with RCV_SHUTDOWN, poll will return EPOLLRDHUP.

So the HUP events can be read as:

EPOLLRDHUP: you have received FIN or you have called shutdown(SHUT_RD). In any case your reading half-socket is hung, that is, you will read no more data.
EPOLLHUP: you have both half-sockets hung. The reading half-socket is just like the previous point, For the sending half-socket you did something like shutdown(SHUT_WR).

To complete a a graceful shutdown I would do:

Do shutdown(SHUT_WR) to send a FIN and mark the end of sending data.
Wait for the peer to do the same by polling until you get a EPOLLRDHUP.
Now you can close the socket with grace.

PS: About your comment:

it's counterintuitive to get writable, as the writing half is closed

It is actually expected if you understand the output of epoll not as ready but as will not block. That is, if you get EPOLLOUT you have the guarantee that calling write() will not block. And certainly, after shutdown(SHUT_WR), write() will return immediately.

How do I use EPOLLHUP

You use EPOLLRDHUP to detect peer shutdown, not EPOLLHUP (which signals an unexpected close of the socket, i.e. usually an internal error).

Using it is really simple, just "or" the flag with any other flags that you are giving to epoll_ctl. So, for example instead of EPOLLIN write EPOLLIN|EPOLLRDHUP.

After epoll_wait, do an if(my_event.events & EPOLLRDHUP) followed by whatever you want to do if the other side closed the connection (you'll probably want to close the socket).

Note that getting a "zero bytes read" result when reading from a socket also means that the other end has shut down the connection, so you should always check for that too, to avoid nasty surprises (the FIN might arrive after you have woken up from EPOLLIN but before you call read, if you are in ET mode, you'll not get another notification).

Given any epoll TCP socket event, if EPOLLRDHUP=0 and EPOLLIN=1; is a subsequent call to read()/recv() guaranteed to return a read size unequal to 0?

If you get an event with EPOLLRDHUP=1 then just close the connection right away without reading. If you get an event with EPOLLRDHUP=0 and EPOLLIN=1 then go ahead and read, but you should be prepared to handle the possibility of recv() still returning 0, just in case. Perhaps a FIN arrives after you got EPOLLIN=1 but before you actually call recv().

Why am I getting the EPOLLHUP event on a brand new socket

As my example in the comments shows, it seems you can't poll the socket before it's properly initialized, unless you want to handle EPOLLHUP.

As for the question, no, you won't miss any events. Calling listen() then epoll() is the same you'd have to do otherwise (listen() + blocking accept()); actual incoming connections between those calls are handled by the kernel and stay waiting until your code handles them.

EPOLLRDHUP not reliable

To answer this: EPOLLRDHUP indeed comes if you continue to poll after receiving a zero-byte read.
So from my experiments it looks like either an EPOLLIN with zero-byte read or an EPOLLRDHUP are reliable indicators for orderly shutdown, the only trouble was, they are not received together. Sometimes (the case that makes the subject of this question), it happens that EPOLLIN is received, yielding zero bytes (connection terminated), and on subsequent polling you get to see the EPOLLRDHUP. Other times, it's vice-versa: you get the EPOLLRDHUP together with an EPOLLIN that signals actual bytes to be read. Then, on subsequent reads, you get zero bytes.

Epoll and remote 1-way shutdown

I'm answering this myself after doing the heavy lifting to find the answer.

A socket listening for epoll events will typically receive an EPOLLRDHUP (in addition to EPOLLIN) event flag upon the remote peer calling close or shutdown(SHUT_WR). This does not neccessarily mean the socket is dead. Subsequent calls to recv() will return any unread data on the socket and eventually "0" will be returned to indicate EOF. It may even be possible to send data back if the remote peer only did a half-close of its socket.

The one notable exception is if the remote peer is using the SO_LINGER option enabled on its socket with a linger value of "0". The result of closing such a socket may result in a TCP RST getting sent instead of a FIN. From what I've read, a connection reset event will generate either a EPOLLHUP or EPOLLERR. (I haven't had time to confirm, but it makes sense).

There is some documentation to suggest there are older Linux implementations that don't support EPOLLRDHUP, as such EPOLLHUP gets generated instead.

And for what it is worth, in my particular case, I found that it is not too interesting to have code that special cases EPOLLHUP or EPOLLRDHUP events. Instead, just treat these events the same as EPOLLIN/EPOLLOUT and call recv() (or send() as appropriate). But pay close attention to return codes returned back from recv() and send().

POLLHUP vs. POLLRDHUP?

No, when poll()ing a socket, POLLHUP will signal that the connection was closed in both directions.

POLLRDHUP will be set when the other end has called shutdown(SHUT_WR) or when this end has called shutdown(SHUT_RD), but the connection may still be alive in the other direction.

You can have a look at net/ipv4/tcp.c the kernel source:

        if (sk->sk_shutdown == SHUTDOWN_MASK || state == TCP_CLOSE)
                mask |= EPOLLHUP;
        if (sk->sk_shutdown & RCV_SHUTDOWN)
                mask |= EPOLLIN | EPOLLRDNORM | EPOLLRDHUP;

SHUTDOWN_MASK is RCV_SHUTDOWN|SEND_SHUTDOWN. RCV_SHUTDOWN is set when a FIN packet is received, and SEND_SHUTDOWN is set when a FIN packet is acknowledged by the other end, and the socket moves to the FIN-WAIT2 state.

[except for the TCP_CLOSE part, that snippet is replicated by all protocols; and the whole thing works similarly for unix sockets, etc]

There are other important differences -- POLLRDHUP (unlike POLLHUP) has to be set explicitly in .events in order to be returned in .revents.

And POLLRDHUP only works on sockets, not on fifos/pipes or ttys.

Tcp: When Is Epollhup Generated