Select VS Poll VS Epoll

Why is epoll faster than select?

There's a lot of misinformation about this, but the real reason is this:

A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.

With select, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.

By contrast, with epoll, the epoll socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.

To be clear, with epoll, the epoll socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select that blocks must add the process to every wait queue for every socket being monitored.

Ironically, with select, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll socket when that activity happened. In a sense, select polls each socket each time you call select to see if there's any activity while epoll rigs it so that the socket activity itself notifies the process.

What are the differences between poll and select?

I think that this answers your question:

From Richard Stevens (rstevens@noao.edu):

The basic difference is that select()'s fd_set is a bit mask and
therefore has some fixed size. It would be possible for the kernel to
not limit this size when the kernel is compiled, allowing the
application to define FD_SETSIZE to whatever it wants (as the comments
in the system header imply today) but it takes more work. 4.4BSD's
kernel and the Solaris library function both have this limit. But I
see that BSD/OS 2.1 has now been coded to avoid this limit, so it's
doable, just a small matter of programming. :-) Someone should file a
Solaris bug report on this, and see if it ever gets fixed.

With poll(), however, the user must allocate an array of pollfd
structures, and pass the number of entries in this array, so there's
no fundamental limit. As Casper notes, fewer systems have poll() than
select, so the latter is more portable. Also, with original
implementations (SVR3) you could not set the descriptor to -1 to tell
the kernel to ignore an entry in the pollfd structure, which made it
hard to remove entries from the array; SVR4 gets around this.
Personally, I always use select() and rarely poll(), because I port my
code to BSD environments too. Someone could write an implementation
of poll() that uses select(), for these environments, but I've never
seen one. Both select() and poll() are being standardized by POSIX
1003.1g.

October 2017 Update:

The email referenced above is at least as old as 2001; the poll() command is now (2017) supported across all modern operating systems - including BSD. In fact, some people believe that select() should be deprecated. Opinions aside, portability issues around poll() are no longer a concern on modern systems. Furthermore, epoll() has since been developed (you can read the man page), and continues to rise in popularity.

For modern development you probably don't want to use select(), although there's nothing explicitly wrong with it. poll(), and it's more modern evolution epoll(), provide the same features (and more) as select() without suffering from the limitations therein.

poll vs. epoll insight

Always use poll unless all of the following are satisfied:

You can ensure you're on a (Linux) system that has epoll or you provide a fallback for systems that don't.
You have a huge number of file descriptors active (at least 1000-10000).
The set of file descriptors you're working with is stable over a long time period (adding/removing file descriptors from the epoll list is just as expensive as a poll operation, since it requires entering/leaving kernelspace).

Caveats of select/poll vs. epoll reactors in Twisted

For very small numbers of sockets (varies depending on your hardware, of course, but we're talking about something on the order of 10 or fewer), select can beat epoll in memory usage and runtime speed. Of course, for such small numbers of sockets, both mechanisms are so fast that you don't really care about this difference in the vast majority of cases.

One clarification, though. Both select and epoll scale linearly. A big difference, though, is that the userspace-facing APIs have complexities that are based on different things. The cost of a select call goes roughly with the value of the highest numbered file descriptor you pass it. If you select on a single fd, 100, then that's roughly twice as expensive as selecting on a single fd, 50. Adding more fds below the highest isn't quite free, so it's a little more complicated than this in practice, but this is a good first approximation for most implementations.

The cost of epoll is closer to the number of file descriptors that actually have events on them. If you're monitoring 200 file descriptors, but only 100 of them have events on them, then you're (very roughly) only paying for those 100 active file descriptors. This is where epoll tends to offer one of its major advantages over select. If you have a thousand clients that are mostly idle, then when you use select you're still paying for all one thousand of them. However, with epoll, it's like you've only got a few - you're only paying for the ones that are active at any given time.

All this means that epoll will lead to less CPU usage for most workloads. As far as memory usage goes, it's a bit of a toss up. select does manage to represent all the necessary information in a highly compact way (one bit per file descriptor). And the FD_SETSIZE (typically 1024) limitation on how many file descriptors you can use with select means that you'll never spend more than 128 bytes for each of the three fd sets you can use with select (read, write, exception). Compared to those 384 bytes max, epoll is sort of a pig. Each file descriptor is represented by a multi-byte structure. However, in absolute terms, it's still not going to use much memory. You can represent a huge number of file descriptors in a few dozen kilobytes (roughly 20k per 1000 file descriptors, I think). And you can also throw in the fact that you have to spend all 384 of those bytes with select if you only want to monitor one file descriptor but its value happens to be 1024, wheras with epoll you'd only spend 20 bytes. Still, all these numbers are pretty small, so it doesn't make much difference.

And there's also that other benefit of epoll, which perhaps you're already aware of, that it is not limited to FD_SETSIZE file descriptors. You can use it to monitor as many file descriptors as you have. And if you only have one file descriptor, but its value is greater than FD_SETSIZE, epoll works with that too, but select does not.

Randomly, I've also recently discovered one slight drawback to epoll as compared to select or poll. While none of these three APIs supports normal files (i.e., files on a file system), select and poll present this lack of support as reporting such descriptors as always readable and always writable. This makes them unsuitable for any meaningful kind of non-blocking filesystem I/O, a program which uses select or poll and happens to encounter a file descriptor from the filesystem will at least continue to operate (or if it fails, it won't be because of select or poll), albeit it perhaps not with the best performance.

On the other hand, epoll will fail fast with an error (EPERM, apparently) when asked to monitor such a file descriptor. Strictly speaking, this is hardly incorrect. It's merely signalling its lack of support in an explicit way. Normally I would applaud explicit failure conditions, but this one is undocumented (as far as I can tell) and results in a completely broken application, rather than one which merely operates with potentially degraded performance.

In practice, the only place I've seen this come up is when interacting with stdio. A user might redirect stdin or stdout from/to a normal file. Whereas previously stdin and stdout would have been a pipe -- supported by epoll just fine -- it then becomes a normal file and epoll fails loudly, breaking the application.

Why poll is not replaced with epoll?

Okay, 7 years later I have a more convincing answer based on this article by Evan Klitzke.

Firstly, the reason I asked the question in the first place is the often mentioned performance advantage of epoll compared to poll/select. The word goes that epoll is asymptotically more efficient (O(1)) than poll (O(N)).

What is not as widely known is that only edge-triggered epoll is truly O(1), while level-trggered epoll has same asymptotics of O(N). Indeed, level-triggered flavor has to go over the list of watched fds every time it is called to find ones that potentially has still more data pending. Edge-triggered variety can rely on signals in response to new bytes appearing in an fd.

It would be interesting to find out, how exactly a resumed thread finds out which fd woke it up, but it's certainly possible that this datum is passed through during epoll-triggered wake-up.

Obviously, poll/select cannot use edge-triggered epoll as the semantics are different. As we saw, implementing with level-triggered epoll wouldn't bring asymptotic performance benefits. And possibly, also negatively affect it if constant factors or constant terms are high (as they seem to be based on coarse benchmark that I did and quoted in another comment).

For more information, please read Blocking I/O, Nonblocking I/O, And Epoll.

Why is poll() better than select()?

Active FD means an open file descriptor.

Both select() and poll() are for single-threaded single-process programs to allow them to handle multiple connections at the same time. For instance OpenWRT's uhttpd web-server is like that.

select() and poll() are available on all Unices.
The better scaling O(1) versions are epoll on Linux and kqueue on BSDs. Less portable, though. But you can install libkqueue0 on Debian Linux.

Many programs use other approaches. For instance sshd, the SSH daemon, spawns a child process for each connection. Others handle each connection in a thread.

Shall we use poll() or select()?

All remotely modern systems have poll, and it's a greatly superior interface to select/pselect in almost all ways:

poll allows more fine-grained detection of status than select.
poll does not have limits on the max file descriptor you can use (and more importantly, does not have critical vulnerabilities when you fail to check for file descriptors past the FD_SETSIZE limit).

The only disadvantages I can think of to using poll are that:

unlike pselect, poll cannot atomically unmask/mask signals, so you can't use it for waiting for a set of events that includes both file descriptor activity and signals unless you resort to the self-pipe trick.
poll only has millisecond resolution for the wait timeout, rather than microsecond (select) or nanosecond (pselect).

Certainly portability of poll is not a consideration anymore. Any system old enough to lack poll is full of so many vulnerabilities it should not be connected to a network.

In summary, unless you have very special needs (tiny timeout intervals, nasty signal interactions, scaling to millions of persistent connections, etc.) I would simply use poll and be done with it. As others have mentioned, libevent is also an option, but it's not clean/safe code (its use of select actually invokes dangerous UB trying to workaround the limitations of select!) and I find code that uses libevent is generally a lot more unnecessarily complicated than code that simply uses poll directly.