Does Epoll(), Do Its Job in O(1)

Does epoll(), do its job in O(1)?

This makes sense once you look for ep_find. I only spent a few minutes with it and I see ep_find is only called by epoll_ctl.

So indeed, when you add the descriptors (EPOLL_CTL_ADD) that costly operation is performed. BUT when doing the real work (epoll_wait) it isn't. You only add the descriptors in the beginning.

In conclusion, it's not enough to ask the complexity of epoll, since there is no epoll system call. You want the individual complexities of epoll_ctl, epoll_wait etc.

Other stuff

There are other reasons to avoid select and use epoll. When using select, you don't know how many descriptors need attention. So you must keep track of the biggest and loop to it.

rc = select(...);
/* check rc */
for (s = 0; s <= maxfd; s++) {
    if (FD_ISSET(s)) {
        /* ... */
    }
}

Now with epoll it's a lot cleaner:

nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
/* check nfds */
for (n = 0; n < nfds; ++n) {
    /* events[n].data.fd needs attention */
}

Why exactly does ePoll scale better than Poll?

The poll system call needs to copy your list of file descriptors to the kernel each time. This happens only once with epoll_ctl, but not every time you call epoll_wait.

Also, epoll_wait is O(1) in respect of the number of descriptors watched¹, which means it does not matter whether you wait on one descriptor or on 5,000 or 50,000 descriptors. poll, while being more efficient than select, still has to walk over the list every time (i.e. it is O(N) in respect of number of descriptors).

And lastly, epoll can in addition to the "normal" mode work in "edge triggered" mode, which means the kernel does not need keep track of how much data you've read after you've been signalled readiness. This mode is more difficult to grasp, but somewhat more efficient.

¹As correctly pointed out by David Schwartz, epoll_wait is of course still O(N) in respect of events that occur. There is hardly a way it could be any different, with any interface. If N events happen on a descriptor that is watched, then the application needs to get N notifications, and needs to do N "things" in order to react on what's happening.

This is again slightly, but not fundamentally different in edge triggered mode, where you actually get M events with M <= N. In edge triggered mode, when the same event (say, POLLIN) happens several times, you will probably get fewer notifications, possibly only a single one. However, this doesn't change much about the big-O notation as such.

However, epoll_wait is irrespective of the number of descriptors watched. Under the assumption that it is used in the intended, "normal" way (that is, many descriptors, few events), this is what really matters, and here it is indeed O(1).

As an analogy, you can think of a hash table. A hash table accesses its content in O(1), but one could argue that calculating the hash is actually O(N) in respect of the key length. This is technically absolutely correct, and there probably exist cases where this is a problem, however, for most people, this just doesn't matter.

Does epoll_wait() return events one at a time?

The documentation of epoll_wait() says:

The events field is a bit mask that indicates the events that
have occurred for the corresponding open file description.

The plural "events" implies that multiple events can occur for the same descriptor.

I believe the maxevents argument to epoll_wait() actually specifies the maximum elements in the events array -- if a single FD has multiple events occur, they're all in one element.

Note that if you use level-triggered events (the default), EPOLLOUT will almost always be triggered immediately, since the socket is always ready to write unless you write so much that you fill the kernel's socket buffer.

Why is epoll faster than select?

There's a lot of misinformation about this, but the real reason is this:

A typical server might be dealing with, say, 200 connections. It will service every connection that needs to have data written or read and then it will need to wait until there's more work to do. While it's waiting, it needs to be interrupted if data is received on any of those 200 connections.

With select, the kernel has to add the process to 200 wait lists, one for each connection. To do this, it needs a "thunk" to attach the process to the wait list. When the process finally does wake up, it needs to be removed from all 200 wait lists and all those thunks need to be freed.

By contrast, with epoll, the epoll socket itself has a wait list. The process needs to be put on only that one wait list using only one thunk. When the process wakes up, it needs to be removed from only one wait list and only one thunk needs to be freed.

To be clear, with epoll, the epoll socket itself has to be attached to each of those 200 connections. But this is done once, for each connection, when it is accepted in the first place. And this is torn down once, for each connection, when it is removed. By contrast, each call to select that blocks must add the process to every wait queue for every socket being monitored.

Ironically, with select, the largest cost comes from checking if sockets that have had no activity have had any activity. With epoll, there is no need to check sockets that have had no activity because if they did have activity, they would have informed the epoll socket when that activity happened. In a sense, select polls each socket each time you call select to see if there's any activity while epoll rigs it so that the socket activity itself notifies the process.

epoll does not signal an event when socket is close

There are a few problems with your attempt.

You should not use EPOLLONESHOT unless you know what you are doing and you really need it. It disables the report of any other events to the epoll instance until you enable it again with EPOLL_CTL_MOD.
You should not use EPOLLHUP to determine if a connection was closed. The EPOLLHUP event may be raised before all data is read from the socket input stream, even if the client disconnects gracefully. I recommend you to only use EPOLLIN. If there is no input left (because of forceful or graceful disconnect), the read() would return 0 (EOF) and you can close the socket.
Your read() call will block the reading thread and consume the whole stream until EOF (connection closed). The whole point in using epoll() is to not have to use a while ( read(...) > 0 ) loop.
You should not use EPOLLET because "the code runs multithreaded" but because you need it. You can write multithreaded code without the edge-triggered mode. The use of EPOLLET requires a thorough knowledge of the differences between blocking and non-blocking I/O. You can easily run into pitfalls (as mentioned in the manual), like edge-triggered starvation.

epoll instantly returns on stdin

The timeout value 0 on epoll_wait() means: return immediately and only report currently pending events.

You need to specify timeout value -1, which means, "wait indefinitely for events":

epoll_wait(fd, &event, 1, -1);

Then it should work as expected.

Does Epoll(), Do Its Job in O(1)