Poorly-Balanced Socket Accepts with Linux 3.2 Kernel VS 2.6 Kernel

Poorly-balanced socket accepts with Linux 3.2 kernel vs 2.6 kernel

Don't depend on the OS's socket multiple accept to balance load across web server processes.

The Linux kernels behavior differs here from version to version, and we saw a particularly imbalanced behavior with the 3.2 kernel, which appeared to be somewhat more balanced in later versions. e.g. 3.6.

We were operating under the assumption that there should be a way to make Linux do something like round-robin with this, but there were a variety of issues with this, including:

Linux kernel 2.6 showed something like round-robin behavior on bare metal (imbalances were about 3-to-1), Linux kernel 3.2 did not (10-to-1 imbalances), and kernel 3.6.10 seemed okay again. We did not attempt to bisect to the actual change.
Regardless of the kernel version or build options used, the behavior we saw on a 32-logical-core HVM instance on Amazon Web services was severely weighted toward a single process; there may be issues with Xen socket accept: https://serverfault.com/questions/272483/why-is-tcp-accept-performance-so-bad-under-xen

You can see our tests in great detail on the github issue we were using to correspond with the excellent Node.js team, starting about here: https://github.com/joyent/node/issues/3241#issuecomment-11145233

That conversation ends with the Node.js team indicating that they are seriously considering implementing explicit round-robin in Cluster, and starting an issue for that: https://github.com/joyent/node/issues/4435, and with the Trello team (that's us) going to our fallback plan, which was to use a local HAProxy process to proxy across 16 ports on each server machine, with a 2-worker-process Cluster instance running on each port (for fast failover at the accept level in case of process crash or hang). That plan is working beautifully, with greatly reduced variation in request latency and a lower average latency as well.

There is a lot more to be said here, and I did NOT take the step of mailing the Linux kernel mailing list, as it was unclear if this was really a Xen or a Linux kernel issue, or really just an incorrect expectation of multiple accept behavior on our part.

I'd love to see an answer from an expert on multiple accept, but we're going back to what we can build using components that we understand better. If anyone posts a better answer, I would be delighted to accept it instead of mine.

Context switches much slower in new linux kernels

The solution to the bad thread wake up performance problem in recent kernels has to do with the switch to the intel_idle cpuidle driver from acpi_idle, the driver used in older kernels. Sadly, the intel_idle driver ignores the user's BIOS configuration for the C-states and dances to its own tune. In other words, even if you completely disable all C states in your PC's (or server's) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (e.g., stress) is running. You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z tool on most compatible hardware.

To see which cpuidle driver is currently active in your setup, just cat the current_driver file in the cpuidle section of /sys/devices/system/cpu as follows:

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features:

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub and then running update-grub. The boot parameters to add are:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do:

Setting intel_idle.max_cstate to zero will either revert your cpuidle driver to acpi_idle (at least per the documentation of the option), or disable it completely. On my box it is completely disabled (i.e., displaying the current_driver file in /sys/devices/system/cpu/cpuidle produces an output of none). In this case the second boot option, processor.max_cstate=0 is unnecessary. However, the documentation states that setting max_cstate to zero for the intel_idle driver should revert the OS to the acpi_idle driver. Therefore, I put in the second boot option just in case.

The processor.max_cstate option sets the maximum C state for the acpi_idle driver to zero, hopefully disabling it as well. I do not have a system that I can test this on, because intel_idle.max_cstate=0 completely knocks out the cpuidle driver on all of the hardware available to me. However, if your installation does revert you from intel_idle to acpi_idle with just the first boot option, please let me know if the second option, processor.max_cstate did what it was documented to do in the comments so that I can update this answer.

Finally, the last of the three parameters, idle=poll is a real power hog. It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it's really necessary. For most this will be overkill, since the C1* latency is not all that large. Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. This is certainly a significant reduction for highly latency sensitive applications (e.g., financial trading, high precision telemetry/tracking, high freq. data acquisition, etc...), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. The only way to know for sure is to profile your application's improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs.

Update:

After additional testing with various idle=* parameters, I have discovered that setting idle to mwait if supported by your hardware is a much better idea. It seems that the use of the MWAIT/MONITOR instructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. With idle=mwait, you will get cooler CPU temperatures (as compared to idle=poll), less power use and still retain the excellent low latencies of a polling idle loop. Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwait instead of idle=poll may also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). This has yet to be proven in testing, however, which I will continue to do.

Update 2:

The mwait idle option has been removed from newer 3.x kernels (thanks to user ck_ for the update). That leaves us with two options:

idle=halt - Should work as well as mwait, but test to be sure that this is the case with your hardware. The HLT instruction is almost equivalent to an MWAIT with state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. So, as I said test/profile and see if it meets your latency needs...

and

idle=poll - The highest performance option, at the expense of power and heat.

How does the operating system load balance between multiple processes accepting the same socket?

"Load balancing" is perhaps a bit poor choice of words, essentially it's just a question of how does the OS choose which process to wake up and/or run next. Generally, the process scheduler tries to choose the process to run based on criteria like giving an equal share of cpu time to processes of equal priority, cpu/memory locality (don't bounce processes around the cpu's), etc. Anyway, by googling you'll find plenty of stuff to read about process scheduling algorithms and implementations.

Now, for the particular case of accept(), that also depends on how the OS implements waking up processes that are waiting on accept().

A simple implementation is to just wake up every process blocked on the accept() call, then let the scheduler choose the order in which they get to run.
The above is simple but leads to a "thundering herd" problem, as only the first process succeeds in accepting the connection, the others go back to blocking. A more sophisticated approach is for the OS to wake up only one process; here the choice of which process to wake up can be made by asking the scheduler, or e.g. just by picking the first process in the blocked-on-accept()-for-this-socket list. The latter is what Linux does since a decade or more back, based on the link already posted by others.
Note that this only works for blocking accept(); for non-blocking accept() (which I'm sure is what node.js is doing) the issue becomes to which process blocking in select()/poll()/whatever to deliver the event to. The semantics of poll()/select() actually demand that all of them be waken up, so you have the thundering herd issue there again. For Linux, and probably in similar ways other systems with system-specific high performance polling interfaces as well, it's possible to avoid the thundering herd by using a single shared epoll fd, and edge triggered events. In that case the event will be delivered to only one of the processes blocked on epoll_wait(). I think that, similar to blocking accept(), the choice of process to deliver the event to, is just to pick the first one in the list of processes blocked on epoll_wait() for that particular epoll fd.

So at least for Linux, both for the blocking accept() and the non-blocking accept() with edge triggered epoll, there is no scheduling per se when choosing which process to wake. But OTOH, the workload will probably be quite evenly balanced between the processes anyway, as essentially the system will round-robin the processes in the order in which they finish their current work and go back to blocking on epoll_wait().

Compiling Kernel on Debian Wheezy

The -j4 argument to make-kpkg (which gets passed to the underlying make) only sets the number of parallel compilation processes during the kernel build (and has no influence on the produced kernel packages). And it does not matter that much (so -j4 or -j8 won't make a very big difference in term of build time).

I often pass only -j3 to leave a core available to other processes (e.g. my web surfing or my email reading during the kernel compilation).

Also, some part of make-kpkg is intrinsically serial and cannot be parallelized (some tar running....)

And you could even remove the -j4 (same as -j1): kernel build time will increase, but your machine will be more responsive during it

^{PS: you don't need both sudo and fakeroot if the parent directory (..) is user-writable. It will contain the produced .deb packages. BTW, you could edit your /etc/kernel-package.conf.}

gRPC(C Base) polling engine is built with 'epollex' despite being under linux kernel version 4.5

TL;DR

RHEL7/CentOS7's kernel 3.10.x may have EPOLLEXCLUSIVE.
epollex engine does NOT exist in gRPC source anymore.

Details

CentOS, or RHEL seems to have EPOLLEXCLUSIVE backported into its kernel 3.10.x, which is available in release >= 7.3.

https://bugzilla.redhat.com/show_bug.cgi?id=1426133

gRPC has kernel feature availability check code which actually tries epoll system call with EPOLLEXCLUSIVE flag on. It does not depends on actual version of linux kernel.

https://github.com/grpc/grpc/blob/77e2827f3d70650182474624b4de22e053ac01f6/src/core/lib/iomgr/is_epollexclusive_available.cc#L63-L95

/* This polling engine is only relevant on linux kernels supporting epoll() */
bool grpc_is_epollexclusive_available(void) {

  ...

  struct epoll_event ev;
  /* choose events that should cause an error on
     EPOLLEXCLUSIVE enabled kernels - specifically the combination of
     EPOLLONESHOT and EPOLLEXCLUSIVE */
  ev.events =
      static_cast<uint32_t>(EPOLLET | EPOLLIN | EPOLLEXCLUSIVE | EPOLLONESHOT);
  ev.data.ptr = nullptr;
  if (epoll_ctl(fd, EPOLL_CTL_ADD, evfd, &ev) != 0) {
    if (errno != EINVAL) {
      if (!logged_why_not) {
        gpr_log(
            GPR_ERROR,
            "epoll_ctl with EPOLLEXCLUSIVE | EPOLLONESHOT failed with error: "
            "%d. Not using epollex polling engine.",
            errno);
        logged_why_not = true;
      }
      close(fd);
      close(evfd);
      return false;
    }

  ...

BTW epollex polling engine is now removed from gRPC repository for some unknown reason.

https://github.com/grpc/grpc/pull/29160
https://github.com/grpc/grpc/issues/30328#issuecomment-1189477119

if my node process creates child processes will those child processes be able to take advantage of other processor cores?

Linux answer only, short answer YES.

Long answer: On checking the Node.js docs (last page):-

This is a special case of the spawn() functionality for spawning Node
processes. In addition to having all the methods in a normal
ChildProcess instance, the returned object has a communication channel
built-in. See child.send(message, [sendHandle]) for details.

These child Nodes are still whole new instances of V8. Assume at least
30ms startup and 10mb memory for each new Node. That is, you cannot
create many thousands of them.

Since it's a new V8 instance, the kernel will slice the threads and multitask the job. Linux OS been doing multitasking with multi-core a long time ago (from kernel 2.4). Hence, you want to spawn only as many child processes as the CPU. Use require('os').cpus().length to determine the CPU count.

Mac is unix (Next?) based, so I'd assume that the kernel behaves the same way.

Poorly-Balanced Socket Accepts with Linux 3.2 Kernel VS 2.6 Kernel