How to Do Runtime Binding Based on CPU Capabilities on Linux

How to do runtime binding based on CPU capabilities on linux

Edit: I found out later on that the technique described below will only work under limited circumstances. Specifically, your shared libraries must contain functions only, without any global variables. If there are globals inside the libraries that you want to dispatch to, then you will end up with a runtime dynamic linker error. This occurs because global variables are relocated before shared library constructors are invoked. Thus, the linker needs to resolve those references early, before the dispatching scheme described here has a chance to run.

One way of accomplishing what you want is to (ab)use the DT_SONAME field in your shared library's ELF header. This can be used to alter the name of the file that the dynamic loader (ld-linux-so*) loads at runtime in order to resolve the shared library dependency. This is best explained with an example. Say I compile a shared library libtest.so with the following command line:

g++ test.cc -shared -o libtest.so -Wl,-soname,libtest_dispatch.so

This will create a shared library whose filename is libtest.so, but its DT_SONAME field is set to libtest_dispatch.so. Let's see what happens when we link a program against it:

g++ testprog.cc -o test -ltest

Let's examine the runtime library dependencies for the resulting application binary test:

> ldd test
linux-vdso.so.1 =>  (0x00007fffcc5fe000)
libtest_dispatch.so => not found
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fd1e4a55000)
/lib64/ld-linux-x86-64.so.2 (0x00007fd1e4e4f000)

Note that instead of looking for libtest.so, the dynamic loader instead wants to load libtest_dispatch.so instead. You can exploit this to implement the dispatching functionality that you want. Here's how I would do it:

Create the various versions of your shared library. I assume that there is some "generic" version that can always be used, with other optimized versions utilized at runtime as appropriate. I would name the generic version with the "plain" library name libtest.so, and name the others however you choose (e.g. libtest_sse2.so, libtest_avx.so, etc.).
When linking the generic version of the library, override its DT_SONAME to something else, like libtest_dispatch.so.

Create a dispatcher library called libtest_dispatch.so. When the dispatcher is loaded at application startup, it is responsible for loading the appropriate implementation of the library. Here's pseudocode for what the implementation of libtest_dispatch.so might look like:

#include <dlfcn.h>
#include <stdlib.h>

// the __attribute__ ensures that this function is called when the library is loaded
__attribute__((constructor)) void init()
{
    // manually load the appropriate shared library based upon what the CPU supports
    // at runtime
    if (avx_is_available) dlopen("libtest_avx.so", RTLD_NOW | RTLD_GLOBAL);
    else if (sse2_is_available) dlopen("libtest_sse2.so", RTLD_NOW | RTLD_GLOBAL);
    else dlopen("libtest.so", RTLD_NOW | RTLD_GLOBAL);
    // NOTE: this is just an example; you should check the return values from 
    // dlopen() above and handle errors accordingly
}

When linking an application against your library, link it against the "vanilla" libtest.so, the one that has its DT_SONAME overridden to point to the dispatcher library. This makes the dispatching essentially transparent to any application authors that use your library.

This should work as described above on Linux. On Mac OS, shared libraries have an "install name" that is analogous to the DT_SONAME used in ELF shared libraries, so a process very similar to the above could be used instead. I'm not sure about whether something similar could be used on Windows.

Note: There is one important assumption made in the above: ABI compatibility between the various implementations of the library. That is, your library should be designed such that it is safe to link against the most generic version at link time while using an optimized version (e.g. libtest_avx.so) at runtime.

Dynamic library timing and CPU load analysis in linux

Static linking requires more time and i/o at link time because all the binding occurs during linking. The result is an executable file which needs no further processing for it to call the library code.

Dynamic loading requires more work at runtime. It has to look up the .so file, open it, and bind referenced addresses, all before the first call into it. What you are measuring is expected and normal.

How to bind a process to a set of cpu in golang?

Taskset : To enable a process run on a specific CPU, you use the command 'taskset' in linux. Accordingly you can arrive on a logic based on "taskset -p [mask] [pid]" where the mask represents the cores in which the particular process shall run, provided the whole program runs with GOMAXPROCS=1.

pthread_setaffinity_np : You can use cgo and arrive on a logic that calls pthread_setaffinity_np, as Go uses pthreads in cgo mode. (The pthread_attr_setaffinity_np() function sets the CPU affinity mask attribute of the thread attributes object referred to by attr to the value specified in cpuset. )

Go helps in incorporation of affinity control via "SchedSetaffinity" that can be checked for confining a thread to specific cores. Accordingly , you can arrive on a logic for usage of "SchedSetaffinity(pid int, set *CPUSet)" that sets the CPU affinity mask of the thread specified by pid. If pid is 0 the calling thread is used.

It should be noted that GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. If it is > 1 then, you may use runtime.LockOSThread of Go that shall pin the current goroutine to the current thread that is is running on . The calling goroutine will always execute in that thread, and no other goroutine will execute in it, until the calling goroutine has made as many calls to UnlockOSThread as to LockOSThread.

cgroups : There is also option of using cgroups that helps in organizing the processes hierarchically and distribution of system resources along the hierarchy in a controlled and configurable manner. Here, there is subsystem termed as cpuset that enables assigning individual CPUs (on a multicore system) and memory nodes to process in a cgroup. The cpuset lists CPUs to be used by tasks within this cgroup. The CPU numbers are comma-separated numbers or ranges. For example:

#cat cpuset.cpus
0-4,6,8-10

A process is confined to run only on the CPUs in the cpuset it belongs to, and to allocate memory only on the memory nodes in that cpuset. It should be noted that all processes are put in the cgroup that the parent process belongs to at the time on creation and a process can be migrated to another cgroup. Migration of a process doesn't affect already existing descendant processes.

Can I make shared library constructors execute before relocations?

Failure. If I remove all instances of the global variable bar and try to dispatch the foo() function only, then it all works.

The reason this works without global variables is that functions (by default) use lazy binding, but variables can not (for obvious reasons).

You would get the exact same failure without any global variables if your test program is linked with -Wl,-z,now (which would disable lazy binding of functions).

You could fix this by introducing an instance of every global variable referenced by your main program into the dispatch library.

Contrary to what your other answer suggests, this is not the standard way to do CPU-specific dispatch.

There are two standard ways.

The older one: use $PLATFORM as part of DT_RPATH or DT_RUNPATH. The kernel will pass in a string, such as x86_64, or i386, or i686 as part of the aux vector, and ld.so will replace $PLATFORM with that string.

This allowed distributions to ship both i386 and i686-optimized libraries, and have a program select appropriate version depending on which CPU it was running on.

Needless to say, this isn't very flexible, and (as far as I understand) doesn't allow you to distinguish between various x86_64 variants.

The new hotness is IFUNC dispatch, documented here. This is what GLIBC currently uses to provide different versions of e.g. memcpy depending on which CPU it is running on. There is also target and target_clones attribute (documented on the same page) that allows you to compile several variants of the routine, optimized for different processors (in case you don't want to code them in assembly).

I'm trying to apply this functionality to an existing, very large library, so just a recompile is the most straightforward way of implementing it.

In that case, you may have to wrap the binary in a shell script, and set LD_LIBRARY_PATH to different directories depending on the CPU. Or have the user source your script before running the program.

target_clones does look interesting; is that a recent addition to gcc

I believe the IFUNC support is about 4-5 years old, the automatic cloning in GCC is about 2 years old. So yes, quite recent.

How one can achieve late binding in C language?

Late binding is not really a function of the C language itself, more something that your execution environment provides for you.

Many systems will provide deferred binding as a feature of the linker/loader and you can also use explicit calls such as dlopen (to open a shared library) and dlsym (to get the address of a symbol within that library so you can access it or call it).

The only semi-portable way of getting late binding with the C standard would be to use some trickery with system() and even that is at least partially implementation-specific.

If you're not so much talking about deferred binding but instead polymorphism, you can achieve that effect with function pointers. Basically, you create a struct which has all the data for a type along with function pointers for locating the methods for that type. Then, in the "constructor" (typically an init() function), you set the function pointers to the relevant functions for that type.

You still need to include all the code even if you don't use it but it is possible to get polymorphism that way.

Is there a way to dynamically change CPU count for a docker container at runtime?

Run this command (you will have to provide your container id, of course):

docker update --cpuset-cpus="0"

That will update it runtime! There is a lot of old, out of date information on the internet which says you can't do this. It might only work on Linux docker, though.

Is there a way for non-root processes to bind to privileged ports on Linux?

Okay, thanks to the people who pointed out the capabilities system and CAP_NET_BIND_SERVICE capability. If you have a recent kernel, it is indeed possible to use this to start a service as non-root but bind low ports. The short answer is that you do:

setcap 'cap_net_bind_service=+ep' /path/to/program

And then anytime program is executed thereafter it will have the CAP_NET_BIND_SERVICE capability. setcap is in the debian package libcap2-bin.

Now for the caveats:

You will need at least a 2.6.24 kernel
This won't work if your file is a script. (i.e. uses a #! line to launch an interpreter). In this case, as far I as understand, you'd have to apply the capability to the interpreter executable itself, which of course is a security nightmare, since any program using that interpreter will have the capability. I wasn't able to find any clean, easy way to work around this problem.
Linux will disable LD_LIBRARY_PATH on any program that has elevated privileges like setcap or suid. So if your program uses its own .../lib/, you might have to look into another option like port forwarding.

Resources:

capabilities(7) man page. Read this long and hard if you're going to use capabilities in a production environment. There are some really tricky details of how capabilities are inherited across exec() calls that are detailed here.
setcap man page
"Bind ports below 1024 without root on GNU/Linux": The document that first pointed me towards setcap.

Note: RHEL first added this in v6.

Isolate Kernel Module to a Specific Core Using Cpuset

So I want the module to get executed in an isolated core.

and

actually isolate a specific core in our system and execute just one
specific process to that core

This is a working source code compiled and tested on a Debian box using kernel 3.16. I'll describe how to load and unload first and what the parameter passed means.

All sources can be found on github here...

https://github.com/harryjackson/doc/tree/master/linux/kernel/toy/toy

Build and load the module...

make
insmod toy param_cpu_id=2

To unload the module use

rmmod toy

I'm not using modprobe because it expects some configuration etc. The parameter we're passing to the toy kernel module is the CPU we want to isolate. None of the device operations that get called will run unless they're executing on that CPU.

Once the module is loaded you can find it here

/dev/toy

Simple operations like

cat /dev/toy

create events that the kernel module catches and produces some output. You can see the output using dmesg.

Source code...

#include <linux/module.h>
#include <linux/fs.h>
#include <linux/miscdevice.h>
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Harry");
MODULE_DESCRIPTION("toy kernel module");
MODULE_VERSION("0.1"); 
#define  DEVICE_NAME "toy"
#define  CLASS_NAME  "toy"

static int    param_cpu_id;
module_param(param_cpu_id    , int, (S_IRUSR | S_IRGRP | S_IROTH));
MODULE_PARM_DESC(param_cpu_id, "CPU ID that operations run on");

//static void    bar(void *arg);
//static void    foo(void *cpu);
static int     toy_open(   struct inode *inodep, struct file *fp);
static ssize_t toy_read(   struct file *fp     , char *buffer, size_t len, loff_t * offset);
static ssize_t toy_write(  struct file *fp     , const char *buffer, size_t len, loff_t *);
static int     toy_release(struct inode *inodep, struct file *fp);

static struct file_operations toy_fops = {
  .owner = THIS_MODULE,
  .open = toy_open,
  .read = toy_read,
  .write = toy_write,
  .release = toy_release,
};

static struct miscdevice toy_device = {
  .minor = MISC_DYNAMIC_MINOR,
  .name = "toy",
  .fops = &toy_fops
};

//static int CPU_IDS[64] = {0};
static int toy_open(struct inode *inodep, struct file *filep) {
  int this_cpu = get_cpu();
  printk(KERN_INFO "open: called on CPU:%d\n", this_cpu);
  if(this_cpu == param_cpu_id) {
    printk(KERN_INFO "open: is on requested CPU: %d\n", smp_processor_id());
  }
  else {
    printk(KERN_INFO "open: not on requested CPU:%d\n", smp_processor_id());
  }
  put_cpu();
  return 0;
}
static ssize_t toy_read(struct file *filep, char *buffer, size_t len, loff_t *offset){
  int this_cpu = get_cpu();
  printk(KERN_INFO "read: called on CPU:%d\n", this_cpu);
  if(this_cpu == param_cpu_id) {
    printk(KERN_INFO "read: is on requested CPU: %d\n", smp_processor_id());
  }
  else {
    printk(KERN_INFO "read: not on requested CPU:%d\n", smp_processor_id());
  }
  put_cpu();
  return 0;
}
static ssize_t toy_write(struct file *filep, const char *buffer, size_t len, loff_t *offset){
  int this_cpu = get_cpu();
  printk(KERN_INFO "write called on CPU:%d\n", this_cpu);
  if(this_cpu == param_cpu_id) {
    printk(KERN_INFO "write: is on requested CPU: %d\n", smp_processor_id());
  }
  else {
    printk(KERN_INFO "write: not on requested CPU:%d\n", smp_processor_id());
  }
  put_cpu();
  return 0;
}
static int toy_release(struct inode *inodep, struct file *filep){
  int this_cpu = get_cpu();
  printk(KERN_INFO "release called on CPU:%d\n", this_cpu);
  if(this_cpu == param_cpu_id) {
    printk(KERN_INFO "release: is on requested CPU: %d\n", smp_processor_id());
  }
  else {
    printk(KERN_INFO "release: not on requested CPU:%d\n", smp_processor_id());
  }
  put_cpu();
  return 0;
}

static int __init toy_init(void) {
  int cpu_id;
  if(param_cpu_id < 0 || param_cpu_id > 4) {
    printk(KERN_INFO "toy: unable to load module without cpu parameter\n");
    return -1;
  }
  printk(KERN_INFO "toy: loading to device driver, param_cpu_id: %d\n", param_cpu_id);
  //preempt_disable(); // See notes below
  cpu_id = get_cpu();
  printk(KERN_INFO "toy init called and running on CPU: %d\n", cpu_id);
  misc_register(&toy_device);
  //preempt_enable(); // See notes below
  put_cpu();
  //smp_call_function_single(1,foo,(void *)(uintptr_t) 1,1);
  return 0;
}

static void __exit toy_exit(void) {
    misc_deregister(&toy_device);
    printk(KERN_INFO "toy exit called\n");
}

module_init(toy_init);
module_exit(toy_exit);

The code above contains the two methods you asked for ie isolation of CPU and on init run on an isolated core.

On init get_cpu disables preemption ie anything that comes after it will not be preempted by the kernel and will run on one core. Note, this was done kernel using 3.16, your mileage may vary depending on your kernel version but I think these API's have been around a long time

This is the Makefile...

obj-m += toy.o

all:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
    make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

Notes. get_cpu is declared in linux/smp.h as

#define get_cpu()   ({ preempt_disable(); smp_processor_id(); })
#define put_cpu()   preempt_enable()

so you don't actually need to call preempt_disable before calling get_cpu.
The get_cpu call is a wrapper around the following sequence of calls...

preempt_count_inc();
barrier();

and put_cpu is really doing this...

barrier();
if (unlikely(preempt_count_dec_and_test())) {
  __preempt_schedule();
}

You can get as fancy as you like using the above. Almost all of this was taken from the following sources..

Google for... smp_call_function_single

Linux Kernel Development, book by Robert Love.

http://derekmolloy.ie/writing-a-linux-kernel-module-part-2-a-character-device/

https://github.com/vsinitsyn/reverse/blob/master/reverse.c

How to Do Runtime Binding Based on CPU Capabilities on Linux