How to Determine Thread Local Storage Model Used by a Library on Linux

Is there a way to determine thread local storage model used by a library on Linux

I ran into this error myself, and while investigating it, I came on a mailing list post with this info:

If you link a shared object containing IE-model access relocs, the object
will have the DF_STATIC_TLS flag set. By the spec, this means that dlopen
might refuse to load it.

Looking at /usr/include/elf.h, we have:

/* Values of `d_un.d_val' in the DT_FLAGS entry.  */
...
#define DF_STATIC_TLS   0x00000010      /* Module uses the static TLS model */

So you need to test if DF_STATIC_TLS is set in the DT_FLAGS entry of the shared library.

To test things, I created a simple piece of code using thread local storage:

static __thread int foo;
void set_foo(int new) {
    foo = new;
}

I then compiled it twice with the two different thread local storage models:

gcc -ftls-model=initial-exec -fPIC -c tls.c  -o tls-initial-exec.o
gcc -shared tls-initial-exec.o -o tls-initial-exec.so

gcc -ftls-model=global-dynamic -fPIC -c tls.c  -o tls-global-dynamic.o
gcc -shared tls-global-dynamic.o -o tls-global-dynamic.so

And sure enough, I can see a difference between the two libraries using readelf:

$ readelf --dynamic tls-initial-exec.so

Dynamic section at offset 0xe00 contains 25 entries:
  Tag        Type                         Name/Value
...
 0x000000000000001e (FLAGS)              STATIC_TLS

The tls-global-dynamic.so version did not have a DT_FLAGS entry, presumably because it didn't have any flags set. So it should be fairly easy to create a script using readelf and grep to find affected libraries.

How does thread_local! work with dynamic libraries in rust?

The reason this behavior is observed is because the shared library contains it's own copy of the code of crates it depends on, resulting in two different thread local declarations.

The solution to this is to pass a reference to the thread local in question, instead of directly accessing the thread local. See here for more information on how to obtain a reference to a thread local: How to create a thread local variable inside of a Rust struct?

How fast is thread local variable access on Linux

How fast is accessing a thread local variables in Linux

It depends, on a lot of things.

Some processors (i*86) have special segment (fs, or gs in x86_64 mode). Other processors do not (but usually they will have a register reserved for accessing current thread, and TLS is easy to find using that dedicated register).

On i*86, using fs, the access is almost as fast as direct memory access.

I keep on reading horror stories about the slowness of thread local variable access

It would have helped if you provided links to some such horror stories. Without the links, it's impossible to tell whether their authors know what they are talking about.

LD_PRELOAD and thread local variable

This is not possible, since thread-local-storage requires per-thread initialisation.

LD_PRELOAD will load the library even before the standard library is loaded, which messes up TLS initialisation.

Update:

Please read sections 2 and 3 of ELF Handling For Thread-Local Storage

How does the gcc `__thread` work?

Recent GCC, e.g. GCC 5 do support C11 and its thread_local (if compiling with e.g. gcc -std=c11). As FUZxxl commented, you could use (instead of C11 thread_local) the __thread qualifier supported by older GCC versions. Read about Thread Local Storage.

pthread_getspecific is indeed quite slow (it is in the POSIX library, so is not provided by GCC but e.g. by GNU glibc or musl-libc) since it involves a function call. Using thread_local variables will very probably be faster.

Look into the source code of MUSL's thread/pthread_getspecific.c file
for an example of implementation. Read this answer to a related question.

And _thread & thread_local are (often) not magically translated to calls to pthread_getspecific. They usually involve some specific address mode and/or register (details are implementation specific, related to the ABI; on Linux, I guess that since x86-64 has more registers & address modes, its implementation of TLS is faster than on i386), with help from the compiler, the linker and the runtime system. It could happen on the contrary that some implementations of pthread_getspecific are using some internal thread_local variables (in your implementation of POSIX threads).

As an example, compiling the following code

#include <pthread.h>

const extern pthread_key_t key;

__thread int data;

int
get_data (void) {
  return data;
}

int
get_by_key (void) {
  return *(int*) (pthread_getspecific (key));
}

using GCC 5.2 (on Debian/Sid) with gcc -m32 -S -O2 -fverbose-asm gives the following code for get_data using TLS:

  .type get_data, @function
get_data:
.LFB3:
  .cfi_startproc
  movl  %gs:data@ntpoff, %eax   # data,
  ret
.cfi_endproc

and the following code of get_by_key with an explicit call to pthread_getspecific:

get_by_key:
 .LFB4:
  .cfi_startproc
  subl  $24, %esp   #,
  .cfi_def_cfa_offset 28
  pushl key # key
  .cfi_def_cfa_offset 32
  call  pthread_getspecific #
  movl  (%eax), %eax    # MEM[(int *)_4], MEM[(int *)_4]
  addl  $28, %esp   #,
  .cfi_def_cfa_offset 4
  ret
  .cfi_endproc

Hence using TLS with __thread (or thread_local in C11) should probably be faster than using pthread_getspecific (avoiding the overhead of a call).

Notice that thread_local is a convenience macro defined in <threads.h> (a C11 standard header).