Why Sizeof(Spinlock_T) Is Greater Than Zero on Uni-Processor

Why sizeof(spinlock_t) is greater than zero on uni-processor?

The best guess I have is that although you have a single CPU, the kernel is still compiled with CONFIG_SMP set.

What is the size of an empty struct in C?

A struct cannot be empty in C because the syntax forbids it. Furthermore, there is a semantic constraint that makes behavior undefined if a struct has no named member:

struct-or-union-speciﬁer:
  struct-or-union identiﬁeropt { struct-declaration-list }
  struct-or-union identiﬁer

struct-or-union:
  struct
  union

struct-declaration-list:
  struct-declaration
  struct-declaration-list struct-declaration

struct-declaration:
  speciﬁer-qualiﬁer-list struct-declarator-list ;

/* type-specifier or qualifier required here! */
speciﬁer-qualiﬁer-list:
  type-speciﬁer speciﬁer-qualiﬁer-listopt
  type-qualiﬁer speciﬁer-qualiﬁer-listopt

struct-declarator-list:
  struct-declarator
  struct-declarator-list , struct-declarator

struct-declarator:
  declarator
  declaratoropt : constant-expression

If you write

struct identifier { };

It will give you a diagnostic message, because you violate syntactic rules. If you write

struct identifier { int : 0; };

Then you have a non-empty struct with no named members, thus making behavior undefined, and not requiring a diagnostic:

If the struct-declaration-list contains no named members, the behavior is undeﬁned.

Notice that the following is disallowed because a flexible array member cannot be the first member:

struct identifier { type ident[]; };

spin_lock on non-preemtive linux kernels

Quoted from «Linux Device Drivers», by Jonathan Corbet, Alessandro Rubini and Greg Kroah-Hartman:

If a nonpreemptive uniprocessor system ever went into a spin on a
lock, it would spin forever; no other thread would ever be able to
obtain the CPU to release the lock (because it couldn't yield).
Because of this, spinlock operations on uniprocessor systems without
preemption enabled are optimized to do nothing, with the exception of
the ones that change the IRQ masking status (in Linux, that would be
spin_lock_irqsave()). Because of preemption, even if you never
expect your code to run on an SMP system, you still need to implement
proper locking.

If you're interested in a spinlock that can be taken by code running in interrupt context (hardware or software), you must use a form of spin_lock_* that disables interrupts. Not doing so will deadlock the system as soon as an interrupt arrives while you have entered your critical section.

false sharing in boost::detail::spinlock_pool?

BRIEF

You are correct, in that the spinlocks are sharing the same cache line, when they could possibly be in different cache lines. Aka false sharing. And there may well be some performance gain to be had by allocating the locks in different cache lines (but see below).

However, this is NOT the same as lock contention. Lock contention arises when a lock is held by one guy, and one or more other guys want to access it.

I.e. spinlock_pool has cache line contention caused by locks co-resident in the same cache line. But it has (reduced) software lock contention.

The reduced software lock contention is almost undoubtedy good.

The cache line contention probably hurts a bit, as your benchmarks show to some degree, but is a second order effect compared to the software lock contention.

DETAIL

Background

First, some background:

Classic spinloops are test-and-test-and-set:

    loop: 
         tmp := load( memory_location_of_sw_lock )
         if( is_locked(tmp) ) goto loop
         was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock, 
                                             value_that_will_get_sw_lock )
         if( was_locked(tmp) ) goto loop
     got_the_sw_lock:
          ...
          ... critical section
          ...
          // release the lock, e.g.
          memory_location_of_sw_lock := 0

There are also test-and-set spinlocks, which look like

    loop: 
         was_locked := atomic_hw_locked_rmw( memory_location_of_sw_lock, 
                                             value_that_will_get_sw_lock )
         if( was_locked(tmp) ) goto loop

These can have very bad performance on most modern memory systems with caches, write-through or write-back. (Although some of the hardware optimizations that I have proposed make test-and-set spinloops as fast as test-and-test-and-set spinloops - slightly faster because they are smaller.)

Note that there are two different concepts of lock here: the "software" lock that the spinlock is acquiring, and the hardware lock that is used by the atomic_hw_locked_rmw instruction, such as Intel LOCK INC mem or CMPXCHG. We don;t care about the latter much, except to know that it usually unconditionally writes to the cache line holding the software lock, invalidated other copies of the cache line. (Making the write conditional is another possible hardware optimization.)

O(N^2) burst of cache misses on (software) lock contention

Lock contention with test-and-test-and-set spinloops is particularly bad. The waiters all spin on the lock, and when it is released there is a burst of bus accesses. One guy wins, the others realize that they have lost, and eventually they settle down to spin again. This burst of activity is particularly bad because for N waiting guys (threads/processes/processors) the burst of bus activity can be O(N^2) in size, because in the worst case everyone exits the test- part of a test-and-test-and-set spinloop, and everyone tries to do the atomic locked RMW (read-modify-write) instruction, like x86 LOCK INC mem or CMPXCHG, at the same time. This means that everyone will eventually write the line, even though all but the first don't need to write the lock because they won't get the lock.

E.g.

Lock is held by P0
P1-PN are spinning in test-and-test-and-set spinloops waiting for it.

P0 releases the lock, e.g. by writing it to 0

P1's "test-" instruction reads the lock
...
PN's "test-" instruction reads the lock

All of P1-PN see the lock as released, 
   so they fall out of the "test-" part of the spinloop which uses ordinary instructions
   to the test-and-set part of the spinloop, which uses a hardware atomic RMW like LOCK INC mem

P1's locked RMW happens, and acquires the spinlock for P1
   It invalidates all other cache lines
   P1 goes away and does something with the data protected by the lock

P2's locked RMW fails to acquire the spinlock
   It invalidates all other caches because it has a write
   P1 falls back into its test- spinloop

P3's locked RMW fails
   It invalidates all other caches because it has a write
   P1 falls back into its test- spinloop

...

PN's locked RMW fails

and now, at the very least, all of the remaining P2..PN processors must do ordinary unlocked cache misses for their test- spinloop. This implies at least N+(N-1) cache misses. It can be considerably worse, because it is possible for each write, by a waiter who will fail to acquire the lock, to trigger all of the other waiters to do an unlocked read. I.e., depending on timing, you may get

   1 release the lock
   N reads to test- the lock
   1 locked atomic RMW to acquire the lock
   N reads...
   for each of the remaining N-1 processors
      1 locked RMW that fails to acquire the lock
      N reads

which is O(N^2). Or possibly, for processor M, 1 locked RMW, followed by 2..M reads -- which is still O(N^2).

What does this mean for this question?

OK, if there was true lock contention you get this O(N^2) burst of cache misses when a contended lock is released.

However, the spinlock_pool spreads the waiters out across multiple locks. If there are S locks in the spinlock pool, you get far fewer waiters: probably fewer than N/S (because contention tends to be superlinear in the number of people sharing the same lock).

I.e. with Boost's spinlock_pool, you might naively expect to get 1/41-th the amount of contention --- and probably less than that.

Remember, spinlock_pool is a compromise between having a lock per shared_pointer, increasing the size of the shared pointers, and having all of the shared_pointers share the same lock. So any contention on a shared_pointer's spinlock may be (a) true contention, or (b) false contention caused by shared_pointers that are independent hashing to the same entry in the spinlock_pool.

Now, yes, if instead of having N waiters you have "only" N/41 waiters, then the burst is still O((N/41)^2) or O(N^2). But, if N typically less than 41... you get the idea.

Basically, hashing to spread the shared_Pointers out over multiple spinlock_pool entries quickly reduces the amount of contention.

But... the spinlocks live in the same cacheline? Right... but waiters on other cache lines will not advance to the test-and-set part of their spinloop.

I.e. if a contended lock with M waiters shares a cache line with M other processes, you will get M*N traffic. But if the hashing has reduced M to O(1), you only get N traffic.

And if most of the time there are no other waiters at all, then you get only O(1) raffic on a lock release.

BOTTOM LINE:

Reducing software lock contention is much higher benefit performance-wise than reducing cache false sharing that causes hardware cache line contention.

Now, there still may be benefit to be hard in not packing so many spinlock_pool entries into the same cache line. But this is by no means obvious; it is an empirical question, by which I mean that you need to run an experiment, and it will vary with workload.

Sometimes false sharing of these locks in the same cache line will be bad. A classic example is the locks that protect processor run queues.

Sometimes false sharing of these locks in the same cache line is good - it can provide the same benefits that prefetch does in a cache. E.g. imagine that you have allocated an array of shared_pointers, and that people usually access aosptr[i] and then aosptr[i+1], etc.

It all depends on your workload. I've seen it fall both ways. Usually in my experience one lock per cache line is better, although often not by much.

MORE FUN

In case you care, hardware optimization for test-and-test-and-set spinloops was the subject of my MS thesis, "A Feature Taxonomy of Synchronization Primitives, including the Bus Abandonment Lock" - a hardware update cache protocol that eliminated the burst of bus accesses. Not published in any formal publication, except for University Microfiche.

My work was inspired by the paper "The performance of spin lock alternatives for shared-memory multiprocessors" by T Anderson - IEEE Transactions on Parallel and Distributed Systems , 1990.

I think it's fair to say that most publications have been on software, algorithms, including the famous MCS lock. I think it is also fair to say that such techniques, while popular in theory, have been little used by journeymen programmers.

By the way, there are some more hardware optimizations possible here. E.g. CMPXCHG need not write the lock at all. Unfortunately, on current Intel hardware, or at least on the Intel hardware I designed in 1991, which I still think is used, the only way to release the hardware lock (that is used to implement the atomic locked RMW) is to use a special microcode write "store-unlock". Heck, the microcode could even use LoadLinked/StoreConditional (LL/SC), falling back to locking in those situations where a naive LL/SC makes no forward progress on some threads.

It is possible that Intel has fixed this recently. I left Intel in 2009. I was trying to get it fixed, improved, optimized, oh, since 1991. And Intel has been improving the performance of locks a lot recently. But I think they have mainly been worked on uncontended lock performance, and have not been optimizing contended lock performance.

Similarly, Alain Kagi, in his dissertation and in some publications, as well as patent http://www.google.com/patents/US6460124, shows that by adding judicious delays the cache hardware can make spinlocks as efficient as queued locks.

Some of these hardware optimizations make test-and-set spinloops better than test-and-test-and-set spinloops.

The most recent development has been Intel TSX (Transactional Synchronization Extensions), consisting of HLE (Hardware Lock Elision (Ravi Rajwar worked on this at UWisc while I and Alain Kagi were there, although my work in synchronization was earlier at UIUC)) and RTM (Restricted Transactional Memory). Both of these help uncontended .. hmm, they help with contended locks that are falsely contending, e.g. a coarse grain lock protecting what could be independent stuff. I.e. false lock contention. In some ways, HLE may make spinlock_pool unnecessary.

APOLOGY

I am sorry: I have provided a long winded answer that boils down to "reducing software lock contention can be much more important than hardware cache line contention for spinloops".

While I would expect some performance to be gained from allocating 1 spinloop per cache line, it may be small, and on some not-exactly-uncommon workloads there may even be a loss.

And to be sure, you would have to measure it.

But, in any case, the big gain comes from reducing software lock contention.

Spinlock not working to protect critical section on multi-core system

Turns out the critical section was calling wait_for_completion_timeout. Even though the timeout was zero, it still slept and didn't wake up to release the spin lock if the interrupt occurred in the blocking section. Using try_wait_for_completion in this case resolved the issue.

I would have posted source but it spans many modules and has architecture abstractions for portability between operating systems. Would have been a mess.

Why spinlocks are used in interrupt handlers

Semaphores cause tasks to sleep on contention, which is unacceptable for interrupt handlers. Basically, for such a short and fast task (interrupt handling) the work carried out by the semaphore is overkill. Also, spinlocks can't be held by more than one task.

Why are compilers so stupid?

Oh, I don't know. Sometimes compilers are pretty smart. Consider the following C program:

#include <stdio.h>  /* printf() */

int factorial(int n) {
   return n == 0 ? 1 : n * factorial(n - 1);
}

int main() {
   int n = 10;

   printf("factorial(%d) = %d\n", n, factorial(n));

   return 0;
}

On my version of GCC (4.3.2 on Debian testing), when compiled with no optimizations, or -O1, it generates code for factorial() like you'd expect, using a recursive call to compute the value. But on -O2, it does something interesting: It compiles down to a tight loop:

    factorial:
   .LFB13:
           testl   %edi, %edi
           movl    $1, %eax
           je  .L3
           .p2align 4,,10
           .p2align 3
   .L4:
           imull   %edi, %eax
           subl    $1, %edi
           jne .L4
   .L3:
           rep
           ret

Pretty impressive. The recursive call (not even tail-recursive) has been completely eliminated, so factorial now uses O(1) stack space instead of O(N). And although I have only very superficial knowledge of x86 assembly (actually AMD64 in this case, but I don't think any of the AMD64 extensions are being used above), I doubt that you could write a better version by hand. But what really blew my mind was the code that it generated on -O3. The implementation of factorial stayed the same. But main() changed:

    main:
   .LFB14:
           subq    $8, %rsp
   .LCFI0:
           movl    $3628800, %edx
           movl    $10, %esi
           movl    $.LC0, %edi
           xorl    %eax, %eax
           call    printf
           xorl    %eax, %eax
           addq    $8, %rsp
           ret

See the movl $3628800, %edx line? gcc is pre-computing factorial(10) at compile-time. It doesn't even call factorial(). Incredible. My hat is off to the GCC development team.

Of course, all the usual disclaimers apply, this is just a toy example, premature optimization is the root of all evil, etc, etc, but it illustrates that compilers are often smarter than you think. If you think you can do a better job by hand, you're almost certainly wrong.

(Adapted from a posting on my blog.)

How to create a high resolution timer in Linux to measure program performance?

Check out clock_gettime, which is a POSIX interface to high-resolution timers.

If, having read the manpage, you're left wondering about the difference between CLOCK_REALTIME and CLOCK_MONOTONIC, see Difference between CLOCK_REALTIME and CLOCK_MONOTONIC?

See the following page for a complete example: http://www.guyrutenberg.com/2007/09/22/profiling-code-using-clock_gettime/

#include <iostream>
#include <time.h>
using namespace std;

timespec diff(timespec start, timespec end);

int main()
{
    timespec time1, time2;
    int temp;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time1);
    for (int i = 0; i< 242000000; i++)
        temp+=temp;
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &time2);
    cout<<diff(time1,time2).tv_sec<<":"<<diff(time1,time2).tv_nsec<<endl;
    return 0;
}

timespec diff(timespec start, timespec end)
{
    timespec temp;
    if ((end.tv_nsec-start.tv_nsec)<0) {
        temp.tv_sec = end.tv_sec-start.tv_sec-1;
        temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
    } else {
        temp.tv_sec = end.tv_sec-start.tv_sec;
        temp.tv_nsec = end.tv_nsec-start.tv_nsec;
    }
    return temp;
}

Why Sizeof(Spinlock_T) Is Greater Than Zero on Uni-Processor