Understanding Std::Hardware_Destructive_Interference_Size and Std::Hardware_Constructive_Interference_Size

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size

The intent of these constants is indeed to get the cache-line size. The best place to read about the rationale for them is in the proposal itself:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

I'll quote a snippet of the rationale here for ease-of-reading:

[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.

Uses of cache-line size fall into two broad categories:

  • Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
  • Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.

The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]

We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:

  • Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
  • Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.

In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.


"How are these constants related to the L1 cache line size?"

In theory, pretty directly.

Assume the compiler knows exactly what architecture you'll be running on - then these would almost certainly give you the L1 cache-line size precisely. (As noted later, this is a big assumption.)

For what it's worth, I would almost always expect these values to be the same. I believe the only reason they are declared separately is for completeness. (That said, maybe a compiler wants to estimate L2 cache-line size instead of L1 cache-line size for constructive interference; I don't know if this would actually be useful, though.)


"Is there a good example that demonstrates their use cases?"

At the bottom of this answer I've attached a long benchmark program that demonstrates false-sharing and true-sharing.

It demonstrates false-sharing by allocating an array of int wrappers: in one case multiple elements fit in the L1 cache-line, and in the other a single element takes up the L1 cache-line. In a tight loop a single, a fixed element is chosen from the array and updated repeatedly.

It demonstrates true-sharing by allocating a single pair of ints in a wrapper: in one case, the two ints within the pair do not fit in L1 cache-line size together, and in the other they do. In a tight loop, each element of the pair is updated repeatedly.

Note that the code for accessing the object under test does not change; the only difference is the layout and alignment of the objects themselves.

I don't have a C++17 compiler (and assume most people currently don't either), so I've replaced the constants in question with my own. You need to update these values to be accurate on your machine. That said, 64 bytes is probably the correct value on typical modern desktop hardware (at the time of writing).

Warning: the test will use all cores on your machines, and allocate ~256MB of memory. Don't forget to compile with optimizations!

On my machine, the output is:


Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457

I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.


"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"

This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.

This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.

In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:

constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
return KNOWN_L1_CACHE_LINE_SIZE;
#else
return std::hardware_destructive_interference_size;
#endif
}

During compilation, if you want to assume a cache-line size just define KNOWN_L1_CACHE_LINE_SIZE.

Hope this helps!

Benchmark program:

#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;

constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;

typedef unsigned useless_result_t;
typedef double elapsed_secs_t;

//////// CODE TO BE SAMPLED:

// wraps an int, default alignment allows false-sharing
struct naive_int {
int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");

// wraps an int, cache alignment prevents false-sharing
struct cache_int {
alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");

// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
int first;
char padding[hardware_constructive_interference_size];
int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");

// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
int first;
int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");

// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
Latch& latch,
unsigned thread_index,
T& vec) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };

auto& element = vec[vec.size() / 2 + thread_index];

latch.count_down_and_wait();

// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
element.value = dist(mt);
}

return static_cast<useless_result_t>(element.value);
}

// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
Latch& latch,
unsigned thread_index,
T& pair) {
// prepare for computation
std::random_device rd;
std::mt19937 mt{ rd() };
std::uniform_int_distribution<int> dist{ 0, 4096 };

latch.count_down_and_wait();

// compute
for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
pair.first = dist(mt);
pair.second = dist(mt);
}

return static_cast<useless_result_t>(pair.first) +
static_cast<useless_result_t>(pair.second);
}

//////// UTILITIES:

// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
explicit threadlatch(const std::size_t count) :
count_{ count }
{}

void count_down_and_wait() {
std::unique_lock<std::mutex> lock{ mutex_ };
if (--count_ == 0) {
cv_.notify_all();
}
else {
cv_.wait(lock, [&] { return count_ == 0; });
}
}

private:
std::mutex mutex_;
std::condition_variable cv_;
std::size_t count_;
};

// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
threadlatch latch{ num_threads + 1 };

std::vector<std::future<useless_result_t>> futures;
std::vector<std::thread> threads;
for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
std::packaged_task<useless_result_t()> task{
std::bind(func, std::ref(latch), thread_index)
};

futures.push_back(task.get_future());
threads.push_back(std::thread(std::move(task)));
}

const auto starttime = std::chrono::high_resolution_clock::now();

latch.count_down_and_wait();
for (auto& thread : threads) {
thread.join();
}

const auto endtime = std::chrono::high_resolution_clock::now();
const auto elapsed = std::chrono::duration_cast<
std::chrono::duration<double>>(
endtime - starttime
).count();

useless_result_t result = 0;
for (auto& future : futures) {
result += future.get();
}

return std::make_tuple(result, elapsed);
}

// utility: sample the time it takes to run func on N threads
void run_tests(
const std::function<useless_result_t(threadlatch&, unsigned)>& func,
const unsigned num_threads) {
useless_result_t final_result = 0;
double avgtime = 0.0;
for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
const auto result_and_elapsed = run_threads(func, num_threads);
const auto result = std::get<useless_result_t>(result_and_elapsed);
const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);

final_result += result;
avgtime = (avgtime * trial + elapsed) / (trial + 1);
}

std::cout
<< "Average time: " << avgtime
<< " seconds, useless result: " << final_result
<< std::endl;
}

int main() {
const auto cores = std::thread::hardware_concurrency();
std::cout << "Hardware concurrency: " << cores << std::endl;

std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;

{
std::cout << "Running naive_int test." << std::endl;

std::vector<naive_int> vec;
vec.resize((1u << 28) / sizeof(naive_int)); // allocate 256 mibibytes

run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
}
{
std::cout << "Running cache_int test." << std::endl;

std::vector<cache_int> vec;
vec.resize((1u << 28) / sizeof(cache_int)); // allocate 256 mibibytes

run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_array_threadfunc(latch, thread_index, vec);
}, cores);
}
{
std::cout << "Running bad_pair test." << std::endl;

bad_pair p;

run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
}
{
std::cout << "Running good_pair test." << std::endl;

good_pair p;

run_tests([&](threadlatch& latch, unsigned thread_index) {
return sample_pair_threadfunc(latch, thread_index, p);
}, cores);
}
}

Is std::hardware_constructive_interference_size ever useful?

Consider a std::deque<T>. It's often implemented using chunks of a given size. But how many T's do you store per chunk? A reasonable answer is std::hardware_constructive_interference_size/sizeof(T), if sizeof(T) is small.

Similarly, a string class with the Small String Optimization may aim for a size of std::hardware_constructive_interference_size. In general, the size is useful when you can have a run-time variable amount of data with high locality of reference.

no member named 'hardware_constructive_interference_size' in namespace 'std'

How can I handle this situation?

You could detect the broken language implementation using pre-defined macros and make an exception for it.

Detection could be made by trying to compile and run a small program:

#include <new>

int main() {
#ifdef __cpp_lib_hardware_interference_size
// return 0 if the interference_sizes are defined
return !(std::hardware_constructive_interference_size &&
std::hardware_destructive_interference_size);
#else
return 1; // no interference_sizes
#endif
}
clang++ -std=c++17 -o hwisize hwisize.cpp 2>/dev/null && ./hwisize
has_hw_interference_sizes=$?
  • If the compilation fails, has_hw_interference_sizes will be 1.
  • If the compilation succeeds, but __cpp_lib_hardware_interference_size is not defined, has_hw_interference_sizes will be 1.
  • If the compilation succeeds, and __cpp_lib_hardware_interference_size is defined, has_hw_interference_sizes will be 0.

(The reversed bool logic is how common shells define true (0) and false (not 0))

Just plug that into your build system.

Where is std::hardware_destructive_interference_size?

Neither library has implemented this feature. This is documented on their C++17 compliance status lists:

https://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.2017

http://libcxx.llvm.org/cxx1z_status.html

(On each, search for 'P0154R1'.)

You may be able to detect whether the feature is available via

#if __cpp_lib_hardware_interference_size >= 201603

Is false sharing the case with heap memory?

Memory is memory. To the cpu an array on the stack is exactly the same as an array on the heap. So any false sharing problem remains the same.

Should the cache padding size of x86-64 be 128 bytes?

Intel's optimization manual does describe the L2 spatial prefetcher in SnB-family CPUs. Yes, it tries to complete 128B-aligned pairs of 64B lines, when there's spare memory bandwidth (off-core request tracking slots) when the first line is getting pulled in.

Your microbenchmark doesn't show any significant time difference between 64 vs. 128 byte separation. Without any actual false sharing (within the same 64 byte line), after some initial chaos, it quickly reaches a state where each core has exclusive ownership of the cache line it's modifying. That means no further L1d misses, thus no requests to L2 which would trigger the L2 spatial prefetcher.

Unlike if for example two pairs of threads contending over separate atomic<int> variables in adjacent (or not) cache lines. Or false-sharing with them. Then L2 spatial prefetching could couple the contention together, so all 4 threads are contending with each other instead of 2 independent pairs. Basically any case where cache lines actually are bouncing back and forth between cores, L2 spatial prefetching can make it worse if you aren't careful.

(The L2 prefetcher doesn't keep trying indefinitely to complete pairs of lines of every valid line it's caching; that would hurt cases like this where different cores are repeatedly touching adjacent lines, more than it helps anything.)

Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size includes an answer with a longer benchmark; I haven't looked at it recently, but I think it's supposed to demo destructive interference at 64 bytes but not 128. The answer there unfortunately doesn't mention L2 spatial prefetching as one of the effects that can cause some destructive interference (although not as much as a 128-byte line size in an outer level of cache, especially not if it's an inclusive cache).



Perf counters reveal a difference even with your benchmark

There is more initial chaos that we can measure with performance counters for your benchmark. On my i7-6700k (quad core Skylake with Hyperthreading; 4c8t, running Linux 5.16), I improved the source so I could compile with optimization without defeating the memory access, and with a CPP macro so I could set the stride (in bytes) from the compiler command line. Note the ~500 memory-order mis-speculation pipeline nukes (machine_clears.memory_ordering) when we use adjacent lines. Actual number is quite variable, from 200 to 850, but there's still negligible effect on the overall time.

Adjacent lines, 500 +- 300 machine clears

$ g++ -DSIZE=64 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

Performance counter stats for './a.out' (25 runs):

560.22 msec task-clock # 3.958 CPUs utilized ( +- 0.12% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
126 page-faults # 224.752 /sec ( +- 0.35% )
2,180,391,747 cycles # 3.889 GHz ( +- 0.12% )
2,003,039,378 instructions # 0.92 insn per cycle ( +- 0.00% )
1,604,118,661 uops_issued.any # 2.861 G/sec ( +- 0.00% )
2,003,739,959 uops_executed.thread # 3.574 G/sec ( +- 0.00% )
494 machine_clears.memory_ordering # 881.172 /sec ( +- 9.00% )

0.141534 +- 0.000342 seconds time elapsed ( +- 0.24% )

vs with 128-byte separation, only a very few machine clears

$ g++ -DSIZE=128 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

Performance counter stats for './a.out' (25 runs):

560.01 msec task-clock # 3.957 CPUs utilized ( +- 0.13% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
124 page-faults # 221.203 /sec ( +- 0.16% )
2,180,048,243 cycles # 3.889 GHz ( +- 0.13% )
2,003,038,553 instructions # 0.92 insn per cycle ( +- 0.00% )
1,604,084,990 uops_issued.any # 2.862 G/sec ( +- 0.00% )
2,003,707,895 uops_executed.thread # 3.574 G/sec ( +- 0.00% )
22 machine_clears.memory_ordering # 39.246 /sec ( +- 9.68% )

0.141506 +- 0.000342 seconds time elapsed ( +- 0.24% )

Presumably there's some dependence on how Linux schedules threads to logical cores on this 4c8t machine. Related:

  • What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? - actual false sharing within one line is much worse for logical cores sharing a physical core, but for adjacent lines there's probably no effect: the L2 is the same per physical core, and both lines will stay hot in L1d.
  • Why flush the pipeline for Memory Order Violation caused by other logical processors?

vs. actual false-sharing within one line: 10M machine clears

The store buffer (and store forwarding) get a bunch of increments done for every false-sharing machine clear so it's not as bad as one might expect. (And would be far worse with atomic RMWs, like std::atomic<int> fetch_add, since there every single increment needs direct access to L1d cache as it executes.) Why does false sharing still affect non atomics, but much less than atomics?

$ g++ -DSIZE=4 -pthread -O2 false-share.cpp && perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,machine_clears.memory_ordering -r25 ./a.out 

Performance counter stats for './a.out' (25 runs):

809.98 msec task-clock # 3.835 CPUs utilized ( +- 0.42% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
122 page-faults # 152.953 /sec ( +- 0.22% )
3,152,973,230 cycles # 3.953 GHz ( +- 0.42% )
2,003,038,681 instructions # 0.65 insn per cycle ( +- 0.00% )
2,868,628,070 uops_issued.any # 3.596 G/sec ( +- 0.41% )
2,934,059,729 uops_executed.thread # 3.678 G/sec ( +- 0.30% )
10,810,169 machine_clears.memory_ordering # 13.553 M/sec ( +- 0.90% )

0.21123 +- 0.00124 seconds time elapsed ( +- 0.59% )


Improved benchmark- align the array, and volatile to allow optimization

I used volatile so I could enable optimization. I assume you compiled with optimization disabled, so int j was also getting stored/reloaded inside the loop.

And I used alignas(128) counter[] so we'd be sure the start of the array was in two pairs of 128-byte lines, not spread across three.

#include <thread>

alignas(128) volatile int counter[1024]{};

void update(int idx) {
for (int j = 0; j < 100000000; j++) ++counter[idx];
}

static const int stride = SIZE/sizeof(counter[0]);
int main() {
std::thread t1(update, 0*stride);
std::thread t2(update, 1*stride);
std::thread t3(update, 2*stride);
std::thread t4(update, 3*stride);
t1.join();
t2.join();
t3.join();
t4.join();
}


Related Topics



Leave a reply



Submit