How to Allocate, in User Space, a Non Cacheable Block of Memory on Linux

Is it possible to allocate, in user space, a non cacheable block of memory on Linux?

How to avoid polluting the caches with data like this is covered in What Every Programmer Should Know About Memory (PDF) - This is written from the perspective of Red Hat development so perfect for you. However, most of it is cross-platform.

What you want is called "Non-Temporal Access" and tell the processor to expect that the value you are reading now will not be needed again for a while. The processor then avoids caching that value.

See page 49 of the PDF I linked above. It uses the intel intrinsic to do the streaming around the cache.

On the read side, processors, until
recently, lacked support aside from
weak hints using non-temporal access
(NTA) prefetch instructions. There is
no equivalent to write-combining for
reads, which is especially bad for
uncacheable memory such as
memory-mapped I/O. Intel, with the
SSE4.1 extensions, introduced NTA
loads. They are implemented using a
small number of streaming load
buffers; each buffer contains a cache
line. The first movntdqa instruction
for a given cache line will load a
cache line into a buffer, possibly
replacing another cache line.
Subsequent 16-byte aligned accesses to
the same cache line will be serviced
from the load buffer at little cost.
Unless there are other reasons to do
so, the cache line will not be loaded
into a cache, thus enabling the
loading of large amounts of memory
without polluting the caches. The
compiler provides an intrinsic for
this instruction:

#include <smmintrin.h>
__m128i _mm_stream_load_si128 (__m128i *p);

This intrinsic should be used multiple times, with addresses of
16-byte blocks passed as the
parameter, until each cache line is
read. Only then should the next cache
line be started. Since there are a few
streaming read buffers it might be
possible to read from two memory
locations at once

It would be perfect for you if when reading, the buffers are read in linear order through memory. You use streaming reads to do so. When you want to modify them, the buffers are modified in linear order, and you can use streaming writes to do that if you don't expect to read them again any time soon from the same thread.

How can i disable cpu cache for certain memory regions?

That comment about non-caching doesn't mean what you think it means, and where it is used, it isn't usually a user-accessible feature. That is, CPU cache control is typically a privileged operation.

That said...

-- A normal user program can be build with functions who's attributes are "hot" or "cold" to let the compiler tell the loader to group the functions in ways that will utilize the cache most usefully.

-- A normal program can use the madvise() function in linux to tell the paging function various things, including the fact that the memory just used is or is not likely to be used again soon.

-- The kernel itself uses the Memory Type Range Regesters (mtrr) and Page Attribute Table (pat) flags in later kernels, to tell the hardware that particular ranges of memory (such as the memory mapped display buffer, and the various parts of the PCI bus) are not to be cached.

"Normal Data™" such as you are likely to use in any C program will essentially never benefit from marking any of its data not cache-worthy. The performance improvement that not-cached data enjoys is the subsequent absence of the various cache-flush and memory barrier operations that memory mapped devices and display buffers would need almost constantly. Laying a cache over a memory mapped device, for example, would require a cache invalidate command before every read and a cache forced write command after every single write to make sure that the reads and writes happen at the exact moment needed. This would "poison" the cache usage, using up and instantly discarding cache lines (a physically limited resource) in a most unfriendly and unhelpful way.

In the rare case that you write a program that gains access to one of these cache harmful regions -- such as if you wrote part of the X display server on a linux system -- the kernel would have already set the registers for the device and the non-cache behavior would be transparent to you.

There is effectively no time where your normal application grade program is going to benefit from any ability to mark a variable as harmful to cache beyond the various madvise() type of usage.

Even then, the cases were you could gain any benefit are so rare that if you'd ever acutally run into one, the problem set would have included the need and methodology as part of your research and you'd have been told how and why so explicitly you'd never have needed to ask this question.

To go back to the same example again, if you'd been writing the necessary driver, when you'd been reading up on the display adapter hardware or the PCI bus the various flags and techniques would have been documented and discussed in the hardware guide.

There are ways to pull off cache ejection and such from user space with things like the CLCLEAR instruction on an intel platform. These techniques will not improve general performance.

Since it's a privileged operation on a Linux system, you could write a kernel driver that acquired and marked a region of memory as uncacheable and then let you map it into your application. But the need for such a region is so rare, and so likely to be misused, that there isn't a normal methodology for doing it in place.

So how do you do it? You don't, at least not the you that you are today. When you become a kernel driver writer with an intimate specialty knowledge of multi-threaded code and data synchronization issues, you'll know how you could do it, and at that point you'll know why you don't want to except as a last resort.

TL;DR :: because of the way linux uses and manages data and code, there is never a benefit for marking any part of a normal application as uncacheable that doesn't cause more heartbreak than it saves. As such, there is no unprivileged API for doing this.

P.S. Also, that said, someone already pointed to things that lead to this article http://lwn.net/Articles/255364/ which covers ways to make your program very cache friendly and some of the ways that you can do some cache bypass operations very cheaply. For instance use of memset() tends to go around the cache while setting memory, and some operations can "stream past" the cache. This isn't the same thing as what you ask, but once you understand all of that article you'll have a much better understanding of why marking a region of memory as uncachable is usually, as the Jedi say, not the solution you are looking for.

How to mark some memory ranges as non-cacheable from C++?

On Windows, you can use VirtualProtect(ptr, length, PAGE_NOCACHE, &oldFlags) to set the caching behavior for memory to avoid caching.

Regarding too many indirections: Yes, they can damage cache performance, if you access different pieces of memory very often (which is what happens usually). It's important to note, though, that if you consistently dereference the same set of e.g. 8 blocks of memory, and only the 9th block differs, then it generally won't make a difference, because the 8 blocks would be cached after the first access.

How to access physical addresses from user space in Linux?

You can map a device file to a user process memory using mmap(2) system call. Usually, device files are mappings of physical memory to the file system.
Otherwise, you have to write a kernel module which creates such a file or provides a way to map the needed memory to a user process.

Another way is remapping parts of /dev/mem to a user memory.

Edit:
Example of mmaping /dev/mem (this program must have access to /dev/mem, e.g. have root rights):

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    if (argc < 3) {
        printf("Usage: %s <phys_addr> <offset>\n", argv[0]);
        return 0;
    }

    off_t offset = strtoul(argv[1], NULL, 0);
    size_t len = strtoul(argv[2], NULL, 0);

    // Truncate offset to a multiple of the page size, or mmap will fail.
    size_t pagesize = sysconf(_SC_PAGE_SIZE);
    off_t page_base = (offset / pagesize) * pagesize;
    off_t page_offset = offset - page_base;

    int fd = open("/dev/mem", O_SYNC);
    unsigned char *mem = mmap(NULL, page_offset + len, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, page_base);
    if (mem == MAP_FAILED) {
        perror("Can't map memory");
        return -1;
    }

    size_t i;
    for (i = 0; i < len; ++i)
        printf("%02x ", (int)mem[page_offset + i]);

    return 0;
}

Contiguous physical memory from userspace

No. There is not. You do need to do this from Kernel space.

If you say "we need to do this from User Space" - without anything going on in kernel-space it makes little sense - because a user space program has no way of controlling or even knowing if the underlying memory is contiguous or not.

The only reason where you would need to do this - is if you were working in-conjunction with a piece of hardware, or some other low-level (i.e. Kernel) service that needed this requirement. So again, you would have to deal with it at that level.

So the answer isn't just "you can't" - but "you should never need to".

I have written such memory managers that do allow me to do this - but it was always because of some underlying issue at the kernel level, which had to be addressed at the kernel level. Generally because some other agent on the bus (PCI card, BIOS or even another computer over RDMA interface) had the physical contiguous memory requirement. Again, all of this had to be addressed in kernel space.

When you talk about "cache lines" - you don't need to worry. You can be assured that each page of your user-space memory is contiguous, and each page is much larger than a cache-line (no matter what architecture you're talking about).

How do userspace programs pass memory back to the kernel after free()?

Before the call to free, this process has a heap size of at least 4GiB...

The C language does not define either "heap" or "stack". Before the call to free, this process has a chunk of 4 GB dynamically allocated memory...

and afterward, does it still have that heap size?

...and after the free(), access to that memory would be undefined behaviour, so for practical purposes, that dynamically allocated memory is no longer "there".

What the library does "under the hood" (e.g. caching, see below) is up to the library, and is subject to change without further notice. This could change with the amount of available physical memory, system load, runtime parameters, ...

How do modern operating systems allow userspace programs to return memory back to kernel space?

It's up to the standard library's implementation to decide (which, of course, has to talk to the operating system to actually, physically allocate / free memory).

Others have pointed out how certain, existing implementations do it. Other libraries, operating systems, and environments exist.

Do free implementations execute a syscall to the kernel (or many syscalls) to tell it which areas of memory are now available?

Possibly. A common optimization done by library implementations is to "cache" free()d memory, so subsequent malloc() calls can be served without talking to the kernel (which is a costly operation). When, how much, and how long memory is cached this way is, you guessed it, implementation-defined.

And is it possible that my 4 GiB allocation will be non-contiguous?

The process will always "see" contiguous memory. In a system supporting virtual memory (i.e. "modern" desktop OS's like Linux or Windows), the physical memory might be non-contiguous, but the virtual addresses your process gets to see will be contiguous (or the malloc() would have failed if this requirement could not be serviced).

Again, other systems exist. You might be looking at a system that doesn't virtualize addresses (i.e. gives physical addresses to the process). You might be looking at a system that assigns a given amount of memory to a process on startup, serves any malloc() requests from that, and doesn't support the allocation of additional memory. And so on.

How do declare a memory range as uncacheable using gcc on x86 platform?

I think what you're describing is Memory Type Range Registers. You can control these under Linux (if available and you're user 0) using /proc/mttr / ioctl(2) see here for an example. As it works on a physical address range I think you're going to have a hard time using it in a reasonable way.

A better way is to look at the compiler intrinsics GCC provides and find one or more, that expresses your intent. Have a look at Ulrich Drepper's series on "What every programmer should know about memory", in particular part 5 which deals with bypassing the cache. It looks like _mm_prefetch(ptr, _MM_HINT_NTA) might be appropriate for your needs.

As always when it comes to performance - measure, measure, measure. Drepper's series has excellent parts detailing how this can be done (part 7) as well as code examples and other strategies to try when speeding up the memory performance of your code.