Performance Difference Between Ipc Shared Memory and Threads Memory

Performance difference between IPC shared memory and threads memory

1) shmat() maps the local process virtual memory to the shared
segment. This translation has to be performed for each shared memory
address and can represent a significant cost, relative to the number
of shm accesses. In a multi-threaded application there is no extra
translation required: all VM addresses are converted to physical
addresses, as in a regular process that does not access shared memory.

There is no overhead compared to regular memory access aside from the initial cost to set up shared pages - populating the page-table in the process that calls shmat() - in most flavours of Linux that's 1 page (4 or 8 bytes) per 4KB of shared memory.

It's (to all relevant comparison) the same cost whether the pages are allocated shared or within the same process.

2) The shared memory segment must be maintained somehow by the kernel.
I do not know what that 'somehow' means in terms of performances, but
for example, when all processes attached to the shm are taken down,
the shm segment is still up and can be eventually re-accessed by newly
started processes. There must be at least some degree of overhead
related to the things the kernel needs to check during the lifetime of
the shm segment.

Whether shared or not, each page of memory has a "struct page" attached to it, with some data about the page. One of the items is a reference count. When a page is given out to a process [whether it is through "shmat" or some other mechanism], the reference count is incremented. When it is freed through some means, the reference count is decremented. If the decremented count is zero, the page is actually freed - otherwise "nothing more happens to it".

The overhead is basically zero, compared to any other memory allocated. The same mechanism is used for other purposes for pages anyways - say for example you have a page that is also used by the kernel - and your process dies, the kernel needs to know not to free that page until it's been released by the kernel as well as the user-process.

The same thing happens when a "fork" is created. When a process is forked, the entire page-table of the parent process is essentially copied into the child process, and all pages made read-only. Whenever a write happens, a fault is taken by the kernel, which leads to that page being copied - so there are now two copies of that page, and the process doing the writing can modify it's page, without affecting the other process. Once the child (or parent) process dies, of course all pages still owned by BOTH processes [such as the code-space that never gets written, and probably a bunch of common data that never got touched, etc] obviously can't be freed until BOTH processes are "dead". So again, the reference counted pages come in useful here, since we only count down the ref-count on each page, and when the ref-count is zero - that is, when all processes using that page has freed it - the page is actually returned back as a "useful page".

Exactly the same thing happens with shared libraries. If one process uses a shared library, it will be freed when that process ends. But if two, three or 100 processes use the same shared library, the code obviously will have to stay in memory until the page is no longer needed.

So, basically, all pages in the whole kernel are already reference counted. There is very little overhead.

What's the difference between shared memory for IPCs and threads' shared memory?

SHM is for IPC in multiple processes. In modern OS, each process cannot see each others' memory space. Using common key for shmget() to get the share memory and using shmat() to map share memory page to local memory address inside each process. The mapped shared memory address might be different due to different memory usage and shared libraries loaded into each process space. And the SHM key, size are predefined and fixed among those process.

For threads' memory, we might not call it shared memory because threads are all in a single process memory space addressing. They can see and read/write in the same process space.

shared memory performance and protection from other processes

An alternative architecture you might consider is dynamic loading. Instead of 2 processes, you have just the first one; it uses dlopen() to load your newly compiled code. It calls the entry point of this "library", and the code has access to all the space including the persistant variables. On return, you unload the library, ready for the next "run".

Creating such a loadable library and calling it is fairly simple, and faster than executing a whole new process. There are no problems with permissions, as your one and only process decides what to load and run.

Is there an Advantage to Shared Memory in a Multithreaded Program?

Shared Memory usually refers to memory shared between different processes and needs special OS calls to set up and use: shm_open for POSIX shared memory, shmget for SysV shared memory or mmap with the MAP_SHARED flag.

Threads within the same process can simply access the process' memory (the one you get from malloc).

Since Shared Memory has overhead that's unnecessary for a normal multi-threaded program, you do not gain any benefit by using it in a single-process program.

multithreading in two processes communicating over shared memory gives a huge slowdown and then core-mapping recovers it

The "optimal" configuration when you have two threads running at full tilt is to have a core for each. If they aren't moved around i e each thread stays on "its" core you'll have better performance than if they are moved back and forth between the cores. So essentially a 2+2 thread solution will require 4 cores to run optimally.

In addition since two cores are running the same code it is vital that they (in your case) aren't moved from "their" core. This is because the operating environment for both cores is more or less the same which makes switching between them less cumbersome (at the cache level) than if you need to load everything onto a different core.

Then you have the issue of memory system saturation. A "normal" single-threaded program will usually use up most if not all of the available memory bandwidth. Its speed will usually be determined by the rate at which the memory system provides it with data. There are exceptions such as when you're in a division instruction during which no memory activity occurs or when you're in a tight loop which doesn't require data reads or writes. In most other cases the memory system will be working its butt off to shove memory into the program and a lot of the time not as fast as the program can make use of it.

A program which doesn't take this into account will run slower multi-threaded than single because both threads will start colliding when they need memory access and this slows things down a lot. This for compiled lanbguages such as C or C++. With Java there are a lot of memory accesses going on behind the scenes (caused by the engine) over which the programmer has little control. So the Java engine and its workings will use up a lot of the cache memory and bandwidth which will mean that your shared memory will be competing with the engine's needs and be in and out of the cache more or less constantly.

My two cents.

Performance Difference Between Ipc Shared Memory and Threads Memory