what happens in the kernel during malloc?
When user space applications call malloc()
, that call isn't implemented in the kernel. Instead, it's a library call (implemented glibc or similar).
The short version is that the malloc
implementation in glibc either obtains memory from the brk()
/sbrk()
system call or anonymous memory via mmap()
. This gives glibc a big contiguous (regarding virtual memory addresses) chunk of memory, which the malloc
implementation further slices and dices in smaller chunks and hands out to your application.
Here's a small malloc
implementation that'll give you the idea, along with many, many links.
Note that nothing cares about physical memory yet -- that's handled by the kernel virtual memory system when the process data segment is altered via brk()
/sbrk()
or mmap()
, and when the memory is referenced (by a read or write to the memory).
To summarize:
malloc()
will search its managed pieces of memory to see if there's a piece of unused memory that satisfy the allocation requirements.- Failing that,
malloc()
will try to extend the process data segment(viasbrk()
/brk()
or in some casesmmap()
).sbrk()
ends up in the kernel. - The
brk()
/sbrk()
calls in the kernel adjust some of the offsets in thestruct mm_struct
of the process, so the process data segment will be larger. At first, there will be no physical memory mapped to the additional virtual addresses which extending the data segment gave. - When that unmapped memory is first touched (likely a read/write by the
malloc
implementation) a fault handler will kick in and trap down to the kernel, where the kernel will assign physical memory to the unmapped memory.
Is the memory allocated using malloc inside kernel accessible by threads of other blocks?
Yes, this memory comes from the so-called "device heap" and it is accessible by any device code (any thread) from any kernel running on that GPU.
Note that this applies even to kernels other than the one that actually did the malloc
operation.
The above statement applies until application termination, or until you explicitly free that memory using an in-kernel free()
call on the pointer.
You may wish to read the documentation on the in-kernel malloc()
functionality. There are size limits which you can modify, and its good practice if you are having trouble with such a code, to check the return pointer for NULL after that malloc()
call. If it is NULL, that is the API method to signal an error (usually meaning you ran out of allocation space on the "device heap").
A pointer allocated in this fashion cannot participate in (be used in) any host API for data movement, such as cudaMemcpy
. It is usable/accessible from device code only.
Also note that the malloc()
operation, like most device code you write, is performed per-thread. Each thread that executes the malloc()
call will do so independently, and each thread (assuming no failures) will receive a separate pointer, to a separate allocation. However all such pointers are usable subsequently by any code running on that device, until they are explictly freed.
What is different functions: `malloc()` and `kmalloc()`?
I answer to the second question, assuming that you are using Linux OS. Regarding to the first one please have a look to my comment.
kmalloc
uses get_free_page
to get the memory. The way in which the pages are collected depends on the second parameter ( GFP_ATOMIC GFP_KERNEL ...
in which GFP means GET FREE PAGE). The advantage of kmalloc on the GFP is that it can fit multiple allocations into a single page.
some of the options for kmalloc are:
GFP_USER - Allocate memory on behalf of user. May sleep.
GFP_KERNEL - Allocate normal kernel ram. May sleep.
GFP_ATOMIC - Allocation will not sleep. May use emergency pools. For example, use this inside interrupt handlers.
GFP_HIGHUSER - Allocate pages from high memory.
GFP_NOIO - Do not do any I/O at all while trying to get memory.
GFP_NOFS - Do not make any fs calls while trying to get memory.
GFP_NOWAIT - Allocation will not sleep.
GFP_THISNODE - Allocate node-local memory only.
GFP_DMA - Allocation suitable for DMA. Should only be used for kmalloc caches. Otherwise, use a slab created with SLAB_DMA.
Apart from this get_free_page
and kmalloc
are very similar. _get_free_pages
differs from get_free_page
because it gives the pointer to the first byte of a memory area that is potentially several (physically contiguous) pages long.
Another function that is again very similar to get_free_page
is get_zeroed_page(unsigned int flags)
which gets a single page like get_free_page
but zeroes the memory
Why malloc doesn't allocate memory until I hit a certain threshold?
1 - Why it doesn't allocate when the memory size is relatively small?
The task of the function malloc
is to provide the application with memory, whenever it asks for it. Theoretically, malloc
could, as you suggest, just forward all requests for memory allocations to the operating system's kernel, so that it only acts as a wrapper for the kernel's memory allocator. However, this has the following disadvantages:
- The kernel only provides large amounts of memory at once, at least one page of memory, which is, depending on the configuration of the operating system, normally at least 4096 bytes. Therefore, if an application asked for only 10 bytes of memory, a lot of memory would be wasted.
- System calls are expensive in terms of CPU performance.
For these reasons, it is more efficient for malloc
to not forward memory allocation requests directly to the kernel, but to rather act as an intermediary between the application's memory allocation requests and the kernel. It requests memory in larger amounts from the kernel, so that it can satisfy many smaller memory allocation requests from the application.
Therefore, only when asking for a large amount of memory at once, will malloc
forward that memory allocation request to the kernel.
2 - Why the allocated memory size is not exactly the same? In the first run, it shows that the size is
1004KB
while I've only allocated1000KB
.
The malloc
allocator must keep track of all the memory allocations it granted to the application and also keep track of all the memory allocations that it has been granted by the kernel. To store this information, it requires a bit of addititional memory space. This additional space is called "overhead".
CUDA/C - Using malloc in kernel functions gives strange results
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck
utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
How does the kernel stop you using malloc?
It's not so much that it's locked down. It's just that your kernel module has no idea where malloc() is. The malloc() function is part of the C standard library, which is loaded alongside programs in userspace. When a userland program is executed, the linker will load the shared libraries needed by the program and figure out where the needed functions are. SO it will load libc at an address, and malloc() will be at some offset of that. So when your program goes to call malloc() it actually calls into libc.
Your kernel module isn't linked against libc or any other userspace components. It's linked against the kernel, which doesn't include malloc. Your kernel driver can't depend on the address of anything in userspace, because it may have to run in the context of any userspace program or even in no context, like in an interrupt. So the code for malloc() may not even be in memory anywhere when your module runs. Now if you knew that you were running in the context of a process that had libc loaded, and knew the address that malloc() was located at, you could potentially call that address by storing it in a function pointer. Bad things would probably happen though, possibly including a kernel panic. You don't want to cross userspace and kernelspace boundaries except through sane, well defined interfaces.
Related Topics
How to Make Webdriver Testsuite Created in Windows Machine to Run in a Linux Box
What Is How to Call Execve with Arguments in Assembly
How to Have Chef Reload Global Path
Linux Intel 64Bit Assembly Division
(Mac) Leave Core File Where The Executable Is Instead of /Cores
Find Command Search Only Non Hidden Directories
Passing a Command with Arguments as a String to Docker Run
Brother Ql-720Nw Specifying Media Size Seems Ignored
How to Issue "Module Load" in a Shell or Perl Script (I.E., Non-Interactively)
How Long Does It Take for a Non-Blocked Signal Get Delivered
How to Find Libstdc++.So.6: That Contain Glibcxx_3.4.19 for Rhel 6
Dreamweaver Equivalent for Linux
Prevalence of 64Bit Vs 32Bit Platforms
How to Install Vsftpd on Centos 6
Why Questionmark Comes in The End of Filename When I Create .Txt File Through Shell Script