Malloc in Kernel

what happens in the kernel during malloc?

When user space applications call malloc(), that call isn't implemented in the kernel. Instead, it's a library call (implemented glibc or similar).

The short version is that the malloc implementation in glibc either obtains memory from the brk()/sbrk() system call or anonymous memory via mmap(). This gives glibc a big contiguous (regarding virtual memory addresses) chunk of memory, which the malloc implementation further slices and dices in smaller chunks and hands out to your application.

Here's a small malloc implementation that'll give you the idea, along with many, many links.

Note that nothing cares about physical memory yet -- that's handled by the kernel virtual memory system when the process data segment is altered via brk()/sbrk() or mmap(), and when the memory is referenced (by a read or write to the memory).

To summarize:

malloc() will search its managed pieces of memory to see if there's a piece of unused memory that satisfy the allocation requirements.
Failing that, malloc() will try to extend the process data segment(via sbrk()/brk() or in some cases mmap()). sbrk() ends up in the kernel.
The brk()/sbrk() calls in the kernel adjust some of the offsets in the struct mm_struct of the process, so the process data segment will be larger. At first, there will be no physical memory mapped to the additional virtual addresses which extending the data segment gave.
When that unmapped memory is first touched (likely a read/write by the malloc implementation) a fault handler will kick in and trap down to the kernel, where the kernel will assign physical memory to the unmapped memory.

Is the memory allocated using malloc inside kernel accessible by threads of other blocks?

Yes, this memory comes from the so-called "device heap" and it is accessible by any device code (any thread) from any kernel running on that GPU.

Note that this applies even to kernels other than the one that actually did the malloc operation.

The above statement applies until application termination, or until you explicitly free that memory using an in-kernel free() call on the pointer.

You may wish to read the documentation on the in-kernel malloc() functionality. There are size limits which you can modify, and its good practice if you are having trouble with such a code, to check the return pointer for NULL after that malloc() call. If it is NULL, that is the API method to signal an error (usually meaning you ran out of allocation space on the "device heap").

A pointer allocated in this fashion cannot participate in (be used in) any host API for data movement, such as cudaMemcpy. It is usable/accessible from device code only.

Also note that the malloc() operation, like most device code you write, is performed per-thread. Each thread that executes the malloc() call will do so independently, and each thread (assuming no failures) will receive a separate pointer, to a separate allocation. However all such pointers are usable subsequently by any code running on that device, until they are explictly freed.

What is different functions: `malloc()` and `kmalloc()`?

I answer to the second question, assuming that you are using Linux OS. Regarding to the first one please have a look to my comment.

kmallocuses get_free_page to get the memory. The way in which the pages are collected depends on the second parameter ( GFP_ATOMIC GFP_KERNEL ... in which GFP means GET FREE PAGE). The advantage of kmalloc on the GFP is that it can fit multiple allocations into a single page.

some of the options for kmalloc are:

GFP_USER - Allocate memory on behalf of user. May sleep.
GFP_KERNEL - Allocate normal kernel ram. May sleep.
GFP_ATOMIC - Allocation will not sleep. May use emergency pools. For example, use this inside interrupt handlers.
GFP_HIGHUSER - Allocate pages from high memory.
GFP_NOIO - Do not do any I/O at all while trying to get memory.
GFP_NOFS - Do not make any fs calls while trying to get memory.
GFP_NOWAIT - Allocation will not sleep.
GFP_THISNODE - Allocate node-local memory only.
GFP_DMA - Allocation suitable for DMA. Should only be used for kmalloc caches. Otherwise, use a slab created with SLAB_DMA.

Apart from this get_free_page and kmalloc are very similar. _get_free_pages differs from get_free_page because it gives the pointer to the first byte of a memory area that is potentially several (physically contiguous) pages long.
Another function that is again very similar to get_free_page is get_zeroed_page(unsigned int flags) which gets a single page like get_free_page but zeroes the memory

Why malloc doesn't allocate memory until I hit a certain threshold?

1 - Why it doesn't allocate when the memory size is relatively small?

The task of the function malloc is to provide the application with memory, whenever it asks for it. Theoretically, malloc could, as you suggest, just forward all requests for memory allocations to the operating system's kernel, so that it only acts as a wrapper for the kernel's memory allocator. However, this has the following disadvantages:

The kernel only provides large amounts of memory at once, at least one page of memory, which is, depending on the configuration of the operating system, normally at least 4096 bytes. Therefore, if an application asked for only 10 bytes of memory, a lot of memory would be wasted.
System calls are expensive in terms of CPU performance.

For these reasons, it is more efficient for malloc to not forward memory allocation requests directly to the kernel, but to rather act as an intermediary between the application's memory allocation requests and the kernel. It requests memory in larger amounts from the kernel, so that it can satisfy many smaller memory allocation requests from the application.

Therefore, only when asking for a large amount of memory at once, will malloc forward that memory allocation request to the kernel.

2 - Why the allocated memory size is not exactly the same? In the first run, it shows that the size is 1004KB while I've only allocated 1000KB.

The malloc allocator must keep track of all the memory allocations it granted to the application and also keep track of all the memory allocations that it has been granted by the kernel. To store this information, it requires a bit of addititional memory space. This additional space is called "overhead".

CUDA/C - Using malloc in kernel functions gives strange results

In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.

The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:

 size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
 cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);

to your host code before any other API calls, should solve the problem.

How does the kernel stop you using malloc?

It's not so much that it's locked down. It's just that your kernel module has no idea where malloc() is. The malloc() function is part of the C standard library, which is loaded alongside programs in userspace. When a userland program is executed, the linker will load the shared libraries needed by the program and figure out where the needed functions are. SO it will load libc at an address, and malloc() will be at some offset of that. So when your program goes to call malloc() it actually calls into libc.

Your kernel module isn't linked against libc or any other userspace components. It's linked against the kernel, which doesn't include malloc. Your kernel driver can't depend on the address of anything in userspace, because it may have to run in the context of any userspace program or even in no context, like in an interrupt. So the code for malloc() may not even be in memory anywhere when your module runs. Now if you knew that you were running in the context of a process that had libc loaded, and knew the address that malloc() was located at, you could potentially call that address by storing it in a function pointer. Bad things would probably happen though, possibly including a kernel panic. You don't want to cross userspace and kernelspace boundaries except through sane, well defined interfaces.