Change The Rlimit_Nproc in Linux

Change the RLIMIT_NPROC in linux

take a look at /etc/security/limits.conf or /etc/security/limits.d/ if the latter exists in your installation. Don't forget to re-login afterward

which is better way to edit RLIMIT_NPROC value

First, I believe you are wrong in having nearly a thousand threads. Threads are quite costly, and it is usually not reasonable to have so much of them. I would suggest having a few dozen threads at most (unless you run on a very costly super-computer).

You could have some event loop around a multiplexing syscall like poll(2). Then a single thread can deal with many thousands of connections. Read about the C10K problem and epoll. Consider using some event libraries like libevent or libev etc...

You could start your application as root (perhaps by using setuid techniques), set-up the required resources (in particular, opening privileged TCP/IP ports), and change the user with setreuid(2)

Read Advanced Linux Programming...

You could also wrap your application around a tiny setuid C program which increase the limits using setrlimit(2), change the user with setreuid, and at last execve(2) your real program.

Why setrlimit(RLIMIT_NPROC) doesn't work when run as root but works fine when run as a normal user?

the following proposed code:

cleanly compiles
fails to perform the desired functionality (?why?)
incorporates all the needed header files
only the 'parent' tries to create child processes
note: the OPs and the proposed program both exit without waiting for the child processes to finish. I.E. The main program should be calling wait() or wait_pid() for each child process started.
Note: the call to sleep(1) keeps the output nice and organized. However, during that sleep the child complete and exits, so there is actually only 1 child process running any at any one time, so even if the call to setrlimit() had been successful, that 'fork()` loop could have run forever.

and now, the proposed code:

#include <stdio.h>
#include <stdlib.h>

#include <sys/time.h>
#include <sys/resource.h>

#include <sys/types.h>
#include <unistd.h>

int main( void )
{
    struct rlimit rlim;
    rlim.rlim_cur = rlim.rlim_max = 4;

    if( getrlimit(RLIMIT_NPROC, &rlim) == -1 )
    {
        perror( "getrlimit failed" );
        exit( EXIT_FAILURE );
    }

    if( setrlimit(RLIMIT_NPROC, &rlim) == -1 )
    {
        perror( "setrlimit failed" );
        exit( EXIT_FAILURE );
    }

    for (int i = 0; i < 4; ++i) 
    {
        pid_t pid = fork();
        switch( pid )
        {
            case -1:
            perror( "fork failed" );
            exit( EXIT_FAILURE );
            break;

            case 0:
            printf( "child pid: %d\n", getpid() );
            exit( EXIT_SUCCESS );
            break;

            default:
            printf( "parent pid: %d\n", getpid() );
            break;
        }
        sleep(1);
    }
    return 0;
}

a run of the program results in:

fork failed: Resource temporarily unavailable

which indicates a problem with the call to setrlimit()

from the MAN page:

RLIMIT_NPROC
          This is a limit on the number of extant process (or,  more  pre‐
          cisely  on  Linux,  threads) for the real user ID of the calling
          process.  So long as the current number of  processes  belonging
          to  this process's real user ID is greater than or equal to this
          limit, fork(2) fails with the error EAGAIN.

          The RLIMIT_NPROC limit is not enforced for processes  that  have
          either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability.

so, the call to setrlimit() is limiting the number of threads, not the number of child processes

However, if we add a couple of print statements immediately after the call to getrlimit() and again after the call to setrlimit() the result is:

    if( getrlimit(RLIMIT_NPROC, &rlim) == -1 )
    {
        perror( "getrlimit failed" );
        exit( EXIT_FAILURE );
    }

    printf( "soft limit: %d\n", (int)rlim.rlim_cur );
    printf( "hard limit: %d\n\n", (int)rlim.rlim_max );

    if( setrlimit(RLIMIT_NPROC, &rlim) == -1 )
    {
        perror( "setrlimit failed" );
        exit( EXIT_FAILURE );
    }

    if( getrlimit(RLIMIT_NPROC, &rlim) == -1 )
    {
        perror( "getrlimit failed" );
        exit( EXIT_FAILURE );
    }

    printf( "soft limit: %d\n", (int)rlim.rlim_cur );
    printf( "hard limit: %d\n\n", (int)rlim.rlim_max );

then the result is:

soft limit: 27393
hard limit: 27393

soft limit: 27393
hard limit: 27393

parent pid: 5516
child pid: 5517
parent pid: 5516
child pid: 5518
parent pid: 5516
child pid: 5519
parent pid: 5516
child pid: 5520

which indicates that call to: setrlimit() did not actually change the limits for child processes

Note: I'm running ubuntu linux 18.04

Apache 2.4 hits rlimit_nproc: hidden processes?

Found the problem thanks to the suggestion from @sarnold. My Application depends on mpm_prefork and up till Ubuntu 13.04, this module was automatically enabled when the apache2-mpm-prefork package is installed. I assumed this was still the case, but it turned out that it was running mpm_event.

It seems that in Apache 2.4 the packaging of MPM's has changed and mpm_prefork needs to be enabled manually after installation:

sudo a2dismod mpm_event
sudo a2enmod mpm_prefork
sudo service apache2 restart

Now the problems seem to have disappeared.

Multiple instances of Python running simultaneously limited to 35

Decomposing the Error Message

Your error message includes the following hint:

OpenBLAS blas_thread_init: pthread_create: Resource temporarily unavailable
OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 2067021 max

The RLIMIT_NPROC variable controls the total number of processes that user can have. More specifically, as it is a per process setting, when fork(), clone(), vfork(), &c are called by a process, the RLIMIT_NPROC value for that process is compared to the total process count for that process's parent user. If that value is exceeded, things shut down, as you've experienced.

The error message indicates that OpenBLAS was unable to create additional threads because your user had used all the threads RLIMIT_NPROC had given it.

Since you're running on a cluster, it's unlikely that your user is running many threads (unlike, say, if you were on your personal machine and browsing the web, playing music, &c), so it's reasonable to conclude that OpenBLAS is trying to start multiple threads.

How OpenBLAS Uses Threads

OpenBLAS can use multiple threads to accelerate linear algebra. You may want many threads for solving a single, larger problem quickly. You may want fewer threads for solving many smaller problems simultaneously.

OpenBLAS has several ways to limit the number of threads it uses. These are controlled via:

export OPENBLAS_NUM_THREADS=4
export GOTO_NUM_THREADS=4
export OMP_NUM_THREADS=4

The priorities are OPENBLAS_NUM_THREADS > GOTO_NUM_THREADS > OMP_NUM_THREADS. (I think this means that OPENBLAS_NUM_THREADS overrides OMP_NUM_THREADS; however, OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.)

If none of the foregoing variables are set, OpenBLAS will run using a number of threads equal to the number of cores on your machine (32 on your machine)

Your Situation

Your cluster has 32-core CPUs. You're trying to run 36 instances of Python. Each instance requires 1 thread for Python + 32 threads for OpenBLAS. You'll also need 1 thread for your SSH connection and 1 thread for your shell. That means that you need 36*(32+1)+2=1190 threads.

The nuclear option for fixing the problem is to use:

export OPENBLAS_NUM_THREADS=1

which should bring you down to 36*(1+1)+2=74 threads.

Since you have spare capacity, you could adjust OPENBLAS_NUM_THREADS to a higher value, but then the OpenBLAS instances owned by your separate Python processes will interfere with each other. So there's a trade-off between how fast you get one solution versus how fast you can get many solutions. Ideally, you can solve this trade-off by running fewer Pythons per node and using more nodes.

Is there a programmatic way in C to determine the number of processes ever used in a group of processes under Linux?

To enforce the RLIMIT_NPROC limit, linux kernel reads &p->real_cred->user->processes field in copy_process function (on fork() for example)
http://lxr.free-electrons.com/source/kernel/fork.c?v=4.8#L1371

 1371         if (atomic_read(&p->real_cred->user->processes) >=
 1372                         task_rlimit(p, RLIMIT_NPROC)) {

or in sys_execve (do_execveat_common in fs/exec.c):

1504    if ((current->flags & PF_NPROC_EXCEEDED) &&
1505        atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
1506        retval = -EAGAIN;
1507        goto out_ret;

So, if the processes is larger than RLIMIT_NPROC, function will fail. This field is defined as part of struct user_struct (accessed with struct cred real_cred in sched.h as

 atomic_t processes;    /* How many processes does this user have? */

So the process count accounting is per-user.

There is decrement of the field in copy_process in case of fail:

1655 bad_fork_cleanup_count:
1656    atomic_dec(&p->cred->user->processes);

And increment of the field is in copy_cred: http://code.metager.de/source/xref/linux/stable/kernel/cred.c#313

313 /*
314 * Copy credentials for the new process created by fork()
315 *
316 * We share if we can, but under some circumstances we have to generate a new
317 * set.
318 *
319 * The new process gets the current process's subjective credentials as its
320 * objective and subjective credentials
321 */
322 int copy_creds(struct task_struct *p, unsigned long clone_flags)

339         atomic_inc(&p->cred->user->processes);

372 atomic_inc(&new->user->processes);

man page says that it is per-user limit: http://man7.org/linux/man-pages/man2/setrlimit.2.html

   RLIMIT_NPROC
          The maximum number of processes (or, more precisely on Linux,
          threads) that can be created for the real user ID of the
          calling process.  Upon encountering this limit, fork(2) fails
          with the error EAGAIN.

Change The Rlimit_Nproc in Linux