How to Configure Linux Capabilities Per User

How to set CAP_SYS_NICE capability to a Linux user?

Jan Hudec is right that a process can't just give itself a capability, and a setuid wrapper is the obvious way get the capability. Also, keep in mind that you'll need to prctl(PR_SET_KEEPCAPS, ...) when you drop root. (See the prctl man page for details.) Otherwise, you'll drop the capability when you transition to your non-root real user id.

If you really just want to launch user sessions with a different allowed nice level, you might see the pam_limits and limits.conf man pages, as the pam_limits module allows you to change the hard nice limit. It could be a line like:

yourspecialusername hard nice -10

Changing user IDs for assigning additional capabilities

If your program executes with effective user ID root, then you do have root privileges.

In Linux, capabilities are divided into three sets: inheritable, permitted, and effective. Inheritable defines which capabilities stay permitted across an exec(). Permitted defines which capabilities are permitted for a process. Effective defines which capabilities are currently in effect.

Edited to add: When the filesystem containing the binary that will be exec()'d does support filesystem capabilities, these always affect what capabilities the executed process will have. See the Transformation of capabilities during an execve() in the man 7 capabilities man page.

When changing the owner or group of a process from root to non-root, the effective capability set is always cleared. By default, also the permitted capability set is cleared, but calling prctl(PR_SET_KEEPCAPS, 1L) before the identity change tells the kernel to keep the permitted set intact.

Therefore, to have the CAP_NET_RAW capability, your program has to have it in both permitted and effective sets. If you wish for the CAP_NET_RAW to remain in effect over an exec(), it must be included in all three capability sets.

Edited to add: If file capabilities are supported for the target of the exec(), the file capabilities must also contain those capabilities in the inherited and effective sets. (Only including the capability in the inherited and effective sets does not grant the capability, since it's not in the permitted set in the file capabilities; but, it is enough to allow passing the capability from the executor to the execee, if the executor has the capability).

You can use the setcap command to grant specific capabilities to a binary. (Most Linux filesystems nowadays support these file capabilities.) It does not need to be privileged, or setuid. Just remember to add the desired capabilities to both permitted and effective sets.

Edited to add some examples:

Grant CAP_NET_RAW to /usr/bin/myprog (which must NOT be setuid or setgid root):

sudo setcap 'cap_net_raw=pe' /usr/bin/myprog

By default, do not grant CAP_NET_RAW to /usr/bin/myprog, but if the executor has the capability (in both inheritable and permitted sets), retain the capability (in inheritable and permitted sets, and activating it in the effective set):

sudo setcap 'cap_net_raw=ie' /usr/bin/myprog

If your program has to be setuid root anyway, then you can use e.g.

#define  _GNU_SOURCE
#include <unistd.h>
#include <sys/types.h>
#include <sys/capability.h>
#include <sys/prctl.h>

#define   NEED_CAPS 1
static const cap_value_t need_caps[NEED_CAPS] = { CAP_NET_RAW };

int main(void)
{
    uid_t  real = getuid();
    cap_t  caps;

    /* Elevate privileges */
    if (setresuid(0, 0, 0))
        return 1; /* Fatal error, probably not setuid root */

    /* Add need_caps to current capabilities. */
    caps = cap_get_proc();
    if (cap_set_flag(caps, CAP_PERMITTED,   NEED_CAPS, need_caps, CAP_SET) ||
        cap_set_flag(caps, CAP_EFFECTIVE,   NEED_CAPS, need_caps, CAP_SET) ||
        cap_set_flag(caps, CAP_INHERITABLE, NEED_CAPS, need_caps, CAP_SET))
        return 1; /* Fatal error */

    /* Update capabilities */
    if (cap_set_proc(caps))
        return 1; /* Fatal error */

    /* Retain capabilities over an identity change */
    if (prctl(PR_SET_KEEPCAPS, 1L))
        return 1; /* Fatal error */

    /* Return to original, real-user identity */ 
    if (setresuid(real, real, real))
        return 1; /* Fatal error */

    /* Because the identity changed, we need to
     * re-install the effective set. */
    if (cap_set_proc(caps))
        return 1; /* Fatal error */

    /* Capability set is no longer needed. */
    cap_free(caps);

    /* You now have the CAP_NET_RAW capability.
     * It will be retained over fork() and exec().
    */

    return 0;
}

Linux capabilities to launch process as root from a user mode program in C++

Try this (parent.cc):

#include <iostream>
#include <sys/capability.h>
#include <unistd.h>

int main() {
    cap_t caps = cap_get_proc();
    cap_value_t val = CAP_SETUID;
    cap_set_flag(caps, CAP_EFFECTIVE, 1, &val, CAP_SET);
    if (cap_set_proc(caps)) {
        perror("failed to raise cap_setuid");
        exit(1);
    }
    if (setuid(0)) {
        perror("unable to setuid");
        exit(1);
    }
    execl("./child.sh", "child.sh", NULL);
    std::cout << "didn't work, uid=" << getuid();
    exit(1);
}

With this (child.sh):

#!/bin/bash
id -u

Build and set things up:

$ chmod +x child.sh
$ g++ -o parent parent.cc -lcap
$ sudo setcap cap_setuid=p ./parent

If you run ./parent it should work like this:

$ ./parent 
0

This example is single threaded. If your app is single threaded, it should be sufficient. If your program is multithreaded, you might need to explore something like libpsx.

Request Linux Capabilities During Runtime

The most common method to do provide extra capabilities to a process is to assign filesystem capabilities to its binary.

For example, if you want the processes executing /sbin/yourprog to have the CAP_CHOWN capability, add that capability to the permitted and effective sets of that file: sudo setcap cap_chown=ep /sbin/yourprog.

The setcap utility is provided by the libcap2-bin package, and is installed by default on most Linux distributions.

It is also possible to provide the capabilities to the original process, and have that process manipulate its effective capability set as needed. For example, Wireshark's dumpcap is typically installed with CAP_NET_ADMIN and CAP_NET_RAW filesystem capabilities in the effective, permitted, and inheritable sets.

I dislike the idea of adding any filesystem capabilities to the inheritable set. When the capabilities are not in the inheritable set, executing another binary causes the kernel to drop those capabilities (assuming KEEPCAPS is zero; see prctl(PR_SET_KEEPCAPS) and man 7 capabilities for details).

As an example, if you granted /sbin/yourprog only the CAP_CHOWN capability and only in the permitted set (sudo setcap cap_chown=p /sbin/yourprog), then the CAP_CHOWN capability will not be automatically effective, and it will be dropped if the process executes some other binary. To use the CAP_CHOWN capability, a thread can add the capability to its effective set for the duration of the operations needed, then remove it from the effective set (but keep it in the permitted set), via prctl() calls. Note that the libcap cap_get_proc()/cap_set_proc() interface applies the changes to all threads in the process, which may not be what you want.

For temporarily granting a capability, a worker sub-process can be used. This makes sense for a complex process, as it allows delegating/separating the privileged operations to a separate binary. A child process is forked, connected to the parent via an Unix domain stream or datagram socket created via socketpair(), and executes the helper binary that grants it the necessary capabilities. It then uses the Unix domain stream socket to verify the identity (process ID, user ID, group ID, and via the process ID, the executable the other end of the socket is executing). The reason a pipe is not used, is that an Unix domain stream socket or datagram socketpair socket is needed to use the SO_PEERCRED socket option to query the kernel the identity of the other end of the socket.

There are known attack patterns that need to be anticipated and thwarted. The most common attack pattern is causing the parent process to immediately execute a compromised binary after forking and executing the privileged child process, timed just right so the capabled child process trusts the other end is its proper parent executing the proper binary, but in fact control has been transferred to a completely different, compromised or untrustworthy binary.

The details on exactly how to do this securely are a software engineering question much more than a programming question, but using socketpair(AF_UNIX, SOCK_STREAM | SOCK_CLOEXEC, 0, fdpair) and verifying the socket peer is the parent process still executing the expected binary more than just once at the beginning, are the key steps needed.

The simplest example I can think of is using prctl() and CAP_NET_BIND_SERVICE filesystem capability only in the permitted set, so that an otherwise unprivileged process can use a privileged port (1-1024, preferably a system-wide subset defined/listed in a root or admin-owned configuration file somewhere under /etc) to provide a network service. If the service will close and reopen its listening socket when told to do so (perhaps via SIGUSR1 signal), the listening socket cannot simply be created once at the beginning then dropped. It is a pretty good match for the "keep in permitted set, but only add to effective set of the thread that actually needs it, then drop it immediately afterwards" pattern.

For CAP_CHOWN, an example program might acquire it into its effective and permitted sets via the filesystem capability, but use a trusted configuration file (root/admin modifiable only) to list the ownership changes it is allowed to do based on the real user and group identity running the process. Consider a dedicated "sudo"-style "chown" utility, intended for say organizations to allow team leads to shift file ownership between their team members, but one that does not use sudo.)

Allow non-root user of container to execute binaries that need capabilities

Elaborating on my comment to the question, I think this is a problem with the ip code's attempt to suppress Ambient capabilities except in one specific command line case.

I've suggested a fix in this bug report: https://github.com/shemminger/iproute2/issues/62 (which appears to have been deleted, since this was not how iproute2 wants to receive bug reports). I was directed to try again with this email thread, so we can see how that report turns out.

I did develop a partial workaround you might want to try:

    $ sudo setcap cap_net_admin=ie ./ip
    $ sudo capsh --inh=cap_net_admin --user=`whoami` --
    $ ./ip ...

This works by using inheritable file capabilities instead of the permitted ones you were relying on. The ip code seems to prefer inheritable capabilities over permitted ones.