Unshare User Namespace and Set UId Mapping with Newuidmap

unshare user namespace and set uid mapping with newuidmap

catanman

[nobody@host ~]$ newuidmap 7134 65534 5000 1
newuidmap: write to uid_map failed: Operation not permitted

Any idea why this is failing?

Documentation (http://man7.org/linux/man-pages/man7/user_namespaces.7.html) states the following:

The child process created by clone(2) with the CLONE_NEWUSER flag
starts out with a complete set of capabilities in the new user
namespace. <...> Note that a call to execve(2) will cause a process's
capabilities to be recalculated in the usual way (see
capabilities(7)).

This happens because unshare calls 'exec bash' before returing the control to the user and you loose the necessary capabilities, thus you cannot change uid_map/gid_map from within user namespace.

Still, if you compile some application (e.g. you can make a small fix in an example from user_namespaces(7)) which updates uid_map/gid_map before 'exec', the update will succeed.

But when I run a process from within the new namespace it still seems
to run as root instead of UID 5000:

What am I missing?

The mapping does not change the user. The mapping links ids in a child namespace to ids in its parent namespace, not the opposite way.
You can call setuid(2) or seteuid(2) from within a child namespace to change the credentials to some other credentials from the same user namespace. They of course should be mapped onto the values in the parent namespace, otherwise geteuid() function will fail.

Here are two examples:

Example 1. Suppose we have created a child user namespace.

arksnote linux-namespaces   # unshare -U bash
nobody@arksnote ~  $ id
uid=65534(nobody) gid=65534(nobody) группы=65534(nobody)
nobody@arksnote ~  $ echo $$
18526

Now let's link root from parent namespace with some id (0 in this case) in a child namespace:

arksnote linux-namespaces   # newuidmap 18526 0 0 1
arksnote linux-namespaces   # cat /proc/18526/uid_map
         0          0          1

Here's what happens to the child namespace:

nobody@arksnote ~  $ id
uid=0(root) gid=65534(nobody) группы=65534(nobody)

You can try some other mappings, like newuidmap 18526 1 0 1 and see that it is applied to the child user namespace, not the parent one.

Example 2: Now we does not set a mapping for root:

arksnote linux-namespaces   # newuidmap 18868 0  1000 1
arksnote linux-namespaces   # cat /proc/18868/uid_map
         0       1000          1

In this case the user root is left unknown for the child user namespace:

nobody@arksnote ~  $ id
uid=65534(nobody) gid=65534(nobody) группы=65534(nobody)

What you have done with [root@host ~]$ newuidmap 7134 65534 5000 1 was association of userid 5000 in a parent namespace with uid 65534 in a child namespace, but the process still runs as root. It is shown as 65534 only because this value is used for any unknown id:

Functions getuid(), getgid() returns the value from /proc/sys/kernel/overflowgid for uids/gids which does not have a mapping. The value corresponds to a special user without any system rights:nobody, as you can see in uid/gid in the output above.

See Unmapped user and group IDs in user_namespaces(7).

unshare user namespace, fork, map uid then execvp failing

Pretty sure you've already found the answer, but this is a minimal sample I could come up with:

// gcc -Wall -std=c11
#define _GNU_SOURCE
#include <stdio.h>
#include <unistd.h>
#include <sched.h>
#include <sys/wait.h>
#include <stdlib.h>
#include <stdarg.h>

void write_to_file(const char *which, const char *format, ...) {
  FILE * fu = fopen(which, "w");
  va_list args;
  va_start(args, format);
  if (vfprintf(fu, format, args) < 0) {
    perror("cannot write");
    exit(1);
  }
  fclose(fu);
}

int main(int argc, char ** argv) {
  // array of strings, terminated with NULL entry
  char **cmd_and_args = (char**) calloc(argc, sizeof(char*));
  for (int i = 1 ; i < argc; i++) {
    cmd_and_args[i-1] = argv[i];
  }
  uid_t uid = getuid();
  gid_t gid = getgid();
  // first unshare
  if (0 != unshare(CLONE_NEWUSER)) {
    fprintf(stderr, "%s\n", "USER unshare has failed");
    exit(1);
  }
  // remap uid
  write_to_file("/proc/self/uid_map", "0 %d 1", uid);
  // deny setgroups (see user_namespaces(7))
  write_to_file("/proc/self/setgroups", "deny");
  // remap gid
  write_to_file("/proc/self/gid_map", "0 %d 1", gid);
  // exec the command
  if (execvp(cmd_and_args[0], cmd_and_args) < 0) {
    perror("cannot execvp");
    exit(1);
  }
  // unreachable
  free(cmd_and_args);
  return 0;
}

unshare command doesn't create new PID namespace

Solution

you should add --fork and --mount-proc switch to unshare as stated in the man page

-f, --fork
          Fork the specified program as a child process of unshare rather than running it directly. This is useful
          when creating a new PID namespace. Note that when unshare is waiting for the child process, then it
          ignores SIGINT and SIGTERM and does not forward any signals to the child. It is necessary to send
          signals to the child process.

Explanation (from `man pid_namespaces`)

a process's PID namespace membership is determined when the process is created and cannot be changed thereafter.

what unshare actually does when you supply --pid is setting the file descriptor at /proc/[PID]/ns/pid_for_children for the current process to the new PID namespace, causing children subsequently created by this process to be places in a different PID namespace (its children not itself!! important!).

So, when you supply --fork to unshare, it will fork your program (in this case busybox sh) as a child process of unshare and place it in the new PID namespace.

Why do I need `--mount-proc` ?

Try running unshare with only --pid and --fork and let's see what happen.

wendel@gentoo-grill ~ λ sudo unshare --pid --fork busybox sh
/home/wendel # echo $$
1
/home/wendel # ps
PID   USER     TIME  COMMAND
12443 root      0:00 unshare --pid --fork busybox sh
12444 root      0:00 busybox sh
24370 root      0:00 {ps} busybox sh
.
.
. // bunch more

from echo $$ we can see that the pid is actually 1 so we know that we must be in the new PID namespace, but when we run ps we see other processes as if we are still in the parent PID namespace.

This is because of /proc is a special filesystem called procfs that kernel created in memory, and from the man page.

A /proc filesystem shows (in the /proc/[pid] directories) only processes visible in the PID namespace of the process that performed the mount, even if the /proc filesystem is viewed from processes in other namespaces.

So, in order for tools such as ps to work correctly, we need to re-mount /proc using a process in the new namespace.

But, assuming that your process is in the root mount namespace, if we re-mount /proc, this will mess up many things for other processes in the same mount namespace, because now they can't see anything (in /proc). So you should also put your process in new mount namespace too.

Good thing is unshare has --mount-proc.

--mount-proc[=mountpoint]
          Just before running the program, mount the proc filesystem at mountpoint (default is /proc). This is useful when creating a new PID namespace. It also implies creating a new mount namespace since the /proc mount would
          otherwise mess up existing programs on the system. The new proc filesystem is explicitly mounted as private (with MS_PRIVATE|MS_REC).

Let's verify that --mount-proc also put your process in new mount namespace.

bash outside:

wendel@gentoo-grill ~ λ ls -go /proc/$$/ns/{user,mnt,pid}
lrwxrwxrwx 1 0 Aug  9 10:05 /proc/17011/ns/mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 0 Aug  9 10:10 /proc/17011/ns/pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Aug  9 10:10 /proc/17011/ns/user -> 'user:[4026531837]'

busybox:

wendel@gentoo-grill ~ λ doas ls -go /proc/16436/ns/{user,mnt,pid}
lrwxrwxrwx 1 0 Aug  9 10:05 /proc/16436/ns/mnt -> 'mnt:[4026533479]'
lrwxrwxrwx 1 0 Aug  9 10:04 /proc/16436/ns/pid -> 'pid:[4026533481]'
lrwxrwxrwx 1 0 Aug  9 10:17 /proc/16436/ns/user -> 'user:[4026531837]'

Notice that their user namespace is the same but mount and pid aren't.

Note: You can see that I cited a lot from man pages. If you want to learn more about linux namespaces (or anything unix really) first thing for you to do is to read the man page of each namespace. It is well written and really informative.

Cannot open uid_map for writing from an app with cap_setuid capability set

The research

I have found the reason. During my reasearch I have found that uid_map file is not open because its ownership is changed to root.

Unprivileged process, no capabilities:

parent(m): capabilities: '='
parent(m): file /proc/4644/uid_map owner uid: 1000
parent(m): file /proc/4644/uid_map owner gid: 1000

Unprivileged process, capabilities are set (cap_setuid=pe):

parent(m): capabilities: '= cap_setuid+ep'
parent(m): file /proc/4644/uid_map owner uid: 0
parent(m): file /proc/4644/uid_map owner gid: 0
ERROR: open /proc/4668/uid_map: Permission denied

The following research has led me to this topic: what causes proc pid resources to become owned by root?

The rules on "dumpable" flag

This is what happens:

1) When a process is not dumpable, its /proc/<pid> inodes are given a root ownership:

// linux/base.c

struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task)
...
        if (task_dumpable(task)) {
                rcu_read_lock();
                cred = __task_cred(task);
                inode->i_uid = cred->euid;
                inode->i_gid = cred->egid;
                rcu_read_unlock();
        }

2) The process is dumpable only when its "dumpable" attribute has a value 1 (SUID_DUMP_USER). See ptrace(2).

3) prctl(2) clears the situation further:

  Normally, this flag is set to 1.  However, it is reset to the
          current value contained in the file /proc/sys/fs/suid_dumpable
          (which by default has the value 0), in the following
          circumstances:

          *  The process's effective user or group ID is changed.

          *  The process's filesystem user or group ID is changed (see
             credentials(7)).

          *  The process executes (execve(2)) a set-user-ID or set-
             group-ID program, resulting in a change of either the
             effective user ID or the effective group ID.

          *  The process executes (execve(2)) a program that has file
             capabilities (see capabilities(7)), but only if the
             permitted capabilities gained exceed those already
             permitted for the process.

Thus my problem arose from the last of the above rules:

int commit_creds(struct cred *new)
<...> 
    /* dumpability changes */
    if (!uid_eq(old->euid, new->euid) ||
        !gid_eq(old->egid, new->egid) ||
        !uid_eq(old->fsuid, new->fsuid) ||
        !gid_eq(old->fsgid, new->fsgid) ||
        !cred_cap_issubset(old, new)) {
            if (task->mm)
                    set_dumpable(task->mm, suid_dumpable);

Fixes

There are a number of ways to overcome the issue:

Globally change /proc/sys/fs/suid_dumpable:

echo 1 > /proc/sys/fs/suid_dumpable

Set "dumpable" flag just for the process:

prctl(PR_SET_DUMPABLE, 1, 0, 0, 0)

How to test user namespace with clone system call with CLONE_NEWUSER flag

Unprivileged user namespaces are probably disabled. As you don't check the return value of clone, you won't notice. Running through strace on my system prints:

.... startup stuff ...
clone(child_stack=0x55b41f2a4070, flags=CLONE_NEWUSER) = -1 EPERM (Operation not permitted)
geteuid()                               = 1000
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 6), ...}) = 0
brk(NULL)                               = 0x55b4200b8000
brk(0x55b4200d9000)                     = 0x55b4200d9000
write(1, "UID outside the namespace is 100"..., 34UID outside the namespace is 1000
) = 34
getegid()                               = 1000
write(1, "GID outside the namespace is 100"..., 34GID outside the namespace is 1000
) = 34
wait4(-1, NULL, 0, NULL)                = -1 ECHILD (No child processes)
exit_group(0)   = ?

So clone and therefor waitpid fail, there is no child process.

See here to enable user privileges: https://superuser.com/questions/1094597/enable-user-namespaces-in-debian-kernel

grantpt report error after unshare

Since I've had the same issue I have also looked into this. Here are my findings:

grantpt(3) tries to ensure that the slave pseudo terminal has its group set to the special tty group (or whatever TTY_GROUP is when compiling glibc):

static int tty_gid = -1;
if (__glibc_unlikely (tty_gid == -1))
  {
    char *grtmpbuf;
    struct group grbuf;
    size_t grbuflen = __sysconf (_SC_GETGR_R_SIZE_MAX);
    struct group *p;

    /* Get the group ID of the special `tty' group.  */
    if (grbuflen == (size_t) -1L)
      /* `sysconf' does not support _SC_GETGR_R_SIZE_MAX.
         Try a moderate value.  */
      grbuflen = 1024;
    grtmpbuf = (char *) __alloca (grbuflen);
    __getgrnam_r (TTY_GROUP, &grbuf, grtmpbuf, grbuflen, &p);
    if (p != NULL)
      tty_gid = p->gr_gid;
  }
gid_t gid = tty_gid == -1 ? __getgid () : tty_gid;

/* Make sure the group of the device is that special group.  */
if (st.st_gid != gid)
  {
    if (__chown (buf, uid, gid) < 0)
      goto helper;
  }

See https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/grantpt.c;h=c04c85d450f9296efa506121bcee022afda3e2dd;hb=HEAD#l137.

On my system, the tty group is 5. However, that group isn't mapped into your user namespace and the chown(2) fails because the GID 5 doesn't exist. glibc then falls back to executing the pt_chown helper, which also fails. I haven't looked into the details of why it fails, but I assume it's because it's setuid nobody unless you mapped the root user to your user namespace. Here's strace output that shows the failing operation:

[pid    30] chown("/dev/pts/36", 1000, 5) = -1 EINVAL (Invalid argument)

The gives you a couple of methods to work around this problem:

Map the required groups (i.e. tty), which may not be possible without CAP_SYS_ADMIN in the binary that opens the user namespace
Use subuids and subgids together with newuidmap(1) and newgidmap(1) to make these groups available (this might work, but I haven't tested it).
Make changes that avoid the failure of the chown(2) call, e.g. by using a mount namespace and changing the GID of the tty group in /etc/groups to your user's GID.
Avoid the chown(2) call, e.g. by making the st.st_gid != gid check false; this should be possible by deleting the tty group from your target mount namespace's /etc/groups. Of course, that may cause other problems.

Unshare User Namespace and Set UId Mapping with Newuidmap