Linux - Understanding the Mount Namespace & Clone Clone_Newns Flag

Linux - understanding the mount namespace & clone CLONE_NEWNS flag

The “mount namespace” of a process is just the set of mounted filesystems that it sees. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, you must decide what to do when creating a child process with clone().

Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes: there was one global mount namespace, seen by all processes, and if any change was made (e.g. using the mount command) all processes would immediately see that change irrespective of their relationship to the mount command.

With per-process mount namespaces, a child process can now have a different mount namespace to its parent. The question now arises:

Should changes to the mount namespace made by the child propagate back to the parent?

Clearly, this functionality must at least be supported and, indeed, must probably be the default. Otherwise, launching the mount command itself would effect no change (since the filesystem as seen by the parent shell would be unaffected).

Equally clearly, it must also be possible for this necessary propagation to be suppressed, otherwise we can never create a child process whose mount namespace differs from its parent, and we have one global mount namespace again (the filesystem as seen by init).

Thus, we must decide when creating a child process with clone() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the parent, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).

If the CLONE_NEWNS flag is passed to clone(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parent's mount data structures, where changes made by the child will be seen by the parent (so the mount command itself can work).

Now if I use clone with CLONE_NEWNS to create a child process, does this mean that child will get an exact copy of the mount points in the tree (5 & 6) and still be able to access the rest of the original tree ?

Yes. It sees the exact same tree as its parent after the call to clone().

Does it also mean that the child could mount 5 & 6 at its will, without effecting what's mounted at 5 or 6 in its parent process's mount namespace.

Yes. Since you've used CLONE_NEWNS, the child can unmount one device from 5 and mount another device there, and only it (and its children) could see the changes. No other process can see the changes made by the child in this case.

If yes, does it also mean that child could mount / unmount a different directory than 5 or 6 and effect what's visible to the parent process ?

No. If you've used CLONE_NEWNS, the changes made in the child cannot propagate back to the parent.

If you haven't used CLONE_NEWNS, the child would have received a pointer to the same mount namespace data as its parent, and any changes made by the child would be seen by any process that shares those data structures, including the parent. (This is also the case when the new child is created using fork().)

Linux - use of term mount in clone man page

If you think of “the set of mounts” as being (at least) a set of (device, mount point) pairs, rather than merely a set of mount points, then it starts to look a lot like the fstab or the output of the mount command (with no arguments), albeit without the additional information about flags and options (e.g. rw, nosuid, etc.).

Such a “set of mounts” provides complete information about what filesystems are mounted where. This is, by definition, the “mount namespace” of a process. Once you go from the traditional situation of having one global mount namespace to having per-process mount namespaces, additional questions arise when a process fork()s.

Traditionally, mounting or unmounting a filesystem changed the filesystem as seen by all processes.

With per-process mount namespaces, it is possible for a child process to have a different mount namespace from its parent. A question now arises:

Should changes to the mount namespace made by the child propagate back to the parent?

Thus, we must decide on fork() whether the child process gets its own copy of the data about mounted filesystems from the parent, which it can change without affecting the parent, or gets a pointer to the same data structures as the, which it can change (necessary for changes to propagate back, as when you launch mount from the shell).

If the CLONE_NEWNS flag is passed to clone() or fork(), the child gets a copy of its parent's mounted filesystem data, which it can change without affecting the parent's mount namespace. Otherwise, it gets a pointer to the parents data structure, where changes made by the child will be seen by the parent (so the mount command itself can work).

mount() after clone() with CLONE_NEWNS set effects parent

This came down to two issues for me.

First is it seems like the version of Ubuntu(16.04.1 LTS) or util-linux package I'm using shares the / mount namespace and CLONE_NEWNS propagates that setting. My / mount was shared. I verified this in /proc/self/mountinfo and /proc/1/mountinfo. I tried sudo mount --make-private -o remount / from this answer and upgraed the package mentioned. https://unix.stackexchange.com/questions/246312/why-is-my-bind-mount-visible-outside-its-mount-namespace. That allowed me to make an extra mount without any effect on the parent namespace.

The second problem was unmounting /proc. This didn't work becasue my system had /proc/sys/fs/binfmt_misc mounted twice. The discussion here inspired me to check that. http://linux-kernel.vger.kernel.narkive.com/aVUicig1/umount-proc-after-clone-newns-in-2-6-25

my final child_exec code ended up being

int child_exec(void *arg)
{
        int err =0; 
        char **commands = (char **)arg;
        printf("child...%s\n",commands[0]);
//      if(unshare(CLONE_NEWNS) <0)
//              printf("unshare issue?\n");
        if (umount("/proc/sys/fs/binfmt_misc") <0) 
                printf("error unmount bin: %s\n",strerror(errno));
        if (umount("/proc/sys/fs/binfmt_misc") <0) 
                printf("error unmount bin: %s\n",strerror(errno));
        if (umount("/proc") <0) 
                printf("error unmount: %s\n",strerror(errno));
        if (mount("proc", "/proc", "proc",0, NULL) <0) 
                printf("error mount: %s\n",strerror(errno));
        execvp(commands[0],commands);   
        return 0;
}

Mount filesystem after clone with CLONE_NEWNS flag

Short answer: It looks like the type of mount propagation isn't properly set.

Explanation

The Linux kernel defaults all mounts to MS_PRIVATE, but systemd overrides this during early boot to MS_SHARED, for the convenience of nspawn.
This can be observed by looking at the optional fields of /proc/$PID/mountinfo.
For instance, something like this might be expected:

$ cat /proc/self/mountinfo
  . . .
25 0 8:6 / / rw,relatime shared:1 - ext4 /dev/sda6 rw,errors=remount-ro,data=ordered
                         ^^^^^^
  . . .

Notice the underlined(by me) shared:1 field above, indicating that the current propagation type of / mount point is MS_SHARED, and the peer group ID is 1 (we won't care about peer group ID at all in our case).

When using the CLONE_NEWNS flag on clone(2) a new mount namespace is created, which is initialized as a copy of the caller's mount namespace.
The new, replicated mount points of the new namespace join the same peer group as their respective original mount points in the caller's mount namespace.

The propagation type of a new mount point whose parent's propagation type is MS_SHARED, is MS_SHARED too. Thus, when your "contained" process mount()s the filesystem on the loop device, the mount is by default MS_SHARED. Later, all the mounts under it, are propagated to "main" process's namespace too, and that's the reason "main" process can see them.

For your request to be satisfied (for the "main" process not to see "contained" process's mount points), the mount propagation type you seek is either MS_SLAVE or MS_PRIVATE, depending on whether you want your "contained" process's root mount point to receive propagation events from other peers or not, respectively.
Obviously, MS_PRIVATE offers greater isolation than MS_SLAVE.

Thus, in your case, it should be sufficient to change the propagation type of "contained" process's root mount point to MS_PRIVATE or MS_SLAVE before you mount the rest of the filesystems, so the mounts won't be propagated to "main" process's namespace.

The code

At first, one would try to set the propagation type properly when the "contained" process creates its root mount point.

However, I noticed the following in man 8 mount (quoting):

Note that the Linux kernel does not allow to change multiple
propagation flags with a single mount(2) system call, and the flags
cannot be mixed with other mount options.

Since util-linux 2.23 the mount command allows to use several
propagation flags together and also together with other mount
operations. This feature is EXPERIMENTAL. The propagation flags are
applied by additional mount(2) system calls when the preceding mount
operations were successful.

Looking at your code, the "contained" process, after it mount()s the filesystem on the loop device, it issues chroot() to it. At this point, you could set its propagation type by injecting this mount(2) call:

if (chroot(".") < 0) {
    // handle error
}

if (mount("/", "/", c->fstype, MS_PRIVATE, "") < 0) {
    // handle error
}

if (mkdir(...)) {
    // handle error
}

Now that the propagation type is set to MS_PRIVATE, all the subsequent mounts that "contained" process does under / won't be propagated, thus won't be visible in "main" process's namespace, as you can observe in /proc/mounts or /proc/$PID/mountinfo.

Resources

Linux kernel's Shared Subtrees documentation for more information on mount propagation.
Michael Kerrisk's excellent LWN article explaining mount namespaces better than I could.

Which process/thread capabilities sets will be changed during clone(2), unshare(2), and setns(2)?

In order to get an idea of what setns(2) and unshare(2) might do to capabilities, I've created the following tiny Python 3 scripts. Make sure to install the package nsenter and unshare (pip3 install nsenter, ...) before any attempt to run them.

setns(2)

# usernscaps.py: dump all capabilities sets of this process
# when entering a specific (grand)child user namespace.
from nsenter import Namespace
import sys

def dumpcaps(s):
    print(s)
    with open('/proc/self/status', 'r') as st:
        for line in st:
            if line.startswith('Cap'):
                print(line.rstrip())

if len(sys.argv) != 2:
    print('usage: usernscaps.py <PID>')
    exit(1)

dumpcaps('initial:')
try:
    with Namespace('/proc/%d/ns/user' % int(sys.argv[1]), 'user'):
        entered = True
        dumpcaps('after setns:')
except PermissionError:
    # Switching back to our original user namespace isn't allowed, so ignore the exception.
    try:
        entered
    except NameError:
        print('no permission to enter user namespace')

As an ordinary unprivileged user, let us create a new user namespace which will be owned by us, and keep it open with a sleeping process (note: we put it to the background):

unshare -U bash -c "readlink /proc/self/ns/user && sleep infinity" &

Next, run the Python script usernscaps.py from above, and tell it to enter our newly created user space using setns(2), then finally dump the capability sets:

python3 usernscaps.py $(lsns -t user | grep "infinity" | awk '{ print $4 }')

This gives, even for our unprivileged user and process, after setns(2):

initial:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
after setns:
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

This seems to indicate that a setns(2) in fact gives a full set of capabilities not only to the effective caps, but also the permitted caps (which makes sense, as the effective caps must be bounded by the permitted caps at any time). It doesn't seem to top up the inherited caps, though.

clone(2)

Similar to the previous script, but unshare(2)ing this time.

# usernsunsharecaps.py: dump all capabilities sets of this process
# upon unsharing the user namespace.
import unshare
import sys

def dumpcaps(s):
    print(s)
    with open('/proc/self/status', 'r') as st:
        for line in st:
            if line.startswith('Cap'):
                print(line.rstrip())

dumpcaps('initial:')
unshare.unshare(unshare.CLONE_NEWUSER)
dumpcaps('after unshare:')

Simply run it python3 usernsunsharecaps.py:

initial:
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
after unshare:
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

So, this also gives full permitted and effective capabilities within the new user namespace after unsharing.

Can't enter mount namespace created by a setuid process

Try using:

sudo setcap cap_sys_admin,cap_sys_chroot,cap_sys_ptrace=ep ./client

It looks like this detail of requiring the cap_sys_ptrace capability is buried in the kernel patch in a code comment:

+ * This syscall gets a copy of a file descriptor from another process
+ * based on the pidfd, and file descriptor number. It requires that
+ * the calling process has the ability to ptrace the process represented
+ * by the pidfd. The process which is having its file descriptor copied
+ * is otherwise unaffected.

Linux - Understanding the Mount Namespace & Clone Clone_Newns Flag