This article describes the bare metal container technology which has taken the computing world by storm. The technology isn’t new but rather has been in use for a very long time. The difference is now that it’s much more cleaner and effective.
Consider for example that you were trying to create your own distribution or packages for several distributions. What we want is to have complete isolation of the development environment of such a package or distribution creation system from our own installed system.
Earlier this was done using chroot which allowed us to change the root file system and move into another root file system which would probably be the distribution or package creation environment and has nothing to do with the installed system. Let’s take a look how this can be done
chroot (Change root / pivot_root) way of isolation
Some of the issues with chroot way are listed below
- only a the filesystem was pivoted but the user would still be the actual system user not the chroot jail user.
- Couldn’t isolate processes, we would still actually see system processes within chroot.
- Couldn’t isolate network. I.e. iptables / ebtables rules were still applied to the system, including the chroot jail.
Implementing a chroot jail
In order to appreciate what namespaces bring to the table, let’s first try the old way chroot command. You may’ve to install additional package(s) to get chroot utility on your system.
Packages Required
Package Name | Description | Where to get it from? |
BusyBox | A very lightweight package containing shells and other linux utilities. Can be compiled as a statically linked binary which is quite useful. | https://busybox.net/downloads/busybox-1.30.1.tar.bz2 |
chroot | Command to create a chroot jail. We would need to be root user to execute this command. | Check your distro’s package manager. |
Compiling BusyBox
From the shell, cd into BusyBox source directory and run the following command
BusyBox Compilation |
|
$ make menuconfig && make -j4 && make install |
And make sure you select the Build static binary option under Settings. You may also need to install libncurses if you already haven’t done so to enable the menuconfig GUI appear.
Chroot into the busybox
If the installation was success you’ll have a directory _install. This contains a similar looking directory structure what you would expect on a linux machine. Now with root user you can do the following, (can also use sudo)
Moving into BusyBox chroot jail |
|
#chroot _install /bin/sh |
You should’ve a pound(#) prompt with a directory structure similar to as shown below,
/ # ls
bin linuxrc sbin usr / # |
|
As you can see there’s no proc directory so doing a ps command will give an error. To fix this first let’s mount procfs so we can see our processes.
Mounting procfs inside chroot jail
/ # mkdir /proc
/ # mount -t proc none /proc |
|
Now when we do a ps -ef command, it’ll show you all the processes not only from the chroot but also of the system. This is one of the major drawbacks of chroot, though it solves the root file system issue (we just changed it above) but it has a lot of limitations. User is also root which is also the system’s root and not isolated to chroot.
Before you exit, unmount the procfs from chroot!
Make sure you do umount /proc before typing exit. That’s to avoid any surprises later!
So what we want …
- To have complete isolation we don’t want at the very least the chroot users to exist on our installed machine.
- It would be nice to have separate mount of proc and sysfs so they would show only the process information about the chroot.
- Network isolation would be nice.
- Controlling resources allocated to such a chroot is much desired.
Enter Namespaces
Just like namespaces in programming languages, namespaces are isolated “places” which are accessible to processes living in that namespace. There are multiple namespaces (7 in all) implemented and all processes are part of exactly one of each of these namespaces.
However some namespaces can be cascaded and some can be shared (mount namespace) but under no circumstance can the hierarchy be reversed, we’ll take a look at this later.
To be able to use the namespaces there are only 2 new system calls. Most of the heavy lifting is done by the Linux kernel at the time of process creation. In order to understand namespaces correctly it’s important that we do some hands-on
Check if namespaces are available in the kernel
Running Kernel Configuration |
|
$ cat /boot/config-$(uname -r) | grep _NS |
What we are looking for are the following options,
Namespace Config Options for Kernel |
|
CONFIG_UTS_NS=y CONFIG_IPC_NS=y CONFIG_USER_NS=y CONFIG_PID_NS=y CONFIG_NET_NS=y CONFIG_CGROUPS=y |
Each process in Linux is part of each of these 7 namespaces and can only be part of exactly one of each. So for example if a running process changes it mount namespace it no longer remains part of the previous mount namespace.
The only distinction is the PID Namespace which is actually cascaded, otherwise it would be impossible to control the child processes from parent process. We’ll take a look at what this means.
See running shell’s namespace |
|
$ ls -l /proc/$$/ns |
These appear to be broken symlinks however you can see that each of the namespace has an id attached to it. Processes whose namespaces are same will have exactly same ids. So for example a command run from the shell will inherit it’s parent’s namespace unless it’s changed explicitly by the command.
NOTE: Replace $$ with any other process’s pid to see it’s namespace.
Types of namespaces
Namespace | Description | Clone Flag |
UTS | Information map for the identifiers returned by system call uname. See man 2 uname. | CLONE_NEWUTS |
IPC | IPC namespace of the process. | CLONE_NEWIPC |
NET | Network namespace of the process. | CLONE_NEWNET |
NS | Mount namespace of the process. | CLONE_NEWNS |
PID | PID Namespace of the process. This is cascaded that is process can also see it’s child processes but not other way. | CLONE_NEWPID |
USER | User namespace of the process. | CLONE_NEWUSER |
CGROUPS | Control Groups. Required for resource management within a namespace. | CLONE_NEWCGROUP |
A process can do either of the following two,
- Create a new namespace , one or all of the above 7 when cloning a child process.
- Become part of an existing namespace(s).
To see the above in action, let’s use the tools already available
Command | Description |
unshare | As the name suggests, this command unshares a namespace by creating a new namespace. |
nsenter | Join an existing namespace. |
The above tools use the following system calls
System call | Description |
unshare | As the name suggests, this command unshares a namespace by creating a new namespace. See man 2 unshare |
setns | Join an existing namespace. See man 2 setns. |
Creating a simple usable container
Things we’ll need
- Busybox, so we can have everything we need in a small package.
- A running Linux kernel with namespaces enabled.
- Chroot installed on the system.
- Utilities unshare and nsenter.
After installing BusyBox, we need to do the same steps. i.e. chroot into the BusyBox however we would also change our namespaces while we’re at it so that we are isolated from the rest of the system.
$ unshare –kill-child -f -pmnuU -r — chroot $(pwd)/_install /bin/sh |
|
To see what all options are used, see man unshare. I’ll explain some options below,
Unshare Option | Description |
-r | Map the root user of the newly created namespace to the user who ran unshare. |
– – | Separates the command to execute from rest of the options. i.e. after – – what follows is the command and it’s argument(s).
It’s important to understand that the command is run after moving to the new namespace(s). Which is why we don’t require root permissions in the host system. |
With the above command successful you should see the same directory structure as we did earlier with chroot. Again first let’s mount the procfs, but this time you’ll see the difference
/ # mkdir /proc
/ # mount -t proc none /proc / # ps -ef PID USER TIME COMMAND 1 0 0:00 /bin/sh 5 0 0:00 ps -ef |
|
As you can see, though we are root, we can’t see the processes outside our chroot jail. This is in essence the basic idea of container.
Now let’s see what the namespace ids are for our init process, in this case the shell (/bin/sh executed via chroot).
#ls -l /proc/1/ns
lrwxrwxrwx 1 0 0 0 May 12 17:58 cgroup -> cgroup:[4026531835] lrwxrwxrwx 1 0 0 0 May 12 17:58 ipc -> ipc:[4026531839] lrwxrwxrwx 1 0 0 0 May 12 17:58 mnt -> mnt:[4026532426] lrwxrwxrwx 1 0 0 0 May 12 17:58 net -> net:[4026532430] lrwxrwxrwx 1 0 0 0 May 12 17:58 pid -> pid:[4026532428] lrwxrwxrwx 1 0 0 0 May 12 17:58 pid_for_children -> pid:[4026532428] lrwxrwxrwx 1 0 0 0 May 12 17:58 user -> user:[4026532425] lrwxrwxrwx 1 0 0 0 May 12 17:58 uts -> uts:[4026532427] |
|
Now let’s see what changed, without closing the above busybox terminal, from another terminal session
$ps -eo pid,ppid,cmd | grep /bin/sh
27134 6647 unshare –kill-child -f -pmnuU -r — chroot /home/pranay/busybox_src/busybox/_install /bin/sh 27135 27134 /bin/sh |
|
Now if I compare the namespaces of the above process (/bin/sh) and the shell I’ve on my other terminal
pranay@pranay-Inspiron-3250:~/busybox_src$ ls -l /proc/27135/ns
total 0 lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 cgroup -> ‘cgroup:[4026531835]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 ipc -> ‘ipc:[4026531839]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 mnt -> ‘mnt:[4026532426]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 net -> ‘net:[4026532430]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 pid -> ‘pid:[4026532428]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 pid_for_children -> ‘pid:[4026532428]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 user -> ‘user:[4026532425]’ lrwxrwxrwx 1 pranay pranay 0 May 12 22:55 uts -> ‘uts:[4026532427]’ pranay@pranay-Inspiron-3250:~/busybox_src$ ls -l /proc/$$/ns total 0 lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 cgroup -> ‘cgroup:[4026531835]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 ipc -> ‘ipc:[4026531839]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 mnt -> ‘mnt:[4026531840]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 net -> ‘net:[4026532008]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 pid -> ‘pid:[4026531836]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 pid_for_children -> ‘pid:[4026531836]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 user -> ‘user:[4026531837]’ lrwxrwxrwx 1 pranay pranay 0 May 12 16:36 uts -> ‘uts:[4026531838]’ |
|
Apart from the cgroup, ipc namespaces all other are different since we didn’t created a new cgroup and ipc we expected them to be same.
Important Note
Killing the pid 1 inside pid namespace will kill other processes within that pid namespace. In effect all namespace mappings are destroyed of which pid 1 is part of.
Running a process in another process’s namespace
Now that we’ve a basic container running, let’s try to run a process within it. To demonstrate here’s a very simple C program which does nothing but sleeps for 20 seconds.
#include <unistd.h> int main(int argc, char *argv[]) { for(;;) { sleep(20); } }
We can compile the above program and let’s call the binary generated as test.
Now to run this inside the busybox’s shell namespace we are going to use the nsenter utility. All it needs is a target pid and the namespace which needs to be changed to that of the target. In our case
- The target pid = 27135 i.e the pid of the shell /bin/sh started using the unshare command.
- The namespaces we need to move into. We’ll move into all namespaces of target except cgroups and ipc.
By default nsenter tries to write to uid_map and gid_map files inside the namespace. The issue is that write on these files is allowed exactly one time by exactly one process inside the PID namespace. When we started unshare, using the -r option we already wrote to the PID 1 (/bin/sh) uid_map and gid_map file which are then only inherited but can’t be modified.
Therefore we would need to tell nsenter not to write to those files, which can be done using the – – preserve-credentials option. So the command now becomes, note that we are running this from our system’s shell,
$nsenter -p -n -u -U -t 27135 –preserve-credentials ./test |
|
This process will now be visible inside the BusyBox’s shell as well using the usual ps command,
/ # ps -ef
PID USER TIME COMMAND 1 0 0:00 /bin/sh 10 0 0:00 ./test 12 0 0:00 ps -ef |
|