This is an experimental type of Linux Virtual Machine that does not use a hypervisor (no monitor, no emulation, no HW virtualization). A guest runs multiple processes on the same address-space as a single host process on top of 12 syscalls (sandboxed using seccomp). The guest kernel is a modified Linux configured with User-Mode-Linux (UML) and no-MMU.
The best way to see this is as a modified User-Mode-Linux (UML) that is faster and more secure, but is significantly less general-purpose.
docker run --rm -it kollerr/linux-um-nommu
The source dockerfile is tests/docker/linux-um-nommu/Dockerfile which uses the alpine-test.ext3
image built
by tests/Makefile (which is then based on tests/docker/alpine/Dockerfile).
Container runtimes have been using virtualization as a way of improving isolation (e.g., Kata containers). And in order to make them feel like regular containers, the community has been trying to slim down their virtual machine (VM) monitors (e.g., Firecracker). This experimental "VM" is what happens when you slim down to the extreme: no monitor at all.
Nabla Linux is a Linux virtual machine that runs as a single unprivileged user-level process on top of only 12 syscalls. We achieve isolation equivalent to virtual machines, without using a monitor, by restricting the VM process to only these 12 system calls using seccomp. The system was built on top of a combination of two well known Linux features: user mode linux (UML) and no-MMU support (used for embedded devices) both in the kernel and in userspace (musl and busybox).
Our initial experiments show that this Linux VM is capable of running multiple unmodified binaries from Alpine (like python, nginx, redis), and can boot in 6 milliseconds (to our knowledge, this is the fastest); albeit with some limitations: PIE executables only, and no forks (processes are emulated using vforks).
This shows a run with the host syscalls on the left. The point of this is to show that lots of applications just work while running on a small set of syscalls.
A single make
at the root should build linux, musl, and busybox. Then you need a disk image (think of this as a VM). You
can create a raw disk file based on alpine using the alpine-test.ext3
target in tests/
, or just do a make demo
in tests/
which
will build one and then run it.
make
cd tests && make demo
-
Linux Kernel Library (LKL) which also uses the NOMMU config but has a different use case: to be used as a library instead of a "VM". There are two very interesting developments related to LKL:
- Unifying LKL into UML. Most of the kernel changes are already implemented in the LKL patch. After it's merged, "nabla linux" could be implemented with mostly userlevel changes.
- Porting Linux to Nabla Containers.
-
Gvisor which looks like UML when running in ptrace mode (one host process per guest process trapped using ptrace).
-
The solo5-spt monitor which runs unikernels as a single process sandboxed using seccomp (same idea).
- No virtual memory (VM) and no memory protection. A single address space is shared by multiple processes, so a process writing into the NULL page will "kill" every process running in the VM (not what you would expect).
- No
sys_fork
. Which is partially solved by supportingvfork
(andposix_spawn
). The catch is that applications need to usevfork
orposix_spawn
instead of fork and exec (like busybox configured for NOMMU). Applications doingsys_fork
will get anEINVAL
. The most common usage of fork and exec (running a new program) is the shell: that's why we need busybox configured for NOMMU. Other applications like nginx or redis don't fork (haven't seen them fork at least), so they don't need to be patched. - Can only run PIE executables. This is the case for most of the binaries in Alpine Linux as explained here/Secure.
- Have to use our modified musl libc. This libc supports making syscalls over vsyscall (i.e. a function call instead of the
syscall
instruction).