# Linux Containers but using the host's runtime

Languages
=> /posts/frankenstein-containers/ 🇬🇧English


I wanted to deploy a BGP/OSPF router in a container, that would peer with another router in the host OS. Either Incus/LXC or Docker/Podman would have worked fine, except that it felt inefficient.
On the host I was running FRRouting, and in the container I wanted to run FRRouting too, ideally matching the versions.

* Using Docker, I would have downloaded an image containing a full OS+FRRouting.
* Using Incus/LXC, the OS image would be even bigger because it includes SystemD & friends.

Clearly this is an issue that bother nobody because storage is cheap and Docker is convenient.
I, however, am using the cheapest VPS from Scaleway (`stardust`), with 1GB of RAM and 10GB of NVMe storage.

The simplest proof of concept would be to mount `/usr`, `/bin` and `/lib64` read-only in a distroless container with Docker.
But actually for my use case, Docker is not such a great fit because FRRouting spawns several processes and also I would like more control over the networking than Docker provides by default.

This was done (first) by Lennart Poettering¹ himself, although he mounts his entire disk as an overlay into the container. I can't do that because I want my container's storage to be persisted (and clear of my host OS's configs & homedir)

### Proof of Concept
The goal here was to start FRRouting without a running OS. This was tested on a Fedora Workstation.
A filesystem skeleton was needed to make FRR happy:
```text
.
├── dev
├── etc
│   ├── frr
│   │   ├── daemons
│   │   ├── frr.conf
│   │   └── vtysh.conf
│   ├── group
│   └── passwd
├── init.sh
├── lib
├── lib64
├── proc
├── root
├── run
│   ├── dbus
│   └── frr
├── sys
├── tmp
├── usr
└── var
    ├── lib
    │   └── frr
    ├── log
    └── tmp
```
The file init.sh was a simple entrypoint, because the frrinit.sh script runs as a daemon:
```bash
#!/bin/bash
/usr/libexec/frr/frrinit.sh start
ip link add dev loo type dummy
bash
```
Using `systemd-nspawn`,
 we can start a process inside a network/pid/fs namespace. We need to use `tini` as an init process because otherwise we could be creating zombie processes.
The option `--volatile=overlay` makes the filesystem skeleton read-write inside the container, but changes are lost when the container exits, similarly to a Docker image.
```bash
$ apt install systemd-container tini
$ systemd-nspawn \
   --private-network \
   --volatile=overlay \
   --bind-ro=/usr \
   --bind-ro=/bin \
   --bind-ro=/lib \
   --bind-ro=/lib64 \
   --directory /mnt/test-base \
   --tmpfs=/run/frr:mode=777 \
   --capability=CAP_NET_ADMIN,CAP_NET_RAW \
   --resolv-conf=off \
   /usr/bin/tini /init.sh

nsid=$(machinectl status test-base | grep "Leader"|grep -Eo '[0-9]+')
ip link set dummy0 netns $nsid

pid=$(machinectl status test-base --no-pager --quiet|grep /usr/bin/tini|grep -Eo '[0-9]+')
nsenter -t $pid -m -u -i -n -p /bin/bash
```
However this does not integrate well with systemd (`machinectl`...): there is no way to see the logs or run a shell.

### A better solution
Instead of kludging a filesystem to run a single process, I will be using `debootstrap` to populate a minimal debian filesystem. Also I gave up on a single process and just used systemd inside the container, which integrates better with `machinectl` commands.

```bash
debootstrap --include=dbus,libpam-systemd,libnss-systemd,frr stable /var/lib/machines/my-container
systemd-nspawn -D /var/lib/machines/my-container --capability=CAP_NET_ADMIN,CAP_NET_RAW --private-network --private-users=pick --bind-ro=/usr --bind-ro=/etc/alternatives --boot --network-veth
```

Mouting `/usr`, we can remove most of the disk usage:
```bash
rm -rf /var/lib/machines/my-container/var/cache
rm /var/lib/machines/my-container/etc/hostname # debootstrap copies the host's hostname, which will cause mass confusion
rm /var/lib/machines/my-container/etc/os-release # it is a symlink to /usr
cp /var/lib/machines/my-container/usr/lib/os-release /var/lib/machines/my-container/etc/os-release
rm -rf /var/lib/machines/my-container/usr # big clean up !
mkdir -p /var/lib/machines/my-container/usr/{bin,sbin,lib,lib64,libexec,local}
```

Making things permanent
```ini
# /etc/systemd/nspawn/my-container.nspawn
[Exec]
Boot=yes
Capability=CAP_NET_ADMIN CAP_NET_RAW
# needed for FRRouting
PrivateUsers=pick
Hostname=my-container

[Files]
BindReadOnly=/usr
BindReadOnly=/etc/alternatives
# needed for vim

[Network]
Private=yes
VirtualEthernet=yes

[Service]
CPUQuota=50%
MemoryMax=256M
MemorySwapMax=512M
```

Note that the file `.nspawn` and the folder in `/var/lib/machines` must share the same name. If you don't want to store the filesystem in `/var/lib/machines`, you can use a symlink.


## 

=> https://0pointer.net/blog/running-an-container-off-the-host-usr.html 🔗 [1]: done (first) by Lennart Poettering

Navigation
=> / Home
=> /posts/ Posts
=> /search/ Search