Linux Containers but using the host's runtime

I wanted to deploy a BGP/OSPF router in a container, that would peer with another router in the host OS. Either Incus/LXC or Docker/Podman would have worked fine, except that it felt inefficient. On the host I was running FRRouting, and in the container I wanted to run FRRouting too, ideally matching the versions.

  • Using Docker, I would have downloaded an image containing a full OS+FRRouting.
  • Using Incus/LXC, the OS image would be even bigger because it includes SystemD & friends.

Clearly this is an issue that bother nobody because storage is cheap and Docker is convenient. I, however, am using the cheapest VPS from Scaleway (stardust), with 1GB of RAM and 10GB of NVMe storage.

The simplest proof of concept would be to mount /usr, /bin and /lib64 read-only in a distroless container with Docker. But actually for my use case, Docker is not such a great fit because FRRouting spawns several processes and also I would like more control over the networking than Docker provides by default.

This was done (first) by Lennart Poettering himself, although he mounts his entire disk as an overlay into the container. I can’t do that because I want my container’s storage to be persisted (and clear of my host OS’s configs & homedir)

Proof of Concept

The goal here was to start FRRouting without a running OS. This was tested on a Fedora Workstation. A filesystem skeleton was needed to make FRR happy:

.
β”œβ”€β”€ dev
β”œβ”€β”€ etc
β”‚Β Β  β”œβ”€β”€ frr
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ daemons
β”‚Β Β  β”‚Β Β  β”œβ”€β”€ frr.conf
β”‚Β Β  β”‚Β Β  └── vtysh.conf
β”‚Β Β  β”œβ”€β”€ group
β”‚Β Β  └── passwd
β”œβ”€β”€ init.sh
β”œβ”€β”€ lib
β”œβ”€β”€ lib64
β”œβ”€β”€ proc
β”œβ”€β”€ root
β”œβ”€β”€ run
β”‚Β Β  β”œβ”€β”€ dbus
β”‚Β Β  └── frr
β”œβ”€β”€ sys
β”œβ”€β”€ tmp
β”œβ”€β”€ usr
└── var
    β”œβ”€β”€ lib
    β”‚Β Β  └── frr
    β”œβ”€β”€ log
    └── tmp

The file init.sh was a simple entrypoint, because the frrinit.sh script runs as a daemon:

#!/bin/bash
/usr/libexec/frr/frrinit.sh start
ip link add dev loo type dummy
bash

Using systemd-nspawn, we can start a process inside a network/pid/fs namespace. We need to use tini as an init process because otherwise we could be creating zombie processes. The option --volatile=overlay makes the filesystem skeleton read-write inside the container, but changes are lost when the container exits, similarly to a Docker image.

$ apt install systemd-container tini
$ systemd-nspawn \
   --private-network \
   --volatile=overlay \
   --bind-ro=/usr \
   --bind-ro=/bin \
   --bind-ro=/lib \
   --bind-ro=/lib64 \
   --directory /mnt/test-base \
   --tmpfs=/run/frr:mode=777 \
   --capability=CAP_NET_ADMIN,CAP_NET_RAW \
   --resolv-conf=off \
   /usr/bin/tini /init.sh

nsid=$(machinectl status test-base | grep "Leader"|grep -Eo '[0-9]+')
ip link set dummy0 netns $nsid

pid=$(machinectl status test-base --no-pager --quiet|grep /usr/bin/tini|grep -Eo '[0-9]+')
nsenter -t $pid -m -u -i -n -p /bin/bash

However this does not integrate well with systemd (machinectl…): there is no way to see the logs or run a shell.

A better solution

Instead of kludging a filesystem to run a single process, I will be using debootstrap to populate a minimal debian filesystem. Also I gave up on a single process and just used systemd inside the container, which integrates better with machinectl commands.

debootstrap --include=dbus,libpam-systemd,libnss-systemd,frr stable /var/lib/machines/my-container
systemd-nspawn -D /var/lib/machines/my-container --capability=CAP_NET_ADMIN,CAP_NET_RAW --private-network --private-users=pick --bind-ro=/usr --bind-ro=/etc/alternatives --boot --network-veth

Mouting /usr, we can remove most of the disk usage:

rm -rf /var/lib/machines/my-container/var/cache
rm /var/lib/machines/my-container/etc/hostname # debootstrap copies the host's hostname, which will cause mass confusion
rm /var/lib/machines/my-container/etc/os-release # it is a symlink to /usr
cp /var/lib/machines/my-container/usr/lib/os-release /var/lib/machines/my-container/etc/os-release
rm -rf /var/lib/machines/my-container/usr # big clean up !
mkdir -p /var/lib/machines/my-container/usr/{bin,sbin,lib,lib64,libexec,local}

Making things permanent

# /etc/systemd/nspawn/my-container.nspawn
[Exec]
Boot=yes
Capability=CAP_NET_ADMIN CAP_NET_RAW
# needed for FRRouting
PrivateUsers=pick
Hostname=my-container

[Files]
BindReadOnly=/usr
BindReadOnly=/etc/alternatives
# needed for vim

[Network]
Private=yes
VirtualEthernet=yes

[Service]
CPUQuota=50%
MemoryMax=256M
MemorySwapMax=512M

Note that the file .nspawn and the folder in /var/lib/machines must share the same name. If you don’t want to store the filesystem in /var/lib/machines, you can use a symlink.