https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md
# Storage
This document describes the known use-cases and architecture options
we have for Linux Virtualization storage in [KubeVirt][].
## Problem description
The main goal of Kubevirt is to leverage the storage subsystem of
Kubernetes (built around [CSI][] and [Persistent Volumes][] aka PVs),
in order to let both workloads (VMs and containers) leverage the same
storage. As a consequence Kubevirt is limited in its use of QEMU
storage subsystem and features. That means:
* Storage solutions should be implemented in k8s in a way that can be
consumed by both containers and VMs.
* VMs can only consume (and provide) storage features which are
available in the pod, through k8s APIs. For example, a VM will not
support disk snapshots if it’s attached to a storage provider that
doesn’t support it. Ditto for incremental backup, block jobs,
encryption, etc.
## Current situation
### Storage handled outside of QEMU
In this scenario, the VM pod uses a [Persistent Volume Claim
(PVC)][Persistent Volumes] to give QEMU access to a raw storage
device or fs mount, which is provided by a [CSI][] driver. QEMU
**doesn’t** handle any of the storage use-cases such as thin
provisioning, snapshots, change block tracking, block jobs, etc.
This is how things work today in KubeVirt.
![Storage handled outside of QEMU][Storage-Current]
Devices and interfaces:
* PVC: block or fs
* QEMU backend: raw device or raw image
* QEMU frontend: virtio-blk
* alternative: emulated device for wider compatibility and Windows
installations
* CDROM (sata)
* disk (sata)
Pros:
* Simplicity
* Sharing the same storage model with other pods/containers
Cons:
* Limited feature-set (fully off-loaded to the storage provider from
CSI).
* No VM snapshots (disk + memory)
* Limited opportunities for fine-tuning and optimizations for
high-performance.
* Hotplug is challenging, because the set of PVCs in a pod is
immutable.
Questions and comments
* How to optimize this in QEMU?
* Can we bypass the block layer for this use-case? Like having SPDK
inside the VM pod?
* Rust-based storage daemon (e.g. [vhost_user_block][]) running
inside the VM pod alongside QEMU (bypassing the block layer)
* We should be able to achieve high-performance with local NVME
storage here, with multiple polling IOThreads and multi queue.
* See [this blog post][PVC resize blog] for information about the PVC
resize feature. To implement this for VMs we could have kubevirt
watch PVCs and respond to capacity changes with a corresponding
call to resize the image file (if applicable) and to notify qemu of
the enlarged device.
* Features such as incremental backup (CBT) and snapshots could be
implemented through a generic CSI backend... Device mapper?
Stratis? (See [Other Topics](#other-topics))
## Possible alternatives
### Storage device passthrough (highest performance)
Device passthrough via PCI VFIO, SCSI, or vDPA. No storage use-cases
and no CSI, as the device is passed directly to the guest.
![Storage device passthrough][Storage-Passthrough]
Devices and interfaces:
* N/A (hardware passthrough)
Pros:
* Highest possible performance (same as host)
Cons:
* No storage features anywhere outside of the guest.
* No live-migration for most cases.
### File-system passthrough (virtio-fs)
File mount volumes (directories, actually) can be exposed to QEMU via
[virtio-fs][] so that VMs have access to files and directories.
![File-system passthrough (virtio-fs)][Storage-Virtiofs]
Devices and interfaces:
* PVC: file-system
Pros:
* Simplicity from the user-perspective
* Flexibility
* Great for heterogeneous workloads that share data between
containers and VMs (ie. OpenShift pipelines)
Cons:
* Performance when compared to block device passthrough
Questions and comments:
* Feature is still quite new (The Windows driver is fresh out of the
oven)
### QEMU storage daemon in CSI for local storage
The qemu-storage-daemon is a user-space daemon that exposes QEMU’s
block layer to external users. It’s similar to [SPDK][], but includes
the implementation of QEMU block layer features such as snapshots and
bitmap tracking for incremental backup (CBT). It also allows the
splitting of one single NVMe device, allowing multiple QEMU VMs to
share one NVMe disk.
In this architecture, the storage daemon runs as part of CSI (control
plane), with the data-plane being either a vhost-user-blk interface
for QEMU or a fs-mount export for containers.
![QEMU storage daemon in CSI for local storage][Storage-QSD]
Devices and interfaces:
* CSI:
* fs mount with a vhost-user-blk socket for QEMU to open
* (OR) fs mount via NBD or FUSE with the actual file-system
contents
* qemu-storage-daemon backend: NVMe local device w/ raw or qcow2
* alternative: any driver supported by QEMU, such as file-posix.
* QEMU frontend: virtio-blk
* alternative: any emulated device (CDROM, virtio-scsi, etc)
* In this case QEMU itself would be consuming vhost-user-blk and
emulating the device for the guest
Pros:
* The NVMe driver from the storage daemon can support partitioning
one NVMe device into multiple blk devices, each shared via a
vhost-user-blk connection.
* Rich feature set, exposing features already implemented in the QEMU
block layer to regular pods/containers:
* Snapshots and thin-provisioning (qcow2)
* Incremental Backup (CBT)
* Compatibility with use-cases from other projects (oVirt, OpenStack,
etc)
* Snapshots, thin-provisioning, CBT and block jobs via QEMU
Cons:
* Complexity due to cascading and splitting of components.
* Depends on the evolution of CSI APIs to provide the right
use-cases.
Questions and comments:
* Local restrictions: QEMU and qemu-storage-daemon should be running
on the same host (for vhost-user-blk shared memory to work).
* Need to cascade CSI providers for volume management (resize,
creation, etc)
* How to share a partitioned NVMe device (from one storage daemon)
with multiple pods?
* See also: [kubevirt/kubevirt#3208][] (similar idea for
vhost-user-net).
* We could do hotplugging under the hood with the storage daemon.
* To expose a new PV, a new qemu-storage-daemon pod can be created
with a corresponding PVC. Conversely, on unplug, the pod can be
deleted. Ideally, we might have a 1:1 relationship between PVs
and storage daemon pods (though 1:n for attaching multiple guests
to a single daemon).
* This requires that we can create a new unix socket connection
from new storage daemon pods to the VMs. The exact way to achieve
this is still to be figured out. According to Adam Litke, the
naive way would require elevated privileges for both pods.
* After having the socket (either the file or a file descriptor)
available in the VM pod, QEMU can connect to it.
* In order to avoid a mix of block devices having a PVC in the VM pod
and others where we just passed the unix socket, we can completely
avoid the PVC case for the VM pod:
* For exposing a PV to QEMU, we would always go through the storage
daemon (i.e. the PVC moves from the VM pod to the storage daemon
pod), so the VM pod always only gets a unix socket connection,
unifying the two cases.
* Using vhost-user-blk from the storage daemon pod performs the
same (or potentially better if this allows for polling that we
wouldn’t have done otherwise) as having a PVC directly in the VM
pod, so while it looks like an indirection, the actual I/O path
would be comparable.
* This architecture would also allow using the native
Gluster/Ceph/NBD/… block drivers in the QEMU process without
making them special (because they wouldn’t use a PVC either),
unifying even more cases.
* Kubernetes has fairly low per-node Pod limits by default so we
may need to be careful about 1:1 Pod/PVC mapping. We may want to
support aggregation of multiple storage connections into a single
q-s-d Pod.
## Other topics
### Device Mapper
Another possibility is to leverage the device-mapper from Linux to
provide features such as snapshots and even like Incremental Backup.
For example, [dm-era][] seems to provide the basic primitives for
bitmap tracking.
This could be part of scenario number 1, or cascaded with other PVs
somewhere else.
Is this already being used? For example, [cybozu-go/topolvm][] is a
CSI LVM Plugin for k8s.
### Stratis
[Stratis][] seems to be an interesting project to be leveraged in the
world of Kubernetes.
### vhost-user-blk in other CSI backends
Would it make sense for other CSI backends to implement support for
vhost-user-blk?
[CSI]:
https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/
[KubeVirt]:
https://kubevirt.io/
[PVC resize blog]:
https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-k...
[Persistent Volumes]:
https://kubernetes.io/docs/concepts/storage/persistent-volumes/
[SPDK]:
https://spdk.io/
[Storage-Current]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage...
[Storage-Passthrough]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage...
[Storage-QSD]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage...
[Storage-Virtiofs]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage...
[Stratis]:
https://stratis-storage.github.io/
[cybozu-go/topolvm]:
https://github.com/cybozu-go/topolvm
[dm-era]:
https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/era.html
[kubevirt/kubevirt#3208]:
https://github.com/kubevirt/kubevirt/pull/3208
[vhost_user_block]:
https://github.com/cloud-hypervisor/cloud-hypervisor/tree/master/vhost_us...
[virtio-fs]:
https://virtio-fs.gitlab.io/