https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Index.md
# KubeVirt and the KVM user space
This is the entry point to a series of documents which, together,
detail the current status of KubeVirt and how it interacts with the
KVM user space.
The intended audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt), and in order to make it
more approachable to them comparisons, are included and little to no
knowledge of KubeVirt or Kubernetes is assumed.
Each section contains a short summary as well as a link to a separate
document discussing the topic in more detail, with the intention that
readers will be able to quickly get a high-level understading of the
various topics by reading this overview document and then dig further
into the specific areas they're interested in.
## Architecture
### Goals
* KubeVirt aims to feel completely native to Kubernetes users
* VMs should behave like containers whenever possible
* There should be no features that are limited to VMs when it would
make sense to implement them for containers too
* KubeVirt also aims to support all the workloads that traditional
virtualization can handle
* Windows support, device assignment etc. are all fair game
* When these two goals clash, integration with Kubernetes usually
wins
### Components
* KubeVirt is made up of various discrete components that interact
with Kubernetes and the KVM user space
* The overall design is somewhat similar to that of libvirt, except
with a much higher granularity and many of the tasks offloaded to
Kubernetes
* Some of the components run at the cluster level or host level
with very high privileges, others run at the pod level with
significantly reduced privileges
Additional information: [Components][]
### Runtime environment
* QEMU expects its environment to be set up in advance, something
that is typically taken care of by libvirt
* libvirtd, when not running in session mode, assumes that it has
root-level access to the system and can perform pretty much any
privileged operation
* In Kubernetes, the runtime environment is usually heavily locked
down and many privileged operations are not permitted
* Requiring additional permissions for VMs goes against the goal,
mentioned earlier, to have VMs behave the same as containers
whenever possible
## Specific areas
### Hotplug
* QEMU supports hotplug (and hot-unplug) of most devices, and its use
is extremely common
* Conversely, resources associated with containers such as storage
volumes, network interfaces and CPU shares are allocated upfront
and do not change throughout the life of the workload
* If the container needs more (or less) resources, the Kubernetes
approach is to destroy the existing one and schedule a new one to
take over its role
Additional information: [Hotplug][]
### Storage
* Handled through the same Kubernetes APIs used for containers
* QEMU / libvirt only see an image file and don't have direct
access to the underlying storage implementation
* This makes certain scenarios that are common in the
virtualization world very challenging: examples include hotplug
and full VM snapshots (storage plus memory)
* It might be possible to remove some of these limitations by
changing the way storage is exposed to QEMU, or even take advantage
of the storage technologies that QEMU already implements and make
them available to containers in addition to VMs.
Additional information: [Storage][]
### Networking
* Application processes running in VMs are hidden behind a network
interface as opposed to local sockets and processes running in
a separated user namespace
* Service meshes proxy and monitor applications by means of
socket redirection and classification on local ports and
process identifiers. We need to aim for generic compatibility
* Existing solutions replicate a full TCP/IP stack to pretend
applications running in a QEMU instance are local. No chances
for zero-copy and context switching avoidance
* Network connectivity is shared between control plane and workload
itself. Addressing and port mapping need particular attention
* Linux capabilities granted to the pod might be minimal, or none
at all. Live migration presents further challenges in terms of
network addressing and port mapping
Additional information: [Networking][]
### Live migration
* QEMU supports live migration between hosts, usually coordinated by
libvirt
* Kubernetes expects containers to be disposable, so the equivalent
of live migration would be to simply destroy the ones running on
the source node and schedule replacements on the destination node
* For KubeVirt, a hybrid approach is used: a new container is created
on the target node, then the VM is migrated from the old container,
running on the source node, to the newly-created one
Additional information: [Live migration][]
### CPU pinning
* CPU pinning is not handled by QEMU directly, but is instead
delegated to libvirt
* KubeVirt figures out which CPUs are assigned to the container after
it has been started by Kubernetes, and passes that information to
libvirt so that it can perform CPU pinning
Additional information: [CPU pinning][]
### NUMA pinning
* NUMA pinning is not handled by QEMU directly, but is instead
delegated to libvirt
* KubeVirt doesn't implement NUMA pinning at the moment
Additional information: [NUMA pinning][]
### Isolation
* For security reasons, it's a good idea to run each QEMU process in
an environment that is isolated from the host as well as other VMs
* This includes using a separate unprivileged user account, setting
up namespaces and cgroups, using SELinux...
* QEMU doesn't take care of this itself and delegates it to libvirt
* Most of these techniques serve as the base for containers, so
KubeVirt can rely on Kubernetes providing a similar level of
isolation without further intervention
Additional information: [Isolation][]
## Other tidbits
### Upgrades
* When libvirt is upgraded, running VMs keep using the old QEMU
binary: the new QEMU binary is used for newly-started VMs as well
as after VMs have been power cycled or migrated
* KubeVirt behaves similarly, with the old version of libvirt and
QEMU remaining in use for running VMs
Additional information [Upgrades][]
### Backpropagation
* Applications using libvirt usually don't provide all information,
eg. a full PCI topology, and let libvirt fill in the blanks
* This might require a second step where the additional information
is collected and stored along with the original one
* Backpropagation doesn't fit well in Kubernetes' declarative model,
so KubeVirt doesn't currently perform it
Additional information: [Backpropagation][]
## Contacts and credits
This information was collected and organized by many people at Red
Hat, some of which have agreed to serve as point of contacts for
follow-up discussion.
Additional information: [Contacts][]
[Backpropagation]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagatio...
[CPU pinning]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md
[Components]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md
[Contacts]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md
[Hotplug]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md
[Isolation]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md
[Live migration]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md
[NUMA pinning]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md
[Networking]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
[Storage]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md
[Upgrades]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md