[libvirt PATCH 0/3] Actually do secure erase with explicit_bzero
by Daniel P. Berrangé
If we're going to have a virSecureErase function, we
might as well make it do secure erasure with currently
available explicit_bzero in FreeBSD/Linux.
While we're here, we should use it from the RPC code.
The remaining hole in the RPC code is xdr_free which
does not securely erase buffers. That's not easily
fixed without dropping the RPC impl in favour of a
custom one.
Daniel P. Berrangé (3):
util: implement secure erase with explicit_bzero
rpc: fix buffer offset updates after decoding payload
rpc: securely erase the message buffers
meson.build | 1 +
src/rpc/virnetmessage.c | 4 +++-
src/util/virsecureerase.c | 6 ++++++
3 files changed, 10 insertions(+), 1 deletion(-)
--
2.38.1
1 year, 11 months
[PATCH] formatcaps: Update capabilities example
by Michal Privoznik
In the formatcaps.rst we give an example output of capabilities.
Well, there are couple of issues with it:
1) We show <features/> nested under /capabilities/host/cpu.
There's no such element and never was.
2) The ordering of elements is corrupted.
3) There is plenty of elements missing.
Fix these by showing an actual output of 'virsh capabilities' as
obtained on my machine.
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
docs/formatcaps.rst | 230 +++++++++++++++++++++++++++++++++-----------
1 file changed, 176 insertions(+), 54 deletions(-)
diff --git a/docs/formatcaps.rst b/docs/formatcaps.rst
index 39b1fb78ac..f7e5342654 100644
--- a/docs/formatcaps.rst
+++ b/docs/formatcaps.rst
@@ -143,60 +143,182 @@ capabilities enabled in the chip and BIOS you will see:
::
- <capabilities>
- <host>
- <cpu>
- <arch>x86_64</arch>
- <features>
- <vmx/>
- </features>
- <model>core2duo</model>
- <vendor>Intel</vendor>
- <topology sockets="1" dies="1" cores="2" threads="1"/>
- <feature name="lahf_lm"/>
- <feature name='xtpr'/>
- <pages unit='KiB' size='4'/>
- <pages unit='KiB' size='2048'/>
- <pages unit='KiB' size='1048576'/>
- <microcode version='36'/>
- <maxphysaddr mode='emulate' bits='46'/>
- ...
- </cpu>
- <power_management>
- <suspend_mem/>
- <suspend_disk/>
- <suspend_hybrid/>
- </power_management>
- </host>
+ <capabilities>
- <!-- xen-3.0-x86_64 -->
- <guest>
- <os_type>xen</os_type>
- <arch name="x86_64">
- <wordsize>64</wordsize>
- <domain type="xen"></domain>
- <emulator>/usr/lib64/xen/bin/qemu-dm</emulator>
- </arch>
- <features>
- </features>
- </guest>
+ <host>
+ <uuid>7b55704c-29f4-11b2-a85c-9dc6ff50623f</uuid>
+ <cpu>
+ <arch>x86_64</arch>
+ <model>Skylake-Client-noTSX-IBRS</model>
+ <vendor>Intel</vendor>
+ <microcode version='236'/>
+ <signature family='6' model='142' stepping='12'/>
+ <counter name='tsc' frequency='2303997000' scaling='no'/>
+ <topology sockets='1' dies='1' cores='4' threads='2'/>
+ <maxphysaddr mode='emulate' bits='39'/>
+ <feature name='ds'/>
+ <feature name='acpi'/>
+ <feature name='ss'/>
+ <feature name='ht'/>
+ <feature name='tm'/>
+ <feature name='pbe'/>
+ <feature name='dtes64'/>
+ <feature name='monitor'/>
+ <feature name='ds_cpl'/>
+ <feature name='vmx'/>
+ <feature name='smx'/>
+ <feature name='est'/>
+ <feature name='tm2'/>
+ <feature name='xtpr'/>
+ <feature name='pdcm'/>
+ <feature name='osxsave'/>
+ <feature name='tsc_adjust'/>
+ <feature name='sgx'/>
+ <feature name='clflushopt'/>
+ <feature name='intel-pt'/>
+ <feature name='md-clear'/>
+ <feature name='stibp'/>
+ <feature name='arch-capabilities'/>
+ <feature name='ssbd'/>
+ <feature name='xsaves'/>
+ <feature name='sgx1'/>
+ <feature name='sgx-debug'/>
+ <feature name='sgx-mode64'/>
+ <feature name='sgx-provisionkey'/>
+ <feature name='sgx-tokenkey'/>
+ <feature name='pdpe1gb'/>
+ <feature name='invtsc'/>
+ <feature name='rdctl-no'/>
+ <feature name='ibrs-all'/>
+ <feature name='skip-l1dfl-vmentry'/>
+ <feature name='mds-no'/>
+ <feature name='tsx-ctrl'/>
+ <pages unit='KiB' size='4'/>
+ <pages unit='KiB' size='2048'/>
+ <pages unit='KiB' size='1048576'/>
+ </cpu>
+ <power_management>
+ <suspend_mem/>
+ </power_management>
+ <iommu support='yes'/>
+ <migration_features>
+ <live/>
+ <uri_transports>
+ <uri_transport>tcp</uri_transport>
+ <uri_transport>rdma</uri_transport>
+ </uri_transports>
+ </migration_features>
+ <topology>
+ <cells num='1'>
+ <cell id='0'>
+ <memory unit='KiB'>32498112</memory>
+ <pages unit='KiB' size='4'>6813808</pages>
+ <pages unit='KiB' size='2048'>2048</pages>
+ <pages unit='KiB' size='1048576'>1</pages>
+ <distances>
+ <sibling id='0' value='10'/>
+ </distances>
+ <cpus num='8'>
+ <cpu id='0' socket_id='0' die_id='0' core_id='0' siblings='0,4'/>
+ <cpu id='1' socket_id='0' die_id='0' core_id='1' siblings='1,5'/>
+ <cpu id='2' socket_id='0' die_id='0' core_id='2' siblings='2,6'/>
+ <cpu id='3' socket_id='0' die_id='0' core_id='3' siblings='3,7'/>
+ <cpu id='4' socket_id='0' die_id='0' core_id='0' siblings='0,4'/>
+ <cpu id='5' socket_id='0' die_id='0' core_id='1' siblings='1,5'/>
+ <cpu id='6' socket_id='0' die_id='0' core_id='2' siblings='2,6'/>
+ <cpu id='7' socket_id='0' die_id='0' core_id='3' siblings='3,7'/>
+ </cpus>
+ </cell>
+ </cells>
+ </topology>
+ <cache>
+ <bank id='0' level='3' type='both' size='8' unit='MiB' cpus='0-7'/>
+ </cache>
+ <secmodel>
+ <model>none</model>
+ <doi>0</doi>
+ </secmodel>
+ <secmodel>
+ <model>dac</model>
+ <doi>0</doi>
+ <baselabel type='kvm'>+77:+77</baselabel>
+ <baselabel type='qemu'>+77:+77</baselabel>
+ </secmodel>
+ </host>
- <!-- hvm-3.0-x86_32 -->
- <guest>
- <os_type>hvm</os_type>
- <arch name="i686">
- <wordsize>32</wordsize>
- <domain type="xen"></domain>
- <emulator>/usr/lib/xen/bin/qemu-dm</emulator>
- <machine>pc</machine>
- <machine>isapc</machine>
- <loader>/usr/lib/xen/boot/hvmloader</loader>
- </arch>
- <features>
- <cpuselection/>
- <deviceboot/>
- </features>
- </guest>
+ <guest>
+ <os_type>hvm</os_type>
+ <arch name='x86_64'>
+ <wordsize>64</wordsize>
+ <emulator>/usr/bin/qemu-system-x86_64</emulator>
+ <machine maxCpus='255'>pc-i440fx-7.1</machine>
+ <machine canonical='pc-i440fx-7.1' maxCpus='255'>pc</machine>
+ <machine maxCpus='288'>pc-q35-5.2</machine>
+ <machine maxCpus='255'>pc-i440fx-2.12</machine>
+ <machine maxCpus='255'>pc-i440fx-2.0</machine>
+ <machine maxCpus='255'>pc-i440fx-6.2</machine>
+ <machine maxCpus='288'>pc-q35-4.2</machine>
+ <machine maxCpus='255'>pc-i440fx-2.5</machine>
+ <machine maxCpus='255'>pc-i440fx-4.2</machine>
+ <machine maxCpus='255'>pc-i440fx-5.2</machine>
+ <machine maxCpus='255' deprecated='yes'>pc-i440fx-1.5</machine>
+ <machine maxCpus='255'>pc-q35-2.7</machine>
+ <machine maxCpus='288'>pc-q35-7.1</machine>
+ <machine canonical='pc-q35-7.1' maxCpus='288'>q35</machine>
+ <machine maxCpus='255'>pc-i440fx-2.2</machine>
+ <machine maxCpus='255'>pc-i440fx-2.7</machine>
+ <machine maxCpus='288'>pc-q35-6.1</machine>
+ <machine maxCpus='255'>pc-q35-2.4</machine>
+ <machine maxCpus='288'>pc-q35-2.10</machine>
+ <machine maxCpus='1'>x-remote</machine>
+ <machine maxCpus='288'>pc-q35-5.1</machine>
+ <machine maxCpus='255' deprecated='yes'>pc-i440fx-1.7</machine>
+ <machine maxCpus='288'>pc-q35-2.9</machine>
+ <machine maxCpus='255'>pc-i440fx-2.11</machine>
+ <machine maxCpus='288'>pc-q35-3.1</machine>
+ <machine maxCpus='255'>pc-i440fx-6.1</machine>
+ <machine maxCpus='288'>pc-q35-4.1</machine>
+ <machine maxCpus='255'>pc-i440fx-2.4</machine>
+ <machine maxCpus='255'>pc-i440fx-4.1</machine>
+ <machine maxCpus='255'>pc-i440fx-5.1</machine>
+ <machine maxCpus='255'>pc-i440fx-2.9</machine>
+ <machine maxCpus='1'>isapc</machine>
+ <machine maxCpus='255' deprecated='yes'>pc-i440fx-1.4</machine>
+ <machine maxCpus='255'>pc-q35-2.6</machine>
+ <machine maxCpus='255'>pc-i440fx-3.1</machine>
+ <machine maxCpus='288'>pc-q35-2.12</machine>
+ <machine maxCpus='288'>pc-q35-7.0</machine>
+ <machine maxCpus='255'>pc-i440fx-2.1</machine>
+ <machine maxCpus='288'>pc-q35-6.0</machine>
+ <machine maxCpus='255'>pc-i440fx-2.6</machine>
+ <machine maxCpus='288'>pc-q35-4.0.1</machine>
+ <machine maxCpus='255'>pc-i440fx-7.0</machine>
+ <machine maxCpus='255' deprecated='yes'>pc-i440fx-1.6</machine>
+ <machine maxCpus='288'>pc-q35-5.0</machine>
+ <machine maxCpus='288'>pc-q35-2.8</machine>
+ <machine maxCpus='255'>pc-i440fx-2.10</machine>
+ <machine maxCpus='288'>pc-q35-3.0</machine>
+ <machine maxCpus='255'>pc-i440fx-6.0</machine>
+ <machine maxCpus='288'>pc-q35-4.0</machine>
+ <machine maxCpus='288'>microvm</machine>
+ <machine maxCpus='255'>pc-i440fx-2.3</machine>
+ <machine maxCpus='255'>pc-i440fx-4.0</machine>
+ <machine maxCpus='255'>pc-i440fx-5.0</machine>
+ <machine maxCpus='255'>pc-i440fx-2.8</machine>
+ <machine maxCpus='288'>pc-q35-6.2</machine>
+ <machine maxCpus='255'>pc-q35-2.5</machine>
+ <machine maxCpus='255'>pc-i440fx-3.0</machine>
+ <machine maxCpus='288'>pc-q35-2.11</machine>
+ <domain type='qemu'/>
+ <domain type='kvm'/>
+ </arch>
+ <features>
+ <acpi default='on' toggle='yes'/>
+ <apic default='on' toggle='no'/>
+ <cpuselection/>
+ <deviceboot/>
+ <disksnapshot default='on' toggle='no'/>
+ </features>
+ </guest>
- ...
- </capabilities>
+ </capabilities>
--
2.37.4
1 year, 11 months
[PATCH] docs: Add missing elements to formatcaps.rst
by Nobuhiro MIKI
Signed-off-by: Nobuhiro MIKI <nmiki(a)yahoo-corp.jp>
---
docs/formatcaps.rst | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/docs/formatcaps.rst b/docs/formatcaps.rst
index 9d7932a6a8..39b1fb78ac 100644
--- a/docs/formatcaps.rst
+++ b/docs/formatcaps.rst
@@ -155,6 +155,11 @@ capabilities enabled in the chip and BIOS you will see:
<topology sockets="1" dies="1" cores="2" threads="1"/>
<feature name="lahf_lm"/>
<feature name='xtpr'/>
+ <pages unit='KiB' size='4'/>
+ <pages unit='KiB' size='2048'/>
+ <pages unit='KiB' size='1048576'/>
+ <microcode version='36'/>
+ <maxphysaddr mode='emulate' bits='46'/>
...
</cpu>
<power_management>
--
2.38.1
1 year, 11 months
[PATCH RFC 0/6] spec: Decompose the daemon subpackage
by Jim Fehlig
Currently it is not possible to install a modular daemon subpackage without
also installing the monolithic daemon
https://listman.redhat.com/archives/libvir-list/2022-September/234554.html
This series is an initial attempt at moving common daemons, utilities, and
files from the daemon subpackage to a new daemon-core subpackage. The
monolithic and modular daemons can then depend on the new subpackage.
libvirt-guests is moved to a new libvirt-guests subpackage, which is
recommended by the daemon subpackage to provide smoother upgrade.
I've likely overlooked several items, but before continuing down this
path too far I first wanted to gauge interest and see if this work is
worth pursuing. If so, any comments on the RFC are appreciated!
Note that patches 1-3 are things I noticed while working on the others
and could be pushed independently.
Jim Fehlig (6):
spec: Remove redundant with_libxl
spec: Use more %{name} macro
spec: Remove daemon postun trigger
spec: Move common daemons to a separate subpackage
spec: Move more files to the daemon-core subpackage
spec: Move libvirt-guests to guests subpackage
libvirt.spec.in | 406 ++++++++++++++++++++++++++----------------------
1 file changed, 219 insertions(+), 187 deletions(-)
--
2.37.3
1 year, 11 months
[libvirt PATCH] tools: Fix style issues in virt-qemu-sev-validate
by Andrea Bolognani
The script had an incorrect interpreter line until commit
f6a19d7264bb, so the flake8 check would not realize it needed
to pick it up and these issues, some of which were present it
the very first version that was committed, were not being
reported.
Signed-off-by: Andrea Bolognani <abologna(a)redhat.com>
---
tools/virt-qemu-sev-validate | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tools/virt-qemu-sev-validate b/tools/virt-qemu-sev-validate
index 46a92aa7a0..3d8b292fef 100755
--- a/tools/virt-qemu-sev-validate
+++ b/tools/virt-qemu-sev-validate
@@ -849,7 +849,7 @@ class ConfidentialVM(abc.ABC):
secret64 = b64encode(secret_table_ciphertext).decode('utf8')
log.debug("Header: %s (%d bytes)", header64, len(header))
log.debug("Secret: %s (%d bytes)",
- secret64, len(secret_table_ciphertext))
+ secret64, len(secret_table_ciphertext))
return header64, secret64
@@ -955,7 +955,7 @@ class LibvirtConfidentialVM(ConfidentialVM):
self.dom = self.conn.lookupByName(id_name_uuid)
log.debug("VM: id=%d name=%s uuid=%s",
- self.dom.ID(), self.dom.name(), self.dom.UUIDString())
+ self.dom.ID(), self.dom.name(), self.dom.UUIDString())
if not self.dom.isActive():
raise InvalidStateException(
@@ -1331,5 +1331,6 @@ def main():
print("ERROR: %s" % e, file=sys.stderr)
sys.exit(6)
+
if __name__ == "__main__":
main()
--
2.38.1
1 year, 11 months
[PATCH] virnetdevtap.c: Disallow pre-existing TAP devices
by Michal Privoznik
When starting a guest with <interface/> which has the target
device name set (i.e. not generated by us), it may happen that
the TAP device already exists. This then may lead to all sorts of
problems. For instance: for <interface type='network'/> the TAP
device is plugged into the network's bridge, but since the TAP
device is persistent it remains plugged there even after the
guest is shut off. We don't have a code that unplugs TAP devices
from the bridge because TAP devices we create are transient, i.e.
are removed automatically when QEMU closes their FD.
The only exception is <interface type='ethernet'/> with <target
managed='no'/> where we specifically want to let users use
pre-created TAP device and basically not touch it at all.
There's another reason for denying to use a pre-created TAP
devices: if we ever have bug in TAP name generation, we may
re-use a TAP device from another domain.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2144738
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
src/qemu/qemu_interface.c | 2 ++
src/util/virnetdevtap.c | 31 ++++++++++++++++++++++++++++++-
src/util/virnetdevtap.h | 2 ++
3 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_interface.c b/src/qemu/qemu_interface.c
index 4cc76e07a5..264d5e060c 100644
--- a/src/qemu/qemu_interface.c
+++ b/src/qemu/qemu_interface.c
@@ -461,6 +461,8 @@ qemuInterfaceEthernetConnect(virDomainDef *def,
if (!net->ifname)
template_ifname = true;
+ tap_create_flags |= VIR_NETDEV_TAP_CREATE_ALLOW_EXISTING;
+
if (virNetDevTapCreate(&net->ifname, tunpath, tapfd, tapfdSize,
tap_create_flags) < 0) {
goto cleanup;
diff --git a/src/util/virnetdevtap.c b/src/util/virnetdevtap.c
index 112a1e8b99..406339c583 100644
--- a/src/util/virnetdevtap.c
+++ b/src/util/virnetdevtap.c
@@ -148,12 +148,15 @@ virNetDevTapGetRealDeviceName(char *ifname G_GNUC_UNUSED)
* @tunpath: path to the tun device (if NULL, /dev/net/tun is used)
* @tapfds: array of file descriptors return value for the new tap device
* @tapfdSize: number of file descriptors in @tapfd
- * @flags: OR of virNetDevTapCreateFlags. Only one flag is recognized:
+ * @flags: OR of virNetDevTapCreateFlags. Only the following flags are
+ * recognized:
*
* VIR_NETDEV_TAP_CREATE_VNET_HDR
* - Enable IFF_VNET_HDR on the tap device
* VIR_NETDEV_TAP_CREATE_PERSIST
* - The device will persist after the file descriptor is closed
+ * VIR_NETDEV_TAP_CREATE_ALLOW_EXISTING
+ * - The device creation fails if @ifname already exists
*
* Creates a tap interface. The caller must use virNetDevTapDelete to
* remove a persistent TAP device when it is no longer needed. In case
@@ -182,6 +185,19 @@ int virNetDevTapCreate(char **ifname,
if (virNetDevGenerateName(ifname, VIR_NET_DEV_GEN_NAME_VNET) < 0)
return -1;
+ if (!(flags & VIR_NETDEV_TAP_CREATE_ALLOW_EXISTING)) {
+ int rc = virNetDevExists(*ifname);
+
+ if (rc < 0) {
+ return -1;
+ } else if (rc > 0) {
+ virReportError(VIR_ERR_OPERATION_INVALID,
+ _("The %s interface already exists"),
+ *ifname);
+ return -1;
+ }
+ }
+
if (!tunpath)
tunpath = "/dev/net/tun";
@@ -319,6 +335,19 @@ int virNetDevTapCreate(char **ifname,
if (virNetDevGenerateName(ifname, VIR_NET_DEV_GEN_NAME_VNET) < 0)
return -1;
+ if (!(flags & VIR_NETDEV_TAP_CREATE_ALLOW_EXISTING)) {
+ int rc = virNetDevExists(*ifname);
+
+ if (rc < 0) {
+ return -1;
+ } else if (rc > 0) {
+ virReportError(VIR_ERR_OPERATION_INVALID,
+ _("The %s interface already exists"),
+ *ifname);
+ return -1;
+ }
+ }
+
/* As FreeBSD determines interface type by name,
* we have to create 'tap' interface first and
* then rename it to 'vnet'
diff --git a/src/util/virnetdevtap.h b/src/util/virnetdevtap.h
index 197ea10f94..c9d29c0384 100644
--- a/src/util/virnetdevtap.h
+++ b/src/util/virnetdevtap.h
@@ -56,6 +56,8 @@ typedef enum {
VIR_NETDEV_TAP_CREATE_USE_MAC_FOR_BRIDGE = 1 << 2,
/* The device will persist after the file descriptor is closed */
VIR_NETDEV_TAP_CREATE_PERSIST = 1 << 3,
+ /* The device is allowed to exist before creation */
+ VIR_NETDEV_TAP_CREATE_ALLOW_EXISTING = 1 << 4,
} virNetDevTapCreateFlags;
int
--
2.37.4
1 year, 11 months
[libvirt PATCH 00/21] meson: remove many obsolete checks
by Daniel P. Berrangé
We have alot of checks for Linux kernel features that are obsolete since
our supported platform matrix lets us assume new enough kernel versions.
Removing the checks will speed up the meson phase and reduce the tangle
of #ifdefs in the code.
I thought I could remove the check for linux/kvm.h but for some reason
our code in virhostpcu.c / cpu_x86.c is enabling the codebases even
on FreeBSD, which I find kind of odd. Does FreeBSD really ship a
linux/kvm.h ?
Daniel P. Berrangé (21):
meson: remove obsolete check for LOOP_CTL_GET_FREE
meson: remove obsolete check for EPOLL_CLOEXEC
meson: remove obsolete check for LO_FLAGS_AUTOCLEAR
meson: drop check for unshare()
netdev: simplify check for ethtool functionality
meson: remove obsolete check for ETHTOOL_GGSO
meson: remove obsolete check for ETHTOOL_GGRO
meson: remove obsolete check for ETHTOOL_GFLAGS
meson: remove obsolete check for ETH_FLAG_LRO
meson: remove obsolete check for ETH_FLAG_TXVLAN/RXVLAN
meson: remove obsolete check for ETH_FLAG_NTUPLE
meson: remove obsolete check for ETH_FLAG_RXHASH
meson: remove obsolete check for ETHTOOL_GFEATURES
meson: remove obsolete check for ETHTOOL_GCOALESCE
meson: remove obsolete check for GET_VLAN_VID_CMD
meson: simplify check for virnetdevbridge.c headers
meson: remove obsolete check for DEVLINK_CMD_ESWITCH_GET
meson: remove obsolete check for linux/magic.h
meson: remove obsolete check for VHOST_VSOCK_SET_GUEST_CID
meson: remove obsolete check for BPF_PROG_QUERY
meson: remove obsolete check for BPF_CGROUP_DEVICE
meson.build | 82 +++--------------------------------
src/util/virbpf.c | 6 +--
src/util/virbpf.h | 6 +--
src/util/vircgroupv2devices.c | 10 ++---
src/util/virfile.c | 15 ++-----
src/util/virnetdev.c | 65 ++++-----------------------
src/util/virvsock.c | 4 +-
tests/securityselinuxhelper.c | 4 +-
tests/virfilemock.c | 2 +-
9 files changed, 32 insertions(+), 162 deletions(-)
--
2.38.1
1 year, 11 months
[libvirt PATCH 0/2] virt-qemu-sev-validate: A couple of small fixes
by Andrea Bolognani
Andrea Bolognani (2):
docs: Fix typo in virt-qemu-sev-validate(1)
tools: Fix interpreter for virt-qemu-sev-validate
docs/manpages/virt-qemu-sev-validate.rst | 2 +-
tools/virt-qemu-sev-validate | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
--
2.38.1
1 year, 11 months
RFC: New APIs for delegation of privileged operations
by Andrea Bolognani
Hi,
this is a proposal for introducing a new family of APIs in libvirt,
with the goal of improving integration with management applications.
KubeVirt is intended to be the primary consumer of these APIs.
Background
----------
KubeVirt makes it possible to run VMs on a Kubernetes cluster, side
by side with containers.
It does so by running QEMU and libvirtd themselves inside a
container. The architecture is explained in more detail at
https://kubevirt.io/user-guide/architecture/
but for the purpose of this discussion we only need to keep in mind
two components:
* virt-launcher
- runs in the same container as QEMU and libvirtd
- one instance per VM
* virt-handler
- runs in a separate container
- one instance per node
Conceptually, these two components roughly map to QEMU processes and
libvirtd respectively.
>From a security perspective, there is a strong push in Kubernetes to
run workloads under unprivileged user accounts and without additional
capabilities. Again, this is similar to how libvirtd itself runs as
root but the QEMU processes it starts are under the unprivileged
"qemu" account.
KubeVirt has been working towards the goal of running VMs as
completely unprivileged workloads and made excellent progress so far.
Some of the operations needed for running a VM, however, inherently
require elevated privilege. In KubeVirt, the conundrum is solved by
having virt-handler (a privileged component) take care of those
operations, making it possible for virt-launcher (as well as QEMU and
libvirtd) to run in an unprivileged context.
Examples
--------
Here are a few examples of how KubeVirt has been able to reduce the
privilege required by virt-launcher by selectively handing over
responsibilities to virt-handler:
* Remove SYS_RESOURCE capability from launcher pod
https://github.com/kubevirt/kubevirt/pull/2584
* Drop SYS_RESOURCE capability
https://github.com/kubevirt/kubevirt/pull/5558
* Housekeeping cgroup
https://github.com/kubevirt/kubevirt/pull/8233
* Real time VMs fail to change vCPU scheduler and priority in
non-root deployments
https://github.com/kubevirt/kubevirt/pull/8750
* virt-launcher: Drop SYS_PTRACE capability
https://github.com/kubevirt/kubevirt/pull/8842
The pattern we can see is that, initially, libvirt just assumes that
it can perform a certain privileged operation. This fails in the
context of KubeVirt, where libvirtd runs with significantly reduced
privileges. As a consequence, libvirt is patched to be more resilient
to such lack of privilege: for example, instead of attempting to
create a file and erroring out due to lack of permissions, it will
instead first check whether the file already exists and, if it does,
assume that it has been prepared ahead of time by an external entity.
Limitations
-----------
This approach works fine, but only for the privileged operations that
would be performed by libvirt before the VM starts running.
Looking at the "housekeeping cgroup" PR in particular, we notice that
the VM is initially created in paused state: this is necessary in
order to create a point in time in which all the VM threads already
exist but, crucially, none of the vCPUs have stated running yet. This
is the only opportunity to move threads across cgroups without
invalidating the expectations of a real time workload.
When it comes to live migration, however, there is no way to create
similar conditions, since the VM is running on the destination host
right out of the gate. As a consequence, live migration has to be
blocked when the housekeeping cgroup is in use, which is an
unfortunate limitation.
Moreover, there's an overall sense of fragility surrounding these
interactions: both KubeVirt and, to some extent, libvirt need to be
acutely aware of what the other component is going to do, but there
is never an explicit handover and the whole thing only works if you
just so happen to do everything with the exact right order and
timing.
Proposal
--------
In order to address the issues outlined above, I propose that we
introduce a new set of APIs in libvirt.
These APIs would expose some of the inner workings of libvirt, and
as such would come with *massively reduced* stability guarantees
compared to the rest of our public API.
The idea is that applications such as KubeVirt, which track libvirt
fairly closely and stay pinned to specific versions, would be able to
adapt to changes in these APIs relatively painlessly. More
traditional management applications such as virt-manager would simply
not opt into using the new APIs and maintain the status quo.
Using memlock as an example, the new API could look like
typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid,
unsigned long long bytes);
int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn,
virInternalSetMaxMemLockHandler handler);
The application-provided handler would be responsible for performing
the privileged operation (in this case raising the memlock limit for
a process). For KubeVirt, virt-launcher would have to pass the baton
to virt-handler.
If such an handler is installed, libvirt would invoke it (and likely
go through some sanity checks afterwards); if not, it would attempt
to perform the privileged operation itself, as it does today.
This would make the interaction between libvirt and the management
application explicit rather than implicit. Not having to stick to our
usual API stability guarantees would make it possible to be more
liberal in exposing the internals of libvirt as interaction points.
Scope
-----
I think we should initially limit the new APIs to the scenarios that
have already been identified, then gradually expand the scope as
needed. In other words, we shouldn't comb through the codebase
looking for potential adopters.
Since the intended consumers of these APIs are those that can
adopt a new libvirt release fairly quickly, this shouldn't be a
problem.
Once the pattern has been established, we might consider introducing
support for it at the same time as a new feature that might benefit
from it is added.
Caveats
-------
libvirt is all about stable API, so introducing an API that is
unstable *by design* is completely uncharted territory.
To ensure that the new APIs are 100% opt-in, we could define them in
a separate <libvirt/libvirt-internal.h> header. Furthermore, we could
have a separate libvirt-internal.so shared library for the symbols
and a corresponding libvirt-internal.pc pkg-config file. We could
even go as far as requiring a preprocessor symbol such as
VIR_INTERNAL_UNSTABLE_API_OPT_IN
to be defined before the entry points are visible to the compiler.
Whatever the mechanism, we would need to make sure that it's usable
from language bindings as well.
Internal APIs are amendable to not only come and go, but also change
semantics between versions. We should make sure that such changes are
clearly exposed to the user, for example by requiring them to pass a
version number to the function and erroring out immediately if the
value doesn't match our expectations. KubeVirt has a massive suite of
functional tests, so this kind of change would immediately be spotted
when a new version of libvirt is imported, with no risk of an
incompatibility lingering in the codebase until it affects users.
Disclaimer
----------
This proposal is intentionally vague on several of the details.
Before attempting to nail those down, I want to gather feedback on
the high-level idea, both from the libvirt and KubeVirt side.
Credits
-------
Thanks to Michal and Martin for helping shape and polish the idea
from its initial rough state.
--
Andrea Bolognani / Red Hat / Virtualization
1 year, 11 months