[libvirt PATCH 0/5] Add support for vDPA block devices

see https://bugzilla.redhat.com/show_bug.cgi?id=1900770. Jonathon Jongsma (5): conf: add ability to configure a vdpa block disk device qemu: add virtio-blk-vhost-vdpa capability qemu: make vdpa connect function more generic qemu: consider vdpa block devices for memlock limits qemu: Implement support for vDPA block devices docs/formatdomain.rst | 19 ++++++++- src/ch/ch_monitor.c | 1 + src/conf/domain_conf.c | 7 ++++ src/conf/schemas/domaincommon.rng | 13 +++++++ src/conf/storage_source_conf.c | 6 ++- src/conf/storage_source_conf.h | 1 + src/libxl/xen_xl.c | 1 + src/qemu/qemu_block.c | 20 ++++++++++ src/qemu/qemu_capabilities.c | 2 + src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 24 +++++++++++- src/qemu/qemu_command.h | 1 + src/qemu/qemu_domain.c | 37 +++++++++++++++++- src/qemu/qemu_interface.c | 23 ----------- src/qemu/qemu_interface.h | 2 - src/qemu/qemu_migration.c | 2 + src/qemu/qemu_snapshot.c | 4 ++ src/qemu/qemu_validate.c | 45 +++++++++++++++++++--- src/storage_file/storage_source.c | 1 + tests/qemuhotplugmock.c | 4 +- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 ++++++++++ tests/qemuxml2argvmock.c | 2 +- tests/qemuxml2argvtest.c | 2 + 24 files changed, 235 insertions(+), 39 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml -- 2.40.1

vDPA block devices can be configured as follows: <disk type='vhostvdpa'> <source dev='/dev/vhost-vdpa-0'/> </disk> Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- docs/formatdomain.rst | 19 +++++++++++++++++-- src/ch/ch_monitor.c | 1 + src/conf/domain_conf.c | 7 +++++++ src/conf/schemas/domaincommon.rng | 13 +++++++++++++ src/conf/storage_source_conf.c | 6 +++++- src/conf/storage_source_conf.h | 1 + src/libxl/xen_xl.c | 1 + src/qemu/qemu_block.c | 6 ++++++ src/qemu/qemu_command.c | 1 + src/qemu/qemu_migration.c | 2 ++ src/qemu/qemu_snapshot.c | 4 ++++ src/qemu/qemu_validate.c | 1 + src/storage_file/storage_source.c | 1 + 13 files changed, 60 insertions(+), 3 deletions(-) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index c3526439bf..778c04506c 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -2666,6 +2666,11 @@ paravirtualized driver is specified via the ``disk`` element. </source> <target dev='vdf' bus='virtio'/> </disk> + <disk type='vhostvdpa' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/vhost-vdpa-0' /> + <target dev='vdg' bus='virtio'/> + </disk> </devices> ... @@ -2676,8 +2681,9 @@ paravirtualized driver is specified via the ``disk`` element. ``type`` Valid values are "file", "block", "dir" ( :since:`since 0.7.5` ), "network" ( :since:`since 0.8.7` ), or "volume" ( :since:`since 1.0.5` ), - or "nvme" ( :since:`since 6.0.0` ), or "vhostuser" ( :since:`since 7.1.0` ) - and refer to the underlying source for the disk. :since:`Since 0.0.3` + or "nvme" ( :since:`since 6.0.0` ), or "vhostuser" ( :since:`since 7.1.0` ), + or "vhostvdpa" ( :since:`since 9.5.0 (QEMU 8.1.0)`) and refer to the + underlying source for the disk. :since:`Since 0.0.3` ``device`` Indicates how the disk is to be exposed to the guest OS. Possible values for this attribute are "floppy", "disk", "cdrom", and "lun", defaulting to @@ -2855,6 +2861,15 @@ paravirtualized driver is specified via the ``disk`` element. ``<disk>`` XML for this disk type. Additionally features such as blockjobs, incremental backups and snapshots are not supported for this disk type. + ``vhostvdpa`` + Enables the hypervisor to connect to a vDPA block device. Requires shared + memory configured for the VM, for more details see ``access`` mode for + ``memoryBacking`` element (See `Memory Backing`_). + + The ``source`` element has a mandatory attribute ``dev`` that specifies + the fully-qualified path to the vhost-vdpa character device (e.g. + ``/dev/vhost-vdpa-0``). + With "file", "block", and "volume", one or more optional sub-elements ``seclabel`` (See `Security label`_) can be used to override the domain security labeling policy for just that source file. diff --git a/src/ch/ch_monitor.c b/src/ch/ch_monitor.c index d902bc6959..607c969e67 100644 --- a/src/ch/ch_monitor.c +++ b/src/ch/ch_monitor.c @@ -197,6 +197,7 @@ virCHMonitorBuildDiskJson(virJSONValue *disks, virDomainDiskDef *diskdef) case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_LAST: default: virReportEnumRangeError(virStorageType, diskdef->src->type); diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 0edb1bda9d..cd9d895292 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -7511,6 +7511,9 @@ virDomainStorageSourceParse(xmlNodePtr node, if (virDomainDiskSourceVHostUserParse(node, src, xmlopt, ctxt) < 0) return -1; break; + case VIR_STORAGE_TYPE_VHOST_VDPA: + src->path = virXMLPropString(node, "dev"); + break; case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -22290,6 +22293,10 @@ virDomainDiskSourceFormat(virBuffer *buf, virDomainDiskSourceVhostuserFormat(&attrBuf, &childBuf, src->vhostuser); break; + case VIR_STORAGE_TYPE_VHOST_VDPA: + virBufferEscapeString(&attrBuf, " dev='%s'", src->path); + break; + case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/conf/schemas/domaincommon.rng b/src/conf/schemas/domaincommon.rng index c1725bb511..ac33c442c3 100644 --- a/src/conf/schemas/domaincommon.rng +++ b/src/conf/schemas/domaincommon.rng @@ -1824,6 +1824,7 @@ <ref name="diskSourceVolume"/> <ref name="diskSourceNvme"/> <ref name="diskSourceVhostUser"/> + <ref name="diskSourceVhostVdpa"/> </choice> </define> @@ -2394,6 +2395,18 @@ </element> </define> + <define name="diskSourceVhostVdpa"> + <attribute name="type"> + <value>vhostvdpa</value> + </attribute> + <element name="source"> + <attribute name="dev"> + <ref name="absFilePath"/> + </attribute> + <empty/> + </element> + </define> + <define name="diskTargetDev"> <data type="string"> <param name="pattern">(ioemu:)?(fd|hd|sd|vd|xvd|ubd)[a-zA-Z0-9_]+</param> diff --git a/src/conf/storage_source_conf.c b/src/conf/storage_source_conf.c index 99061e4df7..0953770a52 100644 --- a/src/conf/storage_source_conf.c +++ b/src/conf/storage_source_conf.c @@ -47,7 +47,8 @@ VIR_ENUM_IMPL(virStorage, "network", "volume", "nvme", - "vhostuser" + "vhostuser", + "vhostvdpa" ); @@ -957,6 +958,7 @@ virStorageSourceIsSameLocation(virStorageSource *a, break; case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_FILE: case VIR_STORAGE_TYPE_BLOCK: @@ -1053,6 +1055,7 @@ virStorageSourceIsLocalStorage(const virStorageSource *src) * Therefore, we have to return false here. */ case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_LAST: case VIR_STORAGE_TYPE_NONE: return false; @@ -1245,6 +1248,7 @@ virStorageSourceIsRelative(virStorageSource *src) case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return false; diff --git a/src/conf/storage_source_conf.h b/src/conf/storage_source_conf.h index c6187dda59..f6cf1c09d6 100644 --- a/src/conf/storage_source_conf.h +++ b/src/conf/storage_source_conf.h @@ -43,6 +43,7 @@ typedef enum { VIR_STORAGE_TYPE_VOLUME, VIR_STORAGE_TYPE_NVME, VIR_STORAGE_TYPE_VHOST_USER, + VIR_STORAGE_TYPE_VHOST_VDPA, VIR_STORAGE_TYPE_LAST } virStorageType; diff --git a/src/libxl/xen_xl.c b/src/libxl/xen_xl.c index 77f9f112f0..cb22425ebf 100644 --- a/src/libxl/xen_xl.c +++ b/src/libxl/xen_xl.c @@ -1523,6 +1523,7 @@ xenFormatXLDiskSrc(virStorageSource *src, char **srcstr) case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: break; diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index 8b2159f845..35312fb7d4 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -873,6 +873,11 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src, return NULL; break; + case VIR_STORAGE_TYPE_VHOST_VDPA: + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("vhostvdpa disk type not yet supported")); + return NULL; + case VIR_STORAGE_TYPE_VHOST_USER: virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("unable to create blockdev props for vhostuser disk type")); @@ -2321,6 +2326,7 @@ qemuBlockStorageSourceCreateGetStorageProps(virStorageSource *src, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: return 0; case VIR_STORAGE_TYPE_NONE: diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index a19902988c..22deac2a9b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -1636,6 +1636,7 @@ qemuBuildDriveSourceStr(virDomainDiskDef *disk, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: break; diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index ed41a03851..a610f230e2 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -347,6 +347,7 @@ qemuMigrationDstPrecreateDisk(virConnectPtr *conn, case VIR_STORAGE_TYPE_DIR: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -1626,6 +1627,7 @@ qemuMigrationSrcIsSafe(virDomainDef *def, break; case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_BLOCK: case VIR_STORAGE_TYPE_DIR: diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c index 91de8b0c31..e74ab7d46c 100644 --- a/src/qemu/qemu_snapshot.c +++ b/src/qemu/qemu_snapshot.c @@ -400,6 +400,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -418,6 +419,7 @@ qemuSnapshotPrepareDiskExternalInactive(virDomainSnapshotDiskDef *snapdisk, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -462,6 +464,7 @@ qemuSnapshotPrepareDiskExternalActive(virDomainSnapshotDiskDef *snapdisk, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, @@ -604,6 +607,7 @@ qemuSnapshotPrepareDiskInternal(virDomainDiskDef *disk, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 04d0c9df73..9dce908cfe 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -632,6 +632,7 @@ qemuValidateDomainDefNvram(const virDomainDef *def, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("unsupported nvram disk type '%1$s'"), virStorageTypeToString(src->type)); diff --git a/src/storage_file/storage_source.c b/src/storage_file/storage_source.c index 47fc1edbd2..dc31d1bf50 100644 --- a/src/storage_file/storage_source.c +++ b/src/storage_file/storage_source.c @@ -583,6 +583,7 @@ virStorageSourceUpdatePhysicalSize(virStorageSource *src, case VIR_STORAGE_TYPE_VOLUME: case VIR_STORAGE_TYPE_NVME: case VIR_STORAGE_TYPE_VHOST_USER: + case VIR_STORAGE_TYPE_VHOST_VDPA: case VIR_STORAGE_TYPE_NONE: case VIR_STORAGE_TYPE_LAST: return -1; -- 2.40.1

On Tue, Jun 06, 2023 at 16:11:00 -0500, Jonathon Jongsma wrote:
vDPA block devices can be configured as follows:
<disk type='vhostvdpa'> <source dev='/dev/vhost-vdpa-0'/> </disk>
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- docs/formatdomain.rst | 19 +++++++++++++++++-- src/ch/ch_monitor.c | 1 + src/conf/domain_conf.c | 7 +++++++ src/conf/schemas/domaincommon.rng | 13 +++++++++++++ src/conf/storage_source_conf.c | 6 +++++- src/conf/storage_source_conf.h | 1 + src/libxl/xen_xl.c | 1 + src/qemu/qemu_block.c | 6 ++++++ src/qemu/qemu_command.c | 1 + src/qemu/qemu_migration.c | 2 ++ src/qemu/qemu_snapshot.c | 4 ++++ src/qemu/qemu_validate.c | 1 + src/storage_file/storage_source.c | 1 + 13 files changed, 60 insertions(+), 3 deletions(-)
[...] By re-using virStorageSource->path for the path to the block device, the code which e.g. sets up the mount namespace will consider that the device node for the vdpa device needs to be created in the per-vm /dev/fileysstem. This should not be needed though as we're FD-passing it. selinux labelling and cgroups will skip labelling the /dev/ node as virStorageSourceIsLocalStorage() returns false. This is okay in case of selinux but I'm not sure how the cgroups device controller handles access to the device. I also didn't see anything related to labelling the fd passed to qemu. What I'm missing is selinux-labelling of the FD passed to the VM. Since the /dev/ node doesn't get labelled either it will most likely we forbidden by selinux. Do you have any guide how to setup the VDPA simulator handy? Generally this patch looks okay and the labelling stuff is more for the implementation patch, so, Reviewed-by: Peter Krempa <pkrempa@redhat.com>

On 6/8/23 8:44 AM, Peter Krempa wrote:
On Tue, Jun 06, 2023 at 16:11:00 -0500, Jonathon Jongsma wrote:
vDPA block devices can be configured as follows:
<disk type='vhostvdpa'> <source dev='/dev/vhost-vdpa-0'/> </disk>
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- docs/formatdomain.rst | 19 +++++++++++++++++-- src/ch/ch_monitor.c | 1 + src/conf/domain_conf.c | 7 +++++++ src/conf/schemas/domaincommon.rng | 13 +++++++++++++ src/conf/storage_source_conf.c | 6 +++++- src/conf/storage_source_conf.h | 1 + src/libxl/xen_xl.c | 1 + src/qemu/qemu_block.c | 6 ++++++ src/qemu/qemu_command.c | 1 + src/qemu/qemu_migration.c | 2 ++ src/qemu/qemu_snapshot.c | 4 ++++ src/qemu/qemu_validate.c | 1 + src/storage_file/storage_source.c | 1 + 13 files changed, 60 insertions(+), 3 deletions(-)
[...]
By re-using virStorageSource->path for the path to the block device, the code which e.g. sets up the mount namespace will consider that the device node for the vdpa device needs to be created in the per-vm /dev/fileysstem. This should not be needed though as we're FD-passing it.
selinux labelling and cgroups will skip labelling the /dev/ node as virStorageSourceIsLocalStorage() returns false. This is okay in case of selinux but I'm not sure how the cgroups device controller handles access to the device.
I also didn't see anything related to labelling the fd passed to qemu.
What I'm missing is selinux-labelling of the FD passed to the VM. Since the /dev/ node doesn't get labelled either it will most likely we forbidden by selinux.
Do you have any guide how to setup the VDPA simulator handy?
I've just been using this a little script to enable the vdpa block simulator: $ cat ./enable-vdpa-sim-blk modprobe -r virtio_vdpa modprobe vhost_vdpa modprobe vdpa_sim_blk vdpa dev add name testvdpablk mgmtdev vdpasim_blk ls /dev/vhost-vdpa*
Generally this patch looks okay and the labelling stuff is more for the implementation patch, so,
Reviewed-by: Peter Krempa <pkrempa@redhat.com>

Check whether the qemu binary supports the vdpa block driver. We can't rely simply on the existence of the virtio-blk-vhost-vdpa block driver since the first releases of qemu didn't support fd-passing for this driver. So we have to check for the 'fdset' feature on the driver object. This feature will be present in the qemu 8.1.0 release and was merged to qemu in commit 98b126f5. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + 2 files changed, 3 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index cf85d42198..11d5ecd622 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -693,6 +693,7 @@ VIR_ENUM_IMPL(virQEMUCaps, "virtio-gpu.blob", /* QEMU_CAPS_VIRTIO_GPU_BLOB */ "rbd-encryption-layering", /* QEMU_CAPS_RBD_ENCRYPTION_LAYERING */ "rbd-encryption-luks-any", /* QEMU_CAPS_RBD_ENCRYPTION_LUKS_ANY */ + "virtio-blk-vhost-vdpa", /* QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA */ ); @@ -1560,6 +1561,7 @@ static struct virQEMUCapsStringFlags virQEMUCapsQMPSchemaQueries[] = { { "object-add/arg-type/+iothread/thread-pool-max", QEMU_CAPS_IOTHREAD_THREAD_POOL_MAX }, { "query-migrate/ret-type/blocked-reasons", QEMU_CAPS_MIGRATION_BLOCKED_REASONS }, { "screendump/arg-type/format/^png", QEMU_CAPS_SCREENSHOT_FORMAT_PNG }, + { "blockdev-add/arg-type/+virtio-blk-vhost-vdpa/$fdset", QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA}, }; typedef struct _virQEMUCapsObjectTypeProps virQEMUCapsObjectTypeProps; diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 3b55aed07a..6feaa81bbf 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -672,6 +672,7 @@ typedef enum { /* virQEMUCapsFlags grouping marker for syntax-check */ QEMU_CAPS_VIRTIO_GPU_BLOB, /* -device virtio-gpu-*.blob= */ QEMU_CAPS_RBD_ENCRYPTION_LAYERING, /* layered encryption support for Ceph RBD */ QEMU_CAPS_RBD_ENCRYPTION_LUKS_ANY, /* luks-any (LUKS and LUKS2) encryption format for Ceph RBD */ + QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA, /* -device virtio-blk-vhost-vdpa */ QEMU_CAPS_LAST /* this must always be the last item */ } virQEMUCapsFlags; -- 2.40.1

On Tue, Jun 06, 2023 at 16:11:01 -0500, Jonathon Jongsma wrote:
Check whether the qemu binary supports the vdpa block driver. We can't rely simply on the existence of the virtio-blk-vhost-vdpa block driver since the first releases of qemu didn't support fd-passing for this driver. So we have to check for the 'fdset' feature on the driver object. This feature will be present in the qemu 8.1.0 release and was merged to qemu in commit 98b126f5.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + 2 files changed, 3 insertions(+)
diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index cf85d42198..11d5ecd622 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -693,6 +693,7 @@ VIR_ENUM_IMPL(virQEMUCaps, "virtio-gpu.blob", /* QEMU_CAPS_VIRTIO_GPU_BLOB */ "rbd-encryption-layering", /* QEMU_CAPS_RBD_ENCRYPTION_LAYERING */ "rbd-encryption-luks-any", /* QEMU_CAPS_RBD_ENCRYPTION_LUKS_ANY */ + "virtio-blk-vhost-vdpa", /* QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA */ );
@@ -1560,6 +1561,7 @@ static struct virQEMUCapsStringFlags virQEMUCapsQMPSchemaQueries[] = { { "object-add/arg-type/+iothread/thread-pool-max", QEMU_CAPS_IOTHREAD_THREAD_POOL_MAX }, { "query-migrate/ret-type/blocked-reasons", QEMU_CAPS_MIGRATION_BLOCKED_REASONS }, { "screendump/arg-type/format/^png", QEMU_CAPS_SCREENSHOT_FORMAT_PNG }, + { "blockdev-add/arg-type/+virtio-blk-vhost-vdpa/$fdset", QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA},
Please order this together with other 'blockdev-add' queries at the top and add a space before }. Also, since I've updated capabilities now, this feature should be detected with the 8.1 dump, so please regenerate the output as the flag will be present now. Reviewed-by: Peter Krempa <pkrempa@redhat.com>

On Thu, Jun 08, 2023 at 11:00:03 +0200, Peter Krempa wrote:
On Tue, Jun 06, 2023 at 16:11:01 -0500, Jonathon Jongsma wrote:
Check whether the qemu binary supports the vdpa block driver. We can't rely simply on the existence of the virtio-blk-vhost-vdpa block driver since the first releases of qemu didn't support fd-passing for this driver. So we have to check for the 'fdset' feature on the driver object. This feature will be present in the qemu 8.1.0 release and was merged to qemu in commit 98b126f5.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_capabilities.c | 2 ++ src/qemu/qemu_capabilities.h | 1 + 2 files changed, 3 insertions(+)
diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index cf85d42198..11d5ecd622 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -693,6 +693,7 @@ VIR_ENUM_IMPL(virQEMUCaps, "virtio-gpu.blob", /* QEMU_CAPS_VIRTIO_GPU_BLOB */ "rbd-encryption-layering", /* QEMU_CAPS_RBD_ENCRYPTION_LAYERING */ "rbd-encryption-luks-any", /* QEMU_CAPS_RBD_ENCRYPTION_LUKS_ANY */ + "virtio-blk-vhost-vdpa", /* QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA */ );
@@ -1560,6 +1561,7 @@ static struct virQEMUCapsStringFlags virQEMUCapsQMPSchemaQueries[] = { { "object-add/arg-type/+iothread/thread-pool-max", QEMU_CAPS_IOTHREAD_THREAD_POOL_MAX }, { "query-migrate/ret-type/blocked-reasons", QEMU_CAPS_MIGRATION_BLOCKED_REASONS }, { "screendump/arg-type/format/^png", QEMU_CAPS_SCREENSHOT_FORMAT_PNG }, + { "blockdev-add/arg-type/+virtio-blk-vhost-vdpa/$fdset", QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA},
Please order this together with other 'blockdev-add' queries at the top and add a space before }.
Also, since I've updated capabilities now, this feature should be detected with the 8.1 dump, so please regenerate the output as the flag will be present now.
Ah, actually the caps were generated with older libblkio which doesn't have the FD passing feature yet. I'll update and re-generate them.

qemuInterfaceVDPAConnect() was a helper function for connecting to the vdpa device file. But in order to support other vdpa devices besides network interfaces (e.g. vdpa block devices) make this function a bit more generic. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_command.c | 23 ++++++++++++++++++++++- src/qemu/qemu_command.h | 1 + src/qemu/qemu_interface.c | 23 ----------------------- src/qemu/qemu_interface.h | 2 -- tests/qemuhotplugmock.c | 4 ++-- tests/qemuxml2argvmock.c | 2 +- 6 files changed, 26 insertions(+), 29 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 22deac2a9b..ae5a797906 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -8474,7 +8474,7 @@ qemuBuildInterfaceConnect(virDomainObj *vm, break; case VIR_DOMAIN_NET_TYPE_VDPA: - if ((vdpafd = qemuInterfaceVDPAConnect(net)) < 0) + if ((vdpafd = qemuVDPAConnect(net->data.vdpa.devicepath)) < 0) return -1; netpriv->vdpafd = qemuFDPassNew(net->info.alias, priv); @@ -10953,3 +10953,24 @@ qemuBuildStorageSourceChainAttachPrepareBlockdevTop(virStorageSource *top, return g_steal_pointer(&data); } + + +/* qemuVDPAConnect: + * @devicepath: the path to the vdpa device + * + * returns: file descriptor of the vdpa device + */ +int +qemuVDPAConnect(const char *devicepath) +{ + int fd; + + if ((fd = open(devicepath, O_RDWR)) < 0) { + virReportSystemError(errno, + _("Unable to open '%1$s' for vdpa device"), + devicepath); + return -1; + } + + return fd; +} diff --git a/src/qemu/qemu_command.h b/src/qemu/qemu_command.h index 55efa45601..341ec43f9a 100644 --- a/src/qemu/qemu_command.h +++ b/src/qemu/qemu_command.h @@ -248,3 +248,4 @@ qemuBuildTPMOpenBackendFDs(const char *tpmdev, const char * qemuAudioDriverTypeToString(virDomainAudioType type); virDomainAudioType qemuAudioDriverTypeFromString(const char *str); +int qemuVDPAConnect(const char *devicepath) G_NO_INLINE; diff --git a/src/qemu/qemu_interface.c b/src/qemu/qemu_interface.c index e875de48ee..8856bb95a8 100644 --- a/src/qemu/qemu_interface.c +++ b/src/qemu/qemu_interface.c @@ -648,29 +648,6 @@ qemuInterfaceBridgeConnect(virDomainDef *def, } -/* qemuInterfaceVDPAConnect: - * @net: pointer to the VM's interface description - * - * returns: file descriptor of the vdpa device - * - * Called *only* called if actualType is VIR_DOMAIN_NET_TYPE_VDPA - */ -int -qemuInterfaceVDPAConnect(virDomainNetDef *net) -{ - int fd; - - if ((fd = open(net->data.vdpa.devicepath, O_RDWR)) < 0) { - virReportSystemError(errno, - _("Unable to open '%1$s' for vdpa device"), - net->data.vdpa.devicepath); - return -1; - } - - return fd; -} - - /* * Returns: -1 on error, 0 on success. Populates net->privateData->slirp if * the slirp helper is needed. diff --git a/src/qemu/qemu_interface.h b/src/qemu/qemu_interface.h index d866beb184..6eed3e6bd7 100644 --- a/src/qemu/qemu_interface.h +++ b/src/qemu/qemu_interface.h @@ -55,5 +55,3 @@ int qemuInterfaceOpenVhostNet(virDomainObj *def, int qemuInterfacePrepareSlirp(virQEMUDriver *driver, virDomainNetDef *net); - -int qemuInterfaceVDPAConnect(virDomainNetDef *net) G_NO_INLINE; diff --git a/tests/qemuhotplugmock.c b/tests/qemuhotplugmock.c index 89d287945a..dd7e2c67e0 100644 --- a/tests/qemuhotplugmock.c +++ b/tests/qemuhotplugmock.c @@ -18,8 +18,8 @@ #include <config.h> +#include "qemu/qemu_command.h" #include "qemu/qemu_hotplug.h" -#include "qemu/qemu_interface.h" #include "qemu/qemu_process.h" #include "testutilsqemu.h" #include "conf/domain_conf.h" @@ -94,7 +94,7 @@ qemuProcessKillManagedPRDaemon(virDomainObj *vm G_GNUC_UNUSED) } int -qemuInterfaceVDPAConnect(virDomainNetDef *net G_GNUC_UNUSED) +qemuVDPAConnect(const char *devicepath G_GNUC_UNUSED) { /* need a valid fd or sendmsg won't work. Just open /dev/null */ return open("/dev/null", O_RDONLY); diff --git a/tests/qemuxml2argvmock.c b/tests/qemuxml2argvmock.c index 400dd5c020..52c44b2ed0 100644 --- a/tests/qemuxml2argvmock.c +++ b/tests/qemuxml2argvmock.c @@ -255,7 +255,7 @@ virNetDevBandwidthSetRootQDisc(const char *ifname G_GNUC_UNUSED, int -qemuInterfaceVDPAConnect(virDomainNetDef *net G_GNUC_UNUSED) +qemuVDPAConnect(const char *devicepath G_GNUC_UNUSED) { if (fcntl(1732, F_GETFD) != -1) abort(); -- 2.40.1

On Tue, Jun 06, 2023 at 16:11:02 -0500, Jonathon Jongsma wrote:
qemuInterfaceVDPAConnect() was a helper function for connecting to the vdpa device file. But in order to support other vdpa devices besides network interfaces (e.g. vdpa block devices) make this function a bit more generic.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_command.c | 23 ++++++++++++++++++++++- src/qemu/qemu_command.h | 1 + src/qemu/qemu_interface.c | 23 ----------------------- src/qemu/qemu_interface.h | 2 -- tests/qemuhotplugmock.c | 4 ++-- tests/qemuxml2argvmock.c | 2 +- 6 files changed, 26 insertions(+), 29 deletions(-)
Reviewed-by: Peter Krempa <pkrempa@redhat.com>

vDPA block devices will also need the same consideration for memlock limits as other vdpa devices, so consider these devices when calculating memlock limits. Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_domain.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 8f77e8fc58..2f6b32e394 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9564,7 +9564,7 @@ qemuDomainGetNumNVMeDisks(const virDomainDef *def) static int -qemuDomainGetNumVDPANetDevices(const virDomainDef *def) +qemuDomainGetNumVDPADevices(const virDomainDef *def) { size_t i; int n = 0; @@ -9574,6 +9574,14 @@ qemuDomainGetNumVDPANetDevices(const virDomainDef *def) n++; } + for (i = 0; i < def->ndisks; i++) { + virStorageSource *src; + for (src = def->disks[i]->src; src; src = src->backingStore) { + if (src->type == VIR_STORAGE_TYPE_VHOST_VDPA) + n++; + } + } + return n; } @@ -9616,7 +9624,7 @@ qemuDomainGetMemLockLimitBytes(virDomainDef *def) nvfio = qemuDomainGetNumVFIOHostdevs(def); nnvme = qemuDomainGetNumNVMeDisks(def); - nvdpa = qemuDomainGetNumVDPANetDevices(def); + nvdpa = qemuDomainGetNumVDPADevices(def); /* For device passthrough using VFIO the guest memory and MMIO memory * regions need to be locked persistent in order to allow DMA. * -- 2.40.1

On Tue, Jun 06, 2023 at 16:11:03 -0500, Jonathon Jongsma wrote:
vDPA block devices will also need the same consideration for memlock limits as other vdpa devices, so consider these devices when calculating memlock limits.
Signed-off-by: Jonathon Jongsma <jjongsma@redhat.com> --- src/qemu/qemu_domain.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-)
Reviewed-by: Peter Krempa <pkrempa@redhat.com>

Requires recent qemu with support for the virtio-blk-vhost-vdpa device and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0) Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1900770 --- src/qemu/qemu_block.c | 20 ++++++++-- src/qemu/qemu_domain.c | 25 ++++++++++++ src/qemu/qemu_validate.c | 44 +++++++++++++++++++--- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 +++++++++++ tests/qemuxml2argvtest.c | 2 + 6 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml diff --git a/src/qemu/qemu_block.c b/src/qemu/qemu_block.c index 35312fb7d4..78f6e63aad 100644 --- a/src/qemu/qemu_block.c +++ b/src/qemu/qemu_block.c @@ -778,6 +778,20 @@ qemuBlockStorageSourceGetNVMeProps(virStorageSource *src) } +static virJSONValue * +qemuBlockStorageSourceGetVhostVdpaProps(virStorageSource *src) +{ + virJSONValue *ret = NULL; + qemuDomainStorageSourcePrivate *srcpriv = QEMU_DOMAIN_STORAGE_SOURCE_PRIVATE(src); + + ignore_value(virJSONValueObjectAdd(&ret, + "s:driver", "virtio-blk-vhost-vdpa", + "s:path", qemuFDPassGetPath(srcpriv->fdpass), + NULL)); + return ret; +} + + static int qemuBlockStorageSourceGetBlockdevGetCacheProps(virStorageSource *src, virJSONValue *props) @@ -874,9 +888,9 @@ qemuBlockStorageSourceGetBackendProps(virStorageSource *src, break; case VIR_STORAGE_TYPE_VHOST_VDPA: - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("vhostvdpa disk type not yet supported")); - return NULL; + if (!(fileprops = qemuBlockStorageSourceGetVhostVdpaProps(src))) + return NULL; + break; case VIR_STORAGE_TYPE_VHOST_USER: virReportError(VIR_ERR_INTERNAL_ERROR, "%s", diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2f6b32e394..119e52a7d7 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11157,6 +11157,28 @@ qemuDomainPrepareStorageSourceFDs(virStorageSource *src, } +static int +qemuDomainPrepareStorageSourceVDPA(virStorageSource *src, + qemuDomainObjPrivate *priv) +{ + qemuDomainStorageSourcePrivate *srcpriv = NULL; + virStorageType actualType = virStorageSourceGetActualType(src); + int vdpafd = -1; + + if (actualType != VIR_STORAGE_TYPE_VHOST_VDPA) + return 0; + + if ((vdpafd = qemuVDPAConnect(src->path)) < 0) + return -1; + + srcpriv = qemuDomainStorageSourcePrivateFetch(src); + + srcpriv->fdpass = qemuFDPassNew(src->nodestorage, priv); + qemuFDPassAddFD(srcpriv->fdpass, &vdpafd, "-vdpa"); + return 0; +} + + int qemuDomainPrepareStorageSourceBlockdevNodename(virDomainDiskDef *disk, virStorageSource *src, @@ -11197,6 +11219,9 @@ qemuDomainPrepareStorageSourceBlockdevNodename(virDomainDiskDef *disk, if (qemuDomainPrepareStorageSourceFDs(src, priv) < 0) return -1; + if (qemuDomainPrepareStorageSourceVDPA(src, priv) < 0) + return -1; + return 0; } diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 9dce908cfe..67b0944162 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -3220,6 +3220,28 @@ qemuValidateDomainDeviceDefDiskTransient(const virDomainDiskDef *disk, } +static int +qemuValidateDomainDeviceDefDiskVhost(const virDomainDef *def, + virStorageType storagetype, + virQEMUCapsFlags flag, + virQEMUCaps *qemuCaps) +{ + const char *vhosttype = virStorageTypeToString(storagetype); + + if (!virQEMUCapsGet(qemuCaps, flag)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("%1$s disk is not supported with this QEMU binary"), + vhosttype); + return -1; + } + + if (qemuValidateDomainDefVhostUserRequireSharedMemory(def, vhosttype) < 0) + return -1; + + return 0; +} + + int qemuValidateDomainDeviceDefDisk(const virDomainDiskDef *disk, const virDomainDef *def, @@ -3281,13 +3303,25 @@ qemuValidateDomainDeviceDefDisk(const virDomainDiskDef *disk, } if (disk->src->type == VIR_STORAGE_TYPE_VHOST_USER) { - if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_DEVICE_VHOST_USER_BLK)) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("vhostuser disk is not supported with this QEMU binary")); + if (qemuValidateDomainDeviceDefDiskVhost(def, disk->src->type, + QEMU_CAPS_DEVICE_VHOST_USER_BLK, + qemuCaps) < 0) return -1; - } + } - if (qemuValidateDomainDefVhostUserRequireSharedMemory(def, "vhostuser") < 0) { + if (disk->src->type == VIR_STORAGE_TYPE_VHOST_VDPA) { + if (qemuValidateDomainDeviceDefDiskVhost(def, disk->src->type, + QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA, + qemuCaps) < 0) + return -1; + + if (disk->cachemode != VIR_DOMAIN_DISK_CACHE_DIRECTSYNC && + disk->cachemode != VIR_DOMAIN_DISK_CACHE_DISABLE) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("disk type '%1$s' requires cache mode '%2$s' or '%3$s'"), + virStorageTypeToString(disk->src->type), + virDomainDiskCacheTypeToString(VIR_DOMAIN_DISK_CACHE_DIRECTSYNC), + virDomainDiskCacheTypeToString(VIR_DOMAIN_DISK_CACHE_DISABLE)); return -1; } } diff --git a/tests/qemuxml2argvdata/disk-vhostvdpa.args b/tests/qemuxml2argvdata/disk-vhostvdpa.args new file mode 100644 index 0000000000..878f3b5fce --- /dev/null +++ b/tests/qemuxml2argvdata/disk-vhostvdpa.args @@ -0,0 +1,35 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=QEMUGuest1,debug-threads=on \ +-S \ +-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/master-key.aes \ +-machine pc,usb=off,dump-guest-core=off \ +-accel tcg \ +-m 214 \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-boot strict=on \ +-usb \ +-add-fd set=0,fd=1732,opaque=libvirt-1-storage-vdpa \ +-blockdev '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/fdset/0","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ +-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \ +-device virtio-blk-pci,bus=pci.0,addr=0x2,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \ +-msg timestamp=on diff --git a/tests/qemuxml2argvdata/disk-vhostvdpa.xml b/tests/qemuxml2argvdata/disk-vhostvdpa.xml new file mode 100644 index 0000000000..0ac3899a34 --- /dev/null +++ b/tests/qemuxml2argvdata/disk-vhostvdpa.xml @@ -0,0 +1,21 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219136</memory> + <currentMemory unit='KiB'>219136</currentMemory> + <memoryBacking> + <access mode='shared'/> + </memoryBacking> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + </os> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <disk type='vhostvdpa' device='disk'> + <driver name='qemu' type='raw' cache='none'/> + <source dev='/dev/vhost-vdpa-0'/> + <target dev='vda' bus='virtio'/> + </disk> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index d914d8cbea..928375c173 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1224,6 +1224,8 @@ mymain(void) DO_TEST_CAPS_VER("disk-vhostuser-numa", "4.2.0"); DO_TEST_CAPS_LATEST("disk-vhostuser-numa"); DO_TEST_CAPS_LATEST("disk-vhostuser"); + DO_TEST("disk-vhostvdpa", + QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA); DO_TEST_CAPS_LATEST_PARSE_ERROR("disk-device-lun-type-invalid"); DO_TEST_CAPS_LATEST_PARSE_ERROR("disk-attaching-partition-nosupport"); DO_TEST_CAPS_LATEST("disk-usb-device"); -- 2.40.1

On Tue, Jun 06, 2023 at 16:15:30 -0500, Jonathon Jongsma wrote:
Requires recent qemu with support for the virtio-blk-vhost-vdpa device and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0)
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1900770 --- src/qemu/qemu_block.c | 20 ++++++++-- src/qemu/qemu_domain.c | 25 ++++++++++++ src/qemu/qemu_validate.c | 44 +++++++++++++++++++--- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 +++++++++++ tests/qemuxml2argvtest.c | 2 + 6 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml
[...]
diff --git a/tests/qemuxml2argvdata/disk-vhostvdpa.args b/tests/qemuxml2argvdata/disk-vhostvdpa.args new file mode 100644 index 0000000000..878f3b5fce --- /dev/null +++ b/tests/qemuxml2argvdata/disk-vhostvdpa.args @@ -0,0 +1,35 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/.local/share \ +XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/.cache \ +XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=QEMUGuest1,debug-threads=on \ +-S \ +-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain--1-QEMUGuest1/master-key.aes \ +-machine pc,usb=off,dump-guest-core=off \ +-accel tcg \ +-m 214 \ +-overcommit mem-lock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-boot strict=on \ +-usb \ +-add-fd set=0,fd=1732,opaque=libvirt-1-storage-vdpa \ +-blockdev '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/fdset/0","node-name":"libvirt-1-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \ +-blockdev '{"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-1-storage"}' \
IIUC the use of the 'raw' driver as an additional layer will prevent the fast path working when it will eventually be implemented in qemu. There are multiple options how we could eliminate that, but I think I should excavate my patches for: https://bugzilla.redhat.com/show_bug.cgi?id=1838189 which asks for removal of the dummy layer also for other cases when it is not needed. Alternative ways would require that we limit vdpa to be the only layer and disallow blockjobs, as both of those cases expect the format layer to be present.
+-device virtio-blk-pci,bus=pci.0,addr=0x2,drive=libvirt-1-format,id=virtio-disk0,bootindex=1,write-cache=on \ +-audiodev '{"id":"audio1","driver":"none"}' \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x3 \ +-msg timestamp=on
diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index d914d8cbea..928375c173 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1224,6 +1224,8 @@ mymain(void) DO_TEST_CAPS_VER("disk-vhostuser-numa", "4.2.0"); DO_TEST_CAPS_LATEST("disk-vhostuser-numa"); DO_TEST_CAPS_LATEST("disk-vhostuser"); + DO_TEST("disk-vhostvdpa", + QEMU_CAPS_DEVICE_VIRTIO_BLK_VHOST_VDPA);
NACK to fake-caps test. I'll regenerate the qemu caps with newer libblkio and you can then use DO_TEST_CAPS_LATEST here. Apart from the above the patch looks good. I'll have my final word after I consider all the blockjob code paths and snapshots (added by patch 1).

As I've promised a long time ago I gave your patches some testing in regards of cooperation with blockjobs and snapshots. Since the new version of the patches was not yet posted on the list I'm replying here including my observations from testing patches from your gitlab branch: On Tue, Jun 06, 2023 at 16:15:30 -0500, Jonathon Jongsma wrote:
Requires recent qemu with support for the virtio-blk-vhost-vdpa device and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0)
Since this is a feature addition the 'Fixes' keyword doesn't make sense. Use e.g. 'Resolves' instead. Additionally you're missing the DCO certification here.
--- src/qemu/qemu_block.c | 20 ++++++++-- src/qemu/qemu_domain.c | 25 ++++++++++++ src/qemu/qemu_validate.c | 44 +++++++++++++++++++--- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 +++++++++++ tests/qemuxml2argvtest.c | 2 + 6 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml
[...]
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2f6b32e394..119e52a7d7 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11157,6 +11157,28 @@ qemuDomainPrepareStorageSourceFDs(virStorageSource *src, }
+static int +qemuDomainPrepareStorageSourceVDPA(virStorageSource *src, + qemuDomainObjPrivate *priv) +{ + qemuDomainStorageSourcePrivate *srcpriv = NULL; + virStorageType actualType = virStorageSourceGetActualType(src); + int vdpafd = -1; + + if (actualType != VIR_STORAGE_TYPE_VHOST_VDPA) + return 0; + + if ((vdpafd = qemuVDPAConnect(src->path)) < 0) + return -1;
This function call directly touches the host filesystem, which is not supposed to be in the *DomainPrepareStorageSource* functions but we rather have a completely separate machinery in qemuProcessPrepareHostStorage. Unfortunately that one doesn't yet need to handle individual backing chain members though. This ensures that the code doesn't get accidentally called from tests even without mocking the code as the tests reimplement the functions differently for testing purposes.
+ + srcpriv = qemuDomainStorageSourcePrivateFetch(src); + + srcpriv->fdpass = qemuFDPassNew(src->nodestorage, priv); + qemuFDPassAddFD(srcpriv->fdpass, &vdpafd, "-vdpa"); + return 0; +}
[...]
diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 9dce908cfe..67b0944162 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -3220,6 +3220,28 @@ qemuValidateDomainDeviceDefDiskTransient(const virDomainDiskDef *disk, }
+static int +qemuValidateDomainDeviceDefDiskVhost(const virDomainDef *def, + virStorageType storagetype, + virQEMUCapsFlags flag, + virQEMUCaps *qemuCaps) +{ + const char *vhosttype = virStorageTypeToString(storagetype); + + if (!virQEMUCapsGet(qemuCaps, flag)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("%1$s disk is not supported with this QEMU binary"), + vhosttype); + return -1;
I'd prefer if both things this function does are duplicated inline below rather than passing it via arguments here. It makes it harder to follow.
+ } + + if (qemuValidateDomainDefVhostUserRequireSharedMemory(def, vhosttype) < 0) + return -1; + + return 0; +}
In my testing of the code from the new branch I've observed that blockjobs and snapshot creation work well thanks to libblkio, so we don't have to add any additional checks or limitations. I'll still need to go ahead and finish the series removing the 'raw' driver when it's not necessary so that the fast-path, once implemented will be possible. Waiting for that is not necessary for this series as it works properly even with the 'raw' driver in place. With your new version of the patches I've noticed the following problems: - After converting to store the vdpa device path in src->vdpadev: - rejecting of the empty disk source doesn't work for vdpa. If you use <source/> in stead of the proper path, the XML will be defined but broken. - virStorageSourceCopy doesn't copy the new member (as instructed in the comment above the struct), thus block job code which uses this extensively to work on the inactive definition creates broken configurations. I've also noticed that using 'qcow2' format for the device doesn't work: error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument If that is supposed to work, then qemu devs will probably need to know about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.

On 7/24/23 8:05 AM, Peter Krempa wrote:
As I've promised a long time ago I gave your patches some testing in regards of cooperation with blockjobs and snapshots.
Since the new version of the patches was not yet posted on the list I'm replying here including my observations from testing patches from your gitlab branch:
On Tue, Jun 06, 2023 at 16:15:30 -0500, Jonathon Jongsma wrote:
Requires recent qemu with support for the virtio-blk-vhost-vdpa device and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0)
Since this is a feature addition the 'Fixes' keyword doesn't make sense. Use e.g. 'Resolves' instead.
Additionally you're missing the DCO certification here.
--- src/qemu/qemu_block.c | 20 ++++++++-- src/qemu/qemu_domain.c | 25 ++++++++++++ src/qemu/qemu_validate.c | 44 +++++++++++++++++++--- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 +++++++++++ tests/qemuxml2argvtest.c | 2 + 6 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml
[...]
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2f6b32e394..119e52a7d7 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11157,6 +11157,28 @@ qemuDomainPrepareStorageSourceFDs(virStorageSource *src, }
+static int +qemuDomainPrepareStorageSourceVDPA(virStorageSource *src, + qemuDomainObjPrivate *priv) +{ + qemuDomainStorageSourcePrivate *srcpriv = NULL; + virStorageType actualType = virStorageSourceGetActualType(src); + int vdpafd = -1; + + if (actualType != VIR_STORAGE_TYPE_VHOST_VDPA) + return 0; + + if ((vdpafd = qemuVDPAConnect(src->path)) < 0) + return -1;
This function call directly touches the host filesystem, which is not supposed to be in the *DomainPrepareStorageSource* functions but we rather have a completely separate machinery in qemuProcessPrepareHostStorage.
Unfortunately that one doesn't yet need to handle individual backing chain members though.
This ensures that the code doesn't get accidentally called from tests even without mocking the code as the tests reimplement the functions differently for testing purposes.
+ + srcpriv = qemuDomainStorageSourcePrivateFetch(src); + + srcpriv->fdpass = qemuFDPassNew(src->nodestorage, priv); + qemuFDPassAddFD(srcpriv->fdpass, &vdpafd, "-vdpa"); + return 0; +}
[...]
diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 9dce908cfe..67b0944162 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -3220,6 +3220,28 @@ qemuValidateDomainDeviceDefDiskTransient(const virDomainDiskDef *disk, }
+static int +qemuValidateDomainDeviceDefDiskVhost(const virDomainDef *def, + virStorageType storagetype, + virQEMUCapsFlags flag, + virQEMUCaps *qemuCaps) +{ + const char *vhosttype = virStorageTypeToString(storagetype); + + if (!virQEMUCapsGet(qemuCaps, flag)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("%1$s disk is not supported with this QEMU binary"), + vhosttype); + return -1;
I'd prefer if both things this function does are duplicated inline below rather than passing it via arguments here. It makes it harder to follow.
+ } + + if (qemuValidateDomainDefVhostUserRequireSharedMemory(def, vhosttype) < 0) + return -1; + + return 0; +}
In my testing of the code from the new branch I've observed that blockjobs and snapshot creation work well thanks to libblkio, so we don't have to add any additional checks or limitations.
I'll still need to go ahead and finish the series removing the 'raw' driver when it's not necessary so that the fast-path, once implemented will be possible. Waiting for that is not necessary for this series as it works properly even with the 'raw' driver in place.
With your new version of the patches I've noticed the following problems:
- After converting to store the vdpa device path in src->vdpadev:
- rejecting of the empty disk source doesn't work for vdpa. If you use <source/> in stead of the proper path, the XML will be defined but broken.
- virStorageSourceCopy doesn't copy the new member (as instructed in the comment above the struct), thus block job code which uses this extensively to work on the inactive definition creates broken configurations.
I've also noticed that using 'qcow2' format for the device doesn't work:
error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need to know about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts. Jonathon

On Wed, Aug 2, 2023 at 10:33 PM Jonathon Jongsma <jjongsma@redhat.com> wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
[...]
I've also noticed that using 'qcow2' format for the device doesn't work:
error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need to know about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts.
Yep, I would also like to understand how you initialized the device with a qcow2 format. Theoretically, the best use case for vDPA block is that the backend handles formats, for QEMU it should just be a virtio device, but being a blockdev, we should be able to use formats anyway, so it should work. For now, waiting for real hardware, the only way to test vDPA block support in QEMU is to use the simulator in the kernel or VDUSE. With the kernel simulator we only have a 128 MB ramdisk available, with VDUSE you can use QSD with any file: $ modprobe -a vhost_vdpa vduse $ qemu-storage-daemon \ --blockdev file,filename=/path/to/image.qcow2,cache.direct=on,aio=native,node-name=file \ --blockdev qcow2,file=file,node-name=qcow2 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on $ vdpa dev add name vduse0 mgmtdev vduse Then you have a /dev/vhost-vdpa-X device that you can use with the `virtio-blk-vhost-vdpa` blockdev (note: vduse requires QEMU with a memory-backed with `share=on`), but using raw since the qcow2 is handled by QSD. Of course, we should be able to use raw file with QSD and qcow2 on qemu (although it's not the optimal configuration), but I don't know how to initialize a `virtio-blk-vhost-vdpa` blockdev with a qcow2 image :-( Thanks, Stefano

On Thu, Aug 03, 2023 at 09:48:01 +0200, Stefano Garzarella wrote:
On Wed, Aug 2, 2023 at 10:33 PM Jonathon Jongsma <jjongsma@redhat.com> wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
[...]
I've also noticed that using 'qcow2' format for the device doesn't work:
error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need to know about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts.
Yep, I would also like to understand how you initialized the device with a qcow2 format.
Naively and originally I've simply used it as 'raw' at first and formatted it from the guest OS. Then I've shut-down the VM and started it back reconfiguring the image format as qcow2. This normally works with real-file backed storage, and since the vdpa simulator seems to persist the contents I supposed this would work.
Theoretically, the best use case for vDPA block is that the backend handles formats, for QEMU it should just be a virtio device, but being a blockdev, we should be able to use formats anyway, so it should work.
Yeah, ideally there will be no format driver in qemu used for these devices (this is not yet the case, I'll need to fix libvirt to stop using the 'raw' driver if not needed). Here I'm more interested whether it is supposed to work, in which case we want to allow using qcow2 as a format in libvirt, or it's not supposed to work and we should forbid it before the user gets a suboptimal error message such as now.
For now, waiting for real hardware, the only way to test vDPA block support in QEMU is to use the simulator in the kernel or VDUSE.
With the kernel simulator we only have a 128 MB ramdisk available, with VDUSE you can use QSD with any file:
$ modprobe -a vhost_vdpa vduse $ qemu-storage-daemon \ --blockdev file,filename=/path/to/image.qcow2,cache.direct=on,aio=native,node-name=file \ --blockdev qcow2,file=file,node-name=qcow2 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on
$ vdpa dev add name vduse0 mgmtdev vduse
Then you have a /dev/vhost-vdpa-X device that you can use with the `virtio-blk-vhost-vdpa` blockdev (note: vduse requires QEMU with a memory-backed with `share=on`), but using raw since the qcow2 is handled by QSD. Of course, we should be able to use raw file with QSD and qcow2 on qemu (although it's not the optimal configuration), but I don't know how to initialize a `virtio-blk-vhost-vdpa` blockdev with a qcow2 image :-(
With the above qemu storage daemon you should be able to do that by simply dropping the qcow2 format driver and simply exposing a qcow2 formatted image. It similarly works with NBD: I've formatted 2 qcow2 images: # qemu-img create -f qcow2 /root/image1.qcow2 100M # qemu-img create -f qcow2 /root/image2.qcow2 100M And then exported them both via vduse and nbd without interpreting qcow2, thus making the QSD into just a dumb storage device: # qemu-storage-daemon \ --blockdev file,filename=/root/image1.qcow2,cache.direct=on,aio=native,node-name=file1 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file1,writable=on \ --blockdev file,filename=/root/image2.qcow2,cache.direct=on,aio=native,node-name=file2 \ --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock \ --export nbd,id=nbd0,node-name=file2,writable=on,name=exportname Now when I start a VM using the NBD export in qcow2 format: <disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='nbd' name='exportname'> <host transport='unix' socket='/tmp/nbd.sock'/> </source> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk> The VM starts fine, but when using: <disk type='vhostvdpa' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source dev='/dev/vhost-vdpa-0'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk> I get: error: internal error: QEMU unexpectedly closed the monitor (vm='vdpa'): 2023-08-07T12:34:21.628520Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument What is weird though, that if I attempt to use qemu itself to format the image, with VDPA simulator it appears to work the very first time after the image is formatted, but if I restart the VM it no longer recognizes the image. Since the simulator's content seems to be preserved accross VM restarts when using 'raw' I thought it should work: + virsh start cd Domain 'cd' started + virsh qemu-monitor-command cd --pass-fds 6 add-fd '{"fdset-id": 1337}' {"return":{"fd":37,"fdset-id":1337},"id":"libvirt-441"} + virsh qemu-monitor-command cd blockdev-add '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/fdset/1337","node-name":"testformat","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' {"return":{},"id":"libvirt-442"} + virsh qemu-monitor-command cd blockdev-create '{"options":{"driver":"qcow2","file":"testformat","size":10485760},"job-id":"formatjob"}' {"return":{},"id":"libvirt-443"} + virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":1,"status":"concluded","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-444"} + sleep 1 + virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":1,"status":"concluded","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-445"} + sleep 1 + virsh qemu-monitor-command cd job-dismiss '{"id":"formatjob"}' {"return":{},"id":"libvirt-446"} + virsh qemu-monitor-command cd blockdev-add '{"node-name":"testqcow2","read-only":false,"driver":"qcow2","file":"testformat"}' {"return":{},"id":"libvirt-447"} + virsh destroy cd Domain 'cd' destroyed + virsh qemu-monitor-command cd --pass-fds 6 add-fd '{"fdset-id": 1337}' {"return":{"fd":37,"fdset-id":1337},"id":"libvirt-441"} + virsh qemu-monitor-command cd blockdev-add '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/fdset/1337","node-name":"testformat","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' {"return":{},"id":"libvirt-442"} + virsh qemu-monitor-command cd blockdev-add '{"node-name":"testqcow2","read-only":false,"driver":"qcow2","file":"testformat"}' {"id":"libvirt-443","error":{"class":"GenericError","desc":"Could not read qcow2 header: Invalid argument"}} Additionally, even weirder is that if I use the qemu-storage-daemon backed vhost-vdpa device, the formatting job simply gets stuck forever: + virsh start cd Domain 'cd' started + virsh qemu-monitor-command cd blockdev-add '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/vhost-vdpa-0","node-name":"testformat","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' {"return":{},"id":"libvirt-441"} + virsh qemu-monitor-command cd blockdev-create '{"options":{"driver":"qcow2","file":"testformat","size":10485760},"job-id":"formatjob"}' {"return":{},"id":"libvirt-442"} + virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":0,"status":"running","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-443"} + sleep 1 + virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":0,"status":"running","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-444"} + sleep 1 + virsh qemu-monitor-command cd job-dismiss '{"id":"formatjob"}' {"id":"libvirt-445","error":{"class":"GenericError","desc":"Job 'formatjob' in state 'running' cannot accept command verb 'dismiss'"}} (Note, it behaves the same when using FD passing, I've just tried because it's very weird) Disclaimer: It's possible I broke my test box but I can't simply reboot it at this point.

On Mon, Aug 07, 2023 at 03:41:21PM +0200, Peter Krempa wrote:
On Thu, Aug 03, 2023 at 09:48:01 +0200, Stefano Garzarella wrote:
On Wed, Aug 2, 2023 at 10:33 PM Jonathon Jongsma <jjongsma@redhat.com> wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
[...]
I've also noticed that using 'qcow2' format for the device doesn't work:
error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need to know about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts.
Yep, I would also like to understand how you initialized the device with a qcow2 format.
Naively and originally I've simply used it as 'raw' at first and formatted it from the guest OS. Then I've shut-down the VM and started it back reconfiguring the image format as qcow2. This normally works with real-file backed storage, and since the vdpa simulator seems to persist the contents I supposed this would work.
Cool, I'll try that. Can you try to reboot the VM, use it as `raw`, and read the qcow2 in the vm from the guest OS? Note: there could be some bugs in the simulator!
Theoretically, the best use case for vDPA block is that the backend handles formats, for QEMU it should just be a virtio device, but being a blockdev, we should be able to use formats anyway, so it should work.
Yeah, ideally there will be no format driver in qemu used for these devices (this is not yet the case, I'll need to fix libvirt to stop using the 'raw' driver if not needed).
Here I'm more interested whether it is supposed to work, in which case we want to allow using qcow2 as a format in libvirt, or it's not supposed to work and we should forbid it before the user gets a suboptimal error message such as now.
This is a good question. We certainly haven't tested it, because it's an uncommon scenario, but as I said before, maybe it should work. I need to check it better.
For now, waiting for real hardware, the only way to test vDPA block support in QEMU is to use the simulator in the kernel or VDUSE.
With the kernel simulator we only have a 128 MB ramdisk available, with VDUSE you can use QSD with any file:
$ modprobe -a vhost_vdpa vduse $ qemu-storage-daemon \ --blockdev file,filename=/path/to/image.qcow2,cache.direct=on,aio=native,node-name=file \ --blockdev qcow2,file=file,node-name=qcow2 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on
$ vdpa dev add name vduse0 mgmtdev vduse
Then you have a /dev/vhost-vdpa-X device that you can use with the `virtio-blk-vhost-vdpa` blockdev (note: vduse requires QEMU with a memory-backed with `share=on`), but using raw since the qcow2 is handled by QSD. Of course, we should be able to use raw file with QSD and qcow2 on qemu (although it's not the optimal configuration), but I don't know how to initialize a `virtio-blk-vhost-vdpa` blockdev with a qcow2 image :-(
With the above qemu storage daemon you should be able to do that by simply dropping the qcow2 format driver and simply exposing a qcow2 formatted image. It similarly works with NBD:
I've formatted 2 qcow2 images:
# qemu-img create -f qcow2 /root/image1.qcow2 100M # qemu-img create -f qcow2 /root/image2.qcow2 100M
And then exported them both via vduse and nbd without interpreting qcow2, thus making the QSD into just a dumb storage device:
# qemu-storage-daemon \ --blockdev file,filename=/root/image1.qcow2,cache.direct=on,aio=native,node-name=file1 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file1,writable=on \ --blockdev file,filename=/root/image2.qcow2,cache.direct=on,aio=native,node-name=file2 \ --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock \ --export nbd,id=nbd0,node-name=file2,writable=on,name=exportname
Cool! Thanks for sharing!
Now when I start a VM using the NBD export in qcow2 format:
<disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='nbd' name='exportname'> <host transport='unix' socket='/tmp/nbd.sock'/> </source> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
The VM starts fine, but when using:
<disk type='vhostvdpa' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source dev='/dev/vhost-vdpa-0'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
I get:
error: internal error: QEMU unexpectedly closed the monitor (vm='vdpa'): 2023-08-07T12:34:21.628520Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
mmm, I just tried this scenario using QEMU directly and it worked. These are the steps I did (qemu upstream, commit 9400601a689a128c25fa9c21e932562e0eeb7a26): ./build/storage-daemon/qemu-storage-daemon \ --blockdev file,filename=test.qcow2,cache.direct=on,aio=native,node-name=file \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file,writable=on vdpa dev add name vduse0 mgmtdev vduse /build/qemu-system-x86_64 -m 512M -smp 2 \ -M q35,accel=kvm,memory-backend=mem \ -drive file=f38-vm-build.qcow2,format=qcow2,if=none,id=hd0 \ -device virtio-blk-pci,drive=hd0,bootindex=1 \ -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \ -blockdev qcow2,node-name=qcow2,file=drive_src1 \ -device virtio-blk-pci,id=src1,bootindex=2,drive=qcow2 \ -object memory-backend-file,share=on,id=mem,size=512M,mem-path="/dev/hugepages" Then I'm able to see /dev/vdb and /dev/vdb1. (test.qcow2 has a fs on the first partition) I mounted vdb1 and did I did md5sum on a file. Then I turned off the machine, moved the `-blockdev qcow2...` from qemu to QSD, and I did the same steps and checked that md5 is the same. So it seems to work, but maybe we have something different. My kernel host is: 6.4.7-200.fc38.x86_64 Thanks, Stefano
What is weird though, that if I attempt to use qemu itself to format the image, with VDPA simulator it appears to work the very first time after the image is formatted, but if I restart the VM it no longer recognizes the image. Since the simulator's content seems to be preserved accross VM restarts when using 'raw' I thought it should work:
+ virsh start cd Domain 'cd' started
+ virsh qemu-monitor-command cd --pass-fds 6 add-fd '{"fdset-id": 1337}' {"return":{"fd":37,"fdset-id":1337},"id":"libvirt-441"}
+ virsh qemu-monitor-command cd blockdev-add '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/fdset/1337","node-name":"testformat","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' {"return":{},"id":"libvirt-442"}
+ virsh qemu-monitor-command cd blockdev-create '{"options":{"driver":"qcow2","file":"testformat","size":10485760},"job-id":"formatjob"}' {"return":{},"id":"libvirt-443"}
+ virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":1,"status":"concluded","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-444"}
+ sleep 1 + virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":1,"status":"concluded","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-445"}
+ sleep 1 + virsh qemu-monitor-command cd job-dismiss '{"id":"formatjob"}' {"return":{},"id":"libvirt-446"}
+ virsh qemu-monitor-command cd blockdev-add '{"node-name":"testqcow2","read-only":false,"driver":"qcow2","file":"testformat"}' {"return":{},"id":"libvirt-447"}
+ virsh destroy cd Domain 'cd' destroyed
+ virsh qemu-monitor-command cd --pass-fds 6 add-fd '{"fdset-id": 1337}' {"return":{"fd":37,"fdset-id":1337},"id":"libvirt-441"}
+ virsh qemu-monitor-command cd blockdev-add '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/fdset/1337","node-name":"testformat","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' {"return":{},"id":"libvirt-442"}
+ virsh qemu-monitor-command cd blockdev-add '{"node-name":"testqcow2","read-only":false,"driver":"qcow2","file":"testformat"}' {"id":"libvirt-443","error":{"class":"GenericError","desc":"Could not read qcow2 header: Invalid argument"}}
Additionally, even weirder is that if I use the qemu-storage-daemon backed vhost-vdpa device, the formatting job simply gets stuck forever:
+ virsh start cd Domain 'cd' started
+ virsh qemu-monitor-command cd blockdev-add '{"driver":"virtio-blk-vhost-vdpa","path":"/dev/vhost-vdpa-0","node-name":"testformat","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' {"return":{},"id":"libvirt-441"}
+ virsh qemu-monitor-command cd blockdev-create '{"options":{"driver":"qcow2","file":"testformat","size":10485760},"job-id":"formatjob"}' {"return":{},"id":"libvirt-442"}
+ virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":0,"status":"running","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-443"}
+ sleep 1 + virsh qemu-monitor-command cd query-jobs {"return":[{"current-progress":0,"status":"running","total-progress":1,"type":"create","id":"formatjob"}],"id":"libvirt-444"}
+ sleep 1 + virsh qemu-monitor-command cd job-dismiss '{"id":"formatjob"}' {"id":"libvirt-445","error":{"class":"GenericError","desc":"Job 'formatjob' in state 'running' cannot accept command verb 'dismiss'"}}
(Note, it behaves the same when using FD passing, I've just tried because it's very weird)
Disclaimer: It's possible I broke my test box but I can't simply reboot it at this point.

On 8/8/23 6:00 AM, Stefano Garzarella wrote:
On Mon, Aug 07, 2023 at 03:41:21PM +0200, Peter Krempa wrote:
On Thu, Aug 03, 2023 at 09:48:01 +0200, Stefano Garzarella wrote:
On Wed, Aug 2, 2023 at 10:33 PM Jonathon Jongsma <jjongsma@redhat.com> wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
[...]
I've also noticed that using 'qcow2' format for the device
doesn't work:
error: internal error: process exited while connecting to
monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need to
know
about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts.
Yep, I would also like to understand how you initialized the device with a qcow2 format.
Naively and originally I've simply used it as 'raw' at first and formatted it from the guest OS. Then I've shut-down the VM and started it back reconfiguring the image format as qcow2. This normally works with real-file backed storage, and since the vdpa simulator seems to persist the contents I supposed this would work.
Cool, I'll try that. Can you try to reboot the VM, use it as `raw`, and read the qcow2 in the vm from the guest OS?
Note: there could be some bugs in the simulator!
Theoretically, the best use case for vDPA block is that the backend handles formats, for QEMU it should just be a virtio device, but being a blockdev, we should be able to use formats anyway, so it should work.
Yeah, ideally there will be no format driver in qemu used for these devices (this is not yet the case, I'll need to fix libvirt to stop using the 'raw' driver if not needed).
Here I'm more interested whether it is supposed to work, in which case we want to allow using qcow2 as a format in libvirt, or it's not supposed to work and we should forbid it before the user gets a suboptimal error message such as now.
This is a good question. We certainly haven't tested it, because it's an uncommon scenario, but as I said before, maybe it should work. I need to check it better.
For now, waiting for real hardware, the only way to test vDPA block support in QEMU is to use the simulator in the kernel or VDUSE.
With the kernel simulator we only have a 128 MB ramdisk available, with VDUSE you can use QSD with any file:
$ modprobe -a vhost_vdpa vduse $ qemu-storage-daemon \ --blockdev file,filename=/path/to/image.qcow2,cache.direct=on,aio=native,node-name=file \ --blockdev qcow2,file=file,node-name=qcow2 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on
$ vdpa dev add name vduse0 mgmtdev vduse
Then you have a /dev/vhost-vdpa-X device that you can use with the `virtio-blk-vhost-vdpa` blockdev (note: vduse requires QEMU with a memory-backed with `share=on`), but using raw since the qcow2 is handled by QSD. Of course, we should be able to use raw file with QSD and qcow2 on qemu (although it's not the optimal configuration), but I don't know how to initialize a `virtio-blk-vhost-vdpa` blockdev with a qcow2 image :-(
With the above qemu storage daemon you should be able to do that by simply dropping the qcow2 format driver and simply exposing a qcow2 formatted image. It similarly works with NBD:
I've formatted 2 qcow2 images:
# qemu-img create -f qcow2 /root/image1.qcow2 100M # qemu-img create -f qcow2 /root/image2.qcow2 100M
And then exported them both via vduse and nbd without interpreting qcow2, thus making the QSD into just a dumb storage device:
# qemu-storage-daemon \ --blockdev file,filename=/root/image1.qcow2,cache.direct=on,aio=native,node-name=file1 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file1,writable=on \ --blockdev file,filename=/root/image2.qcow2,cache.direct=on,aio=native,node-name=file2 \ --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock \ --export nbd,id=nbd0,node-name=file2,writable=on,name=exportname
Cool! Thanks for sharing!
Now when I start a VM using the NBD export in qcow2 format:
<disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='nbd' name='exportname'> <host transport='unix' socket='/tmp/nbd.sock'/> </source> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
The VM starts fine, but when using:
<disk type='vhostvdpa' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source dev='/dev/vhost-vdpa-0'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
I get:
error: internal error: QEMU unexpectedly closed the monitor (vm='vdpa'): 2023-08-07T12:34:21.628520Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
mmm, I just tried this scenario using QEMU directly and it worked.
These are the steps I did (qemu upstream, commit 9400601a689a128c25fa9c21e932562e0eeb7a26): ./build/storage-daemon/qemu-storage-daemon \ --blockdev file,filename=test.qcow2,cache.direct=on,aio=native,node-name=file \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file,writable=on
vdpa dev add name vduse0 mgmtdev vduse
/build/qemu-system-x86_64 -m 512M -smp 2 \ -M q35,accel=kvm,memory-backend=mem \ -drive file=f38-vm-build.qcow2,format=qcow2,if=none,id=hd0 \ -device virtio-blk-pci,drive=hd0,bootindex=1 \ -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \ -blockdev qcow2,node-name=qcow2,file=drive_src1 \ -device virtio-blk-pci,id=src1,bootindex=2,drive=qcow2 \ -object memory-backend-file,share=on,id=mem,size=512M,mem-path="/dev/hugepages"
Then I'm able to see /dev/vdb and /dev/vdb1. (test.qcow2 has a fs on the first partition)
I mounted vdb1 and did I did md5sum on a file.
Then I turned off the machine, moved the `-blockdev qcow2...` from qemu to QSD, and I did the same steps and checked that md5 is the same.
So it seems to work, but maybe we have something different. My kernel host is: 6.4.7-200.fc38.x86_64
Thanks, Stefano
By the way, I get the same "Could not read qcow2 header" error that Peter reported when I use this direct qemu commandline. My laptop is a little bit behind so I'm still on fedora 37. Jonathon

On 8/16/23 4:19 PM, Jonathon Jongsma wrote:
On 8/8/23 6:00 AM, Stefano Garzarella wrote:
On Mon, Aug 07, 2023 at 03:41:21PM +0200, Peter Krempa wrote:
On Thu, Aug 03, 2023 at 09:48:01 +0200, Stefano Garzarella wrote:
On Wed, Aug 2, 2023 at 10:33 PM Jonathon Jongsma <jjongsma@redhat.com> wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
[...]
I've also noticed that using 'qcow2' format for the device
doesn't work:
error: internal error: process exited while connecting to
monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need
to know
about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts.
Yep, I would also like to understand how you initialized the device with a qcow2 format.
Naively and originally I've simply used it as 'raw' at first and formatted it from the guest OS. Then I've shut-down the VM and started it back reconfiguring the image format as qcow2. This normally works with real-file backed storage, and since the vdpa simulator seems to persist the contents I supposed this would work.
Cool, I'll try that. Can you try to reboot the VM, use it as `raw`, and read the qcow2 in the vm from the guest OS?
Note: there could be some bugs in the simulator!
Theoretically, the best use case for vDPA block is that the backend handles formats, for QEMU it should just be a virtio device, but being a blockdev, we should be able to use formats anyway, so it should work.
Yeah, ideally there will be no format driver in qemu used for these devices (this is not yet the case, I'll need to fix libvirt to stop using the 'raw' driver if not needed).
Here I'm more interested whether it is supposed to work, in which case we want to allow using qcow2 as a format in libvirt, or it's not supposed to work and we should forbid it before the user gets a suboptimal error message such as now.
This is a good question. We certainly haven't tested it, because it's an uncommon scenario, but as I said before, maybe it should work. I need to check it better.
For now, waiting for real hardware, the only way to test vDPA block support in QEMU is to use the simulator in the kernel or VDUSE.
With the kernel simulator we only have a 128 MB ramdisk available, with VDUSE you can use QSD with any file:
$ modprobe -a vhost_vdpa vduse $ qemu-storage-daemon \ --blockdev file,filename=/path/to/image.qcow2,cache.direct=on,aio=native,node-name=file \ --blockdev qcow2,file=file,node-name=qcow2 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on
$ vdpa dev add name vduse0 mgmtdev vduse
Then you have a /dev/vhost-vdpa-X device that you can use with the `virtio-blk-vhost-vdpa` blockdev (note: vduse requires QEMU with a memory-backed with `share=on`), but using raw since the qcow2 is handled by QSD. Of course, we should be able to use raw file with QSD and qcow2 on qemu (although it's not the optimal configuration), but I don't know how to initialize a `virtio-blk-vhost-vdpa` blockdev with a qcow2 image :-(
With the above qemu storage daemon you should be able to do that by simply dropping the qcow2 format driver and simply exposing a qcow2 formatted image. It similarly works with NBD:
I've formatted 2 qcow2 images:
# qemu-img create -f qcow2 /root/image1.qcow2 100M # qemu-img create -f qcow2 /root/image2.qcow2 100M
And then exported them both via vduse and nbd without interpreting qcow2, thus making the QSD into just a dumb storage device:
# qemu-storage-daemon \ --blockdev file,filename=/root/image1.qcow2,cache.direct=on,aio=native,node-name=file1 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file1,writable=on \ --blockdev file,filename=/root/image2.qcow2,cache.direct=on,aio=native,node-name=file2 \ --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock \ --export nbd,id=nbd0,node-name=file2,writable=on,name=exportname
Cool! Thanks for sharing!
Now when I start a VM using the NBD export in qcow2 format:
<disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='nbd' name='exportname'> <host transport='unix' socket='/tmp/nbd.sock'/> </source> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
The VM starts fine, but when using:
<disk type='vhostvdpa' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source dev='/dev/vhost-vdpa-0'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
I get:
error: internal error: QEMU unexpectedly closed the monitor (vm='vdpa'): 2023-08-07T12:34:21.628520Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
mmm, I just tried this scenario using QEMU directly and it worked.
These are the steps I did (qemu upstream, commit 9400601a689a128c25fa9c21e932562e0eeb7a26): ./build/storage-daemon/qemu-storage-daemon \ --blockdev file,filename=test.qcow2,cache.direct=on,aio=native,node-name=file \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file,writable=on
vdpa dev add name vduse0 mgmtdev vduse
/build/qemu-system-x86_64 -m 512M -smp 2 \ -M q35,accel=kvm,memory-backend=mem \ -drive file=f38-vm-build.qcow2,format=qcow2,if=none,id=hd0 \ -device virtio-blk-pci,drive=hd0,bootindex=1 \ -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \ -blockdev qcow2,node-name=qcow2,file=drive_src1 \ -device virtio-blk-pci,id=src1,bootindex=2,drive=qcow2 \ -object memory-backend-file,share=on,id=mem,size=512M,mem-path="/dev/hugepages"
Then I'm able to see /dev/vdb and /dev/vdb1. (test.qcow2 has a fs on the first partition)
I mounted vdb1 and did I did md5sum on a file.
Then I turned off the machine, moved the `-blockdev qcow2...` from qemu to QSD, and I did the same steps and checked that md5 is the same.
So it seems to work, but maybe we have something different. My kernel host is: 6.4.7-200.fc38.x86_64
Thanks, Stefano
By the way, I get the same "Could not read qcow2 header" error that Peter reported when I use this direct qemu commandline. My laptop is a little bit behind so I'm still on fedora 37.
Jonathon
I recently upgraded my laptop to fedora 38 and retested this. I used the above procedure and used qemu-storage-daemon to export a vdpa block device for a qcow2 disk image and it was readable inside the guest launched from my libvirt branch. So it looks like the "Could not read qcow2 header" error has been fixed upstream already. So I don't think there's anything else blocking this libvirt implementation. I'll post an updated patch soon. Jonathon

On Fri, Sep 01, 2023 at 03:24:11PM -0500, Jonathon Jongsma wrote:
On 8/16/23 4:19 PM, Jonathon Jongsma wrote:
On 8/8/23 6:00 AM, Stefano Garzarella wrote:
On Mon, Aug 07, 2023 at 03:41:21PM +0200, Peter Krempa wrote:
On Thu, Aug 03, 2023 at 09:48:01 +0200, Stefano Garzarella wrote:
On Wed, Aug 2, 2023 at 10:33 PM Jonathon Jongsma <jjongsma@redhat.com> wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
[...]
> > I've also noticed that using 'qcow2' format for the device doesn't work: > > error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument > > If that is supposed to work, then qemu devs will probably need to know > about that, if that is not supposed to work, libvirt needs to add a > check, because the error doesn't tell much. It's also possible I've > messed up when formatting the image though, as didn't really try to > figure out what's happening. >
That's a good question, and I don't actually know the answer. Were you using an actual vdpa block device for your tests or were you using the vdpa block simulator kernel module? How did you set it up? Adding Stefano to cc for his thoughts.
Yep, I would also like to understand how you initialized the device with a qcow2 format.
Naively and originally I've simply used it as 'raw' at first and formatted it from the guest OS. Then I've shut-down the VM and started it back reconfiguring the image format as qcow2. This normally works with real-file backed storage, and since the vdpa simulator seems to persist the contents I supposed this would work.
Cool, I'll try that. Can you try to reboot the VM, use it as `raw`, and read the qcow2 in the vm from the guest OS?
Note: there could be some bugs in the simulator!
Theoretically, the best use case for vDPA block is that the backend handles formats, for QEMU it should just be a virtio device, but being a blockdev, we should be able to use formats anyway, so it should work.
Yeah, ideally there will be no format driver in qemu used for these devices (this is not yet the case, I'll need to fix libvirt to stop using the 'raw' driver if not needed).
Here I'm more interested whether it is supposed to work, in which case we want to allow using qcow2 as a format in libvirt, or it's not supposed to work and we should forbid it before the user gets a suboptimal error message such as now.
This is a good question. We certainly haven't tested it, because it's an uncommon scenario, but as I said before, maybe it should work. I need to check it better.
For now, waiting for real hardware, the only way to test vDPA block support in QEMU is to use the simulator in the kernel or VDUSE.
With the kernel simulator we only have a 128 MB ramdisk available, with VDUSE you can use QSD with any file:
$ modprobe -a vhost_vdpa vduse $ qemu-storage-daemon \ --blockdev file,filename=/path/to/image.qcow2,cache.direct=on,aio=native,node-name=file \ --blockdev qcow2,file=file,node-name=qcow2 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=qcow2,writable=on
$ vdpa dev add name vduse0 mgmtdev vduse
Then you have a /dev/vhost-vdpa-X device that you can use with the `virtio-blk-vhost-vdpa` blockdev (note: vduse requires QEMU with a memory-backed with `share=on`), but using raw since the qcow2 is handled by QSD. Of course, we should be able to use raw file with QSD and qcow2 on qemu (although it's not the optimal configuration), but I don't know how to initialize a `virtio-blk-vhost-vdpa` blockdev with a qcow2 image :-(
With the above qemu storage daemon you should be able to do that by simply dropping the qcow2 format driver and simply exposing a qcow2 formatted image. It similarly works with NBD:
I've formatted 2 qcow2 images:
# qemu-img create -f qcow2 /root/image1.qcow2 100M # qemu-img create -f qcow2 /root/image2.qcow2 100M
And then exported them both via vduse and nbd without interpreting qcow2, thus making the QSD into just a dumb storage device:
# qemu-storage-daemon \ --blockdev file,filename=/root/image1.qcow2,cache.direct=on,aio=native,node-name=file1 \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file1,writable=on \ --blockdev file,filename=/root/image2.qcow2,cache.direct=on,aio=native,node-name=file2 \ --nbd-server addr.type=unix,addr.path=/tmp/nbd.sock \ --export nbd,id=nbd0,node-name=file2,writable=on,name=exportname
Cool! Thanks for sharing!
Now when I start a VM using the NBD export in qcow2 format:
<disk type='network' device='disk'> <driver name='qemu' type='qcow2'/> <source protocol='nbd' name='exportname'> <host transport='unix' socket='/tmp/nbd.sock'/> </source> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
The VM starts fine, but when using:
<disk type='vhostvdpa' device='disk'> <driver name='qemu' type='qcow2' cache='none'/> <source dev='/dev/vhost-vdpa-0'/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </disk>
I get:
error: internal error: QEMU unexpectedly closed the monitor (vm='vdpa'): 2023-08-07T12:34:21.628520Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
mmm, I just tried this scenario using QEMU directly and it worked.
These are the steps I did (qemu upstream, commit 9400601a689a128c25fa9c21e932562e0eeb7a26): ./build/storage-daemon/qemu-storage-daemon \ --blockdev file,filename=test.qcow2,cache.direct=on,aio=native,node-name=file \ --export vduse-blk,id=vduse0,name=vduse0,num-queues=1,node-name=file,writable=on
vdpa dev add name vduse0 mgmtdev vduse
/build/qemu-system-x86_64 -m 512M -smp 2 \ -M q35,accel=kvm,memory-backend=mem \ -drive file=f38-vm-build.qcow2,format=qcow2,if=none,id=hd0 \ -device virtio-blk-pci,drive=hd0,bootindex=1 \ -blockdev node-name=drive_src1,driver=virtio-blk-vhost-vdpa,path=/dev/vhost-vdpa-0,cache.direct=on \ -blockdev qcow2,node-name=qcow2,file=drive_src1 \ -device virtio-blk-pci,id=src1,bootindex=2,drive=qcow2 \ -object memory-backend-file,share=on,id=mem,size=512M,mem-path="/dev/hugepages"
Then I'm able to see /dev/vdb and /dev/vdb1. (test.qcow2 has a fs on the first partition)
I mounted vdb1 and did I did md5sum on a file.
Then I turned off the machine, moved the `-blockdev qcow2...` from qemu to QSD, and I did the same steps and checked that md5 is the same.
So it seems to work, but maybe we have something different. My kernel host is: 6.4.7-200.fc38.x86_64
Thanks, Stefano
By the way, I get the same "Could not read qcow2 header" error that Peter reported when I use this direct qemu commandline. My laptop is a little bit behind so I'm still on fedora 37.
Jonathon
I recently upgraded my laptop to fedora 38 and retested this. I used the above procedure and used qemu-storage-daemon to export a vdpa block device for a qcow2 disk image and it was readable inside the guest launched from my libvirt branch. So it looks like the "Could not read qcow2 header" error has been fixed upstream already.
Thanks for the update! Yep, vDPA is under continuous development, so it's likely there was some problem in QEMU or Linux that we fixed in a new release.
So I don't think there's anything else blocking this libvirt implementation.
Yep, I agree!
I'll post an updated patch soon.
Cool, thanks! Stefano

On 7/24/23 8:05 AM, Peter Krempa wrote:
As I've promised a long time ago I gave your patches some testing in regards of cooperation with blockjobs and snapshots.
Since the new version of the patches was not yet posted on the list I'm replying here including my observations from testing patches from your gitlab branch:
On Tue, Jun 06, 2023 at 16:15:30 -0500, Jonathon Jongsma wrote:
Requires recent qemu with support for the virtio-blk-vhost-vdpa device and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0)
Since this is a feature addition the 'Fixes' keyword doesn't make sense. Use e.g. 'Resolves' instead.
Additionally you're missing the DCO certification here.
--- src/qemu/qemu_block.c | 20 ++++++++-- src/qemu/qemu_domain.c | 25 ++++++++++++ src/qemu/qemu_validate.c | 44 +++++++++++++++++++--- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 +++++++++++ tests/qemuxml2argvtest.c | 2 + 6 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml
[...]
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2f6b32e394..119e52a7d7 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11157,6 +11157,28 @@ qemuDomainPrepareStorageSourceFDs(virStorageSource *src, }
+static int +qemuDomainPrepareStorageSourceVDPA(virStorageSource *src, + qemuDomainObjPrivate *priv) +{ + qemuDomainStorageSourcePrivate *srcpriv = NULL; + virStorageType actualType = virStorageSourceGetActualType(src); + int vdpafd = -1; + + if (actualType != VIR_STORAGE_TYPE_VHOST_VDPA) + return 0; + + if ((vdpafd = qemuVDPAConnect(src->path)) < 0) + return -1;
This function call directly touches the host filesystem, which is not supposed to be in the *DomainPrepareStorageSource* functions but we rather have a completely separate machinery in qemuProcessPrepareHostStorage.
Unfortunately that one doesn't yet need to handle individual backing chain members though.
This ensures that the code doesn't get accidentally called from tests even without mocking the code as the tests reimplement the functions differently for testing purposes.
Somehow I missed this comment earlier. Unfortunately, it doesn't seem straightforward to move this code. We can't simply move all of the logic to qemuProcessPrepareHostStorage() because that function doesn't get called during the tests. I thought I could move only the opening of the fd to the PrepareHostStorage() function and then keep the qemuFDPass construction in this function, but that doesn't work: the PrepareHostStorage() function actually gets called *after* this function. So the fd would not even be open yet at the time this function gets called. So... it seems that the options are either: - leave everything in qemuDomainPrepareStorageSourceVDPA() (as is) - move the fd opening to PrepareHostStorage() and then move the rest to a different common function that is called after that, such as qemuBuildDiskSourceCommandLine() - a third option (suggestions?) It's worth noting that the vdpa *network* device essentially does everything (opening the file, creating the qemuFDPass object, etc) in the qemuBuildInterfaceCommandLine() function. This was done to match other network devices that connect to open file descriptors (see qemuBuildInterfaceConnect()). But based on your comments above, it sounds like this may not be a the ideal situation even though the function is mocked to not actually open any file descriptors from the host filesystem under test. Jonathon
+ + srcpriv = qemuDomainStorageSourcePrivateFetch(src); + + srcpriv->fdpass = qemuFDPassNew(src->nodestorage, priv); + qemuFDPassAddFD(srcpriv->fdpass, &vdpafd, "-vdpa"); + return 0; +}
[...]
diff --git a/src/qemu/qemu_validate.c b/src/qemu/qemu_validate.c index 9dce908cfe..67b0944162 100644 --- a/src/qemu/qemu_validate.c +++ b/src/qemu/qemu_validate.c @@ -3220,6 +3220,28 @@ qemuValidateDomainDeviceDefDiskTransient(const virDomainDiskDef *disk, }
+static int +qemuValidateDomainDeviceDefDiskVhost(const virDomainDef *def, + virStorageType storagetype, + virQEMUCapsFlags flag, + virQEMUCaps *qemuCaps) +{ + const char *vhosttype = virStorageTypeToString(storagetype); + + if (!virQEMUCapsGet(qemuCaps, flag)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("%1$s disk is not supported with this QEMU binary"), + vhosttype); + return -1;
I'd prefer if both things this function does are duplicated inline below rather than passing it via arguments here. It makes it harder to follow.
+ } + + if (qemuValidateDomainDefVhostUserRequireSharedMemory(def, vhosttype) < 0) + return -1; + + return 0; +}
In my testing of the code from the new branch I've observed that blockjobs and snapshot creation work well thanks to libblkio, so we don't have to add any additional checks or limitations.
I'll still need to go ahead and finish the series removing the 'raw' driver when it's not necessary so that the fast-path, once implemented will be possible. Waiting for that is not necessary for this series as it works properly even with the 'raw' driver in place.
With your new version of the patches I've noticed the following problems:
- After converting to store the vdpa device path in src->vdpadev:
- rejecting of the empty disk source doesn't work for vdpa. If you use <source/> in stead of the proper path, the XML will be defined but broken.
- virStorageSourceCopy doesn't copy the new member (as instructed in the comment above the struct), thus block job code which uses this extensively to work on the inactive definition creates broken configurations.
I've also noticed that using 'qcow2' format for the device doesn't work:
error: internal error: process exited while connecting to monitor: 2023-07-24T12:54:15.818631Z qemu-system-x86_64: -blockdev {"node-name":"libvirt-1-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"qcow2","file":"libvirt-1-storage"}: Could not read qcow2 header: Invalid argument
If that is supposed to work, then qemu devs will probably need to know about that, if that is not supposed to work, libvirt needs to add a check, because the error doesn't tell much. It's also possible I've messed up when formatting the image though, as didn't really try to figure out what's happening.

On Fri, Sep 08, 2023 at 14:51:14 -0500, Jonathon Jongsma wrote:
On 7/24/23 8:05 AM, Peter Krempa wrote:
As I've promised a long time ago I gave your patches some testing in regards of cooperation with blockjobs and snapshots.
Since the new version of the patches was not yet posted on the list I'm replying here including my observations from testing patches from your gitlab branch:
On Tue, Jun 06, 2023 at 16:15:30 -0500, Jonathon Jongsma wrote:
Requires recent qemu with support for the virtio-blk-vhost-vdpa device and the ability to pass a /dev/fdset/N path for the vdpa path (8.1.0)
Since this is a feature addition the 'Fixes' keyword doesn't make sense. Use e.g. 'Resolves' instead.
Additionally you're missing the DCO certification here.
--- src/qemu/qemu_block.c | 20 ++++++++-- src/qemu/qemu_domain.c | 25 ++++++++++++ src/qemu/qemu_validate.c | 44 +++++++++++++++++++--- tests/qemuxml2argvdata/disk-vhostvdpa.args | 35 +++++++++++++++++ tests/qemuxml2argvdata/disk-vhostvdpa.xml | 21 +++++++++++ tests/qemuxml2argvtest.c | 2 + 6 files changed, 139 insertions(+), 8 deletions(-) create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.args create mode 100644 tests/qemuxml2argvdata/disk-vhostvdpa.xml
[...]
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 2f6b32e394..119e52a7d7 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -11157,6 +11157,28 @@ qemuDomainPrepareStorageSourceFDs(virStorageSource *src, } +static int +qemuDomainPrepareStorageSourceVDPA(virStorageSource *src, + qemuDomainObjPrivate *priv) +{ + qemuDomainStorageSourcePrivate *srcpriv = NULL; + virStorageType actualType = virStorageSourceGetActualType(src); + int vdpafd = -1; + + if (actualType != VIR_STORAGE_TYPE_VHOST_VDPA) + return 0; + + if ((vdpafd = qemuVDPAConnect(src->path)) < 0) + return -1;
This function call directly touches the host filesystem, which is not supposed to be in the *DomainPrepareStorageSource* functions but we rather have a completely separate machinery in qemuProcessPrepareHostStorage.
Unfortunately that one doesn't yet need to handle individual backing chain members though.
This ensures that the code doesn't get accidentally called from tests even without mocking the code as the tests reimplement the functions differently for testing purposes.
Somehow I missed this comment earlier. Unfortunately, it doesn't seem straightforward to move this code. We can't simply move all of the logic to qemuProcessPrepareHostStorage() because that function doesn't get called during the tests.
That is the exact idea of it. This must not be called from tests because you must not attempt to rely on host state during tests. To make tests work, in testCompareXMLToArgvCreateArgs, we then setup a fake FD to be passed to the command line generator. Similarly other stuff is mocked there, mostly FD passing-related. Additionally qemuProcessPrepareHostStorage() is also skipped inside the domxmlToNative API as you also don't want to actually attempt opening the VDPA socket in case when it won't be used. In that case mocking via LD_PRELOAD is not possible. In certain cases (if we care enough about the domxml->native API) we even setup an alternative syntax (e.g. not using FD passing) so that the users can adapt and use such commandline as inspiration to actually run qemu.
I thought I could move only the opening of the fd to the PrepareHostStorage() function and then keep the qemuFDPass construction in this function, but that doesn't work: the PrepareHostStorage() function actually gets called *after* this function. So the fd would not even be open yet at the time this function gets called.
So... it seems that the options are either:
- leave everything in qemuDomainPrepareStorageSourceVDPA() (as is)
No. That would violate the idea of qemuProcessPrepareHostStorage().
- move the fd opening to PrepareHostStorage() and then move the rest to a different common function that is called after that, such as qemuBuildDiskSourceCommandLine()
Inside PrepareHostStorage() you both open the FD and create the fdpass structure and assign it to the private data, so there's no need for any other location to do anything else.
- a third option (suggestions?)
It's worth noting that the vdpa *network* device essentially does everything (opening the file, creating the qemuFDPass object, etc) in the qemuBuildInterfaceCommandLine() function. This was done to match other network devices that connect to open file descriptors (see qemuBuildInterfaceConnect()). But based on your comments above, it sounds like this may not be a the ideal situation even though the function is mocked to not actually open any file descriptors from the host filesystem under test.
Hmm, that is right, the function will be mocked, but regardless to not make the code obscure, please make host-dependant storage backend setup in the aforementioned function and whatever required to "mock" it (including even calling qemuVDPAConnect if you wanto rely on the LD_PRELOAD-ed version) in the tests.
participants (3)
-
Jonathon Jongsma
-
Peter Krempa
-
Stefano Garzarella