[RFC PATCH 0/7] To make <transient/> disk sharable

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> This patch series is RFC to implement to make <transient/> disk sharable. Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP command with overlay. In that case, qemu holds the write lock to the original disk, so we cannot share the original disks with the other qemu guests. This patch series tries to implement to make the disks, which have <transient/> disk option, sharable by using disk hot-plug. First, create the overlay disk with the original disk is set as the backingStore. Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's because to fix the bootindex of the disk. Lastly, device_add the disks and CPUs start. The sharable transient disk has the same limitation as the current <transient/> disk option, and also has the following limitations: - qemu needs to have hotplug feature - All the disk bus is virtio Masayoshi Mizuma (7): qemu_hotplug: Add asynJob argument to qemuDomainAttachDiskGeneric qemu_hotplug: Add asyncJob argument to qemuDomainAttachDeviceDiskLiveInternal qemu_hotplug: Add asynJob argument to qemuDomainRemoveDiskDevice qemu_hotplug: Add bootindex argument to qemuDomainAttachDiskGeneric qemu_hotplug: make <transient/> disk sharable qemu: Add check whether the transient disks are sharable qemu: Add virtio disks as sharable transient disks src/qemu/qemu_command.c | 20 ++- src/qemu/qemu_domain.h | 3 + src/qemu/qemu_hotplug.c | 322 ++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_hotplug.h | 3 + src/qemu/qemu_process.c | 38 ++++- 5 files changed, 368 insertions(+), 18 deletions(-) -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Add asynJob argument to qemuDomainAttachDiskGeneric() so that it can be used before CPUs start. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_hotplug.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 882e5d2384..609e9d1a8a 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -693,7 +693,8 @@ qemuDomainChangeEjectableMedia(virQEMUDriverPtr driver, static int qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, virDomainObjPtr vm, - virDomainDiskDefPtr disk) + virDomainDiskDefPtr disk, + qemuDomainAsyncJob asyncJob) { g_autoptr(qemuBlockStorageSourceChainData) data = NULL; int ret = -1; @@ -740,7 +741,8 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, if (qemuHotplugAttachManagedPR(driver, vm, disk->src, QEMU_ASYNC_JOB_NONE) < 0) goto cleanup; - qemuDomainObjEnterMonitor(driver, vm); + if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) < 0) + goto cleanup; if (qemuBlockStorageSourceChainAttach(priv->mon, data) < 0) goto exit_monitor; @@ -820,7 +822,7 @@ qemuDomainAttachVirtioDiskDevice(virQEMUDriverPtr driver, if (qemuDomainEnsureVirtioAddress(&releaseaddr, vm, &dev, disk->dst) < 0) return -1; - if ((rv = qemuDomainAttachDiskGeneric(driver, vm, disk)) < 0) { + if ((rv = qemuDomainAttachDiskGeneric(driver, vm, disk, QEMU_ASYNC_JOB_NONE)) < 0) { if (rv == -1 && releaseaddr) qemuDomainReleaseDeviceAddress(vm, &disk->info); @@ -999,7 +1001,7 @@ qemuDomainAttachSCSIDisk(virQEMUDriverPtr driver, return -1; } - if (qemuDomainAttachDiskGeneric(driver, vm, disk) < 0) + if (qemuDomainAttachDiskGeneric(driver, vm, disk, QEMU_ASYNC_JOB_NONE) < 0) return -1; return 0; @@ -1016,7 +1018,7 @@ qemuDomainAttachUSBMassStorageDevice(virQEMUDriverPtr driver, if (virDomainUSBAddressEnsure(priv->usbaddrs, &disk->info) < 0) return -1; - if (qemuDomainAttachDiskGeneric(driver, vm, disk) < 0) { + if (qemuDomainAttachDiskGeneric(driver, vm, disk, QEMU_ASYNC_JOB_NONE) < 0) { virDomainUSBAddressRelease(priv->usbaddrs, &disk->info); return -1; } -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Add asynJob argument to qemuDomainAttachDeviceDiskLiveInternal() so that it can be used before CPUs start. To avoid a compile warning, G_GNUC_UNUSED is added here. A later patch will remove G_GNUC_UNUSED. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_hotplug.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 609e9d1a8a..a61899d53a 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1030,7 +1030,8 @@ qemuDomainAttachUSBMassStorageDevice(virQEMUDriverPtr driver, static int qemuDomainAttachDeviceDiskLiveInternal(virQEMUDriverPtr driver, virDomainObjPtr vm, - virDomainDeviceDefPtr dev) + virDomainDeviceDefPtr dev, + qemuDomainAsyncJob asyncJob G_GNUC_UNUSED) { size_t i; virDomainDiskDefPtr disk = dev->data.disk; @@ -1138,7 +1139,7 @@ qemuDomainAttachDeviceDiskLive(virQEMUDriverPtr driver, return 0; } - return qemuDomainAttachDeviceDiskLiveInternal(driver, vm, dev); + return qemuDomainAttachDeviceDiskLiveInternal(driver, vm, dev, QEMU_ASYNC_JOB_NONE); } -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Add asynJob argument to qemuDomainRemoveDiskDevice() so that it can be used before CPUs start. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_hotplug.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index a61899d53a..b989652533 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -4302,7 +4302,8 @@ static bool qemuIsMultiFunctionDevice(virDomainDefPtr def, static int qemuDomainRemoveDiskDevice(virQEMUDriverPtr driver, virDomainObjPtr vm, - virDomainDiskDefPtr disk) + virDomainDiskDefPtr disk, + qemuDomainAsyncJob asyncJob) { qemuDomainDiskPrivatePtr diskPriv = QEMU_DOMAIN_DISK_PRIVATE(disk); g_autoptr(qemuBlockStorageSourceChainData) diskBackend = NULL; @@ -4347,7 +4348,8 @@ qemuDomainRemoveDiskDevice(virQEMUDriverPtr driver, } } - qemuDomainObjEnterMonitor(driver, vm); + if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) < 0) + goto cleanup; if (corAlias) ignore_value(qemuMonitorBlockdevDel(priv->mon, corAlias)); @@ -5093,7 +5095,8 @@ qemuDomainRemoveDevice(virQEMUDriverPtr driver, * into this function. */ case VIR_DOMAIN_DEVICE_DISK: - if (qemuDomainRemoveDiskDevice(driver, vm, dev->data.disk) < 0) + if (qemuDomainRemoveDiskDevice(driver, vm, dev->data.disk, + QEMU_ASYNC_JOB_NONE) < 0) return -1; break; case VIR_DOMAIN_DEVICE_CONTROLLER: -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Add bootindex argument to qemuDomainAttachDiskGeneric() so that qemu can detect the boot index for the disks which are hot-added before CPUs start. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_hotplug.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index b989652533..a2535949b7 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -694,6 +694,7 @@ static int qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDiskDefPtr disk, + unsigned int bootindex, qemuDomainAsyncJob asyncJob) { g_autoptr(qemuBlockStorageSourceChainData) data = NULL; @@ -732,7 +733,7 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, goto cleanup; } - if (!(devstr = qemuBuildDiskDeviceStr(vm->def, disk, 0, priv->qemuCaps))) + if (!(devstr = qemuBuildDiskDeviceStr(vm->def, disk, bootindex, priv->qemuCaps))) goto cleanup; if (VIR_REALLOC_N(vm->def->disks, vm->def->ndisks + 1) < 0) @@ -822,7 +823,8 @@ qemuDomainAttachVirtioDiskDevice(virQEMUDriverPtr driver, if (qemuDomainEnsureVirtioAddress(&releaseaddr, vm, &dev, disk->dst) < 0) return -1; - if ((rv = qemuDomainAttachDiskGeneric(driver, vm, disk, QEMU_ASYNC_JOB_NONE)) < 0) { + if ((rv = qemuDomainAttachDiskGeneric(driver, vm, disk, 0, + QEMU_ASYNC_JOB_NONE)) < 0) { if (rv == -1 && releaseaddr) qemuDomainReleaseDeviceAddress(vm, &disk->info); @@ -1001,7 +1003,7 @@ qemuDomainAttachSCSIDisk(virQEMUDriverPtr driver, return -1; } - if (qemuDomainAttachDiskGeneric(driver, vm, disk, QEMU_ASYNC_JOB_NONE) < 0) + if (qemuDomainAttachDiskGeneric(driver, vm, disk, 0, QEMU_ASYNC_JOB_NONE) < 0) return -1; return 0; @@ -1018,7 +1020,7 @@ qemuDomainAttachUSBMassStorageDevice(virQEMUDriverPtr driver, if (virDomainUSBAddressEnsure(priv->usbaddrs, &disk->info) < 0) return -1; - if (qemuDomainAttachDiskGeneric(driver, vm, disk, QEMU_ASYNC_JOB_NONE) < 0) { + if (qemuDomainAttachDiskGeneric(driver, vm, disk, 0, QEMU_ASYNC_JOB_NONE) < 0) { virDomainUSBAddressRelease(priv->usbaddrs, &disk->info); return -1; } -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Add qemuHotplugCreateDisksTransient() to make <transient/> disk sharable. The procedure is followings. First, create the overlay disk with the original disk is set as the backingStore. Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's because to fix the bootindex of the disk. Lastly, device_add the disks. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_command.c | 3 + src/qemu/qemu_hotplug.c | 285 ++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_hotplug.h | 3 + 3 files changed, 291 insertions(+) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 1ec302d4ac..81a27703c5 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2139,6 +2139,9 @@ qemuBuildDisksCommandLine(virCommandPtr cmd, bootCD = 0; break; case VIR_DOMAIN_DISK_DEVICE_DISK: + /* to use bootindex later for transient disk */ + disk->info.bootIndex = bootDisk; + G_GNUC_FALLTHROUGH; case VIR_DOMAIN_DISK_DEVICE_LUN: bootindex = bootDisk; bootDisk = 0; diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index a2535949b7..5d0445538d 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -6710,3 +6710,288 @@ qemuDomainSetVcpuInternal(virQEMUDriverPtr driver, virBitmapFree(livevcpus); return ret; } + +struct _qemuHotplugTransientDiskContext { + virDomainDeviceDefPtr trandev; + virDomainDiskDefPtr *domdisk; + size_t ndd; +}; + +typedef struct _qemuHotplugTransientDiskContext qemuHotplugTransientDiskContext; +typedef qemuHotplugTransientDiskContext *qemuHotplugTransientDiskContextPtr; + +static qemuHotplugTransientDiskContextPtr +qemuHotplugTransientDiskContextNew(size_t ndisks) +{ + qemuHotplugTransientDiskContextPtr ret = g_new0(qemuHotplugTransientDiskContext, 1); + + ret->trandev = g_new0(virDomainDeviceDef, ndisks); + ret->domdisk = g_new0(virDomainDiskDefPtr, ndisks); + + return ret; +} + +static void +qemuHotplugTransientDiskCleanup(virDomainDeviceDefPtr data, + virDomainDiskDefPtr *domdisk) +{ + VIR_FREE(data); + VIR_FREE(domdisk); + + return; +} + +static void +qemuHotplugTransientDiskContextCleanup(qemuHotplugTransientDiskContextPtr hptctxt) +{ + if (!hptctxt) + return; + + qemuHotplugTransientDiskCleanup(hptctxt->trandev, hptctxt->domdisk); + + g_free(hptctxt); +} + +G_DEFINE_AUTOPTR_CLEANUP_FUNC(qemuHotplugTransientDiskContext, qemuHotplugTransientDiskContextCleanup); + +static int +qemuHotplugDiskPrepareOneBlockdev(virQEMUDriverPtr driver, + virDomainObjPtr vm, + virQEMUDriverConfigPtr cfg, + virDomainDiskDefPtr domdisk, + virDomainDiskDefPtr trandisk, + GHashTable *blockNamedNodeData, + qemuDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + g_autoptr(qemuBlockStorageSourceChainData) data = NULL; + g_autoptr(virStorageSource) terminator = NULL; + + terminator = virStorageSourceNew(); + + if (qemuDomainPrepareStorageSourceBlockdev(trandisk, trandisk->src, + priv, cfg) < 0) + return -1; + + if (!(data = qemuBuildStorageSourceChainAttachPrepareBlockdevTop(trandisk->src, + terminator, + priv->qemuCaps))) + return -1; + + if (qemuBlockStorageSourceCreateDetectSize(blockNamedNodeData, + trandisk->src, domdisk->src) < 0) + return -1; + + if (qemuBlockStorageSourceCreate(vm, trandisk->src, domdisk->src, + NULL, data->srcdata[0], + asyncJob) < 0) + return -1; + + /* blockdev-del the transient disk src. The disk is blockdev-add'ed + * while the disk is hot-added */ + if (qemuBlockStorageSourceDetachOneBlockdev(driver, vm, + asyncJob, trandisk->src) < 0) + return -1; + + return 0; +} + +static int +qemuHotplugDiskTransientPrepareOne(virDomainObjPtr vm, + virQEMUDriverConfigPtr cfg, + virDomainDiskDefPtr domdisk, + virDomainDiskDefPtr trandisk, + GHashTable *blockNamedNodeData, + qemuDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + virQEMUDriverPtr driver = priv->driver; + bool supportsCreate; + + if (qemuDomainStorageSourceValidateDepth(trandisk->src, 1, trandisk->dst) < 0) + return -1; + + if (virStorageSourceInitChainElement(trandisk->src, domdisk->src, false) < 0) + return -1; + + trandisk->src->readonly = false; + supportsCreate = virStorageSourceSupportsCreate(trandisk->src); + + if (supportsCreate) { + if (qemuDomainStorageFileInit(driver, vm, trandisk->src, NULL) < 0) + return -1; + + if (virStorageSourceCreate(trandisk->src) < 0) { + virReportSystemError(errno, _("failed to create image file '%s'"), + NULLSTR(trandisk->src->path)); + return -1; + } + } + + if (qemuDomainStorageSourceAccessAllow(driver, vm, trandisk->src, + false, true, true) < 0) + return -1; + + if (qemuHotplugDiskPrepareOneBlockdev(driver, vm, cfg, domdisk, trandisk, + blockNamedNodeData, asyncJob) < 0) + return -1; + + return 0; +} + +static qemuHotplugTransientDiskContextPtr +qemuHotplugDiskPrepareDisksTransient(virDomainObjPtr vm, + virQEMUDriverConfigPtr cfg, + GHashTable *blockNamedNodeData, + qemuDomainAsyncJob asyncJob) +{ + g_autoptr(qemuHotplugTransientDiskContext) hptctxt = NULL; + qemuDomainObjPrivatePtr priv = vm->privateData; + virQEMUDriverPtr driver = priv->driver; + size_t i; + + hptctxt = qemuHotplugTransientDiskContextNew(vm->def->ndisks); + + for (i = 0; i < vm->def->ndisks; i++) { + virDomainDiskDefPtr domdisk = vm->def->disks[i]; + virDomainDiskDefPtr trandisk; + virDomainDeviceDefPtr trandev; + + if (!(trandisk = virDomainDiskDefNew(driver->xmlopt))) + return NULL; + + trandev = hptctxt->trandev + hptctxt->ndd; + trandev->type = VIR_DOMAIN_DEVICE_DISK; + + memcpy(&trandisk->info, &domdisk->info, sizeof(virDomainDeviceInfo)); + trandisk->info.alias = NULL; + trandisk->info.romfile = NULL; + + if (domdisk->transient) { + trandisk->src = virStorageSourceNew(); + trandisk->src->type = VIR_STORAGE_TYPE_FILE; + trandisk->src->format = VIR_STORAGE_FILE_QCOW2; + trandisk->src->path = g_strdup_printf("%s.TRANSIENT-%s", + domdisk->src->path, vm->def->name); + + if (!(trandisk->src->backingStore = + virStorageSourceCopy(domdisk->src, false))) + return NULL; + + if (virFileExists(trandisk->src->path)) { + virReportError(VIR_ERR_OPERATION_UNSUPPORTED, + _("Overlay file '%s' for transient disk '%s' already exists"), + trandisk->src->path, domdisk->dst); + return NULL; + } + + if (qemuHotplugDiskTransientPrepareOne(vm, cfg, domdisk, trandisk, + blockNamedNodeData, + asyncJob) < 0) + return NULL; + } else { + /* The disks without transient option will be device_add as well + * because to fix the bootindex */ + if (!(trandisk->src = virStorageSourceCopy(domdisk->src, false))) + return NULL; + } + + trandisk->bus = domdisk->bus; + trandisk->dst = g_strdup(domdisk->dst); + trandev->data.disk = trandisk; + + hptctxt->domdisk[hptctxt->ndd] = domdisk; + + hptctxt->ndd++; + } + + return g_steal_pointer(&hptctxt); +} + +static int +qemuHotplugDiskTransientCreate(virDomainObjPtr vm, + qemuHotplugTransientDiskContextPtr hptctxt, + virQEMUDriverConfigPtr cfg, + qemuDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + virQEMUDriverPtr driver = priv->driver; + const char *alias = NULL; + size_t i; + int ret; + + for (i = 0; i < hptctxt->ndd; i++) { + virDomainDeviceDefPtr trandev = hptctxt->trandev + i; + virDomainDiskDefPtr domdisk = hptctxt->domdisk[i]; + bool transient = domdisk->transient; + + /* transient disk doesn't support disk hotplug. Disable it here temporarily + * to remove it */ + if (transient) + domdisk->transient = false; + + /* blockdev-del StorageProgs and FormatProps of domdisk so that + * qemuDomainAttachDeviceDiskLiveInternal() can blockdev-add without + * write lock issue */ + if (qemuDomainRemoveDiskDevice(driver, vm, domdisk, asyncJob) < 0) + return -1; + + ret = qemuDomainAttachDeviceDiskLiveInternal(driver, vm, trandev, asyncJob); + if (!ret) { + alias = trandev->data.disk->info.alias; + if (transient) { + trandev->data.disk->transient = true; + } + } else { + VIR_DEBUG("Failed to attach disk %s (transient: %d) with disk hotplug.", + trandev->data.disk->dst, transient); + return -1; + } + + if (alias) { + virObjectEventPtr event; + event = virDomainEventDeviceAddedNewFromObj(vm, alias); + virObjectEventStateQueue(driver->domainEventState, event); + } + + if (qemuDomainUpdateDeviceList(driver, vm, asyncJob) < 0) + return -1; + } + + if (virDomainObjSave(vm, driver->xmlopt, cfg->stateDir) < 0) + return -1; + + return 0; +} + +int +qemuHotplugCreateDisksTransient(virDomainObjPtr vm, + qemuDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + virQEMUDriverPtr driver = priv->driver; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + g_autoptr(qemuHotplugTransientDiskContext) hptctxt = NULL; + g_autoptr(GHashTable) blockNamedNodeData = NULL; + + if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV) && + virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_PCIE_ROOT_PORT_HOTPLUG)) { + + VIR_DEBUG("prepare transient disks with disk hotplug"); + + if (!(blockNamedNodeData = qemuBlockGetNamedNodeData(vm, asyncJob))) + return -1; + + if (!(hptctxt = qemuHotplugDiskPrepareDisksTransient(vm, cfg, + blockNamedNodeData, + asyncJob))) + return -1; + + if (qemuHotplugDiskTransientCreate(vm, hptctxt, cfg, asyncJob) < 0) + return -1; + } + + priv->inhibitDiskTransientDelete = false; + + return 0; +} diff --git a/src/qemu/qemu_hotplug.h b/src/qemu/qemu_hotplug.h index 6287c5b5e8..baef2dba42 100644 --- a/src/qemu/qemu_hotplug.h +++ b/src/qemu/qemu_hotplug.h @@ -160,3 +160,6 @@ int qemuHotplugAttachDBusVMState(virQEMUDriverPtr driver, int qemuHotplugRemoveDBusVMState(virQEMUDriverPtr driver, virDomainObjPtr vm, qemuDomainAsyncJob asyncJob); + +int qemuHotplugCreateDisksTransient(virDomainObjPtr vm, + qemuDomainAsyncJob asyncJob); -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Check whether the transient disks are shareable or not. If followings are true, the transient disks are shareable. - qemu has blockdev and hotplug feature - the all disk bus support hot-plug Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_command.c | 17 +++++++++++++---- src/qemu/qemu_domain.h | 3 +++ src/qemu/qemu_process.c | 38 +++++++++++++++++++++++++++++++++++++- 3 files changed, 53 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 81a27703c5..d5958f46ef 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -2063,10 +2063,14 @@ qemuBuildDiskCommandLine(virCommandPtr cmd, const virDomainDef *def, virDomainDiskDefPtr disk, virQEMUCapsPtr qemuCaps, - unsigned int bootindex) + unsigned int bootindex, + qemuDomainObjPrivatePtr priv) { g_autofree char *optstr = NULL; + if ((disk->transient) && (priv->TransientDiskSharable)) + disk->src->readonly = true; + if (qemuBuildDiskSourceCommandLine(cmd, disk, qemuCaps) < 0) return -1; @@ -2084,6 +2088,10 @@ qemuBuildDiskCommandLine(virCommandPtr cmd, if (qemuCommandAddExtDevice(cmd, &disk->info) < 0) return -1; + /* All disks are hot-added later if TransientDiskSharable is true */ + if (priv->TransientDiskSharable) + return 0; + virCommandAddArg(cmd, "-device"); if (!(optstr = qemuBuildDiskDeviceStr(def, disk, bootindex, @@ -2098,7 +2106,8 @@ qemuBuildDiskCommandLine(virCommandPtr cmd, static int qemuBuildDisksCommandLine(virCommandPtr cmd, const virDomainDef *def, - virQEMUCapsPtr qemuCaps) + virQEMUCapsPtr qemuCaps, + qemuDomainObjPrivatePtr priv) { size_t i; unsigned int bootCD = 0; @@ -2154,7 +2163,7 @@ qemuBuildDisksCommandLine(virCommandPtr cmd, if (disk->device == VIR_DOMAIN_DISK_DEVICE_FLOPPY) bootindex = 0; - if (qemuBuildDiskCommandLine(cmd, def, disk, qemuCaps, bootindex) < 0) + if (qemuBuildDiskCommandLine(cmd, def, disk, qemuCaps, bootindex, priv) < 0) return -1; } @@ -9935,7 +9944,7 @@ qemuBuildCommandLine(virQEMUDriverPtr driver, VIR_DOMAIN_CONTROLLER_TYPE_CCID) < 0) return NULL; - if (qemuBuildDisksCommandLine(cmd, def, qemuCaps) < 0) + if (qemuBuildDisksCommandLine(cmd, def, qemuCaps, priv) < 0) return NULL; if (qemuBuildFilesystemCommandLine(cmd, def, qemuCaps, priv) < 0) diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 154339ef8f..37b050def4 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -267,6 +267,9 @@ struct _qemuDomainObjPrivate { /* prevent deletion of <transient> disk overlay files between startup and * succesful setup of the overlays */ bool inhibitDiskTransientDelete; + + /* True if the all transient disks are sharable */ + bool TransientDiskSharable; }; #define QEMU_DOMAIN_PRIVATE(vm) \ diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f87a3c0f60..2e2d1c6fea 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6892,6 +6892,40 @@ qemuProcessEnablePerf(virDomainObjPtr vm) return 0; } +static void +qemuCheckTransientDiskSharable(virDomainObjPtr vm) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + bool hotplug = virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_PCIE_ROOT_PORT_HOTPLUG); + size_t i; + + priv->TransientDiskSharable = false; + + if (!hotplug) + return; + + for (i = 0; i < vm->def->ndisks; i++) { + virDomainDiskDefPtr disk = vm->def->disks[i]; + + if (disk->transient && disk->bus != VIR_DOMAIN_DISK_BUS_LAST) + return; + } + + priv->TransientDiskSharable = true; +} + +static int +qemuProcessCreateDisksTransient(virDomainObjPtr vm, + qemuDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + + if (priv->TransientDiskSharable) + return qemuHotplugCreateDisksTransient(vm, asyncJob); + else + return qemuSnapshotCreateDisksTransient(vm, asyncJob); +} + /** * qemuProcessLaunch: @@ -6982,6 +7016,8 @@ qemuProcessLaunch(virConnectPtr conn, incoming != NULL) < 0) goto cleanup; + qemuCheckTransientDiskSharable(vm); + VIR_DEBUG("Building emulator command line"); if (!(cmd = qemuBuildCommandLine(driver, qemuDomainLogContextGetManager(logCtxt), @@ -7228,7 +7264,7 @@ qemuProcessLaunch(virConnectPtr conn, goto cleanup; VIR_DEBUG("Setting up transient disk"); - if (qemuSnapshotCreateDisksTransient(vm, asyncJob) < 0) + if (qemuProcessCreateDisksTransient(vm, asyncJob) < 0) goto cleanup; ret = 0; -- 2.27.0

From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Add virtio disks to be sharable transient disks. Signed-off-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> --- src/qemu/qemu_hotplug.c | 13 ++++++++----- src/qemu/qemu_process.c | 2 +- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 5d0445538d..fc6ca028e3 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -814,7 +814,9 @@ qemuDomainAttachDiskGeneric(virQEMUDriverPtr driver, static int qemuDomainAttachVirtioDiskDevice(virQEMUDriverPtr driver, virDomainObjPtr vm, - virDomainDiskDefPtr disk) + virDomainDiskDefPtr disk, + int bootindex, + qemuDomainAsyncJob asyncJob) { virDomainDeviceDef dev = { VIR_DOMAIN_DEVICE_DISK, { .disk = disk } }; bool releaseaddr = false; @@ -823,8 +825,8 @@ qemuDomainAttachVirtioDiskDevice(virQEMUDriverPtr driver, if (qemuDomainEnsureVirtioAddress(&releaseaddr, vm, &dev, disk->dst) < 0) return -1; - if ((rv = qemuDomainAttachDiskGeneric(driver, vm, disk, 0, - QEMU_ASYNC_JOB_NONE)) < 0) { + if ((rv = qemuDomainAttachDiskGeneric(driver, vm, disk, bootindex, + asyncJob)) < 0) { if (rv == -1 && releaseaddr) qemuDomainReleaseDeviceAddress(vm, &disk->info); @@ -1033,7 +1035,7 @@ static int qemuDomainAttachDeviceDiskLiveInternal(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDeviceDefPtr dev, - qemuDomainAsyncJob asyncJob G_GNUC_UNUSED) + qemuDomainAsyncJob asyncJob) { size_t i; virDomainDiskDefPtr disk = dev->data.disk; @@ -1080,7 +1082,8 @@ qemuDomainAttachDeviceDiskLiveInternal(virQEMUDriverPtr driver, break; case VIR_DOMAIN_DISK_BUS_VIRTIO: - ret = qemuDomainAttachVirtioDiskDevice(driver, vm, disk); + ret = qemuDomainAttachVirtioDiskDevice(driver, vm, disk, + disk->info.bootIndex, asyncJob); break; case VIR_DOMAIN_DISK_BUS_SCSI: diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 2e2d1c6fea..55577632c5 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6907,7 +6907,7 @@ qemuCheckTransientDiskSharable(virDomainObjPtr vm) for (i = 0; i < vm->def->ndisks; i++) { virDomainDiskDefPtr disk = vm->def->disks[i]; - if (disk->transient && disk->bus != VIR_DOMAIN_DISK_BUS_LAST) + if (disk->transient && disk->bus != VIR_DOMAIN_DISK_BUS_VIRTIO) return; } -- 2.27.0

On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
This patch series is RFC to implement to make <transient/> disk sharable.
Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP command with overlay. In that case, qemu holds the write lock to the original disk, so we cannot share the original disks with the other qemu guests.
This patch series tries to implement to make the disks, which have <transient/> disk option, sharable by using disk hot-plug.
First, create the overlay disk with the original disk is set as the backingStore. Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's because to fix the bootindex of the disk. Lastly, device_add the disks and CPUs start.
So I had a look at the proposed implementation and its too ugly and complicated. Firstly one of the overcomplicated bits is the part which adds the disk backend, then removes it and then adds it back. That's too much. Secondly you are hotplugging ALL virtio disks even those which are not transient, that won't work either. This must stay limited to <transient> disk to minimize the impact. Thirdly the code is also overcomplicated by the fact that you workaround the fact that qemuDomainAttachDeviceDiskLiveInternal doesn't actually support hotplug of transient disks. You then prepare everything somewhere behind it's back and try to hotplug the disk as not-transient. It would be way better if you support hotplug of transient disk in the first place reusing the hotplug of the backend and adding image creating for the frontend. I also don't understand the issue with bootindex. If the problem is that bootindex is calculated in qemuBuildDisksCommandLine and used directly then you'll need to refactor the code in first place, pre-calculate the indices and store them in the disk private data and then use them from there, but you must not modify disk->info.bootIndex as that gets represented back to the user when dumping XML. The new code also the name of the VM as suffix to the temporary image, but doesn't update the previous code which is used for disks which don't support hotplug. That needs to be fixed too. The fact that we are doing hotplug of the disk should also stay hidden, so the code doing the hotplug should not actually emit the device-added event and the whole thing should not need to update the device list at all, but should be placed before the regular update of the device list. In the end that means you shouldn't need 'TransientDiskSharable' private data variable and shouldn do any modification of disk->src->readonly and such. Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out. Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline. Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.

On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote:
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
This patch series is RFC to implement to make <transient/> disk sharable.
Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP command with overlay. In that case, qemu holds the write lock to the original disk, so we cannot share the original disks with the other qemu guests.
This patch series tries to implement to make the disks, which have <transient/> disk option, sharable by using disk hot-plug.
First, create the overlay disk with the original disk is set as the backingStore. Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's because to fix the bootindex of the disk. Lastly, device_add the disks and CPUs start.
So I had a look at the proposed implementation and its too ugly and complicated.
Firstly one of the overcomplicated bits is the part which adds the disk backend, then removes it and then adds it back. That's too much.
Secondly you are hotplugging ALL virtio disks even those which are not transient, that won't work either. This must stay limited to <transient> disk to minimize the impact.
Thirdly the code is also overcomplicated by the fact that you workaround the fact that qemuDomainAttachDeviceDiskLiveInternal doesn't actually support hotplug of transient disks. You then prepare everything somewhere behind it's back and try to hotplug the disk as not-transient.
It would be way better if you support hotplug of transient disk in the first place reusing the hotplug of the backend and adding image creating for the frontend.
I also don't understand the issue with bootindex. If the problem is that bootindex is calculated in qemuBuildDisksCommandLine and used directly then you'll need to refactor the code in first place, pre-calculate the indices and store them in the disk private data and then use them from there, but you must not modify disk->info.bootIndex as that gets represented back to the user when dumping XML.
The new code also the name of the VM as suffix to the temporary image, but doesn't update the previous code which is used for disks which don't support hotplug. That needs to be fixed too.
The fact that we are doing hotplug of the disk should also stay hidden, so the code doing the hotplug should not actually emit the device-added event and the whole thing should not need to update the device list at all, but should be placed before the regular update of the device list.
In the end that means you shouldn't need 'TransientDiskSharable' private data variable and shouldn do any modification of disk->src->readonly and such.
Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out.
Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline.
Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
Is there a reason we can't use the existing <readonly/> element as a way to make the underlying base image be treated as read only and thus become sharable ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote:
On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote:
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
[...]
Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out.
Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline.
Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
Is there a reason we can't use the existing <readonly/> element as a way to make the underlying base image be treated as read only and thus become sharable ?
The problem is that the qemu device frontend code infers the readonly state from the backend image specification when it's instantiated. Thus if you start qemu with the topmost layer marked as read-only so that the write lock doesn't get asserted, this fact gets stored in the frontend. If you then later insert a writable layer into the chain on top of it, the frontend stays read-only. IDE/SATA disks are outhright refused to be read-only by qemu, virtio disks remember that they are read-only and keep it into guest execution and SCSI disks caused an abort() (which was recently fixed in qemu upstream.) Thus the only two options are: 1) hotplug the frontend after we do the shenanigans with the image 2) implement explicit control of the read-only state of the frontend in qemu in the first place, adapt to it in libvirt and then keep using the current approach in libvirt

On Mon, Jan 25, 2021 at 10:51:09AM +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote:
On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote:
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
[...]
Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out.
Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline.
Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
Is there a reason we can't use the existing <readonly/> element as a way to make the underlying base image be treated as read only and thus become sharable ?
The problem is that the qemu device frontend code infers the readonly state from the backend image specification when it's instantiated.
Thus if you start qemu with the topmost layer marked as read-only so that the write lock doesn't get asserted, this fact gets stored in the frontend. If you then later insert a writable layer into the chain on top of it, the frontend stays read-only.
IDE/SATA disks are outhright refused to be read-only by qemu, virtio disks remember that they are read-only and keep it into guest execution and SCSI disks caused an abort() (which was recently fixed in qemu upstream.)
Thus the only two options are:
1) hotplug the frontend after we do the shenanigans with the image 2) implement explicit control of the read-only state of the frontend in qemu in the first place, adapt to it in libvirt and then keep using the current approach in libvirt
I must be missing something here, because the topmost layer with <transient> isn't read-only - it would be writable as that's the whole point of the throwaway transient overlay file. It just feels to me that the situation desired is already one that is supported and working when explicitly top+base are specified in the XML. Consider if we have <disk> <source file="top.qcow2> <backingStore> <source file="base.qcow2"> </backing> </disk> IIUC, base.qcow2 will be readonly, and top.qcow2 will be writable by default. I'm suggesting that <disk> <source file="base.qcow2"> <transient/> <readonly/> </disk> should be treated as identical to the first. During domain startup preparation, we would effectvely transform the latter into the former, with a random tempfile for top.qcow. This shouldn't add any real complexity to our existing code and not involve tricks with hotplug. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote:
On Mon, Jan 25, 2021 at 10:51:09AM +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote:
On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote:
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
[...]
Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out.
Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline.
Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
Is there a reason we can't use the existing <readonly/> element as a way to make the underlying base image be treated as read only and thus become sharable ?
The problem is that the qemu device frontend code infers the readonly state from the backend image specification when it's instantiated.
Thus if you start qemu with the topmost layer marked as read-only so that the write lock doesn't get asserted, this fact gets stored in the frontend. If you then later insert a writable layer into the chain on top of it, the frontend stays read-only.
IDE/SATA disks are outhright refused to be read-only by qemu, virtio disks remember that they are read-only and keep it into guest execution and SCSI disks caused an abort() (which was recently fixed in qemu upstream.)
Thus the only two options are:
1) hotplug the frontend after we do the shenanigans with the image 2) implement explicit control of the read-only state of the frontend in qemu in the first place, adapt to it in libvirt and then keep using the current approach in libvirt
I must be missing something here, because the topmost layer with <transient> isn't read-only - it would be writable as that's the whole point of the throwaway transient overlay file. It just feels to me that the situation desired is already one that is supported and working when explicitly top+base are specified in the XML.
[...]
should be treated as identical to the first. During domain startup preparation, we would effectvely transform the latter into the former, with a random tempfile for top.qcow. This
Your assumptions are right, but the problem is how it's done: The temporary file is created with 'blockdev-create' during the initialization of the VM. That means that the chain starts with all the existing layers, and we then add the throwaway layer before starting execution of the guest once qemu is already started. Unfortunately that's after the frontend is initialized (and thus remembers the readonly state) and after the storage is already open (thus it already asserted-or wanted to assert the write lock). 'blockdev-create' is used to prevent another invocation of qemu-img

On Mon, Jan 25, 2021 at 11:11:43AM +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote:
On Mon, Jan 25, 2021 at 10:51:09AM +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 09:33:31 +0000, Daniel Berrange wrote:
On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote:
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
[...]
Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out.
Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline.
Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
Is there a reason we can't use the existing <readonly/> element as a way to make the underlying base image be treated as read only and thus become sharable ?
The problem is that the qemu device frontend code infers the readonly state from the backend image specification when it's instantiated.
Thus if you start qemu with the topmost layer marked as read-only so that the write lock doesn't get asserted, this fact gets stored in the frontend. If you then later insert a writable layer into the chain on top of it, the frontend stays read-only.
IDE/SATA disks are outhright refused to be read-only by qemu, virtio disks remember that they are read-only and keep it into guest execution and SCSI disks caused an abort() (which was recently fixed in qemu upstream.)
Thus the only two options are:
1) hotplug the frontend after we do the shenanigans with the image 2) implement explicit control of the read-only state of the frontend in qemu in the first place, adapt to it in libvirt and then keep using the current approach in libvirt
I must be missing something here, because the topmost layer with <transient> isn't read-only - it would be writable as that's the whole point of the throwaway transient overlay file. It just feels to me that the situation desired is already one that is supported and working when explicitly top+base are specified in the XML.
[...]
should be treated as identical to the first. During domain startup preparation, we would effectvely transform the latter into the former, with a random tempfile for top.qcow. This
Your assumptions are right, but the problem is how it's done:
The temporary file is created with 'blockdev-create' during the initialization of the VM. That means that the chain starts with all the existing layers, and we then add the throwaway layer before starting execution of the guest once qemu is already started.
Unfortunately that's after the frontend is initialized (and thus remembers the readonly state) and after the storage is already open (thus it already asserted-or wanted to assert the write lock).
'blockdev-create' is used to prevent another invocation of qemu-img
Wouldn't it make life simpler to just use qemu-img and avoid these problems in the first place ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Jan 25, 2021 at 10:13:51 +0000, Daniel Berrange wrote:
On Mon, Jan 25, 2021 at 11:11:43AM +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote:
[...]
Unfortunately that's after the frontend is initialized (and thus remembers the readonly state) and after the storage is already open (thus it already asserted-or wanted to assert the write lock).
'blockdev-create' is used to prevent another invocation of qemu-img
Wouldn't it make life simpler to just use qemu-img and avoid these problems in the first place ?
The use of 'qemu-img' in the qemu driver code is limited to inactive snapshots and I'd really like us to keep it that way. I'm more willing to declare that sharing of the original image with a <transient/> disk is unsupported rather than cave in using qemu-img. The user can always provide their own overlay and discard it appropriately.

On Mon, Jan 25, 2021 at 11:28:00 +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 10:13:51 +0000, Daniel Berrange wrote:
On Mon, Jan 25, 2021 at 11:11:43AM +0100, Peter Krempa wrote:
On Mon, Jan 25, 2021 at 10:04:41 +0000, Daniel Berrange wrote:
[...]
Unfortunately that's after the frontend is initialized (and thus remembers the readonly state) and after the storage is already open (thus it already asserted-or wanted to assert the write lock).
'blockdev-create' is used to prevent another invocation of qemu-img
Wouldn't it make life simpler to just use qemu-img and avoid these problems in the first place ?
The use of 'qemu-img' in the qemu driver code is limited to inactive snapshots and I'd really like us to keep it that way.
I'm more willing to declare that sharing of the original image with a <transient/> disk is unsupported rather than cave in using qemu-img.
To be more specific, qemu-img uses a really crusty old commadline. Trying to adapt any blockdev code to it (such as encrypted qcow2 and others) isn't something I see a point in doing and maintaining an interlocking matrix of supported things. Similarly, the idea for supporting offline-blockjobs was always to use either qemu -m none or nowadays qemu-storage-daemon, rather than trying to probe and adapt to the quirks of qemu-img.

On Sat, Jan 23, 2021 at 11:54:51AM +0100, Peter Krempa wrote:
On Fri, Jan 22, 2021 at 20:10:59 -0500, Masayoshi Mizuma wrote:
From: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
This patch series is RFC to implement to make <transient/> disk sharable.
Currently, <transient/> disk option for qemu uses blockdev-snapshot QMP command with overlay. In that case, qemu holds the write lock to the original disk, so we cannot share the original disks with the other qemu guests.
This patch series tries to implement to make the disks, which have <transient/> disk option, sharable by using disk hot-plug.
First, create the overlay disk with the original disk is set as the backingStore. Then, blockdev-del the StorageProgs and FormatProgs of the disk. That's because to fix the bootindex of the disk. Lastly, device_add the disks and CPUs start.
Thank you for your review!
So I had a look at the proposed implementation and its too ugly and complicated.
Firstly one of the overcomplicated bits is the part which adds the disk backend, then removes it and then adds it back. That's too much.
Secondly you are hotplugging ALL virtio disks even those which are not transient, that won't work either. This must stay limited to <transient> disk to minimize the impact.
I agree that it's overkill to hotplugging all disks... I'm hotplugging all virtio disks because I have experienced a boot order issue. Case: the domain xml has two virtio disks, the one has <transient/> and the boot order is 1, the other doesn't have <transient/> and the boot order is 2. In that case, qemu tried to boot from the latter disk after the former disk is hot-added even though the order is 2. The boot issue doesn't happen if the latter disk is also hot-added.
Thirdly the code is also overcomplicated by the fact that you workaround the fact that qemuDomainAttachDeviceDiskLiveInternal doesn't actually support hotplug of transient disks. You then prepare everything somewhere behind it's back and try to hotplug the disk as not-transient.
It would be way better if you support hotplug of transient disk in the first place reusing the hotplug of the backend and adding image creating for the frontend.
Make sense, thanks.
I also don't understand the issue with bootindex. If the problem is that bootindex is calculated in qemuBuildDisksCommandLine and used directly then you'll need to refactor the code in first place, pre-calculate the indices and store them in the disk private data and then use them from there, but you must not modify disk->info.bootIndex as that gets represented back to the user when dumping XML.
The new code also the name of the VM as suffix to the temporary image, but doesn't update the previous code which is used for disks which don't support hotplug. That needs to be fixed too.
Got it.
The fact that we are doing hotplug of the disk should also stay hidden, so the code doing the hotplug should not actually emit the device-added event and the whole thing should not need to update the device list at all, but should be placed before the regular update of the device list.
In the end that means you shouldn't need 'TransientDiskSharable' private data variable and shouldn do any modification of disk->src->readonly and such.
OK.
Please note that this is a high level review. I've spotted some coding style inconsistencies and such, but since this will need significant rework I'll not point them out.
Also don't forget to add test cases, where it will be visible that the disk (neither frontend nor backend) is added on the commandline.
Also it should be fairly trivial to support it for SCSI disks since they support hotplug too.
I suppose the patches will be following parts... 1. Hot-add support for <transient/> disks 2. Some changes to be sharable <transient/> disks 3. virtio and scsi specific changes to support sharable <transient/> 4. Testing Thanks! Masa
participants (3)
-
Daniel P. Berrangé
-
Masayoshi Mizuma
-
Peter Krempa