[PATCH RFC 0/9] qemu: Support mapped-ram migration capability

This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is 1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details. The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt? For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed. Note the logical file size of mapped-ram saved images is slightly larger than guest RAM size, so the files are often much larger than the files produced by the existing, sequential format. However, actual blocks written to disk is often lower with mapped-ram saved images. E.g. a saved image from a 30G, freshly booted, idle guest results in the following 'Size' and 'Blocks' values reported by stat(1) Size Blocks sequential 998595770 1950392 mapped-ram 34368584225 1800456 With the same guest running a workload that dirties memory Size Blocks sequential 33173330615 64791672 mapped-ram 34368578210 64706944 Thanks for any comments on this RFC! [1] https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/mapp... [2] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/K4BD... [3] https://mail.gnu.org/archive/html/qemu-devel/2024-05/msg04432.html Jim Fehlig (9): qemu: Enable mapped-ram migration capability qemu_fd: Add function to retrieve fdset ID qemu: Add function to get migration params for save qemu: Add a 'features' element to save image header and bump version qemu: conf: Add setting for save image version qemu: Add support for mapped-ram on save qemu: Enable mapped-ram on restore qemu: Support O_DIRECT with mapped-ram on save qemu: Support O_DIRECT with mapped-ram on restore src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 6 + src/qemu/qemu_conf.c | 8 ++ src/qemu/qemu_conf.h | 1 + src/qemu/qemu_driver.c | 25 ++-- src/qemu/qemu_fd.c | 18 +++ src/qemu/qemu_fd.h | 3 + src/qemu/qemu_migration.c | 99 ++++++++++++++- src/qemu/qemu_migration.h | 11 +- src/qemu/qemu_migration_params.c | 20 +++ src/qemu/qemu_migration_params.h | 4 + src/qemu/qemu_monitor.c | 40 ++++++ src/qemu/qemu_monitor.h | 5 + src/qemu/qemu_process.c | 63 +++++++--- src/qemu/qemu_process.h | 16 ++- src/qemu/qemu_saveimage.c | 187 +++++++++++++++++++++++------ src/qemu/qemu_saveimage.h | 20 ++- src/qemu/qemu_snapshot.c | 12 +- src/qemu/test_libvirtd_qemu.aug.in | 1 + 19 files changed, 455 insertions(+), 85 deletions(-) -- 2.44.0

Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_migration_params.c | 1 + src/qemu/qemu_migration_params.h | 1 + 2 files changed, 2 insertions(+) diff --git a/src/qemu/qemu_migration_params.c b/src/qemu/qemu_migration_params.c index 48f8657f71..201286e58c 100644 --- a/src/qemu/qemu_migration_params.c +++ b/src/qemu/qemu_migration_params.c @@ -105,6 +105,7 @@ VIR_ENUM_IMPL(qemuMigrationCapability, "return-path", "zero-copy-send", "postcopy-preempt", + "mapped-ram", ); diff --git a/src/qemu/qemu_migration_params.h b/src/qemu/qemu_migration_params.h index 91bc6792cd..a14b189b48 100644 --- a/src/qemu/qemu_migration_params.h +++ b/src/qemu/qemu_migration_params.h @@ -41,6 +41,7 @@ typedef enum { QEMU_MIGRATION_CAP_RETURN_PATH, QEMU_MIGRATION_CAP_ZERO_COPY_SEND, QEMU_MIGRATION_CAP_POSTCOPY_PREEMPT, + QEMU_MIGRATION_CAP_MAPPED_RAM, QEMU_MIGRATION_CAP_LAST } qemuMigrationCapability; -- 2.44.0

Add new function qemuFDPassGetId(), to be used when adding support for mapped-ram save format. Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_fd.c | 18 ++++++++++++++++++ src/qemu/qemu_fd.h | 3 +++ 2 files changed, 21 insertions(+) diff --git a/src/qemu/qemu_fd.c b/src/qemu/qemu_fd.c index e847056573..3ae4a87a20 100644 --- a/src/qemu/qemu_fd.c +++ b/src/qemu/qemu_fd.c @@ -276,6 +276,24 @@ qemuFDPassGetPath(qemuFDPass *fdpass) } +/** + * qemuFDPassGetId: + * @fdpass: The fd passing helper struct + * @id: An out parameter for providing the fdset ID + * + * Returns 0 on success, -1 on error. + */ +int +qemuFDPassGetId(qemuFDPass *fdpass, unsigned int *id) +{ + if (!fdpass) + return -1; + + *id = fdpass->fdSetID; + return 0; +} + + struct _qemuFDPassDirect { int fd; char *name; diff --git a/src/qemu/qemu_fd.h b/src/qemu/qemu_fd.h index cd0ff2c690..dfcd5fb754 100644 --- a/src/qemu/qemu_fd.h +++ b/src/qemu/qemu_fd.h @@ -58,6 +58,9 @@ qemuFDPassTransferMonitorRollback(qemuFDPass *fdpass, const char * qemuFDPassGetPath(qemuFDPass *fdpass); +int +qemuFDPassGetId(qemuFDPass *fdpass, unsigned int *id); + typedef struct _qemuFDPassDirect qemuFDPassDirect; -- 2.44.0

Introduce qemuMigrationParamsForMappedSave() to create a qemuMigrationParams object initialized with appropriate migration capabilities and parameters for a save operation using mapped-ram. Note that mapped-ram also requires the multifd capability. For now, the number of multifd channels is set to 1. Future work to support parallel save/restore can set the number of channels specified by the user. Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_migration_params.c | 19 +++++++++++++++++++ src/qemu/qemu_migration_params.h | 3 +++ 2 files changed, 22 insertions(+) diff --git a/src/qemu/qemu_migration_params.c b/src/qemu/qemu_migration_params.c index 201286e58c..96698bde9f 100644 --- a/src/qemu/qemu_migration_params.c +++ b/src/qemu/qemu_migration_params.c @@ -781,6 +781,25 @@ qemuMigrationParamsFromFlags(virTypedParameterPtr params, } +qemuMigrationParams * +qemuMigrationParamsForMappedSave(void) +{ + g_autoptr(qemuMigrationParams) saveParams = NULL; + + if (!(saveParams = qemuMigrationParamsNew())) + return NULL; + + if (virBitmapSetBit(saveParams->caps, QEMU_MIGRATION_CAP_MAPPED_RAM) < 0) + return NULL; + if (virBitmapSetBit(saveParams->caps, QEMU_MIGRATION_CAP_MULTIFD) < 0) + return NULL; + saveParams->params[QEMU_MIGRATION_PARAM_MULTIFD_CHANNELS].value.i = 1; + saveParams->params[QEMU_MIGRATION_PARAM_MULTIFD_CHANNELS].set = true; + + return g_steal_pointer(&saveParams); +} + + int qemuMigrationParamsDump(qemuMigrationParams *migParams, virTypedParameterPtr *params, diff --git a/src/qemu/qemu_migration_params.h b/src/qemu/qemu_migration_params.h index a14b189b48..0f3ed07384 100644 --- a/src/qemu/qemu_migration_params.h +++ b/src/qemu/qemu_migration_params.h @@ -85,6 +85,9 @@ qemuMigrationParamsFromFlags(virTypedParameterPtr params, unsigned int flags, qemuMigrationParty party); +qemuMigrationParams * +qemuMigrationParamsForMappedSave(void); + int qemuMigrationParamsDump(qemuMigrationParams *migParams, virTypedParameterPtr *params, -- 2.44.0

QEMU's new mapped-ram stream format [1] is incompatible with the existing sequential format. In order to support the new format in libvirt, a new 'features' element is added to the saved image header. This element can be used now indicate the use of mapped-ram feature, and provides a mechanism to support future save image features. An older libvirt that is unaware of mapped-ram must not attempt to open and restore a mapped-ram saved image. ATM, this can only be done by bumping the save image version. [1] https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/mapp... Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_saveimage.c | 7 +++++++ src/qemu/qemu_saveimage.h | 9 +++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 018ab5a222..50fec33f54 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -74,6 +74,7 @@ qemuSaveImageBswapHeader(virQEMUSaveHeader *hdr) hdr->was_running = GUINT32_SWAP_LE_BE(hdr->was_running); hdr->compressed = GUINT32_SWAP_LE_BE(hdr->compressed); hdr->cookieOffset = GUINT32_SWAP_LE_BE(hdr->cookieOffset); + hdr->features = GUINT32_SWAP_LE_BE(hdr->features); } @@ -637,6 +638,12 @@ qemuSaveImageOpen(virQEMUDriver *driver, return -1; } + if (header->features && header->features != QEMU_SAVE_FEATURE_MAPPED_RAM) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("image contains unsupported features)")); + return -1; + } + if (header->cookieOffset) xml_len = header->cookieOffset; else diff --git a/src/qemu/qemu_saveimage.h b/src/qemu/qemu_saveimage.h index e541792153..9dd7de292d 100644 --- a/src/qemu/qemu_saveimage.h +++ b/src/qemu/qemu_saveimage.h @@ -28,10 +28,14 @@ */ #define QEMU_SAVE_MAGIC "LibvirtQemudSave" #define QEMU_SAVE_PARTIAL "LibvirtQemudPart" -#define QEMU_SAVE_VERSION 2 +#define QEMU_SAVE_VERSION 3 G_STATIC_ASSERT(sizeof(QEMU_SAVE_MAGIC) == sizeof(QEMU_SAVE_PARTIAL)); +typedef enum { + QEMU_SAVE_FEATURE_MAPPED_RAM = 1 << 0, +} qemuSaveFeatures; + typedef struct _virQEMUSaveHeader virQEMUSaveHeader; struct _virQEMUSaveHeader { char magic[sizeof(QEMU_SAVE_MAGIC)-1]; @@ -40,7 +44,8 @@ struct _virQEMUSaveHeader { uint32_t was_running; uint32_t compressed; uint32_t cookieOffset; - uint32_t unused[14]; + uint32_t features; + uint32_t unused[13]; }; -- 2.44.0

Add a 'save_image_version' setting to qemu.conf to control the image version when saving a VM with 'virsh save' or 'virsh managedsave'. Default to the new version 3. Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 6 ++++++ src/qemu/qemu_conf.c | 8 ++++++++ src/qemu/qemu_conf.h | 1 + src/qemu/qemu_driver.c | 3 +-- src/qemu/qemu_saveimage.c | 19 ++++++++++++++----- src/qemu/qemu_saveimage.h | 6 +++--- src/qemu/qemu_snapshot.c | 4 ++-- src/qemu/test_libvirtd_qemu.aug.in | 1 + 9 files changed, 37 insertions(+), 12 deletions(-) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 2b6526538f..acbfcd9fc3 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -92,6 +92,7 @@ module Libvirtd_qemu = | str_array_entry "namespaces" let save_entry = str_entry "save_image_format" + | int_entry "save_image_version" | str_entry "dump_image_format" | str_entry "snapshot_image_format" | str_entry "auto_dump_path" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 6bc2140dcb..b5df8c1cc6 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -590,6 +590,11 @@ # at scheduled saving, and it is an error if the specified save_image_format # is not valid, or the requested compression program can't be found. # +# save_image_version is applicable when using 'virsh save' or 'virsh managed-save'. +# Currently only versions 2 and 3 are supported, with version 3 being the default. +# If saved images must be compatible with an older libvirt without this setting, +# then set save_image_format_version to 2. +# # dump_image_format is used when you use 'virsh dump' at emergency # crashdump, and if the specified dump_image_format is not valid, or # the requested compression program can't be found, this falls @@ -601,6 +606,7 @@ # or the requested compression program can't be found. # #save_image_format = "raw" +#save_image_version = 3 #dump_image_format = "raw" #snapshot_image_format = "raw" diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 4050a82341..3ed17165b8 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -66,6 +66,10 @@ VIR_LOG_INIT("qemu.qemu_conf"); #define QEMU_MIGRATION_PORT_MIN 49152 #define QEMU_MIGRATION_PORT_MAX 49215 +/* Need to reconsile definition here and in qemu_saveimage.h */ +#define QEMU_SAVE_VERSION 3 + + VIR_ENUM_IMPL(virQEMUSchedCore, QEMU_SCHED_CORE_LAST, "none", @@ -246,6 +250,8 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->migrationPortMin = QEMU_MIGRATION_PORT_MIN; cfg->migrationPortMax = QEMU_MIGRATION_PORT_MAX; + cfg->saveImageVersion = QEMU_SAVE_VERSION; + /* For privileged driver, try and find hugetlbfs mounts automatically. * Non-privileged driver requires admin to create a dir for the * user, chown it, and then let user configure it manually. */ @@ -610,6 +616,8 @@ virQEMUDriverConfigLoadSaveEntry(virQEMUDriverConfig *cfg, { if (virConfGetValueString(conf, "save_image_format", &cfg->saveImageFormat) < 0) return -1; + if (virConfGetValueUInt(conf, "save_image_version", &cfg->saveImageVersion) < 0) + return -1; if (virConfGetValueString(conf, "dump_image_format", &cfg->dumpImageFormat) < 0) return -1; if (virConfGetValueString(conf, "snapshot_image_format", &cfg->snapshotImageFormat) < 0) diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 36049b4bfa..30f5c2151e 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -193,6 +193,7 @@ struct _virQEMUDriverConfig { bool securityRequireConfined; char *saveImageFormat; + unsigned int saveImageVersion; char *dumpImageFormat; char *snapshotImageFormat; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index e2698c7924..f9761242d2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -2696,8 +2696,7 @@ qemuDomainSaveInternal(virQEMUDriver *driver, if (!(cookie = qemuDomainSaveCookieNew(vm))) goto endjob; - if (!(data = virQEMUSaveDataNew(xml, cookie, was_running, compressed, - driver->xmlopt))) + if (!(data = virQEMUSaveDataNew(driver, xml, cookie, was_running, compressed))) goto endjob; xml = NULL; diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 50fec33f54..30085dc7bc 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -95,25 +95,34 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUSaveData, virQEMUSaveDataFree); * This function steals @domXML on success. */ virQEMUSaveData * -virQEMUSaveDataNew(char *domXML, +virQEMUSaveDataNew(virQEMUDriver *driver, + char *domXML, qemuDomainSaveCookie *cookieObj, bool running, - int compressed, - virDomainXMLOption *xmlopt) + int compressed) { virQEMUSaveData *data = NULL; virQEMUSaveHeader *header; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); data = g_new0(virQEMUSaveData, 1); if (cookieObj && !(data->cookie = virSaveCookieFormat((virObject *) cookieObj, - virDomainXMLOptionGetSaveCookie(xmlopt)))) + virDomainXMLOptionGetSaveCookie(driver->xmlopt)))) goto error; header = &data->header; memcpy(header->magic, QEMU_SAVE_PARTIAL, sizeof(header->magic)); - header->version = QEMU_SAVE_VERSION; + + if (cfg->saveImageVersion < 2 || cfg->saveImageVersion > QEMU_SAVE_VERSION) { + virReportError(VIR_ERR_OPERATION_FAILED, + _("Invalid save image version %1$u"), + cfg->saveImageVersion); + goto error; + } + header->version = cfg->saveImageVersion; + header->was_running = running ? 1 : 0; header->compressed = compressed; diff --git a/src/qemu/qemu_saveimage.h b/src/qemu/qemu_saveimage.h index 9dd7de292d..63ad5508ed 100644 --- a/src/qemu/qemu_saveimage.h +++ b/src/qemu/qemu_saveimage.h @@ -123,11 +123,11 @@ virQEMUSaveDataWrite(virQEMUSaveData *data, const char *path); virQEMUSaveData * -virQEMUSaveDataNew(char *domXML, +virQEMUSaveDataNew(virQEMUDriver *driver, + char *domXML, qemuDomainSaveCookie *cookieObj, bool running, - int compressed, - virDomainXMLOption *xmlopt); + int compressed); void virQEMUSaveDataFree(virQEMUSaveData *data); diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c index f5260c4a22..1d75208814 100644 --- a/src/qemu/qemu_snapshot.c +++ b/src/qemu/qemu_snapshot.c @@ -1390,9 +1390,9 @@ qemuSnapshotCreateActiveExternal(virQEMUDriver *driver, !(snapdef->cookie = (virObject *) qemuDomainSaveCookieNew(vm))) goto cleanup; - if (!(data = virQEMUSaveDataNew(xml, + if (!(data = virQEMUSaveDataNew(driver, xml, (qemuDomainSaveCookie *) snapdef->cookie, - resume, compressed, driver->xmlopt))) + resume, compressed))) goto cleanup; xml = NULL; diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index b97e6de11e..8bcb332020 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -70,6 +70,7 @@ module Test_libvirtd_qemu = { "8" = "/dev/userfaultfd" } } { "save_image_format" = "raw" } +{ "save_image_version" = "3" } { "dump_image_format" = "raw" } { "snapshot_image_format" = "raw" } { "auto_dump_path" = "/var/lib/libvirt/qemu/dump" } -- 2.44.0

Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_driver.c | 2 +- src/qemu/qemu_migration.c | 79 ++++++++++++++++++++++++++++ src/qemu/qemu_migration.h | 7 +++ src/qemu/qemu_monitor.c | 32 ++++++++++++ src/qemu/qemu_monitor.h | 4 ++ src/qemu/qemu_saveimage.c | 105 ++++++++++++++++++++++++++++++-------- src/qemu/qemu_saveimage.h | 1 + src/qemu/qemu_snapshot.c | 2 +- 8 files changed, 208 insertions(+), 24 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index f9761242d2..34f37210d9 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -2696,7 +2696,7 @@ qemuDomainSaveInternal(virQEMUDriver *driver, if (!(cookie = qemuDomainSaveCookieNew(vm))) goto endjob; - if (!(data = virQEMUSaveDataNew(driver, xml, cookie, was_running, compressed))) + if (!(data = virQEMUSaveDataNew(driver, vm, xml, cookie, was_running, compressed))) goto endjob; xml = NULL; diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 1faab5dd23..3110ef2621 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -7072,6 +7072,85 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainObj *vm, } +int +qemuMigrationSrcToMappedFile(virQEMUDriver *driver, virDomainObj *vm, + int fd, + virDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(qemuMigrationParams) saveParams = NULL; + unsigned long saveMigBandwidth = priv->migMaxBandwidth; + int rc; + int ret = -1; + virErrorPtr orig_err = NULL; + + if (qemuMigrationSetDBusVMState(driver, vm) < 0) + return -1; + + if (!(saveParams = qemuMigrationParamsForMappedSave())) + return -1; + + /* Increase migration bandwidth to unlimited since target is a file. + * Failure to change migration speed is not fatal. */ + if (qemuMigrationParamsSetULL(saveParams, + QEMU_MIGRATION_PARAM_MAX_BANDWIDTH, + QEMU_DOMAIN_MIG_BANDWIDTH_MAX * 1024 * 1024) < 0) + return -1; + + if (qemuMigrationParamsApply(vm, asyncJob, saveParams, 0) < 0) + return -1; + + priv->migMaxBandwidth = QEMU_DOMAIN_MIG_BANDWIDTH_MAX; + + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("guest unexpectedly quit")); + /* nothing to tear down */ + return -1; + } + + if (qemuDomainObjEnterMonitorAsync(vm, asyncJob) < 0) + goto cleanup; + + rc = qemuMonitorMigrateToFdSet(vm, 0, fd); + qemuDomainObjExitMonitor(vm); + if (rc < 0) + goto cleanup; + + rc = qemuMigrationSrcWaitForCompletion(vm, asyncJob, NULL, 0); + + if (rc < 0) { + if (rc == -2) { + virErrorPreserveLast(&orig_err); + if (virDomainObjIsActive(vm)) + qemuMigrationSrcCancel(vm, asyncJob, true); + } + goto cleanup; + } + + qemuDomainEventEmitJobCompleted(driver, vm); + ret = 0; + + cleanup: + if (ret < 0 && !orig_err) + virErrorPreserveLast(&orig_err); + + /* Restore max migration bandwidth */ + if (virDomainObjIsActive(vm)) { + if (qemuMigrationParamsSetULL(saveParams, + QEMU_MIGRATION_PARAM_MAX_BANDWIDTH, + saveMigBandwidth * 1024 * 1024) == 0) + ignore_value(qemuMigrationParamsApply(vm, asyncJob, + saveParams, 0)); + priv->migMaxBandwidth = saveMigBandwidth; + } + + virErrorRestore(&orig_err); + + return ret; +} + + /** * This function is supposed to be used only to while reconnecting to a domain * with an active migration job. diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index ed62fd4a91..f845a0198b 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -241,6 +241,13 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainAsyncJob asyncJob) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; +int +qemuMigrationSrcToMappedFile(virQEMUDriver *driver, + virDomainObj *vm, + int fd, + virDomainAsyncJob asyncJob) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; + int qemuMigrationSrcCancelUnattended(virDomainObj *vm, virDomainJobObj *oldJob); diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 34e2ccab97..4c92bd740a 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2237,6 +2237,38 @@ qemuMonitorMigrateToFd(qemuMonitor *mon, } +int +qemuMonitorMigrateToFdSet(virDomainObj *vm, + unsigned int flags, + int fd) +{ + qemuDomainObjPrivate *priv = vm->privateData; + qemuMonitor *mon = priv->mon; + off_t offset; + g_autoptr(qemuFDPass) fdPassMigrate = NULL; + unsigned int setId; + g_autofree char *uri = NULL; + + VIR_DEBUG("fd=%d flags=0x%x", fd, flags); + + QEMU_CHECK_MONITOR(mon); + + if ((offset = lseek(fd, 0, SEEK_CUR)) == -1) + return -1; + + fdPassMigrate = qemuFDPassNew("migrate", priv); + qemuFDPassAddFD(fdPassMigrate, &fd, "-fd"); + qemuFDPassTransferMonitor(fdPassMigrate, mon); + + if (qemuFDPassGetId(fdPassMigrate, &setId) < 0) + return -1; + + uri = g_strdup_printf("file:/dev/fdset/%u,offset=%#lx", setId, offset); + + return qemuMonitorJSONMigrate(mon, flags, uri); +} + + int qemuMonitorMigrateToHost(qemuMonitor *mon, unsigned int flags, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 6e81945201..c477def138 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -843,6 +843,10 @@ int qemuMonitorMigrateToFd(qemuMonitor *mon, unsigned int flags, int fd); +int qemuMonitorMigrateToFdSet(virDomainObj *vm, + unsigned int flags, + int fd); + int qemuMonitorMigrateToHost(qemuMonitor *mon, unsigned int flags, const char *protocol, diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 30085dc7bc..8f28770086 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -96,6 +96,7 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUSaveData, virQEMUSaveDataFree); */ virQEMUSaveData * virQEMUSaveDataNew(virQEMUDriver *driver, + virDomainObj *vm, char *domXML, qemuDomainSaveCookie *cookieObj, bool running, @@ -122,6 +123,10 @@ virQEMUSaveDataNew(virQEMUDriver *driver, goto error; } header->version = cfg->saveImageVersion; + /* Enable mapped-ram feature if available and save version >= 3 */ + if (header->version >= QEMU_SAVE_VERSION && + qemuMigrationCapsGet(vm, QEMU_MIGRATION_CAP_MAPPED_RAM)) + header->features |= QEMU_SAVE_FEATURE_MAPPED_RAM; header->was_running = running ? 1 : 0; header->compressed = compressed; @@ -369,6 +374,79 @@ qemuSaveImageDecompressionStop(virCommand *cmd, } +static int +qemuSaveImageCreateSequential(virQEMUDriver *driver, + virDomainObj *vm, + const char *path, + int fd, + virQEMUSaveData *data, + virCommand *compressor, + unsigned int flags, + virDomainAsyncJob asyncJob) +{ + int ret = -1; + virFileWrapperFd *wrapperFd = NULL; + unsigned int wrapperFlags = VIR_FILE_WRAPPER_NON_BLOCKING; + + if ((flags & VIR_DOMAIN_SAVE_BYPASS_CACHE)) + wrapperFlags |= VIR_FILE_WRAPPER_BYPASS_CACHE; + + if (!(wrapperFd = virFileWrapperFdNew(&fd, path, wrapperFlags))) + goto cleanup; + + if (virQEMUSaveDataWrite(data, fd, path) < 0) + goto cleanup; + + /* Perform the migration */ + if (qemuMigrationSrcToFile(driver, vm, fd, compressor, asyncJob) < 0) + goto cleanup; + + if (VIR_CLOSE(fd) < 0) { + virReportSystemError(errno, _("unable to close %1$s"), path); + goto cleanup; + } + + if (qemuDomainFileWrapperFDClose(vm, wrapperFd) < 0) + goto cleanup; + + ret = 0; + + cleanup: + VIR_FORCE_CLOSE(fd); + if (qemuDomainFileWrapperFDClose(vm, wrapperFd) < 0) + ret = -1; + virFileWrapperFdFree(wrapperFd); + + return ret; +} + + +static int +qemuSaveImageCreateMapped(virQEMUDriver *driver, + virDomainObj *vm, + const char *path, + int fd, + virQEMUSaveData *data, + unsigned int flags, + virDomainAsyncJob asyncJob) +{ + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + + /* mapped-ram does not support directIO */ + if ((flags & VIR_DOMAIN_SAVE_BYPASS_CACHE)) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("bypass cache unsupported by this system")); + return -1; + } + + if (virQEMUSaveDataWrite(data, fd, path) < 0) + return -1; + + /* Perform the migration */ + return qemuMigrationSrcToMappedFile(driver, vm, fd, asyncJob); +} + + /* Helper function to execute a migration to file with a correct save header * the caller needs to make sure that the processors are stopped and do all other * actions besides saving memory */ @@ -386,12 +464,9 @@ qemuSaveImageCreate(virQEMUDriver *driver, int ret = -1; int fd = -1; int directFlag = 0; - virFileWrapperFd *wrapperFd = NULL; - unsigned int wrapperFlags = VIR_FILE_WRAPPER_NON_BLOCKING; /* Obtain the file handle. */ if ((flags & VIR_DOMAIN_SAVE_BYPASS_CACHE)) { - wrapperFlags |= VIR_FILE_WRAPPER_BYPASS_CACHE; directFlag = virFileDirectFdFlag(); if (directFlag < 0) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", @@ -409,14 +484,12 @@ qemuSaveImageCreate(virQEMUDriver *driver, if (qemuSecuritySetImageFDLabel(driver->securityManager, vm->def, fd) < 0) goto cleanup; - if (!(wrapperFd = virFileWrapperFdNew(&fd, path, wrapperFlags))) - goto cleanup; + if (data->header.features & QEMU_SAVE_FEATURE_MAPPED_RAM) + ret = qemuSaveImageCreateMapped(driver, vm, path, fd, data, flags, asyncJob); + else + ret = qemuSaveImageCreateSequential(driver, vm, path, fd, data, compressor, flags, asyncJob); - if (virQEMUSaveDataWrite(data, fd, path) < 0) - goto cleanup; - - /* Perform the migration */ - if (qemuMigrationSrcToFile(driver, vm, fd, compressor, asyncJob) < 0) + if (ret < 0) goto cleanup; /* Touch up file header to mark image complete. */ @@ -425,14 +498,6 @@ qemuSaveImageCreate(virQEMUDriver *driver, * up to seek backwards on wrapperFd. The reopened fd will * trigger a single page of file system cache pollution, but * that's acceptable. */ - if (VIR_CLOSE(fd) < 0) { - virReportSystemError(errno, _("unable to close %1$s"), path); - goto cleanup; - } - - if (qemuDomainFileWrapperFDClose(vm, wrapperFd) < 0) - goto cleanup; - if ((fd = qemuDomainOpenFile(cfg, vm->def, path, O_WRONLY, NULL)) < 0 || virQEMUSaveDataFinish(data, &fd, path) < 0) goto cleanup; @@ -441,10 +506,6 @@ qemuSaveImageCreate(virQEMUDriver *driver, cleanup: VIR_FORCE_CLOSE(fd); - if (qemuDomainFileWrapperFDClose(vm, wrapperFd) < 0) - ret = -1; - virFileWrapperFdFree(wrapperFd); - if (ret < 0 && needUnlink) unlink(path); diff --git a/src/qemu/qemu_saveimage.h b/src/qemu/qemu_saveimage.h index 63ad5508ed..81d93bf33c 100644 --- a/src/qemu/qemu_saveimage.h +++ b/src/qemu/qemu_saveimage.h @@ -124,6 +124,7 @@ virQEMUSaveDataWrite(virQEMUSaveData *data, virQEMUSaveData * virQEMUSaveDataNew(virQEMUDriver *driver, + virDomainObj *vm, char *domXML, qemuDomainSaveCookie *cookieObj, bool running, diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c index 1d75208814..1e9e0e31d7 100644 --- a/src/qemu/qemu_snapshot.c +++ b/src/qemu/qemu_snapshot.c @@ -1390,7 +1390,7 @@ qemuSnapshotCreateActiveExternal(virQEMUDriver *driver, !(snapdef->cookie = (virObject *) qemuDomainSaveCookieNew(vm))) goto cleanup; - if (!(data = virQEMUSaveDataNew(driver, xml, + if (!(data = virQEMUSaveDataNew(driver, vm, xml, (qemuDomainSaveCookie *) snapdef->cookie, resume, compressed))) goto cleanup; -- 2.44.0

On Thu, Jun 13, 2024 at 04:43:20PM -0600, Jim Fehlig via Devel wrote:
Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_driver.c | 2 +- src/qemu/qemu_migration.c | 79 ++++++++++++++++++++++++++++ src/qemu/qemu_migration.h | 7 +++ src/qemu/qemu_monitor.c | 32 ++++++++++++ src/qemu/qemu_monitor.h | 4 ++ src/qemu/qemu_saveimage.c | 105 ++++++++++++++++++++++++++++++-------- src/qemu/qemu_saveimage.h | 1 + src/qemu/qemu_snapshot.c | 2 +- 8 files changed, 208 insertions(+), 24 deletions(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index f9761242d2..34f37210d9 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -2696,7 +2696,7 @@ qemuDomainSaveInternal(virQEMUDriver *driver, if (!(cookie = qemuDomainSaveCookieNew(vm))) goto endjob;
- if (!(data = virQEMUSaveDataNew(driver, xml, cookie, was_running, compressed))) + if (!(data = virQEMUSaveDataNew(driver, vm, xml, cookie, was_running, compressed))) goto endjob; xml = NULL;
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 1faab5dd23..3110ef2621 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -7072,6 +7072,85 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainObj *vm, }
+int +qemuMigrationSrcToMappedFile(virQEMUDriver *driver, virDomainObj *vm, + int fd, + virDomainAsyncJob asyncJob) +{ + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(qemuMigrationParams) saveParams = NULL; + unsigned long saveMigBandwidth = priv->migMaxBandwidth; + int rc; + int ret = -1; + virErrorPtr orig_err = NULL; + + if (qemuMigrationSetDBusVMState(driver, vm) < 0) + return -1; + + if (!(saveParams = qemuMigrationParamsForMappedSave())) + return -1; + + /* Increase migration bandwidth to unlimited since target is a file. + * Failure to change migration speed is not fatal. */ + if (qemuMigrationParamsSetULL(saveParams, + QEMU_MIGRATION_PARAM_MAX_BANDWIDTH, + QEMU_DOMAIN_MIG_BANDWIDTH_MAX * 1024 * 1024) < 0) + return -1; + + if (qemuMigrationParamsApply(vm, asyncJob, saveParams, 0) < 0) + return -1; + + priv->migMaxBandwidth = QEMU_DOMAIN_MIG_BANDWIDTH_MAX; + + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("guest unexpectedly quit")); + /* nothing to tear down */ + return -1; + } + + if (qemuDomainObjEnterMonitorAsync(vm, asyncJob) < 0) + goto cleanup; + + rc = qemuMonitorMigrateToFdSet(vm, 0, fd); + qemuDomainObjExitMonitor(vm); + if (rc < 0) + goto cleanup; + + rc = qemuMigrationSrcWaitForCompletion(vm, asyncJob, NULL, 0); + + if (rc < 0) { + if (rc == -2) { + virErrorPreserveLast(&orig_err); + if (virDomainObjIsActive(vm)) + qemuMigrationSrcCancel(vm, asyncJob, true); + } + goto cleanup; + } + + qemuDomainEventEmitJobCompleted(driver, vm); + ret = 0; + + cleanup: + if (ret < 0 && !orig_err) + virErrorPreserveLast(&orig_err); + + /* Restore max migration bandwidth */ + if (virDomainObjIsActive(vm)) { + if (qemuMigrationParamsSetULL(saveParams, + QEMU_MIGRATION_PARAM_MAX_BANDWIDTH, + saveMigBandwidth * 1024 * 1024) == 0) + ignore_value(qemuMigrationParamsApply(vm, asyncJob, + saveParams, 0)); + priv->migMaxBandwidth = saveMigBandwidth; + } + + virErrorRestore(&orig_err); + + return ret; +} + + /** * This function is supposed to be used only to while reconnecting to a domain * with an active migration job. diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index ed62fd4a91..f845a0198b 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -241,6 +241,13 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainAsyncJob asyncJob) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT;
+int +qemuMigrationSrcToMappedFile(virQEMUDriver *driver, + virDomainObj *vm, + int fd, + virDomainAsyncJob asyncJob) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; + int qemuMigrationSrcCancelUnattended(virDomainObj *vm, virDomainJobObj *oldJob); diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 34e2ccab97..4c92bd740a 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2237,6 +2237,38 @@ qemuMonitorMigrateToFd(qemuMonitor *mon, }
+int +qemuMonitorMigrateToFdSet(virDomainObj *vm, + unsigned int flags, + int fd) +{ + qemuDomainObjPrivate *priv = vm->privateData; + qemuMonitor *mon = priv->mon; + off_t offset; + g_autoptr(qemuFDPass) fdPassMigrate = NULL; + unsigned int setId; + g_autofree char *uri = NULL; + + VIR_DEBUG("fd=%d flags=0x%x", fd, flags); + + QEMU_CHECK_MONITOR(mon); + + if ((offset = lseek(fd, 0, SEEK_CUR)) == -1) + return -1;
virReportSystemError() should probably be here or this could be checked/gathered before entering the monitor
+ + fdPassMigrate = qemuFDPassNew("migrate", priv); + qemuFDPassAddFD(fdPassMigrate, &fd, "-fd"); + qemuFDPassTransferMonitor(fdPassMigrate, mon);
This function can fail (and set an error)
+ + if (qemuFDPassGetId(fdPassMigrate, &setId) < 0) + return -1;
No error set here, but I see no way for it failing in this case.
+ + uri = g_strdup_printf("file:/dev/fdset/%u,offset=%#lx", setId, offset); + + return qemuMonitorJSONMigrate(mon, flags, uri); +} + + int qemuMonitorMigrateToHost(qemuMonitor *mon, unsigned int flags, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 6e81945201..c477def138 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -843,6 +843,10 @@ int qemuMonitorMigrateToFd(qemuMonitor *mon, unsigned int flags, int fd);
+int qemuMonitorMigrateToFdSet(virDomainObj *vm, + unsigned int flags, + int fd); + int qemuMonitorMigrateToHost(qemuMonitor *mon, unsigned int flags, const char *protocol, diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 30085dc7bc..8f28770086 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -96,6 +96,7 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUSaveData, virQEMUSaveDataFree); */ virQEMUSaveData * virQEMUSaveDataNew(virQEMUDriver *driver, + virDomainObj *vm, char *domXML, qemuDomainSaveCookie *cookieObj, bool running, @@ -122,6 +123,10 @@ virQEMUSaveDataNew(virQEMUDriver *driver, goto error; } header->version = cfg->saveImageVersion; + /* Enable mapped-ram feature if available and save version >= 3 */ + if (header->version >= QEMU_SAVE_VERSION &&
This should really be "3" and not the save version in case we increase it, but to be readable without magic numbers and also more error proof, there should be a mapping of feature->minVersion maybe?
+ qemuMigrationCapsGet(vm, QEMU_MIGRATION_CAP_MAPPED_RAM)) + header->features |= QEMU_SAVE_FEATURE_MAPPED_RAM;
header->was_running = running ? 1 : 0; header->compressed = compressed;
Also in case we add support for compressed mapped-ram save images this might screw up later since you record the compression, but don't execute it (as mentioned in the cover letter).

Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_migration.c | 19 +++++++++++---- src/qemu/qemu_migration.h | 3 ++- src/qemu/qemu_process.c | 49 ++++++++++++++++++++++++++++----------- src/qemu/qemu_process.h | 13 +++++++---- src/qemu/qemu_saveimage.c | 26 ++++++++++++++------- 5 files changed, 76 insertions(+), 34 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 3110ef2621..b1d27e07e1 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2281,13 +2281,23 @@ qemuMigrationDstGetURI(const char *migrateFrom, int qemuMigrationDstRun(virDomainObj *vm, const char *uri, - virDomainAsyncJob asyncJob) + virDomainAsyncJob asyncJob, + unsigned int flags) { qemuDomainObjPrivate *priv = vm->privateData; int rv; VIR_DEBUG("Setting up incoming migration with URI %s", uri); + if (flags & VIR_QEMU_PROCESS_START_MAPPED_RAM) { + g_autoptr(qemuMigrationParams) migParams = NULL; + + if (!(migParams = qemuMigrationParamsForMappedSave())) + return -1; + + qemuMigrationParamsApply(vm, asyncJob, migParams, 0); + } + if (qemuDomainObjEnterMonitorAsync(vm, asyncJob) < 0) return -1; @@ -2945,7 +2955,6 @@ qemuMigrationDstPrepare(virDomainObj *vm, unsigned short port, int fd) { - qemuDomainObjPrivate *priv = vm->privateData; g_autofree char *migrateFrom = NULL; if (tunnel) { @@ -2999,8 +3008,8 @@ qemuMigrationDstPrepare(virDomainObj *vm, migrateFrom = g_strdup_printf(incFormat, protocol, listenAddress, port); } - return qemuProcessIncomingDefNew(priv->qemuCaps, listenAddress, - migrateFrom, fd, NULL); + return qemuProcessIncomingDefNew(vm, listenAddress, + migrateFrom, fd, NULL, 0); } @@ -3238,7 +3247,7 @@ qemuMigrationDstPrepareActive(virQEMUDriver *driver, } if (qemuMigrationDstRun(vm, incoming->uri, - VIR_ASYNC_JOB_MIGRATION_IN) < 0) + VIR_ASYNC_JOB_MIGRATION_IN, 0) < 0) goto error; if (qemuProcessFinishStartup(driver, vm, VIR_ASYNC_JOB_MIGRATION_IN, diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index f845a0198b..864f3280e5 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -286,7 +286,8 @@ qemuMigrationDstGetURI(const char *migrateFrom, int qemuMigrationDstRun(virDomainObj *vm, const char *uri, - virDomainAsyncJob asyncJob); + virDomainAsyncJob asyncJob, + unsigned int flags); void qemuMigrationSrcPostcopyFailed(virDomainObj *vm); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7ef7040a85..f700390a8c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4684,28 +4684,43 @@ qemuProcessIncomingDefFree(qemuProcessIncomingDef *inc) * qemuProcessIncomingDefFree will NOT close it. */ qemuProcessIncomingDef * -qemuProcessIncomingDefNew(virQEMUCaps *qemuCaps, +qemuProcessIncomingDefNew(virDomainObj *vm, const char *listenAddress, const char *migrateFrom, int fd, - const char *path) + const char *path, + unsigned int flags) { + qemuDomainObjPrivate *priv = vm->privateData; qemuProcessIncomingDef *inc = NULL; - if (qemuMigrationDstCheckProtocol(qemuCaps, migrateFrom) < 0) + if (qemuMigrationDstCheckProtocol(priv->qemuCaps, migrateFrom) < 0) return NULL; inc = g_new0(qemuProcessIncomingDef, 1); inc->address = g_strdup(listenAddress); - - inc->uri = qemuMigrationDstGetURI(migrateFrom, fd); - if (!inc->uri) - goto error; - inc->fd = fd; inc->path = path; + if (flags & VIR_QEMU_PROCESS_START_MAPPED_RAM) { + unsigned int fdsetId; + off_t offset; + + if ((offset = lseek(fd, 0, SEEK_CUR)) == -1) + offset = 0; + + inc->fdPassMigrate = qemuFDPassNew("migrate", priv); + qemuFDPassAddFD(inc->fdPassMigrate, &fd, "-fd"); + qemuFDPassGetId(inc->fdPassMigrate, &fdsetId); + inc->uri = g_strdup_printf("file:/dev/fdset/%u,offset=%#lx", fdsetId, offset); + } else { + inc->uri = qemuMigrationDstGetURI(migrateFrom, fd); + } + + if (!inc->uri) + goto error; + return inc; error: @@ -7667,7 +7682,8 @@ qemuProcessLaunch(virConnectPtr conn, VIR_QEMU_PROCESS_START_AUTODESTROY | VIR_QEMU_PROCESS_START_NEW | VIR_QEMU_PROCESS_START_GEN_VMID | - VIR_QEMU_PROCESS_START_RESET_NVRAM, -1); + VIR_QEMU_PROCESS_START_RESET_NVRAM | + VIR_QEMU_PROCESS_START_MAPPED_RAM, -1); cfg = virQEMUDriverGetConfig(driver); @@ -7717,8 +7733,12 @@ qemuProcessLaunch(virConnectPtr conn, &nnicindexes, &nicindexes))) goto cleanup; - if (incoming && incoming->fd != -1) + if (incoming && incoming->fd != -1) { virCommandPassFD(cmd, incoming->fd, 0); + if (incoming->fdPassMigrate != NULL) { + qemuFDPassTransferCommand(incoming->fdPassMigrate, cmd); + } + } /* now that we know it is about to start call the hook if present */ if (qemuProcessStartHook(driver, vm, @@ -8153,7 +8173,8 @@ qemuProcessStart(virConnectPtr conn, VIR_QEMU_PROCESS_START_PAUSED | VIR_QEMU_PROCESS_START_AUTODESTROY | VIR_QEMU_PROCESS_START_GEN_VMID | - VIR_QEMU_PROCESS_START_RESET_NVRAM, cleanup); + VIR_QEMU_PROCESS_START_RESET_NVRAM | + VIR_QEMU_PROCESS_START_MAPPED_RAM, cleanup); if (!migrateFrom && !snapshot) flags |= VIR_QEMU_PROCESS_START_NEW; @@ -8163,8 +8184,8 @@ qemuProcessStart(virConnectPtr conn, goto cleanup; if (migrateFrom) { - incoming = qemuProcessIncomingDefNew(priv->qemuCaps, NULL, migrateFrom, - migrateFd, migratePath); + incoming = qemuProcessIncomingDefNew(vm, NULL, migrateFrom, + migrateFd, migratePath, flags); if (!incoming) goto stop; } @@ -8191,7 +8212,7 @@ qemuProcessStart(virConnectPtr conn, relabel = true; if (incoming) { - if (qemuMigrationDstRun(vm, incoming->uri, asyncJob) < 0) + if (qemuMigrationDstRun(vm, incoming->uri, asyncJob, flags) < 0) goto stop; } else { /* Refresh state of devices from QEMU. During migration this happens diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index c1ea949215..a5212ee56e 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -54,14 +54,16 @@ struct _qemuProcessIncomingDef { char *address; /* address where QEMU is supposed to listen */ char *uri; /* used when calling migrate-incoming QMP command */ int fd; /* for fd:N URI */ + qemuFDPass *fdPassMigrate; /* for file:/dev/fdset/n,offset=x URI */ const char *path; /* path associated with fd */ }; -qemuProcessIncomingDef *qemuProcessIncomingDefNew(virQEMUCaps *qemuCaps, - const char *listenAddress, - const char *migrateFrom, - int fd, - const char *path); +qemuProcessIncomingDef *qemuProcessIncomingDefNew(virDomainObj *vm, + const char *listenAddress, + const char *migrateFrom, + int fd, + const char *path, + unsigned int flags); void qemuProcessIncomingDefFree(qemuProcessIncomingDef *inc); int qemuProcessBeginJob(virDomainObj *vm, @@ -77,6 +79,7 @@ typedef enum { VIR_QEMU_PROCESS_START_NEW = 1 << 4, /* internal, new VM is starting */ VIR_QEMU_PROCESS_START_GEN_VMID = 1 << 5, /* Generate a new VMID */ VIR_QEMU_PROCESS_START_RESET_NVRAM = 1 << 6, /* Re-initialize NVRAM from template */ + VIR_QEMU_PROCESS_START_MAPPED_RAM = 1 << 7, /* Re-initialize NVRAM from template */ } qemuProcessStartFlags; int qemuProcessStart(virConnectPtr conn, diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 8f28770086..1545c00472 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -628,6 +628,7 @@ qemuSaveImageOpen(virQEMUDriver *driver, int oflags = open_write ? O_RDWR : O_RDONLY; size_t xml_len; size_t cookie_len; + bool use_mapped_ram = false; if (bypass_cache) { int directFlag = virFileDirectFdFlag(); @@ -642,11 +643,6 @@ qemuSaveImageOpen(virQEMUDriver *driver, if ((fd = qemuDomainOpenFile(cfg, NULL, path, oflags, NULL)) < 0) return -1; - if (bypass_cache && - !(*wrapperFd = virFileWrapperFdNew(&fd, path, - VIR_FILE_WRAPPER_BYPASS_CACHE))) - return -1; - data = g_new0(virQEMUSaveData, 1); header = &data->header; @@ -708,10 +704,14 @@ qemuSaveImageOpen(virQEMUDriver *driver, return -1; } - if (header->features && header->features != QEMU_SAVE_FEATURE_MAPPED_RAM) { - virReportError(VIR_ERR_OPERATION_FAILED, "%s", - _("image contains unsupported features)")); - return -1; + if (header->features) { + if (header->features == QEMU_SAVE_FEATURE_MAPPED_RAM) { + use_mapped_ram = true; + } else { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("image contains unsupported features)")); + return -1; + } } if (header->cookieOffset) @@ -739,6 +739,11 @@ qemuSaveImageOpen(virQEMUDriver *driver, } } + if (bypass_cache && !use_mapped_ram && + !(*wrapperFd = virFileWrapperFdNew(&fd, path, + VIR_FILE_WRAPPER_BYPASS_CACHE))) + return -1; + /* Create a domain from this XML */ if (!(def = virDomainDefParseString(data->xml, driver->xmlopt, qemuCaps, VIR_DOMAIN_DEF_PARSE_INACTIVE | @@ -777,6 +782,9 @@ qemuSaveImageStartVM(virConnectPtr conn, if (reset_nvram) start_flags |= VIR_QEMU_PROCESS_START_RESET_NVRAM; + if (header->features & QEMU_SAVE_FEATURE_MAPPED_RAM) + start_flags |= VIR_QEMU_PROCESS_START_MAPPED_RAM; + if (qemuProcessStartWithMemoryState(conn, driver, vm, fd, path, NULL, data, asyncJob, start_flags, "restored", &started) < 0) { -- 2.44.0

On Thu, Jun 13, 2024 at 04:43:21PM -0600, Jim Fehlig via Devel wrote:
Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_migration.c | 19 +++++++++++---- src/qemu/qemu_migration.h | 3 ++- src/qemu/qemu_process.c | 49 ++++++++++++++++++++++++++++----------- src/qemu/qemu_process.h | 13 +++++++---- src/qemu/qemu_saveimage.c | 26 ++++++++++++++------- 5 files changed, 76 insertions(+), 34 deletions(-)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7ef7040a85..f700390a8c 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7717,8 +7733,12 @@ qemuProcessLaunch(virConnectPtr conn, &nnicindexes, &nicindexes))) goto cleanup;
- if (incoming && incoming->fd != -1) + if (incoming && incoming->fd != -1) { virCommandPassFD(cmd, incoming->fd, 0); + if (incoming->fdPassMigrate != NULL) { + qemuFDPassTransferCommand(incoming->fdPassMigrate, cmd);
No need for this last condition, qemuFDPassTransferCommand is no-op when the first parameter is NULL.
+ } + }
/* now that we know it is about to start call the hook if present */ if (qemuProcessStartHook(driver, vm, @@ -8153,7 +8173,8 @@ qemuProcessStart(virConnectPtr conn, VIR_QEMU_PROCESS_START_PAUSED | VIR_QEMU_PROCESS_START_AUTODESTROY | VIR_QEMU_PROCESS_START_GEN_VMID | - VIR_QEMU_PROCESS_START_RESET_NVRAM, cleanup); + VIR_QEMU_PROCESS_START_RESET_NVRAM | + VIR_QEMU_PROCESS_START_MAPPED_RAM, cleanup);
if (!migrateFrom && !snapshot) flags |= VIR_QEMU_PROCESS_START_NEW; @@ -8163,8 +8184,8 @@ qemuProcessStart(virConnectPtr conn, goto cleanup;
if (migrateFrom) { - incoming = qemuProcessIncomingDefNew(priv->qemuCaps, NULL, migrateFrom, - migrateFd, migratePath); + incoming = qemuProcessIncomingDefNew(vm, NULL, migrateFrom, + migrateFd, migratePath, flags); if (!incoming) goto stop; } @@ -8191,7 +8212,7 @@ qemuProcessStart(virConnectPtr conn, relabel = true;
if (incoming) { - if (qemuMigrationDstRun(vm, incoming->uri, asyncJob) < 0) + if (qemuMigrationDstRun(vm, incoming->uri, asyncJob, flags) < 0) goto stop; } else { /* Refresh state of devices from QEMU. During migration this happens diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index c1ea949215..a5212ee56e 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -54,14 +54,16 @@ struct _qemuProcessIncomingDef { char *address; /* address where QEMU is supposed to listen */ char *uri; /* used when calling migrate-incoming QMP command */ int fd; /* for fd:N URI */ + qemuFDPass *fdPassMigrate; /* for file:/dev/fdset/n,offset=x URI */ const char *path; /* path associated with fd */ };
-qemuProcessIncomingDef *qemuProcessIncomingDefNew(virQEMUCaps *qemuCaps, - const char *listenAddress, - const char *migrateFrom, - int fd, - const char *path); +qemuProcessIncomingDef *qemuProcessIncomingDefNew(virDomainObj *vm, + const char *listenAddress, + const char *migrateFrom, + int fd, + const char *path, + unsigned int flags); void qemuProcessIncomingDefFree(qemuProcessIncomingDef *inc);
int qemuProcessBeginJob(virDomainObj *vm, @@ -77,6 +79,7 @@ typedef enum { VIR_QEMU_PROCESS_START_NEW = 1 << 4, /* internal, new VM is starting */ VIR_QEMU_PROCESS_START_GEN_VMID = 1 << 5, /* Generate a new VMID */ VIR_QEMU_PROCESS_START_RESET_NVRAM = 1 << 6, /* Re-initialize NVRAM from template */ + VIR_QEMU_PROCESS_START_MAPPED_RAM = 1 << 7, /* Re-initialize NVRAM from template */
copy paste error in comment ;)
} qemuProcessStartFlags;
int qemuProcessStart(virConnectPtr conn, diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 8f28770086..1545c00472 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -628,6 +628,7 @@ qemuSaveImageOpen(virQEMUDriver *driver, int oflags = open_write ? O_RDWR : O_RDONLY; size_t xml_len; size_t cookie_len; + bool use_mapped_ram = false;
if (bypass_cache) { int directFlag = virFileDirectFdFlag(); @@ -642,11 +643,6 @@ qemuSaveImageOpen(virQEMUDriver *driver, if ((fd = qemuDomainOpenFile(cfg, NULL, path, oflags, NULL)) < 0) return -1;
- if (bypass_cache && - !(*wrapperFd = virFileWrapperFdNew(&fd, path, - VIR_FILE_WRAPPER_BYPASS_CACHE))) - return -1; - data = g_new0(virQEMUSaveData, 1);
header = &data->header; @@ -708,10 +704,14 @@ qemuSaveImageOpen(virQEMUDriver *driver, return -1; }
- if (header->features && header->features != QEMU_SAVE_FEATURE_MAPPED_RAM) { - virReportError(VIR_ERR_OPERATION_FAILED, "%s", - _("image contains unsupported features)")); - return -1; + if (header->features) { + if (header->features == QEMU_SAVE_FEATURE_MAPPED_RAM) { + use_mapped_ram = true; + } else { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("image contains unsupported features)")); + return -1; + } }
if (header->cookieOffset) @@ -739,6 +739,11 @@ qemuSaveImageOpen(virQEMUDriver *driver, } }
+ if (bypass_cache && !use_mapped_ram && + !(*wrapperFd = virFileWrapperFdNew(&fd, path, + VIR_FILE_WRAPPER_BYPASS_CACHE))) + return -1; +
Shouldn't this be: if (bypass_cache) { if (use_mapped_ram) error out virFileWarpperFdNew } ?? But only until the next patches which add the possibility of bypass_cache + mapped_ram. Or maybe if QEMU does not support mapped_ram with direct-io this should error out.
/* Create a domain from this XML */ if (!(def = virDomainDefParseString(data->xml, driver->xmlopt, qemuCaps, VIR_DOMAIN_DEF_PARSE_INACTIVE | @@ -777,6 +782,9 @@ qemuSaveImageStartVM(virConnectPtr conn, if (reset_nvram) start_flags |= VIR_QEMU_PROCESS_START_RESET_NVRAM;
+ if (header->features & QEMU_SAVE_FEATURE_MAPPED_RAM) + start_flags |= VIR_QEMU_PROCESS_START_MAPPED_RAM; + if (qemuProcessStartWithMemoryState(conn, driver, vm, fd, path, NULL, data, asyncJob, start_flags, "restored", &started) < 0) { -- 2.44.0

Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- I'm not happy with this and the subsequent patch, which pass another FD around for QEMU to use for reading/writing unaligned state when BYPASS_CACHE has been specified. One idea is to pass the qemuFdPass object around the various functions, but qemu_fd.h already includes qemu_monitor.h. E.g. passing a qemuFdPass to qemuMonitorMigrateToMappedFile requires including qemu_fd.h in qemu_monitor.h, so a cyclical dependency. I could move qemuFDPass*TransferMonitor* functions to qemu_monitor.[ch] to avoid the include in qemu_fd.h, but would likely require moving the _qemuFDPass struct in qemu_fd.c to qemu_fd.h. Another idea is to move qemuProcessIncomingDefNew in qemu_process.c to qemuMigrationConnectionDefNew() in qemu_migration.c. Also move qemuProcessIncomingDef object in qemu_process.h to qemuMigrationConnectDef (or something like that) in qemu_migration.h. The qemuMigrationConnectDef can encapsulate all the connection info related to the migration. E.g. traditional fd, path associated with fd, uri, listen address, qemuFDPass object containing fds for mapped_ram, etc. This can be used by both save and restore paths. Other suggestions kindly welcomed :-). src/qemu/qemu_migration.c | 3 ++- src/qemu/qemu_migration.h | 1 + src/qemu/qemu_monitor.c | 14 +++++++++++--- src/qemu/qemu_monitor.h | 3 ++- src/qemu/qemu_saveimage.c | 25 +++++++++++++++++-------- 5 files changed, 33 insertions(+), 13 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index b1d27e07e1..a0435a0572 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -7084,6 +7084,7 @@ qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainObj *vm, int qemuMigrationSrcToMappedFile(virQEMUDriver *driver, virDomainObj *vm, int fd, + int nondirectFD, virDomainAsyncJob asyncJob) { qemuDomainObjPrivate *priv = vm->privateData; @@ -7121,7 +7122,7 @@ qemuMigrationSrcToMappedFile(virQEMUDriver *driver, virDomainObj *vm, if (qemuDomainObjEnterMonitorAsync(vm, asyncJob) < 0) goto cleanup; - rc = qemuMonitorMigrateToFdSet(vm, 0, fd); + rc = qemuMonitorMigrateToFdSet(vm, 0, fd, nondirectFD); qemuDomainObjExitMonitor(vm); if (rc < 0) goto cleanup; diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index 864f3280e5..939205ef13 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -245,6 +245,7 @@ int qemuMigrationSrcToMappedFile(virQEMUDriver *driver, virDomainObj *vm, int fd, + int nondirectFD, virDomainAsyncJob asyncJob) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 4c92bd740a..c0936bee6b 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2240,7 +2240,8 @@ qemuMonitorMigrateToFd(qemuMonitor *mon, int qemuMonitorMigrateToFdSet(virDomainObj *vm, unsigned int flags, - int fd) + int fd, + int nondirectFD) { qemuDomainObjPrivate *priv = vm->privateData; qemuMonitor *mon = priv->mon; @@ -2249,15 +2250,22 @@ qemuMonitorMigrateToFdSet(virDomainObj *vm, unsigned int setId; g_autofree char *uri = NULL; - VIR_DEBUG("fd=%d flags=0x%x", fd, flags); + VIR_DEBUG("fd=%d nondirectFD=%d flags=0x%x", fd, nondirectFD, flags); QEMU_CHECK_MONITOR(mon); - if ((offset = lseek(fd, 0, SEEK_CUR)) == -1) + if (nondirectFD != -1) + offset = lseek(nondirectFD, 0, SEEK_CUR); + else + offset = lseek(fd, 0, SEEK_CUR); + + if (offset == -1) return -1; fdPassMigrate = qemuFDPassNew("migrate", priv); qemuFDPassAddFD(fdPassMigrate, &fd, "-fd"); + if (nondirectFD != -1) + qemuFDPassAddFD(fdPassMigrate, &nondirectFD, "-nondirect-fd"); qemuFDPassTransferMonitor(fdPassMigrate, mon); if (qemuFDPassGetId(fdPassMigrate, &setId) < 0) diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index c477def138..0fbb2dadc1 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -845,7 +845,8 @@ int qemuMonitorMigrateToFd(qemuMonitor *mon, int qemuMonitorMigrateToFdSet(virDomainObj *vm, unsigned int flags, - int fd); + int fd, + int nondirectFD); int qemuMonitorMigrateToHost(qemuMonitor *mon, unsigned int flags, diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 1545c00472..2b0281895a 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -431,19 +431,28 @@ qemuSaveImageCreateMapped(virQEMUDriver *driver, virDomainAsyncJob asyncJob) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + VIR_AUTOCLOSE nondirectFD = -1; - /* mapped-ram does not support directIO */ + /* If BYPASS_CACHE has been specified, fd has been opened with O_DIRECT. + * In this case, QEMU requires a second FD without O_DIRECT for writing + * unaligned state. We'll use this FD as well to write the header. + * Relative to VM RAM size, this data is a drop in the bucket and fine + * to write without O_DIRECT. + */ if ((flags & VIR_DOMAIN_SAVE_BYPASS_CACHE)) { - virReportError(VIR_ERR_OPERATION_FAILED, "%s", - _("bypass cache unsupported by this system")); - return -1; + if ((nondirectFD = qemuDomainOpenFile(cfg, vm->def, path, O_WRONLY, NULL)) < 0) + return -1; + if (qemuSecuritySetImageFDLabel(driver->securityManager, vm->def, nondirectFD) < 0) + return -1; + if (virQEMUSaveDataWrite(data, nondirectFD, path) < 0) + return -1; + } else { + if (virQEMUSaveDataWrite(data, fd, path) < 0) + return -1; } - if (virQEMUSaveDataWrite(data, fd, path) < 0) - return -1; - /* Perform the migration */ - return qemuMigrationSrcToMappedFile(driver, vm, fd, asyncJob); + return qemuMigrationSrcToMappedFile(driver, vm, fd, nondirectFD, asyncJob); } -- 2.44.0

Signed-off-by: Jim Fehlig <jfehlig@suse.com> --- src/qemu/qemu_driver.c | 22 +++++++++++++--------- src/qemu/qemu_migration.c | 2 +- src/qemu/qemu_process.c | 20 ++++++++++++++++---- src/qemu/qemu_process.h | 3 +++ src/qemu/qemu_saveimage.c | 29 +++++++++++++++++++++++++---- src/qemu/qemu_saveimage.h | 4 +++- src/qemu/qemu_snapshot.c | 8 ++++---- 7 files changed, 65 insertions(+), 23 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 34f37210d9..53674458bf 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1624,7 +1624,7 @@ static virDomainPtr qemuDomainCreateXML(virConnectPtr conn, } if (qemuProcessStart(conn, driver, vm, NULL, VIR_ASYNC_JOB_START, - NULL, -1, NULL, NULL, + NULL, -1, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags) < 0) { virDomainAuditStart(vm, "booted", false); @@ -5742,6 +5742,7 @@ qemuDomainRestoreInternal(virConnectPtr conn, g_autofree char *xmlout = NULL; const char *newxml = dxml; int fd = -1; + int nondirectFd = -1; int ret = -1; virQEMUSaveData *data = NULL; virFileWrapperFd *wrapperFd = NULL; @@ -5758,7 +5759,7 @@ qemuDomainRestoreInternal(virConnectPtr conn, fd = qemuSaveImageOpen(driver, NULL, path, &def, &data, (flags & VIR_DOMAIN_SAVE_BYPASS_CACHE) != 0, - &wrapperFd, false, false); + &wrapperFd, &nondirectFd, false, false); if (fd < 0) goto cleanup; @@ -5812,13 +5813,14 @@ qemuDomainRestoreInternal(virConnectPtr conn, if (qemuProcessBeginJob(vm, VIR_DOMAIN_JOB_OPERATION_RESTORE, flags) < 0) goto cleanup; - ret = qemuSaveImageStartVM(conn, driver, vm, &fd, data, path, + ret = qemuSaveImageStartVM(conn, driver, vm, &fd, nondirectFd, data, path, false, reset_nvram, VIR_ASYNC_JOB_START); qemuProcessEndJob(vm); cleanup: VIR_FORCE_CLOSE(fd); + VIR_FORCE_CLOSE(nondirectFd); if (virFileWrapperFdClose(wrapperFd) < 0) ret = -1; virFileWrapperFdFree(wrapperFd); @@ -5893,7 +5895,7 @@ qemuDomainSaveImageGetXMLDesc(virConnectPtr conn, const char *path, virCheckFlags(VIR_DOMAIN_SAVE_IMAGE_XML_SECURE, NULL); fd = qemuSaveImageOpen(driver, NULL, path, &def, &data, - false, NULL, false, false); + false, NULL, NULL, false, false); if (fd < 0) goto cleanup; @@ -5930,7 +5932,7 @@ qemuDomainSaveImageDefineXML(virConnectPtr conn, const char *path, state = 0; fd = qemuSaveImageOpen(driver, NULL, path, &def, &data, - false, NULL, true, false); + false, NULL, NULL, true, false); if (fd < 0) goto cleanup; @@ -6011,7 +6013,7 @@ qemuDomainManagedSaveGetXMLDesc(virDomainPtr dom, unsigned int flags) } if ((fd = qemuSaveImageOpen(driver, priv->qemuCaps, path, &def, &data, - false, NULL, false, false)) < 0) + false, NULL, NULL, false, false)) < 0) goto cleanup; ret = qemuDomainDefFormatXML(driver, priv->qemuCaps, def, flags); @@ -6069,13 +6071,14 @@ qemuDomainObjRestore(virConnectPtr conn, g_autoptr(virDomainDef) def = NULL; qemuDomainObjPrivate *priv = vm->privateData; int fd = -1; + int nondirectFd = -1; int ret = -1; g_autofree char *xmlout = NULL; virQEMUSaveData *data = NULL; virFileWrapperFd *wrapperFd = NULL; fd = qemuSaveImageOpen(driver, NULL, path, &def, &data, - bypass_cache, &wrapperFd, false, true); + bypass_cache, &wrapperFd, &nondirectFd, false, true); if (fd < 0) { if (fd == -3) ret = 1; @@ -6120,12 +6123,13 @@ qemuDomainObjRestore(virConnectPtr conn, virDomainObjAssignDef(vm, &def, true, NULL); - ret = qemuSaveImageStartVM(conn, driver, vm, &fd, data, path, + ret = qemuSaveImageStartVM(conn, driver, vm, &fd, nondirectFd, data, path, start_paused, reset_nvram, asyncJob); cleanup: virQEMUSaveDataFree(data); VIR_FORCE_CLOSE(fd); + VIR_FORCE_CLOSE(nondirectFd); if (virFileWrapperFdClose(wrapperFd) < 0) ret = -1; virFileWrapperFdFree(wrapperFd); @@ -6321,7 +6325,7 @@ qemuDomainObjStart(virConnectPtr conn, } ret = qemuProcessStart(conn, driver, vm, NULL, asyncJob, - NULL, -1, NULL, NULL, + NULL, -1, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags); virDomainAuditStart(vm, "booted", ret >= 0); if (ret >= 0) { diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index a0435a0572..0b1e03394e 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -3009,7 +3009,7 @@ qemuMigrationDstPrepare(virDomainObj *vm, } return qemuProcessIncomingDefNew(vm, listenAddress, - migrateFrom, fd, NULL, 0); + migrateFrom, fd, -1, NULL, 0); } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f700390a8c..f5b2f9ea2b 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4688,6 +4688,7 @@ qemuProcessIncomingDefNew(virDomainObj *vm, const char *listenAddress, const char *migrateFrom, int fd, + int nondirectFd, const char *path, unsigned int flags) { @@ -4707,11 +4708,18 @@ qemuProcessIncomingDefNew(virDomainObj *vm, unsigned int fdsetId; off_t offset; - if ((offset = lseek(fd, 0, SEEK_CUR)) == -1) - offset = 0; + if (nondirectFd == -1) + offset = lseek(fd, 0, SEEK_CUR); + else + offset = lseek(nondirectFd, 0, SEEK_CUR); + + if (offset < 0) + goto error; inc->fdPassMigrate = qemuFDPassNew("migrate", priv); qemuFDPassAddFD(inc->fdPassMigrate, &fd, "-fd"); + if (nondirectFd != -1) + qemuFDPassAddFD(inc->fdPassMigrate, &nondirectFd, "-nondirect-fd"); qemuFDPassGetId(inc->fdPassMigrate, &fdsetId); inc->uri = g_strdup_printf("file:/dev/fdset/%u,offset=%#lx", fdsetId, offset); } else { @@ -8148,6 +8156,7 @@ qemuProcessStart(virConnectPtr conn, virDomainAsyncJob asyncJob, const char *migrateFrom, int migrateFd, + int migrateNondirectFd, const char *migratePath, virDomainMomentObj *snapshot, virNetDevVPortProfileOp vmop, @@ -8185,7 +8194,8 @@ qemuProcessStart(virConnectPtr conn, if (migrateFrom) { incoming = qemuProcessIncomingDefNew(vm, NULL, migrateFrom, - migrateFd, migratePath, flags); + migrateFd, migrateNondirectFd, + migratePath, flags); if (!incoming) goto stop; } @@ -8264,6 +8274,7 @@ qemuProcessStart(virConnectPtr conn, * @driver: qemu driver object * @vm: domain object * @fd: FD pointer of memory state file + * @nondirectFd: FD for memory state file, opened without O_DIRECT * @path: path to memory state file * @snapshot: internal snapshot to load when starting QEMU process or NULL * @data: data from memory state file or NULL @@ -8290,6 +8301,7 @@ qemuProcessStartWithMemoryState(virConnectPtr conn, virQEMUDriver *driver, virDomainObj *vm, int *fd, + int nondirectFd, const char *path, virDomainMomentObj *snapshot, virQEMUSaveData *data, @@ -8329,7 +8341,7 @@ qemuProcessStartWithMemoryState(virConnectPtr conn, priv->disableSlirp = true; if (qemuProcessStart(conn, driver, vm, cookie ? cookie->cpu : NULL, - asyncJob, migrateFrom, *fd, path, snapshot, + asyncJob, migrateFrom, *fd, nondirectFd, path, snapshot, VIR_NETDEV_VPORT_PROFILE_OP_RESTORE, start_flags) == 0) *started = true; diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index a5212ee56e..6e37dac2ca 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -62,6 +62,7 @@ qemuProcessIncomingDef *qemuProcessIncomingDefNew(virDomainObj *vm, const char *listenAddress, const char *migrateFrom, int fd, + int nondirectFd, const char *path, unsigned int flags); void qemuProcessIncomingDefFree(qemuProcessIncomingDef *inc); @@ -89,6 +90,7 @@ int qemuProcessStart(virConnectPtr conn, virDomainAsyncJob asyncJob, const char *migrateFrom, int stdin_fd, + int nondirectFd, const char *stdin_path, virDomainMomentObj *snapshot, virNetDevVPortProfileOp vmop, @@ -98,6 +100,7 @@ int qemuProcessStartWithMemoryState(virConnectPtr conn, virQEMUDriver *driver, virDomainObj *vm, int *fd, + int nondirectFd, const char *path, virDomainMomentObj *snapshot, virQEMUSaveData *data, diff --git a/src/qemu/qemu_saveimage.c b/src/qemu/qemu_saveimage.c index 2b0281895a..8e634862a6 100644 --- a/src/qemu/qemu_saveimage.c +++ b/src/qemu/qemu_saveimage.c @@ -610,6 +610,7 @@ qemuSaveImageGetCompressionProgram(const char *imageFormat, * @ret_data: returns structure filled with data from the image header * @bypass_cache: bypass cache when opening the file * @wrapperFd: returns the file wrapper structure + * @nondirectFd: returns file descriptor without O_DIRECT set * @open_write: open the file for writing (for updates) * @unlink_corrupt: remove the image file if it is corrupted * @@ -625,11 +626,14 @@ qemuSaveImageOpen(virQEMUDriver *driver, virQEMUSaveData **ret_data, bool bypass_cache, virFileWrapperFd **wrapperFd, + int *nondirectFd, bool open_write, bool unlink_corrupt) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); VIR_AUTOCLOSE fd = -1; + VIR_AUTOCLOSE nondirect_fd = -1; + int read_fd; int ret = -1; g_autoptr(virQEMUSaveData) data = NULL; virQEMUSaveHeader *header; @@ -646,6 +650,13 @@ qemuSaveImageOpen(virQEMUDriver *driver, _("bypass cache unsupported by this system")); return -1; } + + /* Also open the file without O_DIRECT for reading header and for + * qemu to use when reading unaligned state + */ + if ((nondirect_fd = qemuDomainOpenFile(cfg, NULL, path, oflags, NULL)) < 0) + return -1; + oflags |= directFlag; } @@ -655,7 +666,12 @@ qemuSaveImageOpen(virQEMUDriver *driver, data = g_new0(virQEMUSaveData, 1); header = &data->header; - if (saferead(fd, header, sizeof(*header)) != sizeof(*header)) { + if (nondirect_fd != -1) + read_fd = nondirect_fd; + else + read_fd = fd; + + if (saferead(read_fd, header, sizeof(*header)) != sizeof(*header)) { if (unlink_corrupt) { if (unlink(path) < 0) { virReportSystemError(errno, @@ -732,7 +748,7 @@ qemuSaveImageOpen(virQEMUDriver *driver, data->xml = g_new0(char, xml_len); - if (saferead(fd, data->xml, xml_len) != xml_len) { + if (saferead(read_fd, data->xml, xml_len) != xml_len) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", _("failed to read domain XML")); return -1; @@ -741,7 +757,7 @@ qemuSaveImageOpen(virQEMUDriver *driver, if (cookie_len > 0) { data->cookie = g_new0(char, cookie_len); - if (saferead(fd, data->cookie, cookie_len) != cookie_len) { + if (saferead(read_fd, data->cookie, cookie_len) != cookie_len) { virReportError(VIR_ERR_OPERATION_FAILED, "%s", _("failed to read cookie")); return -1; @@ -761,6 +777,10 @@ qemuSaveImageOpen(virQEMUDriver *driver, *ret_def = g_steal_pointer(&def); *ret_data = g_steal_pointer(&data); + if (nondirectFd) { + *nondirectFd = nondirect_fd; + nondirect_fd = -1; + } ret = fd; fd = -1; @@ -774,6 +794,7 @@ qemuSaveImageStartVM(virConnectPtr conn, virQEMUDriver *driver, virDomainObj *vm, int *fd, + int nondirectFd, virQEMUSaveData *data, const char *path, bool start_paused, @@ -794,7 +815,7 @@ qemuSaveImageStartVM(virConnectPtr conn, if (header->features & QEMU_SAVE_FEATURE_MAPPED_RAM) start_flags |= VIR_QEMU_PROCESS_START_MAPPED_RAM; - if (qemuProcessStartWithMemoryState(conn, driver, vm, fd, path, NULL, data, + if (qemuProcessStartWithMemoryState(conn, driver, vm, fd, nondirectFd, path, NULL, data, asyncJob, start_flags, "restored", &started) < 0) { goto cleanup; diff --git a/src/qemu/qemu_saveimage.h b/src/qemu/qemu_saveimage.h index 81d93bf33c..0a8b35158a 100644 --- a/src/qemu/qemu_saveimage.h +++ b/src/qemu/qemu_saveimage.h @@ -67,12 +67,13 @@ qemuSaveImageStartVM(virConnectPtr conn, virQEMUDriver *driver, virDomainObj *vm, int *fd, + int nondirectFd, virQEMUSaveData *data, const char *path, bool start_paused, bool reset_nvram, virDomainAsyncJob asyncJob) - ATTRIBUTE_NONNULL(4) ATTRIBUTE_NONNULL(5) ATTRIBUTE_NONNULL(6); + ATTRIBUTE_NONNULL(4) ATTRIBUTE_NONNULL(6) ATTRIBUTE_NONNULL(7); int qemuSaveImageOpen(virQEMUDriver *driver, @@ -82,6 +83,7 @@ qemuSaveImageOpen(virQEMUDriver *driver, virQEMUSaveData **ret_data, bool bypass_cache, virFileWrapperFd **wrapperFd, + int *nondirectFd, bool open_write, bool unlink_corrupt) ATTRIBUTE_NONNULL(3) ATTRIBUTE_NONNULL(4); diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c index 1e9e0e31d7..f137a79a8d 100644 --- a/src/qemu/qemu_snapshot.c +++ b/src/qemu/qemu_snapshot.c @@ -2115,7 +2115,7 @@ qemuSnapshotRevertExternalPrepare(virDomainObj *vm, memdata->path = snapdef->memorysnapshotfile; memdata->fd = qemuSaveImageOpen(driver, NULL, memdata->path, &savedef, &memdata->data, - false, NULL, + false, NULL, NULL, false, false); if (memdata->fd < 0) @@ -2359,7 +2359,7 @@ qemuSnapshotRevertActive(virDomainObj *vm, virDomainObjAssignDef(vm, config, true, NULL); if (qemuProcessStartWithMemoryState(snapshot->domain->conn, driver, vm, - &memdata.fd, memdata.path, loadSnap, + &memdata.fd, -1, memdata.path, loadSnap, memdata.data, VIR_ASYNC_JOB_SNAPSHOT, start_flags, "from-snapshot", &started) < 0) { @@ -2513,7 +2513,7 @@ qemuSnapshotRevertInactive(virDomainObj *vm, start_flags |= paused ? VIR_QEMU_PROCESS_START_PAUSED : 0; rc = qemuProcessStart(snapshot->domain->conn, driver, vm, NULL, - VIR_ASYNC_JOB_SNAPSHOT, NULL, -1, NULL, NULL, + VIR_ASYNC_JOB_SNAPSHOT, NULL, -1, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, start_flags); virDomainAuditStart(vm, "from-snapshot", rc >= 0); @@ -2991,7 +2991,7 @@ qemuSnapshotDeleteExternalPrepare(virDomainObj *vm, if (!virDomainObjIsActive(vm)) { if (qemuProcessStart(NULL, driver, vm, NULL, VIR_ASYNC_JOB_SNAPSHOT, - NULL, -1, NULL, NULL, + NULL, -1, -1, NULL, NULL, VIR_NETDEV_VPORT_PROFILE_OP_CREATE, VIR_QEMU_PROCESS_START_PAUSED) < 0) { return -1; -- 2.44.0

On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
They might seem tiny bit hacky, but it's not that big of a deal I think. You could eliminate two conditions by making the first FD always non-direct (as in either there is no BYPASS_CACHE or it's already wrapped by the I/O helper), but it would complicate other things in the code and would get even more hairy IMHO.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
By the time I looked at this series the direct-io work has already went in, but there is still the need for the second descriptor to do some unaligned I/O. From the QEMU patches I'm not sure whether you also need to set the direct-io migration capability/flag when migrating to an fdset. Maybe that's needed for migration into a file directly.
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
I was wondering whether it would make sense to use user-space block I/O, but we'd have to use some compression on a block-by-block basis and since you need to be able to compress each write separately, that means you might just save few bytes here and there. And on top of that you'd have to compress each individual block and that block needs to be allocated as a whole, so no space would be saved at all. So that does not make sense unless there is some new format. And compression after the save is finished is in my opinion kind of pointless. You don't save time and you only save disk space _after_ the compression step is done. Not to mention you'd have to uncompress it again before starting QEMU from it. I'd be fine with making users choose between compression and mapped-ram, at least for now. They can compress the resulting file on their own.
Note the logical file size of mapped-ram saved images is slightly larger than guest RAM size, so the files are often much larger than the files produced by the existing, sequential format. However, actual blocks written to disk is often lower with mapped-ram saved images. E.g. a saved image from a 30G, freshly booted, idle guest results in the following 'Size' and 'Blocks' values reported by stat(1)
Size Blocks sequential 998595770 1950392 mapped-ram 34368584225 1800456
With the same guest running a workload that dirties memory
Size Blocks sequential 33173330615 64791672 mapped-ram 34368578210 64706944
That's fine and even better. It saves space, the only thing that everyone needs to keep in mind is to treat it as a sparse file. The other option for the space saving would be to consolidate the streamed changes in the resulting file, but for little to no gain. The mapped-ram is better.
Thanks for any comments on this RFC!
[1] https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/mapp... [2] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/K4BD... [3] https://mail.gnu.org/archive/html/qemu-devel/2024-05/msg04432.html
Some more teeny tiny comments in two patches will follow. Martin
Jim Fehlig (9): qemu: Enable mapped-ram migration capability qemu_fd: Add function to retrieve fdset ID qemu: Add function to get migration params for save qemu: Add a 'features' element to save image header and bump version qemu: conf: Add setting for save image version qemu: Add support for mapped-ram on save qemu: Enable mapped-ram on restore qemu: Support O_DIRECT with mapped-ram on save qemu: Support O_DIRECT with mapped-ram on restore
src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 6 + src/qemu/qemu_conf.c | 8 ++ src/qemu/qemu_conf.h | 1 + src/qemu/qemu_driver.c | 25 ++-- src/qemu/qemu_fd.c | 18 +++ src/qemu/qemu_fd.h | 3 + src/qemu/qemu_migration.c | 99 ++++++++++++++- src/qemu/qemu_migration.h | 11 +- src/qemu/qemu_migration_params.c | 20 +++ src/qemu/qemu_migration_params.h | 4 + src/qemu/qemu_monitor.c | 40 ++++++ src/qemu/qemu_monitor.h | 5 + src/qemu/qemu_process.c | 63 +++++++--- src/qemu/qemu_process.h | 16 ++- src/qemu/qemu_saveimage.c | 187 +++++++++++++++++++++++------ src/qemu/qemu_saveimage.h | 20 ++- src/qemu/qemu_snapshot.c | 12 +- src/qemu/test_libvirtd_qemu.aug.in | 1 + 19 files changed, 455 insertions(+), 85 deletions(-)
-- 2.44.0

On Wed, Aug 07, 2024 at 02:32:57PM +0200, Martin Kletzander wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
They might seem tiny bit hacky, but it's not that big of a deal I think.
You could eliminate two conditions by making the first FD always non-direct (as in either there is no BYPASS_CACHE or it's already wrapped by the I/O helper), but it would complicate other things in the code and would get even more hairy IMHO.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
By the time I looked at this series the direct-io work has already went in, but there is still the need for the second descriptor to do some unaligned I/O.
From the QEMU patches I'm not sure whether you also need to set the direct-io migration capability/flag when migrating to an fdset. Maybe that's needed for migration into a file directly.
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
I was wondering whether it would make sense to use user-space block I/O, but we'd have to use some compression on a block-by-block basis and since you need to be able to compress each write separately, that means you might just save few bytes here and there. And on top of that you'd have to compress each individual block and that block needs to be allocated as a whole, so no space would be saved at all. So that does not make sense unless there is some new format.
And compression after the save is finished is in my opinion kind of pointless. You don't save time and you only save disk space _after_ the compression step is done. Not to mention you'd have to uncompress it again before starting QEMU from it. I'd be fine with making users choose between compression and mapped-ram, at least for now. They can compress the resulting file on their own.
That argument for compressing on their own applies to the existing code too. The reason we want it in libvirt is that we make it 'just work' without requiring apps to have a login shell to the hypervisor to run commands out of band. So basically it depends whether disk space is more important than overall wallclock time. It might still be worthwhile if the use of multifd with mapped-ram massively reduces the overall save duration and we also had a parallelized compression tool we were spawning. eg xz can be told to use all CPU thrfads. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/7/24 09:49, Daniel P. Berrangé wrote:
On Wed, Aug 07, 2024 at 02:32:57PM +0200, Martin Kletzander wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
They might seem tiny bit hacky, but it's not that big of a deal I think.
You could eliminate two conditions by making the first FD always non-direct (as in either there is no BYPASS_CACHE or it's already wrapped by the I/O helper), but it would complicate other things in the code and would get even more hairy IMHO.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
By the time I looked at this series the direct-io work has already went in, but there is still the need for the second descriptor to do some unaligned I/O.
From the QEMU patches I'm not sure whether you also need to set the direct-io migration capability/flag when migrating to an fdset. Maybe that's needed for migration into a file directly.
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
I was wondering whether it would make sense to use user-space block I/O, but we'd have to use some compression on a block-by-block basis and since you need to be able to compress each write separately, that means you might just save few bytes here and there. And on top of that you'd have to compress each individual block and that block needs to be allocated as a whole, so no space would be saved at all. So that does not make sense unless there is some new format.
And compression after the save is finished is in my opinion kind of pointless. You don't save time and you only save disk space _after_ the compression step is done. Not to mention you'd have to uncompress it again before starting QEMU from it. I'd be fine with making users choose between compression and mapped-ram, at least for now. They can compress the resulting file on their own.
That argument for compressing on their own applies to the existing code too. The reason we want it in libvirt is that we make it 'just work' without requiring apps to have a login shell to the hypervisor to run commands out of band.
So basically it depends whether disk space is more important than overall wallclock time. It might still be worthwhile if the use of multifd with mapped-ram massively reduces the overall save duration and we also had a parallelized compression tool we were spawning. eg xz can be told to use all CPU thrfads.
mapped-ram+direct-io+multifd is quite an improvement over the current save/restore mechanism. The following table shows some save/restore stats from a guest with 32G RAM, 30G dirty, 1 vcpu in tight loop dirtying memory. | save | restore | | time | time | Size | Blocks -----------------------+---------+---------+--------------+--------- legacy | 25.800s | 14.332s | 33154309983 | 64754512 -----------------------+---------+---------+--------------+--------- mapped-ram | 18.742s | 15.027s | 34368559228 | 64617160 -----------------------+---------+---------+--------------+--------- legacy + direct IO | 13.115s | 18.050s | 33154310496 | 64754520 -----------------------+---------+---------+--------------+--------- mapped-ram + direct IO | 13.623s | 15.959s | 34368557392 | 64662040 -----------------------+-------- +---------+--------------+--------- mapped-ram + direct IO | | | | + multifd-channels=8 | 6.994s | 6.470s | 34368554980 | 64665776 -------------------------------------------------------------------- In all cases, the save and restore operations are to/from a block device comprised of two NVMe disks in RAID0 configuration with xfs (~8600MiB/s). The values in the 'save time' and 'restore time' columns were scraped from the 'real' time reported by time(1). The 'Size' and 'Blocks' columns were provided by the corresponding outputs of stat(1). Regards, Jim

Hi Martin, On 8/7/24 06:32, Martin Kletzander wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
They might seem tiny bit hacky, but it's not that big of a deal I think.
In the meantime I've been working on a V1 that's a bit less hacky :-).
You could eliminate two conditions by making the first FD always non-direct (as in either there is no BYPASS_CACHE or it's already wrapped by the I/O helper), but it would complicate other things in the code and would get even more hairy IMHO.
In V1, the first FD is always buffered. The directio one, if needed, is opened a bit later in the save process.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
By the time I looked at this series the direct-io work has already went in, but there is still the need for the second descriptor to do some unaligned I/O.
Correct. However, ATM I don't know how to detect if qemu supports the direct-io migration parameter.
From the QEMU patches I'm not sure whether you also need to set the direct-io migration capability/flag when migrating to an fdset. Maybe that's needed for migration into a file directly.
The direct-io migration parameter must be set to 'true' and QEMU provided a second FD with O_DIRECT set.
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
I was wondering whether it would make sense to use user-space block I/O, but we'd have to use some compression on a block-by-block basis and since you need to be able to compress each write separately, that means you might just save few bytes here and there. And on top of that you'd have to compress each individual block and that block needs to be allocated as a whole, so no space would be saved at all. So that does not make sense unless there is some new format.
And compression after the save is finished is in my opinion kind of pointless. You don't save time and you only save disk space _after_ the compression step is done. Not to mention you'd have to uncompress it again before starting QEMU from it. I'd be fine with making users choose between compression and mapped-ram, at least for now. They can compress the resulting file on their own.
I'd prefer to make compression incompatible with mapped-ram too. Sounds like Daniel is agreeable to that as well. Regards, Jim
Note the logical file size of mapped-ram saved images is slightly larger than guest RAM size, so the files are often much larger than the files produced by the existing, sequential format. However, actual blocks written to disk is often lower with mapped-ram saved images. E.g. a saved image from a 30G, freshly booted, idle guest results in the following 'Size' and 'Blocks' values reported by stat(1)
Size Blocks sequential 998595770 1950392 mapped-ram 34368584225 1800456
With the same guest running a workload that dirties memory
Size Blocks sequential 33173330615 64791672 mapped-ram 34368578210 64706944
That's fine and even better. It saves space, the only thing that everyone needs to keep in mind is to treat it as a sparse file.
The other option for the space saving would be to consolidate the streamed changes in the resulting file, but for little to no gain. The mapped-ram is better.
Thanks for any comments on this RFC!
[1] https://gitlab.com/qemu-project/qemu/-/blob/master/docs/devel/migration/mapp... [2] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/K4BD... [3] https://mail.gnu.org/archive/html/qemu-devel/2024-05/msg04432.html
Some more teeny tiny comments in two patches will follow.
Martin
Jim Fehlig (9): qemu: Enable mapped-ram migration capability qemu_fd: Add function to retrieve fdset ID qemu: Add function to get migration params for save qemu: Add a 'features' element to save image header and bump version qemu: conf: Add setting for save image version qemu: Add support for mapped-ram on save qemu: Enable mapped-ram on restore qemu: Support O_DIRECT with mapped-ram on save qemu: Support O_DIRECT with mapped-ram on restore
src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 6 + src/qemu/qemu_conf.c | 8 ++ src/qemu/qemu_conf.h | 1 + src/qemu/qemu_driver.c | 25 ++-- src/qemu/qemu_fd.c | 18 +++ src/qemu/qemu_fd.h | 3 + src/qemu/qemu_migration.c | 99 ++++++++++++++- src/qemu/qemu_migration.h | 11 +- src/qemu/qemu_migration_params.c | 20 +++ src/qemu/qemu_migration_params.h | 4 + src/qemu/qemu_monitor.c | 40 ++++++ src/qemu/qemu_monitor.h | 5 + src/qemu/qemu_process.c | 63 +++++++--- src/qemu/qemu_process.h | 16 ++- src/qemu/qemu_saveimage.c | 187 +++++++++++++++++++++++------ src/qemu/qemu_saveimage.h | 20 ++- src/qemu/qemu_snapshot.c | 12 +- src/qemu/test_libvirtd_qemu.aug.in | 1 + 19 files changed, 455 insertions(+), 85 deletions(-)
-- 2.44.0

On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
I'm a little on the edge about the user interface for the choice of formats - whether the version number knob in qemu.conf is the ideal approach or not.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
Going back to the original motivation for mapped-ram. The first key factor was that it will make it more viable to use multi-fd for parallelized saving and restoring, as it lets threads write concurrently without needing synchronization. The predictable worst case file size when the VM is live & dirtying memory, was an added benefit. The the streaming format, we can compress as we stream, so adding compression burns more CPU, but this is parallelized with the QEMU saving, so the wallclock time doesn't increase as badly. With mapped-ram, if we want to compress, we need to wait for the image to be fully written, as we can be re-writing regions if the guest is live. IOW, compression cannot be parallelized with the file writing. So compression will add significant wallclock time, as well as having the transient extra disk usage penalty. We can optimize to avoid the extra disk penalty by using discard on the intermediate file every time we read a chunk. eg this loop: while () { read 1 MB from save file discard 1MB from save file write 1 MB to compressor pipe } We can't avoid the wallclock penalty from compression, and as you say, we still need the iohelper chained after the compressor to get O_DIRECT.
Note the logical file size of mapped-ram saved images is slightly larger than guest RAM size, so the files are often much larger than the files produced by the existing, sequential format. However, actual blocks written to disk is often lower with mapped-ram saved images. E.g. a saved image from a 30G, freshly booted, idle guest results in the following 'Size' and 'Blocks' values reported by stat(1)
Size Blocks sequential 998595770 1950392 mapped-ram 34368584225 1800456
With the same guest running a workload that dirties memory
Size Blocks sequential 33173330615 64791672 mapped-ram 34368578210 64706944
I'm a little concerned this difference between sparse and non-sparse formats could trip up existing applications that are using libvirt. eg, openstack uses the save/restore to file facilities and IIUC can upload the saved file to its image storage (glance). Experiance tells us that applications will often not handle sparseness correctly if they have never been expecting it. Overall I'm wondering if we need to give a direct choice to mgmt apps. We added a the save/restore variants that accept virTypedParameters, so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts 'stream' and 'mapped' as options. This choice would then influence whether we save in v2 or v3 format. On restore we don't need a parameter as we just probe the on disk format. As a documentation task we can then save that compression is incompatible with 'mapped'. Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash. We probably should have exposed this compression choice in the API too, via a VIR_DOMAIN_SAVE_PARAM_COMPRESSION typed parameter, taking values 'default', 'raw', 'zstd', 'lzop', etc, where 'default' (or omitted) means honour the qemu.conf setting and all others override qemu.conf Might suggest we name the new API parameter VR_DOMAIN_SAVE_PARAM_LAYOUT for stream vs mapped choice ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/7/24 09:45, Daniel P. Berrangé wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
I'm a little on the edge about the user interface for the choice of formats - whether the version number knob in qemu.conf is the ideal approach or not.
Yeah, I'm not fond of the idea either after seeing it in code and experimenting with the result.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
Going back to the original motivation for mapped-ram. The first key factor was that it will make it more viable to use multi-fd for parallelized saving and restoring, as it lets threads write concurrently without needing synchronization. The predictable worst case file size when the VM is live & dirtying memory, was an added benefit.
Correct. The main motivation for mapped-ram is parallelizing save/restore. To that end, we could base the whole feature on whether parallel is requested. If so, we save with mapped-ram (failing if not supported by underlying qemu) and don't support compression. If parallel not requested, use existing save/restore implementation.
The the streaming format, we can compress as we stream, so adding compression burns more CPU, but this is parallelized with the QEMU saving, so the wallclock time doesn't increase as badly.
With mapped-ram, if we want to compress, we need to wait for the image to be fully written, as we can be re-writing regions if the guest is live. IOW, compression cannot be parallelized with the file writing. So compression will add significant wallclock time, as well as having the transient extra disk usage penalty.
We can optimize to avoid the extra disk penalty by using discard on the intermediate file every time we read a chunk. eg this loop:
while () { read 1 MB from save file discard 1MB from save file write 1 MB to compressor pipe }
We can't avoid the wallclock penalty from compression, and as you say, we still need the iohelper chained after the compressor to get O_DIRECT.
Note the logical file size of mapped-ram saved images is slightly larger than guest RAM size, so the files are often much larger than the files produced by the existing, sequential format. However, actual blocks written to disk is often lower with mapped-ram saved images. E.g. a saved image from a 30G, freshly booted, idle guest results in the following 'Size' and 'Blocks' values reported by stat(1)
Size Blocks sequential 998595770 1950392 mapped-ram 34368584225 1800456
With the same guest running a workload that dirties memory
Size Blocks sequential 33173330615 64791672 mapped-ram 34368578210 64706944
I'm a little concerned this difference between sparse and non-sparse formats could trip up existing applications that are using libvirt.
eg, openstack uses the save/restore to file facilities and IIUC can upload the saved file to its image storage (glance). Experiance tells us that applications will often not handle sparseness correctly if they have never been expecting it.
We can avoid these cases if mapped-ram is only used in conjunction with parallel. To make use of the parallel feature, users would need to adjust their tooling, if needed. Any opinions on that?
Overall I'm wondering if we need to give a direct choice to mgmt apps.
We added a the save/restore variants that accept virTypedParameters, so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts 'stream' and 'mapped' as options. This choice would then influence whether we save in v2 or v3 format. On restore we don't need a parameter as we just probe the on disk format.
I worry these options will confuse users. IMO we'd need a basic description of the stream formats, along with caveats of file size, lack of compression support, etc. It seems like quite a bit for the average user to absorb.
As a documentation task we can then save that compression is incompatible with 'mapped'.
Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash.
We probably should have exposed this compression choice in the API too, via a VIR_DOMAIN_SAVE_PARAM_COMPRESSION typed parameter, taking values 'default', 'raw', 'zstd', 'lzop', etc, where 'default' (or omitted) means honour the qemu.conf setting and all others override qemu.conf
Agree with that suggestion. And VIR_DOMAIN_SAVE_PARAM_COMPRESSION would be easy to describe and clear to average user :-).
Might suggest we name the new API parameter VR_DOMAIN_SAVE_PARAM_LAYOUT for stream vs mapped choice ?
These stream formats seem like an implementation detail we should attempt to hide from the user. Regards, Jim

On Wed, Aug 07, 2024 at 12:04:18PM -0600, Jim Fehlig wrote:
On 8/7/24 09:45, Daniel P. Berrangé wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
Going back to the original motivation for mapped-ram. The first key factor was that it will make it more viable to use multi-fd for parallelized saving and restoring, as it lets threads write concurrently without needing synchronization. The predictable worst case file size when the VM is live & dirtying memory, was an added benefit.
Correct. The main motivation for mapped-ram is parallelizing save/restore. To that end, we could base the whole feature on whether parallel is requested. If so, we save with mapped-ram (failing if not supported by underlying qemu) and don't support compression. If parallel not requested, use existing save/restore implementation.
We had discussed that as a strategy previously and thought it was a good idea. Your other reply to me though makes me re-consider. You're showing that /without/ O_DIRECT and multifd, then mapped-ram is already a big win over the legacy method. The scale of that improvement is somewhat surprising to me and if reliably so, that is compelling to want all the time.
Overall I'm wondering if we need to give a direct choice to mgmt apps.
We added a the save/restore variants that accept virTypedParameters, so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts 'stream' and 'mapped' as options. This choice would then influence whether we save in v2 or v3 format. On restore we don't need a parameter as we just probe the on disk format.
I worry these options will confuse users. IMO we'd need a basic description of the stream formats, along with caveats of file size, lack of compression support, etc. It seems like quite a bit for the average user to absorb.
As a documentation task we can then save that compression is incompatible with 'mapped'.
Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash.
We probably should have exposed this compression choice in the API too, via a VIR_DOMAIN_SAVE_PARAM_COMPRESSION typed parameter, taking values 'default', 'raw', 'zstd', 'lzop', etc, where 'default' (or omitted) means honour the qemu.conf setting and all others override qemu.conf
Agree with that suggestion. And VIR_DOMAIN_SAVE_PARAM_COMPRESSION would be easy to describe and clear to average user :-).
Might suggest we name the new API parameter VR_DOMAIN_SAVE_PARAM_LAYOUT for stream vs mapped choice ?
These stream formats seem like an implementation detail we should attempt to hide from the user.
Yeah it is rather an impl detail, but perhaps it is an impl detail we cannot avoid exposing users to. After all they need to be aware of use of sparseness in order to handle the files without ballooning up their size. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 8/7/24 12:39, Daniel P. Berrangé wrote:
On Wed, Aug 07, 2024 at 12:04:18PM -0600, Jim Fehlig wrote:
On 8/7/24 09:45, Daniel P. Berrangé wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
Going back to the original motivation for mapped-ram. The first key factor was that it will make it more viable to use multi-fd for parallelized saving and restoring, as it lets threads write concurrently without needing synchronization. The predictable worst case file size when the VM is live & dirtying memory, was an added benefit.
Correct. The main motivation for mapped-ram is parallelizing save/restore. To that end, we could base the whole feature on whether parallel is requested. If so, we save with mapped-ram (failing if not supported by underlying qemu) and don't support compression. If parallel not requested, use existing save/restore implementation.
We had discussed that as a strategy previously and thought it was a good idea.
Your other reply to me though makes me re-consider. You're showing that /without/ O_DIRECT and multifd, then mapped-ram is already a big win over the legacy method. The scale of that improvement is somewhat surprising to me and if reliably so, that is compelling to want all the time.
How do we have it all of the time without issues of spareness you describe? Seems it will always be opt-in, unless using the new parallel option.
Overall I'm wondering if we need to give a direct choice to mgmt apps.
We added a the save/restore variants that accept virTypedParameters, so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts 'stream' and 'mapped' as options. This choice would then influence whether we save in v2 or v3 format. On restore we don't need a parameter as we just probe the on disk format.
I worry these options will confuse users. IMO we'd need a basic description of the stream formats, along with caveats of file size, lack of compression support, etc. It seems like quite a bit for the average user to absorb.
As a documentation task we can then save that compression is incompatible with 'mapped'.
Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash.
We probably should have exposed this compression choice in the API too, via a VIR_DOMAIN_SAVE_PARAM_COMPRESSION typed parameter, taking values 'default', 'raw', 'zstd', 'lzop', etc, where 'default' (or omitted) means honour the qemu.conf setting and all others override qemu.conf
Agree with that suggestion. And VIR_DOMAIN_SAVE_PARAM_COMPRESSION would be easy to describe and clear to average user :-).
Might suggest we name the new API parameter VR_DOMAIN_SAVE_PARAM_LAYOUT for stream vs mapped choice ?
These stream formats seem like an implementation detail we should attempt to hide from the user.
Yeah it is rather an impl detail, but perhaps it is an impl detail we cannot avoid exposing users to. After all they need to be aware of use of sparseness in order to handle the files without ballooning up their size.
As for a name for the new parameter, I can't think of anything better than your suggestion of VIR_DOMAIN_SAVE_PARAM_LAYOUT. I much prefer _FORMAT, but yeah, that ship has sailed. Before working on the API parameter, I'll summarize my understanding of the options for supporting mapped-ram: 1. No user control. If qemu supports mapped-ram, use it 2. Control at driver level 3. Control at the API level We don't want 1 due to incompatibilities with compression, sparseness, etc. Experience with the compression feature has shown that 2 is inflexible. Control at the API level is preferred for maximum user flexibility. Regards, Jim

On 8/8/24 17:46, Jim Fehlig wrote:
On 8/7/24 12:39, Daniel P. Berrangé wrote:
On Wed, Aug 07, 2024 at 12:04:18PM -0600, Jim Fehlig wrote:
On 8/7/24 09:45, Daniel P. Berrangé wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
Going back to the original motivation for mapped-ram. The first key factor was that it will make it more viable to use multi-fd for parallelized saving and restoring, as it lets threads write concurrently without needing synchronization. The predictable worst case file size when the VM is live & dirtying memory, was an added benefit.
Correct. The main motivation for mapped-ram is parallelizing save/restore. To that end, we could base the whole feature on whether parallel is requested. If so, we save with mapped-ram (failing if not supported by underlying qemu) and don't support compression. If parallel not requested, use existing save/restore implementation.
We had discussed that as a strategy previously and thought it was a good idea.
Your other reply to me though makes me re-consider. You're showing that /without/ O_DIRECT and multifd, then mapped-ram is already a big win over the legacy method. The scale of that improvement is somewhat surprising to me and if reliably so, that is compelling to want all the time.
How do we have it all of the time without issues of spareness you describe? Seems it will always be opt-in, unless using the new parallel option.
Overall I'm wondering if we need to give a direct choice to mgmt apps.
We added a the save/restore variants that accept virTypedParameters, so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts 'stream' and 'mapped' as options. This choice would then influence whether we save in v2 or v3 format. On restore we don't need a parameter as we just probe the on disk format.
I worry these options will confuse users. IMO we'd need a basic description of the stream formats, along with caveats of file size, lack of compression support, etc. It seems like quite a bit for the average user to absorb.
As a documentation task we can then save that compression is incompatible with 'mapped'.
Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash.
We probably should have exposed this compression choice in the API too, via a VIR_DOMAIN_SAVE_PARAM_COMPRESSION typed parameter, taking values 'default', 'raw', 'zstd', 'lzop', etc, where 'default' (or omitted) means honour the qemu.conf setting and all others override qemu.conf
Agree with that suggestion. And VIR_DOMAIN_SAVE_PARAM_COMPRESSION would be easy to describe and clear to average user :-).
Might suggest we name the new API parameter VR_DOMAIN_SAVE_PARAM_LAYOUT for stream vs mapped choice ?
These stream formats seem like an implementation detail we should attempt to hide from the user.
Yeah it is rather an impl detail, but perhaps it is an impl detail we cannot avoid exposing users to. After all they need to be aware of use of sparseness in order to handle the files without ballooning up their size.
As for a name for the new parameter, I can't think of anything better than your suggestion of VIR_DOMAIN_SAVE_PARAM_LAYOUT. I much prefer _FORMAT, but yeah, that ship has sailed.
Before working on the API parameter, I'll summarize my understanding of the options for supporting mapped-ram:
1. No user control. If qemu supports mapped-ram, use it 2. Control at driver level
Forgot to mention: I sent the implementation for this option since it was ready anyhow https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/MNBHG... Regards, Jim

On 8/7/24 09:45, Daniel P. Berrangé wrote:
On Thu, Jun 13, 2024 at 04:43:14PM -0600, Jim Fehlig via Devel wrote:
This series is a RFC for support of QEMU's mapped-ram migration capability [1] for saving and restoring VMs. It implements the first part of the design approach we discussed for supporting parallel save/restore [2]. In summary, the approach is
1. Add mapped-ram migration capability 2. Steal an element from save header 'unused' for a 'features' variable and bump save version to 3. 3. Add /etc/libvirt/qemu.conf knob for the save format version, defaulting to latest v3 4. Use v3 (aka mapped-ram) by default 5. Use mapped-ram with BYPASS_CACHE for v3, old approach for v2 6. include: Define constants for parallel save/restore 7. qemu: Add support for parallel save. Implies mapped-ram, reject if v2 8. qemu: Add support for parallel restore. Implies mapped-ram. Reject if v2 9. tools: add parallel parameter to virsh save command 10. tools: add parallel parameter to virsh restore command
This series implements 1-5, with the BYPASS_CACHE support in patches 8 and 9 being quite hacky. They are included to discuss approaches to make them less hacky. See the patches for details.
I'm a little on the edge about the user interface for the choice of formats - whether the version number knob in qemu.conf is the ideal approach or not.
The QEMU mapped-ram capability currently does not support directio. Fabino is working on that now [3]. This complicates merging support in libvirt. I don't think it's reasonable to enable mapped-ram by default when BYPASS_CACHE cannot be supported. Should we wait until the mapped-ram directio support is merged in QEMU before supporting mapped-ram in libvirt?
For the moment, compression is ignored in the new save version. Currently, libvirt connects the output of QEMU's save stream to the specified compression program via a pipe. This approach is incompatible with mapped-ram since the fd provided to QEMU must be seekable. One option is to reopen and compress the saved image after the actual save operation has completed. This has the downside of requiring the iohelper to handle BYPASS_CACHE, which would preclude us from removing it sometime in the future. Other suggestions much welcomed.
Going back to the original motivation for mapped-ram. The first key factor was that it will make it more viable to use multi-fd for parallelized saving and restoring, as it lets threads write concurrently without needing synchronization. The predictable worst case file size when the VM is live & dirtying memory, was an added benefit.
The the streaming format, we can compress as we stream, so adding compression burns more CPU, but this is parallelized with the QEMU saving, so the wallclock time doesn't increase as badly.
With mapped-ram, if we want to compress, we need to wait for the image to be fully written, as we can be re-writing regions if the guest is live. IOW, compression cannot be parallelized with the file writing. So compression will add significant wallclock time, as well as having the transient extra disk usage penalty.
We can optimize to avoid the extra disk penalty by using discard on the intermediate file every time we read a chunk. eg this loop:
while () { read 1 MB from save file discard 1MB from save file write 1 MB to compressor pipe }
We can't avoid the wallclock penalty from compression, and as you say, we still need the iohelper chained after the compressor to get O_DIRECT.
Note the logical file size of mapped-ram saved images is slightly larger than guest RAM size, so the files are often much larger than the files produced by the existing, sequential format. However, actual blocks written to disk is often lower with mapped-ram saved images. E.g. a saved image from a 30G, freshly booted, idle guest results in the following 'Size' and 'Blocks' values reported by stat(1)
Size Blocks sequential 998595770 1950392 mapped-ram 34368584225 1800456
With the same guest running a workload that dirties memory
Size Blocks sequential 33173330615 64791672 mapped-ram 34368578210 64706944
I'm a little concerned this difference between sparse and non-sparse formats could trip up existing applications that are using libvirt.
eg, openstack uses the save/restore to file facilities and IIUC can upload the saved file to its image storage (glance). Experiance tells us that applications will often not handle sparseness correctly if they have never been expecting it.
Overall I'm wondering if we need to give a direct choice to mgmt apps.
We added a the save/restore variants that accept virTypedParameters, so we could define a VIR_DOMAIN_SAVE_PARAM_FORMAT, which accepts 'stream' and 'mapped' as options. This choice would then influence whether we save in v2 or v3 format. On restore we don't need a parameter as we just probe the on disk format.
As a documentation task we can then save that compression is incompatible with 'mapped'.
Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash.
Thinking about this more, and your previous idea about abusing the header 'compressed' field [1], I'm beginning to warm to the idea of adding a new item to the list currently accepted by save_image_format. I'm also warming to Fabiano's suggestion to call the new item 'sparse' :-). AFAICT, from a user perspective, save_image_format does not imply compression. The implication primarily stems from variable and function naming in the code :-). The current documentation of save_image_format may slightly imply compression, but we can easily improve that. The setting already accepts 'raw' for the existing (non-compressed) stream format. Any opinions on adding 'sparse' as another supported save_image_format? Regards, Jim [1] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/PP3X...

On 8/12/24 17:16, Jim Fehlig wrote:
On 8/7/24 09:45, Daniel P. Berrangé wrote:
Annoyingly we already have a 'save_image_formt' in qemu.conf though taking 'raw', 'zstd', 'lzop', etc to choose the compression type. So we have a terminology clash.
Thinking about this more, and your previous idea about abusing the header 'compressed' field [1], I'm beginning to warm to the idea of adding a new item to the list currently accepted by save_image_format. I'm also warming to Fabiano's suggestion to call the new item 'sparse' :-).
AFAICT, from a user perspective, save_image_format does not imply compression. The implication primarily stems from variable and function naming in the code :-). The current documentation of save_image_format may slightly imply compression, but we can easily improve that. The setting already accepts 'raw' for the existing (non-compressed) stream format.
E.g. something like this small series https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/thread/BXJVV... Regards, Jim
[1] https://lists.libvirt.org/archives/list/devel@lists.libvirt.org/message/PP3X...
participants (3)
-
Daniel P. Berrangé
-
Jim Fehlig
-
Martin Kletzander