[libvirt] [PATCH] RFC: Support QEMU live uprgade

Hi all, Recently QEMU developers are working on a feature to allow upgrading a live QEMU instance to a new version without restarting the VM. This is implemented as live migration between the old and new QEMU process on the same host [1]. Here is the the use case: 1) Guests are running QEMU release 1.6.1. 2) Admin installs QEMU release 1.6.2 via RPM or deb. 3) Admin starts a new VM using the updated QEMU binary, and asks the old QEMU process to migrate the VM to the newly started VM. I think it will be very useful to support QEMU live upgrade in libvirt. After some investigations, I found migrating to the same host breaks the current migration code. I'd like to propose a new work flow for QEMU live migration. It is to implement the above step 3). I add a new API named virDomainQemuLiveUpgrade, and a new domain command in virsh named qemu-live-upgrade. The work flow of virDomainQemuLiveUpgrade is like following. newDef = deep copy oldVm definition newVm = create VM using newDef, start QEMU process with all vCPUs paused oldVm migrate to newVm using unix socket shutdown oldVm newPid = newVm->pid finalDef = live deep copy of newVm definition Drop the newVm from qemu domain table without shutting down QEMU process Assign finalDef to oldVm oldVm attaches to QEMU process newPid using finalDef resume all vCPUs in oldVm I wrote a RFC patch to demo this work flow. To try my patch, you can firstly apply the patch, build and install libvirt, then create and start a KVM virtual machine, at last run the following command virsh qemu-live-upgrade your_domain_name Check the output of "virsh list" and "ps aux | grep qemu", you will find the virtual machine gets a new id, and a new QEMU process running with different process id. I tested this patch on a Fedora 19 box with QEMU upgraded to 1.6. I'd like to hear your precious opinions on this upgrading flow. I can improve the flow, after we reach an agreement on this, I can start to write a more formal patch. After the upgrading work flow becomes mature, I want to add a "page-flipping" flag to this new API. The reason is that migrating to localhost requires twice host memory as the original VM does during the upgrading. Thus comes the ongoing development of using the new system call vmsplice to move memory pages between two QEMU instances, so that the kernel can just perform page re-mapping from source QEMU process to destination QEMU process in a zero-copy manner. This is expected to reduce memory consumption and speedup the whole procedure. This mechanism is based on Unix domain socket, pipes and FD inter-process passing magic[2]. The page re-mapping mechanism is transparent to libvirt, all we need to trigger this magic is (1) set QEMU migration capability to enable page re-mapping, (2) start destination QEMU process with "-incoming unix:/path/to/socket", (3) use "unix:/path/to/socket" URI when issuing QMP migration command to the source QEMU process. This RFC patch is already using Unix socket to migrate QEMU virtual machine. I'll add code to parse and inspect a "page-flipping" flag, and call QEMU monitor to enable this capability. Thanks very much any comments on this patch! [1] http://lists.nongnu.org/archive/html/qemu-devel/2013-08/msg02916.html [2] http://lists.gnu.org/archive/html/qemu-devel/2013-09/msg04043.html
From 2b659584f2cbe676c843ddeaf198c9a8368ff0ff Mon Sep 17 00:00:00 2001 From: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> Date: Wed, 30 Oct 2013 15:36:49 +0800 Subject: [PATCH] RFC: Support QEMU live uprgade
This patch is to support upgrading QEMU version without restarting the virtual machine. Add new API virDomainQemuLiveUpgrade(), and a new virsh command qemu-live-upgrade. virDomainQemuLiveUpgrade() migrates a running VM to the same host as a new VM with new name and new UUID. Then it shutdown the original VM and drop the new VM definition without shutdown the QEMU process of the new VM. At last it attaches original VM to the new QEMU process. Firstly the admin installs new QEMU package, then he runs virsh qemu-live-upgrade domain_name to trigger our virDomainQemuLiveUpgrade() upgrading flow. Signed-off-by: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> --- include/libvirt/libvirt.h.in | 3 + src/driver.h | 4 + src/libvirt.c | 23 +++ src/libvirt_public.syms | 1 + src/qemu/qemu_driver.c | 339 +++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_migration.c | 2 +- src/qemu/qemu_migration.h | 3 + src/remote/remote_driver.c | 1 + src/remote/remote_protocol.x | 19 ++- tools/virsh-domain.c | 139 ++++++++++++++++++ 10 files changed, 532 insertions(+), 2 deletions(-) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 80b2d78..7c87044 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1331,6 +1331,9 @@ int virDomainMigrateGetMaxSpeed(virDomainPtr domain, unsigned long *bandwidth, unsigned int flags); +virDomainPtr virDomainQemuLiveUpgrade(virDomainPtr domain, + unsigned int flags); + /** * VIR_NODEINFO_MAXCPUS: * @nodeinfo: virNodeInfo instance diff --git a/src/driver.h b/src/driver.h index 8cd164a..1bafa98 100644 --- a/src/driver.h +++ b/src/driver.h @@ -686,6 +686,9 @@ typedef int const char *args, char ***models, unsigned int flags); +typedef virDomainPtr +(*virDrvDomainQemuLiveUpgrade)(virDomainPtr domain, + unsigned int flags); typedef int (*virDrvDomainGetJobInfo)(virDomainPtr domain, @@ -1339,6 +1342,7 @@ struct _virDriver { virDrvDomainMigrateFinish3Params domainMigrateFinish3Params; virDrvDomainMigrateConfirm3Params domainMigrateConfirm3Params; virDrvConnectGetCPUModelNames connectGetCPUModelNames; + virDrvDomainQemuLiveUpgrade domainQemuLiveUpgrade; }; diff --git a/src/libvirt.c b/src/libvirt.c index 90608ab..9e5ff8a 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -7524,6 +7524,29 @@ error: /** + * virDomainQemuLiveUpgrade: + * @domain: a domain object + * @flags: bitwise-OR of flags + * + * Live upgrade qemu binary version of the domain. + * + * Returns the new domain object if the upgrade was successful, + * or NULL in case of error. + */ +virDomainPtr +virDomainQemuLiveUpgrade(virDomainPtr domain, + unsigned int flags) +{ + VIR_DEBUG("domain=%p, flags=%x", domain, flags); + if (!domain->conn->driver->domainQemuLiveUpgrade) { + virLibConnError(VIR_ERR_INTERNAL_ERROR, __FUNCTION__); + return NULL; + } + return domain->conn->driver->domainQemuLiveUpgrade(domain, flags); +} + + +/** * virNodeGetInfo: * @conn: pointer to the hypervisor connection * @info: pointer to a virNodeInfo structure allocated by the user diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms index fe9b497..82f0b37 100644 --- a/src/libvirt_public.syms +++ b/src/libvirt_public.syms @@ -637,6 +637,7 @@ LIBVIRT_1.1.1 { LIBVIRT_1.1.3 { global: virConnectGetCPUModelNames; + virDomainQemuLiveUpgrade; } LIBVIRT_1.1.1; # .... define new API here using predicted next version number .... diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index ef1359c..7cd76e0 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -15705,6 +15705,344 @@ qemuConnectGetCPUModelNames(virConnectPtr conn, } +static virDomainDefPtr +virDomainDefLiveCopy(virDomainDefPtr src, + virCapsPtr caps, + virDomainXMLOptionPtr xmlopt) +{ + char *xml; + virDomainDefPtr ret; + unsigned int flags = VIR_DOMAIN_XML_SECURE; + + /* Easiest to clone via a round-trip through XML. */ + if (!(xml = virDomainDefFormat(src, flags))) + return NULL; + + ret = virDomainDefParseString(xml, caps, xmlopt, -1, flags); + + VIR_FREE(xml); + return ret; +} + + +static virDomainDefPtr +qemuLiveUpgradeMiniBegin(virQEMUDriverPtr driver, virDomainObjPtr vm) { + virCapsPtr caps = NULL; + virDomainDefPtr newDef = NULL; + virDomainDefPtr result = NULL; + char *newName = NULL; + + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) + goto cleanup; + + if (!(newDef = virDomainDefCopy(vm->def, caps, driver->xmlopt, true))) + goto cleanup; + + if (virAsprintf(&newName, "%s_qemu_live_upgrade", vm->def->name) < 0) + goto cleanup; + + VIR_FREE(newDef->name); + newDef->name = newName; + + if (-1 == virUUIDGenerate(newDef->uuid)) + goto cleanup; + + result = newDef; + newDef = NULL; + +cleanup: + virDomainDefFree(newDef); + virObjectUnref(caps); + + return result; +} + +static virDomainObjPtr +qemuLiveUpgradeMiniPrepare(virQEMUDriverPtr driver, virConnectPtr conn, + virDomainDefPtr newDef) { + virDomainObjPtr newVm = NULL; + virDomainObjPtr result = NULL; + char *upgradeUri = NULL; + virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); + + newVm = virDomainObjListAdd(driver->domains, newDef, driver->xmlopt, + VIR_DOMAIN_OBJ_LIST_ADD_LIVE | + VIR_DOMAIN_OBJ_LIST_ADD_CHECK_LIVE, NULL); + if (!newVm) + goto cleanup; + newDef = NULL; + + if (virAsprintf(&upgradeUri, "unix:%s/qemu.live.upgrade.%s.sock", + cfg->libDir, newVm->def->name) < 0) + goto cleanup; + + if (qemuProcessStart(conn, driver, newVm, upgradeUri, -1, NULL, + NULL, VIR_NETDEV_VPORT_PROFILE_OP_MIGRATE_IN_START, + VIR_QEMU_PROCESS_START_PAUSED | + VIR_QEMU_PROCESS_START_AUTODESTROY) < 0) + goto cleanup; + + result = newVm; + newVm = NULL; + +cleanup: + if (newVm) + qemuDomainRemoveInactive(driver, newVm); + VIR_FREE(upgradeUri); + virDomainDefFree(newDef); + virObjectUnref(cfg); + + if (result) + virObjectUnlock(result); + return result; +} + +static bool +qemuLiveUpgradeMiniPerform(virQEMUDriverPtr driver, virDomainObjPtr vm, + const char *newName) { + char *upgradeSock = NULL; + qemuDomainObjPrivatePtr priv = vm->privateData; + bool result = false; + bool migrate = false; + int r = 0; + virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); + + if (virAsprintf(&upgradeSock, "%s/qemu.live.upgrade.%s.sock", + cfg->libDir, newName) < 0) + goto cleanup; + + if (qemuDomainObjEnterMonitorAsync(driver, vm, QEMU_ASYNC_JOB_NONE) < 0) + goto cleanup; + r = qemuMonitorMigrateToUnix(priv->mon, 0, upgradeSock); + qemuDomainObjExitMonitor(driver, vm); + migrate = true; + if (r < 0) + goto cleanup; + + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("guest unexpectedly quit")); + goto cleanup; + } + + if (qemuMigrationWaitForCompletion(driver, vm, QEMU_ASYNC_JOB_NONE, + NULL, true) < 0) + goto cleanup; + + if (virDomainObjGetState(vm, NULL) == VIR_DOMAIN_RUNNING) { + if (qemuMigrationSetOffline(driver, vm) < 0) + goto cleanup; + } + + result = true; + +cleanup: + /* QEMU memory pages has been re-mapped during the migrate, no way to + * continue the original VM */ + if (migrate) + qemuProcessStop(driver, vm, VIR_DOMAIN_SHUTOFF_MIGRATED, + QEMU_ASYNC_JOB_NONE); + VIR_FREE(upgradeSock); + virObjectUnref(cfg); + + return result; +} + +static bool +qemuLiveUpgradeMiniFinish(virQEMUDriverPtr driver, virDomainObjPtr newVm, + const char *origName, const unsigned char *origUuid, + virDomainDefPtr *newDef, bool *newMonJSON, + virDomainChrSourceDefPtr *newMonConf, + pid_t *newPid, const char **newPidFile) { + virCapsPtr caps = NULL; + qemuDomainObjPrivatePtr newPriv = newVm->privateData; + bool result = false; + size_t i = 0; + + virDomainDefPtr tmpDef = NULL; + virDomainChrSourceDefPtr tmpMonConf = NULL; + char *tmpPidFile = NULL; + + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) + goto abort; + + newVm->persistent = 0; + qemuProcessAutoDestroyRemove(driver, newVm); + + if (!(tmpDef = virDomainDefLiveCopy(newVm->def, caps, driver->xmlopt))) + goto abort; + + VIR_FREE(tmpDef->name); + if (VIR_STRDUP(tmpDef->name, origName) < 0) { + goto abort; + } + + for (i=0; i < VIR_UUID_BUFLEN; ++i) { + tmpDef->uuid[i] = origUuid[i]; + } + + if (VIR_ALLOC(tmpMonConf) < 0) + goto abort; + + if (virDomainChrSourceDefCopy(tmpMonConf, newPriv->monConfig) < 0) + goto abort; + + if (VIR_STRDUP(tmpPidFile, newPriv->pidfile) < 0) + goto abort; + + if (newPriv->mon) { + qemuMonitorClose(newPriv->mon); + newPriv->mon = NULL; + } + + *newPid = newVm->pid; + *newPidFile = tmpPidFile; + *newDef = tmpDef; + *newMonConf = tmpMonConf; + *newMonJSON = newPriv->monJSON; + result = true; + +cleanup: + virObjectUnref(caps); + + qemuDomainRemoveInactive(driver, newVm); + return result; + +abort: + VIR_FREE(tmpPidFile); + virDomainChrSourceDefFree(tmpMonConf); + virDomainDefFree(tmpDef); + qemuProcessStop(driver, newVm, VIR_DOMAIN_SHUTOFF_MIGRATED, + QEMU_ASYNC_JOB_NONE); + goto cleanup; +} + +static virDomainPtr +qemuLiveUpgradeMiniConfirm(virQEMUDriverPtr driver, virConnectPtr conn, + virDomainDefPtr newDef, bool newMonJSON, + virDomainChrSourceDefPtr newMonConf, + pid_t newPid, const char *newPidFile) { + virDomainPtr newDom = NULL; + virDomainObjPtr newVm = NULL; + int r = 0; + + if (!(newVm = virDomainObjListAdd(driver->domains, newDef, driver->xmlopt, + 0, NULL))) + goto cleanup; + newDef = NULL; + newVm->def->id = -1; + + VIR_SHRINK_N(newVm->def->seclabels, newVm->def->nseclabels, newVm->def->nseclabels); + if (virSecurityManagerGenLabel(driver->securityManager, newVm->def) < 0) + goto cleanup; + r = qemuProcessAttach(conn, driver, newVm, newPid, + newPidFile, newMonConf, newMonJSON); + newMonConf = NULL; + if (r < 0) + goto cleanup; + + if (qemuProcessStartCPUs(driver, newVm, conn, + VIR_DOMAIN_RUNNING_MIGRATED, + QEMU_ASYNC_JOB_NONE) < 0) { + qemuProcessStop(driver, newVm, VIR_DOMAIN_SHUTOFF_FAILED, + VIR_QEMU_PROCESS_STOP_MIGRATED); + goto cleanup; + } + newDom = virGetDomain(conn, newVm->def->name, newVm->def->uuid); + +cleanup: + VIR_FREE(newPidFile); + virDomainChrSourceDefFree(newMonConf); + virDomainDefFree(newDef); + if (newVm) + virObjectUnlock(newVm); + + return newDom; +} + +static virDomainPtr +qemuDomObjQemuLiveUpgrade(virQEMUDriverPtr driver, virConnectPtr conn, + virDomainObjPtr vm, unsigned int flags) { + char *origName = NULL; + unsigned char origUuid[VIR_UUID_BUFLEN]; + size_t i = 0; + virDomainPtr newDom = NULL; + virDomainDefPtr newDef = NULL; + virDomainObjPtr newVm = NULL; + virDomainChrSourceDefPtr newMonConf = NULL; + bool newMonJSON = false; + pid_t newPid = -1; + const char * newPidFile = NULL; + virDomainDefPtr finalDef = NULL; + + VIR_DEBUG("vm=%p, flags=%x", vm, flags); + + if (!(newDef = qemuLiveUpgradeMiniBegin(driver, vm))) + goto cleanup; + + virObjectUnlock(vm); + newVm = qemuLiveUpgradeMiniPrepare(driver, conn, newDef); + virObjectLock(vm); + newDef = NULL; + if (!newVm) { + goto cleanup; + } + + if (!qemuLiveUpgradeMiniPerform(driver, vm, newVm->def->name)) { + goto cleanup; + } + + if (VIR_STRDUP(origName, vm->def->name) < 0) + goto cleanup; + for (i=0; i < VIR_UUID_BUFLEN; ++i) { + origUuid[i] = vm->def->uuid[i]; + } + virObjectUnlock(vm); + vm = NULL; + + if (!qemuLiveUpgradeMiniFinish(driver, newVm, origName, origUuid, + &finalDef, &newMonJSON, &newMonConf, &newPid, + &newPidFile)) + goto cleanup; + newVm = NULL; + + newDom = qemuLiveUpgradeMiniConfirm(driver, conn, finalDef, newMonJSON, + newMonConf, newPid, newPidFile); + finalDef = NULL; + newMonConf = NULL; + newPidFile = NULL; + +cleanup: + VIR_FREE(origName); + if (newVm) + qemuDomainRemoveInactive(driver, newVm); + if (vm) + virObjectUnlock(vm); + + return newDom; +} + +static virDomainPtr +qemuDomainQemuLiveUpgrade(virDomainPtr domain, + unsigned int flags) +{ + virQEMUDriverPtr driver = domain->conn->privateData; + virDomainObjPtr vm = NULL; + + VIR_DEBUG("domain=%p, flags=%x", domain, flags); + + if (!(vm = qemuDomObjFromDomain(domain))) + return NULL; + + if (virDomainQemuLiveUpgradeEnsureACL(domain->conn, vm->def) < 0) { + virObjectUnlock(vm); + return NULL; + } + + return qemuDomObjQemuLiveUpgrade(driver, domain->conn, vm, flags); +} + + static virDriver qemuDriver = { .no = VIR_DRV_QEMU, .name = QEMU_DRIVER_NAME, @@ -15892,6 +16230,7 @@ static virDriver qemuDriver = { .domainMigrateFinish3Params = qemuDomainMigrateFinish3Params, /* 1.1.0 */ .domainMigrateConfirm3Params = qemuDomainMigrateConfirm3Params, /* 1.1.0 */ .connectGetCPUModelNames = qemuConnectGetCPUModelNames, /* 1.1.3 */ + .domainQemuLiveUpgrade = qemuDomainQemuLiveUpgrade, /* 1.1.3 */ }; diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index a3d986f..f859936 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1686,7 +1686,7 @@ qemuMigrationUpdateJobStatus(virQEMUDriverPtr driver, } -static int +int qemuMigrationWaitForCompletion(virQEMUDriverPtr driver, virDomainObjPtr vm, enum qemuDomainAsyncJob asyncJob, virConnectPtr dconn, bool abort_on_error) diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index cafa2a2..48b2009 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -162,6 +162,9 @@ int qemuMigrationConfirm(virConnectPtr conn, int cookieinlen, unsigned int flags, int cancelled); +int qemuMigrationWaitForCompletion(virQEMUDriverPtr driver, virDomainObjPtr vm, + enum qemuDomainAsyncJob asyncJob, + virConnectPtr dconn, bool abort_on_error); bool qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, virDomainDefPtr def, bool remote, diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index 7181949..cfa70bd 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -7013,6 +7013,7 @@ static virDriver remote_driver = { .domainMigrateFinish3Params = remoteDomainMigrateFinish3Params, /* 1.1.0 */ .domainMigrateConfirm3Params = remoteDomainMigrateConfirm3Params, /* 1.1.0 */ .connectGetCPUModelNames = remoteConnectGetCPUModelNames, /* 1.1.3 */ + .domainQemuLiveUpgrade = remoteDomainQemuLiveUpgrade, /* 1.1.3 */ }; static virNetworkDriver network_driver = { diff --git a/src/remote/remote_protocol.x b/src/remote/remote_protocol.x index f942670..25f35b2 100644 --- a/src/remote/remote_protocol.x +++ b/src/remote/remote_protocol.x @@ -2849,6 +2849,15 @@ struct remote_connect_get_cpu_model_names_ret { int ret; }; +struct remote_domain_qemu_live_upgrade_args { + remote_nonnull_domain dom; + unsigned int flags; +}; + +struct remote_domain_qemu_live_upgrade_ret { + remote_nonnull_domain domUpgraded; +}; + /*----- Protocol. -----*/ /* Define the program number, protocol version and procedure numbers here. */ @@ -5018,5 +5027,13 @@ enum remote_procedure { * @generate: none * @acl: connect:read */ - REMOTE_PROC_CONNECT_GET_CPU_MODEL_NAMES = 312 + REMOTE_PROC_CONNECT_GET_CPU_MODEL_NAMES = 312, + + /** + * @generate: both + * @acl: domain:migrate + * @acl: domain:start + * @acl: domain:write + */ + REMOTE_PROC_DOMAIN_QEMU_LIVE_UPGRADE = 313 }; diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c index 60abd3d..e0c7997 100644 --- a/tools/virsh-domain.c +++ b/tools/virsh-domain.c @@ -10451,6 +10451,139 @@ cleanup: return ret; } +/* + * "qemu-live-upgrade" command + */ +static const vshCmdInfo info_qemu_live_upgrade[] = { + {.name = "help", + .data = N_("Live upgrade QEMU binary version of a running domain") + }, + {.name = "desc", + .data = N_("Let the domain make use of a newly upgraded QEMU binary without restart.") + }, + {.name = NULL} +}; + +static const vshCmdOptDef opts_qemu_live_upgrade[] = { + {.name = "domain", + .type = VSH_OT_DATA, + .flags = VSH_OFLAG_REQ, + .help = N_("domain name, id or uuid") + }, + {.name = "page-flipping", + .type = VSH_OT_BOOL, + .help = N_("enable memory page flipping when upgrading the QEMU binray") + }, + {.name = "timeout", + .type = VSH_OT_INT, + .help = N_("force guest to suspend if QEMU live upgrade exceeds timeout (in seconds)") + }, + {.name = "verbose", + .type = VSH_OT_BOOL, + .help = N_("display the progress of uprgade") + }, + {.name = NULL} +}; + +static void +doQemuLiveUpgrade(void *opaque) +{ + char ret = '1'; + virDomainPtr dom = NULL; + virDomainPtr domUpgraded = NULL; + unsigned int flags = 0; + vshCtrlData *data = opaque; + vshControl *ctl = data->ctl; + const vshCmd *cmd = data->cmd; + sigset_t sigmask, oldsigmask; + + sigemptyset(&sigmask); + sigaddset(&sigmask, SIGINT); + if (pthread_sigmask(SIG_BLOCK, &sigmask, &oldsigmask) < 0) + goto out_sig; + + if (!(dom = vshCommandOptDomain(ctl, cmd, NULL))) + goto out; + + if ((domUpgraded = virDomainQemuLiveUpgrade(dom, flags))) { + ret = '0'; + } + +out: + pthread_sigmask(SIG_SETMASK, &oldsigmask, NULL); +out_sig: + if (domUpgraded) + virDomainFree(domUpgraded); + if (dom) + virDomainFree(dom); + ignore_value(safewrite(data->writefd, &ret, sizeof(ret))); + return; +} + +static void +vshQemuLiveUpgradeTimeout(vshControl *ctl, + virDomainPtr dom, + void *opaque ATTRIBUTE_UNUSED) +{ + vshDebug(ctl, VSH_ERR_DEBUG, "suspending the domain, " + "since QEMU live uprgade timed out\n"); + virDomainSuspend(dom); +} + +static bool +cmdQemuLiveUpgrade(vshControl *ctl, const vshCmd *cmd) +{ + virDomainPtr dom = NULL; + int p[2] = {-1, -1}; + virThread workerThread; + bool verbose = false; + bool functionReturn = false; + int timeout = 0; + vshCtrlData data; + int rv; + + if (!(dom = vshCommandOptDomain(ctl, cmd, NULL))) + return false; + + if (vshCommandOptBool(cmd, "verbose")) + verbose = true; + + if ((rv = vshCommandOptInt(cmd, "timeout", &timeout)) < 0 || + (rv > 0 && timeout < 1)) { + vshError(ctl, "%s", _("qemu-live-uprgade: Invalid timeout")); + goto cleanup; + } else if (rv > 0) { + /* Ensure that we can multiply by 1000 without overflowing. */ + if (timeout > INT_MAX / 1000) { + vshError(ctl, "%s", _("qemu-live-uprgade: Timeout is too big")); + goto cleanup; + } + } + + if (pipe(p) < 0) + goto cleanup; + + data.ctl = ctl; + data.cmd = cmd; + data.writefd = p[1]; + + if (virThreadCreate(&workerThread, + true, + doQemuLiveUpgrade, + &data) < 0) + goto cleanup; + functionReturn = vshWatchJob(ctl, dom, verbose, p[0], timeout, + vshQemuLiveUpgradeTimeout, NULL, _("Upgrade")); + + virThreadJoin(&workerThread); + +cleanup: + virDomainFree(dom); + VIR_FORCE_CLOSE(p[0]); + VIR_FORCE_CLOSE(p[1]); + return functionReturn; +} + const vshCmdDef domManagementCmds[] = { {.name = "attach-device", .handler = cmdAttachDevice, @@ -10796,6 +10929,12 @@ const vshCmdDef domManagementCmds[] = { .info = info_qemu_agent_command, .flags = 0 }, + {.name = "qemu-live-upgrade", + .handler = cmdQemuLiveUpgrade, + .opts = opts_qemu_live_upgrade, + .info = info_qemu_live_upgrade, + .flags = 0 + }, {.name = "reboot", .handler = cmdReboot, .opts = opts_reboot, -- 1.8.3.1 _____________________________ Zhou Zheng Sheng / 周征晟 Software Engineer E-mail: zhshzhou@cn.ibm.com Telephone: 86-10-82454397

On Tue, Nov 12, 2013 at 08:14:11PM +0800, Zheng Sheng ZS Zhou wrote:
Hi all,
Recently QEMU developers are working on a feature to allow upgrading a live QEMU instance to a new version without restarting the VM. This is implemented as live migration between the old and new QEMU process on the same host [1]. Here is the the use case:
1) Guests are running QEMU release 1.6.1. 2) Admin installs QEMU release 1.6.2 via RPM or deb. 3) Admin starts a new VM using the updated QEMU binary, and asks the old QEMU process to migrate the VM to the newly started VM.
I think it will be very useful to support QEMU live upgrade in libvirt. After some investigations, I found migrating to the same host breaks the current migration code. I'd like to propose a new work flow for QEMU live migration. It is to implement the above step 3).
How does it break migration code ? Your patch below is effectively re-implementing the multistep migration workflow, leaving out many important features (seemless reconnect to SPICE clients for example) which is really bad for our ongoing code support burden, so not something I want to see. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Hi Daniel, on 2013/11/12/ 20:23, Daniel P. Berrange wrote:> On Tue, Nov 12, 2013 at 08:14:11PM +0800, Zheng Sheng ZS Zhou wrote:
Hi all,
Recently QEMU developers are working on a feature to allow upgrading a live QEMU instance to a new version without restarting the VM. This is implemented as live migration between the old and new QEMU process on the same host [1]. Here is the the use case:
1) Guests are running QEMU release 1.6.1. 2) Admin installs QEMU release 1.6.2 via RPM or deb. 3) Admin starts a new VM using the updated QEMU binary, and asks the old QEMU process to migrate the VM to the newly started VM.
I think it will be very useful to support QEMU live upgrade in libvirt. After some investigations, I found migrating to the same host breaks the current migration code. I'd like to propose a new work flow for QEMU live migration. It is to implement the above step 3).
How does it break migration code ? Your patch below is effectively re-implementing the multistep migration workflow, leaving out many important features (seemless reconnect to SPICE clients for example) which is really bad for our ongoing code support burden, so not something I want to see.
Daniel
Actually I wrote another hacking patch to investigate how we can re-use existing framework to do local migration. I found the following problems. (1) When migrate to different host, the destination domain uses the same UUID and name as the source, and this is OK. When migrate to localhost, destination domain UUID and name causes conflict with the source. In QEMU driver, it maintains a hash table of domain objects, the reference key is the UUID of the virtual machine. The closeCallbacks is also a hash table with domain UUID as key, and maybe there are other data structures using UUID as key. This implies we use a different name and UUID for the destination domain. In the migration framework, during the Begin and Prepare stage, it calls virDomainDefCheckABIStability to prevent us using a different UUID, and it also checks the hostname and host UUID to be different. If we want to enable local migration, we have to skip these check and generate new UUID and name for destination domain. Of course we restore the original UUID after migration. UUID is used in higher level management software to identify virtual machines. It should stay the same after QEMU live upgrade. (2) If I understand the code correctly, libvirt uses thread pool to handle RPC requests. This means local migration may cause deadlock in P2P migration mode. Suppose there are some concurrent local migration requests and all the worker threads are occupied by these requests. When source libvirtd connects destination libvirtd on the same host to negotiate the migration, the negotiation request is queued, but the negotiation request will never be handled, because the original migration request from client is waiting for the negotiation request to finish to progress, while the negotiation request is queued waiting for the original request to end. This is one of the dealock risk I can think of. I guess in traditional migration mode, in which the client opens two connections to source and destination libvirtd, there is also risk to cause deadlock. (3) Libvirt supports Unix domain socket transport, but this is only used in a tunnelled migration. For native migration, it only supports TCP. We need to enable Unix domain socket transport in native migration. Now we already have a hypervisor migration URI argument in the migration API, but there is no support for parsing and verifying a "unix:/full/path" URI and passing that URI transparently to QEMU. We can add this to current migration framework but direct Unix socket transport looks meaningless for normal migration. (4) When migration fails, the source domain is resumed, and this may not work if we enable page-flipping in QEMU. With page-flipping enabled, QEMU transfers memory page ownership to the destination QEMU, so the source virtual machine should be restarted but not resumed when the migration fails. To summarize, I made a call migration flow with things I hacked to enable local migration in the existing migration framework. It's a bit long, I put it at the end of the mail. I found if I was to re-use migration framework, I need to change interface of a few functions, add some flags, pass them deep into the inner functions. So I propose a new and compact work flow dedicated for QEMU live upgrade. After all, it's an upgrade operation based on tricky migration. When developing the previous RFC patch for the new API, I focused on the correctness of the work flow, so many other things are missing. I think I can add things like Spice seamless migration when I submitting new versions. I am also really happy if you could give me some advice to re-use the migration framework. Re-using the current framework can saves a lot of effort. Appendix Call flow to enable local migration in current migration framework All conn->XXX() and dconn->XXX() are remote calls to libvirtd, then libvirtd dispatches the request to QEMU driver. "domain", "conn" means source domain and libvirt connection, "ddomain", "dconn" means destination domain and libvirt connection. Things I hacked are marked with /* HACKED */. virDomainMigrate(...) virDomainMigrateVersion3(...) -> virDomainMigrateVersion3Full(...) dom_xml = conn->domainMigrateBegin3(&cookieout...) => qemuDomainMigrateBegin3 vm = qemuDomObjFromDomain(domain) qemuMigrationBegin(vm, ...) qemuMigrationBeginPhase(vm, ...) Generate migration cookie: qemuMigrationEatCookie(NULL, ...); qemuMigrationBakeCookie(...) if (xmlin) def = virDomainDefParseString(xmlin, ...) virDomainDefCheckABIStability(def, vm->def) /* HACKED to skip domain uuid check */ qemuDomainDefFormatLive(def / vm->def, ...) cookiein = cookieout dconn->domainMigratePrepare3(uri, &uri_out, cookiein, &cookieout, dname, dom_xml, ...) => qemuDomainMigratePrepare3 def = qemuMigrationPrepareDef(dom_xml, ...) qemuMigrationPrepareDirect(uri, uri_out, &def, ...) parse uri and build uri_out; qemuMigrationPrepareAny(def, ...) new_def = call migration hooks to manipulate def if (virDomainDefCheckABIStability(def, new_def, ...)) /* HACKED to skip domain uuid check */ def = new_def; migrateFrom = "tcp:host:port" /* HACKED to "unix:/full/path" */ vm = virDomainObjListAdd(domains, def, ...) Verify cookie: qemuMigrationEatCookie(cookiein, ...) qemuMigrationCookieXMLParseStr(...) qemuMigrationCookieXMLParse(...) verify our hostname and host UUID are different from source /* HACKED to skip host uuid & name check*/ qemuProcessStart(vm, migrateFrom, ...) qemu --incoming "tcp:host:port" --hda ... qemuMigrationBakeCookie(cookieout, ...) uri = uri_out cookiein = cookieout conn->domainMigratePerform3(uri, coockiein, &cookieout, ...) => qemuDomainMigratePerform3 vm = qemuDomObjFromDomain(dom) qemuMigrationPerform(uri, ...) qemuMigrationPerformPhase(uri, ...) doNativeMigrate(vm, uri, ...) build qemu migration spec /* HACKED to replace tcp transport spec with unix socket spec */ qemuMigrationRun(vm, spec, ...) qemuMigrationEatCookie(...) /* HACKED to skip host uuid & name check*/ qemuMonitorMigrateToHost / qemuMonitorMigrateToUnix / qemuMonitorMigrateToFd qemuMonitorJSONMigrate(uri, ...) cmd = qemuMonitorJSONMakeCommand("migrate", "s:uri", uri, ...) qemuMonitorJSONCommand(cmd) start tunneling between src and dst unix socket /* HACKED to skip tunneling*/ qemuMigrationWaitForCompletion(...) Poll every 50ms for progress, check error, and allow cancellation qemuMigrationBakeCookie(cookieout, ...) cookiein = cookieout ddomain = dconn->domainMigrateFinish3(dname, uri, cookiein, &cookieout...) => qemuDomainMigrateFinish3 vm = virDomainObjListFindByName(driver->domains, dname) qemuMigrationFinish(vm, ...) qemuMigrationEatCookie(...) /* HACKED to skip host uuid & name check*/ qemuProcessStartCPUs(...) -> ... Send QMP command "cont" to QEMU dom = virGetDomain(dconn, vm->def->name, vm->def->uuid) qemuMigrationBakeCookie(cookieout, ...) return dom cookiein = cookieout conn->domainMigrateConfirm3(cookiein, ...) => qemuDomainMigrateConfirm3 vm = qemuDomObjFromDomain(domain) qemuMigrationConfirm(vm, cookiein, ...) qemuMigrationConfirmPhase(vm, cookiein, ...) qemuMigrationEatCookie(cookiein, ...) /* HACKED to skip host uuid & name check*/ qemuProcessStop(vm, ..) return ddomain Thanks and best regards! _____________________________ Zhou Zheng Sheng / 周征晟 Software Engineer E-mail: zhshzhou@cn.ibm.com Telephone: 86-10-82454397

On 11/12/2013 09:15 PM, Zheng Sheng ZS Zhou wrote:
Actually I wrote another hacking patch to investigate how we can re-use existing framework to do local migration. I found the following problems.
[Can you please configure your mailer to wrap long lines? It's awkward to reply when there is no wrapping]
I found if I was to re-use migration framework, I need to change interface of a few functions, add some flags, pass them deep into the inner functions.
Yes, but that's a more maintainable approach. Avoiding code duplication by adding parameters so that existing functions can serve both migration paths is better than blindly rewriting the new code path by just copying the original code path. If we find a bug down the road, we'd prefer to fix it in one place for both migration scenarios, rather than having to remember to copy the fix into two places. You can't reuse the entire migration framework, but you SHOULD be trying to refactor the code to use as much of the framework as possible. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Wed, Nov 13, 2013 at 12:15:30PM +0800, Zheng Sheng ZS Zhou wrote:
Hi Daniel,
on 2013/11/12/ 20:23, Daniel P. Berrange wrote:> On Tue, Nov 12, 2013 at 08:14:11PM +0800, Zheng Sheng ZS Zhou wrote:
Hi all,
Recently QEMU developers are working on a feature to allow upgrading a live QEMU instance to a new version without restarting the VM. This is implemented as live migration between the old and new QEMU process on the same host [1]. Here is the the use case:
1) Guests are running QEMU release 1.6.1. 2) Admin installs QEMU release 1.6.2 via RPM or deb. 3) Admin starts a new VM using the updated QEMU binary, and asks the old QEMU process to migrate the VM to the newly started VM.
I think it will be very useful to support QEMU live upgrade in libvirt. After some investigations, I found migrating to the same host breaks the current migration code. I'd like to propose a new work flow for QEMU live migration. It is to implement the above step 3).
How does it break migration code ? Your patch below is effectively re-implementing the multistep migration workflow, leaving out many important features (seemless reconnect to SPICE clients for example) which is really bad for our ongoing code support burden, so not something I want to see.
Daniel
Actually I wrote another hacking patch to investigate how we can re-use existing framework to do local migration. I found the following problems.
(1) When migrate to different host, the destination domain uses the same UUID and name as the source, and this is OK. When migrate to localhost, destination domain UUID and name causes conflict with the source. In QEMU driver, it maintains a hash table of domain objects, the reference key is the UUID of the virtual machine. The closeCallbacks is also a hash table with domain UUID as key, and maybe there are other data structures using UUID as key. This implies we use a different name and UUID for the destination domain. In the migration framework, during the Begin and Prepare stage, it calls virDomainDefCheckABIStability to prevent us using a different UUID, and it also checks the hostname and host UUID to be different. If we want to enable local migration, we have to skip these check and generate new UUID and name for destination domain. Of course we restore the original UUID after migration. UUID is used in higher level management software to identify virtual machines. It should stay the same after QEMU live upgrade.
This point is something that needs to be solved regardless of whether using migration framework, or re-inventing the migration framework. The QEMU driver fundamentally assumes that there is only ever one single VM with a given UUID, and a VM has only 1 process. IMHO name + uuid must be preserved during any live upgrade process, otherwise mgmt will get confused. This has more problems becasue 'name' is used for various resources created by QEMU on disk - eg the monitor command path. We can't have 2 QEMUs using the same name, but at the same time that's exactly what we'd need here.
(2) If I understand the code correctly, libvirt uses thread pool to handle RPC requests. This means local migration may cause deadlock in P2P migration mode. Suppose there are some concurrent local migration requests and all the worker threads are occupied by these requests. When source libvirtd connects destination libvirtd on the same host to negotiate the migration, the negotiation request is queued, but the negotiation request will never be handled, because the original migration request from client is waiting for the negotiation request to finish to progress, while the negotiation request is queued waiting for the original request to end. This is one of the dealock risk I can think of. I guess in traditional migration mode, in which the client opens two connections to source and destination libvirtd, there is also risk to cause deadlock.
Yes, it sounds like you could get deadlock even with 2 separate libvirtds, if both them were migrating to the other concurrently.
(3) Libvirt supports Unix domain socket transport, but this is only used in a tunnelled migration. For native migration, it only supports TCP. We need to enable Unix domain socket transport in native migration. Now we already have a hypervisor migration URI argument in the migration API, but there is no support for parsing and verifying a "unix:/full/path" URI and passing that URI transparently to QEMU. We can add this to current migration framework but direct Unix socket transport looks meaningless for normal migration.
Actually as far as QEMU is concerned libvirt uses fd: migration only. Again though this points seems pretty much unrelated to the question of how we design the APIs & structure the code.
(4) When migration fails, the source domain is resumed, and this may not work if we enable page-flipping in QEMU. With page-flipping enabled, QEMU transfers memory page ownership to the destination QEMU, so the source virtual machine should be restarted but not resumed when the migration fails.
IMHO that is not an acceptable approach. The whole point of doing live upgrades in place, is that you consider the VMs to be "precious". If you were OK with VMs being killed & restarted then we'd not bother doing any of this live upgrade pain at all. So if we're going to support live upgrades, we *must* be able to guarantee that they will either succeed, or the existing QEMU is left intact. Killing the VM and restarting is not an option on failure.
So I propose a new and compact work flow dedicated for QEMU live upgrade. After all, it's an upgrade operation based on tricky migration. When developing the previous RFC patch for the new API, I focused on the correctness of the work flow, so many other things are missing. I think I can add things like Spice seamless migration when I submitting new versions.
This way lies madness. We do not want 2 impls of the internal migration framework.
I am also really happy if you could give me some advice to re-use the migration framework. Re-using the current framework can saves a lot of effort.
I consider using the internal migration framework a mandatory requirement here, even if the public API is different. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 11/12/2013 05:14 AM, Zheng Sheng ZS Zhou wrote:
From 2b659584f2cbe676c843ddeaf198c9a8368ff0ff Mon Sep 17 00:00:00 2001 From: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> Date: Wed, 30 Oct 2013 15:36:49 +0800 Subject: [PATCH] RFC: Support QEMU live uprgade
This patch is to support upgrading QEMU version without restarting the virtual machine.
Add new API virDomainQemuLiveUpgrade(), and a new virsh command qemu-live-upgrade. virDomainQemuLiveUpgrade() migrates a running VM to the same host as a new VM with new name and new UUID. Then it shutdown the original VM and drop the new VM definition without shutdown the QEMU process of the new VM. At last it attaches original VM to the new QEMU process.
Firstly the admin installs new QEMU package, then he runs virsh qemu-live-upgrade domain_name to trigger our virDomainQemuLiveUpgrade() upgrading flow.
In general, I agree that we need a new API (in fact, I think I helped suggest why we need it as opposed to reusing existing migration API, precisely for some of the deadlock reasons you called out in your reply to Dan). But the new API should still reuse as much of the existing migration code as possible (refactor that to be reusable, rather than bulk copying into completely new code). High-level review below (I didn't test whether things work or look for details like memory leaks, so much as a first impression of style problems and even some major design problems).
Signed-off-by: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> --- include/libvirt/libvirt.h.in | 3 + src/driver.h | 4 + src/libvirt.c | 23 +++ src/libvirt_public.syms | 1 +
I know this is an RFC patch, but ideally the final patch will be split into parts. Part 1: public interface (the files above).
src/qemu/qemu_driver.c | 339 +++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_migration.c | 2 +- src/qemu/qemu_migration.h | 3 +
Part 4: Qemu driver implementation (these 3 files)
src/remote/remote_driver.c | 1 + src/remote/remote_protocol.x | 19 ++-
Part 3: RPC implementation (these 2)
tools/virsh-domain.c | 139 ++++++++++++++++++
Part 2: Use of the public implementation (should compile and gracefully fail until parts 3 and 4 are also applied; doing it early means you can test the error paths as well as validate that the C API seems usable) You may also need to do a Part 5 to modify the python bindings, depending on whether the code generator was able to get it working on your behalf.
+++ b/include/libvirt/libvirt.h.in @@ -1331,6 +1331,9 @@ int virDomainMigrateGetMaxSpeed(virDomainPtr domain, unsigned long *bandwidth, unsigned int flags);
+virDomainPtr virDomainQemuLiveUpgrade(virDomainPtr domain, + unsigned int flags);
No bandwidth parameter? Seems inconsistent with all the other migration APIs. Also, since this is moving from one qemu process to another, this _totally_ seems like a case for allowing a destination XML argument (lots of people have asked for the ability to point qemu to an alternate disk source with identical bits visible to the guest but better characteristics for the host; right now, their solution is to 'virsh save', 'virsh save-image-edit', 'virsh restore'; but your API could let the specify an alternate XML and do a single live-upgrade so that the new qemu is using the new file name). Anything with "Qemu" in the API name belongs in libvirt-qemu.h. But this API seems like it is useful to more than just qemu; it could be used for other hypervisors as well. You need a more generic name; mayb virDomainLiveUpgrade().
+++ b/src/driver.h @@ -686,6 +686,9 @@ typedef int const char *args, char ***models, unsigned int flags); +typedef virDomainPtr +(*virDrvDomainQemuLiveUpgrade)(virDomainPtr domain, + unsigned int flags);
Indentation is off.
+++ b/src/libvirt.c @@ -7524,6 +7524,29 @@ error:
/** + * virDomainQemuLiveUpgrade: + * @domain: a domain object + * @flags: bitwise-OR of flags
Which flags?
+ * + * Live upgrade qemu binary version of the domain. + * + * Returns the new domain object if the upgrade was successful, + * or NULL in case of error.
Unusual spacing; indenting by more than one space causes the generated .html documentation to treat that line as code rather than prose. You need to be explicit that if live upgrade fails, the domain is LOST (in contrast to normal migration, where failure just reverts back to running on the source). Users must be aware of the risks.
+ */ +virDomainPtr +virDomainQemuLiveUpgrade(virDomainPtr domain, + unsigned int flags) +{ + VIR_DEBUG("domain=%p, flags=%x", domain, flags); + if (!domain->conn->driver->domainQemuLiveUpgrade) { + virLibConnError(VIR_ERR_INTERNAL_ERROR, __FUNCTION__); + return NULL; + } + return domain->conn->driver->domainQemuLiveUpgrade(domain, flags);
Doesn't look like most other API calls. In particular, missing a VIR_DOMAIN_DEBUG for logging the call, missing a virResetLastError() for correct error handling, missing a check against invalid domain, missing a check against read-only connections, etc.
+++ b/src/libvirt_public.syms @@ -637,6 +637,7 @@ LIBVIRT_1.1.1 { LIBVIRT_1.1.3 { global: virConnectGetCPUModelNames; + virDomainQemuLiveUpgrade; } LIBVIRT_1.1.1;
Wrong. The earliest this can be introduced is 1.1.5, not 1.1.3.
+static virDomainDefPtr +virDomainDefLiveCopy(virDomainDefPtr src, + virCapsPtr caps, + virDomainXMLOptionPtr xmlopt)
Indentation is off.
+ +static virDomainDefPtr +qemuLiveUpgradeMiniBegin(virQEMUDriverPtr driver, virDomainObjPtr vm) {
{ belongs on its own line when starting a function.
+ if (-1 == virUUIDGenerate(newDef->uuid))
Speak like Yoda we do not. Put your constants on the right side of ==.
+ +static virDomainObjPtr +qemuLiveUpgradeMiniPrepare(virQEMUDriverPtr driver, virConnectPtr conn, + virDomainDefPtr newDef) {
Function bodies start with { on own line. (Several more instances)
+ virDomainObjPtr newVm = NULL; + virDomainObjPtr result = NULL; + char *upgradeUri = NULL; + virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); + + newVm = virDomainObjListAdd(driver->domains, newDef, driver->xmlopt, + VIR_DOMAIN_OBJ_LIST_ADD_LIVE | + VIR_DOMAIN_OBJ_LIST_ADD_CHECK_LIVE, NULL); + if (!newVm) + goto cleanup; + newDef = NULL; + + if (virAsprintf(&upgradeUri, "unix:%s/qemu.live.upgrade.%s.sock", + cfg->libDir, newVm->def->name) < 0)
Computing the socket name once...
+static bool +qemuLiveUpgradeMiniPerform(virQEMUDriverPtr driver, virDomainObjPtr vm, + const char *newName) { + char *upgradeSock = NULL; + qemuDomainObjPrivatePtr priv = vm->privateData; + bool result = false; + bool migrate = false; + int r = 0; + virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); + + if (virAsprintf(&upgradeSock, "%s/qemu.live.upgrade.%s.sock", + cfg->libDir, newName) < 0)
...and again. Seems error prone if we update one string but not the other; shouldn't the name be generated by a common helper function?
+ + for (i=0; i < VIR_UUID_BUFLEN; ++i) {
Spaces around operators, and for loops prefer post-decrement rather than pre-decrement: for (i = 0; i < VIR_UUID_BUFLEN; i++)
+ tmpDef->uuid[i] = origUuid[i];
For that matter, a for loop is rather inefficient; just memcpy(tmpDef->uuid, origUuid, VIR_UUID_BUFLEN) (supposing that using a temporary uuid even makes sense design-wise; after all, the uuid you pass on the destination qemu command line is guest-visible via dmidecode in the guest; you must NOT start the guest with a different uuid and then fix it up later in the host).
+ if (!newVm) { + goto cleanup; + } + + if (!qemuLiveUpgradeMiniPerform(driver, vm, newVm->def->name)) { + goto cleanup; + }
In general, we tend to omit {} around one-liner bodies (not mandatory to omit, and different from qemu's style where {} is always required).
+ + if (VIR_STRDUP(origName, vm->def->name) < 0) + goto cleanup; + for (i=0; i < VIR_UUID_BUFLEN; ++i) { + origUuid[i] = vm->def->uuid[i]; + }
Another poorly-formatted for loop that should use memcpy instead.
@@ -15892,6 +16230,7 @@ static virDriver qemuDriver = { .domainMigrateFinish3Params = qemuDomainMigrateFinish3Params, /* 1.1.0 */ .domainMigrateConfirm3Params = qemuDomainMigrateConfirm3Params, /* 1.1.0 */ .connectGetCPUModelNames = qemuConnectGetCPUModelNames, /* 1.1.3 */ + .domainQemuLiveUpgrade = qemuDomainQemuLiveUpgrade, /* 1.1.3 */
1.1.5 at the earliest.
+++ b/src/remote/remote_driver.c @@ -7013,6 +7013,7 @@ static virDriver remote_driver = { .domainMigrateFinish3Params = remoteDomainMigrateFinish3Params, /* 1.1.0 */ .domainMigrateConfirm3Params = remoteDomainMigrateConfirm3Params, /* 1.1.0 */ .connectGetCPUModelNames = remoteConnectGetCPUModelNames, /* 1.1.3 */ + .domainQemuLiveUpgrade = remoteDomainQemuLiveUpgrade, /* 1.1.3 */
1.1.5
+++ b/src/remote/remote_protocol.x @@ -2849,6 +2849,15 @@ struct remote_connect_get_cpu_model_names_ret { int ret; };
+struct remote_domain_qemu_live_upgrade_args { + remote_nonnull_domain dom; + unsigned int flags;
Needs to match the final API we decide on (again, I don't like qemu in the name, since I think this is more generic than libvirt-qemu.so; I also think you need at least destination xml and probably also bandwidth arguments for consistency; even a virTypedParameter argument will save you hassle if down the road we think of more items that need parameterization).
+static void +vshQemuLiveUpgradeTimeout(vshControl *ctl, + virDomainPtr dom, + void *opaque ATTRIBUTE_UNUSED) +{ + vshDebug(ctl, VSH_ERR_DEBUG, "suspending the domain, " + "since QEMU live uprgade timed out\n");
Is upgrade timeout even possible with live upgrade, or is the point of page-flipping on a local machine to be that we are fast enough to not have to throttle things and never need to worry about converging in a timely manner?
@@ -10796,6 +10929,12 @@ const vshCmdDef domManagementCmds[] = { .info = info_qemu_agent_command, .flags = 0 }, + {.name = "qemu-live-upgrade",
If this does not live in libvirt-qemu.so, then the virsh command must not have 'qemu-' in the name. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Hi Eric, on 2013/11/13 12:54, Eric Blake wrote:
On 11/12/2013 05:14 AM, Zheng Sheng ZS Zhou wrote:
From 2b659584f2cbe676c843ddeaf198c9a8368ff0ff Mon Sep 17 00:00:00 2001 From: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> Date: Wed, 30 Oct 2013 15:36:49 +0800 Subject: [PATCH] RFC: Support QEMU live uprgade
This patch is to support upgrading QEMU version without restarting the virtual machine.
Add new API virDomainQemuLiveUpgrade(), and a new virsh command qemu-live-upgrade. virDomainQemuLiveUpgrade() migrates a running VM to the same host as a new VM with new name and new UUID. Then it shutdown the original VM and drop the new VM definition without shutdown the QEMU process of the new VM. At last it attaches original VM to the new QEMU process.
Firstly the admin installs new QEMU package, then he runs virsh qemu-live-upgrade domain_name to trigger our virDomainQemuLiveUpgrade() upgrading flow.
In general, I agree that we need a new API (in fact, I think I helped suggest why we need it as opposed to reusing existing migration API, precisely for some of the deadlock reasons you called out in your reply to Dan). But the new API should still reuse as much of the existing migration code as possible (refactor that to be reusable, rather than bulk copying into completely new code). High-level review below (I didn't test whether things work or look for details like memory leaks, so much as a first impression of style problems and even some major design problems).
Signed-off-by: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> --- include/libvirt/libvirt.h.in | 3 + src/driver.h | 4 + src/libvirt.c | 23 +++ src/libvirt_public.syms | 1 +
I know this is an RFC patch, but ideally the final patch will be split into parts. Part 1: public interface (the files above).
src/qemu/qemu_driver.c | 339 +++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_migration.c | 2 +- src/qemu/qemu_migration.h | 3 +
Part 4: Qemu driver implementation (these 3 files)
src/remote/remote_driver.c | 1 + src/remote/remote_protocol.x | 19 ++-
Part 3: RPC implementation (these 2)
tools/virsh-domain.c | 139 ++++++++++++++++++
Part 2: Use of the public implementation (should compile and gracefully fail until parts 3 and 4 are also applied; doing it early means you can test the error paths as well as validate that the C API seems usable)
You may also need to do a Part 5 to modify the python bindings, depending on whether the code generator was able to get it working on your behalf.
+++ b/include/libvirt/libvirt.h.in @@ -1331,6 +1331,9 @@ int virDomainMigrateGetMaxSpeed(virDomainPtr domain, unsigned long *bandwidth, unsigned int flags); +virDomainPtr virDomainQemuLiveUpgrade(virDomainPtr domain, + unsigned int flags);
No bandwidth parameter? Seems inconsistent with all the other migration APIs. Also, since this is moving from one qemu process to another, this _totally_ seems like a case for allowing a destination XML argument (lots of people have asked for the ability to point qemu to an alternate disk source with identical bits visible to the guest but better characteristics for the host; right now, their solution is to 'virsh save', 'virsh save-image-edit', 'virsh restore'; but your API could let the specify an alternate XML and do a single live-upgrade so that the new qemu is using the new file name).
Anything with "Qemu" in the API name belongs in libvirt-qemu.h. But this API seems like it is useful to more than just qemu; it could be used for other hypervisors as well. You need a more generic name; mayb virDomainLiveUpgrade().
Thanks for reminding me of the line wrapping problem. My usual email box gets rejected by the libvir-list server. The server can not resolve linux.vnet.ibm.com. I used this email box and did not notice the wrapping problem, sorry. Thank you very much for the suggestions. I agree with you on splitting the patch, bandwidth and XML parameter, programme style, memcpy and many other advices. I will start to refactor the framework and re-use it as much as possible. QEMU developer Lei Li said that for 1 GB memory guest, the downtime of page-flipping migration is roughly 650ms. Kernel developers are working on new vmsplice implementation to improve page-flipping. So timeout and bandwidth may be not useful for it. But as you suggested, we can make the virDomainLiveUpgrade a general API for all hypervisors. Since other hypervisors may not support page-flipping, I will still keep timeout and bandwidth arguments in the new API. As I wrote in previous mails. I find domian UUID very important in libvirt. It causes a lot of troubles if we start the destination domain with the same UUID. Actually I did try to hack libvirt to do this but wasn't successful. I discussed with Lei Li in the CC list. We can add a new QEMU monitor command to change the guest UUID. During the migration, we firstly start destination QEMU process with all vCPUs paused, and this QEMU process gets a different temp UUID. After migration, we call that monitor command to change the UUID back, then resume vCPUs. Guest OS should not notice this change. So the new flow should like this 1. Orginal domain migrates to new domain with different UUID. 2. Send monitor command to QEMU process of the new domain to change UUID back. 3. Drop new domain obj from QEMU driver but don't kill new QEMU process. 4. Shutdown original domain, and ask it to attach to the new QEMU process. 5. Resume vCPUs. I doubt other hypervisors would support changing the UUID, that's why I made the new API specific to QEMU in the RFC patch. Another option is that we start the QEMU process with the same UUID but in libvirt domain object we use a different UUID. At last we drop the temp domain object and attaches to the destination QEMU process. In this way we avoid changing the guest UUID in the hypervisor. Thus the new API is feasible for all hypervisors. However this makes even tougher to re-use existing migration framework. Thanks and best regards! _____________________________ Zhou Zheng Sheng / 周征晟 Software Engineer E-mail: zhshzhou@cn.ibm.com Telephone: 86-10-82454397

On 11/13/2013 12:42 AM, Zheng Sheng ZS Zhou wrote:
As I wrote in previous mails. I find domian UUID very important in libvirt. It causes a lot of troubles if we start the destination domain with the same UUID. Actually I did try to hack libvirt to do this but wasn't successful.
Unfortunately, starting a domain with a different UUID is guest-visible, and a non-starter. We have to resolve the issue of libvirt being able to track two qemu processes both tied to the same uuid.
I discussed with Lei Li in the CC list. We can add a new QEMU monitor command to change the guest UUID. During the migration, we firstly start destination QEMU process with all vCPUs paused, and this QEMU process gets a different temp UUID. After migration, we call that monitor command to change the UUID back, then resume vCPUs. Guest OS should not notice this change.
I don't know of any qemu monitor command to hotplug in a UUID; and even if there were, changing dmidecode information at runtime is not how bare metal behaves, so I'm doubtful that qemu will add such a patch. I really think that your attempts to use a fake UUID are going to end in disaster, compared to fixing the real problem of teaching libvirt to manage two qemu processes at once both tied to the same UUID. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Wed, Nov 13, 2013 at 05:56:44AM -0700, Eric Blake wrote:
On 11/13/2013 12:42 AM, Zheng Sheng ZS Zhou wrote:
As I wrote in previous mails. I find domian UUID very important in libvirt. It causes a lot of troubles if we start the destination domain with the same UUID. Actually I did try to hack libvirt to do this but wasn't successful.
Unfortunately, starting a domain with a different UUID is guest-visible, and a non-starter. We have to resolve the issue of libvirt being able to track two qemu processes both tied to the same uuid.
I discussed with Lei Li in the CC list. We can add a new QEMU monitor command to change the guest UUID. During the migration, we firstly start destination QEMU process with all vCPUs paused, and this QEMU process gets a different temp UUID. After migration, we call that monitor command to change the UUID back, then resume vCPUs. Guest OS should not notice this change.
I don't know of any qemu monitor command to hotplug in a UUID; and even if there were, changing dmidecode information at runtime is not how bare metal behaves, so I'm doubtful that qemu will add such a patch. I really think that your attempts to use a fake UUID are going to end in disaster, compared to fixing the real problem of teaching libvirt to manage two qemu processes at once both tied to the same UUID.
Yep, that's not going to fly. UUID is part of machine ABI and cannot ever be changed for a guest. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Nov 12, 2013 at 09:54:44PM -0700, Eric Blake wrote:
On 11/12/2013 05:14 AM, Zheng Sheng ZS Zhou wrote:
From 2b659584f2cbe676c843ddeaf198c9a8368ff0ff Mon Sep 17 00:00:00 2001 From: Zhou Zheng Sheng <zhshzhou@linux.vnet.ibm.com> Date: Wed, 30 Oct 2013 15:36:49 +0800 Subject: [PATCH] RFC: Support QEMU live uprgade
This patch is to support upgrading QEMU version without restarting the virtual machine.
Add new API virDomainQemuLiveUpgrade(), and a new virsh command qemu-live-upgrade. virDomainQemuLiveUpgrade() migrates a running VM to the same host as a new VM with new name and new UUID. Then it shutdown the original VM and drop the new VM definition without shutdown the QEMU process of the new VM. At last it attaches original VM to the new QEMU process.
Firstly the admin installs new QEMU package, then he runs virsh qemu-live-upgrade domain_name to trigger our virDomainQemuLiveUpgrade() upgrading flow.
In general, I agree that we need a new API (in fact, I think I helped suggest why we need it as opposed to reusing existing migration API, precisely for some of the deadlock reasons you called out in your reply to Dan). But the new API should still reuse as much of the existing migration code as possible (refactor that to be reusable, rather than bulk copying into completely new code). High-level review below (I didn't test whether things work or look for details like memory leaks, so much as a first impression of style problems and even some major design problems).
I really don't like the idea of adding a new API for this - IMHO we need to address the deadlock scenario and fit this into our existing migration APIs. In particular calling this "live upgrades" is wrong, as that is just a specific use case. Functionally this is "localhost migration" and so belongs in the migration APIs. As mentioned in my other message, I believe the deadlock scenario mentioned could even occurr in non-localhost migration, if two libvirtds were doing migrating concurrent migrations in opposite directions. So this seems like something we need to look at fixing somehow. Perhaps it needs a dedicated thread pool, or spawn on demand thread, just for doing the specific migration RPC call that could deadlock, so we can guarantee we can always succeed in it ? Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

on 2013/11/13 21:10, Daniel P. Berrange wrote:
On Wed, Nov 13, 2013 at 12:15:30PM +0800, Zheng Sheng ZS Zhou wrote:
Hi Daniel,
on 2013/11/12/ 20:23, Daniel P. Berrange wrote:
On Tue, Nov 12, 2013 at 08:14:11PM +0800, Zheng Sheng ZS Zhou wrote: Hi all,
Recently QEMU developers are working on a feature to allow upgrading a live QEMU instance to a new version without restarting the VM. This is implemented as live migration between the old and new QEMU process on the same host [1]. Here is the the use case:
1) Guests are running QEMU release 1.6.1. 2) Admin installs QEMU release 1.6.2 via RPM or deb. 3) Admin starts a new VM using the updated QEMU binary, and asks the old QEMU process to migrate the VM to the newly started VM.
I think it will be very useful to support QEMU live upgrade in libvirt. After some investigations, I found migrating to the same host breaks the current migration code. I'd like to propose a new work flow for QEMU live migration. It is to implement the above step 3).
How does it break migration code ? Your patch below is effectively re-implementing the multistep migration workflow, leaving out many important features (seemless reconnect to SPICE clients for example) which is really bad for our ongoing code support burden, so not something I want to see.
Daniel
Actually I wrote another hacking patch to investigate how we can re-use existing framework to do local migration. I found the following problems.
(1) When migrate to different host, the destination domain uses the same UUID and name as the source, and this is OK. When migrate to localhost, destination domain UUID and name causes conflict with the source. In QEMU driver, it maintains a hash table of domain objects, the reference key is the UUID of the virtual machine. The closeCallbacks is also a hash table with domain UUID as key, and maybe there are other data structures using UUID as key. This implies we use a different name and UUID for the destination domain. In the migration framework, during the Begin and Prepare stage, it calls virDomainDefCheckABIStability to prevent us using a different UUID, and it also checks the hostname and host UUID to be different. If we want to enable local migration, we have to skip these check and generate new UUID and name for destination domain. Of course we restore the original UUID after migration. UUID is used in higher level management software to identify virtual machines. It should stay the same after QEMU live upgrade.
This point is something that needs to be solved regardless of whether using migration framework, or re-inventing the migration framework. The QEMU driver fundamentally assumes that there is only ever one single VM with a given UUID, and a VM has only 1 process. IMHO name + uuid must be preserved during any live upgrade process, otherwise mgmt will get confused. This has more problems becasue 'name' is used for various resources created by QEMU on disk - eg the monitor command path. We can't have 2 QEMUs using the same name, but at the same time that's exactly what we'd need here.
Thanks Daniel. I agree with you on that we should not change QEMU UUID. I also think refactor and re-use existing migration code is great. So I did some investigation towards this direction. I found the assumption of one process in one VM and UUID is not only in the QEMU driver, also in libvirt higher level data structure and functions, say virDomainObj structure and virCloseCallbacksXXX functions. The hypervirsor process ID is directly associated with virDomainObj.pid. virDomainObj contains only one pid field. virCloseCallbacksXXX functions maintain an invariant that only one callback and connection can be registered to each VM UUID. For example, in a non-p2p migration, client opens two connections to libvirtd, one for source domain and one for destination domain. When it tries to register a close callback for the dst connection and the dst domain, libvirt reports error that there is already another connection registered the callback for this UUID, it's registered by src connection for the src domain. Is it acceptable if we start the new QEMU process giving it the same UUID while we refer to it in libvirt virDomainObj using a different UUID? I mean we can generate new UUID for the destination VM, thus avoids all the conflicts in libvirt. We should also store the original UUID in the VM definition, and when we start new QEMU process, use the original UUID. After migration, we drop the new virDomainObj, and let original virDomainObj attaches to the new QEMU process. In this way the guest should not notice any change in the UUID and we avoid conflict in libvirt.
(2) If I understand the code correctly, libvirt uses thread pool to handle RPC requests. This means local migration may cause deadlock in P2P migration mode. Suppose there are some concurrent local migration requests and all the worker threads are occupied by these requests. When source libvirtd connects destination libvirtd on the same host to negotiate the migration, the negotiation request is queued, but the negotiation request will never be handled, because the original migration request from client is waiting for the negotiation request to finish to progress, while the negotiation request is queued waiting for the original request to end. This is one of the dealock risk I can think of. I guess in traditional migration mode, in which the client opens two connections to source and destination libvirtd, there is also risk to cause deadlock.
Yes, it sounds like you could get deadlock even with 2 separate libvirtds, if both them were migrating to the other concurrently.
We will try to locate and fix deadlock problems when implementing local migration. This seems the right way to go.
(3) Libvirt supports Unix domain socket transport, but this is only used in a tunnelled migration. For native migration, it only supports TCP. We need to enable Unix domain socket transport in native migration. Now we already have a hypervisor migration URI argument in the migration API, but there is no support for parsing and verifying a "unix:/full/path" URI and passing that URI transparently to QEMU. We can add this to current migration framework but direct Unix socket transport looks meaningless for normal migration.
Actually as far as QEMU is concerned libvirt uses fd: migration only. Again though this points seems pretty much unrelated to the question of how we design the APIs & structure the code.
Yes. I just want to remind that native unix socket transport is what QEMU developers decide to use in local migration with page-flipping. You may already notice that the system call vmsplice() needs a pipe. The old QEMU process and the new one are not parent-child, so QEMU uses some ancillary and out-of-band APIs of Unix domain socket to transfer the pipe fd from one QEMU process to another. This is not supported by TCP. That's why I need to enable direct Unix domain socket for QEMU live upgrade.
(4) When migration fails, the source domain is resumed, and this may not work if we enable page-flipping in QEMU. With page-flipping enabled, QEMU transfers memory page ownership to the destination QEMU, so the source virtual machine should be restarted but not resumed when the migration fails.
IMHO that is not an acceptable approach. The whole point of doing live upgrades in place, is that you consider the VMs to be "precious". If you were OK with VMs being killed & restarted then we'd not bother doing any of this live upgrade pain at all.
So if we're going to support live upgrades, we *must* be able to guarantee that they will either succeed, or the existing QEMU is left intact. Killing the VM and restarting is not an option on failure.
Yes. I'll check with QEMU developers to see if a page-flipped guest can resume vCPU or not.
So I propose a new and compact work flow dedicated for QEMU live upgrade. After all, it's an upgrade operation based on tricky migration. When developing the previous RFC patch for the new API, I focused on the correctness of the work flow, so many other things are missing. I think I can add things like Spice seamless migration when I submitting new versions.
This way lies madness. We do not want 2 impls of the internal migration framework.
I am also really happy if you could give me some advice to re-use the migration framework. Re-using the current framework can saves a lot of effort.
I consider using the internal migration framework a mandatory requirement here, even if the public API is different.
Daniel
I also think re-using migration code is good. I'm trying to find ways to avoid UUID and name conflict problems. If there is no simple way, I need to investigate how much refactor should be done, and propose some solutions to enable QEMU driver managing multiple processes. Thanks and best regards! -- Zhou Zheng Sheng / 周征晟
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Zheng Sheng ZS Zhou