[libvirt] [RFC 0/7] Live Migration with Pass-through Devices proposal

backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly. There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information. So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement: - before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. - when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. - during migration, unplug the passthroughed NIC. then do native migration. - on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration. TODO: 1. when hot add a new NIC in destination side after migration finished, the NIC device need to re-enslave on bonding device in guest. otherwise, it is offline. maybe we should consider bonding driver to support add interfaces dynamically. This is an example on how this might work, so I want to hear some voices about this scenario. Thanks, Chen Chen Fan (7): qemu-agent: add agent init callback when detecting guest setup qemu: add guest init event callback to do the initialize work for guest hostdev: add a 'bond' type element in <hostdev> element qemu-agent: add qemuAgentCreateBond interface hostdev: add parse ip and route for bond configure migrate: hot remove hostdev at perform phase for bond device migrate: add hostdev migrate status to support hostdev migration docs/schemas/basictypes.rng | 6 ++ docs/schemas/domaincommon.rng | 37 ++++++++ src/conf/domain_conf.c | 195 ++++++++++++++++++++++++++++++++++++++--- src/conf/domain_conf.h | 40 +++++++-- src/conf/networkcommon_conf.c | 17 ---- src/conf/networkcommon_conf.h | 17 ++++ src/libvirt_private.syms | 1 + src/qemu/qemu_agent.c | 196 +++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_agent.h | 12 +++ src/qemu/qemu_command.c | 3 + src/qemu/qemu_domain.c | 70 +++++++++++++++ src/qemu/qemu_domain.h | 14 +++ src/qemu/qemu_driver.c | 38 ++++++++ src/qemu/qemu_hotplug.c | 8 +- src/qemu/qemu_migration.c | 91 ++++++++++++++++++++ src/qemu/qemu_migration.h | 4 + src/qemu/qemu_process.c | 32 +++++++ src/util/virhostdev.c | 3 + 18 files changed, 745 insertions(+), 39 deletions(-) -- 1.9.3

sometimes, we want to do some initialize work in guest when guest startup, but currently, qemu-agent doesn't support that. so here we add an init callback, when guest startup, notify libvirt it has been up, then libvirt can do some work for guest. Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/qemu/qemu_agent.c | 26 +++++++++++++++++++++++--- src/qemu/qemu_agent.h | 2 ++ src/qemu/qemu_process.c | 6 ++++++ 3 files changed, 31 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index 548d580..cee0f8b 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -92,6 +92,7 @@ struct _qemuAgent { int watch; bool connectPending; + bool connected; virDomainObjPtr vm; @@ -306,6 +307,7 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, virJSONValuePtr obj = NULL; int ret = -1; unsigned long long id; + const char *status; VIR_DEBUG("Line [%s]", line); @@ -318,7 +320,11 @@ qemuAgentIOProcessLine(qemuAgentPtr mon, goto cleanup; } - if (virJSONValueObjectHasKey(obj, "QMP") == 1) { + if (virJSONValueObjectHasKey(obj, "QMP") == 1 || + virJSONValueObjectHasKey(obj, "status") == 1) { + status = virJSONValueObjectGetString(obj, "status"); + if (STREQ(status, "connected")) + mon->connected = true; ret = 0; } else if (virJSONValueObjectHasKey(obj, "event") == 1) { ret = qemuAgentIOProcessEvent(mon, obj); @@ -700,8 +706,22 @@ qemuAgentIO(int watch, int fd, int events, void *opaque) VIR_DEBUG("Triggering error callback"); (errorNotify)(mon, vm); } else { - virObjectUnlock(mon); - virObjectUnref(mon); + if (mon->connected) { + void (*init)(qemuAgentPtr, virDomainObjPtr) + = mon->cb->init; + virDomainObjPtr vm = mon->vm; + + mon->connected = false; + + virObjectUnlock(mon); + virObjectUnref(mon); + + VIR_DEBUG("Triggering init callback"); + (init)(mon, vm); + } else { + virObjectUnlock(mon); + virObjectUnref(mon); + } } } diff --git a/src/qemu/qemu_agent.h b/src/qemu/qemu_agent.h index 1cd5749..42414a7 100644 --- a/src/qemu/qemu_agent.h +++ b/src/qemu/qemu_agent.h @@ -34,6 +34,8 @@ typedef qemuAgent *qemuAgentPtr; typedef struct _qemuAgentCallbacks qemuAgentCallbacks; typedef qemuAgentCallbacks *qemuAgentCallbacksPtr; struct _qemuAgentCallbacks { + void (*init)(qemuAgentPtr mon, + virDomainObjPtr vm); void (*destroy)(qemuAgentPtr mon, virDomainObjPtr vm); void (*eofNotify)(qemuAgentPtr mon, diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 276837e..e6fc53a 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -194,8 +194,14 @@ static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, virObjectUnref(vm); } +static void qemuProcessHandleAgentInit(qemuAgentPtr agent ATTRIBUTE_UNUSED, + virDomainObjPtr vm) +{ + VIR_DEBUG("Received init from agent on %p '%s'", vm, vm->def->name); +} static qemuAgentCallbacks agentCallbacks = { + .init = qemuProcessHandleAgentInit, .destroy = qemuProcessHandleAgentDestroy, .eofNotify = qemuProcessHandleAgentEOF, .errorNotify = qemuProcessHandleAgentError, -- 1.9.3

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/qemu/qemu_domain.h | 7 +++++++ src/qemu/qemu_driver.c | 32 ++++++++++++++++++++++++++++++++ src/qemu/qemu_process.c | 22 ++++++++++++++++++++++ 3 files changed, 61 insertions(+) diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 3225abb..19f4b27 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -136,6 +136,8 @@ struct qemuDomainJobObj { typedef void (*qemuDomainCleanupCallback)(virQEMUDriverPtr driver, virDomainObjPtr vm); +typedef void (*qemuDomainInitCallback)(virDomainObjPtr vm); + typedef struct _qemuDomainObjPrivate qemuDomainObjPrivate; typedef qemuDomainObjPrivate *qemuDomainObjPrivatePtr; struct _qemuDomainObjPrivate { @@ -185,6 +187,10 @@ struct _qemuDomainObjPrivate { size_t ncleanupCallbacks; size_t ncleanupCallbacks_max; + qemuDomainInitCallback *initCallbacks; + size_t nInitCallbacks; + size_t nInitCallbacks_max; + virCgroupPtr cgroup; virCond unplugFinished; /* signals that unpluggingDevice was unplugged */ @@ -205,6 +211,7 @@ typedef enum { QEMU_PROCESS_EVENT_NIC_RX_FILTER_CHANGED, QEMU_PROCESS_EVENT_SERIAL_CHANGED, QEMU_PROCESS_EVENT_BLOCK_JOB, + QEMU_PROCESS_EVENT_GUESTINIT, QEMU_PROCESS_EVENT_LAST } qemuProcessEventType; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index f37b95d..7368145 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4073,6 +4073,35 @@ processGuestPanicEvent(virQEMUDriverPtr driver, static void +processGuestInitEvent(virQEMUDriverPtr driver, + virDomainObjPtr vm) +{ + qemuDomainObjPrivatePtr priv; + int i; + + VIR_DEBUG("init guest from domain %p %s", + vm, vm->def->name); + + if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0) + return; + + if (!virDomainObjIsActive(vm)) { + VIR_DEBUG("Domain is not running"); + goto endjob; + } + + priv = vm->privateData; + + for (i = 0; i < priv->nInitCallbacks; i++) { + if (priv->initCallbacks[i]) + priv->initCallbacks[i](vm); + } + + endjob: + qemuDomainObjEndJob(driver, vm); +} + +static void processDeviceDeletedEvent(virQEMUDriverPtr driver, virDomainObjPtr vm, char *devAlias) @@ -4627,6 +4656,9 @@ static void qemuProcessEventHandler(void *data, void *opaque) processEvent->action, processEvent->status); break; + case QEMU_PROCESS_EVENT_GUESTINIT: + processGuestInitEvent(driver, vm); + break; case QEMU_PROCESS_EVENT_LAST: break; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index e6fc53a..fcc0566 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -197,7 +197,29 @@ static void qemuProcessHandleAgentDestroy(qemuAgentPtr agent, static void qemuProcessHandleAgentInit(qemuAgentPtr agent ATTRIBUTE_UNUSED, virDomainObjPtr vm) { + struct qemuProcessEvent *processEvent = NULL; + virQEMUDriverPtr driver = qemu_driver; + + virObjectLock(vm); + VIR_DEBUG("Received init from agent on %p '%s'", vm, vm->def->name); + + if (VIR_ALLOC(processEvent) < 0) + goto cleanup; + + processEvent->eventType = QEMU_PROCESS_EVENT_GUESTINIT; + processEvent->vm = vm; + + virObjectRef(vm); + if (virThreadPoolSendJob(driver->workerPool, 0, processEvent) < 0) { + if (!virObjectUnref(vm)) + vm = NULL; + VIR_FREE(processEvent); + } + + cleanup: + if (vm) + virObjectUnlock(vm); } static qemuAgentCallbacks agentCallbacks = { -- 1.9.3

this 'bond' element is to create bond device when guest startup, the xml like: <hostdev mode='subsystem' type='pci' managed='yes'> <driver name='vfio' type='bond'/> <bond> <interface address='XXX'/> <interface address='XXX1'/> </bond> </hostdev> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- docs/schemas/basictypes.rng | 6 ++ docs/schemas/domaincommon.rng | 16 ++++++ src/conf/domain_conf.c | 131 ++++++++++++++++++++++++++++++++++++++---- src/conf/domain_conf.h | 13 +++++ src/libvirt_private.syms | 1 + 5 files changed, 157 insertions(+), 10 deletions(-) diff --git a/docs/schemas/basictypes.rng b/docs/schemas/basictypes.rng index f086ad2..aef24fe 100644 --- a/docs/schemas/basictypes.rng +++ b/docs/schemas/basictypes.rng @@ -66,6 +66,12 @@ </choice> </define> + <define name="pciinterface"> + <attribute name="address"> + <ref name="uniMacAddr"/> + </attribute> + </define> + <define name="pciaddress"> <optional> <attribute name="domain"> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 03fd541..0cf82cb 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -3766,9 +3766,25 @@ <value>xen</value> </choice> </attribute> + <optional> + <attribute name="type"> + <choice> + <value>bond</value> + </choice> + </attribute> + </optional> <empty/> </element> </optional> + <optional> + <element name="bond"> + <zeroOrMore> + <element name="interface"> + <ref name="pciinterface"/> + </element> + </zeroOrMore> + </element> + </optional> <element name="source"> <optional> <ref name="startupPolicy"/> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 4d7e3c9..14bcae1 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -610,6 +610,11 @@ VIR_ENUM_IMPL(virDomainHostdevSubsysPCIBackend, "vfio", "xen") +VIR_ENUM_IMPL(virDomainHostdevSubsysPCIDevice, + VIR_DOMAIN_HOSTDEV_PCI_DEVICE_TYPE_LAST, + "default", + "bond") + VIR_ENUM_IMPL(virDomainHostdevSubsysSCSIProtocol, VIR_DOMAIN_HOSTDEV_SCSI_PROTOCOL_TYPE_LAST, "adapter", @@ -1907,6 +1912,10 @@ void virDomainHostdevDefClear(virDomainHostdevDefPtr def) } else { VIR_FREE(scsisrc->u.host.adapter); } + } else if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + virDomainHostdevSubsysPCIPtr pcisrc = &def->source.subsys.u.pci; + if (pcisrc->device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) + VIR_FREE(pcisrc->macs); } break; } @@ -4978,7 +4987,9 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, char *sgio = NULL; char *rawio = NULL; char *backendStr = NULL; + char *deviceStr = NULL; int backend; + int device; int ret = -1; virDomainHostdevSubsysPCIPtr pcisrc = &def->source.subsys.u.pci; virDomainHostdevSubsysSCSIPtr scsisrc = &def->source.subsys.u.scsi; @@ -5077,6 +5088,68 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, } pcisrc->backend = backend; + device = VIR_DOMAIN_HOSTDEV_PCI_DEVICE_DEFAULT; + if ((deviceStr = virXPathString("string(./driver/@type)", ctxt)) && + (((device = virDomainHostdevSubsysPCIDeviceTypeFromString(deviceStr)) < 0) || + device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_DEFAULT)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unknown PCI device <driver type='%s'/> " + "has been specified"), deviceStr); + goto error; + } + pcisrc->device = device; + + if (device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { + xmlNodePtr *macs = NULL; + int n = 0; + int i; + char *macStr = NULL; + + if (!(virXPathNode("./bond", ctxt))) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("missing <nond> node specified by bond type")); + goto error; + } + + if ((n = virXPathNodeSet("./bond/interface", ctxt, &macs)) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Cannot extract interface nodes")); + goto error; + } + + VIR_FREE(pcisrc->macs); + if (VIR_ALLOC_N(pcisrc->macs, n) < 0) + goto error; + + pcisrc->nmac = n; + for (i = 0; i < n; i++) { + xmlNodePtr cur_node = macs[i]; + + macStr = virXMLPropString(cur_node, "address"); + if (!macStr) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("Missing required address attribute " + "in interface element")); + goto error; + } + if (virMacAddrParse((const char *)macStr, &pcisrc->macs[i]) < 0) { + virReportError(VIR_ERR_XML_ERROR, + _("unable to parse mac address '%s'"), + (const char *)macStr); + VIR_FREE(macStr); + goto error; + } + if (virMacAddrIsMulticast(&pcisrc->macs[i])) { + virReportError(VIR_ERR_XML_ERROR, + _("expected unicast mac address, found multicast '%s'"), + (const char *)macStr); + VIR_FREE(macStr); + goto error; + } + VIR_FREE(macStr); + } + } + break; case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: @@ -18389,18 +18462,56 @@ virDomainHostdevDefFormatSubsys(virBufferPtr buf, virDomainHostdevSubsysSCSIHostPtr scsihostsrc = &scsisrc->u.host; virDomainHostdevSubsysSCSIiSCSIPtr iscsisrc = &scsisrc->u.iscsi; - if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - pcisrc->backend != VIR_DOMAIN_HOSTDEV_PCI_BACKEND_DEFAULT) { - const char *backend = - virDomainHostdevSubsysPCIBackendTypeToString(pcisrc->backend); + if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { + const char *backend = NULL; + const char *device = NULL; + int i; + char macstr[VIR_MAC_STRING_BUFLEN]; - if (!backend) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("unexpected pci hostdev driver name type %d"), - pcisrc->backend); - return -1; + if (pcisrc->backend != VIR_DOMAIN_HOSTDEV_PCI_BACKEND_DEFAULT) { + backend = + virDomainHostdevSubsysPCIBackendTypeToString(pcisrc->backend); + + if (!backend) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected pci hostdev driver name type %d"), + pcisrc->backend); + return -1; + } + } + + if (pcisrc->device != VIR_DOMAIN_HOSTDEV_PCI_DEVICE_DEFAULT) { + device = + virDomainHostdevSubsysPCIDeviceTypeToString(pcisrc->device); + + if (!device) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected pci hostdev device name type %d"), + pcisrc->device); + return -1; + } + } + + if (backend) { + virBufferAddLit(buf, "<driver"); + virBufferAsprintf(buf, " name='%s'", backend); + if (device) + virBufferAsprintf(buf, " type='%s'", device); + + virBufferAddLit(buf, "/>\n"); + } + + if (pcisrc->device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND && + pcisrc->nmac > 0) { + virBufferAddLit(buf, "<bond>\n"); + virBufferAdjustIndent(buf, 2); + for (i = 0; i < pcisrc->nmac; i++) { + virBufferAsprintf(buf, "<interface address='%s'/>\n", + virMacAddrFormat(&pcisrc->macs[i], macstr)); + } + virBufferAdjustIndent(buf, -2); + virBufferAddLit(buf, "</bond>\n"); } - virBufferAsprintf(buf, "<driver name='%s'/>\n", backend); } virBufferAddLit(buf, "<source"); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index e6fa3c9..e62979f 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -416,6 +416,16 @@ typedef enum { VIR_ENUM_DECL(virDomainHostdevSubsysPCIBackend) +/* the type used for PCI hostdev devices */ +typedef enum { + VIR_DOMAIN_HOSTDEV_PCI_DEVICE_DEFAULT, /* default */ + VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND, /* bond device */ + + VIR_DOMAIN_HOSTDEV_PCI_DEVICE_TYPE_LAST +} virDomainHostdevSubsysPCIDeviceType; + +VIR_ENUM_DECL(virDomainHostdevSubsysPCIDevice) + typedef enum { VIR_DOMAIN_HOSTDEV_SCSI_PROTOCOL_TYPE_NONE, VIR_DOMAIN_HOSTDEV_SCSI_PROTOCOL_TYPE_ISCSI, @@ -442,6 +452,9 @@ typedef virDomainHostdevSubsysPCI *virDomainHostdevSubsysPCIPtr; struct _virDomainHostdevSubsysPCI { virDevicePCIAddress addr; /* host address */ int backend; /* enum virDomainHostdevSubsysPCIBackendType */ + int device; /* enum virDomainHostdevSubsysPCIDeviceType */ + size_t nmac; + virMacAddr* macs; }; typedef struct _virDomainHostdevSubsysSCSIHost virDomainHostdevSubsysSCSIHost; diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index aafc385..43a769d 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -320,6 +320,7 @@ virDomainHostdevInsert; virDomainHostdevModeTypeToString; virDomainHostdevRemove; virDomainHostdevSubsysPCIBackendTypeToString; +virDomainHostdevSubsysPCIDeviceTypeToString; virDomainHostdevSubsysTypeToString; virDomainHubTypeFromString; virDomainHubTypeToString; -- 1.9.3

via initialize callback to create bond device. Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/qemu/qemu_agent.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_agent.h | 10 ++++ src/qemu/qemu_domain.c | 70 ++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 7 +++ src/qemu/qemu_process.c | 4 ++ 5 files changed, 209 insertions(+) diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index cee0f8b..b8eba01 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -2169,3 +2169,121 @@ qemuAgentGetInterfaces(qemuAgentPtr mon, goto cleanup; } + +static virDomainInterfacePtr +findInterfaceByMac(virDomainInterfacePtr *info, + size_t len, + const char *macstr) +{ + size_t i; + bool found = false; + + for (i = 0; i < len; i++) { + if (info[i]->hwaddr && + STREQ(info[i]->hwaddr, macstr)) { + found = true; + break; + } + } + + if (found) { + return info[i]; + } + + return NULL; +} + +/* + * qemuAgentSetInterface: + */ +int +qemuAgentCreateBond(qemuAgentPtr mon, + virDomainHostdevSubsysPCIPtr pcisrc) +{ + int ret = -1; + virJSONValuePtr cmd = NULL; + virJSONValuePtr reply = NULL; + size_t i; + char macstr[VIR_MAC_STRING_BUFLEN]; + virDomainInterfacePtr *interfaceInfo = NULL; + virDomainInterfacePtr interface; + virJSONValuePtr new_interface = NULL; + virJSONValuePtr subInterfaces = NULL; + virJSONValuePtr subInterface = NULL; + int len; + + if (!(pcisrc->nmac || pcisrc->macs)) + return ret; + + len = qemuAgentGetInterfaces(mon, &interfaceInfo); + if (len < 0) + return ret; + + if (!(new_interface = virJSONValueNewObject())) + goto cleanup; + + if (virJSONValueObjectAppendString(new_interface, "type", "bond") < 0) + goto cleanup; + + if (virJSONValueObjectAppendString(new_interface, "name", "bond0") < 0) + goto cleanup; + + if (virJSONValueObjectAppendString(new_interface, "onboot", "onboot") < 0) + goto cleanup; + + if (!(subInterfaces = virJSONValueNewArray())) + goto cleanup; + + for (i = 0; i < pcisrc->nmac; i++) { + virMacAddrFormat(&pcisrc->macs[i], macstr); + interface = findInterfaceByMac(interfaceInfo, len, macstr); + if (!interface) { + goto cleanup; + } + + if (!(subInterface = virJSONValueNewObject())) + goto cleanup; + + if (virJSONValueObjectAppendString(subInterface, "name", interface->name) < 0) + goto cleanup; + + if (virJSONValueArrayAppend(subInterfaces, subInterface) < 0) + goto cleanup; + + subInterface = NULL; + } + + if (i && virJSONValueObjectAppend(new_interface, "subInterfaces", subInterfaces) < 0) + goto cleanup; + + cmd = qemuAgentMakeCommand("guest-network-set-interface", + "a:interface", new_interface, + NULL); + + if (!cmd) + goto cleanup; + + subInterfaces = NULL; + new_interface = NULL; + + if (qemuAgentCommand(mon, cmd, &reply, true, + VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK) < 0) + goto cleanup; + + if (virJSONValueObjectGetNumberInt(reply, "return", &ret) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("malformed return value")); + } + + cleanup: + virJSONValueFree(subInterfaces); + virJSONValueFree(subInterface); + virJSONValueFree(new_interface); + virJSONValueFree(cmd); + virJSONValueFree(reply); + if (interfaceInfo) + for (i = 0; i < len; i++) + virDomainInterfaceFree(interfaceInfo[i]); + VIR_FREE(interfaceInfo); + return ret; +} diff --git a/src/qemu/qemu_agent.h b/src/qemu/qemu_agent.h index 42414a7..744cb0a 100644 --- a/src/qemu/qemu_agent.h +++ b/src/qemu/qemu_agent.h @@ -97,6 +97,13 @@ struct _qemuAgentCPUInfo { bool offlinable; /* true if the CPU can be offlined */ }; +typedef struct _qemuAgentInterfaceInfo qemuAgentInterfaceInfo; +typedef qemuAgentInterfaceInfo *qemuAgentInterfaceInfoPtr; +struct _qemuAgentInterfaceInfo { + char *name; + char *hardware_address; +}; + int qemuAgentGetVCPUs(qemuAgentPtr mon, qemuAgentCPUInfoPtr *info); int qemuAgentSetVCPUs(qemuAgentPtr mon, qemuAgentCPUInfoPtr cpus, size_t ncpus); int qemuAgentUpdateCPUInfo(unsigned int nvcpus, @@ -114,4 +121,7 @@ int qemuAgentSetTime(qemuAgentPtr mon, int qemuAgentGetInterfaces(qemuAgentPtr mon, virDomainInterfacePtr **ifaces); +int qemuAgentCreateBond(qemuAgentPtr mon, + virDomainHostdevSubsysPCIPtr pcisrc); + #endif /* __QEMU_AGENT_H__ */ diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 603360f..584fefb 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2722,6 +2722,46 @@ qemuDomainCleanupRun(virQEMUDriverPtr driver, priv->ncleanupCallbacks_max = 0; } +/* + * The vm must be locked when any of the following init functions is + * called. + */ +int +qemuDomainInitAdd(virDomainObjPtr vm, + qemuDomainInitCallback cb) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + size_t i; + + VIR_DEBUG("vm=%s, cb=%p", vm->def->name, cb); + + for (i = 0; i < priv->nInitCallbacks; i++) { + if (priv->initCallbacks[i] == cb) + return 0; + } + + if (VIR_RESIZE_N(priv->initCallbacks, + priv->nInitCallbacks_max, + priv->nInitCallbacks, 1) < 0) + return -1; + + priv->initCallbacks[priv->nInitCallbacks++] = cb; + return 0; +} + +void +qemuDomainInitCleanup(virDomainObjPtr vm) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + + VIR_DEBUG("vm=%s", vm->def->name); + + VIR_FREE(priv->cleanupCallbacks); + priv->ncleanupCallbacks = 0; + priv->ncleanupCallbacks_max = 0; +} + + static void qemuDomainGetImageIds(virQEMUDriverConfigPtr cfg, virDomainObjPtr vm, @@ -3083,3 +3123,33 @@ qemuDomainSupportsBlockJobs(virDomainObjPtr vm, return 0; } + +void +qemuDomainPrepareHostdevInit(virDomainObjPtr vm) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + virDomainDefPtr def = vm->def; + int i; + + if (!def->nhostdevs) + return; + + if (!qemuDomainAgentAvailable(vm, false)) + return; + + if (!virDomainObjIsActive(vm)) + return; + + for (i = 0; i < def->nhostdevs; i++) { + virDomainHostdevDefPtr hostdev = def->hostdevs[i]; + virDomainHostdevSubsysPCIPtr pcisrc = &hostdev->source.subsys.u.pci; + + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && + hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { + qemuDomainObjEnterAgent(vm); + qemuAgentCreateBond(priv->agent, pcisrc); + qemuDomainObjExitAgent(vm); + } + } +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 19f4b27..3244ca0 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -403,6 +403,10 @@ void qemuDomainCleanupRemove(virDomainObjPtr vm, void qemuDomainCleanupRun(virQEMUDriverPtr driver, virDomainObjPtr vm); +int qemuDomainInitAdd(virDomainObjPtr vm, + qemuDomainInitCallback cb); +void qemuDomainInitCleanup(virDomainObjPtr vm); + extern virDomainXMLPrivateDataCallbacks virQEMUDriverPrivateDataCallbacks; extern virDomainXMLNamespace virQEMUDriverDomainXMLNamespace; extern virDomainDefParserConfig virQEMUDriverDomainDefParserConfig; @@ -444,4 +448,7 @@ void qemuDomObjEndAPI(virDomainObjPtr *vm); int qemuDomainAlignMemorySizes(virDomainDefPtr def); void qemuDomainMemoryDeviceAlignSize(virDomainMemoryDefPtr mem); +void +qemuDomainPrepareHostdevInit(virDomainObjPtr vm); + #endif /* __QEMU_DOMAIN_H__ */ diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fcc0566..0a72aca 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4444,6 +4444,9 @@ int qemuProcessStart(virConnectPtr conn, hostdev_flags) < 0) goto cleanup; + if (qemuDomainInitAdd(vm, qemuDomainPrepareHostdevInit)) + goto cleanup; + VIR_DEBUG("Preparing chr devices"); if (virDomainChrDefForeach(vm->def, true, @@ -5186,6 +5189,7 @@ void qemuProcessStop(virQEMUDriverPtr driver, VIR_QEMU_PROCESS_KILL_NOCHECK)); qemuDomainCleanupRun(driver, vm); + qemuDomainInitCleanup(vm); /* Stop autodestroy in case guest is restarted */ qemuProcessAutoDestroyRemove(driver, vm); -- 1.9.3

On Fri, Apr 17, 2015 at 04:53:06PM +0800, Chen Fan wrote:
via initialize callback to create bond device.
Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/qemu/qemu_agent.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_agent.h | 10 ++++ src/qemu/qemu_domain.c | 70 ++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 7 +++ src/qemu/qemu_process.c | 4 ++ 5 files changed, 209 insertions(+)
diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index cee0f8b..b8eba01 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -2169,3 +2169,121 @@ qemuAgentGetInterfaces(qemuAgentPtr mon,
goto cleanup; } + +static virDomainInterfacePtr +findInterfaceByMac(virDomainInterfacePtr *info, + size_t len, + const char *macstr) +{ + size_t i; + bool found = false; + + for (i = 0; i < len; i++) { + if (info[i]->hwaddr && + STREQ(info[i]->hwaddr, macstr)) { + found = true; + break; + } + } + + if (found) { + return info[i]; + } + + return NULL; +} +
I think PCI addresses are a better way to identify the devices for this purpose. This will mean softmac doesn't break this functionality. See anything wrong with it?
+/* + * qemuAgentSetInterface: + */ +int +qemuAgentCreateBond(qemuAgentPtr mon, + virDomainHostdevSubsysPCIPtr pcisrc) +{ + int ret = -1; + virJSONValuePtr cmd = NULL; + virJSONValuePtr reply = NULL; + size_t i; + char macstr[VIR_MAC_STRING_BUFLEN]; + virDomainInterfacePtr *interfaceInfo = NULL; + virDomainInterfacePtr interface; + virJSONValuePtr new_interface = NULL; + virJSONValuePtr subInterfaces = NULL; + virJSONValuePtr subInterface = NULL; + int len; + + if (!(pcisrc->nmac || pcisrc->macs)) + return ret; + + len = qemuAgentGetInterfaces(mon, &interfaceInfo); + if (len < 0) + return ret; + + if (!(new_interface = virJSONValueNewObject())) + goto cleanup; + + if (virJSONValueObjectAppendString(new_interface, "type", "bond") < 0) + goto cleanup; + + if (virJSONValueObjectAppendString(new_interface, "name", "bond0") < 0) + goto cleanup; + + if (virJSONValueObjectAppendString(new_interface, "onboot", "onboot") < 0) + goto cleanup; + + if (!(subInterfaces = virJSONValueNewArray())) + goto cleanup; + + for (i = 0; i < pcisrc->nmac; i++) { + virMacAddrFormat(&pcisrc->macs[i], macstr); + interface = findInterfaceByMac(interfaceInfo, len, macstr); + if (!interface) { + goto cleanup; + } + + if (!(subInterface = virJSONValueNewObject())) + goto cleanup; + + if (virJSONValueObjectAppendString(subInterface, "name", interface->name) < 0) + goto cleanup; + + if (virJSONValueArrayAppend(subInterfaces, subInterface) < 0) + goto cleanup; + + subInterface = NULL; + } + + if (i && virJSONValueObjectAppend(new_interface, "subInterfaces", subInterfaces) < 0) + goto cleanup; + + cmd = qemuAgentMakeCommand("guest-network-set-interface", + "a:interface", new_interface, + NULL); + + if (!cmd) + goto cleanup; + + subInterfaces = NULL; + new_interface = NULL; + + if (qemuAgentCommand(mon, cmd, &reply, true, + VIR_DOMAIN_QEMU_AGENT_COMMAND_BLOCK) < 0) + goto cleanup; + + if (virJSONValueObjectGetNumberInt(reply, "return", &ret) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("malformed return value")); + } + + cleanup: + virJSONValueFree(subInterfaces); + virJSONValueFree(subInterface); + virJSONValueFree(new_interface); + virJSONValueFree(cmd); + virJSONValueFree(reply); + if (interfaceInfo) + for (i = 0; i < len; i++) + virDomainInterfaceFree(interfaceInfo[i]); + VIR_FREE(interfaceInfo); + return ret; +} diff --git a/src/qemu/qemu_agent.h b/src/qemu/qemu_agent.h index 42414a7..744cb0a 100644 --- a/src/qemu/qemu_agent.h +++ b/src/qemu/qemu_agent.h @@ -97,6 +97,13 @@ struct _qemuAgentCPUInfo { bool offlinable; /* true if the CPU can be offlined */ };
+typedef struct _qemuAgentInterfaceInfo qemuAgentInterfaceInfo; +typedef qemuAgentInterfaceInfo *qemuAgentInterfaceInfoPtr; +struct _qemuAgentInterfaceInfo { + char *name; + char *hardware_address; +}; + int qemuAgentGetVCPUs(qemuAgentPtr mon, qemuAgentCPUInfoPtr *info); int qemuAgentSetVCPUs(qemuAgentPtr mon, qemuAgentCPUInfoPtr cpus, size_t ncpus); int qemuAgentUpdateCPUInfo(unsigned int nvcpus, @@ -114,4 +121,7 @@ int qemuAgentSetTime(qemuAgentPtr mon, int qemuAgentGetInterfaces(qemuAgentPtr mon, virDomainInterfacePtr **ifaces);
+int qemuAgentCreateBond(qemuAgentPtr mon, + virDomainHostdevSubsysPCIPtr pcisrc); + #endif /* __QEMU_AGENT_H__ */ diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 603360f..584fefb 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2722,6 +2722,46 @@ qemuDomainCleanupRun(virQEMUDriverPtr driver, priv->ncleanupCallbacks_max = 0; }
+/* + * The vm must be locked when any of the following init functions is + * called. + */ +int +qemuDomainInitAdd(virDomainObjPtr vm, + qemuDomainInitCallback cb) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + size_t i; + + VIR_DEBUG("vm=%s, cb=%p", vm->def->name, cb); + + for (i = 0; i < priv->nInitCallbacks; i++) { + if (priv->initCallbacks[i] == cb) + return 0; + } + + if (VIR_RESIZE_N(priv->initCallbacks, + priv->nInitCallbacks_max, + priv->nInitCallbacks, 1) < 0) + return -1; + + priv->initCallbacks[priv->nInitCallbacks++] = cb; + return 0; +} + +void +qemuDomainInitCleanup(virDomainObjPtr vm) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + + VIR_DEBUG("vm=%s", vm->def->name); + + VIR_FREE(priv->cleanupCallbacks); + priv->ncleanupCallbacks = 0; + priv->ncleanupCallbacks_max = 0; +} + + static void qemuDomainGetImageIds(virQEMUDriverConfigPtr cfg, virDomainObjPtr vm, @@ -3083,3 +3123,33 @@ qemuDomainSupportsBlockJobs(virDomainObjPtr vm,
return 0; } + +void +qemuDomainPrepareHostdevInit(virDomainObjPtr vm) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + virDomainDefPtr def = vm->def; + int i; + + if (!def->nhostdevs) + return; + + if (!qemuDomainAgentAvailable(vm, false)) + return; + + if (!virDomainObjIsActive(vm)) + return; + + for (i = 0; i < def->nhostdevs; i++) { + virDomainHostdevDefPtr hostdev = def->hostdevs[i]; + virDomainHostdevSubsysPCIPtr pcisrc = &hostdev->source.subsys.u.pci; + + if (hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && + hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { + qemuDomainObjEnterAgent(vm); + qemuAgentCreateBond(priv->agent, pcisrc); + qemuDomainObjExitAgent(vm); + } + } +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 19f4b27..3244ca0 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -403,6 +403,10 @@ void qemuDomainCleanupRemove(virDomainObjPtr vm, void qemuDomainCleanupRun(virQEMUDriverPtr driver, virDomainObjPtr vm);
+int qemuDomainInitAdd(virDomainObjPtr vm, + qemuDomainInitCallback cb); +void qemuDomainInitCleanup(virDomainObjPtr vm); + extern virDomainXMLPrivateDataCallbacks virQEMUDriverPrivateDataCallbacks; extern virDomainXMLNamespace virQEMUDriverDomainXMLNamespace; extern virDomainDefParserConfig virQEMUDriverDomainDefParserConfig; @@ -444,4 +448,7 @@ void qemuDomObjEndAPI(virDomainObjPtr *vm); int qemuDomainAlignMemorySizes(virDomainDefPtr def); void qemuDomainMemoryDeviceAlignSize(virDomainMemoryDefPtr mem);
+void +qemuDomainPrepareHostdevInit(virDomainObjPtr vm); + #endif /* __QEMU_DOMAIN_H__ */ diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fcc0566..0a72aca 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4444,6 +4444,9 @@ int qemuProcessStart(virConnectPtr conn, hostdev_flags) < 0) goto cleanup;
+ if (qemuDomainInitAdd(vm, qemuDomainPrepareHostdevInit)) + goto cleanup; + VIR_DEBUG("Preparing chr devices"); if (virDomainChrDefForeach(vm->def, true, @@ -5186,6 +5189,7 @@ void qemuProcessStop(virQEMUDriverPtr driver, VIR_QEMU_PROCESS_KILL_NOCHECK));
qemuDomainCleanupRun(driver, vm); + qemuDomainInitCleanup(vm);
/* Stop autodestroy in case guest is restarted */ qemuProcessAutoDestroyRemove(driver, vm); -- 1.9.3

On 17.04.2015 10:53, Chen Fan wrote:
via initialize callback to create bond device.
Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/qemu/qemu_agent.c | 118 ++++++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_agent.h | 10 ++++ src/qemu/qemu_domain.c | 70 ++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 7 +++ src/qemu/qemu_process.c | 4 ++ 5 files changed, 209 insertions(+)
If we go this way, we should introduce much broader set of interface types to create. In fact, I don't like idea of qemu-ga mangling guest network, esp. when there are so many tools for that. Michal

bond device always need to configure the ip address and route way address. so here we add the interface. xml like: <hostdev mode='subsystem' type='pci' managed='no'> <driver name='vfio' type='bond'/> <bond> <ip address='192.168.122.5' family='ipv4' prefix='24'/> <route family='ipv4' address='0.0.0.0' gateway='192.168.122.1'/> <interface address='52:54:00:e8:c0:f3'/> <interface address='44:33:4c:06:f5:8e'/> </bond> Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- docs/schemas/domaincommon.rng | 21 +++++++++++ src/conf/domain_conf.c | 87 ++++++++++++++++++++++++++++++++++++------- src/conf/domain_conf.h | 24 ++++++++---- src/conf/networkcommon_conf.c | 17 --------- src/conf/networkcommon_conf.h | 17 +++++++++ src/qemu/qemu_agent.c | 58 +++++++++++++++++++++++++++-- 6 files changed, 183 insertions(+), 41 deletions(-) diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 0cf82cb..4056cbd 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -3779,6 +3779,27 @@ <optional> <element name="bond"> <zeroOrMore> + <element name="ip"> + <attribute name="address"> + <ref name="ipAddr"/> + </attribute> + <optional> + <attribute name="family"> + <ref name="addr-family"/> + </attribute> + </optional> + <optional> + <attribute name="prefix"> + <ref name="ipPrefix"/> + </attribute> + </optional> + <empty/> + </element> + </zeroOrMore> + <zeroOrMore> + <ref name="route"/> + </zeroOrMore> + <zeroOrMore> <element name="interface"> <ref name="pciinterface"/> </element> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 14bcae1..7d1cd3e 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -797,6 +797,8 @@ static virClassPtr virDomainXMLOptionClass; static void virDomainObjDispose(void *obj); static void virDomainObjListDispose(void *obj); static void virDomainXMLOptionClassDispose(void *obj); +static virDomainNetIpDefPtr virDomainNetIpParseXML(xmlNodePtr node); + static int virDomainObjOnceInit(void) { @@ -1914,8 +1916,17 @@ void virDomainHostdevDefClear(virDomainHostdevDefPtr def) } } else if (def->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) { virDomainHostdevSubsysPCIPtr pcisrc = &def->source.subsys.u.pci; - if (pcisrc->device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) - VIR_FREE(pcisrc->macs); + if (pcisrc->device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { + for (i = 0; i < pcisrc->net.nmacs; i++) + VIR_FREE(pcisrc->net.macs[i]); + VIR_FREE(pcisrc->net.macs); + for (i = 0; i < pcisrc->net.nips; i++) + VIR_FREE(pcisrc->net.ips[i]); + VIR_FREE(pcisrc->net.ips); + for (i = 0; i < pcisrc->net.nroutes; i++) + VIR_FREE(pcisrc->net.routes[i]); + VIR_FREE(pcisrc->net.routes); + } } break; } @@ -5102,26 +5113,68 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, if (device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { xmlNodePtr *macs = NULL; int n = 0; - int i; + size_t i; char *macStr = NULL; + xmlNodePtr *ipnodes = NULL; + int nipnodes; + xmlNodePtr *routenodes = NULL; + int nroutenodes; if (!(virXPathNode("./bond", ctxt))) { virReportError(VIR_ERR_XML_ERROR, "%s", - _("missing <nond> node specified by bond type")); + _("missing <bond> node specified by bond type")); goto error; } + if ((nipnodes = virXPathNodeSet("./bond/ip", ctxt, &ipnodes)) < 0) + goto error; + + if (nipnodes) { + for (i = 0; i < nipnodes; i++) { + virDomainNetIpDefPtr ip = virDomainNetIpParseXML(ipnodes[i]); + + if (!ip) + goto error; + + if (VIR_APPEND_ELEMENT(pcisrc->net.ips, + pcisrc->net.nips, ip) < 0) { + VIR_FREE(ip); + goto error; + } + } + } + + if ((nroutenodes = virXPathNodeSet("./bond/route", ctxt, &routenodes)) < 0) + goto error; + + if (nroutenodes) { + for (i = 0; i < nroutenodes; i++) { + virNetworkRouteDefPtr route = NULL; + + if (!(route = virNetworkRouteDefParseXML(_("Domain hostdev device"), + routenodes[i], + ctxt))) + goto error; + + if (VIR_APPEND_ELEMENT(pcisrc->net.routes, + pcisrc->net.nroutes, route) < 0) { + virNetworkRouteDefFree(route); + goto error; + } + } + } + if ((n = virXPathNodeSet("./bond/interface", ctxt, &macs)) < 0) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("Cannot extract interface nodes")); goto error; } - VIR_FREE(pcisrc->macs); - if (VIR_ALLOC_N(pcisrc->macs, n) < 0) + VIR_FREE(pcisrc->net.macs); + if (VIR_ALLOC_N(pcisrc->net.macs, n) < 0) goto error; - pcisrc->nmac = n; + pcisrc->net.nmacs = n; for (i = 0; i < n; i++) { xmlNodePtr cur_node = macs[i]; @@ -5132,14 +5185,18 @@ virDomainHostdevDefParseXMLSubsys(xmlNodePtr node, "in interface element")); goto error; } - if (virMacAddrParse((const char *)macStr, &pcisrc->macs[i]) < 0) { + + if (VIR_ALLOC(pcisrc->net.macs[i]) < 0) + goto error; + + if (virMacAddrParse((const char *)macStr, pcisrc->net.macs[i]) < 0) { virReportError(VIR_ERR_XML_ERROR, _("unable to parse mac address '%s'"), (const char *)macStr); VIR_FREE(macStr); goto error; } - if (virMacAddrIsMulticast(&pcisrc->macs[i])) { + if (virMacAddrIsMulticast(pcisrc->net.macs[i])) { virReportError(VIR_ERR_XML_ERROR, _("expected unicast mac address, found multicast '%s'"), (const char *)macStr); @@ -18501,13 +18558,17 @@ virDomainHostdevDefFormatSubsys(virBufferPtr buf, virBufferAddLit(buf, "/>\n"); } - if (pcisrc->device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND && - pcisrc->nmac > 0) { + if (pcisrc->device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { virBufferAddLit(buf, "<bond>\n"); virBufferAdjustIndent(buf, 2); - for (i = 0; i < pcisrc->nmac; i++) { + if (virDomainNetIpsFormat(buf, pcisrc->net.ips, pcisrc->net.nips) < 0) + return -1; + if (virDomainNetRoutesFormat(buf, pcisrc->net.routes, pcisrc->net.nroutes) < 0) + return -1; + + for (i = 0; i < pcisrc->net.nmacs; i++) { virBufferAsprintf(buf, "<interface address='%s'/>\n", - virMacAddrFormat(&pcisrc->macs[i], macstr)); + virMacAddrFormat(pcisrc->net.macs[i], macstr)); } virBufferAdjustIndent(buf, -2); virBufferAddLit(buf, "</bond>\n"); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index e62979f..723f07b 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -447,14 +447,28 @@ struct _virDomainHostdevSubsysUSB { unsigned product; }; +typedef struct _virDomainNetIpDef virDomainNetIpDef; +typedef virDomainNetIpDef *virDomainNetIpDefPtr; +struct _virDomainNetIpDef { + virSocketAddr address; /* ipv4 or ipv6 address */ + unsigned int prefix; /* number of 1 bits in the net mask */ +}; + typedef struct _virDomainHostdevSubsysPCI virDomainHostdevSubsysPCI; typedef virDomainHostdevSubsysPCI *virDomainHostdevSubsysPCIPtr; struct _virDomainHostdevSubsysPCI { virDevicePCIAddress addr; /* host address */ int backend; /* enum virDomainHostdevSubsysPCIBackendType */ int device; /* enum virDomainHostdevSubsysPCIDeviceType */ - size_t nmac; - virMacAddr* macs; + + struct { + size_t nips; + virDomainNetIpDefPtr *ips; + size_t nroutes; + virNetworkRouteDefPtr *routes; + size_t nmacs; + virMacAddrPtr *macs; + } net; }; typedef struct _virDomainHostdevSubsysSCSIHost virDomainHostdevSubsysSCSIHost; @@ -507,12 +521,6 @@ typedef enum { VIR_DOMAIN_HOSTDEV_CAPS_TYPE_LAST } virDomainHostdevCapsType; -typedef struct _virDomainNetIpDef virDomainNetIpDef; -typedef virDomainNetIpDef *virDomainNetIpDefPtr; -struct _virDomainNetIpDef { - virSocketAddr address; /* ipv4 or ipv6 address */ - unsigned int prefix; /* number of 1 bits in the net mask */ -}; typedef struct _virDomainHostdevCaps virDomainHostdevCaps; typedef virDomainHostdevCaps *virDomainHostdevCapsPtr; diff --git a/src/conf/networkcommon_conf.c b/src/conf/networkcommon_conf.c index 7b7a851..c11baf6 100644 --- a/src/conf/networkcommon_conf.c +++ b/src/conf/networkcommon_conf.c @@ -32,23 +32,6 @@ #define VIR_FROM_THIS VIR_FROM_NETWORK -struct _virNetworkRouteDef { - char *family; /* ipv4 or ipv6 - default is ipv4 */ - virSocketAddr address; /* Routed Network IP address */ - - /* One or the other of the following two will be used for a given - * Network address, but never both. The parser guarantees this. - * The virSocketAddrGetIpPrefix() can be used to get a - * valid prefix. - */ - virSocketAddr netmask; /* ipv4 - either netmask or prefix specified */ - unsigned int prefix; /* ipv6 - only prefix allowed */ - bool has_prefix; /* prefix= was specified */ - unsigned int metric; /* value for metric (defaults to 1) */ - bool has_metric; /* metric= was specified */ - virSocketAddr gateway; /* gateway IP address for ip-route */ -}; - void virNetworkRouteDefFree(virNetworkRouteDefPtr def) { diff --git a/src/conf/networkcommon_conf.h b/src/conf/networkcommon_conf.h index 1500d0f..a9f58e8 100644 --- a/src/conf/networkcommon_conf.h +++ b/src/conf/networkcommon_conf.h @@ -35,6 +35,23 @@ typedef struct _virNetworkRouteDef virNetworkRouteDef; typedef virNetworkRouteDef *virNetworkRouteDefPtr; +struct _virNetworkRouteDef { + char *family; /* ipv4 or ipv6 - default is ipv4 */ + virSocketAddr address; /* Routed Network IP address */ + + /* One or the other of the following two will be used for a given + * Network address, but never both. The parser guarantees this. + * The virSocketAddrGetIpPrefix() can be used to get a + * valid prefix. + */ + virSocketAddr netmask; /* ipv4 - either netmask or prefix specified */ + unsigned int prefix; /* ipv6 - only prefix allowed */ + bool has_prefix; /* prefix= was specified */ + unsigned int metric; /* value for metric (defaults to 1) */ + bool has_metric; /* metric= was specified */ + virSocketAddr gateway; /* gateway IP address for ip-route */ +}; + void virNetworkRouteDefFree(virNetworkRouteDefPtr def); diff --git a/src/qemu/qemu_agent.c b/src/qemu/qemu_agent.c index b8eba01..f9823e2 100644 --- a/src/qemu/qemu_agent.c +++ b/src/qemu/qemu_agent.c @@ -2208,11 +2208,14 @@ qemuAgentCreateBond(qemuAgentPtr mon, virDomainInterfacePtr *interfaceInfo = NULL; virDomainInterfacePtr interface; virJSONValuePtr new_interface = NULL; + virJSONValuePtr ip_interface = NULL; virJSONValuePtr subInterfaces = NULL; virJSONValuePtr subInterface = NULL; int len; - if (!(pcisrc->nmac || pcisrc->macs)) + if (!(pcisrc->net.nmacs && + pcisrc->net.nips && + pcisrc->net.nroutes)) return ret; len = qemuAgentGetInterfaces(mon, &interfaceInfo); @@ -2231,11 +2234,60 @@ qemuAgentCreateBond(qemuAgentPtr mon, if (virJSONValueObjectAppendString(new_interface, "onboot", "onboot") < 0) goto cleanup; + if (virJSONValueObjectAppendString(new_interface, + "options", + "mode=active-backup miimon=100 updelay=10") < 0) + goto cleanup; + + if (!(ip_interface = virJSONValueNewObject())) + goto cleanup; + + if (pcisrc->net.nips) { + /* the first valid */ + virSocketAddrPtr address = &pcisrc->net.ips[0]->address; + char *ipStr = virSocketAddrFormat(address); + const char *familyStr = NULL; + + if (virJSONValueObjectAppendString(ip_interface, "ip-address", ipStr) < 0) + goto cleanup; + VIR_FREE(ipStr); + + if (VIR_SOCKET_ADDR_IS_FAMILY(address, AF_INET6)) + familyStr = "ipv6"; + else if (VIR_SOCKET_ADDR_IS_FAMILY(address, AF_INET)) + familyStr = "ipv4"; + + if (familyStr) + if (virJSONValueObjectAppendString(ip_interface, "ip-address-type", familyStr) < 0) + goto cleanup; + if (pcisrc->net.ips[0]->prefix != 0) + if (virJSONValueObjectAppendNumberInt(ip_interface, "prefix", + pcisrc->net.ips[0]->prefix) < 0) + goto cleanup; + } + + if (pcisrc->net.nroutes) { + /* the first valid */ + char *addr = NULL; + virSocketAddrPtr gateway = &pcisrc->net.routes[0]->gateway; + + if (!(addr = virSocketAddrFormat(gateway))) + goto cleanup; + if (virJSONValueObjectAppendString(ip_interface, "gateway", addr) < 0) + goto cleanup; + VIR_FREE(addr); + } + + if ((pcisrc->net.nroutes || + pcisrc->net.nips) && + virJSONValueObjectAppend(new_interface, "ip-address", ip_interface) < 0) + goto cleanup; + if (!(subInterfaces = virJSONValueNewArray())) goto cleanup; - for (i = 0; i < pcisrc->nmac; i++) { - virMacAddrFormat(&pcisrc->macs[i], macstr); + for (i = 0; i < pcisrc->net.nmacs; i++) { + virMacAddrFormat(pcisrc->net.macs[i], macstr); interface = findInterfaceByMac(interfaceInfo, len, macstr); if (!interface) { goto cleanup; -- 1.9.3

For bond device, we can support the migrate, we can simple to hot remove the device from source side, and after migration end, we hot add the new device at destination side. Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/qemu/qemu_driver.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_migration.c | 7 ++++++ 2 files changed, 64 insertions(+) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 7368145..0ba9e4a 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12353,6 +12353,58 @@ qemuDomainMigrateBegin3(virDomainPtr domain, cookieout, cookieoutlen, flags); } +static int +qemuDomainRemovePciPassThruDevices(virConnectPtr conn, + virDomainObjPtr vm) +{ + virQEMUDriverPtr driver = conn->privateData; + virDomainDeviceDef dev; + virDomainDeviceDefPtr dev_copy = NULL; + virCapsPtr caps = NULL; + int ret = -1; + size_t i; + + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) + goto cleanup; + + if (!qemuMigrationJobIsActive(vm, QEMU_ASYNC_JOB_MIGRATION_OUT)) + goto cleanup; + + /* unplug passthrough bond device */ + for (i = 0; i < vm->def->nhostdevs; i++) { + virDomainHostdevDefPtr hostdev = vm->def->hostdevs[i]; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && + hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { + + dev.type = VIR_DOMAIN_DEVICE_HOSTDEV; + dev.data.hostdev = hostdev; + + dev_copy = virDomainDeviceDefCopy(&dev, vm->def, caps, driver->xmlopt); + if (!dev_copy) + goto cleanup; + + if (qemuDomainDetachHostDevice(driver, vm, dev_copy) < 0) { + virDomainDeviceDefFree(dev_copy); + goto cleanup; + } + + virDomainDeviceDefFree(dev_copy); + if (qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE) < 0) + goto cleanup; + } + } + + ret = 0; + + cleanup: + virObjectUnref(caps); + + return ret; +} + static char * qemuDomainMigrateBegin3Params(virDomainPtr domain, virTypedParameterPtr params, @@ -12688,6 +12740,11 @@ qemuDomainMigratePerform3Params(virDomainPtr dom, return -1; } + if (qemuDomainRemovePciPassThruDevices(dom->conn, vm) < 0) { + qemuDomObjEndAPI(&vm); + return -1; + } + return qemuMigrationPerform(driver, dom->conn, vm, dom_xml, dconnuri, uri, graphicsuri, listenAddress, cookiein, cookieinlen, cookieout, cookieoutlen, diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 611f53a..9ea83df 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2000,6 +2000,13 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, forbid = false; for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDefPtr hostdev = def->hostdevs[i]; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && + hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) + continue; + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB) { forbid = true; -- 1.9.3

we add a migrate status for hostdev to specify the device don't need to initialze when VM startup, after migration end, we add the migrate status hostdev, so can support hostdev migration. Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- src/conf/domain_conf.c | 3 ++ src/conf/domain_conf.h | 7 ++++ src/qemu/qemu_command.c | 3 ++ src/qemu/qemu_driver.c | 53 +-------------------------- src/qemu/qemu_hotplug.c | 8 +++-- src/qemu/qemu_migration.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--- src/qemu/qemu_migration.h | 4 +++ src/util/virhostdev.c | 3 ++ 8 files changed, 114 insertions(+), 59 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 7d1cd3e..b56c6fa 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -3035,6 +3035,9 @@ virDomainDeviceInfoIterateInternal(virDomainDefPtr def, device.type = VIR_DOMAIN_DEVICE_HOSTDEV; for (i = 0; i < def->nhostdevs; i++) { device.data.hostdev = def->hostdevs[i]; + if (device.data.hostdev->state == VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) + continue; + if (cb(def, &device, def->hostdevs[i]->info, opaque) < 0) return -1; } diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 723f07b..4b7b4c9 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -543,6 +543,12 @@ struct _virDomainHostdevCaps { } u; }; +typedef enum { + VIR_DOMAIN_HOSTDEV_STATE_DEFAULT, + VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE, + + VIR_DOMAIN_HOSTDEV_STATE_LAST +} virDomainHostdevState; /* basic device for direct passthrough */ struct _virDomainHostdevDef { @@ -559,6 +565,7 @@ struct _virDomainHostdevDef { } source; virDomainHostdevOrigStates origstates; virDomainDeviceInfoPtr info; /* Guest address */ + int state; }; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index e7e0937..dc5245a 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -10365,6 +10365,9 @@ qemuBuildCommandLine(virConnectPtr conn, virDomainHostdevDefPtr hostdev = def->hostdevs[i]; char *devstr; + if (hostdev->state == VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) + continue; + if (hostdev->info->bootIndex) { if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || (hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 0ba9e4a..4724171 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12353,57 +12353,6 @@ qemuDomainMigrateBegin3(virDomainPtr domain, cookieout, cookieoutlen, flags); } -static int -qemuDomainRemovePciPassThruDevices(virConnectPtr conn, - virDomainObjPtr vm) -{ - virQEMUDriverPtr driver = conn->privateData; - virDomainDeviceDef dev; - virDomainDeviceDefPtr dev_copy = NULL; - virCapsPtr caps = NULL; - int ret = -1; - size_t i; - - if (!(caps = virQEMUDriverGetCapabilities(driver, false))) - goto cleanup; - - if (!qemuMigrationJobIsActive(vm, QEMU_ASYNC_JOB_MIGRATION_OUT)) - goto cleanup; - - /* unplug passthrough bond device */ - for (i = 0; i < vm->def->nhostdevs; i++) { - virDomainHostdevDefPtr hostdev = vm->def->hostdevs[i]; - - if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && - hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && - hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) { - - dev.type = VIR_DOMAIN_DEVICE_HOSTDEV; - dev.data.hostdev = hostdev; - - dev_copy = virDomainDeviceDefCopy(&dev, vm->def, caps, driver->xmlopt); - if (!dev_copy) - goto cleanup; - - if (qemuDomainDetachHostDevice(driver, vm, dev_copy) < 0) { - virDomainDeviceDefFree(dev_copy); - goto cleanup; - } - - virDomainDeviceDefFree(dev_copy); - if (qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE) < 0) - goto cleanup; - } - } - - ret = 0; - - cleanup: - virObjectUnref(caps); - - return ret; -} static char * qemuDomainMigrateBegin3Params(virDomainPtr domain, @@ -12740,7 +12689,7 @@ qemuDomainMigratePerform3Params(virDomainPtr dom, return -1; } - if (qemuDomainRemovePciPassThruDevices(dom->conn, vm) < 0) { + if (qemuDomainMigratePciPassThruDevices(driver, vm, false) < 0) { qemuDomObjEndAPI(&vm); return -1; } diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index f07c54d..13a7338 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1239,8 +1239,9 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); unsigned int flags = 0; - if (VIR_REALLOC_N(vm->def->hostdevs, vm->def->nhostdevs + 1) < 0) - return -1; + if (hostdev->state != VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) + if (VIR_REALLOC_N(vm->def->hostdevs, vm->def->nhostdevs + 1) < 0) + return -1; if (!cfg->relaxedACS) flags |= VIR_HOSTDEV_STRICT_ACS_CHECK; @@ -1344,7 +1345,8 @@ qemuDomainAttachHostPCIDevice(virQEMUDriverPtr driver, if (ret < 0) goto error; - vm->def->hostdevs[vm->def->nhostdevs++] = hostdev; + if (hostdev->state != VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) + vm->def->hostdevs[vm->def->nhostdevs++] = hostdev; VIR_FREE(devstr); VIR_FREE(configfd_name); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 9ea83df..291cb9f 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2001,10 +2001,7 @@ qemuMigrationIsAllowed(virQEMUDriverPtr driver, virDomainObjPtr vm, for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDefPtr hostdev = def->hostdevs[i]; - if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && - hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && - hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && - hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) + if (hostdev->state == VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) continue; if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || @@ -2629,6 +2626,80 @@ qemuMigrationCleanup(virDomainObjPtr vm, } +static void +qemuMigrationSetStateForHostdev(virDomainDefPtr def, + int state) +{ + virDomainHostdevDefPtr hostdev; + size_t i; + + if (!def) + return; + + for (i = 0; i < def->nhostdevs; i++) { + hostdev = def->hostdevs[i]; + + if (hostdev->mode == VIR_DOMAIN_HOSTDEV_MODE_SUBSYS && + hostdev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI && + hostdev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO && + hostdev->source.subsys.u.pci.device == VIR_DOMAIN_HOSTDEV_PCI_DEVICE_BOND) + hostdev->state = state; + } +} + + +int +qemuDomainMigratePciPassThruDevices(virQEMUDriverPtr driver, + virDomainObjPtr vm, + bool isPlug) +{ + virDomainDeviceDef dev; + virDomainDeviceDefPtr dev_copy = NULL; + virDomainHostdevDefPtr hostdev; + virCapsPtr caps = NULL; + int ret = -1; + int i; + + if (!(caps = virQEMUDriverGetCapabilities(driver, false))) + goto cleanup; + + /* plug/unplug passthrough bond device */ + for (i = vm->def->nhostdevs; i >= 0; i--) { + hostdev = vm->def->hostdevs[i]; + + if (hostdev->state == VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) { + if (!isPlug) { + dev.type = VIR_DOMAIN_DEVICE_HOSTDEV; + dev.data.hostdev = hostdev; + + dev_copy = virDomainDeviceDefCopy(&dev, vm->def, caps, driver->xmlopt); + if (!dev_copy) + goto cleanup; + + if (qemuDomainDetachHostDevice(driver, vm, dev_copy) < 0) { + virDomainDeviceDefFree(dev_copy); + goto cleanup; + } + virDomainDeviceDefFree(dev_copy); + } else { + qemuMigrationSetStateForHostdev(vm->def, VIR_DOMAIN_HOSTDEV_STATE_DEFAULT); + if (qemuDomainAttachHostDevice(NULL, driver, vm, hostdev) < 0) + goto cleanup; + } + if (qemuDomainUpdateDeviceList(driver, vm, QEMU_ASYNC_JOB_NONE) < 0) + goto cleanup; + } + } + + ret = 0; + + cleanup: + virObjectUnref(caps); + + return ret; +} + + /* The caller is supposed to lock the vm and start a migration job. */ static char *qemuMigrationBeginPhase(virQEMUDriverPtr driver, @@ -2662,6 +2733,8 @@ static char if (priv->job.asyncJob == QEMU_ASYNC_JOB_MIGRATION_OUT) qemuMigrationJobSetPhase(driver, vm, QEMU_MIGRATION_PHASE_BEGIN3); + qemuMigrationSetStateForHostdev(vm->def, VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE); + if (!qemuMigrationIsAllowed(driver, vm, NULL, true, abort_on_error)) goto cleanup; @@ -2885,6 +2958,8 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, if (!(caps = virQEMUDriverGetCapabilities(driver, false))) goto cleanup; + qemuMigrationSetStateForHostdev(*def, VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE); + if (!qemuMigrationIsAllowed(driver, NULL, *def, true, abort_on_error)) goto cleanup; @@ -5315,6 +5390,13 @@ qemuMigrationFinish(virQEMUDriverPtr driver, goto endjob; } + /* hotplug previous mark migrate hostdev */ + if (qemuDomainMigratePciPassThruDevices(driver, vm, true) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("passthrough for hostdev failed")); + goto endjob; + } + /* Guest is successfully running, so cancel previous auto destroy */ qemuProcessAutoDestroyRemove(driver, vm); } else if (!(flags & VIR_MIGRATE_OFFLINE)) { @@ -5331,6 +5413,8 @@ qemuMigrationFinish(virQEMUDriverPtr driver, VIR_WARN("Unable to encode migration cookie"); endjob: + qemuMigrationSetStateForHostdev(vm->def, VIR_DOMAIN_HOSTDEV_STATE_DEFAULT); + qemuMigrationJobFinish(driver, vm); if (!vm->persistent && !virDomainObjIsActive(vm)) qemuDomainRemoveInactive(driver, vm); diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index 1726455..fa21752 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -177,4 +177,8 @@ int qemuMigrationToFile(virQEMUDriverPtr driver, virDomainObjPtr vm, ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(5) ATTRIBUTE_RETURN_CHECK; +int qemuDomainMigratePciPassThruDevices(virQEMUDriverPtr driver, + virDomainObjPtr vm, + bool isPlug); + #endif /* __QEMU_MIGRATION_H__ */ diff --git a/src/util/virhostdev.c b/src/util/virhostdev.c index f583e54..4b6152a 100644 --- a/src/util/virhostdev.c +++ b/src/util/virhostdev.c @@ -206,6 +206,9 @@ virHostdevGetPCIHostDeviceList(virDomainHostdevDefPtr *hostdevs, int nhostdevs) virDomainHostdevSubsysPCIPtr pcisrc = &hostdev->source.subsys.u.pci; virPCIDevicePtr dev; + if (hostdev->state == VIR_DOMAIN_HOSTDEV_STATE_READY_FOR_MIGRATE) + continue; + if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS) continue; if (hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI) -- 1.9.3

the patches is for libvirt to support migration with passthrough device using existing feacture. Chen Fan (3): qemu-agent: add guest-network-set-interface command qemu-agent: add guest-network-delete-interface command qemu-agent: add notify for qemu-ga boot configure | 16 +++ qga/commands-posix.c | 312 +++++++++++++++++++++++++++++++++++++++++++++++++++ qga/commands-win32.c | 13 +++ qga/main.c | 13 +++ qga/qapi-schema.json | 65 +++++++++++ 5 files changed, 419 insertions(+) -- 1.9.3

Nowadays, qemu has supported physical NIC hotplug for high network throughput. but it's in conflict with live migration feature, to keep network connectivity, we could to create bond device interface which provides a mechanism for enslaving multiple network interfaces into a single "bond" interface. the active-backup mode can be used for an automatic switch. so this patch is adding a guest-network-set-interface command for creating bond device. so the management can easy to create a bond device dynamically when guest running. Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- configure | 16 ++++ qga/commands-posix.c | 261 +++++++++++++++++++++++++++++++++++++++++++++++++++ qga/commands-win32.c | 7 ++ qga/qapi-schema.json | 54 +++++++++++ 4 files changed, 338 insertions(+) diff --git a/configure b/configure index f185dd0..ebfcc6a 100755 --- a/configure +++ b/configure @@ -3618,6 +3618,18 @@ if test "$darwin" != "yes" -a "$mingw32" != "yes" -a "$solaris" != yes -a \ fi ########################################## +# Do we need netcf +netcf=no +cat > $TMPC << EOF +#include <netcf.h> +int main(void) { return 0; } +EOF +if compile_prog "" "-lnetcf" ; then + netcf=yes + libs_qga="$libs_qga -lnetcf" +fi + +########################################## # spice probe if test "$spice" != "no" ; then cat > $TMPC << EOF @@ -4697,6 +4709,10 @@ if test "$spice" = "yes" ; then echo "CONFIG_SPICE=y" >> $config_host_mak fi +if test "$netcf" = "yes" ; then + echo "CONFIG_NETCF=y" >> $config_host_mak +fi + if test "$smartcard_nss" = "yes" ; then echo "CONFIG_SMARTCARD_NSS=y" >> $config_host_mak echo "NSS_LIBS=$nss_libs" >> $config_host_mak diff --git a/qga/commands-posix.c b/qga/commands-posix.c index f6f3e3c..5ee7949 100644 --- a/qga/commands-posix.c +++ b/qga/commands-posix.c @@ -46,6 +46,10 @@ extern char **environ; #include <sys/socket.h> #include <net/if.h> +#ifdef CONFIG_NETCF +#include <netcf.h> +#endif + #ifdef FIFREEZE #define CONFIG_FSFREEZE #endif @@ -1719,6 +1723,263 @@ error: return NULL; } +#ifdef CONFIG_NETCF +static const char *interface_type_string[] = { + "bond", +}; + +static const char *ip_address_type_string[] = { + "ipv4", + "ipv6", +}; + +static char *parse_options(const char *str, const char *needle) +{ + char *start, *end, *buffer = NULL; + char *ret = NULL; + + buffer = g_strdup(str); + start = buffer; + if ((start = strstr(start, needle))) { + start += strlen(needle); + end = strchr(start, ' '); + if (end) { + *end = '\0'; + } + if (strlen(start) == 0) { + goto cleanup; + } + ret = g_strdup(start); + } + +cleanup: + g_free(buffer); + return ret; +} + +/** + * @buffer: xml string data to be formatted + * @indent: indent number relative to first line + * + */ +static void adjust_indent(char **buffer, int indent) +{ + char spaces[1024]; + int i; + + if (!*buffer) { + return; + } + + if (indent < 0 || indent >= 1024) { + return; + } + memset(spaces, 0, sizeof(spaces)); + for (i = 0; i < indent; i++) { + spaces[i] = ' '; + } + + sprintf(*buffer + strlen(*buffer), "%s", spaces); +} + +static char *create_bond_interface(GuestNetworkInterface2 *interface) +{ + char *target_xml; + + target_xml = g_malloc0(1024); + if (!target_xml) { + return NULL; + } + + sprintf(target_xml, "<interface type='%s' name='%s'>\n", + interface_type_string[interface->type], interface->name); + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "<start mode='%s'/>\n", + interface->has_onboot ? interface->onboot : "none"); + if (interface->has_ip_address) { + GuestIpAddress *address_item = interface->ip_address; + + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "<protocol family='%s'>\n", + ip_address_type_string[address_item->ip_address_type]); + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<ip address='%s' prefix='%" PRId64 "'/>\n", + address_item->ip_address, address_item->prefix); + if (address_item->has_gateway) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<route gateway='%s'/>\n", + address_item->gateway); + } + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "%s\n", "</protocol>"); + } + + adjust_indent(&target_xml, 2); + if (interface->has_options) { + char *value; + + value = parse_options(interface->options, "mode="); + if (value) { + sprintf(target_xml + strlen(target_xml), "<bond mode='%s'>\n", + value); + g_free(value); + } else { + sprintf(target_xml + strlen(target_xml), "%s\n", "<bond>"); + } + + value = parse_options(interface->options, "miimon="); + if (value) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<miimon freq='%s'", + value); + g_free(value); + + value = parse_options(interface->options, "updelay="); + if (value) { + sprintf(target_xml + strlen(target_xml), " updelay='%s'", + value); + g_free(value); + } + value = parse_options(interface->options, "downdelay="); + if (value) { + sprintf(target_xml + strlen(target_xml), " downdelay='%s'", + value); + g_free(value); + } + value = parse_options(interface->options, "use_carrier="); + if (value) { + sprintf(target_xml + strlen(target_xml), " carrier='%s'", + value); + g_free(value); + } + + sprintf(target_xml + strlen(target_xml), "%s\n", "/>"); + } + + value = parse_options(interface->options, "arp_interval="); + if (value) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<arpmon interval='%s'", + value); + g_free(value); + + value = parse_options(interface->options, "arp_ip_target="); + if (value) { + sprintf(target_xml + strlen(target_xml), " target='%s'", + value); + g_free(value); + } + + value = parse_options(interface->options, "arp_validate="); + if (value) { + sprintf(target_xml + strlen(target_xml), " validate='%s'", + value); + g_free(value); + } + + sprintf(target_xml + strlen(target_xml), "%s\n", "/>"); + } + } else { + sprintf(target_xml + strlen(target_xml), "%s\n", "<bond>"); + } + + if (interface->has_subInterfaces) { + GuestNetworkInterfaceList *head = interface->subInterfaces; + + for (; head; head = head->next) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), + "<interface type='ethernet' name='%s'/>\n", + head->value->name); + } + } + + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "%s\n", "</bond>"); + sprintf(target_xml + strlen(target_xml), "%s\n", "</interface>"); + + return target_xml; +} + +static struct netcf *netcf; + +static void create_interface(GuestNetworkInterface2 *interface, Error **errp) +{ + int ret = -1; + struct netcf_if *iface; + unsigned int flags = 0; + char *target_xml; + + /* open netcf */ + if (netcf == NULL) { + if (ncf_init(&netcf, NULL) != 0) { + error_setg(errp, "netcf init failed"); + return; + } + } + + if (interface->type != GUEST_INTERFACE_TYPE_BOND) { + error_setg(errp, "interface type is not supported, only support 'bond' type"); + return; + } + + target_xml = create_bond_interface(interface); + if (!target_xml) { + error_setg(errp, "no enough memory spaces"); + return; + } + + iface = ncf_define(netcf, target_xml); + if (!iface) { + error_setg(errp, "netcf interface define failed"); + g_free(target_xml); + goto cleanup; + } + + g_free(target_xml); + + if (ncf_if_status(iface, &flags) < 0) { + error_setg(errp, "netcf interface get status failed"); + goto cleanup; + } + + if (flags & NETCF_IFACE_ACTIVE) { + error_setg(errp, "interface is already running"); + goto cleanup; + } + + ret = ncf_if_up(iface); + if (ret < 0) { + error_setg(errp, "netcf interface up failed"); + goto cleanup; + } + + cleanup: + ncf_if_free(iface); +} + +int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, + Error **errp) +{ + Error *local_err = NULL; + + create_interface(interface, &local_err); + if (local_err != NULL) { + error_propagate(errp, local_err); + return -1; + } + + return 0; +} +#else +int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, + Error **errp) +{ + error_set(errp, QERR_UNSUPPORTED); + return -1; +} +#endif + #define SYSCONF_EXACT(name, errp) sysconf_exact((name), #name, (errp)) static long sysconf_exact(int name, const char *name_str, Error **errp) diff --git a/qga/commands-win32.c b/qga/commands-win32.c index 3bcbeae..4c14514 100644 --- a/qga/commands-win32.c +++ b/qga/commands-win32.c @@ -446,6 +446,13 @@ int64_t qmp_guest_set_vcpus(GuestLogicalProcessorList *vcpus, Error **errp) return -1; } +int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, + Error **errp) +{ + error_set(errp, QERR_UNSUPPORTED); + return -1; +} + /* add unsupported commands to the blacklist */ GList *ga_command_blacklist_init(GList *blacklist) { diff --git a/qga/qapi-schema.json b/qga/qapi-schema.json index 376e79f..77f499b 100644 --- a/qga/qapi-schema.json +++ b/qga/qapi-schema.json @@ -556,6 +556,7 @@ { 'type': 'GuestIpAddress', 'data': {'ip-address': 'str', 'ip-address-type': 'GuestIpAddressType', + '*gateway': 'str', 'prefix': 'int'} } ## @@ -575,6 +576,43 @@ '*ip-addresses': ['GuestIpAddress'] } } ## +# @GuestInterfaceType: +# +# An enumeration of supported interface types +# +# @bond: bond device +# +# Since: 2.3 +## +{ 'enum': 'GuestInterfaceType', + 'data': [ 'bond' ] } + +## +# @GuestNetworkInterface2: +# +# @type: the interface type which supported in enum GuestInterfaceType. +# +# @name: the interface name. +# +# @onboot: the interface start model. +# +# @ip-address: IP address. +# +# @options: the options argument. +# +# @subInterfaces: the slave interfaces. +# +# Since: 2.3 +## +{ 'type': 'GuestNetworkInterface2', + 'data': {'type': 'GuestInterfaceType', + 'name': 'str', + '*onboot': 'str', + '*ip-address': 'GuestIpAddress', + '*options': 'str', + '*subInterfaces': ['GuestNetworkInterface'] } } + +## # @guest-network-get-interfaces: # # Get list of guest IP addresses, MAC addresses @@ -588,6 +626,22 @@ 'returns': ['GuestNetworkInterface'] } ## +# @guest-network-set-interface: +# +# Set guest network interface +# +# return: 0: call successful. +# +# -1: call failed. +# +# +# Since: 2.3 +## +{ 'command': 'guest-network-set-interface', + 'data' : {'interface': 'GuestNetworkInterface2' }, + 'returns': 'int' } + +## # @GuestLogicalProcessor: # # @logical-id: Arbitrary guest-specific unique identifier of the VCPU. -- 1.9.3

On 17/04/15 11:53, Chen Fan wrote:
Nowadays, qemu has supported physical NIC hotplug for high network throughput. but it's in conflict with live migration feature, to keep network connectivity, we could to create bond device interface which provides a mechanism for enslaving multiple network interfaces into a single "bond" interface. the active-backup mode can be used for an automatic switch. so this patch is adding a guest-network-set-interface command for creating bond device. so the management can easy to create a bond device dynamically when guest running.
Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- configure | 16 ++++ qga/commands-posix.c | 261 +++++++++++++++++++++++++++++++++++++++++++++++++++ qga/commands-win32.c | 7 ++ qga/qapi-schema.json | 54 +++++++++++ 4 files changed, 338 insertions(+)
diff --git a/configure b/configure index f185dd0..ebfcc6a 100755 --- a/configure +++ b/configure @@ -3618,6 +3618,18 @@ if test "$darwin" != "yes" -a "$mingw32" != "yes" -a "$solaris" != yes -a \ fi
########################################## +# Do we need netcf +netcf=no +cat > $TMPC << EOF +#include <netcf.h> +int main(void) { return 0; } +EOF +if compile_prog "" "-lnetcf" ; then + netcf=yes + libs_qga="$libs_qga -lnetcf" +fi + +########################################## # spice probe if test "$spice" != "no" ; then cat > $TMPC << EOF @@ -4697,6 +4709,10 @@ if test "$spice" = "yes" ; then echo "CONFIG_SPICE=y" >> $config_host_mak fi
+if test "$netcf" = "yes" ; then + echo "CONFIG_NETCF=y" >> $config_host_mak +fi + if test "$smartcard_nss" = "yes" ; then echo "CONFIG_SMARTCARD_NSS=y" >> $config_host_mak echo "NSS_LIBS=$nss_libs" >> $config_host_mak diff --git a/qga/commands-posix.c b/qga/commands-posix.c index f6f3e3c..5ee7949 100644 --- a/qga/commands-posix.c +++ b/qga/commands-posix.c @@ -46,6 +46,10 @@ extern char **environ; #include <sys/socket.h> #include <net/if.h>
+#ifdef CONFIG_NETCF +#include <netcf.h> +#endif + #ifdef FIFREEZE #define CONFIG_FSFREEZE #endif @@ -1719,6 +1723,263 @@ error: return NULL; }
+#ifdef CONFIG_NETCF +static const char *interface_type_string[] = { + "bond", +}; + +static const char *ip_address_type_string[] = { + "ipv4", + "ipv6", +}; + +static char *parse_options(const char *str, const char *needle) +{ + char *start, *end, *buffer = NULL; + char *ret = NULL; + + buffer = g_strdup(str); + start = buffer; + if ((start = strstr(start, needle))) { + start += strlen(needle); + end = strchr(start, ' '); + if (end) { + *end = '\0'; + } + if (strlen(start) == 0) { + goto cleanup; + } + ret = g_strdup(start); + } + +cleanup: + g_free(buffer); + return ret; +} + +/** + * @buffer: xml string data to be formatted + * @indent: indent number relative to first line + * + */ +static void adjust_indent(char **buffer, int indent) +{ + char spaces[1024]; + int i; + + if (!*buffer) { + return; + } + + if (indent < 0 || indent >= 1024) { + return; + } + memset(spaces, 0, sizeof(spaces)); + for (i = 0; i < indent; i++) { + spaces[i] = ' '; + } + + sprintf(*buffer + strlen(*buffer), "%s", spaces); +} + +static char *create_bond_interface(GuestNetworkInterface2 *interface) +{ + char *target_xml; + + target_xml = g_malloc0(1024); + if (!target_xml) { + return NULL; + } + + sprintf(target_xml, "<interface type='%s' name='%s'>\n", + interface_type_string[interface->type], interface->name); + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "<start mode='%s'/>\n", + interface->has_onboot ? interface->onboot : "none"); + if (interface->has_ip_address) { + GuestIpAddress *address_item = interface->ip_address; + + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "<protocol family='%s'>\n", + ip_address_type_string[address_item->ip_address_type]); + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<ip address='%s' prefix='%" PRId64 "'/>\n", + address_item->ip_address, address_item->prefix); + if (address_item->has_gateway) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<route gateway='%s'/>\n", + address_item->gateway); + } + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "%s\n", "</protocol>"); + } + + adjust_indent(&target_xml, 2); + if (interface->has_options) { + char *value; + + value = parse_options(interface->options, "mode="); + if (value) { + sprintf(target_xml + strlen(target_xml), "<bond mode='%s'>\n", + value); + g_free(value); + } else { + sprintf(target_xml + strlen(target_xml), "%s\n", "<bond>"); + } + + value = parse_options(interface->options, "miimon="); + if (value) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<miimon freq='%s'", + value); + g_free(value); + + value = parse_options(interface->options, "updelay="); + if (value) { + sprintf(target_xml + strlen(target_xml), " updelay='%s'", + value); + g_free(value); + } + value = parse_options(interface->options, "downdelay="); + if (value) { + sprintf(target_xml + strlen(target_xml), " downdelay='%s'", + value); + g_free(value); + } + value = parse_options(interface->options, "use_carrier="); + if (value) { + sprintf(target_xml + strlen(target_xml), " carrier='%s'", + value); + g_free(value); + } + + sprintf(target_xml + strlen(target_xml), "%s\n", "/>"); + } + + value = parse_options(interface->options, "arp_interval="); + if (value) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), "<arpmon interval='%s'", + value); + g_free(value); + + value = parse_options(interface->options, "arp_ip_target="); + if (value) { + sprintf(target_xml + strlen(target_xml), " target='%s'", + value); + g_free(value); + } + + value = parse_options(interface->options, "arp_validate="); + if (value) { + sprintf(target_xml + strlen(target_xml), " validate='%s'", + value); + g_free(value); + } + + sprintf(target_xml + strlen(target_xml), "%s\n", "/>"); + } + } else { + sprintf(target_xml + strlen(target_xml), "%s\n", "<bond>"); + } + + if (interface->has_subInterfaces) { + GuestNetworkInterfaceList *head = interface->subInterfaces; + + for (; head; head = head->next) { + adjust_indent(&target_xml, 4); + sprintf(target_xml + strlen(target_xml), + "<interface type='ethernet' name='%s'/>\n", + head->value->name); + } + } + + adjust_indent(&target_xml, 2); + sprintf(target_xml + strlen(target_xml), "%s\n", "</bond>"); + sprintf(target_xml + strlen(target_xml), "%s\n", "</interface>"); + + return target_xml; +} + +static struct netcf *netcf; + +static void create_interface(GuestNetworkInterface2 *interface, Error **errp) +{ + int ret = -1; + struct netcf_if *iface; + unsigned int flags = 0; + char *target_xml; + + /* open netcf */ + if (netcf == NULL) { + if (ncf_init(&netcf, NULL) != 0) { + error_setg(errp, "netcf init failed"); + return; + } + } + + if (interface->type != GUEST_INTERFACE_TYPE_BOND) { + error_setg(errp, "interface type is not supported, only support 'bond' type"); + return; + } + + target_xml = create_bond_interface(interface); + if (!target_xml) { + error_setg(errp, "no enough memory spaces"); + return; + } + + iface = ncf_define(netcf, target_xml); + if (!iface) { + error_setg(errp, "netcf interface define failed"); + g_free(target_xml); + goto cleanup; + } + + g_free(target_xml); + + if (ncf_if_status(iface, &flags) < 0) { + error_setg(errp, "netcf interface get status failed"); + goto cleanup; + } + + if (flags & NETCF_IFACE_ACTIVE) { + error_setg(errp, "interface is already running"); + goto cleanup; + } + + ret = ncf_if_up(iface); + if (ret < 0) { + error_setg(errp, "netcf interface up failed"); + goto cleanup; + } + + cleanup: + ncf_if_free(iface); +} + +int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, + Error **errp) +{ + Error *local_err = NULL; + + create_interface(interface, &local_err); + if (local_err != NULL) { + error_propagate(errp, local_err); + return -1; + } + + return 0; +} +#else +int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, + Error **errp) +{ + error_set(errp, QERR_UNSUPPORTED); + return -1; +} +#endif + #define SYSCONF_EXACT(name, errp) sysconf_exact((name), #name, (errp))
static long sysconf_exact(int name, const char *name_str, Error **errp) diff --git a/qga/commands-win32.c b/qga/commands-win32.c index 3bcbeae..4c14514 100644 --- a/qga/commands-win32.c +++ b/qga/commands-win32.c @@ -446,6 +446,13 @@ int64_t qmp_guest_set_vcpus(GuestLogicalProcessorList *vcpus, Error **errp) return -1; }
+int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, + Error **errp) +{ + error_set(errp, QERR_UNSUPPORTED); + return -1; +} + /* add unsupported commands to the blacklist */ GList *ga_command_blacklist_init(GList *blacklist) { diff --git a/qga/qapi-schema.json b/qga/qapi-schema.json index 376e79f..77f499b 100644 --- a/qga/qapi-schema.json +++ b/qga/qapi-schema.json @@ -556,6 +556,7 @@ { 'type': 'GuestIpAddress', 'data': {'ip-address': 'str', 'ip-address-type': 'GuestIpAddressType', + '*gateway': 'str', 'prefix': 'int'} }
## @@ -575,6 +576,43 @@ '*ip-addresses': ['GuestIpAddress'] } }
## +# @GuestInterfaceType: +# +# An enumeration of supported interface types +# +# @bond: bond device +# +# Since: 2.3 +## +{ 'enum': 'GuestInterfaceType', + 'data': [ 'bond' ] } + +## +# @GuestNetworkInterface2: +# +# @type: the interface type which supported in enum GuestInterfaceType. +# +# @name: the interface name. +# +# @onboot: the interface start model. +# +# @ip-address: IP address. +# +# @options: the options argument. +# +# @subInterfaces: the slave interfaces. +# +# Since: 2.3 +## +{ 'type': 'GuestNetworkInterface2', + 'data': {'type': 'GuestInterfaceType', + 'name': 'str', + '*onboot': 'str', + '*ip-address': 'GuestIpAddress', + '*options': 'str', + '*subInterfaces': ['GuestNetworkInterface'] } } + +## # @guest-network-get-interfaces: # # Get list of guest IP addresses, MAC addresses @@ -588,6 +626,22 @@ 'returns': ['GuestNetworkInterface'] }
## +# @guest-network-set-interface: +# +# Set guest network interface +# +# return: 0: call successful. +# +# -1: call failed. +# +# +# Since: 2.3 +## +{ 'command': 'guest-network-set-interface', + 'data' : {'interface': 'GuestNetworkInterface2' }, + 'returns': 'int' } I thought that usage of built-in types as the returning value is deprecated. Lets return dictionary in guest-network-set (get)-interface + +## # @GuestLogicalProcessor: # # @logical-id: Arbitrary guest-specific unique identifier of the VCPU.

On 05/21/2015 07:52 AM, Olga Krishtal wrote:
On 17/04/15 11:53, Chen Fan wrote:
Nowadays, qemu has supported physical NIC hotplug for high network throughput. but it's in conflict with live migration feature, to keep network connectivity, we could to create bond device interface which provides a mechanism for enslaving multiple network interfaces into a single "bond" interface. the active-backup mode can be used for an automatic switch. so this patch is adding a guest-network-set-interface command for creating bond device. so the management can easy to create a bond device dynamically when guest running.
Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> ---
@@ -588,6 +626,22 @@ 'returns': ['GuestNetworkInterface'] } ## +# @guest-network-set-interface: +# +# Set guest network interface +# +# return: 0: call successful. +# +# -1: call failed. +# +# +# Since: 2.3
You've missed 2.3; if we still want this, it will need to be updated to 2.4.
+## +{ 'command': 'guest-network-set-interface', + 'data' : {'interface': 'GuestNetworkInterface2' }, + 'returns': 'int' } I thought that usage of built-in types as the returning value is deprecated. Lets return dictionary in guest-network-set (get)-interface
Correct. Returning a non-dictionary now causes the generator to barf if you don't update a whitelist. But you don't even need a return value - QGA is already set up to return {} on success and an error message on failure, if you have nothing further to add. Just omit 'returns' from your 'command' definition. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Add a corresponding command to guest-network-set-interface. Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- qga/commands-posix.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ qga/commands-win32.c | 6 ++++++ qga/qapi-schema.json | 11 +++++++++++ 3 files changed, 68 insertions(+) diff --git a/qga/commands-posix.c b/qga/commands-posix.c index 5ee7949..058085f 100644 --- a/qga/commands-posix.c +++ b/qga/commands-posix.c @@ -1971,6 +1971,51 @@ int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, return 0; } + +int64_t qmp_guest_network_delete_interface(const char *name, Error **errp) +{ + struct netcf_if *iface; + int ret = -1; + unsigned int flags = 0; + + /* open netcf */ + if (netcf == NULL) { + if (ncf_init(&netcf, NULL) != 0) { + error_setg(errp, "netcf init failed"); + return ret; + } + } + + iface = ncf_lookup_by_name(netcf, name); + if (!iface) { + error_setg(errp, "couldn't find interface named '%s'", name); + return ret; + } + + if (ncf_if_status(iface, &flags) < 0) { + error_setg(errp, "netcf interface get status failed"); + goto cleanup; + } + + if (flags & NETCF_IFACE_ACTIVE) { + ret = ncf_if_down(iface); + if (ret < 0) { + error_setg(errp, "netcf interface stop failed"); + goto cleanup; + } + } + + ret = ncf_if_undefine(iface); + if (ret < 0) { + error_setg(errp, "netcf interface delete failed"); + goto cleanup; + } + + ret = 0; +cleanup: + ncf_if_free(iface); + return ret; +} #else int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, Error **errp) @@ -1978,6 +2023,12 @@ int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, error_set(errp, QERR_UNSUPPORTED); return -1; } + +int64_t qmp_guest_network_delete_interface(const char *name, Error **errp) +{ + error_set(errp, QERR_UNSUPPORTED); + return -1; +} #endif #define SYSCONF_EXACT(name, errp) sysconf_exact((name), #name, (errp)) diff --git a/qga/commands-win32.c b/qga/commands-win32.c index 4c14514..52f6e47 100644 --- a/qga/commands-win32.c +++ b/qga/commands-win32.c @@ -453,6 +453,12 @@ int64_t qmp_guest_network_set_interface(GuestNetworkInterface2 *interface, return -1; } +int64_t qmp_guest_network_delete_interface(const char *name, Error **errp) +{ + error_set(errp, QERR_UNSUPPORTED); + return -1; +} + /* add unsupported commands to the blacklist */ GList *ga_command_blacklist_init(GList *blacklist) { diff --git a/qga/qapi-schema.json b/qga/qapi-schema.json index 77f499b..b886f97 100644 --- a/qga/qapi-schema.json +++ b/qga/qapi-schema.json @@ -642,6 +642,17 @@ 'returns': 'int' } ## +# @guest-network-delete-interface: +# +# @name: interface name. +# +# Since: 2.3 +## +{ 'command': 'guest-network-delete-interface', + 'data' : {'name': 'str' }, + 'returns': 'int' } + +## # @GuestLogicalProcessor: # # @logical-id: Arbitrary guest-specific unique identifier of the VCPU. -- 1.9.3

Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- qga/main.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/qga/main.c b/qga/main.c index 9939a2b..f011ce0 100644 --- a/qga/main.c +++ b/qga/main.c @@ -1170,6 +1170,19 @@ int main(int argc, char **argv) g_critical("failed to initialize guest agent channel"); goto out_bad; } + + /* send a notification to path */ + if (ga_state->channel) { + QDict *qdict = qdict_new(); + int ret; + + qdict_put_obj(qdict, "status", QOBJECT(qstring_from_str("connected"))); + ret = send_response(s, QOBJECT(qdict)); + if (ret < 0) { + g_warning("error sending connected status"); + } + } + #ifndef _WIN32 g_main_loop_run(ga_state->main_loop); #else -- 1.9.3

On 04/17/2015 02:53 AM, Chen Fan wrote:
Signed-off-by: Chen Fan <chen.fan.fnst@cn.fujitsu.com> --- qga/main.c | 13 +++++++++++++ 1 file changed, 13 insertions(+)
I'm not sure that qga should be sending asynchronous messages (so far, it only every replies synchronously). As it is, we already wired up a qemu event that fires any time the guest opens or closes the virtio connection powering the agent; libvirt can already use those events to know when the agent has opened the connection, and is presumably ready to listen to commands after first booting. So I don't think this patch is needed. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/17/2015 04:53 AM, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
This functionality has been on my mind/bug list for a long time, but I haven't been able to pursue it much. See this BZ, along with the original patches submitted by Shradha Shah from SolarFlare: https://bugzilla.redhat.com/show_bug.cgi?id=896716 (I was a bit optimistic in my initial review of the patches - there are actually a lot of issues that weren't handled by those patches.)
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
An interesting idea, but I think that is a 2nd level enhancement, not necessary initially (and maybe not ever, due to the high possibility of it being extremely difficult to get right in 100% of the cases).
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
This isn't quite making sense - the bond will be on the guest, which may not have netcf installed. Anyway, I think it should be up to the guest's own system network config to have the bond already setup. If you try to impose it from outside that infrastructure, you run too much risk of running afoul of something on the guest (e.g. NetworkManager)
- during migration, unplug the passthroughed NIC. then do native migration.
Correct. This is the most important part. But not just unplugging it, you also need to wait until the unplug operation completes (it is asynchronous). (After this point, the emulated NIC that is part of the bond would get all of the traffic).
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration.
Why does the MAC address need to be different? Are you suggesting doing this with passed-through non-SRIOV NICs? An SRIOV virtual function gets its MAC address from the libvirt config, so it's very simple to use the same MAC address across the migration. Any network card that would be able to do this on any sort of useful scale will be SRIOV-capable (or should be replaced with one that is - some of them are not that expensive).
TODO: 1. when hot add a new NIC in destination side after migration finished, the NIC device need to re-enslave on bonding device in guest. otherwise, it is offline. maybe we should consider bonding driver to support add interfaces dynamically.
I never looked at the details of how SolarFlare's code handled the guest side (they have/had their own patchset they maintained for some older version of libvirt which integrated with some sort of enhanced bonding driver on the guests). I assumed the bond driver could handle this already, but have to say I never investigated.
This is an example on how this might work, so I want to hear some voices about this scenario.
Thanks, Chen
Chen Fan (7): qemu-agent: add agent init callback when detecting guest setup qemu: add guest init event callback to do the initialize work for guest hostdev: add a 'bond' type element in <hostdev> element
Putting this into <hostdev> is the wrong approach, for two reasons: 1) it doesn't account for the device to be used being in a different address on the source and destination hosts, 2) the <interface> element already has much of the config you need, and an interface type supporting hostdev passthrough. It has been possible to do passthrough of an SRIOV VF via <interface type='hostdev'> for a long time now and, even better, via an <interface type='network'> where the network pointed to contains a pool of VFs - As long as the source and destination hosts both have networks with the same name, libvirt will be able to find a currently available device on the destination as it migrates from one host to another instead of relying on both hosts having the exact same device at the exact same address on the host and destination (and also magically unused by any other guest). This page explains the use of a "hostdev network" which has a pool of devices: http://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs_... This was designed specifically with the idea in mind that one day it would be possible to migrate a domain with a hostdev device (as long as the guest could handle the hostdev device being temporarily unplugged during the migration).
qemu-agent: add qemuAgentCreateBond interface hostdev: add parse ip and route for bond configure
Again, I think that this level of detail about the guest network config belongs on the guest, not in libvirt.
migrate: hot remove hostdev at perform phase for bond device
^^ this is the useful part but I don't think the right method is to make this action dependent on the device being a "bond". I think that in this respect Shradha's patches had a better idea - any hostdev (or, by implication <interface type='hostdev'> or, much more usefully <interface type='network'> pointing to a pool of VFs - could have an attribute "ephemeral". If ephemeral was "yes", then the device would always be unplugged prior to migration and re-plugged when migration was completed (the same thing should be done when saving/restoring a domain which also can't currently be done with a domain that has a passthrough device). For that matter, this could be a general-purpose thing (although probably most useful for hostdevs) - just make it possible for *any* hotpluggable device to be "ephemeral"; the meaning of this would be that every device marked as ephemeral should be unplugged prior to migration or save (and libvirt should wait for qemu to notify that the unplug is completed), and re-plugged right after the guest is restarted. (possibly it should be implemented as an <ephemeral> *element* rather than attribute, so that options could be specified). After that is implemented and works properly, then it might be the time to think about auto-creating the bond (although again, my opinion is that this is getting a bit too intrusive into the guest (and making it more likely to fail - I know from long experience with netcf that it is all too easy for some other service on the system (ahem) to mess up all your hard work); I think it would be better to just let the guest deal with setting up a bond in its system network config, and if the bond driver can't handle having a device in the bond unplugging and plugging, then the bond driver should be enhanced).
migrate: add hostdev migrate status to support hostdev migration
docs/schemas/basictypes.rng | 6 ++ docs/schemas/domaincommon.rng | 37 ++++++++ src/conf/domain_conf.c | 195 ++++++++++++++++++++++++++++++++++++++--- src/conf/domain_conf.h | 40 +++++++-- src/conf/networkcommon_conf.c | 17 ---- src/conf/networkcommon_conf.h | 17 ++++ src/libvirt_private.syms | 1 + src/qemu/qemu_agent.c | 196 +++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_agent.h | 12 +++ src/qemu/qemu_command.c | 3 + src/qemu/qemu_domain.c | 70 +++++++++++++++ src/qemu/qemu_domain.h | 14 +++ src/qemu/qemu_driver.c | 38 ++++++++ src/qemu/qemu_hotplug.c | 8 +- src/qemu/qemu_migration.c | 91 ++++++++++++++++++++ src/qemu/qemu_migration.h | 4 + src/qemu/qemu_process.c | 32 +++++++ src/util/virhostdev.c | 3 + 18 files changed, 745 insertions(+), 39 deletions(-)

Hi Laine, Thanks for your review for my patches. and do you know that solarflare's patches have made some update version since https://www.redhat.com/archives/libvir-list/2012-November/msg01324.html ? if not, I hope to go on to complete this work. ;) Thanks, Chen On 04/20/2015 06:29 AM, Laine Stump wrote:
On 04/17/2015 04:53 AM, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information. This functionality has been on my mind/bug list for a long time, but I haven't been able to pursue it much. See this BZ, along with the original patches submitted by Shradha Shah from SolarFlare:
https://bugzilla.redhat.com/show_bug.cgi?id=896716
(I was a bit optimistic in my initial review of the patches - there are actually a lot of issues that weren't handled by those patches.)
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. An interesting idea, but I think that is a 2nd level enhancement, not necessary initially (and maybe not ever, due to the high possibility of it being extremely difficult to get right in 100% of the cases).
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. This isn't quite making sense - the bond will be on the guest, which may not have netcf installed. Anyway, I think it should be up to the guest's own system network config to have the bond already setup. If you try to impose it from outside that infrastructure, you run too much risk of running afoul of something on the guest (e.g. NetworkManager)
- during migration, unplug the passthroughed NIC. then do native migration. Correct. This is the most important part. But not just unplugging it, you also need to wait until the unplug operation completes (it is asynchronous). (After this point, the emulated NIC that is part of the bond would get all of the traffic).
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration. Why does the MAC address need to be different? Are you suggesting doing this with passed-through non-SRIOV NICs? An SRIOV virtual function gets its MAC address from the libvirt config, so it's very simple to use the same MAC address across the migration. Any network card that would be able to do this on any sort of useful scale will be SRIOV-capable (or should be replaced with one that is - some of them are not that expensive).
TODO: 1. when hot add a new NIC in destination side after migration finished, the NIC device need to re-enslave on bonding device in guest. otherwise, it is offline. maybe we should consider bonding driver to support add interfaces dynamically. I never looked at the details of how SolarFlare's code handled the guest side (they have/had their own patchset they maintained for some older version of libvirt which integrated with some sort of enhanced bonding driver on the guests). I assumed the bond driver could handle this already, but have to say I never investigated.
This is an example on how this might work, so I want to hear some voices about this scenario.
Thanks, Chen
Chen Fan (7): qemu-agent: add agent init callback when detecting guest setup qemu: add guest init event callback to do the initialize work for guest hostdev: add a 'bond' type element in <hostdev> element
Putting this into <hostdev> is the wrong approach, for two reasons: 1) it doesn't account for the device to be used being in a different address on the source and destination hosts, 2) the <interface> element already has much of the config you need, and an interface type supporting hostdev passthrough.
It has been possible to do passthrough of an SRIOV VF via <interface type='hostdev'> for a long time now and, even better, via an <interface type='network'> where the network pointed to contains a pool of VFs - As long as the source and destination hosts both have networks with the same name, libvirt will be able to find a currently available device on the destination as it migrates from one host to another instead of relying on both hosts having the exact same device at the exact same address on the host and destination (and also magically unused by any other guest). This page explains the use of a "hostdev network" which has a pool of devices:
http://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs_...
This was designed specifically with the idea in mind that one day it would be possible to migrate a domain with a hostdev device (as long as the guest could handle the hostdev device being temporarily unplugged during the migration).
qemu-agent: add qemuAgentCreateBond interface hostdev: add parse ip and route for bond configure Again, I think that this level of detail about the guest network config belongs on the guest, not in libvirt.
migrate: hot remove hostdev at perform phase for bond device ^^ this is the useful part but I don't think the right method is to make this action dependent on the device being a "bond".
I think that in this respect Shradha's patches had a better idea - any hostdev (or, by implication <interface type='hostdev'> or, much more usefully <interface type='network'> pointing to a pool of VFs - could have an attribute "ephemeral". If ephemeral was "yes", then the device would always be unplugged prior to migration and re-plugged when migration was completed (the same thing should be done when saving/restoring a domain which also can't currently be done with a domain that has a passthrough device).
For that matter, this could be a general-purpose thing (although probably most useful for hostdevs) - just make it possible for *any* hotpluggable device to be "ephemeral"; the meaning of this would be that every device marked as ephemeral should be unplugged prior to migration or save (and libvirt should wait for qemu to notify that the unplug is completed), and re-plugged right after the guest is restarted.
(possibly it should be implemented as an <ephemeral> *element* rather than attribute, so that options could be specified).
After that is implemented and works properly, then it might be the time to think about auto-creating the bond (although again, my opinion is that this is getting a bit too intrusive into the guest (and making it more likely to fail - I know from long experience with netcf that it is all too easy for some other service on the system (ahem) to mess up all your hard work); I think it would be better to just let the guest deal with setting up a bond in its system network config, and if the bond driver can't handle having a device in the bond unplugging and plugging, then the bond driver should be enhanced).
migrate: add hostdev migrate status to support hostdev migration
docs/schemas/basictypes.rng | 6 ++ docs/schemas/domaincommon.rng | 37 ++++++++ src/conf/domain_conf.c | 195 ++++++++++++++++++++++++++++++++++++++--- src/conf/domain_conf.h | 40 +++++++-- src/conf/networkcommon_conf.c | 17 ---- src/conf/networkcommon_conf.h | 17 ++++ src/libvirt_private.syms | 1 + src/qemu/qemu_agent.c | 196 +++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_agent.h | 12 +++ src/qemu/qemu_command.c | 3 + src/qemu/qemu_domain.c | 70 +++++++++++++++ src/qemu/qemu_domain.h | 14 +++ src/qemu/qemu_driver.c | 38 ++++++++ src/qemu/qemu_hotplug.c | 8 +- src/qemu/qemu_migration.c | 91 ++++++++++++++++++++ src/qemu/qemu_migration.h | 4 + src/qemu/qemu_process.c | 32 +++++++ src/util/virhostdev.c | 3 + 18 files changed, 745 insertions(+), 39 deletions(-)
.

On 04/22/2015 12:22 AM, Chen Fan wrote:
Hi Laine,
Thanks for your review for my patches.
and do you know that solarflare's patches have made some update version since
https://www.redhat.com/archives/libvir-list/2012-November/msg01324.html
?
if not, I hope to go on to complete this work. ;)
I haven't heard of any updates. Their priorities may have changed.

On 04/20/2015 06:29 AM, Laine Stump wrote:
On 04/17/2015 04:53 AM, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information. This functionality has been on my mind/bug list for a long time, but I haven't been able to pursue it much. See this BZ, along with the original patches submitted by Shradha Shah from SolarFlare:
https://bugzilla.redhat.com/show_bug.cgi?id=896716
(I was a bit optimistic in my initial review of the patches - there are actually a lot of issues that weren't handled by those patches.)
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. An interesting idea, but I think that is a 2nd level enhancement, not necessary initially (and maybe not ever, due to the high possibility of it being extremely difficult to get right in 100% of the cases).
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. This isn't quite making sense - the bond will be on the guest, which may not have netcf installed. Anyway, I think it should be up to the guest's own system network config to have the bond already setup. If you try to impose it from outside that infrastructure, you run too much risk of running afoul of something on the guest (e.g. NetworkManager)
- during migration, unplug the passthroughed NIC. then do native migration. Correct. This is the most important part. But not just unplugging it, you also need to wait until the unplug operation completes (it is asynchronous). (After this point, the emulated NIC that is part of the bond would get all of the traffic).
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration. Why does the MAC address need to be different? Are you suggesting doing this with passed-through non-SRIOV NICs? An SRIOV virtual function gets its MAC address from the libvirt config, so it's very simple to use the same MAC address across the migration. Any network card that would be able to do this on any sort of useful scale will be SRIOV-capable (or should be replaced with one that is - some of them are not that expensive). Hi Laine,
I think SRIOV virtual NIC to support migration is good idea, but I also think some passthrough NIC without SRIOV-capable. for these NIC devices we only able to use <hostdev> to specify the passthrough function, so for these NIC I think we should support too. Thanks, Chen
TODO: 1. when hot add a new NIC in destination side after migration finished, the NIC device need to re-enslave on bonding device in guest. otherwise, it is offline. maybe we should consider bonding driver to support add interfaces dynamically. I never looked at the details of how SolarFlare's code handled the guest side (they have/had their own patchset they maintained for some older version of libvirt which integrated with some sort of enhanced bonding driver on the guests). I assumed the bond driver could handle this already, but have to say I never investigated.
This is an example on how this might work, so I want to hear some voices about this scenario.
Thanks, Chen
Chen Fan (7): qemu-agent: add agent init callback when detecting guest setup qemu: add guest init event callback to do the initialize work for guest hostdev: add a 'bond' type element in <hostdev> element
Putting this into <hostdev> is the wrong approach, for two reasons: 1) it doesn't account for the device to be used being in a different address on the source and destination hosts, 2) the <interface> element already has much of the config you need, and an interface type supporting hostdev passthrough.
It has been possible to do passthrough of an SRIOV VF via <interface type='hostdev'> for a long time now and, even better, via an <interface type='network'> where the network pointed to contains a pool of VFs - As long as the source and destination hosts both have networks with the same name, libvirt will be able to find a currently available device on the destination as it migrates from one host to another instead of relying on both hosts having the exact same device at the exact same address on the host and destination (and also magically unused by any other guest). This page explains the use of a "hostdev network" which has a pool of devices:
http://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs_...
This was designed specifically with the idea in mind that one day it would be possible to migrate a domain with a hostdev device (as long as the guest could handle the hostdev device being temporarily unplugged during the migration).
qemu-agent: add qemuAgentCreateBond interface hostdev: add parse ip and route for bond configure Again, I think that this level of detail about the guest network config belongs on the guest, not in libvirt.
migrate: hot remove hostdev at perform phase for bond device ^^ this is the useful part but I don't think the right method is to make this action dependent on the device being a "bond".
I think that in this respect Shradha's patches had a better idea - any hostdev (or, by implication <interface type='hostdev'> or, much more usefully <interface type='network'> pointing to a pool of VFs - could have an attribute "ephemeral". If ephemeral was "yes", then the device would always be unplugged prior to migration and re-plugged when migration was completed (the same thing should be done when saving/restoring a domain which also can't currently be done with a domain that has a passthrough device).
For that matter, this could be a general-purpose thing (although probably most useful for hostdevs) - just make it possible for *any* hotpluggable device to be "ephemeral"; the meaning of this would be that every device marked as ephemeral should be unplugged prior to migration or save (and libvirt should wait for qemu to notify that the unplug is completed), and re-plugged right after the guest is restarted.
(possibly it should be implemented as an <ephemeral> *element* rather than attribute, so that options could be specified).
After that is implemented and works properly, then it might be the time to think about auto-creating the bond (although again, my opinion is that this is getting a bit too intrusive into the guest (and making it more likely to fail - I know from long experience with netcf that it is all too easy for some other service on the system (ahem) to mess up all your hard work); I think it would be better to just let the guest deal with setting up a bond in its system network config, and if the bond driver can't handle having a device in the bond unplugging and plugging, then the bond driver should be enhanced).
migrate: add hostdev migrate status to support hostdev migration
docs/schemas/basictypes.rng | 6 ++ docs/schemas/domaincommon.rng | 37 ++++++++ src/conf/domain_conf.c | 195 ++++++++++++++++++++++++++++++++++++++--- src/conf/domain_conf.h | 40 +++++++-- src/conf/networkcommon_conf.c | 17 ---- src/conf/networkcommon_conf.h | 17 ++++ src/libvirt_private.syms | 1 + src/qemu/qemu_agent.c | 196 +++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_agent.h | 12 +++ src/qemu/qemu_command.c | 3 + src/qemu/qemu_domain.c | 70 +++++++++++++++ src/qemu/qemu_domain.h | 14 +++ src/qemu/qemu_driver.c | 38 ++++++++ src/qemu/qemu_hotplug.c | 8 +- src/qemu/qemu_migration.c | 91 ++++++++++++++++++++ src/qemu/qemu_migration.h | 4 + src/qemu/qemu_process.c | 32 +++++++ src/util/virhostdev.c | 3 + 18 files changed, 745 insertions(+), 39 deletions(-)
.

On 04/23/2015 04:34 AM, Chen Fan wrote:
On 04/20/2015 06:29 AM, Laine Stump wrote:
On 04/17/2015 04:53 AM, Chen Fan wrote:
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration.
Why does the MAC address need to be different? Are you suggesting doing this with passed-through non-SRIOV NICs? An SRIOV virtual function gets its MAC address from the libvirt config, so it's very simple to use the same MAC address across the migration. Any network card that would be able to do this on any sort of useful scale will be SRIOV-capable (or should be replaced with one that is - some of them are not that expensive).
Hi Laine,
I think SRIOV virtual NIC to support migration is good idea, but I also think some passthrough NIC without SRIOV-capable. for these NIC devices we only able to use <hostdev> to specify the passthrough function, so for these NIC I think we should support too.
As I think you've already discovered, passing through non-SRIOV NICS is problematic. It is completely impossible for the host to change their MAC address before assigning them to the guest - the guest's driver sees standard netdev hardware and resets it, which resets the MAC address to the original value burned into the firmware. This makes management more complicated, especially when you get into scenarios such as what we're discussing (i.e. migration) where the actual hardware (and thus MAC address) may be different from one run to the next. Since libvirt's <interface> element requires a fixed MAC address in the XML, it's not possible to have an <interface> that gets the actual device from a network pool (without some serious hacking to that code), and there is no support for plain (non-network) <hostdev> device pools; there would need to be a separate (nonexistent) driver for that. Since the <hostdev> element relies on the PCI address of the device (in the <source> subelement, which also must be fixed) to determine which device to passthrough, a domain config with a <hostdev> that could be run on two different machines would require the device to reside at exactly the same PCI address on both machines, which is a very serious limitation to have in an environment large enough that migrating domains is a requirement. Also, non-SRIOV NICs are limited to a single device per physical port, meaning probably at most 4 devices per physical host PCIe slot, and this results in a greatly reduced density on the host (and even more so on the switch that connects to the host!) compared to even the old Intel 82576 cards, which have 14 VFs (7VFs x 2 ethernet ports). Think about it - with an 82576, you can get 14 guests into 1 PCIe slot and 2 switch ports, while the same number of guests with non-SRIOV would take 4 PCIe slots and 14(!) switch ports. The difference is even more striking when comparing to chips like the 82599 (64 VFs per port x 2), or a Mellanox (also 64?) or SolarFlare (128?) card. And don't forget that, because you don't have pools of devices to be automatically chosen from, that each guest domain that will be migrated requires a reserved NIC on *every* machine it will be migrated to (no other domain can be configured to use that NIC, in order to avoid conflicts). Of course you could complicate the software by adding a driver that manages pools of generic hostdevs, and coordinates MAC address changes with the guest (part of what you're suggesting), but all that extra complexity not only takes a lot of time and effort to develop, it also creates more code that needs to be maintained and tested for regressions at each release. The alternative is to just spend $130 per host for an 82576 or Intel I350 card (these are the cheapest SRIOV options I'm aware of). When compared to the total cost of any hardware installation large enough to support migration and have performance requirements high enough that NIC passthrough is needed, this is a trivial amount. I guess the bottom line of all this is that (in my opinion, of course :-) supporting useful migration of domains that used passed-through non-SRIOV NICs would be an interesting experiment, but I don't see much utility to it, other than "scratching an intellectual itch", and I'm concerned that it would create more long term maintenance cost than it was worth.

On Thu, Apr 23, 2015 at 11:01:44AM -0400, Laine Stump wrote:
On 04/23/2015 04:34 AM, Chen Fan wrote:
On 04/20/2015 06:29 AM, Laine Stump wrote:
On 04/17/2015 04:53 AM, Chen Fan wrote:
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration.
Why does the MAC address need to be different? Are you suggesting doing this with passed-through non-SRIOV NICs? An SRIOV virtual function gets its MAC address from the libvirt config, so it's very simple to use the same MAC address across the migration. Any network card that would be able to do this on any sort of useful scale will be SRIOV-capable (or should be replaced with one that is - some of them are not that expensive).
Hi Laine,
I think SRIOV virtual NIC to support migration is good idea, but I also think some passthrough NIC without SRIOV-capable. for these NIC devices we only able to use <hostdev> to specify the passthrough function, so for these NIC I think we should support too.
As I think you've already discovered, passing through non-SRIOV NICS is problematic. It is completely impossible for the host to change their MAC address before assigning them to the guest - the guest's driver sees standard netdev hardware and resets it, which resets the MAC address to the original value burned into the firmware. This makes management more complicated, especially when you get into scenarios such as what we're discussing (i.e. migration) where the actual hardware (and thus MAC address) may be different from one run to the next.
Right, passing through PFs is also insecure. Let's get everything working fine with VFs first, worry about PFs later.
Since libvirt's <interface> element requires a fixed MAC address in the XML, it's not possible to have an <interface> that gets the actual device from a network pool (without some serious hacking to that code), and there is no support for plain (non-network) <hostdev> device pools; there would need to be a separate (nonexistent) driver for that. Since the <hostdev> element relies on the PCI address of the device (in the <source> subelement, which also must be fixed) to determine which device to passthrough, a domain config with a <hostdev> that could be run on two different machines would require the device to reside at exactly the same PCI address on both machines, which is a very serious limitation to have in an environment large enough that migrating domains is a requirement.
Also, non-SRIOV NICs are limited to a single device per physical port, meaning probably at most 4 devices per physical host PCIe slot, and this results in a greatly reduced density on the host (and even more so on the switch that connects to the host!) compared to even the old Intel 82576 cards, which have 14 VFs (7VFs x 2 ethernet ports). Think about it - with an 82576, you can get 14 guests into 1 PCIe slot and 2 switch ports, while the same number of guests with non-SRIOV would take 4 PCIe slots and 14(!) switch ports. The difference is even more striking when comparing to chips like the 82599 (64 VFs per port x 2), or a Mellanox (also 64?) or SolarFlare (128?) card. And don't forget that, because you don't have pools of devices to be automatically chosen from, that each guest domain that will be migrated requires a reserved NIC on *every* machine it will be migrated to (no other domain can be configured to use that NIC, in order to avoid conflicts).
Of course you could complicate the software by adding a driver that manages pools of generic hostdevs, and coordinates MAC address changes with the guest (part of what you're suggesting), but all that extra complexity not only takes a lot of time and effort to develop, it also creates more code that needs to be maintained and tested for regressions at each release.
The alternative is to just spend $130 per host for an 82576 or Intel I350 card (these are the cheapest SRIOV options I'm aware of). When compared to the total cost of any hardware installation large enough to support migration and have performance requirements high enough that NIC passthrough is needed, this is a trivial amount.
I guess the bottom line of all this is that (in my opinion, of course :-) supporting useful migration of domains that used passed-through non-SRIOV NICs would be an interesting experiment, but I don't see much utility to it, other than "scratching an intellectual itch", and I'm concerned that it would create more long term maintenance cost than it was worth.
I'm not sure it has no utility but it's easy to agree that VFs are more important, and focusing on this first is a good idea.

On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
- during migration, unplug the passthroughed NIC. then do native migration.
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
Thinking about it some more I'm not even convinced this should need direct support in libvirt or QEMU at all. We already have the ability to hotplug and unplug NICs, and the guest OS can be setup to run appropriate scripts when a PCI hotadd/remove event occurrs (eg via udev rules). So I think this functionality can be done entirely within the mgmt application (oVirt or OpenStack) and the guest OS. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

* Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff. At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond. However, I agree that you might be able to avoid having to do anything in the guest agent. Dave -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Wed, Apr 22, 2015 at 06:01:56PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff.
At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond.
I didn't mean the admin would literally configure stuff manually. I really just meant that the guest OS itself should decide how it is done, whether NetworkManager magically does the right thing, or the person building the cloud disk image provides a magic udev rule, or $something else. I just don't think that the QEMU guest agent should be involved, as that will definitely trample all over other things that manage networking in the guest. I could see this being solved in the cloud disk images by using cloud-init metadata to mark the NICs as being in a set, or perhaps there is some magic you could define in SMBIOS tables, or something else again. A cloud-init based solution wouldn't need any QEMU work, but an SMBIOS solution might. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

* Daniel P. Berrange (berrange@redhat.com) wrote:
On Wed, Apr 22, 2015 at 06:01:56PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff.
At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond.
I didn't mean the admin would literally configure stuff manually. I really just meant that the guest OS itself should decide how it is done, whether NetworkManager magically does the right thing, or the person building the cloud disk image provides a magic udev rule, or $something else. I just don't think that the QEMU guest agent should be involved, as that will definitely trample all over other things that manage networking in the guest.
OK, good, that's about the same level I was at.
I could see this being solved in the cloud disk images by using cloud-init metadata to mark the NICs as being in a set, or perhaps there is some magic you could define in SMBIOS tables, or something else again. A cloud-init based solution wouldn't need any QEMU work, but an SMBIOS solution might.
Would either of these work with hotplug though? I guess as the VM starts off with the pair of NICs, then when you remove one and add it back after migration then you don't need any more information added; so yes cloud-init or SMBIOS would do it. (I was thinking SMBIOS stuff in the way that you get device/slot numbering that NIC naming is sometimes based off). What about if we hot-add a new NIC later on (not during migration); a normal hot-add of a NIC now turns into a hot-add of two new NICs; how do we pass the information at hot-add time to provide that? Dave
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Wed, Apr 22, 2015 at 06:12:25PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Wed, Apr 22, 2015 at 06:01:56PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff.
At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond.
I didn't mean the admin would literally configure stuff manually. I really just meant that the guest OS itself should decide how it is done, whether NetworkManager magically does the right thing, or the person building the cloud disk image provides a magic udev rule, or $something else. I just don't think that the QEMU guest agent should be involved, as that will definitely trample all over other things that manage networking in the guest.
OK, good, that's about the same level I was at.
I could see this being solved in the cloud disk images by using cloud-init metadata to mark the NICs as being in a set, or perhaps there is some magic you could define in SMBIOS tables, or something else again. A cloud-init based solution wouldn't need any QEMU work, but an SMBIOS solution might.
Would either of these work with hotplug though? I guess as the VM starts off with the pair of NICs, then when you remove one and add it back after migration then you don't need any more information added; so yes cloud-init or SMBIOS would do it. (I was thinking SMBIOS stuff in the way that you get device/slot numbering that NIC naming is sometimes based off).
What about if we hot-add a new NIC later on (not during migration); a normal hot-add of a NIC now turns into a hot-add of two new NICs; how do we pass the information at hot-add time to provide that?
Hmm, yes, actually hotplug would be a problem with that. A even simpler idea would be to just keep things real dumb and simply use the same MAC address for both NICs. Once you put them in a bond device, the kernel will be copying the MAC address of the first NIC into the second NIC anyway, so unless I'm missing something, we might as well just use the same MAC address for both right away. That makes it easy for guest to discover NICs in the same set and works with hotplug trivially. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

* Daniel P. Berrange (berrange@redhat.com) wrote:
On Wed, Apr 22, 2015 at 06:12:25PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Wed, Apr 22, 2015 at 06:01:56PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff.
At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond.
I didn't mean the admin would literally configure stuff manually. I really just meant that the guest OS itself should decide how it is done, whether NetworkManager magically does the right thing, or the person building the cloud disk image provides a magic udev rule, or $something else. I just don't think that the QEMU guest agent should be involved, as that will definitely trample all over other things that manage networking in the guest.
OK, good, that's about the same level I was at.
I could see this being solved in the cloud disk images by using cloud-init metadata to mark the NICs as being in a set, or perhaps there is some magic you could define in SMBIOS tables, or something else again. A cloud-init based solution wouldn't need any QEMU work, but an SMBIOS solution might.
Would either of these work with hotplug though? I guess as the VM starts off with the pair of NICs, then when you remove one and add it back after migration then you don't need any more information added; so yes cloud-init or SMBIOS would do it. (I was thinking SMBIOS stuff in the way that you get device/slot numbering that NIC naming is sometimes based off).
What about if we hot-add a new NIC later on (not during migration); a normal hot-add of a NIC now turns into a hot-add of two new NICs; how do we pass the information at hot-add time to provide that?
Hmm, yes, actually hotplug would be a problem with that.
A even simpler idea would be to just keep things real dumb and simply use the same MAC address for both NICs. Once you put them in a bond device, the kernel will be copying the MAC address of the first NIC into the second NIC anyway, so unless I'm missing something, we might as well just use the same MAC address for both right away. That makes it easy for guest to discover NICs in the same set and works with hotplug trivially.
I bet you need to distinguish the two NICs though; you'd want the bond to send all the traffic through the real NIC during normal use; and how does the guest know when it sees the hotplug of the 1st NIC in the pair that this is a special NIC that it's about to see it's sibbling arrive. Dave
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On 04/22/2015 01:20 PM, Dr. David Alan Gilbert wrote:
On Wed, Apr 22, 2015 at 06:12:25PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Wed, Apr 22, 2015 at 06:01:56PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: > backgrond: > Live migration is one of the most important features of virtualization technology. > With regard to recent virtualization techniques, performance of network I/O is critical. > Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant > performance gap with native network I/O. Pass-through network devices have near > native performance, however, they have thus far prevented live migration. No existing > methods solve the problem of live migration with pass-through devices perfectly. > > There was an idea to solve the problem in website: > https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf > Please refer to above document for detailed information. > > So I think this problem maybe could be solved by using the combination of existing > technology. and the following steps are we considering to implement: > > - before boot VM, we anticipate to specify two NICs for creating bonding device > (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses > in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. > > - when qemu-guest-agent startup in guest it would send a notification to libvirt, > then libvirt will call the previous registered initialize callbacks. so through > the callback functions, we can create the bonding device according to the XML > configuration. and here we use netcf tool which can facilitate to create bonding device > easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install. I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff.
At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond. I didn't mean the admin would literally configure stuff manually. I really just meant that the guest OS itself should decide how it is done, whether NetworkManager magically does the right thing, or the person building the cloud disk image provides a magic udev rule, or $something else. I just don't think that the QEMU guest agent should be involved, as that will definitely trample all over other things that manage networking in the guest. OK, good, that's about the same level I was at.
I could see this being solved in the cloud disk images by using cloud-init metadata to mark the NICs as being in a set, or perhaps there is some magic you could define in SMBIOS tables, or something else again. A cloud-init based solution wouldn't need any QEMU work, but an SMBIOS solution might. Would either of these work with hotplug though? I guess as the VM starts off with the pair of NICs, then when you remove one and add it back after migration then you don't need any more information added; so yes cloud-init or SMBIOS would do it. (I was thinking SMBIOS stuff in the way that you get device/slot numbering that NIC naming is sometimes based off).
What about if we hot-add a new NIC later on (not during migration); a normal hot-add of a NIC now turns into a hot-add of two new NICs; how do we pass the information at hot-add time to provide that? Hmm, yes, actually hotplug would be a problem with that.
A even simpler idea would be to just keep things real dumb and simply use the same MAC address for both NICs. Once you put them in a bond device, the kernel will be copying the MAC address of the first NIC into the second NIC anyway, so unless I'm missing something, we might as well just use the same MAC address for both right away. That makes it easy for guest to discover NICs in the same set and works with hotplug trivially. I bet you need to distinguish the two NICs though; you'd want the bond to send all the traffic through the real NIC during normal use; and how does the guest know when it sees the hotplug of the 1st NIC in the pair
* Daniel P. Berrange (berrange@redhat.com) wrote: that this is a special NIC that it's about to see it's sibbling arrive.
Yeah, there needs to be *some way* for the guest OS to differentiate between the emulated NIC (which will be operational all the time, but only used during migration when the passed-through NIC is missing) and the passed-through NIC (which should be preferred for all traffic when it is present). The simplest method of differentiating would be for the admin who configures it to know the MAC address. Another way could be [some bit of magic I don't know how to do] that sets the bonding config based on which driver is used for the NIC (the emulated NIC will almost certainly be virtio, and the passed-through will be igbf, ixgbvf, or similar). A complicating factor with using MAC address to differentiate is that it isn't possible for the guest to modify the MAC address of a passed-through SRIOV VF - the only way that could be done would be for the guest to notify the host, then the host could use an RTM_SETLINK message sent for the PF+VF# to change the MAC address, otherwise it is prohibited by the hardware. Likewise (but at least tehcnically possible to solve with current libvirt+qemu), the default configuration for a macvtap connection to an emulated guest ethernet device (which is probably what the "backup" device of the bond would be) doesn't pass any traffic once the guest has changed the MAC address of the emulated device - qemu does send an RX_FILTER_CHANGED event to libvirt, and if the interface's config has trustGuestRxFilters='yes', then and only then libvirt will modify the MAC address of the host side of the macvtap device. Thinking about this more, it seems a bit problematic from a security point of view to allow the guest to arbitrarily change its MAC addresses just to support this, so maybe the requirement should be that the MAC addresses be set to the same value, and the guest config required to figure out which is the "preferred" and which is the "backup" by examining the driver used for the device.

On Thu, Apr 23, 2015 at 12:35:28PM -0400, Laine Stump wrote:
On 04/22/2015 01:20 PM, Dr. David Alan Gilbert wrote:
On Wed, Apr 22, 2015 at 06:12:25PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Wed, Apr 22, 2015 at 06:01:56PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote: > On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: >> backgrond: >> Live migration is one of the most important features of virtualization technology. >> With regard to recent virtualization techniques, performance of network I/O is critical. >> Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant >> performance gap with native network I/O. Pass-through network devices have near >> native performance, however, they have thus far prevented live migration. No existing >> methods solve the problem of live migration with pass-through devices perfectly. >> >> There was an idea to solve the problem in website: >> https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf >> Please refer to above document for detailed information. >> >> So I think this problem maybe could be solved by using the combination of existing >> technology. and the following steps are we considering to implement: >> >> - before boot VM, we anticipate to specify two NICs for creating bonding device >> (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses >> in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. >> >> - when qemu-guest-agent startup in guest it would send a notification to libvirt, >> then libvirt will call the previous registered initialize callbacks. so through >> the callback functions, we can create the bonding device according to the XML >> configuration. and here we use netcf tool which can facilitate to create bonding device >> easily. > I'm not really clear on why libvirt/guest agent needs to be involved in this. > I think configuration of networking is really something that must be left to > the guest OS admin to control. I don't think the guest agent should be trying > to reconfigure guest networking itself, as that is inevitably going to conflict > with configuration attempted by things in the guest like NetworkManager or > systemd-networkd. > > IOW, if you want to do this setup where the guest is given multiple NICs connected > to the same host LAN, then I think we should just let the gues admin configure > bonding in whatever manner they decide is best for their OS install. I disagree; there should be a way for the admin not to have to do this manually; however it should interact well with existing management stuff.
At the simplest, something that marks the two NICs in a discoverable way so that they can be seen that they're part of a set; with just that ID system then an installer or setup tool can notice them and offer to put them into a bond automatically; I'd assume it would be possible to add a rule somewhere that said anything with the same ID would automatically be added to the bond. I didn't mean the admin would literally configure stuff manually. I really just meant that the guest OS itself should decide how it is done, whether NetworkManager magically does the right thing, or the person building the cloud disk image provides a magic udev rule, or $something else. I just don't think that the QEMU guest agent should be involved, as that will definitely trample all over other things that manage networking in the guest. OK, good, that's about the same level I was at.
I could see this being solved in the cloud disk images by using cloud-init metadata to mark the NICs as being in a set, or perhaps there is some magic you could define in SMBIOS tables, or something else again. A cloud-init based solution wouldn't need any QEMU work, but an SMBIOS solution might. Would either of these work with hotplug though? I guess as the VM starts off with the pair of NICs, then when you remove one and add it back after migration then you don't need any more information added; so yes cloud-init or SMBIOS would do it. (I was thinking SMBIOS stuff in the way that you get device/slot numbering that NIC naming is sometimes based off).
What about if we hot-add a new NIC later on (not during migration); a normal hot-add of a NIC now turns into a hot-add of two new NICs; how do we pass the information at hot-add time to provide that? Hmm, yes, actually hotplug would be a problem with that.
A even simpler idea would be to just keep things real dumb and simply use the same MAC address for both NICs. Once you put them in a bond device, the kernel will be copying the MAC address of the first NIC into the second NIC anyway, so unless I'm missing something, we might as well just use the same MAC address for both right away. That makes it easy for guest to discover NICs in the same set and works with hotplug trivially. I bet you need to distinguish the two NICs though; you'd want the bond to send all the traffic through the real NIC during normal use; and how does the guest know when it sees the hotplug of the 1st NIC in the pair
* Daniel P. Berrange (berrange@redhat.com) wrote: that this is a special NIC that it's about to see it's sibbling arrive.
Yeah, there needs to be *some way* for the guest OS to differentiate between the emulated NIC (which will be operational all the time, but only used during migration when the passed-through NIC is missing) and the passed-through NIC (which should be preferred for all traffic when it is present). The simplest method of differentiating would be for the admin who configures it to know the MAC address. Another way could be [some bit of magic I don't know how to do] that sets the bonding config based on which driver is used for the NIC (the emulated NIC will almost certainly be virtio, and the passed-through will be igbf, ixgbvf, or similar).
Why not supply this information using the qemu ga?
A complicating factor with using MAC address to differentiate is that it isn't possible for the guest to modify the MAC address of a passed-through SRIOV VF - the only way that could be done would be for the guest to notify the host, then the host could use an RTM_SETLINK message sent for the PF+VF# to change the MAC address, otherwise it is prohibited by the hardware.
Likewise (but at least tehcnically possible to solve with current libvirt+qemu), the default configuration for a macvtap connection to an emulated guest ethernet device (which is probably what the "backup" device of the bond would be) doesn't pass any traffic once the guest has changed the MAC address of the emulated device - qemu does send an RX_FILTER_CHANGED event to libvirt, and if the interface's config has trustGuestRxFilters='yes', then and only then libvirt will modify the MAC address of the host side of the macvtap device.
Thinking about this more, it seems a bit problematic from a security point of view to allow the guest to arbitrarily change its MAC addresses just to support this, so maybe the requirement should be that the MAC addresses be set to the same value, and the guest config required to figure out which is the "preferred" and which is the "backup" by examining the driver used for the device.
That's an unrelated question. Some people want to allow changing the MAC, some don't. Don't use MAC addresses to identify devices, and the problem will go away. -- MST

On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily.
I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd.
There should not be a conflict. guest agent should just give NM the information, and have NM do the right thing.
IOW, if you want to do this setup where the guest is given multiple NICs connected to the same host LAN, then I think we should just let the gues admin configure bonding in whatever manner they decide is best for their OS install.
- during migration, unplug the passthroughed NIC. then do native migration.
- on destination side, check whether need to hotplug new NIC according to specified XML. usually, we use migrate "--xml" command option to specify the destination host NIC mac address to hotplug a new NIC, because source side passthrough NIC mac address is different, then hotplug the deivce according to the destination XML configuration.
Regards, Daniel
Users are actually asking for this functionality. Configuring everything manually is possible but error prone. We probably should leave manual configuration as an option for the 10% of people who want to tweak guest networking config, but this does not mean we shouldn't have it all work out of the box for 90% of people that just want networking to go fast with no tweaks.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Users are actually asking for this functionality.
Configuring everything manually is possible but error prone.
Yes, but attempting to do it automatically is also error prone (due to the myriad of different guest network config systems, even just within the seemingly narrow category of "Linux guests"). Pick your poison :-)

On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
Users are actually asking for this functionality.
Configuring everything manually is possible but error prone.
Yes, but attempting to do it automatically is also error prone (due to the myriad of different guest network config systems, even just within the seemingly narrow category of "Linux guests"). Pick your poison :-)
Also note I'm not debating the usefulness of the overall concept or the need for automation. It simply doesn't belong in libvirt or lower - it is a job for the higher level management applications to define a policy for that fits in with the way they are managing the virtual machines and the networking. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them. This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is? Dave
Users are actually asking for this functionality.
Configuring everything manually is possible but error prone.
Yes, but attempting to do it automatically is also error prone (due to the myriad of different guest network config systems, even just within the seemingly narrow category of "Linux guests"). Pick your poison :-)
Also note I'm not debating the usefulness of the overall concept or the need for automation. It simply doesn't belong in libvirt or lower - it is a job for the higher level management applications to define a policy for that fits in with the way they are managing the virtual machines and the networking.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
Dave
+1
Users are actually asking for this functionality.
Configuring everything manually is possible but error prone.
Yes, but attempting to do it automatically is also error prone (due to the myriad of different guest network config systems, even just within the seemingly narrow category of "Linux guests"). Pick your poison :-)
Also note I'm not debating the usefulness of the overall concept or the need for automation. It simply doesn't belong in libvirt or lower - it is a job for the higher level management applications to define a policy for that fits in with the way they are managing the virtual machines and the networking.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
NB, openstack isn't really defining a custom thing for networking here. It is actually integrating with the standard cloud-init guest tools for this task. Also note that OpenStack has defined a mechanism that works for guest images regardless of what hypervisor they are running on - ie does not rely on any QEMU or libvirt specific functionality here. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, May 19, 2015 at 04:35:08PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: > backgrond: > Live migration is one of the most important features of virtualization technology. > With regard to recent virtualization techniques, performance of network I/O is critical. > Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant > performance gap with native network I/O. Pass-through network devices have near > native performance, however, they have thus far prevented live migration. No existing > methods solve the problem of live migration with pass-through devices perfectly. > > There was an idea to solve the problem in website: > https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf > Please refer to above document for detailed information. > > So I think this problem maybe could be solved by using the combination of existing > technology. and the following steps are we considering to implement: > > - before boot VM, we anticipate to specify two NICs for creating bonding device > (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses > in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. > > - when qemu-guest-agent startup in guest it would send a notification to libvirt, > then libvirt will call the previous registered initialize callbacks. so through > the callback functions, we can create the bonding device according to the XML > configuration. and here we use netcf tool which can facilitate to create bonding device > easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
NB, openstack isn't really defining a custom thing for networking here. It is actually integrating with the standard cloud-init guest tools for this task. Also note that OpenStack has defined a mechanism that works for guest images regardless of what hypervisor they are running on - ie does not rely on any QEMU or libvirt specific functionality here.
Regards, Daniel
I'm not sure what the implication is. No new functionality should be implemented unless we also add it to vmware? People that don't want kvm specific functionality, won't use it.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, May 19, 2015 at 05:39:05PM +0200, Michael S. Tsirkin wrote:
On Tue, May 19, 2015 at 04:35:08PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: > On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: >> backgrond: >> Live migration is one of the most important features of virtualization technology. >> With regard to recent virtualization techniques, performance of network I/O is critical. >> Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant >> performance gap with native network I/O. Pass-through network devices have near >> native performance, however, they have thus far prevented live migration. No existing >> methods solve the problem of live migration with pass-through devices perfectly. >> >> There was an idea to solve the problem in website: >> https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf >> Please refer to above document for detailed information. >> >> So I think this problem maybe could be solved by using the combination of existing >> technology. and the following steps are we considering to implement: >> >> - before boot VM, we anticipate to specify two NICs for creating bonding device >> (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses >> in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. >> >> - when qemu-guest-agent startup in guest it would send a notification to libvirt, >> then libvirt will call the previous registered initialize callbacks. so through >> the callback functions, we can create the bonding device according to the XML >> configuration. and here we use netcf tool which can facilitate to create bonding device >> easily. > I'm not really clear on why libvirt/guest agent needs to be involved in this. > I think configuration of networking is really something that must be left to > the guest OS admin to control. I don't think the guest agent should be trying > to reconfigure guest networking itself, as that is inevitably going to conflict > with configuration attempted by things in the guest like NetworkManager or > systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
NB, openstack isn't really defining a custom thing for networking here. It is actually integrating with the standard cloud-init guest tools for this task. Also note that OpenStack has defined a mechanism that works for guest images regardless of what hypervisor they are running on - ie does not rely on any QEMU or libvirt specific functionality here.
I'm not sure what the implication is. No new functionality should be implemented unless we also add it to vmware? People that don't want kvm specific functionality, won't use it.
I'm saying that standardization of virtualization policy in libvirt is the wrong solution, because different applications will have different viewpoints as to what "standardization" is useful / appropriate. Creating a standardized policy in libvirt for KVM, does not help OpenStack may help people who only care about KVM, but that is not the entire ecosystem. OpenStack has a standardized solution for guest configuration imformation that works across all the hypervisors it targets. This is just yet another example of exactly why libvirt aims to design its APIs such that it exposes direct mechanisms and leaves usage policy decisions upto the management applications. Libvirt is not best placed to decide which policy all these mgmt apps must use for this task. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, May 19, 2015 at 04:45:03PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 05:39:05PM +0200, Michael S. Tsirkin wrote:
On Tue, May 19, 2015 at 04:35:08PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote: > On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: >> On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: >>> backgrond: >>> Live migration is one of the most important features of virtualization technology. >>> With regard to recent virtualization techniques, performance of network I/O is critical. >>> Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant >>> performance gap with native network I/O. Pass-through network devices have near >>> native performance, however, they have thus far prevented live migration. No existing >>> methods solve the problem of live migration with pass-through devices perfectly. >>> >>> There was an idea to solve the problem in website: >>> https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf >>> Please refer to above document for detailed information. >>> >>> So I think this problem maybe could be solved by using the combination of existing >>> technology. and the following steps are we considering to implement: >>> >>> - before boot VM, we anticipate to specify two NICs for creating bonding device >>> (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses >>> in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. >>> >>> - when qemu-guest-agent startup in guest it would send a notification to libvirt, >>> then libvirt will call the previous registered initialize callbacks. so through >>> the callback functions, we can create the bonding device according to the XML >>> configuration. and here we use netcf tool which can facilitate to create bonding device >>> easily. >> I'm not really clear on why libvirt/guest agent needs to be involved in this. >> I think configuration of networking is really something that must be left to >> the guest OS admin to control. I don't think the guest agent should be trying >> to reconfigure guest networking itself, as that is inevitably going to conflict >> with configuration attempted by things in the guest like NetworkManager or >> systemd-networkd. > There should not be a conflict. > guest agent should just give NM the information, and have NM do > the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
NB, openstack isn't really defining a custom thing for networking here. It is actually integrating with the standard cloud-init guest tools for this task. Also note that OpenStack has defined a mechanism that works for guest images regardless of what hypervisor they are running on - ie does not rely on any QEMU or libvirt specific functionality here.
I'm not sure what the implication is. No new functionality should be implemented unless we also add it to vmware? People that don't want kvm specific functionality, won't use it.
I'm saying that standardization of virtualization policy in libvirt is the wrong solution, because different applications will have different viewpoints as to what "standardization" is useful / appropriate. Creating a standardized policy in libvirt for KVM, does not help OpenStack may help people who only care about KVM, but that is not the entire ecosystem. OpenStack has a standardized solution for guest configuration imformation that works across all the hypervisors it targets. This is just yet another example of exactly why libvirt aims to design its APIs such that it exposes direct mechanisms and leaves usage policy decisions upto the management applications. Libvirt is not best placed to decide which policy all these mgmt apps must use for this task.
Regards, Daniel
I don't think we are pushing policy in libvirt here. What we want is a mechanism that let users specify in the XML: interface X is fallback for pass-through device Y Then when requesting migration, specify that it should use device Z on destination as replacement for Y. We are asking libvirt to automatically 1.- when migration is requested, request unplug of Y 2.- wait until Y is deleted 3.- start migration 4.- wait until migration is completed 5.- plug device Z on destination I don't see any policy above: libvirt is in control of migration and seems best placed to implement this.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, May 19, 2015 at 06:08:10PM +0200, Michael S. Tsirkin wrote:
On Tue, May 19, 2015 at 04:45:03PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 05:39:05PM +0200, Michael S. Tsirkin wrote:
On Tue, May 19, 2015 at 04:35:08PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote: > On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote: > > On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: > >> On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: > >>> backgrond: > >>> Live migration is one of the most important features of virtualization technology. > >>> With regard to recent virtualization techniques, performance of network I/O is critical. > >>> Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant > >>> performance gap with native network I/O. Pass-through network devices have near > >>> native performance, however, they have thus far prevented live migration. No existing > >>> methods solve the problem of live migration with pass-through devices perfectly. > >>> > >>> There was an idea to solve the problem in website: > >>> https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf > >>> Please refer to above document for detailed information. > >>> > >>> So I think this problem maybe could be solved by using the combination of existing > >>> technology. and the following steps are we considering to implement: > >>> > >>> - before boot VM, we anticipate to specify two NICs for creating bonding device > >>> (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses > >>> in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. > >>> > >>> - when qemu-guest-agent startup in guest it would send a notification to libvirt, > >>> then libvirt will call the previous registered initialize callbacks. so through > >>> the callback functions, we can create the bonding device according to the XML > >>> configuration. and here we use netcf tool which can facilitate to create bonding device > >>> easily. > >> I'm not really clear on why libvirt/guest agent needs to be involved in this. > >> I think configuration of networking is really something that must be left to > >> the guest OS admin to control. I don't think the guest agent should be trying > >> to reconfigure guest networking itself, as that is inevitably going to conflict > >> with configuration attempted by things in the guest like NetworkManager or > >> systemd-networkd. > > There should not be a conflict. > > guest agent should just give NM the information, and have NM do > > the right thing. > > That assumes the guest will have NM running. Unless you want to severely > limit the scope of usefulness, you also need to handle systems that have > NM disabled, and among those the different styles of system network > config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
NB, openstack isn't really defining a custom thing for networking here. It is actually integrating with the standard cloud-init guest tools for this task. Also note that OpenStack has defined a mechanism that works for guest images regardless of what hypervisor they are running on - ie does not rely on any QEMU or libvirt specific functionality here.
I'm not sure what the implication is. No new functionality should be implemented unless we also add it to vmware? People that don't want kvm specific functionality, won't use it.
I'm saying that standardization of virtualization policy in libvirt is the wrong solution, because different applications will have different viewpoints as to what "standardization" is useful / appropriate. Creating a standardized policy in libvirt for KVM, does not help OpenStack may help people who only care about KVM, but that is not the entire ecosystem. OpenStack has a standardized solution for guest configuration imformation that works across all the hypervisors it targets. This is just yet another example of exactly why libvirt aims to design its APIs such that it exposes direct mechanisms and leaves usage policy decisions upto the management applications. Libvirt is not best placed to decide which policy all these mgmt apps must use for this task.
Regards, Daniel
I don't think we are pushing policy in libvirt here.
What we want is a mechanism that let users specify in the XML: interface X is fallback for pass-through device Y Then when requesting migration, specify that it should use device Z on destination as replacement for Y.
We are asking libvirt to automatically 1.- when migration is requested, request unplug of Y 2.- wait until Y is deleted 3.- start migration 4.- wait until migration is completed 5.- plug device Z on destination
I don't see any policy above: libvirt is in control of migration and seems best placed to implement this.
Even this implies policy in libvirt about handling of failure conditions. How long to wait for unplug. What todo when unplug fails. What todo it plug fails on the target. It is hard to report these errors to application and when multiple devices are to be plugged/unplugged, the application will also have trouble determining whether some or all of the devices are still present after failure. Even beyond that, this is pointless as all 5 steps you describe here are already possible to perform with existing functionality in libvirt, with the application having direct control over what todo in the failure scenarios. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

* Michael S. Tsirkin (mst@redhat.com) wrote:
On Tue, May 19, 2015 at 04:45:03PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 05:39:05PM +0200, Michael S. Tsirkin wrote:
On Tue, May 19, 2015 at 04:35:08PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 04:03:04PM +0100, Dr. David Alan Gilbert wrote:
* Daniel P. Berrange (berrange@redhat.com) wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote: > On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote: > > On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: > >> On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: > >>> backgrond: > >>> Live migration is one of the most important features of virtualization technology. > >>> With regard to recent virtualization techniques, performance of network I/O is critical. > >>> Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant > >>> performance gap with native network I/O. Pass-through network devices have near > >>> native performance, however, they have thus far prevented live migration. No existing > >>> methods solve the problem of live migration with pass-through devices perfectly. > >>> > >>> There was an idea to solve the problem in website: > >>> https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf > >>> Please refer to above document for detailed information. > >>> > >>> So I think this problem maybe could be solved by using the combination of existing > >>> technology. and the following steps are we considering to implement: > >>> > >>> - before boot VM, we anticipate to specify two NICs for creating bonding device > >>> (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses > >>> in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest. > >>> > >>> - when qemu-guest-agent startup in guest it would send a notification to libvirt, > >>> then libvirt will call the previous registered initialize callbacks. so through > >>> the callback functions, we can create the bonding device according to the XML > >>> configuration. and here we use netcf tool which can facilitate to create bonding device > >>> easily. > >> I'm not really clear on why libvirt/guest agent needs to be involved in this. > >> I think configuration of networking is really something that must be left to > >> the guest OS admin to control. I don't think the guest agent should be trying > >> to reconfigure guest networking itself, as that is inevitably going to conflict > >> with configuration attempted by things in the guest like NetworkManager or > >> systemd-networkd. > > There should not be a conflict. > > guest agent should just give NM the information, and have NM do > > the right thing. > > That assumes the guest will have NM running. Unless you want to severely > limit the scope of usefulness, you also need to handle systems that have > NM disabled, and among those the different styles of system network > config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
This is exactly my worry though; why should every higher level management system have it's own way of communicating network config for hotpluggable devices. You shoudln't need to reconfigure a VM to move it between them.
This just makes it hard to move it between management layers; there needs to be some standardisation (or abstraction) of this; if libvirt isn't the place to do it, then what is?
NB, openstack isn't really defining a custom thing for networking here. It is actually integrating with the standard cloud-init guest tools for this task. Also note that OpenStack has defined a mechanism that works for guest images regardless of what hypervisor they are running on - ie does not rely on any QEMU or libvirt specific functionality here.
I'm not sure what the implication is. No new functionality should be implemented unless we also add it to vmware? People that don't want kvm specific functionality, won't use it.
I'm saying that standardization of virtualization policy in libvirt is the wrong solution, because different applications will have different viewpoints as to what "standardization" is useful / appropriate. Creating a standardized policy in libvirt for KVM, does not help OpenStack may help people who only care about KVM, but that is not the entire ecosystem. OpenStack has a standardized solution for guest configuration imformation that works across all the hypervisors it targets. This is just yet another example of exactly why libvirt aims to design its APIs such that it exposes direct mechanisms and leaves usage policy decisions upto the management applications. Libvirt is not best placed to decide which policy all these mgmt apps must use for this task.
Regards, Daniel
I don't think we are pushing policy in libvirt here.
What we want is a mechanism that let users specify in the XML: interface X is fallback for pass-through device Y Then when requesting migration, specify that it should use device Z on destination as replacement for Y.
We are asking libvirt to automatically 1.- when migration is requested, request unplug of Y 2.- wait until Y is deleted 3.- start migration 4.- wait until migration is completed 5.- plug device Z on destination
I don't see any policy above: libvirt is in control of migration and seems best placed to implement this.
The step that list is missing is: 0. Tell guest that *this virtio NIC (X) and *this real NIC (Y) are a bond pair 6. Tell guest that *this real NIC (Z) are a bond pair 0 has to happen both at startup and at hotplug of a new-pair; I'm not clear if 6 is actually needed depending on whether it can be done based on what was in 0. Dave
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

On Tue, May 19, 2015 at 03:21:49PM +0100, Daniel P. Berrange wrote:
On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Also OpenStack already has a way to pass guest information about the required network setup, via cloud-init, so it would not be interested in any thing that used the QEMU guest agent to configure network manager. Which is really just another example of why this does not belong anywhere in libvirt or lower. The decision to use NM is a policy decision that will always be wrong for a non-negligble set of use cases and as such does not belong in libvirt or QEMU. It is the job of higher level apps to make that kind of policy decision.
Using NM is up to users. On some of my VMs, I bring up links manually after each boot. We can provide the into to guest, and teach NM use that. If someone will write bash scripts to use this info, that's also fine.
Users are actually asking for this functionality.
Configuring everything manually is possible but error prone.
Yes, but attempting to do it automatically is also error prone (due to the myriad of different guest network config systems, even just within the seemingly narrow category of "Linux guests"). Pick your poison :-)
Also note I'm not debating the usefulness of the overall concept or the need for automation. It simply doesn't belong in libvirt or lower - it is a job for the higher level management applications to define a policy for that fits in with the way they are managing the virtual machines and the networking.
Regards, Daniel
Users are asking for this automation, so it's useful to them. We can always tell them no. Saying no because we seem unable to be able to decide where this useful functionality fits does not look like a good reason.
-- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, May 19, 2015 at 10:15:17AM -0400, Laine Stump wrote:
On 05/19/2015 05:07 AM, Michael S. Tsirkin wrote:
backgrond: Live migration is one of the most important features of virtualization technology. With regard to recent virtualization techniques, performance of network I/O is critical. Current network I/O virtualization (e.g. Para-virtualized I/O, VMDq) has a significant performance gap with native network I/O. Pass-through network devices have near native performance, however, they have thus far prevented live migration. No existing methods solve the problem of live migration with pass-through devices perfectly.
There was an idea to solve the problem in website: https://www.kernel.org/doc/ols/2008/ols2008v2-pages-261-267.pdf Please refer to above document for detailed information.
So I think this problem maybe could be solved by using the combination of existing technology. and the following steps are we considering to implement:
- before boot VM, we anticipate to specify two NICs for creating bonding device (one plugged and one virtual NIC) in XML. here we can specify the NIC's mac addresses in XML, which could facilitate qemu-guest-agent to find the network interfaces in guest.
- when qemu-guest-agent startup in guest it would send a notification to libvirt, then libvirt will call the previous registered initialize callbacks. so through the callback functions, we can create the bonding device according to the XML configuration. and here we use netcf tool which can facilitate to create bonding device easily. I'm not really clear on why libvirt/guest agent needs to be involved in this. I think configuration of networking is really something that must be left to
On Fri, Apr 17, 2015 at 04:53:02PM +0800, Chen Fan wrote: the guest OS admin to control. I don't think the guest agent should be trying to reconfigure guest networking itself, as that is inevitably going to conflict with configuration attempted by things in the guest like NetworkManager or systemd-networkd. There should not be a conflict. guest agent should just give NM the information, and have NM do
On Wed, Apr 22, 2015 at 10:23:04AM +0100, Daniel P. Berrange wrote: the right thing.
That assumes the guest will have NM running. Unless you want to severely limit the scope of usefulness, you also need to handle systems that have NM disabled, and among those the different styles of system network config. It gets messy very fast.
Systems with system network config can just do the configuration manually, they won't be worse off than they are now.
Users are actually asking for this functionality.
Configuring everything manually is possible but error prone.
Yes, but attempting to do it automatically is also error prone (due to the myriad of different guest network config systems, even just within the seemingly narrow category of "Linux guests"). Pick your poison :-)
Make it work well for RHEL guests. Others will work with less integration. -- MST
participants (8)
-
Chen Fan
-
Daniel P. Berrange
-
Dr. David Alan Gilbert
-
Eric Blake
-
Laine Stump
-
Michael S. Tsirkin
-
Michal Privoznik
-
Olga Krishtal