[libvirt PATCH v2 0/6] <interface> <teaming> element (was: virtio failover / vfio auto-plug-on-migrate)

V1: https://www.redhat.com/archives/libvir-list/2020-January/msg00813.html This all used different names in V1 - in that incarnation the configuration was done using "failover" and "backupAlias" attributes added to the <driver> subelement of <interface>. But the resulting code was cumbersome and had little bits scattered all over the place due to needing it in both hostdev and interface parsing/formatting. In his review of V1, danpb suggesting just adding a new subelement for this configuration to free ourselves from the constraints of <driver> parsing/formatting. This ended up dramatically simplifying the code (hence the lack of V1's refactoring patches in V2, and a decrease in patch count from 12 to 6). During further discussion in email and on IRC, we decided that naming the element <failover> was too limiting, as it implied the behavior of what is, to libvirt, just two network devices that should be teamed/bonded together - it's completely up to the hypervisor and guest what is done with this information. In light of that, we decided to name the new subelement <teaming>, and to specify the two interfaces as "persistent" (an interface that will always remain plugged in) and "transient" (an interface that may be periodically unplugged (during migration, in the case of QEMU). So the virtio device will have <teaming type='persistent'/> and the hostdev device will have <teaming type='transient' persistent='ua-myvirtio'/> (note that the persistent interface must have <alias name='ua-myvirtio'/>) Given this config, libvirt will add "failover=on" to the device commandline arg for the virtio device, and "failover_pair_id=ua-myvirtio" to the arg for the hostdev device (and when a migration is requested, it will notice if there is a hostdev that has <teaming type='transient'/> set, and will allow the migration in this case, but still disallow migrations of domains with normal hostdevs). In response to these extra commandline options, QEMU will set some extra capabilities in the virtio device PCI capabilities data, and will also automatically unplug/re-plug the hostdev device before and after migration. In the guest, the virtio-net driver will notice the extra PCI capabilities and use this as a clue that it should search for another device with matching MAC address (NB: the guest driver requires the two devices to have matching MAC addresses) to join into a bond with the virtio-net device. This bond is hard-wired to always prefer the hostdev device whenever it is present, and use the virtio device as backup when the hostdev is unplugged. ---- As mentioned in a followup to the V1 cover letter, there is a regression in QEMU 4.2.0 that causes QEMU to segv when a hostdev is unplugged. That bug is fixed with this upstream QEMU patch: https://git.qemu.org/?p=qemu.git;a=commitdiff;h=0446f8121723b134ca1d1ed0b73e... Be sure to use a qemu build with this patch applied, or you may not even be able to start the guest! Also we've found that the DEVICE_DELETED event is never sent to libvirt when one of these hostdevs is manually unplugged, meaning that libvirt keeps the device marked as "in-use", and it therefore cannot be re-plugged to the guest until after a complete guest "power cycle". AFAIK there isn't yet a fix for that bug, so don't expect manual unplug of the device to work. Laine Stump (6): qemu: add capabilities flag for failover feature conf: parse/format <teaming> subelement of <interface> qemu: support interface <teaming> functionality qemu: allow migration with assigned PCI hostdev if <teaming> is set qemu: add wait-unplug to qemu migration status enum docs: document <interface> subelement <teaming> docs/formatdomain.html.in | 100 ++++++++++++++++++ docs/news.xml | 28 +++++ docs/schemas/domaincommon.rng | 19 ++++ src/conf/domain_conf.c | 45 ++++++++ src/conf/domain_conf.h | 14 +++ src/qemu/qemu_capabilities.c | 4 + src/qemu/qemu_capabilities.h | 3 + src/qemu/qemu_command.c | 9 ++ src/qemu/qemu_domain.c | 36 ++++++- src/qemu/qemu_migration.c | 53 +++++++++- src/qemu/qemu_monitor.c | 1 + src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_monitor_json.c | 1 + .../caps_4.2.0.aarch64.xml | 1 + .../caps_4.2.0.x86_64.xml | 1 + .../net-virtio-teaming-network.xml | 37 +++++++ .../qemuxml2argvdata/net-virtio-teaming.args | 40 +++++++ tests/qemuxml2argvdata/net-virtio-teaming.xml | 50 +++++++++ tests/qemuxml2argvtest.c | 4 + .../net-virtio-teaming-network.xml | 51 +++++++++ .../qemuxml2xmloutdata/net-virtio-teaming.xml | 66 ++++++++++++ tests/qemuxml2xmltest.c | 6 ++ 22 files changed, 563 insertions(+), 7 deletions(-) create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.args create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming.xml -- 2.24.1

Presence of the virtio-net-pci option called "failover" indicates support in a qemu binary of a simplistic bonding of a virtio-net device with another PCI device. This feature allows migration of guests that have a network device assigned to a guest with VFIO, by creating a network bond device in the guest consisting of the VFIO-assigned device and a virtio-net-pci device, then temporarily (and automatically) unplugging the VFIO net device prior to migration (and hotplugging an equivalent device on the migration destination). (The feature is called "failover" because the bond device uses the vfio-pci netdev for normal guest networking, but "fails over" to the virtio-net-pci netdev once the vfio-pci device is unplugged for migration.) Full functioning of the feature also requires support in the virtio-net driver in the guest OS (since that is where the bond device resides), but if the "failover" commandline option is present for the virtio-net-pci device in qemu, at least the qemu part of the feature is available, and libvirt can add the proper options to both the virtio-net-pci and vfio-pci device commandlines to indicate qemu should attempt doing the failover during migration. This patch just adds the qemu capabilities flag "virtio-net.failover". Signed-off-by: Laine Stump <laine@redhat.com> --- No change from V1. src/qemu/qemu_capabilities.c | 4 ++++ src/qemu/qemu_capabilities.h | 3 +++ tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml | 1 + 4 files changed, 9 insertions(+) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 498348ad58..a41eb79e48 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -554,6 +554,9 @@ VIR_ENUM_IMPL(virQEMUCaps, "savevm-monitor-nodes", "drive-nvme", "smp-dies", + + /* 350 */ + "virtio-net.failover", ); @@ -1276,6 +1279,7 @@ static struct virQEMUCapsStringFlags virQEMUCapsDevicePropsVirtioNet[] = { { "disable-legacy", QEMU_CAPS_VIRTIO_PCI_DISABLE_LEGACY }, { "iommu_platform", QEMU_CAPS_VIRTIO_PCI_IOMMU_PLATFORM }, { "ats", QEMU_CAPS_VIRTIO_PCI_ATS }, + { "failover", QEMU_CAPS_VIRTIO_NET_FAILOVER }, }; static struct virQEMUCapsStringFlags virQEMUCapsDevicePropsSpaprPCIHostBridge[] = { diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index ebcb0d1373..16ca7211a6 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -536,6 +536,9 @@ typedef enum { /* virQEMUCapsFlags grouping marker for syntax-check */ QEMU_CAPS_DRIVE_NVME, /* -drive file.driver=nvme */ QEMU_CAPS_SMP_DIES, /* -smp dies= */ + /* 350 */ + QEMU_CAPS_VIRTIO_NET_FAILOVER, /* virtio-net-*.failover */ + QEMU_CAPS_LAST /* this must always be the last item */ } virQEMUCapsFlags; diff --git a/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml b/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml index 184bb7ff77..6af09e1a83 100644 --- a/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml +++ b/tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml @@ -176,6 +176,7 @@ <flag name='savevm-monitor-nodes'/> <flag name='drive-nvme'/> <flag name='smp-dies'/> + <flag name='virtio-net.failover'/> <version>4001050</version> <kvmVersion>0</kvmVersion> <microcodeVersion>61700242</microcodeVersion> diff --git a/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml b/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml index afd59a269d..c71791e205 100644 --- a/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml +++ b/tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml @@ -219,6 +219,7 @@ <flag name='savevm-monitor-nodes'/> <flag name='drive-nvme'/> <flag name='smp-dies'/> + <flag name='virtio-net.failover'/> <version>4002000</version> <kvmVersion>0</kvmVersion> <microcodeVersion>43100242</microcodeVersion> -- 2.24.1

On Fri, Jan 24, 2020 at 10:39:16AM -0500, Laine Stump wrote:
Presence of the virtio-net-pci option called "failover" indicates support in a qemu binary of a simplistic bonding of a virtio-net device with another PCI device. This feature allows migration of guests that have a network device assigned to a guest with VFIO, by creating a network bond device in the guest consisting of the VFIO-assigned device and a virtio-net-pci device, then temporarily (and automatically) unplugging the VFIO net device prior to migration (and hotplugging an equivalent device on the migration destination). (The feature is called "failover" because the bond device uses the vfio-pci netdev for normal guest networking, but "fails over" to the virtio-net-pci netdev once the vfio-pci device is unplugged for migration.)
Full functioning of the feature also requires support in the virtio-net driver in the guest OS (since that is where the bond device resides), but if the "failover" commandline option is present for the virtio-net-pci device in qemu, at least the qemu part of the feature is available, and libvirt can add the proper options to both the virtio-net-pci and vfio-pci device commandlines to indicate qemu should attempt doing the failover during migration.
This patch just adds the qemu capabilities flag "virtio-net.failover".
Signed-off-by: Laine Stump <laine@redhat.com> ---
No change from V1.
src/qemu/qemu_capabilities.c | 4 ++++ src/qemu/qemu_capabilities.h | 3 +++ tests/qemucapabilitiesdata/caps_4.2.0.aarch64.xml | 1 + tests/qemucapabilitiesdata/caps_4.2.0.x86_64.xml | 1 + 4 files changed, 9 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

The subelement <teaming> of <interface> devices is used to configure a simple teaming association between two interfaces in a domain. Example: <interface type='bridge'> <source bridge='br0'/> <model type='virtio'/> <mac address='00:11:22:33:44:55'/> <alias name='ua-backup0'/> <teaming type='persistent'/> </interface> <interface type='hostdev'> <source> <address type='pci' bus='0x02' slot='0x10' function='0x4'/> </source> <mac address='00:11:22:33:44:55'/> <teaming type='transient' persistent='ua-backup0'/> </interface> The interface with <teaming type='persistent'/> is assumed to always be present, while the interface with type='transient' may be be unplugged and later re-plugged; the persistent='blah' attribute (and in the one currently available implementation, also the matching MAC addresses) is what associates the two devices with each other. It is up to the hypervisor and the guest network drivers to determine what to do with this information. Signed-off-by: Laine Stump <laine@redhat.com> --- docs/schemas/domaincommon.rng | 19 ++++++ src/conf/domain_conf.c | 45 +++++++++++++ src/conf/domain_conf.h | 14 ++++ .../net-virtio-teaming-network.xml | 37 +++++++++++ tests/qemuxml2argvdata/net-virtio-teaming.xml | 50 ++++++++++++++ .../net-virtio-teaming-network.xml | 51 ++++++++++++++ .../qemuxml2xmloutdata/net-virtio-teaming.xml | 66 +++++++++++++++++++ tests/qemuxml2xmltest.c | 6 ++ 8 files changed, 288 insertions(+) create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming.xml diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 76d94b156f..026e753567 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -3158,6 +3158,25 @@ <optional> <ref name="vlan"/> </optional> + <optional> + <element name="teaming"> + <choice> + <group> + <attribute name="type"> + <value>persistent</value> + </attribute> + </group> + <group> + <attribute name="type"> + <value>transient</value> + </attribute> + <attribute name="persistent"> + <ref name="aliasName"/> + </attribute> + </group> + </choice> + </element> + </optional> </interleave> </define> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index f920d1dc39..ea719e5989 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -554,6 +554,13 @@ VIR_ENUM_IMPL(virDomainNetVirtioTxMode, "timer", ); +VIR_ENUM_IMPL(virDomainNetTeaming, + VIR_DOMAIN_NET_TEAMING_TYPE_LAST, + "none", + "persistent", + "transient", +); + VIR_ENUM_IMPL(virDomainNetInterfaceLinkState, VIR_DOMAIN_NET_INTERFACE_LINK_STATE_LAST, "default", @@ -6276,6 +6283,21 @@ virDomainNetDefValidate(const virDomainNetDef *net) virDomainNetTypeToString(net->type)); return -1; } + + if (net->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_TRANSIENT) { + if (!net->teaming.persistent) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("teaming persistent attribute must be set if teaming type is 'transient'")); + return -1; + } + } else { + if (net->teaming.persistent) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("teaming persistent attribute not allowed if teaming type is '%s'"), + virDomainNetTeamingTypeToString(net->teaming.type)); + return -1; + } + } return 0; } @@ -11574,6 +11596,8 @@ virDomainNetDefParseXML(virDomainXMLOptionPtr xmlopt, g_autofree char *vhostuser_type = NULL; g_autofree char *trustGuestRxFilters = NULL; g_autofree char *vhost_path = NULL; + g_autofree char *teamingType = NULL; + g_autofree char *teamingPersistent = NULL; const char *prefix = xmlopt ? xmlopt->config.netPrefix : NULL; if (!(def = virDomainNetDefNew(xmlopt))) @@ -11775,6 +11799,10 @@ virDomainNetDefParseXML(virDomainXMLOptionPtr xmlopt, if (!vhost_path && (tmp = virXMLPropString(cur, "vhost"))) vhost_path = virFileSanitizePath(tmp); VIR_FREE(tmp); + } else if (virXMLNodeNameEqual(cur, "teaming") && + !teamingType && !teamingPersistent) { + teamingType = virXMLPropString(cur, "type"); + teamingPersistent = virXMLPropString(cur, "persistent"); } } cur = cur->next; @@ -12296,6 +12324,17 @@ virDomainNetDefParseXML(virDomainXMLOptionPtr xmlopt, } } + if (teamingType) { + if ((def->teaming.type + = virDomainNetTeamingTypeFromString(teamingType)) <= 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("unknown teaming type '%s'"), + teamingType); + goto error; + } + } + def->teaming.persistent = g_steal_pointer(&teamingPersistent); + rv = virXPathULong("string(./tune/sndbuf)", ctxt, &def->tune.sndbuf); if (rv >= 0) { def->tune.sndbuf_specified = true; @@ -25741,6 +25780,12 @@ virDomainNetDefFormat(virBufferPtr buf, virBufferAddLit(buf, "</tune>\n"); } + if (def->teaming.type != VIR_DOMAIN_NET_TEAMING_TYPE_NONE) { + virBufferAsprintf(buf, "<teaming type='%s'", + virDomainNetTeamingTypeToString(def->teaming.type)); + virBufferEscapeString(buf, " persistent='%s'", def->teaming.persistent); + virBufferAddLit(buf, "/>\n"); + } if (def->linkstate) { virBufferAsprintf(buf, "<link state='%s'/>\n", virDomainNetInterfaceLinkStateTypeToString(def->linkstate)); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 6ae89fa498..ee8eb3ddc0 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -884,6 +884,15 @@ typedef enum { VIR_DOMAIN_NET_VIRTIO_TX_MODE_LAST } virDomainNetVirtioTxModeType; +/* the type of teaming device */ +typedef enum { + VIR_DOMAIN_NET_TEAMING_TYPE_NONE, + VIR_DOMAIN_NET_TEAMING_TYPE_PERSISTENT, + VIR_DOMAIN_NET_TEAMING_TYPE_TRANSIENT, + + VIR_DOMAIN_NET_TEAMING_TYPE_LAST +} virDomainNetTeamingType; + /* link interface states */ typedef enum { VIR_DOMAIN_NET_INTERFACE_LINK_STATE_DEFAULT = 0, /* Default link state (up) */ @@ -958,6 +967,10 @@ struct _virDomainNetDef { char *tap; char *vhost; } backend; + struct { + virDomainNetTeamingType type; + char *persistent; /* alias name of persistent device */ + } teaming; union { virDomainChrSourceDefPtr vhostuser; struct { @@ -3425,6 +3438,7 @@ VIR_ENUM_DECL(virDomainFSModel); VIR_ENUM_DECL(virDomainNet); VIR_ENUM_DECL(virDomainNetBackend); VIR_ENUM_DECL(virDomainNetVirtioTxMode); +VIR_ENUM_DECL(virDomainNetTeaming); VIR_ENUM_DECL(virDomainNetInterfaceLinkState); VIR_ENUM_DECL(virDomainNetModel); VIR_ENUM_DECL(virDomainChrDevice); diff --git a/tests/qemuxml2argvdata/net-virtio-teaming-network.xml b/tests/qemuxml2argvdata/net-virtio-teaming-network.xml new file mode 100644 index 0000000000..edab52f3a1 --- /dev/null +++ b/tests/qemuxml2argvdata/net-virtio-teaming-network.xml @@ -0,0 +1,37 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i386</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + </disk> + <controller type='usb' index='0'/> + <interface type='network'> + <mac address='00:11:22:33:44:55'/> + <source network='mybridge'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup0'/> + </interface> + <interface type='network'> + <mac address='00:11:22:33:44:55'/> + <source network='myhostdevpool'/> + <model type='virtio'/> + <teaming type='transient' persistent='ua-backup0'/> + </interface> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/net-virtio-teaming.xml b/tests/qemuxml2argvdata/net-virtio-teaming.xml new file mode 100644 index 0000000000..830ce28524 --- /dev/null +++ b/tests/qemuxml2argvdata/net-virtio-teaming.xml @@ -0,0 +1,50 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i386</emulator> + <disk type='block' device='disk'> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + </disk> + <controller type='usb' index='0'/> + <interface type='user'> + <mac address='00:11:22:33:44:55'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup0'/> + </interface> + <interface type='user'> + <mac address='66:44:33:22:11:00'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup1'/> + </interface> + <interface type='hostdev' managed='yes'> + <mac address='00:11:22:33:44:55'/> + <source> + <address type='pci' domain='0x0000' bus='0x03' slot='0x07' function='0x1'/> + </source> + <teaming type='transient' persistent='ua-backup0'/> + </interface> + <interface type='hostdev' managed='yes'> + <mac address='66:44:33:22:11:00'/> + <source> + <address type='pci' domain='0x0000' bus='0x03' slot='0x07' function='0x2'/> + </source> + <teaming type='transient' persistent='ua-backup1'/> + </interface> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml b/tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml new file mode 100644 index 0000000000..e0dbeafe02 --- /dev/null +++ b/tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml @@ -0,0 +1,51 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i386</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <interface type='network'> + <mac address='00:11:22:33:44:55'/> + <source network='mybridge'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup0'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </interface> + <interface type='network'> + <mac address='00:11:22:33:44:55'/> + <source network='myhostdevpool'/> + <model type='virtio'/> + <teaming type='transient' persistent='ua-backup0'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </interface> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmloutdata/net-virtio-teaming.xml b/tests/qemuxml2xmloutdata/net-virtio-teaming.xml new file mode 100644 index 0000000000..5a5695794a --- /dev/null +++ b/tests/qemuxml2xmloutdata/net-virtio-teaming.xml @@ -0,0 +1,66 @@ +<domain type='qemu'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + <memory unit='KiB'>219100</memory> + <currentMemory unit='KiB'>219100</currentMemory> + <vcpu placement='static'>1</vcpu> + <os> + <type arch='i686' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-i386</emulator> + <disk type='block' device='disk'> + <driver name='qemu' type='raw'/> + <source dev='/dev/HostVG/QEMUGuest1'/> + <target dev='hda' bus='ide'/> + <address type='drive' controller='0' bus='0' target='0' unit='0'/> + </disk> + <controller type='usb' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <controller type='ide' index='0'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/> + </controller> + <interface type='user'> + <mac address='00:11:22:33:44:55'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup0'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> + </interface> + <interface type='user'> + <mac address='66:44:33:22:11:00'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup1'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> + </interface> + <interface type='hostdev' managed='yes'> + <mac address='00:11:22:33:44:55'/> + <source> + <address type='pci' domain='0x0000' bus='0x03' slot='0x07' function='0x1'/> + </source> + <teaming type='transient' persistent='ua-backup0'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> + </interface> + <interface type='hostdev' managed='yes'> + <mac address='66:44:33:22:11:00'/> + <source> + <address type='pci' domain='0x0000' bus='0x03' slot='0x07' function='0x2'/> + </source> + <teaming type='transient' persistent='ua-backup1'/> + <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/> + </interface> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 3cefc64833..e54c540ef6 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -451,6 +451,12 @@ mymain(void) DO_TEST("net-eth-unmanaged-tap", NONE); DO_TEST("net-virtio-network-portgroup", NONE); DO_TEST("net-virtio-rxtxqueuesize", NONE); + DO_TEST("net-virtio-teaming", + QEMU_CAPS_VIRTIO_NET_FAILOVER, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST("net-virtio-teaming-network", + QEMU_CAPS_VIRTIO_NET_FAILOVER, + QEMU_CAPS_DEVICE_VFIO_PCI); DO_TEST("net-hostdev", NONE); DO_TEST("net-hostdev-bootorder", NONE); DO_TEST("net-hostdev-vfio", QEMU_CAPS_DEVICE_VFIO_PCI); -- 2.24.1

On Fri, Jan 24, 2020 at 10:39:17AM -0500, Laine Stump wrote:
The subelement <teaming> of <interface> devices is used to configure a simple teaming association between two interfaces in a domain. Example:
<interface type='bridge'> <source bridge='br0'/> <model type='virtio'/> <mac address='00:11:22:33:44:55'/> <alias name='ua-backup0'/> <teaming type='persistent'/> </interface> <interface type='hostdev'> <source> <address type='pci' bus='0x02' slot='0x10' function='0x4'/> </source> <mac address='00:11:22:33:44:55'/> <teaming type='transient' persistent='ua-backup0'/> </interface>
The interface with <teaming type='persistent'/> is assumed to always be present, while the interface with type='transient' may be be unplugged and later re-plugged; the persistent='blah' attribute (and in the one currently available implementation, also the matching MAC addresses) is what associates the two devices with each other. It is up to the hypervisor and the guest network drivers to determine what to do with this information.
Signed-off-by: Laine Stump <laine@redhat.com> ---
docs/schemas/domaincommon.rng | 19 ++++++ src/conf/domain_conf.c | 45 +++++++++++++ src/conf/domain_conf.h | 14 ++++ .../net-virtio-teaming-network.xml | 37 +++++++++++ tests/qemuxml2argvdata/net-virtio-teaming.xml | 50 ++++++++++++++ .../net-virtio-teaming-network.xml | 51 ++++++++++++++ .../qemuxml2xmloutdata/net-virtio-teaming.xml | 66 +++++++++++++++++++ tests/qemuxml2xmltest.c | 6 ++ 8 files changed, 288 insertions(+) create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming.xml
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

The QEMU driver uses the <teaming type='persistent|transient' persistent='blah'/> element to setup a "failover" pair of devices - the persistent device must be a virtio emulated NIC, with the only extra configuration being the addition of ",failover=on" to the device commandline, and the transient device must be a hostdev NIC (<interface type='hostdev'> or <interface type='network'> with a network that is a pool of SRIOV VFs) where the extra configuration is the addition of ",failover_pair_id=$aliasOfVirtio" to the device commandline. These new options are supported in QEMU 4.2.0 and later. Extra qemu-specific validation is added to ensure that the device type/model is appropriate and that the qemu binary supports these commandline options. The result of this will be: 1) The virtio device presented to the guest will have an extra bit set in its PCI capabilities indicating that it can be used as a failover backup device. The virtio guest driver will need to be equipped to do something with this information. Unfortunately there is no way for libvirt to learn whether or not the guest driver supports failover - if it doesn't then the extra PCI capability will be ignored and the guest OS will just see two independent devices. (NB: the current virtio guest driver also requires that the MAC addresses of the two NICs match in order to pair them into a bond). 2) When a migration is requested, QEMu will automatically unplug the transient/hostdev NIC from the guest on the source host before starting migration, and automatically re-plug a similar device after restarting the guest CPUs on the destination host. While the transient NIC is unplugged, all network traffic will go through the persistent/virtio device, but when the hostdev NIC is plugged in, it will get all the traffic. This means that in normal circumstances the guest gets the performance advantage of vfio-assigned "real hardware" networking, but it can still be migrated with the only downside being a performance penalty (due to using an emulated NIC) during the migration. Signed-off-by: Laine Stump <laine@redhat.com> --- src/qemu/qemu_command.c | 9 +++++ src/qemu/qemu_domain.c | 36 +++++++++++++++-- .../qemuxml2argvdata/net-virtio-teaming.args | 40 +++++++++++++++++++ tests/qemuxml2argvtest.c | 4 ++ 4 files changed, 86 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.args diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index c66b60fd21..63aa10d3af 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3833,6 +3833,8 @@ qemuBuildNicDevStr(virDomainDefPtr def, } virBufferAsprintf(&buf, ",host_mtu=%u", net->mtu); } + if (usingVirtio && net->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_PERSISTENT) + virBufferAddLit(&buf, ",failover=on"); virBufferAsprintf(&buf, ",netdev=host%s", net->info.alias); virBufferAsprintf(&buf, ",id=%s", net->info.alias); @@ -4704,6 +4706,13 @@ qemuBuildPCIHostdevDevStr(const virDomainDef *def, if (qemuBuildRomStr(&buf, dev->info) < 0) return NULL; + if (dev->parentnet && + dev->parentnet->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_TRANSIENT && + dev->parentnet->teaming.persistent) { + virBufferAsprintf(&buf, ",failover_pair_id=%s", + dev->parentnet->teaming.persistent); + } + return virBufferContentAndReset(&buf); } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index ce0c5b78cd..6bd5d10f09 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -6391,12 +6391,20 @@ qemuDomainValidateActualNetDef(const virDomainNetDef *net, return -1; } + if (net->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_TRANSIENT && + actualType != VIR_DOMAIN_NET_TYPE_HOSTDEV) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("interface %s - teaming transient device must be type='hostdev', not '%s'"), + macstr, virDomainNetTypeToString(actualType)); + return -1; + } return 0; } static int -qemuDomainDeviceDefValidateNetwork(const virDomainNetDef *net) +qemuDomainDeviceDefValidateNetwork(const virDomainNetDef *net, + virQEMUCapsPtr qemuCaps) { bool hasIPv4 = false; bool hasIPv6 = false; @@ -6481,7 +6489,29 @@ qemuDomainDeviceDefValidateNetwork(const virDomainNetDef *net) return -1; } - if (net->coalesce && !qemuDomainNetSupportsCoalesce(net->type)) { + if (net->teaming.type != VIR_DOMAIN_NET_TEAMING_TYPE_NONE && + !virQEMUCapsGet(qemuCaps, QEMU_CAPS_VIRTIO_NET_FAILOVER)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("virtio-net failover (teaming) is not supported with this QEMU binary")); + return -1; + } + if (net->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_PERSISTENT + && !virDomainNetIsVirtioModel(net)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("virtio-net teaming persistent interface must be <model type='virtio'/>, not '%s'"), + virDomainNetGetModelString(net)); + return -1; + } + if (net->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_TRANSIENT && + net->type != VIR_DOMAIN_NET_TYPE_HOSTDEV && + net->type != VIR_DOMAIN_NET_TYPE_NETWORK) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("virtio-net teaming transient interface must be type='hostdev', not '%s'"), + virDomainNetTypeToString(net->type)); + return -1; + } + + if (net->coalesce && !qemuDomainNetSupportsCoalesce(net->type)) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, _("coalesce settings on interface type %s are not supported"), virDomainNetTypeToString(net->type)); @@ -8377,7 +8407,7 @@ qemuDomainDeviceDefValidate(const virDomainDeviceDef *dev, switch ((virDomainDeviceType)dev->type) { case VIR_DOMAIN_DEVICE_NET: - ret = qemuDomainDeviceDefValidateNetwork(dev->data.net); + ret = qemuDomainDeviceDefValidateNetwork(dev->data.net, qemuCaps); break; case VIR_DOMAIN_DEVICE_CHR: diff --git a/tests/qemuxml2argvdata/net-virtio-teaming.args b/tests/qemuxml2argvdata/net-virtio-teaming.args new file mode 100644 index 0000000000..19e7260843 --- /dev/null +++ b/tests/qemuxml2argvdata/net-virtio-teaming.args @@ -0,0 +1,40 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-QEMUGuest1 \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-QEMUGuest1/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-QEMUGuest1/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \ +QEMU_AUDIO_DRV=none \ +/usr/bin/qemu-system-i386 \ +-name QEMUGuest1 \ +-S \ +-machine pc,accel=tcg,usb=off,dump-guest-core=off \ +-m 214 \ +-realtime mlock=off \ +-smp 1,sockets=1,cores=1,threads=1 \ +-uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,path=/tmp/lib/domain--1-QEMUGuest1/monitor.sock,\ +server,nowait \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-usb \ +-drive file=/dev/HostVG/QEMUGuest1,format=raw,if=none,id=drive-ide0-0-0 \ +-device ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 \ +-netdev user,id=hostua-backup0 \ +-device virtio-net-pci,failover=on,netdev=hostua-backup0,id=ua-backup0,\ +mac=00:11:22:33:44:55,bus=pci.0,addr=0x3 \ +-netdev user,id=hostua-backup1 \ +-device virtio-net-pci,failover=on,netdev=hostua-backup1,id=ua-backup1,\ +mac=66:44:33:22:11:00,bus=pci.0,addr=0x4 \ +-device vfio-pci,host=0000:03:07.1,id=hostdev0,bus=pci.0,addr=0x5,\ +failover_pair_id=ua-backup0 \ +-device vfio-pci,host=0000:03:07.2,id=hostdev1,bus=pci.0,addr=0x6,\ +failover_pair_id=ua-backup1 \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index b923590930..4d26fe0b55 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -1308,6 +1308,10 @@ mymain(void) QEMU_CAPS_VIRTIO_NET_RX_QUEUE_SIZE, QEMU_CAPS_VIRTIO_NET_TX_QUEUE_SIZE); DO_TEST_PARSE_ERROR("net-virtio-rxqueuesize-invalid-size", NONE); + DO_TEST("net-virtio-teaming", + QEMU_CAPS_VIRTIO_NET_FAILOVER, + QEMU_CAPS_DEVICE_VFIO_PCI); + DO_TEST_PARSE_ERROR("net-virtio-teaming", NONE); DO_TEST("net-eth", NONE); DO_TEST("net-eth-ifname", NONE); DO_TEST("net-eth-names", NONE); -- 2.24.1

On Fri, Jan 24, 2020 at 10:39:18AM -0500, Laine Stump wrote:
The QEMU driver uses the <teaming type='persistent|transient' persistent='blah'/> element to setup a "failover" pair of devices - the persistent device must be a virtio emulated NIC, with the only extra configuration being the addition of ",failover=on" to the device commandline, and the transient device must be a hostdev NIC (<interface type='hostdev'> or <interface type='network'> with a network that is a pool of SRIOV VFs) where the extra configuration is the addition of ",failover_pair_id=$aliasOfVirtio" to the device commandline. These new options are supported in QEMU 4.2.0 and later.
Extra qemu-specific validation is added to ensure that the device type/model is appropriate and that the qemu binary supports these commandline options.
The result of this will be:
1) The virtio device presented to the guest will have an extra bit set in its PCI capabilities indicating that it can be used as a failover backup device. The virtio guest driver will need to be equipped to do something with this information. Unfortunately there is no way for libvirt to learn whether or not the guest driver supports failover - if it doesn't then the extra PCI capability will be ignored and the guest OS will just see two independent devices. (NB: the current virtio guest driver also requires that the MAC addresses of the two NICs match in order to pair them into a bond).
2) When a migration is requested, QEMu will automatically unplug the transient/hostdev NIC from the guest on the source host before starting migration, and automatically re-plug a similar device after restarting the guest CPUs on the destination host. While the transient NIC is unplugged, all network traffic will go through the persistent/virtio device, but when the hostdev NIC is plugged in, it will get all the traffic. This means that in normal circumstances the guest gets the performance advantage of vfio-assigned "real hardware" networking, but it can still be migrated with the only downside being a performance penalty (due to using an emulated NIC) during the migration.
Signed-off-by: Laine Stump <laine@redhat.com> --- src/qemu/qemu_command.c | 9 +++++ src/qemu/qemu_domain.c | 36 +++++++++++++++-- .../qemuxml2argvdata/net-virtio-teaming.args | 40 +++++++++++++++++++ tests/qemuxml2argvtest.c | 4 ++ 4 files changed, 86 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.args
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Normally a PCI hostdev can't be migrated, so qemuMigrationSrcIsAllowedHostdev() won't permit it. In the case of a a hostdev network interface that has <teaming type='transient'/> set, QEMU will automatically unplug the device prior to migration, and re-plug a corresponding device on the destination. This patch modifies qemuMigrationSrcIsAllowedHostdev() to allow domains with those devices to be migrated. Signed-off-by: Laine Stump <laine@redhat.com> --- src/qemu/qemu_migration.c | 52 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 48 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 29d228a8d9..46612a3c84 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1093,10 +1093,54 @@ qemuMigrationSrcIsAllowedHostdev(const virDomainDef *def) * forbidden. */ for (i = 0; i < def->nhostdevs; i++) { virDomainHostdevDefPtr hostdev = def->hostdevs[i]; - if (hostdev->mode != VIR_DOMAIN_HOSTDEV_MODE_SUBSYS || - hostdev->source.subsys.type != VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB) { - virReportError(VIR_ERR_OPERATION_INVALID, "%s", - _("domain has assigned non-USB host devices")); + switch ((virDomainHostdevMode)hostdev->mode) { + case VIR_DOMAIN_HOSTDEV_MODE_CAPABILITIES: + virReportError(VIR_ERR_OPERATION_UNSUPPORTED, "%s", + _("cannot migrate a domain with <hostdev mode='capabilities'>")); + return false; + + case VIR_DOMAIN_HOSTDEV_MODE_SUBSYS: + switch ((virDomainHostdevSubsysType)hostdev->source.subsys.type) { + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_USB: + /* USB devices can be "migrated" */ + continue; + + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_SCSI_HOST: + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_MDEV: + virReportError(VIR_ERR_OPERATION_UNSUPPORTED, + _("cannot migrate a domain with <hostdev mode='subsystem' type='%s'>"), + virDomainHostdevSubsysTypeToString(hostdev->source.subsys.type)); + return false; + + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI: + /* + * if this is a network interface with <teaming + * type='transient'>, migration *is* allowed because + * the device will be auto-unplugged by QEMU during + * migration. + */ + if (hostdev->parentnet && + hostdev->parentnet->teaming.type == VIR_DOMAIN_NET_TEAMING_TYPE_TRANSIENT) { + continue; + } + + /* all other PCI hostdevs can't be migrated */ + virReportError(VIR_ERR_OPERATION_UNSUPPORTED, + _("cannot migrate a domain with <hostdev mode='subsystem' type='%s'>"), + virDomainHostdevSubsysTypeToString(hostdev->source.subsys.type)); + return false; + + case VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_LAST: + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("invalid hostdev subsystem type")); + return false; + } + break; + + case VIR_DOMAIN_HOSTDEV_MODE_LAST: + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("invalid hostdev mode")); return false; } } -- 2.24.1

On Fri, Jan 24, 2020 at 10:39:19AM -0500, Laine Stump wrote:
Normally a PCI hostdev can't be migrated, so qemuMigrationSrcIsAllowedHostdev() won't permit it. In the case of a a hostdev network interface that has <teaming type='transient'/> set, QEMU will automatically unplug the device prior to migration, and re-plug a corresponding device on the destination. This patch modifies qemuMigrationSrcIsAllowedHostdev() to allow domains with those devices to be migrated.
Signed-off-by: Laine Stump <laine@redhat.com> --- src/qemu/qemu_migration.c | 52 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 48 insertions(+), 4 deletions(-) Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Aside from itinerant error (actually warning) messages due to an unrecognized response from qemu, this isn't even necessary - the migration proceeds successfully to completion anyway. (I'm not sure where to see this status reported in the API though - do we need to add an extra state, or recognition of a new event somewhere?) Signed-off-by: Laine Stump <laine@redhat.com> --- src/qemu/qemu_migration.c | 1 + src/qemu/qemu_monitor.c | 1 + src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_monitor_json.c | 1 + 4 files changed, 4 insertions(+) diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 46612a3c84..b56ccbdc3c 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1457,6 +1457,7 @@ qemuMigrationUpdateJobType(qemuDomainJobInfoPtr jobInfo) case QEMU_MONITOR_MIGRATION_STATUS_SETUP: case QEMU_MONITOR_MIGRATION_STATUS_ACTIVE: case QEMU_MONITOR_MIGRATION_STATUS_CANCELLING: + case QEMU_MONITOR_MIGRATION_STATUS_WAIT_UNPLUG: case QEMU_MONITOR_MIGRATION_STATUS_LAST: break; } diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index ccd20b3740..4f547bf5ec 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -168,6 +168,7 @@ VIR_ENUM_IMPL(qemuMonitorMigrationStatus, "device", "postcopy-active", "completed", "failed", "cancelling", "cancelled", + "wait-unplug", ); VIR_ENUM_IMPL(qemuMonitorVMStatus, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 3f3b81cddd..cca2cdcb27 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -767,6 +767,7 @@ typedef enum { QEMU_MONITOR_MIGRATION_STATUS_ERROR, QEMU_MONITOR_MIGRATION_STATUS_CANCELLING, QEMU_MONITOR_MIGRATION_STATUS_CANCELLED, + QEMU_MONITOR_MIGRATION_STATUS_WAIT_UNPLUG, QEMU_MONITOR_MIGRATION_STATUS_LAST } qemuMonitorMigrationStatus; diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e5164d218a..5d8c7e9b5e 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -3515,6 +3515,7 @@ qemuMonitorJSONGetMigrationStatsReply(virJSONValuePtr reply, case QEMU_MONITOR_MIGRATION_STATUS_INACTIVE: case QEMU_MONITOR_MIGRATION_STATUS_SETUP: case QEMU_MONITOR_MIGRATION_STATUS_CANCELLED: + case QEMU_MONITOR_MIGRATION_STATUS_WAIT_UNPLUG: case QEMU_MONITOR_MIGRATION_STATUS_LAST: break; -- 2.24.1

On Fri, Jan 24, 2020 at 10:39:20AM -0500, Laine Stump wrote:
Aside from itinerant error (actually warning) messages due to an unrecognized response from qemu, this isn't even necessary - the migration proceeds successfully to completion anyway.
(I'm not sure where to see this status reported in the API though - do we need to add an extra state, or recognition of a new event somewhere?)
Signed-off-by: Laine Stump <laine@redhat.com> --- src/qemu/qemu_migration.c | 1 + src/qemu/qemu_monitor.c | 1 + src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_monitor_json.c | 1 + 4 files changed, 4 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

and the QEMU backend implementation using virtio failover. Signed-off-by: Laine Stump <laine@redhat.com> --- docs/formatdomain.html.in | 100 ++++++++++++++++++++++++++++++++++++++ docs/news.xml | 28 +++++++++++ 2 files changed, 128 insertions(+) diff --git a/docs/formatdomain.html.in b/docs/formatdomain.html.in index 4db9c292b7..a1c2a1e392 100644 --- a/docs/formatdomain.html.in +++ b/docs/formatdomain.html.in @@ -5873,6 +5873,106 @@ </devices> ...</pre> + <h5><a id="elementsTeaming">Teaming a virtio/hostdev NIC pair</a></h5> + + <p> + <span class="since">Since 6.1.0 (QEMU and KVM only, requires + QEMU 4.2.0 or newer and a guest virtio-net driver supporting + the "failover" feature) + </span> + The <code><teaming></code> element of two interfaces can + be used to connect them as a team/bond device in the guest + (assuming proper support in the hypervisor and the guest + network driver). + </p> + +<pre> +... +<devices> + <interface type='network'> + <source network='mybridge'/> + <mac address='00:11:22:33:44:55'/> + <model type='virtio'/> + <teaming type='persistent'/> + <alias name='ua-backup0'/> + </interface> + <interface type='network'> + <source network='hostdev-pool'/> + <mac address='00:11:22:33:44:55'/> + <model type='virtio'/> + <teaming type='transient' persistent='ua-backup0'/> + </interface> +</devices> +...</pre> + + <p> + The <code><teaming></code> element required + attribute <code>type</code> will be set to + either <code>"persistent"</code> to indicate a device that + should always be present in the domain, + or <code>"transient"</code> to indicate a device that may + periodically be removed, then later re-added to the domain. When + type="transient", there should be a second attribute + to <code><teaming></code> called <code>"persistent"</code> + - this attribute should be set to the alias name of the other + device in the pair (the one that has <code><teaming + type="persistent'/></code>). + </p> + <p> + In the particular case of QEMU, + libvirt's <code><teaming></code> element is used to setup + a virtio-net "failover" device pair. For this setup, the + persistent device must be an interface with <code><model + type="virtio"/></code>, and the transient device must + be <code><interface type='hostdev'/></code> + (or <code><interface type='network'/></code> where the + referenced network defines a pool of SRIOV VFs). The guest will + then have a simple network team/bond device made of the virtio + NIC + hostdev NIC pair. In this configuration, the + higher-performing hostdev NIC will normally be preferred for all + network traffic, but when the domain is migrated, QEMU will + automatically unplug the VF from the guest, and then hotplug a + similar device once migration is completed; while migration is + taking place, network traffic will use the virtio NIC. (Of + course the emulated virtio NIC and the hostdev NIC must be + connected to the same subnet for bonding to work properly). + </p> + <p> + NB1: Since you must know the alias name of the virtio NIC when + configuring the hostdev NIC, it will need to be manually set in + the virtio NIC's configuration (as with all other manually set + alias names, this means it must start with "ua-"). + </p> + <p> + NB2: Currently the only implementation of the guest OS + virtio-net driver supporting virtio-net failover requires that + the MAC addresses of the virtio and hostdev NIC must + match. Since that may not always be a requirement in the future, + libvirt doesn't enforce this limitation - it is up to the + person/management application that is creating the configuration + to assure the MAC addresses of the two devices match. + </p> + <p> + NB3: Since the PCI addresses of the SRIOV VFs on the hosts that + are the source and destination of the migration will almost + certainly be different, either higher level management software + will need to modify the <code><source></code> of the + hostdev NIC (<code><interface type='hostdev'></code>) at + the start of migration, or (a simpler solution) the + configuration will need to use a libvirt "hostdev" virtual + network that maintains a pool of such devices, as is implied in + the example's use of the libvirt network named "hostdev-pool" - + as long as the hostdev network pools on both hosts have the same + name, libvirt itself will take care of allocating an appropriate + device on both ends of the migration. Similarly the XML for the + virtio interface must also either work correctly unmodified on + both the source and destination of the migration (e.g. by + connecting to the same bridge device on both hosts, or by using + the same virtual network), or the management software must + properly modify the interface XML during migration so that the + virtio device remains connected to the same network segment + before and after migration. + </p> <h5><a id="elementsNICSMulticast">Multicast tunnel</a></h5> diff --git a/docs/news.xml b/docs/news.xml index 056c7ef026..7dc9cc18cb 100644 --- a/docs/news.xml +++ b/docs/news.xml @@ -44,6 +44,34 @@ <libvirt> <release version="v6.1.0" date="unreleased"> <section title="New features"> + <change> + <summary> + support for virtio+hostdev NIC <teaming> + </summary> + <description> + QEMU 4.2.0 and later, combined with a sufficiently recent + guest virtio-net driver, supports setting up a simple + network bond device comprised of one virtio emulated NIC and + one hostdev NIC (which must be an SRIOV VF). (in QEMU, this + is known as the "virtio failover" feature). The allure of + this setup is that the bond will always favor the hostdev + device, providing better performance, until the guest is + migrated - at that time QEMU will automatically unplug the + hostdev NIC and the bond will send all traffic via the + virtio NIC until migration is completed, then QEMU on the + destination side will hotplug a new hostdev NIC and the bond + will switch back to using the hostdev for network + traffic. The result is that guests desiring the extra + performance of a hostdev NIC are now migratable without + network downtime (performance is just degraded during + migration) and without requiring a complicated bonding + configuration in the guest OS network config and complicated + unplug/replug logic in the management application on the + host - it can instead all be accomplished in libvirt with + the interface <teaming> subelement "type" and + "persistent" attributes. + </description> + </change> </section> <section title="Improvements"> </section> -- 2.24.1

On Fri, Jan 24, 2020 at 10:39:21AM -0500, Laine Stump wrote:
and the QEMU backend implementation using virtio failover.
Signed-off-by: Laine Stump <laine@redhat.com> --- docs/formatdomain.html.in | 100 ++++++++++++++++++++++++++++++++++++++ docs/news.xml | 28 +++++++++++ 2 files changed, 128 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, 24 Jan 2020, 17:54 Laine Stump, <laine@redhat.com> wrote:
V1: https://www.redhat.com/archives/libvir-list/2020-January/msg00813.html
This all used different names in V1 - in that incarnation the configuration was done using "failover" and "backupAlias" attributes added to the <driver> subelement of <interface>. But the resulting code was cumbersome and had little bits scattered all over the place due to needing it in both hostdev and interface parsing/formatting.
In his review of V1, danpb suggesting just adding a new subelement for this configuration to free ourselves from the constraints of <driver> parsing/formatting. This ended up dramatically simplifying the code (hence the lack of V1's refactoring patches in V2, and a decrease in patch count from 12 to 6).
During further discussion in email and on IRC, we decided that naming the element <failover> was too limiting, as it implied the behavior of what is, to libvirt, just two network devices that should be teamed/bonded together - it's completely up to the hypervisor and guest what is done with this information. In light of that, we decided to name the new subelement <teaming>, and to specify the two interfaces as "persistent" (an interface that will always remain plugged in) and "transient" (an interface that may be periodically unplugged (during migration, in the case of QEMU). So the virtio device will have
<teaming type='persistent'/>
and the hostdev device will have
<teaming type='transient' persistent='ua-myvirtio'/>
(note that the persistent interface must have <alias name='ua-myvirtio'/>)
Given this config, libvirt will add "failover=on" to the device commandline arg for the virtio device, and "failover_pair_id=ua-myvirtio" to the arg for the hostdev device (and when a migration is requested, it will notice if there is a hostdev that has <teaming type='transient'/> set, and will allow the migration in this case, but still disallow migrations of domains with normal hostdevs).
In response to these extra commandline options, QEMU will set some extra capabilities in the virtio device PCI capabilities data, and will also automatically unplug/re-plug the hostdev device before and after migration.
In the guest, the virtio-net driver will notice the extra PCI capabilities and use this as a clue that it should search for another device with matching MAC address (NB: the guest driver requires the two devices to have matching MAC addresses) to join into a bond with the virtio-net device.
I like the <teaming/> abstraction. As I wrote earlier, as a virt-manager user I'd like to specify that two interfaces are teamed; I would not care to copy the mac address of one onto the other. I prefer that libvirt hides this virtio awkwardness by passing the "persistent" mac address to both qemu nics. Would libvirt do this service to its multiple clients? This bond is hard-wired to always prefer the
hostdev device whenever it is present, and use the virtio device as backup when the hostdev is unplugged.
----
As mentioned in a followup to the V1 cover letter, there is a regression in QEMU 4.2.0 that causes QEMU to segv when a hostdev is unplugged. That bug is fixed with this upstream QEMU patch:
https://git.qemu.org/?p=qemu.git;a=commitdiff;h=0446f8121723b134ca1d1ed0b73e...
Be sure to use a qemu build with this patch applied, or you may not even be able to start the guest! Also we've found that the DEVICE_DELETED event is never sent to libvirt when one of these hostdevs is manually unplugged, meaning that libvirt keeps the device marked as "in-use", and it therefore cannot be re-plugged to the guest until after a complete guest "power cycle". AFAIK there isn't yet a fix for that bug, so don't expect manual unplug of the device to work.
Laine Stump (6): qemu: add capabilities flag for failover feature conf: parse/format <teaming> subelement of <interface> qemu: support interface <teaming> functionality qemu: allow migration with assigned PCI hostdev if <teaming> is set qemu: add wait-unplug to qemu migration status enum docs: document <interface> subelement <teaming>
docs/formatdomain.html.in | 100 ++++++++++++++++++ docs/news.xml | 28 +++++ docs/schemas/domaincommon.rng | 19 ++++ src/conf/domain_conf.c | 45 ++++++++ src/conf/domain_conf.h | 14 +++ src/qemu/qemu_capabilities.c | 4 + src/qemu/qemu_capabilities.h | 3 + src/qemu/qemu_command.c | 9 ++ src/qemu/qemu_domain.c | 36 ++++++- src/qemu/qemu_migration.c | 53 +++++++++- src/qemu/qemu_monitor.c | 1 + src/qemu/qemu_monitor.h | 1 + src/qemu/qemu_monitor_json.c | 1 + .../caps_4.2.0.aarch64.xml | 1 + .../caps_4.2.0.x86_64.xml | 1 + .../net-virtio-teaming-network.xml | 37 +++++++ .../qemuxml2argvdata/net-virtio-teaming.args | 40 +++++++ tests/qemuxml2argvdata/net-virtio-teaming.xml | 50 +++++++++ tests/qemuxml2argvtest.c | 4 + .../net-virtio-teaming-network.xml | 51 +++++++++ .../qemuxml2xmloutdata/net-virtio-teaming.xml | 66 ++++++++++++ tests/qemuxml2xmltest.c | 6 ++ 22 files changed, 563 insertions(+), 7 deletions(-) create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.args create mode 100644 tests/qemuxml2argvdata/net-virtio-teaming.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming-network.xml create mode 100644 tests/qemuxml2xmloutdata/net-virtio-teaming.xml
-- 2.24.1
participants (3)
-
Dan Kenigsberg
-
Daniel P. Berrangé
-
Laine Stump