[libvirt] Network device abstraction aka virtual switch - V3

This is a followup to https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html (and an even earlier draft) which I alluded to here: https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html Network device abstraction aka virtual switch - V3 ================================================== The <interface> element of a guest's domain config in libvirt has a <source> element that describes what resources on a host will be used to connect the guest's network interface to the rest of the world. This is very flexible, allowing several different types of connection (virtual network, host bridge, direct macvtap connection to physical interface, qemu usermode, user-defined via an external script), but currently has the problem that unnecessary details of the host resources are embedded into the guest's config; if the guest is migrated to a different host, and that host has a different hardware or network config (or possibly the same hardware, but that hardware is currently in use by a different guest), the migration will fail. I am proposing a change to libvirt's network XML that will allow us to (optionally - old configs will remain valid) remove the host details from the guest's domain XML (which can move around from host to host) and place them in the network XML (which remains with a single host); the domain XML will then use existing config elements to associate each guest interface with a "network". The motivating use case for this change is the "direct" connection type (which uses macvtap for vepa and vnlink connections directly between a guest and a physical interface, rather than through a bridge), but it is applicable for all types of connection. (Another hopeful side effect of this change will be to make libvirt's network connection model easier to realize on non-Linux hypervisors (eg, VMWare ESX) and for other network technologies, such as openvswitch, VDE, and various VPN implementations). Background ========== (parts lifted from Dan Berrange's last mail on this subject) Currently <network> supports 3 connectivity modes - Non-routed network, separate subnet (no <forward> element present) - Routed network, separate subnet with NAT (<forward mode='nat'/>) - Routed network, separate subnet (<forward mode='route'/>) Each of these is implemented in the existing network driver by creating a bridge device using brctl, and connecting the guest network interfaces via tap devices (a detail which, now that I've stated it, you should promptly forget!). All traffic between that bridge and the outside network is done via the host's IP routing stack (ie, there is no physical device directly connected to the bridge) In the future, these two additional routed modes might be useful: - Routed network, IP subnetting - Routed network, separate subnet with VPN The core goal of this proposal, though, is to replace type=bridge and type=direct from the domain interface XML with new types of <network> definitions so that the domain can just give "type='network'" and have all the necessary details filled in at runtime. This basically means we're adding several bridging modes (the submodes of "direct" have been flattened out here): - Bridged network, eth + bridge + tap - Bridged network, eth + macvtap + vepa - Bridged network, eth + macvtap + private - Bridged network, eth + macvtap + passthrough - Bridged network, eth + macvtap + bridge Another "future expansion" could be to add: - Bridged network, with VPN Likewise, support for other technologies, such as openvswitch and VDE would each be another entry on this list. (Dan also listed each of the above "+sriov" separately, but that ends up being handled in an orthogonal manner (by just specifying a pool of interfaces for a single network), so I'm only giving the abbreviated list) I. Changes to domain <interface> element ======================================== In many cases, the <interface> element of the domain XML will be identical to what is used now when connecting the interface to a libvirt-style virtual network: <interface type='network'> <source network='red-network'/> <mac address='xx:xx:xx:xx:xx:xx'/> </interface> Depending on the definition of the network "red-network" on the host the guest was started on / migrated to, this could be either a direct (macvtap) connection using one of the various direct modes (vepa/private/bridge/passthrough), a bridge (again, pointed to by the definition of 'red-network'), or a virtual network (using the current network definition syntax). This way the same guest could be migrated not only between macvtap-enabled hosts, but from there to a host using a bridge, or maybe a host in a remote location that used a virtual network with a secure tunnel to connect back to the rest of the red-network. (Part of the migration process would of course check that the destination host had a network of the proper name with adequate available resources, and fail if it didn't; management software at a level above libvirt would probably filter a list of candidate migration destinations based on available networks and any various details of those networks (eg. it could search for only networks using vepa for the connection), and only attempt migration to one that had the matching network available). <virtualport> element of <interface> ------------------------------------ Since many of the attributes/sub-elements of <virtualport> (used by some modes of "direct" interface connections) are identical for all interfaces connecting to any given switch, most of the information in <virtualport> will be optional in the domain's interface definition - it can be filled in from a similar <virtualport> element that will be added to the <network> definition. Some parameters in <virtualport> ("instanceid", for example) must be unique for every interface, though, so those will still be specified in the <interface> XML. The two <virtualport> elements will be OR'ed at runtime to arrive at the actual set of parameters that are used. (Open Question: What should be the policy when a parameter is specified in both places? Should one take precedence? Or should it be considered an error?) portgroup attribute of <source> ------------------------------- The <source> element of an interface definition will be able to optionally specify a "portgroup" attribute. If portgroup is *NOT* given, the default (first) portgroup of the network will be used (if any are defined). If portgroup *IS* specified, the source network must have a portgroup by that name (or the domain startup/migration will fail), and the attributes of that portgroup will be used for the connection. Here is an example <interface> definition that has both a reduced <virtualport> element, as well as a portgroup attribute: <interface type='network'> <source network='red-network' portgroup='engineering'/> <virtualport type="802.1Qbg"> <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/> </virtualport> <mac address='de:ad:be:ef:ca:fe'/> </interface> (The specifics of what can be in a portgroup are given below) II. Changes to <network> definition =================================== As Dan has pointed out, any additions to <network> must be designed so that existing management applications (written to understand <network> prior to these new additions) will at least recognize that the XML they've been given is for something new that they don't fully understand. At the same time, the new types of network definition should attempt to re-use as much of the existing elements/attributes as possible, both to make it easier to extend these applications, as well as to make the status displays of un-updated applications make as much sense as possible. Dan's suggestion (which I obviously endorse :-) is that the new types of network should be specified by extending the choices for <forward mode='....'>. He also suggested adding a new "layer='network|link'" attribute to <forward>. I'm not convinced that item is necessary (it seems redundant), but am including it here for sake of discussion. The current modes are: <forward layer='network' mode='route|nat'/> (in addition to not listing any mode, which equates to "isolated") Here are suggested new modes: <forward layer='link' mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/> A description of each: bridge-brctl - equivalent to "<interface type='bridge'>" in the interface definition. The bridge device to use would be given in the existing <forward dev='xxx'>. (Dan also suggests putting this in <network>'s <bridge name='xxx'/> - opinions?) (Question: better name for this?) vepa - same as "<interface type='direct'>..." with <source mode='vepa'/> private - <interface type='direct'> ... <source mode='private'/> passthrough - <interface type='direct'> ... <source mode='passthrough'/> bridge-macvtap - <interface type='direct'> ... <source mode='bridge'/> (Question: better name for this?) Interface Pools --------------- In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device. The current specification for <forward> only allows for a single "dev" attribute, though. In order to support multiple device names, we will extend <forward> to allow 0 or more <interface> sub-elements: <forward mode='vepa' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> </forward> Note that, as a convenience, the first of these elements will always be a duplicate of the "dev" attribute in <forward> itself. (Is this necessary/desirable?) In the case of mode='passthrough', only one guest interface can be connected to a device at a time. libvirt will keep track of which devices are in use, and attempt to assign a free device; failure to assign a device will result in a failure of the domain to start/migrate. For the other direct modes, libvirt will simply keep track of the number of guest interfaces currently using each device, and attempt to keep them balanced. (Open question: where will we keep track of this allocation/assignment?) Portgroups ----------- A <portgroup> (sub-element of <network>) is just a way of easily putting connections to the network into different classes, with each class having a different level/type of service. Each <network> can have multiple <portgroup> elements, and each <portgroup> has a name, as well as various attributes associated with it. The first thing we will use portgroups for is as an alternate place to specify <virtualport> parameters: <portgroup name='engineering'> <virtualport type="802.1Qbg"> <parameters managerid="11" typeid="1193047" typeidversion="2"/> </virtualport> </portgroup> Anything that is valid in an interface's <virtualport> is also valid here. The next thing to specify in a portgroup will be bandwidth limiting / QoS configuration. Since I don't know exactly what's needed for that, I won't specify it here. If anything is specified both directly under <network> and in a <portgroup>, the value in portgroup will take precedence. (Again - what will the precedence of items specified in the <interface> be?) EXAMPLES -------- Examples of 'red-network' for different types of connections (all of these would work with minor variations of the interface XML given above, e.g. the 'vepa' version would require <virtualport> in the interface that specified an instanceid, and if the <interface> specified a portgroup, it would need to also be in the <network> definition (even if it was empty aside from name). <!-- Existing usage - a libvirt virtual network --> <network> <name>red-network</name> <bridge name='virbr0'/> <forward layer='network' mode='route'/> ... </network> <!-- The simplest - an existing host bridge --> <network> <name>red-network</name> <forward mode='bridge-brctl' dev='br0'/> </network> <!-- A macvtap connection to a vepa bridge --> <network> <name>red-network</name> <forward layer='link' mode='vepa' dev='eth10'/> <virtualport type='802.1Qbg'> <parameters managerid='11' typeid='1193047' typeidversion='2'/> </virtualport> <!-- NB: if <interface> doesn't specify portgroup, --> <!-- 'accounting' is assumed --> <portgroup name='accounting'> <virtualport> <parameters typeid='22'/> </virtualport> </portgroup> <portgroup name='engineering'> <virtualport> <parameters typeid='33'/> </virtualport> </portgroup> </network> <!-- A macvtap passthrough connection (one guest interface per dev) --> <network> <name>red-network</name> <forward layer='link' mode='passthrough' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> <interface dev='eth14'/> <interface dev='eth15'/> <interface dev='eth16'/> <interface dev='eth17'/> </forward> </network> ============= Open Questions: * Is there a good reason to include the "layer='network|link'" attribute in forward? (maybe just because it's useful info for a management application that doesn't know the details of the modes?) Or is it redundant? * What should be the policy when a virtualport parameter is specified in both the <interface> and the <network>/<portgroup>? Should one take precedence? Or should it be considered an error? * Is it okay for the domain's own definition to specify what portgroup it will be in? Or are there cases where we want to allow someone to modify their domain XML, but force them into a particular portgroup beyond their control? * Is it really necessary/desirable for the first ethernet device in a pool to be duplicated in the <forward dev='xxx'...> attribute? Or can that attribute be omitted when there is a pool of devices? * Where will we keep track of the count of guest interfaces connected to each host interface device, and where will we keep track of which device is being used by a particular guest interface? In the network/domain XML? * Does anyone have better names for "brctl-bridge" and "macvtap-bridge"?

On 06/12/2011 08:29 PM, Laine Stump wrote:
* Does anyone have better names for "brctl-bridge" and "macvtap-bridge"?
How about using "direct" instead of macvtap-bridge, and "direct-private", "direct-vepa", and "direct-passthrough" for "private", "vepa", and "passthrough"? I still can't think of anything more generally useful instead of brctl-bridge...

See my comments inline. Thank you, Oved ----- Original Message -----
From: "Laine Stump" <laine@laine.org> To: "Libvirt" <libvir-list@redhat.com> Sent: Monday, June 13, 2011 3:29:08 AM Subject: [libvirt] Network device abstraction aka virtual switch - V3 This is a followup to https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html (and an even earlier draft) which I alluded to here:
https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html
Network device abstraction aka virtual switch - V3 ==================================================
The <interface> element of a guest's domain config in libvirt has a <source> element that describes what resources on a host will be used to connect the guest's network interface to the rest of the world. This is very flexible, allowing several different types of connection (virtual network, host bridge, direct macvtap connection to physical interface, qemu usermode, user-defined via an external script), but currently has the problem that unnecessary details of the host resources are embedded into the guest's config; if the guest is migrated to a different host, and that host has a different hardware or network config (or possibly the same hardware, but that hardware is currently in use by a different guest), the migration will fail.
I am proposing a change to libvirt's network XML that will allow us to (optionally - old configs will remain valid) remove the host details from the guest's domain XML (which can move around from host to host) and place them in the network XML (which remains with a single host); the domain XML will then use existing config elements to associate each guest interface with a "network".
The motivating use case for this change is the "direct" connection type (which uses macvtap for vepa and vnlink connections directly between a guest and a physical interface, rather than through a bridge), but it is applicable for all types of connection. (Another hopeful side effect of this change will be to make libvirt's network connection model easier to realize on non-Linux hypervisors (eg, VMWare ESX) and for other network technologies, such as openvswitch, VDE, and various VPN implementations).
Background ==========
(parts lifted from Dan Berrange's last mail on this subject)
Currently <network> supports 3 connectivity modes
- Non-routed network, separate subnet (no <forward> element present) - Routed network, separate subnet with NAT (<forward mode='nat'/>) - Routed network, separate subnet (<forward mode='route'/>)
Each of these is implemented in the existing network driver by creating a bridge device using brctl, and connecting the guest network interfaces via tap devices (a detail which, now that I've stated it, you should promptly forget!). All traffic between that bridge and the outside network is done via the host's IP routing stack (ie, there is no physical device directly connected to the bridge)
In the future, these two additional routed modes might be useful:
- Routed network, IP subnetting - Routed network, separate subnet with VPN
The core goal of this proposal, though, is to replace type=bridge and type=direct from the domain interface XML with new types of <network> definitions so that the domain can just give "type='network'" and have all the necessary details filled in at runtime. This basically means we're adding several bridging modes (the submodes of "direct" have been flattened out here):
- Bridged network, eth + bridge + tap - Bridged network, eth + macvtap + vepa - Bridged network, eth + macvtap + private - Bridged network, eth + macvtap + passthrough - Bridged network, eth + macvtap + bridge
Another "future expansion" could be to add:
- Bridged network, with VPN
Likewise, support for other technologies, such as openvswitch and VDE would each be another entry on this list.
(Dan also listed each of the above "+sriov" separately, but that ends up being handled in an orthogonal manner (by just specifying a pool of interfaces for a single network), so I'm only giving the abbreviated list)
I. Changes to domain <interface> element ========================================
In many cases, the <interface> element of the domain XML will be identical to what is used now when connecting the interface to a libvirt-style virtual network:
<interface type='network'> <source network='red-network'/> <mac address='xx:xx:xx:xx:xx:xx'/> </interface>
Depending on the definition of the network "red-network" on the host the guest was started on / migrated to, this could be either a direct (macvtap) connection using one of the various direct modes (vepa/private/bridge/passthrough), a bridge (again, pointed to by the definition of 'red-network'), or a virtual network (using the current network definition syntax). This way the same guest could be migrated not only between macvtap-enabled hosts, but from there to a host using a bridge, or maybe a host in a remote location that used a virtual network with a secure tunnel to connect back to the rest of the red-network.
(Part of the migration process would of course check that the destination host had a network of the proper name with adequate available resources, and fail if it didn't; management software at a level above libvirt would probably filter a list of candidate migration destinations based on available networks and any various details of those networks (eg. it could search for only networks using vepa for the connection), and only attempt migration to one that had the matching network available).
<virtualport> element of <interface> ------------------------------------
Since many of the attributes/sub-elements of <virtualport> (used by some modes of "direct" interface connections) are identical for all interfaces connecting to any given switch, most of the information in <virtualport> will be optional in the domain's interface definition - it can be filled in from a similar <virtualport> element that will be added to the <network> definition.
Some parameters in <virtualport> ("instanceid", for example) must be unique for every interface, though, so those will still be specified in the <interface> XML. The two <virtualport> elements will be OR'ed at runtime to arrive at the actual set of parameters that are used.
(Open Question: What should be the policy when a parameter is specified in both places? Should one take precedence? Or should it be considered an error?)
I think it depends on the parameter. instanceid, for example, is something that needs to be specified in the guest XML, so I would report an error in case it exists in the host. As for other parameters, like, for example typeid, I think that you should take it from the guest xml, and validate that it equals to the typeid in the host xml. I saw below that you specify the managerid, typeid and typeidversion both in the virtualport properties in the network, and in the port group. I think that the network should have the managerid, and each port group should have the typeid+version. But we should give the option not to have port groups, and put all devices under the network, and just specify the managerid. You can say that it is the default typeid, but I think it is confusing, because it raises questions like: "Will my guest fail if I specify a different typeid?" I think there are two cases: 1. There are port groups: a. The guest specified a port group. b. I believe this will be the common use-case in RHEV: The guest specified a typeid - we should check if there is a port group that supports it, and if so run the guest on one of its devices c. The guest specified both port group and typeid - we will fail in case the portgroup exists, but with another typeid. 2. There are no port groups: a. The guest specified a port group - failure b. I believe this will be the common use-case in RHEV: The guest specified a typeid - we will just run it on one of the devices in the specified network c. The guest specified both port group and typeid - failure
portgroup attribute of <source> -------------------------------
The <source> element of an interface definition will be able to optionally specify a "portgroup" attribute. If portgroup is *NOT* given, the default (first) portgroup of the network will be used (if any are defined). If portgroup *IS* specified, the source network must have a portgroup by that name (or the domain startup/migration will fail), and the attributes of that portgroup will be used for the connection. Here is an example <interface> definition that has both a reduced <virtualport> element, as well as a portgroup attribute:
<interface type='network'> <source network='red-network' portgroup='engineering'/> <virtualport type="802.1Qbg"> <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/> </virtualport> <mac address='de:ad:be:ef:ca:fe'/> </interface>
(The specifics of what can be in a portgroup are given below)
II. Changes to <network> definition ===================================
As Dan has pointed out, any additions to <network> must be designed so that existing management applications (written to understand <network> prior to these new additions) will at least recognize that the XML they've been given is for something new that they don't fully understand. At the same time, the new types of network definition should attempt to re-use as much of the existing elements/attributes as possible, both to make it easier to extend these applications, as well as to make the status displays of un-updated applications make as much sense as possible.
Dan's suggestion (which I obviously endorse :-) is that the new types of network should be specified by extending the choices for <forward mode='....'>.
He also suggested adding a new "layer='network|link'" attribute to <forward>. I'm not convinced that item is necessary (it seems redundant), but am including it here for sake of discussion.
The current modes are:
<forward layer='network' mode='route|nat'/>
(in addition to not listing any mode, which equates to "isolated")
Here are suggested new modes:
<forward layer='link' mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/>
A description of each:
bridge-brctl - equivalent to "<interface type='bridge'>" in the interface definition. The bridge device to use would be given in the existing <forward dev='xxx'>. (Dan also suggests putting this in <network>'s <bridge name='xxx'/> - opinions?) (Question: better name for this?)
vepa - same as "<interface type='direct'>..." with <source mode='vepa'/>
private - <interface type='direct'> ... <source mode='private'/>
passthrough - <interface type='direct'> ... <source mode='passthrough'/>
bridge-macvtap - <interface type='direct'> ... <source mode='bridge'/> (Question: better name for this?)
Interface Pools ---------------
In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device.
The current specification for <forward> only allows for a single "dev" attribute, though. In order to support multiple device names, we will extend <forward> to allow 0 or more <interface> sub-elements:
<forward mode='vepa' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> </forward>
Note that, as a convenience, the first of these elements will always be a duplicate of the "dev" attribute in <forward> itself. (Is this necessary/desirable?)
In the case of mode='passthrough', only one guest interface can be connected to a device at a time. libvirt will keep track of which devices are in use, and attempt to assign a free device; failure to assign a device will result in a failure of the domain to start/migrate. For the other direct modes, libvirt will simply keep track of the number of guest interfaces currently using each device, and attempt to keep them balanced.
(Open question: where will we keep track of this allocation/assignment?)
Portgroups -----------
A <portgroup> (sub-element of <network>) is just a way of easily putting connections to the network into different classes, with each class having a different level/type of service. Each <network> can have multiple <portgroup> elements, and each <portgroup> has a name, as well as various attributes associated with it. The first thing we will use portgroups for is as an alternate place to specify <virtualport> parameters:
<portgroup name='engineering'> <virtualport type="802.1Qbg"> <parameters managerid="11" typeid="1193047" typeidversion="2"/> </virtualport> </portgroup>
Anything that is valid in an interface's <virtualport> is also valid here.
The next thing to specify in a portgroup will be bandwidth limiting / QoS configuration. Since I don't know exactly what's needed for that, I won't specify it here.
If anything is specified both directly under <network> and in a <portgroup>, the value in portgroup will take precedence. (Again - what will the precedence of items specified in the <interface> be?)
EXAMPLES --------
Examples of 'red-network' for different types of connections (all of these would work with minor variations of the interface XML given above, e.g. the 'vepa' version would require <virtualport> in the interface that specified an instanceid, and if the <interface> specified a portgroup, it would need to also be in the <network> definition (even if it was empty aside from name).
<!-- Existing usage - a libvirt virtual network --> <network> <name>red-network</name> <bridge name='virbr0'/> <forward layer='network' mode='route'/> ... </network>
<!-- The simplest - an existing host bridge --> <network> <name>red-network</name> <forward mode='bridge-brctl' dev='br0'/> </network>
<!-- A macvtap connection to a vepa bridge --> <network> <name>red-network</name> <forward layer='link' mode='vepa' dev='eth10'/> <virtualport type='802.1Qbg'> <parameters managerid='11' typeid='1193047' typeidversion='2'/> </virtualport> <!-- NB: if <interface> doesn't specify portgroup, --> <!-- 'accounting' is assumed --> <portgroup name='accounting'> <virtualport> <parameters typeid='22'/> </virtualport> </portgroup> <portgroup name='engineering'> <virtualport> <parameters typeid='33'/> </virtualport> </portgroup> </network>
<!-- A macvtap passthrough connection (one guest interface per dev) --> <network> <name>red-network</name> <forward layer='link' mode='passthrough' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> <interface dev='eth14'/> <interface dev='eth15'/> <interface dev='eth16'/> <interface dev='eth17'/> </forward> </network>
=============
Open Questions:
* Is there a good reason to include the "layer='network|link'" attribute in forward? (maybe just because it's useful info for a management application that doesn't know the details of the modes?) Or is it redundant?
* What should be the policy when a virtualport parameter is specified in both the <interface> and the <network>/<portgroup>? Should one take precedence? Or should it be considered an error?
See answer above.
* Is it okay for the domain's own definition to specify what portgroup it will be in? Or are there cases where we want to allow someone to modify their domain XML, but force them into a particular portgroup beyond their control?
I think that if you give the ability to define port groups then It'll be nice to give the guest a control on that. But I think that RHEV guests will use the typeid or profile name, and not specify the port group.
* Is it really necessary/desirable for the first ethernet device in a pool to be duplicated in the <forward dev='xxx'...> attribute? Or can that attribute be omitted when there is a pool of devices?
* Where will we keep track of the count of guest interfaces connected to each host interface device, and where will we keep track of which device is being used by a particular guest interface? In the network/domain XML?
I think you should keep it in the host network XML.
* Does anyone have better names for "brctl-bridge" and "macvtap-bridge"?
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Sun, 2011-06-12 at 20:29 -0400, Laine Stump wrote: ...
II. Changes to <network> definition ===================================
...
He also suggested adding a new "layer='network|link'" attribute to <forward>. I'm not convinced that item is necessary (it seems redundant), but am including it here for sake of discussion.
The current modes are:
<forward layer='network' mode='route|nat'/>
(in addition to not listing any mode, which equates to "isolated")
Here are suggested new modes:
<forward layer='link' mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/>
On the "layer='network|link'" question, would "layer='IP|MAC'" not be clearer? Regarding the mode attribute: "mode='bridge|vepa|private|passthrough'" seems sufficient to me, bridge-brctl or bridge-macvtap can be concluded from the "dev" attribute, right? ... -- Best regards, Gerhard Stenzel, ----------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Sun, 2011-06-12 at 20:29 -0400, Laine Stump wrote:
<!-- A macvtap passthrough connection (one guest interface per dev) --> <network> <name>red-network</name> <forward layer='link' mode='passthrough' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> <interface dev='eth14'/> <interface dev='eth15'/> <interface dev='eth16'/> <interface dev='eth17'/> </forward> </network>
If this example describes a scenario with a SR-IOV card, where eth10 is the physical function and eth11-eth17 are the virtual functions and libvirt can attach a VM to any of the VFs, then I would not list eth10 in the interface pool for passthrough devices. -- Best regards, Gerhard Stenzel, ----------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Thu, Jun 16, 2011 at 04:40:19PM +0200, Gerhard Stenzel wrote:
On Sun, 2011-06-12 at 20:29 -0400, Laine Stump wrote:
<!-- A macvtap passthrough connection (one guest interface per dev) --> <network> <name>red-network</name> <forward layer='link' mode='passthrough' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> <interface dev='eth14'/> <interface dev='eth15'/> <interface dev='eth16'/> <interface dev='eth17'/> </forward> </network>
If this example describes a scenario with a SR-IOV card, where eth10 is the physical function and eth11-eth17 are the virtual functions and libvirt can attach a VM to any of the VFs, then I would not list eth10 in the interface pool for passthrough devices.
All interfaces listed here should be considered equal for attaching VMs to. I don't think the network code has to even care about whether a NIC in the XML is a virtual or a physical function. The application will discover NICs and whether they are virtual/physical functions via the node device APIs in libvirt. It will then decide which of the NICs to use when creating the network XML. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 06/12/2011 08:29 PM, Laine Stump wrote:
This is a followup to https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html (and an even earlier draft) which I alluded to here:
https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html
Network device abstraction aka virtual switch - V3 ==================================================
[...]
The core goal of this proposal, though, is to replace type=bridge and type=direct from the domain interface XML with new types of <network> definitions so that the domain can just give "type='network'" and have all the necessary details filled in at runtime. This basically means we're adding several bridging modes (the submodes of "direct" have been flattened out here):
- Bridged network, eth + bridge + tap - Bridged network, eth + macvtap + vepa - Bridged network, eth + macvtap + private - Bridged network, eth + macvtap + passthrough - Bridged network, eth + macvtap + bridge
Another "future expansion" could be to add:
- Bridged network, with VPN
This case sounds to me like the first one with for example OpenVPN's tap interface also added to the bridge.
Likewise, support for other technologies, such as openvswitch and VDE would each be another entry on this list.
(Dan also listed each of the above "+sriov" separately, but that ends up being handled in an orthogonal manner (by just specifying a pool of interfaces for a single network), so I'm only giving the abbreviated list)
I. Changes to domain <interface> element ========================================
[...]
<virtualport> element of <interface> ------------------------------------
Since many of the attributes/sub-elements of <virtualport> (used by some modes of "direct" interface connections) are identical for all interfaces connecting to any given switch, most of the information in <virtualport> will be optional in the domain's interface definition - it can be filled in from a similar <virtualport> element that will be added to the <network> definition.
Some parameters in <virtualport> ("instanceid", for example) must be unique for every interface, though, so those will still be specified in the <interface> XML. The two <virtualport> elements will be OR'ed at runtime to arrive at the actual set of parameters that are used.
(Open Question: What should be the policy when a parameter is specified in both places? Should one take precedence? Or should it be considered an error?)
I think the one in the domain XML should take precedence assuming the user wants to make some parameter different for one particular interface.
portgroup attribute of <source> -------------------------------
The <source> element of an interface definition will be able to optionally specify a "portgroup" attribute. If portgroup is *NOT* given, the default (first) portgroup of the network will be used (if any are defined). If portgroup *IS* specified, the source network must have a portgroup by that name (or the domain startup/migration will fail), and the attributes of that portgroup will be used for the connection. Here is an example <interface> definition that has both a reduced <virtualport> element, as well as a portgroup attribute:
<interface type='network'> <source network='red-network' portgroup='engineering'/> <virtualport type="802.1Qbg"> <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/> </virtualport> <mac address='de:ad:be:ef:ca:fe'/> </interface>
(The specifics of what can be in a portgroup are given below)
II. Changes to <network> definition ===================================
[...]
A description of each:
bridge-brctl - equivalent to "<interface type='bridge'>" in the interface definition. The bridge device to use would be given in the existing <forward dev='xxx'>. (Dan also suggests putting this in <network>'s <bridge name='xxx'/> - opinions?) (Question: better name for this?)
Just 'bridge'?
vepa - same as "<interface type='direct'>..." with <source mode='vepa'/>
private - <interface type='direct'> ... <source mode='private'/>
passthrough - <interface type='direct'> ... <source mode='passthrough'/>
bridge-macvtap - <interface type='direct'> ... <source mode='bridge'/> (Question: better name for this?)
Interface Pools ---------------
In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device.
The current specification for <forward> only allows for a single "dev" attribute, though. In order to support multiple device names, we will extend <forward> to allow 0 or more <interface> sub-elements:
<forward mode='vepa' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> </forward>
Note that, as a convenience, the first of these elements will always be a duplicate of the "dev" attribute in <forward> itself. (Is this necessary/desirable?) It feels like this would require special handling in the code. If there was no dev in the forward node then that would require one to look into
So this becomes a pool now where libvirt keeps track of which ones of these interfaces is already in use. the pool right away. So maybe the dev attribute in the forward node would just be ignored if there is a pool of interfaces.
In the case of mode='passthrough', only one guest interface can be connected to a device at a time. libvirt will keep track of which devices are in use, and attempt to assign a free device; failure to assign a device will result in a failure of the domain to start/migrate. For the other direct modes, libvirt will simply keep track of the number of guest interfaces currently using each device, and attempt to keep them balanced.
(Open question: where will we keep track of this allocation/assignment?)
Portgroups -----------
A <portgroup> (sub-element of <network>) is just a way of easily putting connections to the network into different classes, with each class having a different level/type of service. Each <network> can have multiple <portgroup> elements, and each <portgroup> has a name, as well as various attributes associated with it. The first thing we will use portgroups for is as an alternate place to specify <virtualport> parameters:
<portgroup name='engineering'> <virtualport type="802.1Qbg"> <parameters managerid="11" typeid="1193047" typeidversion="2"/> </virtualport> </portgroup>
Anything that is valid in an interface's <virtualport> is also valid here.
The next thing to specify in a portgroup will be bandwidth limiting / QoS configuration. Since I don't know exactly what's needed for that, I won't specify it here.
If anything is specified both directly under <network> and in a <portgroup>, the value in portgroup will take precedence. (Again - what will the precedence of items specified in the <interface> be?)
EXAMPLES --------
[...]
=============
Open Questions:
[...]
* Where will we keep track of the count of guest interfaces connected to each host interface device, and where will we keep track of which device is being used by a particular guest interface? In the network/domain XML? As a user/administrator I may be interested to see it in both places, network and domain XML. At least that way I wouldn't have to dig too much...
I think this is necessary work and it feels like a lot of new complexity will need to be added... Regards, Stefan

Here are my previous comments to the earlier draft with subject "migration of vnlink VMs": http://www.redhat.com/archives/libvir-list/2011-May/msg00160.html Comments for this new draft inline ...
-----Original Message----- From: libvir-list-bounces@redhat.com [mailto:libvir-list- bounces@redhat.com] On Behalf Of Laine Stump Sent: Sunday, June 12, 2011 5:29 PM To: Libvirt Subject: [libvirt] Network device abstraction aka virtual switch - V3
This is a followup to https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html (and an even earlier draft) which I alluded to here:
https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html
Network device abstraction aka virtual switch - V3 ==================================================
The <interface> element of a guest's domain config in libvirt has a <source> element that describes what resources on a host will be used to connect the guest's network interface to the rest of the world. This is very flexible, allowing several different types of connection (virtual network, host bridge, direct macvtap connection to physical interface, qemu usermode, user-defined via an external script), but currently has the problem that unnecessary details of the host resources are embedded into the guest's config; if the guest is migrated to a different host, and that host has a different hardware or network config (or possibly the same hardware, but that hardware is currently in use by a different guest), the migration will fail.
I am proposing a change to libvirt's network XML that will allow us to (optionally - old configs will remain valid) remove the host details from the guest's domain XML (which can move around from host to host) and place them in the network XML (which remains with a single host); the domain XML will then use existing config elements to associate each guest interface with a "network".
The motivating use case for this change is the "direct" connection type (which uses macvtap for vepa and vnlink connections directly between a guest and a physical interface, rather than through a bridge), but it is applicable for all types of connection. (Another hopeful side effect of this change will be to make libvirt's network connection model easier to realize on non-Linux hypervisors (eg, VMWare ESX) and for other network technologies, such as openvswitch, VDE, and various VPN implementations).
Background ==========
(parts lifted from Dan Berrange's last mail on this subject)
Currently <network> supports 3 connectivity modes
- Non-routed network, separate subnet (no <forward> element present) - Routed network, separate subnet with NAT (<forward mode='nat'/>) - Routed network, separate subnet (<forward mode='route'/>)
Each of these is implemented in the existing network driver by creating a bridge device using brctl, and connecting the guest network interfaces via tap devices (a detail which, now that I've stated it, you should promptly forget!). All traffic between that bridge and the outside network is done via the host's IP routing stack (ie, there is no physical device directly connected to the bridge)
In the future, these two additional routed modes might be useful:
- Routed network, IP subnetting - Routed network, separate subnet with VPN
The core goal of this proposal, though, is to replace type=bridge and type=direct from the domain interface XML with new types of <network> definitions so that the domain can just give "type='network'" and have all the necessary details filled in at runtime. This basically means we're adding several bridging modes (the submodes of "direct" have been flattened out here):
- Bridged network, eth + bridge + tap - Bridged network, eth + macvtap + vepa - Bridged network, eth + macvtap + private - Bridged network, eth + macvtap + passthrough - Bridged network, eth + macvtap + bridge
Another "future expansion" could be to add:
- Bridged network, with VPN
Likewise, support for other technologies, such as openvswitch and VDE would each be another entry on this list.
(Dan also listed each of the above "+sriov" separately, but that ends up being handled in an orthogonal manner (by just specifying a pool of interfaces for a single network), so I'm only giving the abbreviated list)
I. Changes to domain <interface> element ========================================
In many cases, the <interface> element of the domain XML will be identical to what is used now when connecting the interface to a libvirt-style virtual network:
<interface type='network'> <source network='red-network'/> <mac address='xx:xx:xx:xx:xx:xx'/> </interface>
Depending on the definition of the network "red-network" on the host the guest was started on / migrated to, this could be either a direct (macvtap) connection using one of the various direct modes (vepa/private/bridge/passthrough), a bridge (again, pointed to by the definition of 'red-network'), or a virtual network (using the current network definition syntax). This way the same guest could be migrated not only between macvtap-enabled hosts, but from there to a host using a bridge, or maybe a host in a remote location that used a virtual network with a secure tunnel to connect back to the rest of the red-network.
This is only possible if you provision for that possibility: the Interface config must includes all the mandatory config parameters needed by the various remote 'network' definitions you want the domain to be able to migrate to. If you are in 'bridge' mode and you want to migrate to a dst host whose 'network' provides only vepa/vnlink, you need to populate on the source/origin host your bridge interface with parameters that are not needed by the local host (but that may be needed in case of a migration to a remote Host whose (same name) network only provides vepa/vnlink). All this to say that when you configure the interface/port_group you need to do it keeping into account the hosts where you want the VM to be able to migrate to. I am OK with this, but I do not particularly like the idea of a semi-flat list of parameters (grouped by port_group) and I find it cleaner to group them by network type, with or without a priority order (see my comment to the previous draft, whose URL is at the top of this email). What is the expected behavior in the following scenarios? ==> Scenario_1 (NetworkA=vepa/vnlink) (NetworkA=vepa) HOST1 HOST2 VM(vnlink)---------------->VM(vepa) VM(??????)<--------------- VM(vepa) Where: - NetworkA is defined in both hosts host1/host2 - The definition of NetworkA in Host1 is a superset of NetworkA in Host2. - When VM migrates from Host1 to Host2 it changes, for example, from vnlink to vepa because vnlink is not part of NetworkA in Host2. [Q] when the VM migrates back from Host2 to Host1, is VM going to move back to vnlink? ==> Scenario_2 (NetworkA=vepa/vnlink) (NetworkA=bridge) (NetworkA=bridge/vepa/vnlink) HOST1 HOST2 HOST3 VM(vnlink)---------------->VM(bridge) VM(bridge) -------->VM(???) Where: - NetworkA is defined in all hosts host1/host2/host3 - The definition of NetworkA is different in the three hosts - When VM migrates from Host1 to Host2 it changes from Vnlink to bridge(vnlink is not available on host2) [Q] when the VM migrates from Host2 to Host3, is VM going to move back to 'vnlink'? Is it going to try to keep 'bridge'? The question really is: should the algorithm (a)- try to minimize the number of changes (of connectivity type), Or (b)- try to match as much as possible the original connectivity type (ie, in the above examples, whatever was in use in HOST1 before the 1st migration)? An algorithm which uses a sort of priority-based-selection/handshake as the one I have briefly mentioned in my previous post would: - be deterministic (ie, the connectivity selected at migration number N does not depend on the connectivity types assigned to the VM at the previous N-1 migrations (this is not true for example in (a) above). - provide an easier to read configuration (ie, no flat lists of parameters). This second argument is arguable because the current definition of 'Port_group' is still work in progress, but my feeling is that it may become something like a flat (hard to maintain) list of parameters. As I stated already, I am not suggesting the use of that priority-based approach, but I wanted to mention it because it is not obvious to me why that approach was not considered/discussed at all.
(Part of the migration process would of course check that the destination host had a network of the proper name with adequate available resources, and fail if it didn't; management software at a level above libvirt would probably filter a list of candidate migration destinations based on available networks and any various details of those networks (eg. it could search for only networks using vepa for the connection), and only attempt migration to one that had the matching network available).
Would management be responsible for the creation and maintenance of the network definitions in the various hosts? (ie, in order to make It not necessary to poll for changes in the definitions). How are (as of now) changes made via virsh made visible/propagated to the management app?
<virtualport> element of <interface> ------------------------------------
Since many of the attributes/sub-elements of <virtualport> (used by some modes of "direct" interface connections) are identical for all interfaces connecting to any given switch, most of the information in <virtualport> will be optional in the domain's interface definition - it can be filled in from a similar <virtualport> element that will be added to the <network> definition.
Some parameters in <virtualport> ("instanceid", for example) must be unique for every interface, though, so those will still be specified in the <interface> XML. The two <virtualport> elements will be OR'ed at runtime to arrive at the actual set of parameters that are used.
(Open Question: What should be the policy when a parameter is specified in both places? Should one take precedence? Or should it be considered an error?)
If you consider that case as an error then you force all "consumers" of the port_group to agree on whether that parameter is shared/common or not. A third option (which I do not like that much) would be to add an "optional" keyword to the instance in the network config to tell whether that config is to be enforced or to be used only when not already provided by the interface (ie as a default).
portgroup attribute of <source> -------------------------------
The <source> element of an interface definition will be able to optionally specify a "portgroup" attribute. If portgroup is *NOT* given, the default (first) portgroup of the network will be used (if any are defined).
Rather than "the first", I would add the possibility of defining explicitly which one is the default (to make it more deterministic).
If portgroup *IS* specified, the source network must have a portgroup by that name (or the domain startup/migration will fail), and the attributes of that portgroup will be used for the connection. Here is an example <interface> definition that has both a reduced <virtualport> element, as well as a portgroup attribute:
<interface type='network'> <source network='red-network' portgroup='engineering'/> <virtualport type="802.1Qbg"> <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/> </virtualport> <mac address='de:ad:be:ef:ca:fe'/> </interface>
(The specifics of what can be in a portgroup are given below)
II. Changes to <network> definition ===================================
As Dan has pointed out, any additions to <network> must be designed so that existing management applications (written to understand <network> prior to these new additions) will at least recognize that the XML they've been given is for something new that they don't fully understand. At the same time, the new types of network definition should attempt to re-use as much of the existing elements/attributes as possible, both to make it easier to extend these applications, as well as to make the status displays of un-updated applications make as much sense as possible.
Dan's suggestion (which I obviously endorse :-) is that the new types of network should be specified by extending the choices for <forward mode='....'>.
He also suggested adding a new "layer='network|link'" attribute to <forward>. I'm not convinced that item is necessary (it seems redundant), but am including it here for sake of discussion.
The current modes are:
<forward layer='network' mode='route|nat'/>
What would be the role/goal of this "layer"? This would make sense if most of the network types were able to operate at both layers (and you want to be able to chose one), but this is not the case. On top of that, I can imagine a new network type which identifies a forwarding engine able to operate both at L2 and L3, and which decides which layer to operate at at runtime based on its own (independent from libvirt) configuration (something like a brouter). The documentation can classify the network types based on the layer they operate at (no need to put this info in the config if it does not do anything).
The current modes are:
<forward layer='network' mode='route|nat'/>
Just for the sake of discussion, would any of the following cases deserve some discussion? mode = tunnel ipsec/vpn(/ipip) may be a special case of this. BTW, what is the integration status of the VPN RFC proposal? mode=application/proxy This would make it easy to configure a domain such as that it can only get connectivity through a proxy (transparently). Part of the config parameters would obviously be the addr/fqdm/port of the server/s (and libvirt would use that config to configure the associated firewall rules). This case can also be seen as a special case for mode=route|nat.
(in addition to not listing any mode, which equates to "isolated")
Here are suggested new modes:
<forward layer='link'
mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/>
A description of each:
bridge-brctl - equivalent to "<interface type='bridge'>" in the interface definition. The bridge device to use would
be
given in the existing <forward dev='xxx'>. (Dan also suggests putting this in <network>'s <bridge name='xxx'/> - opinions?) (Question: better name for this?)
vepa - same as "<interface type='direct'>..." with <source mode='vepa'/>
private - <interface type='direct'> ... <source mode='private'/>
passthrough - <interface type='direct'> ... <source mode='passthrough'/>
bridge-macvtap - <interface type='direct'> ... <source mode='bridge'/> (Question: better name for this?)
Interface Pools ---------------
In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device.
Even though vnlink does not use 'passthrough' (it uses 'private' mode), it actually comes with the same requirement: the lower device cannot be shared.
The current specification for <forward> only allows for a single "dev" attribute, though. In order to support multiple device names, we will extend <forward> to allow 0 or more <interface> sub-elements:
<forward mode='vepa' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> </forward>
Note that, as a convenience, the first of these elements will always be a duplicate of the "dev" attribute in <forward> itself. (Is this necessary/desirable?)
I agree with those that commented against the duplicate eth10. On top of the reasons already mentioned (SR/IOV PF/VF, ...), I just find it confusing.
In the case of mode='passthrough', only one guest interface can be connected to a device at a time.
In the case of BH that I mentioned above, the libvirt/BH code does not currently enforce it, but it does have the same requirement.
libvirt will keep track of which devices are in use, and attempt to assign a free device; failure to assign a device will result in a failure of the domain to start/migrate. For the other direct modes, libvirt will simply keep track of the number of guest interfaces currently using each device, and attempt to keep them balanced.
(Open question: where will we keep track of this allocation/assignment?)
If this information does not live in the kernel (ie, as a sort of ref counter Or flag in the device), libvirt needs to use some kind of persistent storage to be able to recover properly after a crash, or, it needs a reliable way to rebuild such information upon a restart.
Portgroups -----------
A <portgroup> (sub-element of <network>) is just a way of easily putting connections to the network into different classes, with each class having a different level/type of service. Each <network> can have multiple <portgroup> elements, and each <portgroup> has a name, as well as various attributes associated with it. The first thing we will use portgroups for is as an alternate place to specify <virtualport> parameters:
<portgroup name='engineering'> <virtualport type="802.1Qbg"> <parameters managerid="11" typeid="1193047" typeidversion="2"/> </virtualport> </portgroup>
Anything that is valid in an interface's <virtualport> is also valid here.
The next thing to specify in a portgroup will be bandwidth limiting / QoS configuration. Since I don't know exactly what's needed for that, I won't specify it here.
If anything is specified both directly under <network> and in a <portgroup>, the value in portgroup will take precedence. (Again - what will the precedence of items specified in the <interface> be?)
EXAMPLES --------
Examples of 'red-network' for different types of connections (all of these would work with minor variations of the interface XML given above, e.g. the 'vepa' version would require <virtualport> in the interface that specified an instanceid, and if the <interface> specified a portgroup, it would need to also be in the <network> definition (even if it was empty aside from name).
<!-- Existing usage - a libvirt virtual network --> <network> <name>red-network</name> <bridge name='virbr0'/> <forward layer='network' mode='route'/> ... </network>
<!-- The simplest - an existing host bridge --> <network> <name>red-network</name> <forward mode='bridge-brctl' dev='br0'/> </network>
<!-- A macvtap connection to a vepa bridge --> <network> <name>red-network</name> <forward layer='link' mode='vepa' dev='eth10'/> <virtualport type='802.1Qbg'> <parameters managerid='11' typeid='1193047' typeidversion='2'/> </virtualport> <!-- NB: if <interface> doesn't specify portgroup, --> <!-- 'accounting' is assumed --> <portgroup name='accounting'> <virtualport> <parameters typeid='22'/> </virtualport> </portgroup> <portgroup name='engineering'> <virtualport> <parameters typeid='33'/> </virtualport> </portgroup> </network>
<!-- A macvtap passthrough connection (one guest interface per dev) --> <network> <name>red-network</name> <forward layer='link' mode='passthrough' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> <interface dev='eth14'/> <interface dev='eth15'/> <interface dev='eth16'/> <interface dev='eth17'/> </forward> </network>
=============
Open Questions:
* Is there a good reason to include the "layer='network|link'" attribute in forward? (maybe just because it's useful info for a management application that doesn't know the details of the modes?) Or is it redundant?
I find it redundant.
* What should be the policy when a virtualport parameter is specified in both the <interface> and the <network>/<portgroup>? Should one take precedence? Or should it be considered an error?
"precedence" would give more flexibility to the configuration (ie, you can Have a default value to use when a value is not explicitly configured).
* Is it okay for the domain's own definition to specify what portgroup it will be in?
If you chose the port_group ... you are also choosing the host-specific parameters ... If you do so, then you relax part of the original goal (move host-specific parameters from the interface to the network).
Or are there cases where we want to allow someone to modify their domain XML, but force them into a particular portgroup beyond their control?
* Is it really necessary/desirable for the first ethernet device in a pool to be duplicated in the <forward dev='xxx'...> attribute? Or can that attribute be omitted when there is a pool of devices?
I find it confusing.
* Where will we keep track of the count of guest interfaces connected to each host interface device, and where will we keep track of which device is being used by a particular guest interface? In the network/domain XML?
Should libvirt be able to determine/detect whether a device is already in use or not (for example to determine whether a macvtap/passthrou config can be accepted or not)? For example, what if according to libvirt a device may be available, but that device is actually not available because already in use by another application/VM (for example a VM started independently from libvirt). I guess we do not need to focus too much on these corner cases, but I believe It would be nice to be able to detect these conditions (if easy to implement). /Chris
* Does anyone have better names for "brctl-bridge" and "macvtap-bridge"?
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 06/16/2011 09:56 PM, Christian Benvenuti (benve) wrote:
Laine Stump wrote:
Interface Pools ---------------
In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device. Even though vnlink does not use 'passthrough' (it uses 'private' mode), it actually comes with the same requirement: the lower device cannot be shared. In the case of mode='passthrough', only one guest interface can be connected to a device at a time. In the case of BH that I mentioned above, the libvirt/BH code does not currently enforce it, but it does have the same requirement.
Christian, Can this (the fact that the desired mode of operation will not allow for sharing of interfaces) be determined absolutely from the existing config information? In other words, is it safe to say that any time you have the combination of "direct"/"private"/"802.1Qbh" that interfaces can't be shared, but that for direct/private/<not-802.1Qbh> they *can* be shared? I'm currently writing the code that picks an interface to use from the pool; the information I have is roughly equivalent to what gets configured for current libvirt domain interfaces: | <interface type='direct'> | <source dev='XYZ' mode='private'/> | <virtualport type='802.1Qbh"> | <parameters | </virtualport> | </interface> I want to avoid adding an explicit config item to the XML to allow/prevent interface sharing if at all possible (I already prevent sharing for passthrough mode; if adding a check for private mode with virtualport type='802.1Qbh' would be enough, then I'm happy)

-----Original Message----- From: sendmail [mailto:justsendmailnothingelse@gmail.com] On Behalf Of Laine Stump Sent: Saturday, July 02, 2011 10:36 PM To: Christian Benvenuti (benve) Cc: Libvirt Subject: Re: [libvirt] Network device abstraction aka virtual switch - V3
On 06/16/2011 09:56 PM, Christian Benvenuti (benve) wrote:
Laine Stump wrote:
Interface Pools ---------------
In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device. Even though vnlink does not use 'passthrough' (it uses 'private' mode), it actually comes with the same requirement: the lower device cannot be shared. In the case of mode='passthrough', only one guest interface can be connected to a device at a time. In the case of BH that I mentioned above, the libvirt/BH code does not currently enforce it, but it does have the same requirement.
Christian,
Can this (the fact that the desired mode of operation will not allow for sharing of interfaces) be determined absolutely from the existing config information? In other words, is it safe to say that any time you have the combination of "direct"/"private"/"802.1Qbh" that interfaces can't be shared, but that for direct/private/<not-802.1Qbh> they *can* be shared?
I'm currently writing the code that picks an interface to use from the pool; the information I have is roughly equivalent to what gets configured for current libvirt domain interfaces:
| <interface type='direct'> | <source dev='XYZ' mode='private'/> | <virtualport type='802.1Qbh"> | <parameters | </virtualport> | </interface>
I want to avoid adding an explicit config item to the XML to allow/prevent interface sharing if at all possible (I already prevent sharing for passthrough mode; if adding a check for private mode with virtualport type='802.1Qbh' would be enough, then I'm happy)
Yes, I think that would be enough. BH does not use passthrou mode because it does not need/want to put the lower dev into promiscuous mode. Adding a config item would be more flexible, but as of now only BH would use it (there are no other cases I can think of), therefore it does not seem necessary. /Chris

On 07/03/2011 03:42 PM, Christian Benvenuti (benve) wrote:
-----Original Message----- From: sendmail [mailto:justsendmailnothingelse@gmail.com] On Behalf Of
Can this (the fact that the desired mode of operation will not allow for sharing of interfaces) be determined absolutely from the existing config information? In other words, is it safe to say that any time you have the combination of "direct"/"private"/"802.1Qbh" that interfaces can't be shared, but that for direct/private/<not-802.1Qbh> they *can* be shared?
I'm currently writing the code that picks an interface to use from the pool; the information I have is roughly equivalent to what gets configured for current libvirt domain interfaces:
|<interface type='direct'> |<source dev='XYZ' mode='private'/> |<virtualport type='802.1Qbh"> |<parameters |</virtualport> |</interface>
I want to avoid adding an explicit config item to the XML to allow/prevent interface sharing if at all possible (I already prevent sharing for passthrough mode; if adding a check for private mode with virtualport type='802.1Qbh' would be enough, then I'm happy) Yes, I think that would be enough.
BH does not use passthrou mode because it does not need/want to put the lower dev into promiscuous mode. Adding a config item would be more flexible, but as of now only BH would use it (there are no other cases I can think of), therefore it does not seem necessary.
Great! It's always easier to add something later if we determine it's necessary, rather than putting it in now and learning later that it's redundant or (even worse) incorrect. Thanks for the firsthand info.

On Sun, Jun 12, 2011 at 08:29:08PM -0400, Laine Stump wrote:
This is a followup to https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html (and an even earlier draft) which I alluded to here:
https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html
Network device abstraction aka virtual switch - V3 ==================================================
The <interface> element of a guest's domain config in libvirt has a <source> element that describes what resources on a host will be used to connect the guest's network interface to the rest of the world. This is very flexible, allowing several different types of connection (virtual network, host bridge, direct macvtap connection to physical interface, qemu usermode, user-defined via an external script), but currently has the problem that unnecessary details of the host resources are embedded into the guest's config; if the guest is migrated to a different host, and that host has a different hardware or network config (or possibly the same hardware, but that hardware is currently in use by a different guest), the migration will fail.
I am proposing a change to libvirt's network XML that will allow us to (optionally - old configs will remain valid) remove the host details from the guest's domain XML (which can move around from host to host) and place them in the network XML (which remains with a single host); the domain XML will then use existing config elements to associate each guest interface with a "network".
The motivating use case for this change is the "direct" connection type (which uses macvtap for vepa and vnlink connections directly between a guest and a physical interface, rather than through a bridge), but it is applicable for all types of connection. (Another hopeful side effect of this change will be to make libvirt's network connection model easier to realize on non-Linux hypervisors (eg, VMWare ESX) and for other network technologies, such as openvswitch, VDE, and various VPN implementations).
Background ==========
(parts lifted from Dan Berrange's last mail on this subject)
Currently <network> supports 3 connectivity modes
- Non-routed network, separate subnet (no <forward> element present) - Routed network, separate subnet with NAT (<forward mode='nat'/>) - Routed network, separate subnet (<forward mode='route'/>)
Each of these is implemented in the existing network driver by creating a bridge device using brctl, and connecting the guest network interfaces via tap devices (a detail which, now that I've stated it, you should promptly forget!). All traffic between that bridge and the outside network is done via the host's IP routing stack (ie, there is no physical device directly connected to the bridge)
In the future, these two additional routed modes might be useful:
- Routed network, IP subnetting - Routed network, separate subnet with VPN
The core goal of this proposal, though, is to replace type=bridge and type=direct from the domain interface XML with new types of <network> definitions so that the domain can just give "type='network'" and have all the necessary details filled in at runtime. This basically means we're adding several bridging modes (the submodes of "direct" have been flattened out here):
- Bridged network, eth + bridge + tap - Bridged network, eth + macvtap + vepa - Bridged network, eth + macvtap + private - Bridged network, eth + macvtap + passthrough - Bridged network, eth + macvtap + bridge
Another "future expansion" could be to add:
- Bridged network, with VPN
Likewise, support for other technologies, such as openvswitch and VDE would each be another entry on this list.
(Dan also listed each of the above "+sriov" separately, but that ends up being handled in an orthogonal manner (by just specifying a pool of interfaces for a single network), so I'm only giving the abbreviated list)
I. Changes to domain <interface> element ========================================
In many cases, the <interface> element of the domain XML will be identical to what is used now when connecting the interface to a libvirt-style virtual network:
<interface type='network'> <source network='red-network'/> <mac address='xx:xx:xx:xx:xx:xx'/> </interface>
Depending on the definition of the network "red-network" on the host the guest was started on / migrated to, this could be either a direct (macvtap) connection using one of the various direct modes (vepa/private/bridge/passthrough), a bridge (again, pointed to by the definition of 'red-network'), or a virtual network (using the current network definition syntax). This way the same guest could be migrated not only between macvtap-enabled hosts, but from there to a host using a bridge, or maybe a host in a remote location that used a virtual network with a secure tunnel to connect back to the rest of the red-network.
When I originally thought of the goal of making the guest networking XML "host independant", I was mainly thinking in terms of avoidance of physical network device names. Obviously this design could also enable us to change the type of connection, bridge/vepa/etc, but this feels like a secondary goal, because I believe this would result in interruption of the guest network connections, so migration would not be seemless to the guest.
<virtualport> element of <interface> ------------------------------------
Since many of the attributes/sub-elements of <virtualport> (used by some modes of "direct" interface connections) are identical for all interfaces connecting to any given switch, most of the information in <virtualport> will be optional in the domain's interface definition - it can be filled in from a similar <virtualport> element that will be added to the <network> definition.
Some parameters in <virtualport> ("instanceid", for example) must be unique for every interface, though, so those will still be specified in the <interface> XML. The two <virtualport> elements will be OR'ed at runtime to arrive at the actual set of parameters that are used.
(Open Question: What should be the policy when a parameter is specified in both places? Should one take precedence? Or should it be considered an error?)
The guest <interface> XML should in general take preference, since that is considered a specialization. I believe certain of the VEPA parameters shouldn't be overridable per guest though, since they are really host-level configuration.
portgroup attribute of <source> -------------------------------
The <source> element of an interface definition will be able to optionally specify a "portgroup" attribute. If portgroup is *NOT* given, the default (first) portgroup of the network will be used (if any are defined). If portgroup *IS* specified, the source network must have a portgroup by that name (or the domain startup/migration will fail), and the attributes of that portgroup will be used for the connection. Here is an example <interface> definition that has both a reduced <virtualport> element, as well as a portgroup attribute:
<interface type='network'> <source network='red-network' portgroup='engineering'/> <virtualport type="802.1Qbg"> <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/> </virtualport> <mac address='de:ad:be:ef:ca:fe'/> </interface>
(The specifics of what can be in a portgroup are given below)
II. Changes to <network> definition ===================================
As Dan has pointed out, any additions to <network> must be designed so that existing management applications (written to understand <network> prior to these new additions) will at least recognize that the XML they've been given is for something new that they don't fully understand. At the same time, the new types of network definition should attempt to re-use as much of the existing elements/attributes as possible, both to make it easier to extend these applications, as well as to make the status displays of un-updated applications make as much sense as possible.
Dan's suggestion (which I obviously endorse :-) is that the new types of network should be specified by extending the choices for <forward mode='....'>.
He also suggested adding a new "layer='network|link'" attribute to <forward>. I'm not convinced that item is necessary (it seems redundant), but am including it here for sake of discussion.
The current modes are:
<forward layer='network' mode='route|nat'/>
(in addition to not listing any mode, which equates to "isolated")
Here are suggested new modes:
<forward layer='link' mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/>
A description of each:
bridge-brctl - equivalent to "<interface type='bridge'>" in the interface definition. The bridge device to use would be given in the existing <forward dev='xxx'>. (Dan also suggests putting this in <network>'s <bridge name='xxx'/> - opinions?) (Question: better name for this?)
vepa - same as "<interface type='direct'>..." with <source mode='vepa'/>
private - <interface type='direct'> ... <source mode='private'/>
passthrough - <interface type='direct'> ... <source mode='passthrough'/>
bridge-macvtap - <interface type='direct'> ... <source mode='bridge'/> (Question: better name for this?)
I like the suggestion elsewhere in this thread, of detecting whether todo macvtap vs brctl, based on the interface declared, so we could then just use 'bridge' as the name. eg, Do macvtap mode: <forward mode='bridge' dev='eth0'/> Or do brctl mode: <forward mode='bridge'/> <bridge dev='br0'/> (Remember that '<bridge>' element already exists in our schema so we might as well use it)
Interface Pools ---------------
In many cases, a single host network may have multiple physical network devices associated with it (especially in the case of an SRIOV-capable ethernet card, which will have several "virtual functions" associated with a single physical ethernet connection). The host will at least want to balance the load of multiple guests between these multiple devices, and may even require (in the case of passthrough mode, for example) that only a single guest interface be attached to each host device.
The current specification for <forward> only allows for a single "dev" attribute, though. In order to support multiple device names, we will extend <forward> to allow 0 or more <interface> sub-elements:
<forward mode='vepa' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> </forward>
Note that, as a convenience, the first of these elements will always be a duplicate of the "dev" attribute in <forward> itself. (Is this necessary/desirable?)
Yes it is a key back/for-wards compat issue. Currently applications will be just doing an XPath of "/network/forward/@dev". New applications will want to ignore '@dev' completely and just do "/network/forward/interface/@dev". If we didn't duplicate the <forward @dev/> attribute as the first child <interface>, then new applications would have to run 2 xpath queries to get the information out. We might also want to add further attributes to <interface> in the future, so we want all interfaces listed there regardless.
In the case of mode='passthrough', only one guest interface can be connected to a device at a time. libvirt will keep track of which devices are in use, and attempt to assign a free device; failure to assign a device will result in a failure of the domain to start/migrate. For the other direct modes, libvirt will simply keep track of the number of guest interfaces currently using each device, and attempt to keep them balanced.
(Open question: where will we keep track of this allocation/assignment?)
That's a job for the network driver. As we do with when running QEMU guests, the network driver would want to keep a persistent state file in /var/lib/libvirt/networks to store any data like this which needs to be preserved across libvirtd restarts/crashes.
Portgroups -----------
A <portgroup> (sub-element of <network>) is just a way of easily putting connections to the network into different classes, with each class having a different level/type of service. Each <network> can have multiple <portgroup> elements, and each <portgroup> has a name, as well as various attributes associated with it. The first thing we will use portgroups for is as an alternate place to specify <virtualport> parameters:
<portgroup name='engineering'> <virtualport type="802.1Qbg"> <parameters managerid="11" typeid="1193047" typeidversion="2"/> </virtualport> </portgroup>
Anything that is valid in an interface's <virtualport> is also valid here.
The next thing to specify in a portgroup will be bandwidth limiting / QoS configuration. Since I don't know exactly what's needed for that, I won't specify it here.
If anything is specified both directly under <network> and in a <portgroup>, the value in portgroup will take precedence. (Again - what will the precedence of items specified in the <interface> be?)
Precendence should go from most specific, to least specific. ie 1. Guest <interface> 2. Network <portgroup> 3. Network top level
EXAMPLES --------
Examples of 'red-network' for different types of connections (all of these would work with minor variations of the interface XML given above, e.g. the 'vepa' version would require <virtualport> in the interface that specified an instanceid, and if the <interface> specified a portgroup, it would need to also be in the <network> definition (even if it was empty aside from name).
<!-- Existing usage - a libvirt virtual network --> <network> <name>red-network</name> <bridge name='virbr0'/> <forward layer='network' mode='route'/> ... </network>
<!-- The simplest - an existing host bridge --> <network> <name>red-network</name> <forward mode='bridge-brctl' dev='br0'/> </network>
<!-- A macvtap connection to a vepa bridge --> <network> <name>red-network</name> <forward layer='link' mode='vepa' dev='eth10'/> <virtualport type='802.1Qbg'> <parameters managerid='11' typeid='1193047' typeidversion='2'/> </virtualport> <!-- NB: if <interface> doesn't specify portgroup, --> <!-- 'accounting' is assumed --> <portgroup name='accounting'> <virtualport> <parameters typeid='22'/> </virtualport> </portgroup> <portgroup name='engineering'> <virtualport> <parameters typeid='33'/> </virtualport> </portgroup> </network>
<!-- A macvtap passthrough connection (one guest interface per dev) --> <network> <name>red-network</name> <forward layer='link' mode='passthrough' dev='eth10'/> <interface dev='eth10'/> <interface dev='eth11'/> <interface dev='eth12'/> <interface dev='eth13'/> <interface dev='eth14'/> <interface dev='eth15'/> <interface dev='eth16'/> <interface dev='eth17'/> </forward> </network>
=============
Open Questions:
* Is there a good reason to include the "layer='network|link'" attribute in forward? (maybe just because it's useful info for a management application that doesn't know the details of the modes?) Or is it redundant?
I think it is likely redundant. We can leave it out for now and if we feel a need, can add it back in the future. It was to be primarily an output-only attribute, though I did perhaps think we could let you do <forward layer='network'/> and use that to auto-pick a suitable mode, but lets not bother.
* What should be the policy when a virtualport parameter is specified in both the <interface> and the <network>/<portgroup>? Should one take precedence? Or should it be considered an error?
* Is it okay for the domain's own definition to specify what portgroup it will be in? Or are there cases where we want to allow someone to modify their domain XML, but force them into a particular portgroup beyond their control?
Yes, the domain should be able to specify the portgroup. Access control over portgroup usage is a matter for a more general ACL system in libvirt drivers.
* Is it really necessary/desirable for the first ethernet device in a pool to be duplicated in the <forward dev='xxx'...> attribute? Or can that attribute be omitted when there is a pool of devices?
Yes, it is key to getting good backwards/forwards compatibility and simplifying app usage. It should of course be an error to specify both when giving XML to libvirt, if they are conflicting in what they say. Typically an app should only specify one of them for input though.
* Where will we keep track of the count of guest interfaces connected to each host interface device, and where will we keep track of which device is being used by a particular guest interface? In the network/domain XML?
In the network driver i reckon.
* Does anyone have better names for "brctl-bridge" and "macvtap-bridge"?
'brctl' and 'macvtap' are both impl details, so we don't really want to expose them. Just have one called 'bridge' which is a reflection of the connection type. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (6)
-
Christian Benvenuti (benve)
-
Daniel P. Berrange
-
Gerhard Stenzel
-
Laine Stump
-
Oved Ourfalli
-
Stefan Berger