Re: [libvirt] Network device abstraction aka virtual switch - V3

Thursday, 16 June 2011

Here are my previous comments to the earlier draft with
subject "migration of vnlink VMs":

  http://www.redhat.com/archives/libvir-list/2011-May/msg00160.html

Comments for this new draft inline ...

...
 -----Original Message-----
 From: libvir-list-bounces(a)redhat.com [mailto:libvir-list-
 bounces(a)redhat.com] On Behalf Of Laine Stump
 Sent: Sunday, June 12, 2011 5:29 PM
 To: Libvirt
 Subject: [libvirt] Network device abstraction aka virtual switch - V3

 This is a followup to
 https://www.redhat.com/archives/libvir-list/2011-April/msg00591.html
 (and an even earlier draft) which I alluded to here:

https://www.redhat.com/archives/libvir-list/2011-June/msg00383.html
...

 Network device abstraction aka virtual switch - V3
 ==================================================

 The <interface> element of a guest's domain config in libvirt has a
 <source> element that describes what resources on a host will be used
 to connect the guest's network interface to the rest of the
 world. This is very flexible, allowing several different types of
 connection (virtual network, host bridge, direct macvtap connection to
 physical interface, qemu usermode, user-defined via an external
 script), but currently has the problem that unnecessary details of the
 host resources are embedded into the guest's config; if the guest is
 migrated to a different host, and that host has a different hardware
 or network config (or possibly the same hardware, but that hardware is
 currently in use by a different guest), the migration will fail.

 I am proposing a change to libvirt's network XML that will allow us to
 (optionally - old configs will remain valid) remove the host details
 from the guest's domain XML (which can move around from host to host)
 and place them in the network XML (which remains with a single host);
 the domain XML will then use existing config elements to associate
 each guest interface with a "network".

 The motivating use case for this change is the "direct" connection
 type (which uses macvtap for vepa and vnlink connections directly
 between a guest and a physical interface, rather than through a
 bridge), but it is applicable for all types of connection. (Another
 hopeful side effect of this change will be to make libvirt's network
 connection model easier to realize on non-Linux hypervisors (eg,
 VMWare ESX) and for other network technologies, such as openvswitch,
 VDE, and various VPN implementations).

 Background
 ==========

 (parts lifted from Dan Berrange's last mail on this subject)

 Currently <network> supports 3 connectivity modes

   - Non-routed network, separate subnet        (no <forward> element
 present)
   - Routed network, separate subnet with NAT   (<forward mode='nat'/>)
   - Routed network, separate subnet            (<forward
 mode='route'/>)

 Each of these is implemented in the existing network driver by
 creating a bridge device using brctl, and connecting the guest network
 interfaces via tap devices (a detail which, now that I've stated it,
 you should promptly forget!). All traffic between that bridge and the
 outside network is done via the host's IP routing stack (ie, there is
 no physical device directly connected to the bridge)

 In the future, these two additional routed modes might be useful:

   - Routed network, IP subnetting
   - Routed network, separate subnet with VPN

 The core goal of this proposal, though, is to replace type=bridge and
 type=direct from the domain interface XML with new types of <network>
 definitions so that the domain can just give "type='network'" and have
 all the necessary details filled in at runtime. This basically means
 we're adding several bridging modes (the submodes of "direct" have
 been flattened out here):

   - Bridged network, eth + bridge + tap
   - Bridged network, eth + macvtap + vepa
   - Bridged network, eth + macvtap + private
   - Bridged network, eth + macvtap + passthrough
   - Bridged network, eth + macvtap + bridge

 Another "future expansion" could be to add:

   - Bridged network, with VPN

 Likewise, support for other technologies, such as openvswitch and VDE
 would each be another entry on this list.

 (Dan also listed each of the above "+sriov" separately, but that ends
 up being handled in an orthogonal manner (by just specifying a pool of
 interfaces for a single network), so I'm only giving the abbreviated
 list)

 I. Changes to domain <interface> element
 ========================================

 In many cases, the <interface> element of the domain XML will be
 identical to what is used now when connecting the interface to a
 libvirt-style virtual network:

 <interface type='network'>
 <source network='red-network'/>
 <mac address='xx:xx:xx:xx:xx:xx'/>
 </interface>

 Depending on the definition of the network "red-network" on the host
 the guest was started on / migrated to, this could be either a direct
 (macvtap) connection using one of the various direct modes
 (vepa/private/bridge/passthrough), a bridge (again, pointed to by the
 definition of 'red-network'), or a virtual network (using the current
 network definition syntax). This way the same guest could be migrated
 not only between macvtap-enabled hosts, but from there to a host using
 a bridge, or maybe a host in a remote location that used a virtual
 network with a secure tunnel to connect back to the rest of the
 red-network. 
This is only possible if you provision for that possibility:
   the Interface config must includes all the mandatory config
   parameters needed by the various remote 'network' definitions
   you want the domain to be able to migrate to.
If you are in 'bridge' mode and you want to migrate to a dst host
whose 'network' provides only vepa/vnlink, you need to populate
on the source/origin host your bridge interface with parameters that
are not needed by the local host (but that may be needed in case of
a migration to a remote Host whose (same name) network only provides
vepa/vnlink).
All this to say that when you configure the interface/port_group you
need
to do it keeping into account the hosts where you want the VM to be able
to migrate to. I am OK with this, but I do not particularly like the
idea
of a semi-flat list of parameters (grouped by port_group) and I find it
cleaner to group them by network type, with or without a priority
order (see my comment to the previous draft, whose URL is at the top of
this email).

What is the expected behavior in the following scenarios?

==> Scenario_1

(NetworkA=vepa/vnlink)  (NetworkA=vepa)
  HOST1                    HOST2

  VM(vnlink)---------------->VM(vepa)

  VM(??????)<--------------- VM(vepa)

Where:
- NetworkA is defined in both hosts host1/host2
- The definition of NetworkA in Host1 is a superset of
  NetworkA in Host2.
- When VM migrates from Host1 to Host2 it changes, for
  example, from vnlink to vepa because vnlink is not
  part of NetworkA in Host2.

[Q] when the VM migrates back from Host2 to Host1, is
    VM going to move back to vnlink?

==> Scenario_2

(NetworkA=vepa/vnlink)  (NetworkA=bridge)  (NetworkA=bridge/vepa/vnlink)
  HOST1                    HOST2                 HOST3

  VM(vnlink)---------------->VM(bridge)
                             VM(bridge) -------->VM(???)

Where:
- NetworkA is defined in all hosts host1/host2/host3
- The definition of NetworkA is different in the three hosts
- When VM migrates from Host1 to Host2 it changes from
  Vnlink to bridge(vnlink is not available on host2)

[Q] when the VM migrates from Host2 to Host3, is
    VM going to move back to 'vnlink'? Is it going to try
    to keep 'bridge'?

The question really is: should the algorithm
(a)- try to minimize the number of changes (of connectivity type),
Or
(b)- try to match as much as possible the original connectivity
     type (ie, in the above examples, whatever was in use in HOST1
before
     the 1st migration)?

An algorithm which uses a sort of priority-based-selection/handshake
as the one I have briefly mentioned in my previous post would:
- be deterministic (ie, the connectivity selected at migration number N
  does not depend on the connectivity types assigned to the VM at the
  previous N-1 migrations (this is not true for example in (a) above).
- provide an easier to read configuration (ie, no flat lists of
parameters).
  This second argument is arguable because the current definition of
  'Port_group' is still work in progress, but my feeling is that it may
  become something like a flat (hard to maintain) list of parameters.

As I stated already, I am not suggesting the use of that priority-based
approach, but I wanted to mention it because it is not obvious to me why
that approach was not considered/discussed at all.

...
   (Part of the migration process would of course check that the
 destination host had a network of the proper name with adequate
 available resources, and fail if it didn't; management software at a
 level above libvirt would probably filter a list of candidate
 migration destinations based on available networks and any various
 details of those networks (eg. it could search for only networks using
 vepa for the connection), and only attempt migration to one that had
 the matching network available). 
Would management be responsible for the creation and maintenance of
the network definitions in the various hosts? (ie, in order to make It
not necessary to poll for changes in the definitions).
How are (as of now) changes made via virsh made visible/propagated to
the management app?

...
 <virtualport> element of <interface>
 ------------------------------------

 Since many of the attributes/sub-elements of <virtualport> (used by
 some modes of "direct" interface connections) are identical for all
 interfaces connecting to any given switch, most of the information in
 <virtualport> will be optional in the domain's interface definition -
 it can be filled in from a similar <virtualport> element that will be
 added to the <network> definition.

 Some parameters in <virtualport> ("instanceid", for example) must be
 unique for every interface, though, so those will still be specified
 in the <interface> XML. The two <virtualport> elements will be OR'ed
 at runtime to arrive at the actual set of parameters that are
 used.

 (Open Question: What should be the policy when a parameter is
 specified in both places? Should one take precedence? Or should it be
 considered an error?) 
If you consider that case as an error then you force all "consumers" of
the port_group to agree on whether that parameter is shared/common or
not.

A third option (which I do not like that much) would be to add an
"optional"
keyword to the instance in the network config to tell whether that
config
is to be enforced or to be used only when not already provided by the
interface (ie as a default).

...
 portgroup attribute of <source>
 -------------------------------

 The <source> element of an interface definition will be able to
 optionally specify a "portgroup" attribute. If portgroup is *NOT*
 given, the default (first) portgroup of the network will be used (if
 any are defined). 
Rather than "the first", I would add the possibility of defining
explicitly which one is the default (to make it more deterministic).

...
 If portgroup *IS* specified, the source network must
 have a portgroup by that name (or the domain startup/migration will
 fail), and the attributes of that portgroup will be used for the
 connection. Here is an example <interface> definition that has both a
 reduced <virtualport> element, as well as a portgroup attribute:

 <interface type='network'>
 <source network='red-network' portgroup='engineering'/>
 <virtualport type="802.1Qbg">
 <parameters instanceid="09b11c53-8b5c-4eeb-8f00-d84eaa0aaa4f"/>
 </virtualport>
 <mac address='de:ad:be:ef:ca:fe'/>
 </interface>

 (The specifics of what can be in a portgroup are given below)

 II. Changes to <network> definition
 ===================================

 As Dan has pointed out, any additions to <network> must be designed so
 that existing management applications (written to understand <network>
 prior to these new additions) will at least recognize that the XML
 they've been given is for something new that they don't fully
 understand. At the same time, the new types of network definition
 should attempt to re-use as much of the existing elements/attributes
 as possible, both to make it easier to extend these applications, as
 well as to make the status displays of un-updated applications make as
 much sense as possible.

 Dan's suggestion (which I obviously endorse :-) is that the new types
 of network should be specified by extending the choices for <forward
 mode='....'>.

 He also suggested adding a new "layer='network|link'" attribute to
 <forward>. I'm not convinced that item is necessary (it seems
 redundant), but am including it here for sake of discussion.

 The current modes are:

 <forward layer='network' mode='route|nat'/> 
What would be the role/goal of this "layer"?
This would make sense if most of the network types were able to
operate at both layers (and you want to be able to chose one), but
this is not the case.
On top of that, I can imagine a new network type which identifies a
forwarding engine able to operate both at L2 and L3, and which decides
which layer to operate at at runtime based on its own (independent
from libvirt) configuration (something like a brouter).

The documentation can classify the network types based on the layer
they operate at (no need to put this info in the config if it does
not do anything).

...
 The current modes are:

 <forward layer='network' mode='route|nat'/> 
Just for the sake of discussion, would any of the following cases
deserve some discussion?

mode = tunnel
  ipsec/vpn(/ipip) may be a special case of this.
  BTW, what is the integration status of the VPN RFC proposal?

mode=application/proxy
  This would make it easy to configure a domain such as that it
  can only get connectivity through a proxy (transparently).
  Part of the config parameters would obviously be the addr/fqdm/port
  of the server/s (and libvirt would use that config to configure
  the associated firewall rules).
  This case can also be seen as a special case for mode=route|nat.

...
 (in addition to not listing any mode, which equates to
"isolated")

 Here are suggested new modes:

 <forward layer='link'
 mode='bridge-brctl|vepa|private|passthrough|bridge-macvtap'/>
...

 A description of each:

 bridge-brctl - equivalent to "<interface type='bridge'>" in the
                 interface definition. The bridge device to use would be
...
                 given in the existing <forward
dev='xxx'>. (Dan also
                 suggests putting this in <network>'s <bridge
                 name='xxx'/> - opinions?)
                 (Question: better name for this?)

 vepa         - same as "<interface type='direct'>..." with
<source
                 mode='vepa'/>

 private      - <interface type='direct'> ... <source
mode='private'/>

 passthrough  - <interface type='direct'> ... <source
 mode='passthrough'/>

 bridge-macvtap - <interface type='direct'> ... <source
mode='bridge'/>
                 (Question: better name for this?)

 Interface Pools
 ---------------

 In many cases, a single host network may have multiple physical
 network devices associated with it (especially in the case of an
 SRIOV-capable ethernet card, which will have several "virtual
 functions" associated with a single physical ethernet connection). The
 host will at least want to balance the load of multiple guests between
 these multiple devices, and may even require (in the case of
 passthrough mode, for example) that only a single guest interface be
 attached to each host device. 
Even though vnlink does not use 'passthrough' (it uses 'private' mode),
it
actually comes with the same requirement: the lower device cannot be
shared.

...
 The current specification for <forward> only allows for a
single "dev"
 attribute, though. In order to support multiple device names, we will
 extend <forward> to allow 0 or more <interface> sub-elements:

 <forward mode='vepa' dev='eth10'/>
 <interface dev='eth10'/>
 <interface dev='eth11'/>
 <interface dev='eth12'/>
 <interface dev='eth13'/>
 </forward>

 Note that, as a convenience, the first of these elements will always
 be a duplicate of the "dev" attribute in <forward> itself. (Is this
 necessary/desirable?) 
I agree with those that commented against the duplicate eth10.
On top of the reasons already mentioned (SR/IOV PF/VF, ...), I just
find it confusing.

...
 In the case of mode='passthrough', only one guest interface
can be
 connected to a device at a time. 
In the case of BH that I mentioned above, the libvirt/BH code does
not currently enforce it, but it does have the same requirement.

...
 libvirt will keep track of which
 devices are in use, and attempt to assign a free device; failure to
 assign a device will result in a failure of the domain to
 start/migrate. For the other direct modes, libvirt will simply keep
 track of the number of guest interfaces currently using each device,
 and attempt to keep them balanced.

 (Open question: where will we keep track of this
 allocation/assignment?) 
If this information does not live in the kernel (ie, as a sort of ref
counter
Or flag in the device), libvirt needs to use some kind of persistent
storage to be able to recover properly after a crash, or, it needs a
reliable way to rebuild such information upon a restart.

...
 Portgroups
 -----------

 A <portgroup> (sub-element of <network>) is just a way of easily
 putting connections to the network into different classes, with each
 class having a different level/type of service. Each <network> can
 have multiple <portgroup> elements, and each <portgroup> has a name,
 as well as various attributes associated with it. The first thing we
 will use portgroups for is as an alternate place to specify
 <virtualport> parameters:

 <portgroup name='engineering'>
 <virtualport type="802.1Qbg">
 <parameters managerid="11" typeid="1193047"
typeidversion="2"/>
 </virtualport>
 </portgroup>

 Anything that is valid in an interface's <virtualport> is also valid
 here.

 The next thing to specify in a portgroup will be bandwidth limiting /
 QoS configuration. Since I don't know exactly what's needed for that,
 I won't specify it here.

 If anything is specified both directly under <network> and in a
 <portgroup>, the value in portgroup will take precedence. (Again -
 what will the precedence of items specified in the <interface> be?)

 EXAMPLES
 --------

 Examples of 'red-network' for different types of connections (all of
 these would work with minor variations of the interface XML given
 above, e.g. the 'vepa' version would require <virtualport> in the
 interface that specified an instanceid, and if the <interface>
 specified a portgroup, it would need to also be in the <network>
 definition (even if it was empty aside from name).

 <!-- Existing usage - a libvirt virtual network -->
 <network>
 <name>red-network</name>
 <bridge name='virbr0'/>
 <forward layer='network' mode='route'/>
          ...
 </network>

 <!-- The simplest - an existing host bridge -->
 <network>
 <name>red-network</name>
 <forward mode='bridge-brctl' dev='br0'/>
 </network>

 <!-- A macvtap connection to a vepa bridge -->
 <network>
 <name>red-network</name>
 <forward layer='link' mode='vepa' dev='eth10'/>
 <virtualport type='802.1Qbg'>
 <parameters managerid='11' typeid='1193047'
typeidversion='2'/>
 </virtualport>
 <!-- NB: if <interface> doesn't specify portgroup, -->
 <!-- 'accounting' is assumed -->
 <portgroup name='accounting'>
 <virtualport>
 <parameters typeid='22'/>
 </virtualport>
 </portgroup>
 <portgroup name='engineering'>
 <virtualport>
 <parameters typeid='33'/>
 </virtualport>
 </portgroup>
 </network>

 <!-- A macvtap passthrough connection (one guest interface per dev) -->
...
 <network>
 <name>red-network</name>
 <forward layer='link' mode='passthrough' dev='eth10'/>
 <interface dev='eth10'/>
 <interface dev='eth11'/>
 <interface dev='eth12'/>
 <interface dev='eth13'/>
 <interface dev='eth14'/>
 <interface dev='eth15'/>
 <interface dev='eth16'/>
 <interface dev='eth17'/>
 </forward>
 </network>

 =============

 Open Questions:

 * Is there a good reason to include the "layer='network|link'"
    attribute in forward? (maybe just because it's useful info for a
    management application that doesn't know the details of the modes?)
    Or is it redundant? 
I find it redundant.

...
 * What should be the policy when a virtualport parameter is
specified
    in both the <interface> and the <network>/<portgroup>? Should one
    take precedence? Or should it be considered an error? 
"precedence" would give more flexibility to the configuration (ie, you
can Have a default value to use when a value is not explicitly
configured).

...
 * Is it okay for the domain's own definition to specify what
portgroup
    it will be in? 
If you chose the port_group ... you are also choosing the host-specific
parameters ...
If you do so, then you relax part of the original goal (move
host-specific
parameters from the interface to the network).

...
    Or are there cases where we want to allow someone to
    modify their domain XML, but force them into a particular portgroup
    beyond their control?

 * Is it really necessary/desirable for the first ethernet device in a
    pool to be duplicated in the <forward dev='xxx'...> attribute? Or
    can that attribute be omitted when there is a pool of devices? 
I find it confusing.

...
 * Where will we keep track of the count of guest interfaces
connected
    to each host interface device, and where will we keep track of which
...
    device is being used by a particular guest interface? In the
    network/domain XML? 
Should libvirt be able to determine/detect whether a device is already
in use or not (for example to determine whether a macvtap/passthrou
config can be accepted or not)?
For example, what if according to libvirt a device may be available, but
that device is actually not available because already in use by another
application/VM (for example a VM started independently from libvirt).
I guess we do not need to focus too much on these corner cases, but I
believe It would be nice to be able to detect these conditions (if easy
to implement).

/Chris

...
 * Does anyone have better names for "brctl-bridge" and
    "macvtap-bridge"?

 --
 libvir-list mailing list
 libvir-list(a)redhat.com
 https://www.redhat.com/mailman/listinfo/libvir-list 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] Network device abstraction aka virtual switch - V3