"Daniel P. Berrange" <berrange@redhat.com> wrote on 01/26/2010 04:21:56 PM:


>
> libvir-list, gerhard.stenzel, Vivek Kashyap, arndb

>
> Please respond to "Daniel P. Berrange"

>
> On Mon, Jan 25, 2010 at 12:47:17PM -0500, Stefan Berger wrote:
> > Hello!
> >
> >  The attached patch provides support for the Linux macvtap device for
> > Qemu by passing a file descriptor to Qemu command line similar to how it
> > is done with a regular tap device. I have modified the network XML code
> > to understand a definition as the following one here:
> >
> > <network>
> >   <name>vepanet</name>
> >   <uuid>4ebd5168-6321-4757-8397-f6e83484f402</uuid>
> >   <extbridge mode='vepa' dev='eth0'/>
> > </network>
>
> I don't think this is the correct place to be adding this kind
> of configuration / functionality. The virNetworkPtr / <network>
> XML is describing a virtual network capability which is *not*
> directly connected to the LAN. It may be configured to route
> from the virtual network to the LAN, with optional NAT applied.
> So while the implementation may use a bridge device, this bridge
> is not connected to any physical device. Since VEPA is about
> directly connecting VMs to the LAN, this doesn't really fit here.


Yes, I have re-purposed the network XML to describe an external bride.

There's the following advantage to this:

- you can migrate a VM between machines that have different types of connectivity, i.e, tap and macvtap

- pushing the eth0 into referenced XML makes it independent of the local configuration of the host, i.e,
  on the one host it may be eth0 and on the other eth1. eth0 in the above XML could be a physical adapter,
  or an SR-IOV physical adapter or virtual function of an SR-IOV adapter.

>
> I'm not sure how familiar people are with the current libvirt
> object model, so here's a 30 second overview
>
>  - virDomainPtr / <domain>  - management & configuration of
>    the virtual machines & their hardware.
>
>  - virNetworkPtr / <network> - virtual networks, either isolated
>    or routed (+NAT) to the LAN
>
>  - virInterfacePtr / <interface> - host interface configuration
>    for physical devices, bridge, bonding & vlan devices

>
>  - virNodeDevPtr / <device>  - host device enumeration / discovery
>


Thanks.

[...]
>
> One warning..
>
>  - The <domain> XML also contains a element called <interface> which
>    declares a guest NIC. Be careful not to mix up with the similarly
>    named virInterfacePtr XML also using <interface> refering to a
>    host NIC.
>
>
> In the context of bridging a guest to a plain ethernet device, these
> fit together as follows
>
>  1. The virNodeDevPtr APIs are used to discover what physical network
>     devices exist, 'eth0'
>
>  2. The virInterfacePtr APIs are used to create a bridge on the host
>     br0, containing the physical device 'eth0'



Yes, I suppose this is all done via 'virsh iface-*' commands.



> >
> >
> > This XML indicates the device that links to the external bridge, here
> > eth0, and the mode this device is supposed to be set into, here 'vepa'.
> > If 'extbridge' is found in the XML, all other XML elements that have
> > been used so far, i.e., bridge, ip, dhcp, etc., are ignored.
> > The above network description can then be referenced from the virtual
> > machine definition using the typical interface description of type
> > 'network':
> >
> >     <interface type='network'>
> >       <source network='vepanet'/>
> >       <model type='virtio'/>
> >     </interface>
> >
> >
> > The above network XML holds the necessary parameters to open a macvtap
> > device. Using command line, one would issue the following command to
> > achieve the same what now libvirt does internally, using a patched
> > version of the 'ip' command:
> >
> > ip link add link eth0 <optional name of if> type macvtap mode vepa
> >
> > This then creates the network interface, i.e., macvtap12, along with
> > entries in /sys/class/net/macvtap12/ where the content of the ifindex
> > file returns the integer necessary to build the /dev/tap%d device to
> > open the file descriptor of and pass to qemu via command line argument.
>
> So every guest VM NIC gets a new uniqely named macvtap device, and
> and unqiue TAP device ?



Yes, macvtap device creates a macvtap%d interface and a /dev/tap%d char device for sending an receiving of packets.



>
> IIUC, with a macvlan + a regular NIC we'd thus be seeing something
> like this arrangement ..
>
>                   /-> macvtap0 -> tap0 -> vm0
>   pcidev0 -> eth0 --> macvtap1 -> tap1 -> vm1
>                   \-> macvtap2 -> tap2 -> vm2
>

correct.

>
> I guess you could also use this with an SR-IOV card too
>
>   pcidev0 -> eth0 -> macvtap0 -> tap0 -> vm0
>      |
>   pcidev1 -> eth1 -> macvtap1 -> tap1 -> vm1
>      |
>   pcidev2 -> eth2 -> macvtap2 -> tap2 -> vm2
>
> Also, again there's no reason why we'd need to restrict ourselves
> to only using a single macvtap device with each ethX device / virtual
> function.
>
> The core question to me is whether we need to introduce any new
> configuration parameters at the host level. It looks to me like
> it can all be done against the guest XML <interface> element.
>
> Specifically, I don't see a need to manage the macvtapN devices
> directly via the virInterfacePtr APIs. This is good, because
> that is all delegated to the netcf library, which in turn is
> just a way to read/write the /etc/sysconfig/networking-scripts.


Yes, macvtaps need to be created on the fly every time for every VM that starts and wants to use macvtap. From the perspective of the macvtap what matters it the device it is linked to, i.e., eth0 or eth1.


> These ifcfg-XXX files have no knowledge of macvtap, and so
> plumbing all that through the networking-scripts, netcf and
> libvirt virInterfacePtr APIs is non-trivial work we'd be better
> off avoiding unless there was a clear compelling need.
>


> So unless I'm missing something major in my reasoning here I think
> in the domain XML we end up with two possible configs for guest
> network interfaces
>
>
> 1. The current one using plain Linux software bridging, which
>    we can't change in an incompatible way
>
>     <interface type='bridge'/>
>        <source bridge='br0'/>
>        <target dev='vnet0'/>
>     </interface>
>
>    Here, the source device is a bridge previously setup
>    to have a physical device enslaved (regular or SR-IOV)
>    The target device is the plain TAP device


plain TAP device -> no need for change here.

>
> 2. A new one using hardware bridging, which we can freely
>    define for our new needs
>
>     <interface type='direct'/>
>       <source dev='eth0' mode='vepa|pepa|bridge'/>
>       <target dev='vnet0'/>
>     </interface>


In contrast to the ACLs ( :-) ), where I would regard the ACLs as VM-attached data that ideally would migrate along when the VM migrates between hosts, in the case of this network attachment I'd not put host-specific information in the domain XML as is the case here with the 'eth0'. Who knows, maybe it's going to be the SR-IOV virtual adapter eth10 on the destination side? With the redirection into the network XML (or similar) one could define a network XML per VM, create that with host-specific information on the destination, i.e., eth10, and then migrate the VM previously linked to eth0 via macvtap that then connected via eth10. It's more work for upper layers, but if there is a need for optimization for throughput, then maybe that's the only way that optimizations can be done. Otherwise if all VMs in the data center are created with above XML and eth0 then they will all need to stay on eth0 I suppose.

In this context, how will the virtual functions of SR-IOV be administered and given to VMs. I suppose their management would be left up to higher layers?

>
>    Here, source device is a physical device (regular or
>    SR-IOV). The target device is a macvtap device.
>
> In both cases the TAP or macvtap device is created on the fly when the
> VM is booted & destroyed at shutdown (either by the kernel, or manually
> by libvirt for macvtap).


Yes, as long as libvirt is running when the VM goes down it can delete the macvtap device. If not, I am trying to delete all macvtap devices at VM startup using the MAC address of the VM (which the macvtap inherits) as search/delete criterion.

>
> Both configs are ultimately a form of 'bridging', the difference is
> that the former is focused on software bridges, as explicitly managed
> objects separate from the physical device, while the latter is hardware
> bridges which don't need to be managed separately.
>
>
> > Some details:
> >
> > In libvirt I am first searching for an unused interface name following
> > the template macvtap%d unless the user has provided a name explicitly
> > or the previous macvtap%d is now taken by another VM. Once that has
> > succeeded, I follow the path through the filesystem to open the
> > corresponding /dev/tap%d.
> > Unlike the regular tap device, where the network interface disappears
> > once the corresponding file descriptor is closed, the macvtap device
> > needs explicit tear-down. So, when a VM terminates, I am deleting all
> > macvtap type interface with the MAC address as the interface of the
> > terminating VM.
>
> Ok, this plan all makes sense. The only thing is that if we want to
> allow the user to manually specify a macvtap name, we should declare
> that they are not allowed to use the prefix 'macvtap' for their name.
> We do similar with TAP, where we declare 'vnet' is a reserved prefix
> for auto-generated names.



Yes, somehow this will need to be documented.


>
>
> > Index: libvirt/src/qemu/qemu_conf.c
> > ===================================================================
> > --- libvirt.orig/src/qemu/qemu_conf.c
> > +++ libvirt/src/qemu/qemu_conf.c
> > @@ -52,6 +52,7 @@
> >  #include "nodeinfo.h"
> >  #include "logging.h"
> >  #include "network.h"
> > +#include "macvtap.h"
> >  #include "cpu/cpu.h"
> >  
> >  #define VIR_FROM_THIS VIR_FROM_QEMU
> > @@ -1384,6 +1385,34 @@ int qemudExtractVersion(virConnectPtr co
> >  }
> >  
> >  
> > +static int
> > +qemudPhysIfaceConnect(virConnectPtr conn,
> > +                      virDomainNetDefPtr net,
> > +                      char *linkdev,
> > +                      char *brmode)
> > +{
> > +    int rc;
> > +#if defined(WITH_MACVTAP)
> > +    char *res_ifname = NULL;
> > +    delMacvtapByMACAddress(conn, net->mac);
> > +    rc = openMacvtapTap(conn, net->ifname, net->mac, linkdev, brmode,
> > +                        &res_ifname);
> > +    if (rc > 0) {
> > +        VIR_FREE(net->ifname);
> > +        net->ifname = res_ifname;
> > +    }
> > +#else
> > +    (void)net;
> > +    (void)linkdev;
> > +    (void)brmode;
> > +    qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR,
> > +                     "%s", _("No support for macvtap device"));
> > +    rc = -1;
> > +#endif
> > +    return rc;
> > +}
> > +
> > +
> >  int
> >  qemudNetworkIfaceConnect(virConnectPtr conn,
> >                           struct qemud_driver *driver,
> > @@ -1395,6 +1424,7 @@ qemudNetworkIfaceConnect(virConnectPtr c
> >      int tapfd = -1;
> >      int vnet_hdr = 0;
> >      int template_ifname = 0;
> > +    char *brmode = NULL;
> >  
> >      if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK) {
> >          virNetworkPtr network = virNetworkLookupByName(conn,
> > @@ -1402,6 +1432,15 @@ qemudNetworkIfaceConnect(virConnectPtr c
> >          if (!network)
> >              return -1;
> >  
> > +        if (virNetworkGetExtbridgeData(network, &brname, &brmode) == 0) {
> > +            tapfd = qemudPhysIfaceConnect(conn, net,
> > +                                          brname,
> > +                                          brmode);
> > +            VIR_FREE(brname);
> > +            VIR_FREE(brmode);
> > +            return tapfd;
> > +        }
> > +
> >          brname = virNetworkGetBridgeName(network);
> >  
> >          virNetworkFree(network);
>
> As mentioned earlier, we don't want to touch the VIR_DOMAIN_NET_TYPE_NETWORK
> case, since that's not bridging - its separate virtal networks using NAT.
> If my idea above is acceptable, we'd be adding a new
> VIR_DOMAIN_NET_TYPE_DIRECT
> that we'd hook up.


Understood, but the redirection into a referenced XML does have its advantages.

>
> > Index: libvirt/src/util/macvtap.h
> > ===================================================================
> > --- /dev/null
> > +++ libvirt/src/util/macvtap.h
> > @@ -0,0 +1,48 @@
> > +/*
> > + * Copyright (C) 2010 IBM Corporation
> > + *
> > + * This library is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * This library is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with this library; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307  USA
> > + *
> > + * Authors:
> > + *     Stefan Berger <stefanb@us.ibm.com>
> > + */
> > +
> > +#ifndef __UTIL_MACVTAP_H__
> > +#define __UTIL_MACVTAP_H__
> > +
> > +#include <config.h>
> > +
> > +#if defined(WITH_MACVTAP)
> > +
> > +#include "internal.h"
> > +
> > +int openMacvtapTap(virConnectPtr conn,
> > +                   const char *ifname,
> > +                   const unsigned char *macaddress,
> > +                   const char *linkdev,
> > +                   const char *mode,
> > +                   char **res_ifname);
> > +
> > +int delMacvtapByMACAddress(virConnectPtr conn,
> > +                           const unsigned char *macaddress);
>
> As an interface for internal usage, this looks like a fine
> start.


Thanks :-)
As said, I am not trying to delete the macvtap by interface name but by MAC address in case libvirt forgot about the MAC address or there is a stale interface still around. This is where this parameter comes from. Maybe an ioctl to delete-macvtap-on-tap-close could solve the problem on the device level later on.

>
> > +
> > +#endif /* WITH_MACVTAP */
> > +
> > +#define MACVTAP_MODE_PRIVATE_STR  "private"
> > +#define MACVTAP_MODE_VEPA_STR     "vepa"
> > +#define MACVTAP_MODE_BRIDGE_STR   "bridge"
> > +
> > +
> > +#endif
> > Index: libvirt/src/Makefile.am
> > ===================================================================
> > --- libvirt.orig/src/Makefile.am
> > +++ libvirt/src/Makefile.am
> > @@ -55,6 +55,7 @@ UTIL_SOURCES =                     \
> >        util/ebtables.c util/ebtables.h         \
> >        util/json.c util/json.h            \
> >        util/logging.c util/logging.h         \
> > +      util/macvtap.c util/macvtap.h         \
> >        util/memory.c util/memory.h         \
> >        util/pci.c util/pci.h            \
> >        util/processinfo.c util/processinfo.h      \
> > @@ -784,12 +785,15 @@ if WITH_LINUX
> >  USED_SYM_FILES += libvirt_linux.syms
> >  endif
> >  
> > +USED_SYM_FILES += libvirt_macvtap.syms
> > +
> >  EXTRA_DIST += \
> >    libvirt_public.syms      \
> >    libvirt_private.syms      \
> >    libvirt_driver_modules.syms   \
> >    libvirt_bridge.syms      \
> > -  libvirt_linux.syms
> > +  libvirt_linux.syms      \
> > +  libvirt_macvtap.syms
> >  
> >  BUILT_SOURCES = libvirt.syms
> >  
> > Index: libvirt/src/util/macvtap.c
> > ===================================================================
> > --- /dev/null
> > +++ libvirt/src/util/macvtap.c
> > @@ -0,0 +1,664 @@
> > +/*
> > + * Copyright (C) 2010 IBM Corporation
> > + *
> > + * This library is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later version.
> > + *
> > + * This library is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General Public
> > + * License along with this library; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307  USA
> > + *
> > + * Authors:
> > + *     Stefan Berger <stefanb@us.ibm.com>
> > + */
> > +
> > +#include <config.h>
> > +
> > +#if defined(WITH_MACVTAP)
>
> [snip].
>
> I've not had time to look at the details of this macvtap.c code yet,
> but I assume its doing all you need :-) Is there any benefit to using
> the network libnl.so library, rather than the ioctl()'s directly ?

 

Haven't looked at that library and its API, but can do so if it's documented. Would it be ok to keep the current implementation, though?


> I'm  not too familiar with either, but we already use libnl.so
> indirectly via netcf, so don't be afraid of adding a dependancy
> on the libnl.so library if you consider it to be useful.
>
> > Index: libvirt/src/conf/network_conf.c
> > ===================================================================
> > --- libvirt.orig/src/conf/network_conf.c
> > +++ libvirt/src/conf/network_conf.c
>
> This file shouldn't get any changes really.
>
> > Index: libvirt/src/libvirt.c
> > ===================================================================
> > --- libvirt.orig/src/libvirt.c
> > +++ libvirt/src/libvirt.c
> > @@ -6034,6 +6034,48 @@ error:
> >      return NULL;
> >  }
> >  
> > +
> > +/**
> > + * virNetworkGetExtbridgeData:
> > + * @network: a network object
> > + * @linkdev : pointer where name of the interface to connect to
> the external
> > + *            bridge is returned
> > + * @brmode  : pointer where mode of the external bridge is returned
> > + *
> > + * Returns 0 in case an external bridge has been configured, -1 otherwise
> > + */
> > +int
> > +virNetworkGetExtbridgeData(virNetworkPtr network,
> > +                           char **linkdev, char **brmode)
> > +{
> > +    virConnectPtr conn;
> > +    DEBUG("network=%p", network);
> > +
> > +    virResetLastError();
> > +
> > +    if (!VIR_IS_CONNECTED_NETWORK(network)) {
> > +        virLibNetworkError(NULL, VIR_ERR_INVALID_NETWORK, __FUNCTION__);
> > +        virDispatchError(NULL);
> > +        return -1;
> > +    }
> > +
> > +    conn = network->conn;
> > +
> > +    if (conn->networkDriver &&
> conn->networkDriver->networkGetExtbridgeData) {
> > +        int ret;
> > +        ret = conn->networkDriver->networkGetExtbridgeData(network,
> > +                                                           linkdev,
> > +                                                           brmode);
> > +        return ret;
> > +    }
> > +
> > +    virLibConnError (conn, VIR_ERR_NO_SUPPORT, __FUNCTION__);
> > +
> > +    virDispatchError(network->conn);
> > +    return -1;
> > +}
>
> Also don't think we need to be adding extra public APIs for this, if we
> do it within scope of the domain XML.


Correct.

>
>
> For the next submission, it'd be very helpful for review if you could
> split it into 3 patches
>
>  1. Adding the extra domain XML syntax to src/conf/domain_conf.{h,c}
>     and docs/schemas/domain.rng
>
>  2. Adding the macvtap helper code in src/util/macvtap.{h,c}
>
>  3. Implementation for the QEMU driver to use the stuff from the first
>     two patches.


Will do.

  Thanks and regards,
     Stefan


>
> Regards,
> Daniel
> --
> |: Red Hat, Engineering, London   -o-  
http://people.redhat.com/berrange/ :|
> |:
http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
> |:
http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
> |: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|