"Daniel P. Berrange" <berrange@redhat.com>
wrote on 01/26/2010 04:21:56 PM:
>
> libvir-list, gerhard.stenzel, Vivek Kashyap, arndb
>
> Please respond to "Daniel P. Berrange"
>
> On Mon, Jan 25, 2010 at 12:47:17PM -0500, Stefan Berger wrote:
> > Hello!
> >
> > The attached patch provides support for the Linux macvtap
device for
> > Qemu by passing a file descriptor to Qemu command line similar
to how it
> > is done with a regular tap device. I have modified the network
XML code
> > to understand a definition as the following one here:
> >
> > <network>
> > <name>vepanet</name>
> > <uuid>4ebd5168-6321-4757-8397-f6e83484f402</uuid>
> > <extbridge mode='vepa' dev='eth0'/>
> > </network>
>
> I don't think this is the correct place to be adding this kind
> of configuration / functionality. The virNetworkPtr / <network>
> XML is describing a virtual network capability which is *not*
> directly connected to the LAN. It may be configured to route
> from the virtual network to the LAN, with optional NAT applied.
> So while the implementation may use a bridge device, this bridge
> is not connected to any physical device. Since VEPA is about
> directly connecting VMs to the LAN, this doesn't really fit here.
Yes, I have re-purposed the network XML to describe
an external bride.
There's the following advantage to this:
- you can migrate a VM between machines that have
different types of connectivity, i.e, tap and macvtap
- pushing the eth0 into referenced XML makes it independent
of the local configuration of the host, i.e,
on the one host it may be eth0 and on the other
eth1. eth0 in the above XML could be a physical adapter,
or an SR-IOV physical adapter or virtual function
of an SR-IOV adapter.
>
> I'm not sure how familiar people are with the current libvirt
> object model, so here's a 30 second overview
>
> - virDomainPtr / <domain> - management & configuration
of
> the virtual machines & their hardware.
>
> - virNetworkPtr / <network> - virtual networks, either
isolated
> or routed (+NAT) to the LAN
>
> - virInterfacePtr / <interface> - host interface configuration
> for physical devices, bridge, bonding & vlan devices
>
> - virNodeDevPtr / <device> - host device enumeration
/ discovery
>
Thanks.
[...]
>
> One warning..
>
> - The <domain> XML also contains a element called <interface>
which
> declares a guest NIC. Be careful not to mix up with the
similarly
> named virInterfacePtr XML also using <interface>
refering to a
> host NIC.
>
>
> In the context of bridging a guest to a plain ethernet device, these
> fit together as follows
>
> 1. The virNodeDevPtr APIs are used to discover what physical
network
> devices exist, 'eth0'
>
> 2. The virInterfacePtr APIs are used to create a bridge on the
host
> br0, containing the physical device 'eth0'
Yes, I suppose this is all done via 'virsh iface-*'
commands.
> >
> >
> > This XML indicates the device that links to the external bridge,
here
> > eth0, and the mode this device is supposed to be set into, here
'vepa'.
> > If 'extbridge' is found in the XML, all other XML elements that
have
> > been used so far, i.e., bridge, ip, dhcp, etc., are ignored.
> > The above network description can then be referenced from the
virtual
> > machine definition using the typical interface description of
type
> > 'network':
> >
> > <interface type='network'>
> > <source network='vepanet'/>
> > <model type='virtio'/>
> > </interface>
> >
> >
> > The above network XML holds the necessary parameters to open
a macvtap
> > device. Using command line, one would issue the following command
to
> > achieve the same what now libvirt does internally, using a patched
> > version of the 'ip' command:
> >
> > ip link add link eth0 <optional name of if> type macvtap
mode vepa
> >
> > This then creates the network interface, i.e., macvtap12, along
with
> > entries in /sys/class/net/macvtap12/ where the content of the
ifindex
> > file returns the integer necessary to build the /dev/tap%d device
to
> > open the file descriptor of and pass to qemu via command line
argument.
>
> So every guest VM NIC gets a new uniqely named macvtap device, and
> and unqiue TAP device ?
Yes, macvtap device creates a macvtap%d interface
and a /dev/tap%d char device for sending an receiving of packets.
>
> IIUC, with a macvlan + a regular NIC we'd thus be seeing something
> like this arrangement ..
>
> /->
macvtap0 -> tap0 -> vm0
> pcidev0 -> eth0 --> macvtap1 -> tap1 -> vm1
> \->
macvtap2 -> tap2 -> vm2
>
correct.
>
> I guess you could also use this with an SR-IOV card too
>
> pcidev0 -> eth0 -> macvtap0 -> tap0 -> vm0
> |
> pcidev1 -> eth1 -> macvtap1 -> tap1 -> vm1
> |
> pcidev2 -> eth2 -> macvtap2 -> tap2 -> vm2
>
> Also, again there's no reason why we'd need to restrict ourselves
> to only using a single macvtap device with each ethX device / virtual
> function.
>
> The core question to me is whether we need to introduce any new
> configuration parameters at the host level. It looks to me like
> it can all be done against the guest XML <interface> element.
>
> Specifically, I don't see a need to manage the macvtapN devices
> directly via the virInterfacePtr APIs. This is good, because
> that is all delegated to the netcf library, which in turn is
> just a way to read/write the /etc/sysconfig/networking-scripts.
Yes, macvtaps need to be created on the fly every
time for every VM that starts and wants to use macvtap. From the perspective
of the macvtap what matters it the device it is linked to, i.e., eth0 or
eth1.
> These ifcfg-XXX files have no knowledge of macvtap, and so
> plumbing all that through the networking-scripts, netcf and
> libvirt virInterfacePtr APIs is non-trivial work we'd be better
> off avoiding unless there was a clear compelling need.
>
> So unless I'm missing something major in my reasoning here I think
> in the domain XML we end up with two possible configs for guest
> network interfaces
>
>
> 1. The current one using plain Linux software bridging, which
> we can't change in an incompatible way
>
> <interface type='bridge'/>
> <source bridge='br0'/>
> <target dev='vnet0'/>
> </interface>
>
> Here, the source device is a bridge previously setup
> to have a physical device enslaved (regular or SR-IOV)
> The target device is the plain TAP device
plain TAP device -> no need for change here.
>
> 2. A new one using hardware bridging, which we can freely
> define for our new needs
>
> <interface type='direct'/>
> <source dev='eth0' mode='vepa|pepa|bridge'/>
> <target dev='vnet0'/>
> </interface>
In contrast to the ACLs ( :-) ), where I would regard
the ACLs as VM-attached data that ideally would migrate along when the
VM migrates between hosts, in the case of this network attachment I'd not
put host-specific information in the domain XML as is the case here with
the 'eth0'. Who knows, maybe it's going to be the SR-IOV virtual adapter
eth10 on the destination side? With the redirection into the network XML
(or similar) one could define a network XML per VM, create that with host-specific
information on the destination, i.e., eth10, and then migrate the VM previously
linked to eth0 via macvtap that then connected via eth10. It's more work
for upper layers, but if there is a need for optimization for throughput,
then maybe that's the only way that optimizations can be done. Otherwise
if all VMs in the data center are created with above XML and eth0 then
they will all need to stay on eth0 I suppose.
In this context, how will the virtual functions of
SR-IOV be administered and given to VMs. I suppose their management would
be left up to higher layers?
>
> Here, source device is a physical device (regular or
> SR-IOV). The target device is a macvtap device.
>
> In both cases the TAP or macvtap device is created on the fly when
the
> VM is booted & destroyed at shutdown (either by the kernel, or
manually
> by libvirt for macvtap).
Yes, as long as libvirt is running when the VM goes
down it can delete the macvtap device. If not, I am trying to delete all
macvtap devices at VM startup using the MAC address of the VM (which the
macvtap inherits) as search/delete criterion.
>
> Both configs are ultimately a form of 'bridging', the difference is
> that the former is focused on software bridges, as explicitly managed
> objects separate from the physical device, while the latter is hardware
> bridges which don't need to be managed separately.
>
>
> > Some details:
> >
> > In libvirt I am first searching for an unused interface name
following
> > the template macvtap%d unless the user has provided a name explicitly
> > or the previous macvtap%d is now taken by another VM. Once that
has
> > succeeded, I follow the path through the filesystem to open the
> > corresponding /dev/tap%d.
> > Unlike the regular tap device, where the network interface disappears
> > once the corresponding file descriptor is closed, the macvtap
device
> > needs explicit tear-down. So, when a VM terminates, I am deleting
all
> > macvtap type interface with the MAC address as the interface
of the
> > terminating VM.
>
> Ok, this plan all makes sense. The only thing is that if we want to
> allow the user to manually specify a macvtap name, we should declare
> that they are not allowed to use the prefix 'macvtap' for their name.
> We do similar with TAP, where we declare 'vnet' is a reserved prefix
> for auto-generated names.
Yes, somehow this will need to be documented.
>
>
> > Index: libvirt/src/qemu/qemu_conf.c
> > ===================================================================
> > --- libvirt.orig/src/qemu/qemu_conf.c
> > +++ libvirt/src/qemu/qemu_conf.c
> > @@ -52,6 +52,7 @@
> > #include "nodeinfo.h"
> > #include "logging.h"
> > #include "network.h"
> > +#include "macvtap.h"
> > #include "cpu/cpu.h"
> >
> > #define VIR_FROM_THIS VIR_FROM_QEMU
> > @@ -1384,6 +1385,34 @@ int qemudExtractVersion(virConnectPtr
co
> > }
> >
> >
> > +static int
> > +qemudPhysIfaceConnect(virConnectPtr conn,
> > +
virDomainNetDefPtr net,
> > +
char *linkdev,
> > +
char *brmode)
> > +{
> > + int rc;
> > +#if defined(WITH_MACVTAP)
> > + char *res_ifname = NULL;
> > + delMacvtapByMACAddress(conn, net->mac);
> > + rc = openMacvtapTap(conn, net->ifname, net->mac,
linkdev, brmode,
> > +
&res_ifname);
> > + if (rc > 0) {
> > + VIR_FREE(net->ifname);
> > + net->ifname = res_ifname;
> > + }
> > +#else
> > + (void)net;
> > + (void)linkdev;
> > + (void)brmode;
> > + qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR,
> > +
"%s", _("No support for macvtap device"));
> > + rc = -1;
> > +#endif
> > + return rc;
> > +}
> > +
> > +
> > int
> > qemudNetworkIfaceConnect(virConnectPtr conn,
> >
struct qemud_driver *driver,
> > @@ -1395,6 +1424,7 @@ qemudNetworkIfaceConnect(virConnectPtr
c
> > int tapfd = -1;
> > int vnet_hdr = 0;
> > int template_ifname = 0;
> > + char *brmode = NULL;
> >
> > if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK)
{
> > virNetworkPtr network = virNetworkLookupByName(conn,
> > @@ -1402,6 +1432,15 @@ qemudNetworkIfaceConnect(virConnectPtr
c
> > if (!network)
> > return -1;
> >
> > + if (virNetworkGetExtbridgeData(network,
&brname, &brmode) == 0) {
> > + tapfd = qemudPhysIfaceConnect(conn,
net,
> > +
brname,
> > +
brmode);
> > + VIR_FREE(brname);
> > + VIR_FREE(brmode);
> > + return tapfd;
> > + }
> > +
> > brname = virNetworkGetBridgeName(network);
> >
> > virNetworkFree(network);
>
> As mentioned earlier, we don't want to touch the VIR_DOMAIN_NET_TYPE_NETWORK
> case, since that's not bridging - its separate virtal networks using
NAT.
> If my idea above is acceptable, we'd be adding a new
> VIR_DOMAIN_NET_TYPE_DIRECT
> that we'd hook up.
Understood, but the redirection into a referenced
XML does have its advantages.
>
> > Index: libvirt/src/util/macvtap.h
> > ===================================================================
> > --- /dev/null
> > +++ libvirt/src/util/macvtap.h
> > @@ -0,0 +1,48 @@
> > +/*
> > + * Copyright (C) 2010 IBM Corporation
> > + *
> > + * This library is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later
version.
> > + *
> > + * This library is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty
of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General
Public
> > + * License along with this library; if not, write to the Free
Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA
> > + *
> > + * Authors:
> > + * Stefan Berger <stefanb@us.ibm.com>
> > + */
> > +
> > +#ifndef __UTIL_MACVTAP_H__
> > +#define __UTIL_MACVTAP_H__
> > +
> > +#include <config.h>
> > +
> > +#if defined(WITH_MACVTAP)
> > +
> > +#include "internal.h"
> > +
> > +int openMacvtapTap(virConnectPtr conn,
> > +
const char *ifname,
> > +
const unsigned char *macaddress,
> > +
const char *linkdev,
> > +
const char *mode,
> > +
char **res_ifname);
> > +
> > +int delMacvtapByMACAddress(virConnectPtr conn,
> > +
const unsigned char *macaddress);
>
> As an interface for internal usage, this looks like a fine
> start.
Thanks :-)
As said, I am not trying to delete the macvtap by
interface name but by MAC address in case libvirt forgot about the MAC
address or there is a stale interface still around. This is where this
parameter comes from. Maybe an ioctl to delete-macvtap-on-tap-close could
solve the problem on the device level later on.
>
> > +
> > +#endif /* WITH_MACVTAP */
> > +
> > +#define MACVTAP_MODE_PRIVATE_STR "private"
> > +#define MACVTAP_MODE_VEPA_STR "vepa"
> > +#define MACVTAP_MODE_BRIDGE_STR "bridge"
> > +
> > +
> > +#endif
> > Index: libvirt/src/Makefile.am
> > ===================================================================
> > --- libvirt.orig/src/Makefile.am
> > +++ libvirt/src/Makefile.am
> > @@ -55,6 +55,7 @@ UTIL_SOURCES =
\
> > util/ebtables.c util/ebtables.h
\
> > util/json.c util/json.h
\
> > util/logging.c util/logging.h
\
> > + util/macvtap.c util/macvtap.h
\
> > util/memory.c util/memory.h
\
> > util/pci.c util/pci.h
\
> > util/processinfo.c util/processinfo.h
\
> > @@ -784,12 +785,15 @@ if WITH_LINUX
> > USED_SYM_FILES += libvirt_linux.syms
> > endif
> >
> > +USED_SYM_FILES += libvirt_macvtap.syms
> > +
> > EXTRA_DIST += \
> > libvirt_public.syms \
> > libvirt_private.syms \
> > libvirt_driver_modules.syms \
> > libvirt_bridge.syms \
> > - libvirt_linux.syms
> > + libvirt_linux.syms \
> > + libvirt_macvtap.syms
> >
> > BUILT_SOURCES = libvirt.syms
> >
> > Index: libvirt/src/util/macvtap.c
> > ===================================================================
> > --- /dev/null
> > +++ libvirt/src/util/macvtap.c
> > @@ -0,0 +1,664 @@
> > +/*
> > + * Copyright (C) 2010 IBM Corporation
> > + *
> > + * This library is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU Lesser General Public
> > + * License as published by the Free Software Foundation; either
> > + * version 2.1 of the License, or (at your option) any later
version.
> > + *
> > + * This library is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty
of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See
the GNU
> > + * Lesser General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU Lesser General
Public
> > + * License along with this library; if not, write to the Free
Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA
> > + *
> > + * Authors:
> > + * Stefan Berger <stefanb@us.ibm.com>
> > + */
> > +
> > +#include <config.h>
> > +
> > +#if defined(WITH_MACVTAP)
>
> [snip].
>
> I've not had time to look at the details of this macvtap.c code yet,
> but I assume its doing all you need :-) Is there any benefit to using
> the network libnl.so library, rather than the ioctl()'s directly ?
Haven't looked at that library and its API, but can
do so if it's documented. Would it be ok to keep the current implementation,
though?
> I'm not too familiar with either, but we already use libnl.so
> indirectly via netcf, so don't be afraid of adding a dependancy
> on the libnl.so library if you consider it to be useful.
>
> > Index: libvirt/src/conf/network_conf.c
> > ===================================================================
> > --- libvirt.orig/src/conf/network_conf.c
> > +++ libvirt/src/conf/network_conf.c
>
> This file shouldn't get any changes really.
>
> > Index: libvirt/src/libvirt.c
> > ===================================================================
> > --- libvirt.orig/src/libvirt.c
> > +++ libvirt/src/libvirt.c
> > @@ -6034,6 +6034,48 @@ error:
> > return NULL;
> > }
> >
> > +
> > +/**
> > + * virNetworkGetExtbridgeData:
> > + * @network: a network object
> > + * @linkdev : pointer where name of the interface to connect
to
> the external
> > + * bridge is returned
> > + * @brmode : pointer where mode of the external bridge
is returned
> > + *
> > + * Returns 0 in case an external bridge has been configured,
-1 otherwise
> > + */
> > +int
> > +virNetworkGetExtbridgeData(virNetworkPtr network,
> > +
char **linkdev, char **brmode)
> > +{
> > + virConnectPtr conn;
> > + DEBUG("network=%p", network);
> > +
> > + virResetLastError();
> > +
> > + if (!VIR_IS_CONNECTED_NETWORK(network)) {
> > + virLibNetworkError(NULL, VIR_ERR_INVALID_NETWORK,
__FUNCTION__);
> > + virDispatchError(NULL);
> > + return -1;
> > + }
> > +
> > + conn = network->conn;
> > +
> > + if (conn->networkDriver &&
> conn->networkDriver->networkGetExtbridgeData) {
> > + int ret;
> > + ret = conn->networkDriver->networkGetExtbridgeData(network,
> > +
linkdev,
> > +
brmode);
> > + return ret;
> > + }
> > +
> > + virLibConnError (conn, VIR_ERR_NO_SUPPORT, __FUNCTION__);
> > +
> > + virDispatchError(network->conn);
> > + return -1;
> > +}
>
> Also don't think we need to be adding extra public APIs for this,
if we
> do it within scope of the domain XML.
Correct.
>
>
> For the next submission, it'd be very helpful for review if you could
> split it into 3 patches
>
> 1. Adding the extra domain XML syntax to src/conf/domain_conf.{h,c}
> and docs/schemas/domain.rng
>
> 2. Adding the macvtap helper code in src/util/macvtap.{h,c}
>
> 3. Implementation for the QEMU driver to use the stuff from
the first
> two patches.
Will do.
Thanks and regards,
Stefan
>
> Regards,
> Daniel
> --
> |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/
:|
> |: http://libvirt.org
-o- http://virt-manager.org
-o- http://ovirt.org
:|
> |: http://autobuild.org
-o- http://search.cpan.org/~danberr/
:|
> |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF
F742 7D3B 9505 :|