"Daniel P. Berrange" <berrange(a)redhat.com> wrote on 01/26/2010 04:21:56
PM:
libvir-list, gerhard.stenzel, Vivek Kashyap, arndb
Please respond to "Daniel P. Berrange"
On Mon, Jan 25, 2010 at 12:47:17PM -0500, Stefan Berger wrote:
> Hello!
>
> The attached patch provides support for the Linux macvtap device for
> Qemu by passing a file descriptor to Qemu command line similar to how
it
> is done with a regular tap device. I have modified the network
XML
code
> to understand a definition as the following one here:
>
> <network>
> <name>vepanet</name>
> <uuid>4ebd5168-6321-4757-8397-f6e83484f402</uuid>
> <extbridge mode='vepa' dev='eth0'/>
> </network>
I don't think this is the correct place to be adding this kind
of configuration / functionality. The virNetworkPtr / <network>
XML is describing a virtual network capability which is *not*
directly connected to the LAN. It may be configured to route
from the virtual network to the LAN, with optional NAT applied.
So while the implementation may use a bridge device, this bridge
is not connected to any physical device. Since VEPA is about
directly connecting VMs to the LAN, this doesn't really fit here.
Yes, I have re-purposed the network XML to describe an external bride.
There's the following advantage to this:
- you can migrate a VM between machines that have different types of
connectivity, i.e, tap and macvtap
- pushing the eth0 into referenced XML makes it independent of the local
configuration of the host, i.e,
on the one host it may be eth0 and on the other eth1. eth0 in the above
XML could be a physical adapter,
or an SR-IOV physical adapter or virtual function of an SR-IOV adapter.
I'm not sure how familiar people are with the current libvirt
object model, so here's a 30 second overview
- virDomainPtr / <domain> - management & configuration of
the virtual machines & their hardware.
- virNetworkPtr / <network> - virtual networks, either isolated
or routed (+NAT) to the LAN
- virInterfacePtr / <interface> - host interface configuration
for physical devices, bridge, bonding & vlan devices
- virNodeDevPtr / <device> - host device enumeration / discovery
Thanks.
[...]
One warning..
- The <domain> XML also contains a element called <interface> which
declares a guest NIC. Be careful not to mix up with the similarly
named virInterfacePtr XML also using <interface> refering to a
host NIC.
In the context of bridging a guest to a plain ethernet device, these
fit together as follows
1. The virNodeDevPtr APIs are used to discover what physical network
devices exist, 'eth0'
2. The virInterfacePtr APIs are used to create a bridge on the host
br0, containing the physical device 'eth0'
Yes, I suppose this is all done via 'virsh iface-*' commands.
>
>
> This XML indicates the device that links to the external bridge, here
> eth0, and the mode this device is supposed to be set into, here
'vepa'.
> If 'extbridge' is found in the XML, all other XML
elements that have
> been used so far, i.e., bridge, ip, dhcp, etc., are ignored.
> The above network description can then be referenced from the virtual
> machine definition using the typical interface description of type
> 'network':
>
> <interface type='network'>
> <source network='vepanet'/>
> <model type='virtio'/>
> </interface>
>
>
> The above network XML holds the necessary parameters to open a macvtap
> device. Using command line, one would issue the following command to
> achieve the same what now libvirt does internally, using a patched
> version of the 'ip' command:
>
> ip link add link eth0 <optional name of if> type macvtap mode vepa
>
> This then creates the network interface, i.e., macvtap12, along with
> entries in /sys/class/net/macvtap12/ where the content of the ifindex
> file returns the integer necessary to build the /dev/tap%d device to
> open the file descriptor of and pass to qemu via command line
argument.
So every guest VM NIC gets a new uniqely named macvtap device, and
and unqiue TAP device ?
Yes, macvtap device creates a macvtap%d interface and a /dev/tap%d char
device for sending an receiving of packets.
IIUC, with a macvlan + a regular NIC we'd thus be seeing something
like this arrangement ..
/-> macvtap0 -> tap0 -> vm0
pcidev0 -> eth0 --> macvtap1 -> tap1 -> vm1
\-> macvtap2 -> tap2 -> vm2
correct.
I guess you could also use this with an SR-IOV card too
pcidev0 -> eth0 -> macvtap0 -> tap0 -> vm0
|
pcidev1 -> eth1 -> macvtap1 -> tap1 -> vm1
|
pcidev2 -> eth2 -> macvtap2 -> tap2 -> vm2
Also, again there's no reason why we'd need to restrict ourselves
to only using a single macvtap device with each ethX device / virtual
function.
The core question to me is whether we need to introduce any new
configuration parameters at the host level. It looks to me like
it can all be done against the guest XML <interface> element.
Specifically, I don't see a need to manage the macvtapN devices
directly via the virInterfacePtr APIs. This is good, because
that is all delegated to the netcf library, which in turn is
just a way to read/write the /etc/sysconfig/networking-scripts.
Yes, macvtaps need to be created on the fly every time for every VM that
starts and wants to use macvtap. From the perspective of the macvtap what
matters it the device it is linked to, i.e., eth0 or eth1.
These ifcfg-XXX files have no knowledge of macvtap, and so
plumbing all that through the networking-scripts, netcf and
libvirt virInterfacePtr APIs is non-trivial work we'd be better
off avoiding unless there was a clear compelling need.
So unless I'm missing something major in my reasoning here I
think
in the domain XML we end up with two possible configs for guest
network interfaces
1. The current one using plain Linux software bridging, which
we can't change in an incompatible way
<interface type='bridge'/>
<source bridge='br0'/>
<target dev='vnet0'/>
</interface>
Here, the source device is a bridge previously setup
to have a physical device enslaved (regular or SR-IOV)
The target device is the plain TAP device
plain TAP device -> no need for change here.
2. A new one using hardware bridging, which we can freely
define for our new needs
<interface type='direct'/>
<source dev='eth0' mode='vepa|pepa|bridge'/>
<target dev='vnet0'/>
</interface>
In contrast to the ACLs ( :-) ), where I would regard the ACLs as
VM-attached data that ideally would migrate along when the VM migrates
between hosts, in the case of this network attachment I'd not put
host-specific information in the domain XML as is the case here with the
'eth0'. Who knows, maybe it's going to be the SR-IOV virtual adapter eth10
on the destination side? With the redirection into the network XML (or
similar) one could define a network XML per VM, create that with
host-specific information on the destination, i.e., eth10, and then
migrate the VM previously linked to eth0 via macvtap that then connected
via eth10. It's more work for upper layers, but if there is a need for
optimization for throughput, then maybe that's the only way that
optimizations can be done. Otherwise if all VMs in the data center are
created with above XML and eth0 then they will all need to stay on eth0 I
suppose.
In this context, how will the virtual functions of SR-IOV be administered
and given to VMs. I suppose their management would be left up to higher
layers?
Here, source device is a physical device (regular or
SR-IOV). The target device is a macvtap device.
In both cases the TAP or macvtap device is created on the fly when the
VM is booted & destroyed at shutdown (either by the kernel, or manually
by libvirt for macvtap).
Yes, as long as libvirt is running when the VM goes down it can delete the
macvtap device. If not, I am trying to delete all macvtap devices at VM
startup using the MAC address of the VM (which the macvtap inherits) as
search/delete criterion.
Both configs are ultimately a form of 'bridging', the difference is
that the former is focused on software bridges, as explicitly managed
objects separate from the physical device, while the latter is hardware
bridges which don't need to be managed separately.
> Some details:
>
> In libvirt I am first searching for an unused interface name following
> the template macvtap%d unless the user has provided a name explicitly
> or the previous macvtap%d is now taken by another VM. Once that has
> succeeded, I follow the path through the filesystem to open the
> corresponding /dev/tap%d.
> Unlike the regular tap device, where the network interface disappears
> once the corresponding file descriptor is closed, the macvtap device
> needs explicit tear-down. So, when a VM terminates, I am deleting all
> macvtap type interface with the MAC address as the interface of the
> terminating VM.
Ok, this plan all makes sense. The only thing is that if we want to
allow the user to manually specify a macvtap name, we should declare
that they are not allowed to use the prefix 'macvtap' for their name.
We do similar with TAP, where we declare 'vnet' is a reserved prefix
for auto-generated names.
Yes, somehow this will need to be documented.
> Index: libvirt/src/qemu/qemu_conf.c
> ===================================================================
> --- libvirt.orig/src/qemu/qemu_conf.c
> +++ libvirt/src/qemu/qemu_conf.c
> @@ -52,6 +52,7 @@
> #include "nodeinfo.h"
> #include "logging.h"
> #include "network.h"
> +#include "macvtap.h"
> #include "cpu/cpu.h"
>
> #define VIR_FROM_THIS VIR_FROM_QEMU
> @@ -1384,6 +1385,34 @@ int qemudExtractVersion(virConnectPtr co
> }
>
>
> +static int
> +qemudPhysIfaceConnect(virConnectPtr conn,
> + virDomainNetDefPtr net,
> + char *linkdev,
> + char *brmode)
> +{
> + int rc;
> +#if defined(WITH_MACVTAP)
> + char *res_ifname = NULL;
> + delMacvtapByMACAddress(conn, net->mac);
> + rc = openMacvtapTap(conn, net->ifname, net->mac, linkdev, brmode,
> + &res_ifname);
> + if (rc > 0) {
> + VIR_FREE(net->ifname);
> + net->ifname = res_ifname;
> + }
> +#else
> + (void)net;
> + (void)linkdev;
> + (void)brmode;
> + qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR,
> + "%s", _("No support for macvtap
device"));
> + rc = -1;
> +#endif
> + return rc;
> +}
> +
> +
> int
> qemudNetworkIfaceConnect(virConnectPtr conn,
> struct qemud_driver *driver,
> @@ -1395,6 +1424,7 @@ qemudNetworkIfaceConnect(virConnectPtr c
> int tapfd = -1;
> int vnet_hdr = 0;
> int template_ifname = 0;
> + char *brmode = NULL;
>
> if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK) {
> virNetworkPtr network = virNetworkLookupByName(conn,
> @@ -1402,6 +1432,15 @@ qemudNetworkIfaceConnect(virConnectPtr c
> if (!network)
> return -1;
>
> + if (virNetworkGetExtbridgeData(network, &brname, &brmode) ==
0) {
> + tapfd = qemudPhysIfaceConnect(conn, net,
> + brname,
> + brmode);
> + VIR_FREE(brname);
> + VIR_FREE(brmode);
> + return tapfd;
> + }
> +
> brname = virNetworkGetBridgeName(network);
>
> virNetworkFree(network);
As mentioned earlier, we don't want to touch the
VIR_DOMAIN_NET_TYPE_NETWORK
case, since that's not bridging - its separate virtal networks
using
NAT.
If my idea above is acceptable, we'd be adding a new
VIR_DOMAIN_NET_TYPE_DIRECT
that we'd hook up.
Understood, but the redirection into a referenced XML does have its
advantages.
> Index: libvirt/src/util/macvtap.h
> ===================================================================
> --- /dev/null
> +++ libvirt/src/util/macvtap.h
> @@ -0,0 +1,48 @@
> +/*
> + * Copyright (C) 2010 IBM Corporation
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free
Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA
> + *
> + * Authors:
> + * Stefan Berger <stefanb(a)us.ibm.com>
> + */
> +
> +#ifndef __UTIL_MACVTAP_H__
> +#define __UTIL_MACVTAP_H__
> +
> +#include <config.h>
> +
> +#if defined(WITH_MACVTAP)
> +
> +#include "internal.h"
> +
> +int openMacvtapTap(virConnectPtr conn,
> + const char *ifname,
> + const unsigned char *macaddress,
> + const char *linkdev,
> + const char *mode,
> + char **res_ifname);
> +
> +int delMacvtapByMACAddress(virConnectPtr conn,
> + const unsigned char *macaddress);
As an interface for internal usage, this looks like a fine
start.
Thanks :-)
As said, I am not trying to delete the macvtap by interface name but by
MAC address in case libvirt forgot about the MAC address or there is a
stale interface still around. This is where this parameter comes from.
Maybe an ioctl to delete-macvtap-on-tap-close could solve the problem on
the device level later on.
> +
> +#endif /* WITH_MACVTAP */
> +
> +#define MACVTAP_MODE_PRIVATE_STR "private"
> +#define MACVTAP_MODE_VEPA_STR "vepa"
> +#define MACVTAP_MODE_BRIDGE_STR "bridge"
> +
> +
> +#endif
> Index: libvirt/src/Makefile.am
> ===================================================================
> --- libvirt.orig/src/Makefile.am
> +++ libvirt/src/Makefile.am
> @@ -55,6 +55,7 @@ UTIL_SOURCES = \
> util/ebtables.c util/ebtables.h \
> util/json.c util/json.h \
> util/logging.c util/logging.h \
> + util/macvtap.c util/macvtap.h \
> util/memory.c util/memory.h \
> util/pci.c util/pci.h \
> util/processinfo.c util/processinfo.h \
> @@ -784,12 +785,15 @@ if WITH_LINUX
> USED_SYM_FILES += libvirt_linux.syms
> endif
>
> +USED_SYM_FILES += libvirt_macvtap.syms
> +
> EXTRA_DIST += \
> libvirt_public.syms \
> libvirt_private.syms \
> libvirt_driver_modules.syms \
> libvirt_bridge.syms \
> - libvirt_linux.syms
> + libvirt_linux.syms \
> + libvirt_macvtap.syms
>
> BUILT_SOURCES = libvirt.syms
>
> Index: libvirt/src/util/macvtap.c
> ===================================================================
> --- /dev/null
> +++ libvirt/src/util/macvtap.c
> @@ -0,0 +1,664 @@
> +/*
> + * Copyright (C) 2010 IBM Corporation
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, write to the Free
Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
02111-1307 USA
> + *
> + * Authors:
> + * Stefan Berger <stefanb(a)us.ibm.com>
> + */
> +
> +#include <config.h>
> +
> +#if defined(WITH_MACVTAP)
[snip].
I've not had time to look at the details of this macvtap.c code yet,
but I assume its doing all you need :-) Is there any benefit to using
the network libnl.so library, rather than the ioctl()'s directly ?
Haven't looked at that library and its API, but can do so if it's
documented. Would it be ok to keep the current implementation, though?
I'm not too familiar with either, but we already use libnl.so
indirectly via netcf, so don't be afraid of adding a dependancy
on the libnl.so library if you consider it to be useful.
> Index: libvirt/src/conf/network_conf.c
> ===================================================================
> --- libvirt.orig/src/conf/network_conf.c
> +++ libvirt/src/conf/network_conf.c
This file shouldn't get any changes really.
> Index: libvirt/src/libvirt.c
> ===================================================================
> --- libvirt.orig/src/libvirt.c
> +++ libvirt/src/libvirt.c
> @@ -6034,6 +6034,48 @@ error:
> return NULL;
> }
>
> +
> +/**
> + * virNetworkGetExtbridgeData:
> + * @network: a network object
> + * @linkdev : pointer where name of the interface to connect to
the external
> + * bridge is returned
> + * @brmode : pointer where mode of the external bridge is returned
> + *
> + * Returns 0 in case an external bridge has been configured, -1
otherwise
> + */
> +int
> +virNetworkGetExtbridgeData(virNetworkPtr network,
> + char **linkdev, char **brmode)
> +{
> + virConnectPtr conn;
> + DEBUG("network=%p", network);
> +
> + virResetLastError();
> +
> + if (!VIR_IS_CONNECTED_NETWORK(network)) {
> + virLibNetworkError(NULL, VIR_ERR_INVALID_NETWORK,
__FUNCTION__);
> + virDispatchError(NULL);
> + return -1;
> + }
> +
> + conn = network->conn;
> +
> + if (conn->networkDriver &&
conn->networkDriver->networkGetExtbridgeData) {
> + int ret;
> + ret = conn->networkDriver->networkGetExtbridgeData(network,
> + linkdev,
> + brmode);
> + return ret;
> + }
> +
> + virLibConnError (conn, VIR_ERR_NO_SUPPORT, __FUNCTION__);
> +
> + virDispatchError(network->conn);
> + return -1;
> +}
Also don't think we need to be adding extra public APIs for this, if we
do it within scope of the domain XML.
Correct.
For the next submission, it'd be very helpful for review if you could
split it into 3 patches
1. Adding the extra domain XML syntax to src/conf/domain_conf.{h,c}
and docs/schemas/domain.rng
2. Adding the macvtap helper code in src/util/macvtap.{h,c}
3. Implementation for the QEMU driver to use the stuff from the first
two patches.
Will do.
Thanks and regards,
Stefan
Regards,
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
http://ovirt.org :|
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B
9505 :|