On Mon, Jan 25, 2010 at 12:47:17PM -0500, Stefan Berger wrote:
Hello!
The attached patch provides support for the Linux macvtap device for
Qemu by passing a file descriptor to Qemu command line similar to how it
is done with a regular tap device. I have modified the network XML code
to understand a definition as the following one here:
<network>
<name>vepanet</name>
<uuid>4ebd5168-6321-4757-8397-f6e83484f402</uuid>
<extbridge mode='vepa' dev='eth0'/>
</network>
I don't think this is the correct place to be adding this kind
of configuration / functionality. The virNetworkPtr / <network>
XML is describing a virtual network capability which is *not*
directly connected to the LAN. It may be configured to route
from the virtual network to the LAN, with optional NAT applied.
So while the implementation may use a bridge device, this bridge
is not connected to any physical device. Since VEPA is about
directly connecting VMs to the LAN, this doesn't really fit here.
I'm not sure how familiar people are with the current libvirt
object model, so here's a 30 second overview
- virDomainPtr / <domain> - management & configuration of
the virtual machines & their hardware.
- virNetworkPtr / <network> - virtual networks, either isolated
or routed (+NAT) to the LAN
- virInterfacePtr / <interface> - host interface configuration
for physical devices, bridge, bonding & vlan devices
- virNodeDevPtr / <device> - host device enumeration / discovery
One warning..
- The <domain> XML also contains a element called <interface> which
declares a guest NIC. Be careful not to mix up with the similarly
named virInterfacePtr XML also using <interface> refering to a
host NIC.
In the context of bridging a guest to a plain ethernet device, these
fit together as follows
1. The virNodeDevPtr APIs are used to discover what physical network
devices exist, 'eth0'
2. The virInterfacePtr APIs are used to create a bridge on the host
br0, containing the physical device 'eth0'
3. The virDomainPtr APIs are used to define a guest configuration
using guest NIC <interface type='bridge'> element, whcih uses
a plain TAP device.
The logical setup of devices is
/-> tap0 -> vm0
pcidev0 -> eth0 -> br0 --> tap1 -> vm1
\-> tap2 -> vm2
Now, throw SR-IOV into the picture....
With SR-IOV there is a single PCI card, with multiple virtual functions.
The number of virtual functions is fixed at time of kernel module load
so there's no live configuration available from userspace. Every virtual
function shows up as a new PCI device in the virNodeDevPtr APIs, and each
of those has an associated ethX device visible too.
So with SR-IOV plus traditional bridging the same 3 step process above
more or less applies unchanged, creating a bridge for each SR-IOV
virtual function ethX device & then attaching guests using TAP.
The logical setup of devices is a little different
pcidev0 -> eth0 -> br0 -> tap0 -> vm0
|
pcidev1 -> eth1 -> br1 -> tap1 -> vm1
|
pcidev2 -> eth2 -> br2 -> tap2 -> vm2
and you can correlate pcidev0/1/2 to determine they are all virtual
function. Of course there's no reason this has to restrict itself
to a 1 bridge == 1 guest arrangement - i've just kept this diagram
simple.
This XML indicates the device that links to the external bridge, here
eth0, and the mode this device is supposed to be set into, here 'vepa'.
If 'extbridge' is found in the XML, all other XML elements that have
been used so far, i.e., bridge, ip, dhcp, etc., are ignored.
The above network description can then be referenced from the virtual
machine definition using the typical interface description of type
'network':
<interface type='network'>
<source network='vepanet'/>
<model type='virtio'/>
</interface>
The above network XML holds the necessary parameters to open a macvtap
device. Using command line, one would issue the following command to
achieve the same what now libvirt does internally, using a patched
version of the 'ip' command:
ip link add link eth0 <optional name of if> type macvtap mode vepa
This then creates the network interface, i.e., macvtap12, along with
entries in /sys/class/net/macvtap12/ where the content of the ifindex
file returns the integer necessary to build the /dev/tap%d device to
open the file descriptor of and pass to qemu via command line argument.
So every guest VM NIC gets a new uniqely named macvtap device, and
and unqiue TAP device ?
IIUC, with a macvlan + a regular NIC we'd thus be seeing something
like this arrangement ..
/-> macvtap0 -> tap0 -> vm0
pcidev0 -> eth0 --> macvtap1 -> tap1 -> vm1
\-> macvtap2 -> tap2 -> vm2
I guess you could also use this with an SR-IOV card too
pcidev0 -> eth0 -> macvtap0 -> tap0 -> vm0
|
pcidev1 -> eth1 -> macvtap1 -> tap1 -> vm1
|
pcidev2 -> eth2 -> macvtap2 -> tap2 -> vm2
Also, again there's no reason why we'd need to restrict ourselves
to only using a single macvtap device with each ethX device / virtual
function.
The core question to me is whether we need to introduce any new
configuration parameters at the host level. It looks to me like
it can all be done against the guest XML <interface> element.
Specifically, I don't see a need to manage the macvtapN devices
directly via the virInterfacePtr APIs. This is good, because
that is all delegated to the netcf library, which in turn is
just a way to read/write the /etc/sysconfig/networking-scripts.
These ifcfg-XXX files have no knowledge of macvtap, and so
plumbing all that through the networking-scripts, netcf and
libvirt virInterfacePtr APIs is non-trivial work we'd be better
off avoiding unless there was a clear compelling need.
So unless I'm missing something major in my reasoning here I think
in the domain XML we end up with two possible configs for guest
network interfaces
1. The current one using plain Linux software bridging, which
we can't change in an incompatible way
<interface type='bridge'/>
<source bridge='br0'/>
<target dev='vnet0'/>
</interface>
Here, the source device is a bridge previously setup
to have a physical device enslaved (regular or SR-IOV)
The target device is the plain TAP device
2. A new one using hardware bridging, which we can freely
define for our new needs
<interface type='direct'/>
<source dev='eth0' mode='vepa|pepa|bridge'/>
<target dev='vnet0'/>
</interface>
Here, source device is a physical device (regular or
SR-IOV). The target device is a macvtap device.
In both cases the TAP or macvtap device is created on the fly when the
VM is booted & destroyed at shutdown (either by the kernel, or manually
by libvirt for macvtap).
Both configs are ultimately a form of 'bridging', the difference is
that the former is focused on software bridges, as explicitly managed
objects separate from the physical device, while the latter is hardware
bridges which don't need to be managed separately.
Some details:
In libvirt I am first searching for an unused interface name following
the template macvtap%d unless the user has provided a name explicitly
or the previous macvtap%d is now taken by another VM. Once that has
succeeded, I follow the path through the filesystem to open the
corresponding /dev/tap%d.
Unlike the regular tap device, where the network interface disappears
once the corresponding file descriptor is closed, the macvtap device
needs explicit tear-down. So, when a VM terminates, I am deleting all
macvtap type interface with the MAC address as the interface of the
terminating VM.
Ok, this plan all makes sense. The only thing is that if we want to
allow the user to manually specify a macvtap name, we should declare
that they are not allowed to use the prefix 'macvtap' for their name.
We do similar with TAP, where we declare 'vnet' is a reserved prefix
for auto-generated names.
Index: libvirt/src/qemu/qemu_conf.c
===================================================================
--- libvirt.orig/src/qemu/qemu_conf.c
+++ libvirt/src/qemu/qemu_conf.c
@@ -52,6 +52,7 @@
#include "nodeinfo.h"
#include "logging.h"
#include "network.h"
+#include "macvtap.h"
#include "cpu/cpu.h"
#define VIR_FROM_THIS VIR_FROM_QEMU
@@ -1384,6 +1385,34 @@ int qemudExtractVersion(virConnectPtr co
}
+static int
+qemudPhysIfaceConnect(virConnectPtr conn,
+ virDomainNetDefPtr net,
+ char *linkdev,
+ char *brmode)
+{
+ int rc;
+#if defined(WITH_MACVTAP)
+ char *res_ifname = NULL;
+ delMacvtapByMACAddress(conn, net->mac);
+ rc = openMacvtapTap(conn, net->ifname, net->mac, linkdev, brmode,
+ &res_ifname);
+ if (rc > 0) {
+ VIR_FREE(net->ifname);
+ net->ifname = res_ifname;
+ }
+#else
+ (void)net;
+ (void)linkdev;
+ (void)brmode;
+ qemudReportError(conn, NULL, NULL, VIR_ERR_INTERNAL_ERROR,
+ "%s", _("No support for macvtap device"));
+ rc = -1;
+#endif
+ return rc;
+}
+
+
int
qemudNetworkIfaceConnect(virConnectPtr conn,
struct qemud_driver *driver,
@@ -1395,6 +1424,7 @@ qemudNetworkIfaceConnect(virConnectPtr c
int tapfd = -1;
int vnet_hdr = 0;
int template_ifname = 0;
+ char *brmode = NULL;
if (net->type == VIR_DOMAIN_NET_TYPE_NETWORK) {
virNetworkPtr network = virNetworkLookupByName(conn,
@@ -1402,6 +1432,15 @@ qemudNetworkIfaceConnect(virConnectPtr c
if (!network)
return -1;
+ if (virNetworkGetExtbridgeData(network, &brname, &brmode) == 0) {
+ tapfd = qemudPhysIfaceConnect(conn, net,
+ brname,
+ brmode);
+ VIR_FREE(brname);
+ VIR_FREE(brmode);
+ return tapfd;
+ }
+
brname = virNetworkGetBridgeName(network);
virNetworkFree(network);
As mentioned earlier, we don't want to touch the VIR_DOMAIN_NET_TYPE_NETWORK
case, since that's not bridging - its separate virtal networks using NAT.
If my idea above is acceptable, we'd be adding a new VIR_DOMAIN_NET_TYPE_DIRECT
that we'd hook up.
Index: libvirt/src/util/macvtap.h
===================================================================
--- /dev/null
+++ libvirt/src/util/macvtap.h
@@ -0,0 +1,48 @@
+/*
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Authors:
+ * Stefan Berger <stefanb(a)us.ibm.com>
+ */
+
+#ifndef __UTIL_MACVTAP_H__
+#define __UTIL_MACVTAP_H__
+
+#include <config.h>
+
+#if defined(WITH_MACVTAP)
+
+#include "internal.h"
+
+int openMacvtapTap(virConnectPtr conn,
+ const char *ifname,
+ const unsigned char *macaddress,
+ const char *linkdev,
+ const char *mode,
+ char **res_ifname);
+
+int delMacvtapByMACAddress(virConnectPtr conn,
+ const unsigned char *macaddress);
As an interface for internal usage, this looks like a fine
start.
+
+#endif /* WITH_MACVTAP */
+
+#define MACVTAP_MODE_PRIVATE_STR "private"
+#define MACVTAP_MODE_VEPA_STR "vepa"
+#define MACVTAP_MODE_BRIDGE_STR "bridge"
+
+
+#endif
Index: libvirt/src/Makefile.am
===================================================================
--- libvirt.orig/src/Makefile.am
+++ libvirt/src/Makefile.am
@@ -55,6 +55,7 @@ UTIL_SOURCES = \
util/ebtables.c util/ebtables.h \
util/json.c util/json.h \
util/logging.c util/logging.h \
+ util/macvtap.c util/macvtap.h \
util/memory.c util/memory.h \
util/pci.c util/pci.h \
util/processinfo.c util/processinfo.h \
@@ -784,12 +785,15 @@ if WITH_LINUX
USED_SYM_FILES += libvirt_linux.syms
endif
+USED_SYM_FILES += libvirt_macvtap.syms
+
EXTRA_DIST += \
libvirt_public.syms \
libvirt_private.syms \
libvirt_driver_modules.syms \
libvirt_bridge.syms \
- libvirt_linux.syms
+ libvirt_linux.syms \
+ libvirt_macvtap.syms
BUILT_SOURCES = libvirt.syms
Index: libvirt/src/util/macvtap.c
===================================================================
--- /dev/null
+++ libvirt/src/util/macvtap.c
@@ -0,0 +1,664 @@
+/*
+ * Copyright (C) 2010 IBM Corporation
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ *
+ * Authors:
+ * Stefan Berger <stefanb(a)us.ibm.com>
+ */
+
+#include <config.h>
+
+#if defined(WITH_MACVTAP)
[snip].
I've not had time to look at the details of this macvtap.c code yet,
but I assume its doing all you need :-) Is there any benefit to using
the network libnl.so library, rather than the ioctl()'s directly ?
I'm not too familiar with either, but we already use libnl.so
indirectly via netcf, so don't be afraid of adding a dependancy
on the libnl.so library if you consider it to be useful.
Index: libvirt/src/conf/network_conf.c
===================================================================
--- libvirt.orig/src/conf/network_conf.c
+++ libvirt/src/conf/network_conf.c
This file shouldn't get any changes really.
Index: libvirt/src/libvirt.c
===================================================================
--- libvirt.orig/src/libvirt.c
+++ libvirt/src/libvirt.c
@@ -6034,6 +6034,48 @@ error:
return NULL;
}
+
+/**
+ * virNetworkGetExtbridgeData:
+ * @network: a network object
+ * @linkdev : pointer where name of the interface to connect to the external
+ * bridge is returned
+ * @brmode : pointer where mode of the external bridge is returned
+ *
+ * Returns 0 in case an external bridge has been configured, -1 otherwise
+ */
+int
+virNetworkGetExtbridgeData(virNetworkPtr network,
+ char **linkdev, char **brmode)
+{
+ virConnectPtr conn;
+ DEBUG("network=%p", network);
+
+ virResetLastError();
+
+ if (!VIR_IS_CONNECTED_NETWORK(network)) {
+ virLibNetworkError(NULL, VIR_ERR_INVALID_NETWORK, __FUNCTION__);
+ virDispatchError(NULL);
+ return -1;
+ }
+
+ conn = network->conn;
+
+ if (conn->networkDriver &&
conn->networkDriver->networkGetExtbridgeData) {
+ int ret;
+ ret = conn->networkDriver->networkGetExtbridgeData(network,
+ linkdev,
+ brmode);
+ return ret;
+ }
+
+ virLibConnError (conn, VIR_ERR_NO_SUPPORT, __FUNCTION__);
+
+ virDispatchError(network->conn);
+ return -1;
+}
Also don't think we need to be adding extra public APIs for this, if we
do it within scope of the domain XML.
For the next submission, it'd be very helpful for review if you could
split it into 3 patches
1. Adding the extra domain XML syntax to src/conf/domain_conf.{h,c}
and docs/schemas/domain.rng
2. Adding the macvtap helper code in src/util/macvtap.{h,c}
3. Implementation for the QEMU driver to use the stuff from the first
two patches.
Regards,
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://ovirt.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|