Re: [libvirt] Supporting vhost-net and macvtap in libvirt for QEMU

25 Jan 2010


      On Mon, Jan 25, 2010 at 11:38:15AM -0600, Anthony Liguori wrote:
...
On 01/21/2010 03:13 PM, Vivek Kashyap wrote:
...
.....
...
...
...
So I think we want to maintain a concept of the qemu backend (virtio,
e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
socket, macvtap, etc), and the bridge.  The bridge bit gets a little
complicated.  We have the following bridge cases:
- sw bridge
- normal existing setup, w/ Linux bridging code
- macvlan
- hw bridge
- on SR-IOV card
  - configured to simply fwd to external hw bridge (like VEPA mode)
  - configured as a bridge w/ policies (QoS, ACL, port mirroring,
    etc. and allows inter-guest traffic and looks a bit like above
    sw switch)
- external
  - need to possibly inform switch of incoming vport
I've got mixed feelings here.  With respect to sw vs. hw bridge, I
really think that that's an implementation detail that should not be
exposed to a user.  A user doesn't typically want to think about 
whether
they're using a hardware switch vs. software switch.  Instead, they
approach it from, I want to have this network topology, and these
features enabled.
Agree there is alot of low level detail there, and I think it will be
very hard for users, or apps to gain enough knowledge to make 
intelligent
decisions about which they should use. So I don't think we want to 
expose
all that detail. For a libvirt representation we need to consider it 
more
in terms of what capabilities does each options provide, rather than 
what
implementation each option uses
Attached is some background information on VEPA bridging being 
discussed in
this thread and then a proposal for defining it in libvirt xml.
The 'Edge Virtual Bridging'(eVB) working group has proposed a 
mechanism to
offload the bridging function from the server to a physical switch on
the network. This is referred to as VEPA (Virtual Ethernet Port
Aggregator). This is described here:
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-...
The VEPA mode implies that the virtual machines on a host communicate 
to each
other via the physical switch on the network instead of the bridge in
the Linux host.  The filtering, quality of service enforcement, stats 
etc. are all done in the external switch.
The newer NICs with embedded switches (such as SR-IOV cards) will
also provide VEPA mode. This implies that the communication between two
virtual functions on the same physical NIC will also require a packet to
travel to the first hop switch on the network and then be reflected back.
The 'macvlan' driver in Linux supports virtual interfaces that can be
attached to virtual machine interfaces. This patch provides tap backend
to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If 
such an interface is used the packets will be forwarded directly onto 
the network bypassing the host bridge. This is exactly what is 
required for VEPA mode.
However,  the 'macvlan' driver can support both VEPA and 'bridging' 
mode. The bridging in this case is among its virtual interfaces only. 
There is also a private mode in which the packets are transmitted to 
the network
but are not forwarded among the VMs.
Similarly, the sr-iov's embedded switch in the future will be settable
as 'VEPA', or 'private' or 'bridging' mode.
In the eVB working group the 'private' mode is referred to as PEPA, and
the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same
terms.
The 'VEB' mode of macvlan or sr-iov is no different than the bridge
in Linux. The behaviour of the networking/switching on the network is
unaffected.
Changes in the first-hop adjacent Switch on the network:
---------------------------------------------------------
When the 'VEPA' (or PEPA) mode is used the packet switching is 
occuring on the first hop switch. Therefore for VM to VM traffic, the 
first hop switch must support reflecting the packets back on the port 
on which they were received. This is referred to as the 'hairpin' or 
'reflective relay'
mode.
The IEEE 802.1 body is standardizing on the protocol with the switch
vendors, and various other server vendors working on the standard. This
is derived from the above mentioned eVB ('edge virtual bridging')
working group.
To enable easy testing the Linux bridge can be put into the 'reflective
relay' (or hairpin) mode. The patches are included in 2.6.32. The mode 
can be set using sysfs or brctl commands (in latest bridge utils bits).
In the future the switch vendors (in eVB group) expect to support both
VEPA and VEB on the same switch port. That is the Linux host can have
some VM's using VEPA mode and some in VEB mode on the same outgoing
uplink. This protocol is to be fully defined and will require more
changes in the bridging function. The ethernet frame will carry tags to
identify the packet streams (for VEPA or VEB ports). See chart 4 in the
above linked IEEE document.
However, from a libvirt defintion point of view it implies that a
'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative 
is to define separate bridges handling VEB/VEPA or PEPA modes for the 
same 'sr-iov' or 'macvlan' backend.
Determining the switch capability:
---------------------------------
The Linux host can determine (and set) whether the remote bridge
supports 'hairpin' mode and also set this capability through a low level
protocol (DCBx) being extended in the above eVB working group.
Some drivers (for NICs/CNAs) are likely to do this detrmination
themselves and make the information available to the hypervisor/Linux
host.
Summary:
--------
Based on above a virtual machine might be defined to work with the
Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes,
or with sr-iov virtual function with switching in bridge, or vepa/pepa
modes.
Proposal:
--------
To support the above combinations we need to be able to define the bridge
to be used, the 'uplink' it is associated with, and the interface type
that the VM will use to connect to the bridge.
Currently in libvirt we define a bridge and can associate an ethernet
with it (which is the uplink to the network). In the 'macvlan' and the
'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it
is embedded in the 'nic', and in the case of macvlan the function is
enabled when the virtual interface is created.
Describing the bridge and modes:
--------------------------------
So, we can define the bridge function using a new type or maybe extend
the bridge.xml itself.
<interface type='bridge' name='br0'>
<bridge>
<type='hypervisor|embedded|ethernet'/> //hypervisor is default
<mode='all|VEPA|PEPA|VEB'/>          // 'all' is default if supported.
<interface type='ethernet' name='eth0'/>
</bridge>
</interface>
Does this really map to how VEPA works?
For a physical bridge, you create a br0 network interface that also has 
eth0 as a component.
With VEPA and macv{lan,tap}, you do not create a single "br0" 
interface.  Instead, for the given physical port, you create interfaces 
for each tap device and hand them over.  IOW, while something like:
<interface type='bridge' name='br0'>
<bridge>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>
</bridge>
</interface>
Makes sense, the following wouldn't:
<interface type='bridge' name='br0'>
<bridge mode='VEPA'>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>
</bridge>
</interface>
I think the only use of the interface tag that would make sense is:
<interface type='ethernet' name='eth0'>
<vepa/>
</interface>
You can imagine doing something similar with SR-IOV:
<interface type='ethernet' name='eth0>
<sr-iov/>
</interface>
This seems like overkill to me - we don't need to manage these as
top level objects, as we would with traditional bridges.  I'd think
we can keep the config in solely within the realm of the domain XML,
and create/delete the macvlan/macvtap  devices on the fly, as we do
with plain TAP devices today.
...
and in the guest:
<interface type='direct'>
<source physical='eth0'>
  ...
</interface>
I like the simplicity of just having this in the guest XML and a way
to just indicate macvlan vs macvtap somehow. 

Daniel
-- 
|: Red Hat, Engineering, London   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Re: [libvirt] Supporting vhost-net and macvtap in libvirt for QEMU

Daniel P. Berrange