On 01/21/2010 03:13 PM, Vivek Kashyap wrote:
>.....
>>
>>>>So I think we want to maintain a concept of the qemu backend (virtio,
>>>>e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
>>>>socket, macvtap, etc), and the bridge. The bridge bit gets a little
>>>>complicated. We have the following bridge cases:
>>>>
>>>>- sw bridge
>>>> - normal existing setup, w/ Linux bridging code
>>>> - macvlan
>>>>- hw bridge
>>>> - on SR-IOV card
>>>> - configured to simply fwd to external hw bridge (like VEPA mode)
>>>> - configured as a bridge w/ policies (QoS, ACL, port mirroring,
>>>> etc. and allows inter-guest traffic and looks a bit like above
>>>> sw switch)
>>>> - external
>>>> - need to possibly inform switch of incoming vport
>>>
>>>I've got mixed feelings here. With respect to sw vs. hw bridge, I
>>>really think that that's an implementation detail that should not be
>>>exposed to a user. A user doesn't typically want to think about
>>>whether
>>>they're using a hardware switch vs. software switch. Instead, they
>>>approach it from, I want to have this network topology, and these
>>>features enabled.
>>
>>Agree there is alot of low level detail there, and I think it will be
>>very hard for users, or apps to gain enough knowledge to make
>>intelligent
>>decisions about which they should use. So I don't think we want to
>>expose
>>all that detail. For a libvirt representation we need to consider it
>>more
>>in terms of what capabilities does each options provide, rather than
>>what
>>implementation each option uses
>>
>
>Attached is some background information on VEPA bridging being
>discussed in
>this thread and then a proposal for defining it in libvirt xml.
>
>The 'Edge Virtual Bridging'(eVB) working group has proposed a
>mechanism to
>offload the bridging function from the server to a physical switch on
>the network. This is referred to as VEPA (Virtual Ethernet Port
>Aggregator). This is described here:
>
>http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf
>
>
>The VEPA mode implies that the virtual machines on a host communicate
>to each
>other via the physical switch on the network instead of the bridge in
>the Linux host. The filtering, quality of service enforcement, stats
>etc. are all done in the external switch.
>
>The newer NICs with embedded switches (such as SR-IOV cards) will
>also provide VEPA mode. This implies that the communication between two
>virtual functions on the same physical NIC will also require a packet to
>travel to the first hop switch on the network and then be reflected back.
>
>The 'macvlan' driver in Linux supports virtual interfaces that can be
>attached to virtual machine interfaces. This patch provides tap backend
>to macvlan:
http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If
>such an interface is used the packets will be forwarded directly onto
>the network bypassing the host bridge. This is exactly what is
>required for VEPA mode.
>
>However, the 'macvlan' driver can support both VEPA and 'bridging'
>mode. The bridging in this case is among its virtual interfaces only.
>There is also a private mode in which the packets are transmitted to
>the network
>but are not forwarded among the VMs.
>
>Similarly, the sr-iov's embedded switch in the future will be settable
>as 'VEPA', or 'private' or 'bridging' mode.
>
>In the eVB working group the 'private' mode is referred to as PEPA, and
>the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same
>terms.
>
>The 'VEB' mode of macvlan or sr-iov is no different than the bridge
>in Linux. The behaviour of the networking/switching on the network is
>unaffected.
>
>Changes in the first-hop adjacent Switch on the network:
>---------------------------------------------------------
>When the 'VEPA' (or PEPA) mode is used the packet switching is
>occuring on the first hop switch. Therefore for VM to VM traffic, the
>first hop switch must support reflecting the packets back on the port
>on which they were received. This is referred to as the 'hairpin' or
>'reflective relay'
>mode.
>
>The IEEE 802.1 body is standardizing on the protocol with the switch
>vendors, and various other server vendors working on the standard. This
>is derived from the above mentioned eVB ('edge virtual bridging')
>working group.
>
>To enable easy testing the Linux bridge can be put into the 'reflective
>relay' (or hairpin) mode. The patches are included in 2.6.32. The mode
>can be set using sysfs or brctl commands (in latest bridge utils bits).
>
>In the future the switch vendors (in eVB group) expect to support both
>VEPA and VEB on the same switch port. That is the Linux host can have
>some VM's using VEPA mode and some in VEB mode on the same outgoing
>uplink. This protocol is to be fully defined and will require more
>changes in the bridging function. The ethernet frame will carry tags to
>identify the packet streams (for VEPA or VEB ports). See chart 4 in the
>above linked IEEE document.
>
>However, from a libvirt defintion point of view it implies that a
>'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative
>is to define separate bridges handling VEB/VEPA or PEPA modes for the
>same 'sr-iov' or 'macvlan' backend.
>
>Determining the switch capability:
>---------------------------------
>The Linux host can determine (and set) whether the remote bridge
>supports 'hairpin' mode and also set this capability through a low level
>protocol (DCBx) being extended in the above eVB working group.
>Some drivers (for NICs/CNAs) are likely to do this detrmination
>themselves and make the information available to the hypervisor/Linux
>host.
>
>Summary:
>--------
>
>Based on above a virtual machine might be defined to work with the
>Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes,
>or with sr-iov virtual function with switching in bridge, or vepa/pepa
>modes.
>
>
>Proposal:
>--------
>
>To support the above combinations we need to be able to define the bridge
>to be used, the 'uplink' it is associated with, and the interface type
>that the VM will use to connect to the bridge.
>
>Currently in libvirt we define a bridge and can associate an ethernet
>with it (which is the uplink to the network). In the 'macvlan' and the
>'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov'
it
>is embedded in the 'nic', and in the case of macvlan the function is
>enabled when the virtual interface is created.
>
>Describing the bridge and modes:
>--------------------------------
>So, we can define the bridge function using a new type or maybe extend
>the bridge.xml itself.
>
><interface type='bridge' name='br0'>
><bridge>
><type='hypervisor|embedded|ethernet'/> //hypervisor is default
><mode='all|VEPA|PEPA|VEB'/> // 'all' is default if
supported.
><interface type='ethernet' name='eth0'/>
></bridge>
></interface>
Does this really map to how VEPA works?
For a physical bridge, you create a br0 network interface that also has
eth0 as a component.
With VEPA and macv{lan,tap}, you do not create a single "br0"
interface. Instead, for the given physical port, you create interfaces
for each tap device and hand them over. IOW, while something like:
<interface type='bridge' name='br0'>
<bridge>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>
</bridge>
</interface>
Makes sense, the following wouldn't:
<interface type='bridge' name='br0'>
<bridge mode='VEPA'>
<interface type='ethernet' name='eth0'/>
<interface type='ethernet' name='eth1'/>
</bridge>
</interface>
I think the only use of the interface tag that would make sense is:
<interface type='ethernet' name='eth0'>
<vepa/>
</interface>
You can imagine doing something similar with SR-IOV:
<interface type='ethernet' name='eth0>
<sr-iov/>
</interface>
This seems like overkill to me - we don't need to manage these as
top level objects, as we would with traditional bridges. I'd think
we can keep the config in solely within the realm of the domain XML,
and create/delete the macvlan/macvtap devices on the fly, as we do
with plain TAP devices today.
and in the guest:
<interface type='direct'>
<source physical='eth0'>
...
</interface>
I like the simplicity of just having this in the guest XML and a way
to just indicate macvlan vs macvtap somehow.
Daniel
--
|: Red Hat, Engineering, London -o-
:|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|