[libvirt] Re: Supporting vhost-net and macvtap in libvirt for QEMU

18 Dec 2009

      Chris Wright wrote:
...
* Anthony Liguori (aliguori@linux.vnet.ibm.com) wrote:
...
There are two modes worth supporting for vhost-net in libvirt.  The
first mode is where vhost-net backs to a tun/tap device.  This is
behaves in very much the same way that -net tap behaves in qemu
today.  Basically, the difference is that the virtio backend is in
the kernel instead of in qemu so there should be some performance
improvement.
Current, libvirt invokes qemu with -net tap,fd=X where X is an
already open fd to a tun/tap device.  I suspect that after we merge
vhost-net, libvirt could support vhost-net in this mode by just
doing -net vhost,fd=X.  I think the only real question for libvirt
is whether to provide a user visible switch to use vhost or to just
always use vhost when it's available and it makes sense.
Personally, I think the later makes sense.
Doesn't sound useful.  Low-level, sure worth being able to turn things
on and off for testing/debugging, but probably not something a user
should be burdened with in libvirt.
But I dont' understand  your -net vhost,fd=X, that would still be -net
tap=fd=X, no?  IOW, vhost is an internal qemu impl. detail of the virtio
backend (or if you get your wish, $nic_backend).
I don't want to get bogged down in a qemu-devel discussion on 
libvirt-devel :-)

But from a libvirt perspective, I assume that it wants to open up 
/dev/vhost in order to not have to grant the qemu instance privileges 
which means that it needs to hand qemu the file descriptor to it.

Given a file descriptor, I don't think qemu can easily tell whether it's 
a tun/tap fd or whether it's a vhost fd.  Since they have different 
interfaces, we need libvirt to tell us which one it is.  Whether that's 
-net tap,vhost or -net vhost, we can figure that part out on qemu-devel :-)
...
...
The more interesting invocation of vhost-net though is one where the
vhost-net device backs directly to a physical network card.  In this
mode, vhost should get considerably better performance than the
current implementation.  I don't know the syntax yet, but I think
it's reasonable to assume that it will look something like -net
tap,dev=eth0.   The effect will be that eth0 is dedicated to the
guest.
tap?  we'd want either macvtap or raw socket here.
I screwed up.  I meant to say, -net vhost,dev=eth0.  But maybe it 
doesn't matter if libvirt is the one that initializes the vhost device, 
setups up the raw socket (or macvtap), and hands us a file descriptor.

In general, I think it's best to avoid as much network configuration in 
qemu as humanly possible so I'd rather see libvirt configure the vhost 
device ahead of time and pass us an fd that we can start using.
...
...
On most modern systems, there is a small number of network devices
so this model is not all that useful except when dealing with SR-IOV
adapters.  In that case, each physical device can be exposed as many
virtual devices (VFs).  There are a few restrictions here though.
The biggest is that currently, you can only change the number of VFs
by reloading a kernel module so it's really a parameter that must be
set at startup time.
I think there are a few ways libvirt could support vhost-net in this
second mode.  The simplest would be to introduce a new tag similar
to <source network='br0'>.  In fact, if you probed the device type
for the network parameter, you could probably do something like
<source network='eth0'> and have it Just Work.
We'll need to keep track of more than just the other en
We need to 0
Is something missing here?
...
...
Another model would be to have libvirt see an SR-IOV adapter as a
network pool whereas it handled all of the VF management.
Considering how inflexible SR-IOV is today, I'm not sure whether
this is the best model.
We already need to know the VF<->PF relationship.  For example, don't
want to assign a VF to a guest, then a PF to another guest for basic
sanity reasons.  As we get better ability to manage the embedded switch
in an SR-IOV NIC we will need to manage them as well.  So we do need
to have some concept of managing an SR-IOV adapter.
But we still need to support the notion of backing a VNIC to a NIC, no?  
If this just happens to also work with a naive usage of SR-IOV, is that 
so bad? :-)

Long term, yes, I think you want to manage SR-IOV adapters as if they're 
a network pool.  But since they're sufficiently inflexible right now, 
I'm not sure it's all that useful today.
...
So I think we want to maintain a concept of the qemu backend (virtio,
e1000, etc), tbhe fd that connects the qemu backend to the host (tap,
socket, macvtap, etc), and the bridge.  The bridge bit gets a little
complicated.  We have the following bridge cases:
- sw bridge
  - normal existing setup, w/ Linux bridging code
  - macvlan
- hw bridge
  - on SR-IOV card
    - configured to simply fwd to external hw bridge (like VEPA mode)
    - configured as a bridge w/ policies (QoS, ACL, port mirroring,
      etc. and allows inter-guest traffic and looks a bit like above
      sw switch)
  - external
    - need to possibly inform switch of incoming vport
I've got mixed feelings here.  With respect to sw vs. hw bridge, I 
really think that that's an implementation detail that should not be 
exposed to a user.  A user doesn't typically want to think about whether 
they're using a hardware switch vs. software switch.  Instead, they 
approach it from, I want to have this network topology, and these 
features enabled.

I think the notion of network pools as being somewhat opaque really 
works well for this.  Ideally you would create a network pool based on 
the requirements you had, and the management tool would figure out what 
the best set of implementations to use was.

VEPA is really a unique use-case in my mind.  It's when someone wants to 
use an external switch for their network management.
...
And, we can have a hybrid.  E.g., no reason one VF can't be shared by a
few guests.
-- 
Regards,

Anthony Liguori