[libvirt] Supporting vhost-net and macvtap in libvirt for QEMU

Disclaimer: I am neither an SR-IOV nor a vhost-net expert, but I've CC'd people that are who can throw tomatoes at me for getting bits wrong :-) I wanted to start a discussion about supporting vhost-net in libvirt. vhost-net has not yet been merged into qemu but I expect it will be soon so it's a good time to start this discussion. There are two modes worth supporting for vhost-net in libvirt. The first mode is where vhost-net backs to a tun/tap device. This is behaves in very much the same way that -net tap behaves in qemu today. Basically, the difference is that the virtio backend is in the kernel instead of in qemu so there should be some performance improvement. Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense. The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest. On most modern systems, there is a small number of network devices so this model is not all that useful except when dealing with SR-IOV adapters. In that case, each physical device can be exposed as many virtual devices (VFs). There are a few restrictions here though. The biggest is that currently, you can only change the number of VFs by reloading a kernel module so it's really a parameter that must be set at startup time. I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work. Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model. Has anyone put any more thought into this problem or how this should be modeled in libvirt? Michael, could you share your current thinking for -net syntax? -- Regards, Anthony Liguori

On Wed, Dec 16, 2009 at 07:48:08PM -0600, Anthony Liguori wrote:
Disclaimer: I am neither an SR-IOV nor a vhost-net expert, but I've CC'd people that are who can throw tomatoes at me for getting bits wrong :-)
I wanted to start a discussion about supporting vhost-net in libvirt. vhost-net has not yet been merged into qemu but I expect it will be soon so it's a good time to start this discussion.
There are two modes worth supporting for vhost-net in libvirt. The first mode is where vhost-net backs to a tun/tap device. This is behaves in very much the same way that -net tap behaves in qemu today. Basically, the difference is that the virtio backend is in the kernel instead of in qemu so there should be some performance improvement.
Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense.
I tend to agree, I dont see any compelling reason to expose 'vhost' as a config option, since it is not changing any functionality, merely the internal impl. I don't think apps would be in any position to decide whether it should be on, or off. Thus we just need to figure out how to detect that it is supported in kernel+QEMU, and if supported, enable it.
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
Ok, so in this model you have to create a dedicated ethXX device for every guest, no sharing ?
On most modern systems, there is a small number of network devices so this model is not all that useful except when dealing with SR-IOV adapters. In that case, each physical device can be exposed as many virtual devices (VFs). There are a few restrictions here though. The biggest is that currently, you can only change the number of VFs by reloading a kernel module so it's really a parameter that must be set at startup time.
Yes, since the hardware doesn't allow for any usable configurability of the number of VFs, we'll guest assume that they have already been setup. Likely the kernel can just enable the max # of VFs at all times.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
Agreed, given the hardware limitations I don't see that it is worth the bother. This new mode is not really what we'd call 'bridging' in libvirt network XML format, so I think we'll want to define a new type of network config for it in libvirt. Perhaps <network type='physical'> <source dev='eth0'/> </network> Or type='passthru' Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
Ok, so in this model you have to create a dedicated ethXX device for every guest, no sharing ?
Yup. You may be sharing a physical network device via SR-IOV, but from libvirt's perspective, we're dedicating a physical device to a guest virtual nic.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
Agreed, given the hardware limitations I don't see that it is worth the bother.
This new mode is not really what we'd call 'bridging' in libvirt network XML format, so I think we'll want to define a new type of network config for it in libvirt. Perhaps
<network type='physical'> <source dev='eth0'/> </network>
Or type='passthru'
That certainly simplifies the problem. I don't know whether SR-IOV requires additional setup though wrt programming the VF's mac address. It may make sense for libvirt to at least do that. -- Regards, Anthony Liguori

On Thu, Dec 17, 2009 at 07:28:00AM -0600, Anthony Liguori wrote:
Daniel P. Berrange wrote:
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
Ok, so in this model you have to create a dedicated ethXX device for every guest, no sharing ?
Yup. You may be sharing a physical network device via SR-IOV, but from libvirt's perspective, we're dedicating a physical device to a guest virtual nic.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
Agreed, given the hardware limitations I don't see that it is worth the bother.
This new mode is not really what we'd call 'bridging' in libvirt network XML format, so I think we'll want to define a new type of network config for it in libvirt. Perhaps
<network type='physical'> <source dev='eth0'/> </network>
Opps, when i write <network> here I actually mean <interface>
Or type='passthru'
That certainly simplifies the problem.
I don't know whether SR-IOV requires additional setup though wrt programming the VF's mac address. It may make sense for libvirt to at least do that.
Oh sure, that's easy enough - if there's no MAC in the XML we autogenerate one anyway, so we always have a mac for every interface & do whatever is needed with that. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

* Anthony Liguori (aliguori@linux.vnet.ibm.com) wrote:
I don't know whether SR-IOV requires additional setup though wrt programming the VF's mac address. It may make sense for libvirt to at least do that.
Doesn't require, but will need something in the future. Esp, as we start to acutally manage the embedded bridge. thanks, -chris

On Thursday 17 December 2009, Daniel P. Berrange wrote:
On Wed, Dec 16, 2009 at 07:48:08PM -0600, Anthony Liguori wrote:
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
Ok, so in this model you have to create a dedicated ethXX device for every guest, no sharing ?
I think so, but it could be any of * a physical NIC dedicated to the guest, e.g. if you want to run a firewall on that guest and provide connectivity to all other guests to that, or if you have lots of real NICs * an IOV adapter with separate physical or virtual functions * a VMDq adapter that shows multiple queues on the same PCI function as separate network interfaces * a macvlan device in VEPA or bridge mode The creation for each of these is different, but once it's there, using it should be possible in identical ways. I think an important question here is if libvirt should at all be responsible for creating the devices, or just for opening the sockets or taps on them.
Yes, since the hardware doesn't allow for any usable configurability of the number of VFs, we'll guest assume that they have already been setup. Likely the kernel can just enable the max # of VFs at all times.
In macvlan, there is no such limitation. How many would you create?
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
Agreed, given the hardware limitations I don't see that it is worth the bother.
This new mode is not really what we'd call 'bridging' in libvirt network XML format, so I think we'll want to define a new type of network config for it in libvirt. Perhaps
<network type='physical'> <source dev='eth0'/> </network>
Or type='passthru'
You should also have a parameter mode={'vepa'|'bridge'|'private'} like macvlan now has. Even if SR-IOV nics today only support bridge mode, they should support at least vepa mode in the future. Arnd

Arnd Bergmann wrote:
On Thursday 17 December 2009, Daniel P. Berrange wrote:
On Wed, Dec 16, 2009 at 07:48:08PM -0600, Anthony Liguori wrote:
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
Ok, so in this model you have to create a dedicated ethXX device for every guest, no sharing ?
I think so, but it could be any of * a physical NIC dedicated to the guest, e.g. if you want to run a firewall on that guest and provide connectivity to all other guests to that, or if you have lots of real NICs * an IOV adapter with separate physical or virtual functions * a VMDq adapter that shows multiple queues on the same PCI function as separate network interfaces * a macvlan device in VEPA or bridge mode
I don't think a macvlan device is quite the same thing (mainly because there is not finite number of them). I think <source vepa="on" dev="eth0"/> probably would make more sense as a UI. But then libvirt needs to be able to create/destroy macvlan devices on demand. --- Regards, Anthony Liguori

On Wed, Dec 16, 2009 at 07:48:08PM -0600, Anthony Liguori wrote:
Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense.
I was currently trying to implement -net tap,fd=X,vhost since I thought this is what you suggested originally.
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
...
modeled in libvirt? Michael, could you share your current thinking for -net syntax?
From networking POV, these two are similar cases: vepa where an external bridge loops packets back at host, and SRIOV where this is done internally by the card, but in both cases it is external to host. For this reason, I thought that we might want to call it
I don't really care much. like this: -net external,eth0
-- Regards,
Anthony Liguori

On Thursday 17 December 2009, Anthony Liguori wrote:
There are two modes worth supporting for vhost-net in libvirt. The first mode is where vhost-net backs to a tun/tap device. This is behaves in very much the same way that -net tap behaves in qemu today. Basically, the difference is that the virtio backend is in the kernel instead of in qemu so there should be some performance improvement.
Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense.
I think it should be treated like any other option where we have kernel support to make something "go fast", e.g. kvm, or the in-kernel interrupt processing. If we don't enable it by default when it's there, I would prefer to have an '--enable-vhost' option to replacing the '-net tap' option with '-net vhost', because that would be easier to integrate with existing scripts.
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
On most modern systems, there is a small number of network devices so this model is not all that useful except when dealing with SR-IOV adapters. In that case, each physical device can be exposed as many virtual devices (VFs). There are a few restrictions here though. The biggest is that currently, you can only change the number of VFs by reloading a kernel module so it's really a parameter that must be set at startup time.
I like to think of this way of using SR-IOV (VMDq actually) as a way to do macvlan with hardware support. Unfortunately, it does work like this yet, but the way I would like to do it is: * use 'ip link ... type macvlan' as the configuration frontend for this mode * as long as there are VFs, PFs or queue pairs available in hardware, use them * When you run out of PCI functions, register additional unicast MAC addresses with the hardware, as macvlan does today * As the final fallback, when we run out of unicast MAC addresses in the NIC, put it into promiscuous mode. Again, macvlan handles this fine today. Right now, if you want to use a VF, you have to set up either raw socket, because you can't add a tun/tap device to the interface without a bridge, which would defeat the whole purpose of doing this. Macvtap should eventually solve this, but only after VMDq is integrated with macvlan.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
Right. The first option (source network='br0) is not so ideal because it assumes that you run a bridge, which you typically don't want in this mode, because those devices have the bridge in hardware (or in macvlan for the software case) Arnd

* Anthony Liguori (aliguori@linux.vnet.ibm.com) wrote:
There are two modes worth supporting for vhost-net in libvirt. The first mode is where vhost-net backs to a tun/tap device. This is behaves in very much the same way that -net tap behaves in qemu today. Basically, the difference is that the virtio backend is in the kernel instead of in qemu so there should be some performance improvement.
Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense.
Doesn't sound useful. Low-level, sure worth being able to turn things on and off for testing/debugging, but probably not something a user should be burdened with in libvirt. But I dont' understand your -net vhost,fd=X, that would still be -net tap=fd=X, no? IOW, vhost is an internal qemu impl. detail of the virtio backend (or if you get your wish, $nic_backend).
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
tap? we'd want either macvtap or raw socket here.
On most modern systems, there is a small number of network devices so this model is not all that useful except when dealing with SR-IOV adapters. In that case, each physical device can be exposed as many virtual devices (VFs). There are a few restrictions here though. The biggest is that currently, you can only change the number of VFs by reloading a kernel module so it's really a parameter that must be set at startup time.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
We'll need to keep track of more than just the other en We need to 0
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
We already need to know the VF<->PF relationship. For example, don't want to assign a VF to a guest, then a PF to another guest for basic sanity reasons. As we get better ability to manage the embedded switch in an SR-IOV NIC we will need to manage them as well. So we do need to have some concept of managing an SR-IOV adapter. So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), the fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases: - sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport And, we can have a hybrid. E.g., no reason one VF can't be shared by a few guests.

Chris Wright wrote:
* Anthony Liguori (aliguori@linux.vnet.ibm.com) wrote:
There are two modes worth supporting for vhost-net in libvirt. The first mode is where vhost-net backs to a tun/tap device. This is behaves in very much the same way that -net tap behaves in qemu today. Basically, the difference is that the virtio backend is in the kernel instead of in qemu so there should be some performance improvement.
Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense.
Doesn't sound useful. Low-level, sure worth being able to turn things on and off for testing/debugging, but probably not something a user should be burdened with in libvirt.
But I dont' understand your -net vhost,fd=X, that would still be -net tap=fd=X, no? IOW, vhost is an internal qemu impl. detail of the virtio backend (or if you get your wish, $nic_backend).
I don't want to get bogged down in a qemu-devel discussion on libvirt-devel :-) But from a libvirt perspective, I assume that it wants to open up /dev/vhost in order to not have to grant the qemu instance privileges which means that it needs to hand qemu the file descriptor to it. Given a file descriptor, I don't think qemu can easily tell whether it's a tun/tap fd or whether it's a vhost fd. Since they have different interfaces, we need libvirt to tell us which one it is. Whether that's -net tap,vhost or -net vhost, we can figure that part out on qemu-devel :-)
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
tap? we'd want either macvtap or raw socket here.
I screwed up. I meant to say, -net vhost,dev=eth0. But maybe it doesn't matter if libvirt is the one that initializes the vhost device, setups up the raw socket (or macvtap), and hands us a file descriptor. In general, I think it's best to avoid as much network configuration in qemu as humanly possible so I'd rather see libvirt configure the vhost device ahead of time and pass us an fd that we can start using.
On most modern systems, there is a small number of network devices so this model is not all that useful except when dealing with SR-IOV adapters. In that case, each physical device can be exposed as many virtual devices (VFs). There are a few restrictions here though. The biggest is that currently, you can only change the number of VFs by reloading a kernel module so it's really a parameter that must be set at startup time.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
We'll need to keep track of more than just the other en We need to 0
Is something missing here?
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
We already need to know the VF<->PF relationship. For example, don't want to assign a VF to a guest, then a PF to another guest for basic sanity reasons. As we get better ability to manage the embedded switch in an SR-IOV NIC we will need to manage them as well. So we do need to have some concept of managing an SR-IOV adapter.
But we still need to support the notion of backing a VNIC to a NIC, no? If this just happens to also work with a naive usage of SR-IOV, is that so bad? :-) Long term, yes, I think you want to manage SR-IOV adapters as if they're a network pool. But since they're sufficiently inflexible right now, I'm not sure it's all that useful today.
So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), tbhe fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases:
- sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport
I've got mixed feelings here. With respect to sw vs. hw bridge, I really think that that's an implementation detail that should not be exposed to a user. A user doesn't typically want to think about whether they're using a hardware switch vs. software switch. Instead, they approach it from, I want to have this network topology, and these features enabled. I think the notion of network pools as being somewhat opaque really works well for this. Ideally you would create a network pool based on the requirements you had, and the management tool would figure out what the best set of implementations to use was. VEPA is really a unique use-case in my mind. It's when someone wants to use an external switch for their network management.
And, we can have a hybrid. E.g., no reason one VF can't be shared by a few guests.
-- Regards, Anthony Liguori

* Anthony Liguori (aliguori@linux.vnet.ibm.com) wrote:
Chris Wright wrote:
* Anthony Liguori (aliguori@linux.vnet.ibm.com) wrote:
There are two modes worth supporting for vhost-net in libvirt. The first mode is where vhost-net backs to a tun/tap device. This is behaves in very much the same way that -net tap behaves in qemu today. Basically, the difference is that the virtio backend is in the kernel instead of in qemu so there should be some performance improvement.
Current, libvirt invokes qemu with -net tap,fd=X where X is an already open fd to a tun/tap device. I suspect that after we merge vhost-net, libvirt could support vhost-net in this mode by just doing -net vhost,fd=X. I think the only real question for libvirt is whether to provide a user visible switch to use vhost or to just always use vhost when it's available and it makes sense. Personally, I think the later makes sense.
Doesn't sound useful. Low-level, sure worth being able to turn things on and off for testing/debugging, but probably not something a user should be burdened with in libvirt.
But I dont' understand your -net vhost,fd=X, that would still be -net tap=fd=X, no? IOW, vhost is an internal qemu impl. detail of the virtio backend (or if you get your wish, $nic_backend).
I don't want to get bogged down in a qemu-devel discussion on libvirt-devel :-)
The reason I brought it up here is in case libvirt would be doing both. /dev/vhost takes an fd for a tap device or raw socket. So libvirt would need to open both, and then becomes a question of whether libvirt only passes the single vhost fd (after setting it up completely) or passes both the vhost fd and connecting fd for qemu to put the two together. I didn't recall migration (if qemu would need tap fd again).
But from a libvirt perspective, I assume that it wants to open up /dev/vhost in order to not have to grant the qemu instance privileges which means that it needs to hand qemu the file descriptor to it.
Given a file descriptor, I don't think qemu can easily tell whether it's a tun/tap fd or whether it's a vhost fd. Since they have different interfaces, we need libvirt to tell us which one it is. Whether that's -net tap,vhost or -net vhost, we can figure that part out on qemu-devel :-)
Yeah, I agree, just thinking of the workflow as it impacts libvirt.
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
tap? we'd want either macvtap or raw socket here.
I screwed up. I meant to say, -net vhost,dev=eth0. But maybe it doesn't matter if libvirt is the one that initializes the vhost device, setups up the raw socket (or macvtap), and hands us a file descriptor.
Ah, gotcha, yeah.
In general, I think it's best to avoid as much network configuration in qemu as humanly possible so I'd rather see libvirt configure the vhost device ahead of time and pass us an fd that we can start using.
Hard to disagree, but will make qemu not work w/out libvirt?
On most modern systems, there is a small number of network devices so this model is not all that useful except when dealing with SR-IOV adapters. In that case, each physical device can be exposed as many virtual devices (VFs). There are a few restrictions here though. The biggest is that currently, you can only change the number of VFs by reloading a kernel module so it's really a parameter that must be set at startup time.
I think there are a few ways libvirt could support vhost-net in this second mode. The simplest would be to introduce a new tag similar to <source network='br0'>. In fact, if you probed the device type for the network parameter, you could probably do something like <source network='eth0'> and have it Just Work.
We'll need to keep track of more than just the other en We need to 0
Is something missing here?
I got to it below. Just noting that libvirt will need to track each piece, the backend (virtio), the connector (tap,socket), and any bridge setup.
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
We already need to know the VF<->PF relationship. For example, don't want to assign a VF to a guest, then a PF to another guest for basic sanity reasons. As we get better ability to manage the embedded switch in an SR-IOV NIC we will need to manage them as well. So we do need to have some concept of managing an SR-IOV adapter.
But we still need to support the notion of backing a VNIC to a NIC, no? If this just happens to also work with a naive usage of SR-IOV, is that so bad? :-)
Nope, not at all ;-) We do need to know if a VF is available or not (and if a PF has any of its VFs used). Needed on migration ("can I hook up to a VF on target?"), and for assignment ("can I give this PCI device to a guest? wait, it's a PF and VF's are in use." Although, I don't think libvirt actually goes beyond, "wait it's a PF").
Long term, yes, I think you want to manage SR-IOV adapters as if they're a network pool. But since they're sufficiently inflexible right now, I'm not sure it's all that useful today.
So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), tbhe fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases:
- sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport
I've got mixed feelings here. With respect to sw vs. hw bridge, I really think that that's an implementation detail that should not be exposed to a user. A user doesn't typically want to think about whether they're using a hardware switch vs. software switch. Instead, they approach it from, I want to have this network topology, and these features enabled.
libvirt needs to know what to do w/ the switch. Ideally...all would show up in Linux with the same mgmt interface, then libvirt would just apply a port profile to a port on a switch, we aren't there now.
I think the notion of network pools as being somewhat opaque really works well for this. Ideally you would create a network pool based on the requirements you had, and the management tool would figure out what the best set of implementations to use was.
VEPA is really a unique use-case in my mind. It's when someone wants to use an external switch for their network management.
It's an enterprise thing, sure, but we need to be able to manage. Ditto for a VN-Tag approach. They all require some basic setup. thanks, -chris

Chris Wright wrote:
I don't want to get bogged down in a qemu-devel discussion on libvirt-devel :-)
The reason I brought it up here is in case libvirt would be doing both. /dev/vhost takes an fd for a tap device or raw socket. So libvirt would need to open both, and then becomes a question of whether libvirt only passes the single vhost fd (after setting it up completely) or passes both the vhost fd and connecting fd for qemu to put the two together. I didn't recall migration (if qemu would need tap fd again).
I'm heavily leaning towards taking a /dev/vhost fd but we'll see what Michael posts.
But from a libvirt perspective, I assume that it wants to open up /dev/vhost in order to not have to grant the qemu instance privileges which means that it needs to hand qemu the file descriptor to it.
Given a file descriptor, I don't think qemu can easily tell whether it's a tun/tap fd or whether it's a vhost fd. Since they have different interfaces, we need libvirt to tell us which one it is. Whether that's -net tap,vhost or -net vhost, we can figure that part out on qemu-devel :-)
Yeah, I agree, just thinking of the workflow as it impacts libvirt.
I really prefer -net vhost,fd=X where X is the fd of an open /dev/vhost. When invoking qemu directly, for the first go about, I'd expect -net vhost,dev=eth0 for a raw device and -net vhost,mode=tap,tap-arguments. Long term, there are so many possible ways to layer things, that I'd really like to see: -net vepa,dev=eth0 Which ends up invoking /usr/libexec/qemu-net-helper-vepa --arg-dev=eth0 --socketpair=X --try-vhost. qemu-net-helper-vepa would do all of the fancy stuff of creating a macvtap device, trying to hook that up with vhost, sending us an fd over the socketpair telling us which interface it's using and what features were enabled. That lets people infinitely extend qemu's networking support while allow us to focus on just implementing backends for the interfaces we're exposed to. AFAICT, that's just /dev/vhost, /dev/net/tun, and a normal socket. The later two can be reduced to a single read/write interface honestly.
In general, I think it's best to avoid as much network configuration in qemu as humanly possible so I'd rather see libvirt configure the vhost device ahead of time and pass us an fd that we can start using.
Hard to disagree, but will make qemu not work w/out libvirt?
No, net/ would essentially become a series of helper programs. What's nice about this approach is that libvirt could potentially use helpers too which would allow people to run qemu directly based on the output of ps -ef. Would certainly make debugging easier.
But we still need to support the notion of backing a VNIC to a NIC, no? If this just happens to also work with a naive usage of SR-IOV, is that so bad? :-)
Nope, not at all ;-)
We do need to know if a VF is available or not (and if a PF has any of its VFs used).
"We need to know" or "it would be nice to know"? You can make the same argument about a physical network interface.
Needed on migration ("can I hook up to a VF on target?"), and for assignment ("can I give this PCI device to a guest? wait, it's a PF and VF's are in use." Although, I don't think libvirt actually goes beyond, "wait it's a PF").
Migration's definitely tough because the ethX device might carry a different name on a different node. I'm not sure how libvirt handles this today. Is it possible to do a live migration with libvirt whereas the mount location of a common network file system changes? For instance, if /mount/disk.img becomes /mnt/disk.img?
I think the notion of network pools as being somewhat opaque really works well for this. Ideally you would create a network pool based on the requirements you had, and the management tool would figure out what the best set of implementations to use was.
VEPA is really a unique use-case in my mind. It's when someone wants to use an external switch for their network management.
It's an enterprise thing, sure, but we need to be able to manage. Ditto for a VN-Tag approach. They all require some basic setup.
Clearly I want to punt network setup out of qemu because it's awfully complex. It makes me wonder if the same should be true for libvirt? To what extend is libvirt going to do network management over time? Should I expect to be able to use libvirt to create arbitrarily complex network pools using custom iptable rules? I think libvirt punting the setup of these things to something else isn't such a bad idea. -- Regards, Anthony Liguori

On Thursday 17 December 2009, Anthony Liguori wrote:
When invoking qemu directly, for the first go about, I'd expect -net vhost,dev=eth0 for a raw device and -net vhost,mode=tap,tap-arguments.
Long term, there are so many possible ways to layer things, that I'd really like to see:
-net vepa,dev=eth0
Which ends up invoking /usr/libexec/qemu-net-helper-vepa --arg-dev=eth0 --socketpair=X --try-vhost.
qemu-net-helper-vepa would do all of the fancy stuff of creating a macvtap device, trying to hook that up with vhost, sending us an fd over the socketpair telling us which interface it's using and what features were enabled.
We need to make sure not to hardcode the dependency from VEPA to macvtap in your example, so I'm not sure if a VEPA specific helper is helpful. We really have a tuple of policy, kernel implementation and qemu implementation, with many possibly combinations, currently at least (ignoring UDP, TCP and VDE modes): nat-socket-user nat-bridge-tap nat-bridge-tap+vhost route-none-tap route-none-tap+vhost route-veth+macvlan-tap route-veth+macvlan-tap+vhost route-veth+macvlan-socket route-veth+macvlan-socket+vhost veb-bridge-tap veb-bridge-tap+vhost veb-macvlan-tap veb-macvlan-tap+vhost veb-macvlan-socket veb-macvlan-socket+vhost veb-sriov-socket veb-sriov-socket+vhost vepa-macvlan-tap vepa-macvlan-tap+vhost vepa-macvlan-socket vepa-macvlan-socket+vhost vepa-sriov-socket vepa-sriov-socket+vhost private-macvlan-tap private-macvlan-tap+vhost private-macvlan-socket private-macvlan-socket+vhost private-sriov-socket private-sriov-socket+vhost private-physdev-socket private-physdev-socket+vhost If my plans for extending macvlan for SR-IOV work out, we will also have bridge-sriov-tap bridge-sriov-tap+vhost vepa-sriov-tap vepa-sriov-tap+vhost private-sriov-tap private-sriov-tap+vhost As you can see, the policy is mostly independent from the qemu implementation and even from the kernel implementation. Naming the macvtap code in qemu '-net vepa' would completely mix up things for people that want to use vepa with an SR-IOV card, or macvtap in bridge mode! The concept with the callout to an external program to deal with the enourmous number of variations absolutely makes sense, but the naming needs to get better. In particular, I think that the policy should be only known between the helper and libvirt (or the user), but not show up anywhere in qemu, which can just pass all the options to the helper, and let that one decide what to do. E.g. "qemu -net host,mode=vepa,dev=eth0" can result in calling "/usr/libexec/qemu-net-helper --mode=vepa --dev=eth0 --socketpair=X --protocols=tap,socket,vhost". Then qemu-net-helper tries to find the best way to set up a vepa on eth0, given the choice of tap, socket, tap+vhost or socket+vhost, the system capabilities (sr-iov, macvlan, macvtap driver) and the user permissions it is running on.
That lets people infinitely extend qemu's networking support while allow us to focus on just implementing backends for the interfaces we're exposed to. AFAICT, that's just /dev/vhost, /dev/net/tun, and a normal socket. The later two can be reduced to a single read/write interface honestly.
Well, I think you are still required to use sendmsg/recvmsg with the raw socket, not write/read, but aside from that I agree.
No, net/ would essentially become a series of helper programs. What's nice about this approach is that libvirt could potentially use helpers too which would allow people to run qemu directly based on the output of ps -ef. Would certainly make debugging easier.
Right. Also, if we put the helpers into netcf or a similar library, more applications that are unrelated to qemu could use them, e.g. user-mode-linux, if they are interested.
Nope, not at all ;-)
We do need to know if a VF is available or not (and if a PF has any of its VFs used).
"We need to know" or "it would be nice to know"?
You can make the same argument about a physical network interface.
The difference to what we have today is that you can add an arbitrary number of taps to a bridge, so you don't need to know if any other guests are running when you add another one. But when you add a guest to a VF, you need to be sure tha t no other guest uses the same VF, so this needs system-wide coordination. libvirt can keep the state if it manages all guests, but if you want to run guests without libvirt, you need something like lock-files. Arnd

As you can see, the policy is mostly independent from the qemu implementation and even from the kernel implementation. Naming the macvtap code in qemu '-net vepa' would completely mix up things for people that want to use vepa with an SR-IOV card, or macvtap in bridge mode!
Qemu can continue to name the interface '-net tap'. libvrt can invoke it as '-net tap, fd=x', whether the fd is of type tap or macvtap. The SR-IOV/VMdq/macvlan are all 'offloading' the briding function to a particular physical device. In the case of the 'vepa' (or even the pepa) mode the offload is to an external switch on the network. The sriov nic's embedded bridge can also be put into VEPA mode or be used as a bridge for packets among the virtual functions. The embedded bridge might even support the 'pepa' mode. Similarly, the 'macvlan' driver can support the bridge mode for packets among the macvlan interfaces, or support vepa/pepa modes where the packes are sent out on the wire without bridging function. Therefore, we define an interface type that can be linked to a specific NIC (source dev), and a set of supported modes defined. Expanding on Daniel's suggestion earlier in this thread on 'physical' type, we can do the following: <interface type ='physical' name='somename'/> <source dev='eth0'/> <type='sr-iov|vmdq|ethernet'/> // it can be of one type <mode='vepa|pepa|bridge'/> </interface> The 'mode' can be left blank since the same NIC can support differnt modes per VM's network inteface defined in the domain xml. The 'mode' can however be used to restrict the supported modes for the particular named instance. With the above, in the domain xml, we specify: <interface type='physical'/> <name='somename'/> <type='macvtap|tap'/> // one of the two to be specified <target mode='vepa|pepa|bridge'/> //specify the mode needed for the VM </interface> With the above, when instantiating a guest libvirt will determine the type of interface. Example: for a 'vepa' on device eth0, libvirt will create a macvtap interface while setting the mode to vepa. This fd is what is passed to qemu. Since macvtap/tap appear the same to qemu we should not have to modify anything beyond libvirt. thoughts? Vivek/Stefan/Gerhard __ Vivek Kashyap Linux Technology Center, IBM

On Wednesday 20 January 2010, Vivek Kashyap wrote:
As you can see, the policy is mostly independent from the qemu implementation and even from the kernel implementation. Naming the macvtap code in qemu '-net vepa' would completely mix up things for people that want to use vepa with an SR-IOV card, or macvtap in bridge mode!
Qemu can continue to name the interface '-net tap'. libvrt can invoke it as '-net tap, fd=x', whether the fd is of type tap or macvtap.
right.
With the above, in the domain xml, we specify:
<interface type='physical'/> <name='somename'/> <type='macvtap|tap'/> // one of the two to be specified <target mode='vepa|pepa|bridge'/> //specify the mode needed for the VM </interface>
With the above, when instantiating a guest libvirt will determine the type of interface. Example: for a 'vepa' on device eth0, libvirt will create a macvtap interface while setting the mode to vepa.
Sounds good. So you could passs macvtap with any of the target modes or tap with bridge mode to get the current behaviour.
This fd is what is passed to qemu. Since macvtap/tap appear the same to qemu we should not have to modify anything beyond libvirt.
thoughts?
There is still the question of whether the macvtap devices should be kept persistant and created when the daemon is started or only when instantiating a particular VM. Normal taps are not persistent unless you mark them to be so, while macvtap is persistant by default. We could also add the TUNSETPERSIST ioctl to macvtap to give it autodestruct behavior, but I'd rather avoid that if possible in order to keep the lifetime rules simple. Arnd

On Wed, 27 Jan 2010, Arnd Bergmann wrote:
Date: Wed, 27 Jan 2010 04:10:28 +0100 From: Arnd Bergmann <arnd@arndb.de> To: Vivek Kashyap <kashyapv@us.ibm.com> Cc: Anthony Liguori <aliguori@linux.vnet.ibm.com>, Chris Wright <chrisw@redhat.com>, "libvir-list@redhat.com" <libvir-list@redhat.com>, Michael S. Tsirkin <mst@redhat.com>, vivk@us.ibm.com Subject: Re: [libvirt] Re: Supporting vhost-net and macvtap in libvirt for QEMU
On Wednesday 20 January 2010, Vivek Kashyap wrote:
As you can see, the policy is mostly independent from the qemu implementation and even from the kernel implementation. Naming the macvtap code in qemu '-net vepa' would completely mix up things for people that want to use vepa with an SR-IOV card, or macvtap in bridge mode!
Qemu can continue to name the interface '-net tap'. libvrt can invoke it as '-net tap, fd=x', whether the fd is of type tap or macvtap.
right.
With the above, in the domain xml, we specify:
<interface type='physical'/> <name='somename'/> <type='macvtap|tap'/> // one of the two to be specified <target mode='vepa|pepa|bridge'/> //specify the mode needed for the VM </interface>
With the above, when instantiating a guest libvirt will determine the type of interface. Example: for a 'vepa' on device eth0, libvirt will create a macvtap interface while setting the mode to vepa.
Sounds good. So you could passs macvtap with any of the target modes or tap with bridge mode to get the current behaviour.
Exactly. Also, the 'target mode' comes into play only if one wants to override the default mode for the 'bridge'. For example, macvlan bridge can allow both 'vepa' and 'veb' (ie. briding) modes. The VM can then specify that its interface needs to be in target mode='veb' to overcome the default of 'vepa'.
This fd is what is passed to qemu. Since macvtap/tap appear the same to qemu we should not have to modify anything beyond libvirt.
thoughts?
There is still the question of whether the macvtap devices should be kept persistant and created when the daemon is started or only when instantiating a particular VM. Normal taps are not persistent unless you mark them to be so, while macvtap is persistant by default. We could also add the TUNSETPERSIST ioctl to macvtap to give it autodestruct behavior, but I'd rather avoid that if possible in order to keep the lifetime rules simple.
Yes, we ran into this. We are creating the 'macvtap' interface when the VM is created. The problem certainly is when the VM terminates. It seems to me that we should, like 'tap', let macvtap auto-destruct. An option like you suggest would be ok too. Vivek
Arnd
__ Vivek Kashyap Linux Technology Center, IBM

On Thu, Dec 17, 2009 at 02:32:32PM -0800, Chris Wright wrote:
The reason I brought it up here is in case libvirt would be doing both. /dev/vhost takes an fd for a tap device or raw socket. So libvirt would need to open both, and then becomes a question of whether libvirt only passes the single vhost fd (after setting it up completely) or passes both the vhost fd and connecting fd for qemu to put the two together. I didn't recall migration (if qemu would need tap fd again).
libvirt should open both fds, pass them to qemu and qemu will put them together: since vhost might potentially perform direct access anywhere in guest memory, it would not be safe to let management control it, so vhost does not let you do it. -- MST

On Thu, Dec 17, 2009 at 03:39:05PM -0600, Anthony Liguori wrote:
Chris Wright wrote:
Doesn't sound useful. Low-level, sure worth being able to turn things on and off for testing/debugging, but probably not something a user should be burdened with in libvirt.
But I dont' understand your -net vhost,fd=X, that would still be -net tap=fd=X, no? IOW, vhost is an internal qemu impl. detail of the virtio backend (or if you get your wish, $nic_backend).
I don't want to get bogged down in a qemu-devel discussion on libvirt-devel :-)
But from a libvirt perspective, I assume that it wants to open up /dev/vhost in order to not have to grant the qemu instance privileges which means that it needs to hand qemu the file descriptor to it.
Given a file descriptor, I don't think qemu can easily tell whether it's a tun/tap fd or whether it's a vhost fd. Since they have different interfaces, we need libvirt to tell us which one it is. Whether that's -net tap,vhost or -net vhost, we can figure that part out on qemu-devel :-)
That is no problem, since we already do that kind of thing for TAP devices it is perfectly feasible for us to also do it for vhost FDs.
The more interesting invocation of vhost-net though is one where the vhost-net device backs directly to a physical network card. In this mode, vhost should get considerably better performance than the current implementation. I don't know the syntax yet, but I think it's reasonable to assume that it will look something like -net tap,dev=eth0. The effect will be that eth0 is dedicated to the guest.
tap? we'd want either macvtap or raw socket here.
I screwed up. I meant to say, -net vhost,dev=eth0. But maybe it doesn't matter if libvirt is the one that initializes the vhost device, setups up the raw socket (or macvtap), and hands us a file descriptor.
In general, I think it's best to avoid as much network configuration in qemu as humanly possible so I'd rather see libvirt configure the vhost device ahead of time and pass us an fd that we can start using.
Agreed, if we can avoid needing to give QEMU CAP_NET_ADMIN then that is preferred - indeed when libvirt runs QEMU as root, we already strip it of CAP_NET_ADMIN (and all other capabilities).
Another model would be to have libvirt see an SR-IOV adapter as a network pool whereas it handled all of the VF management. Considering how inflexible SR-IOV is today, I'm not sure whether this is the best model.
We already need to know the VF<->PF relationship. For example, don't want to assign a VF to a guest, then a PF to another guest for basic sanity reasons. As we get better ability to manage the embedded switch in an SR-IOV NIC we will need to manage them as well. So we do need to have some concept of managing an SR-IOV adapter.
But we still need to support the notion of backing a VNIC to a NIC, no? If this just happens to also work with a naive usage of SR-IOV, is that so bad? :-)
Long term, yes, I think you want to manage SR-IOV adapters as if they're a network pool. But since they're sufficiently inflexible right now, I'm not sure it's all that useful today.
FYI, we have generic capabilities for creating & deleting host devices via the virNodeDevCreate / virNodeDevDestroy APIs. We use this for creating & deleting NPIV scsi adapters. If we need to support this for some types of NICs too, that fits into the model fine.
So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), tbhe fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases:
- sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport
I've got mixed feelings here. With respect to sw vs. hw bridge, I really think that that's an implementation detail that should not be exposed to a user. A user doesn't typically want to think about whether they're using a hardware switch vs. software switch. Instead, they approach it from, I want to have this network topology, and these features enabled.
Agree there is alot of low level detail there, and I think it will be very hard for users, or apps to gain enough knowledge to make intelligent decisions about which they should use. So I don't think we want to expose all that detail. For a libvirt representation we need to consider it more in terms of what capabilities does each options provide, rather than what implementation each option uses Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

.....
So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), tbhe fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases:
- sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport
I've got mixed feelings here. With respect to sw vs. hw bridge, I really think that that's an implementation detail that should not be exposed to a user. A user doesn't typically want to think about whether they're using a hardware switch vs. software switch. Instead, they approach it from, I want to have this network topology, and these features enabled.
Agree there is alot of low level detail there, and I think it will be very hard for users, or apps to gain enough knowledge to make intelligent decisions about which they should use. So I don't think we want to expose all that detail. For a libvirt representation we need to consider it more in terms of what capabilities does each options provide, rather than what implementation each option uses
Attached is some background information on VEPA bridging being discussed in this thread and then a proposal for defining it in libvirt xml. The 'Edge Virtual Bridging'(eVB) working group has proposed a mechanism to offload the bridging function from the server to a physical switch on the network. This is referred to as VEPA (Virtual Ethernet Port Aggregator). This is described here: http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-... The VEPA mode implies that the virtual machines on a host communicate to each other via the physical switch on the network instead of the bridge in the Linux host. The filtering, quality of service enforcement, stats etc. are all done in the external switch. The newer NICs with embedded switches (such as SR-IOV cards) will also provide VEPA mode. This implies that the communication between two virtual functions on the same physical NIC will also require a packet to travel to the first hop switch on the network and then be reflected back. The 'macvlan' driver in Linux supports virtual interfaces that can be attached to virtual machine interfaces. This patch provides tap backend to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If such an interface is used the packets will be forwarded directly onto the network bypassing the host bridge. This is exactly what is required for VEPA mode. However, the 'macvlan' driver can support both VEPA and 'bridging' mode. The bridging in this case is among its virtual interfaces only. There is also a private mode in which the packets are transmitted to the network but are not forwarded among the VMs. Similarly, the sr-iov's embedded switch in the future will be settable as 'VEPA', or 'private' or 'bridging' mode. In the eVB working group the 'private' mode is referred to as PEPA, and the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same terms. The 'VEB' mode of macvlan or sr-iov is no different than the bridge in Linux. The behaviour of the networking/switching on the network is unaffected. Changes in the first-hop adjacent Switch on the network: --------------------------------------------------------- When the 'VEPA' (or PEPA) mode is used the packet switching is occuring on the first hop switch. Therefore for VM to VM traffic, the first hop switch must support reflecting the packets back on the port on which they were received. This is referred to as the 'hairpin' or 'reflective relay' mode. The IEEE 802.1 body is standardizing on the protocol with the switch vendors, and various other server vendors working on the standard. This is derived from the above mentioned eVB ('edge virtual bridging') working group. To enable easy testing the Linux bridge can be put into the 'reflective relay' (or hairpin) mode. The patches are included in 2.6.32. The mode can be set using sysfs or brctl commands (in latest bridge utils bits). In the future the switch vendors (in eVB group) expect to support both VEPA and VEB on the same switch port. That is the Linux host can have some VM's using VEPA mode and some in VEB mode on the same outgoing uplink. This protocol is to be fully defined and will require more changes in the bridging function. The ethernet frame will carry tags to identify the packet streams (for VEPA or VEB ports). See chart 4 in the above linked IEEE document. However, from a libvirt defintion point of view it implies that a 'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative is to define separate bridges handling VEB/VEPA or PEPA modes for the same 'sr-iov' or 'macvlan' backend. Determining the switch capability: --------------------------------- The Linux host can determine (and set) whether the remote bridge supports 'hairpin' mode and also set this capability through a low level protocol (DCBx) being extended in the above eVB working group. Some drivers (for NICs/CNAs) are likely to do this detrmination themselves and make the information available to the hypervisor/Linux host. Summary: -------- Based on above a virtual machine might be defined to work with the Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes, or with sr-iov virtual function with switching in bridge, or vepa/pepa modes. Proposal: -------- To support the above combinations we need to be able to define the bridge to be used, the 'uplink' it is associated with, and the interface type that the VM will use to connect to the bridge. Currently in libvirt we define a bridge and can associate an ethernet with it (which is the uplink to the network). In the 'macvlan' and the 'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it is embedded in the 'nic', and in the case of macvlan the function is enabled when the virtual interface is created. Describing the bridge and modes: -------------------------------- So, we can define the bridge function using a new type or maybe extend the bridge.xml itself. <interface type='bridge' name='br0'> <bridge> <type='hypervisor|embedded|ethernet'/> //hypervisor is default <mode='all|VEPA|PEPA|VEB'/> // 'all' is default if supported. <interface type='ethernet' name='eth0'/> </bridge> </interface> The 'type' and 'mode' need not be specified. libvirt could default to the virtual bridge in the hypervisor. Similarly, the supported modes may be determined dynamically by libvirt. Or, we could invent a new type for macvlan or sr-iov based switching: <interface type ='physical' name='pbr0'/> <source dev='eth0'/> <type='embedded|ethernet'/> <mode='all|VEPA|PEPA|VEB'/> // all is default if supported. </interface> The above two descriptions imply that the bridge may be 'embedded' e.g. sr-iov or vmdq nics, standard existing bridging (the VEB), or macvlan based. Describing the VM connectivity: -------------------------------- With the above, in the domain xml, we can specify: <interface type='physical'/ or bridge> <name='br0'/> <type='macvtap|tap|raw'/> <target mode='vepa|pepa|veb'/> //only one can be specified </interface> Therefore, when instantiating a guest, libvirt will determine the type of interface and bridge. Example: for a 'vepa' mode, for a bridge defined as 'ethernet', libvirt will create a macvtap interface while setting the mode to vepa. thanks, Vivek
-- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
__ Vivek Kashyap Linux Technology Center, IBM

On 01/21/2010 03:13 PM, Vivek Kashyap wrote:
.....
So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), tbhe fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases:
- sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport
I've got mixed feelings here. With respect to sw vs. hw bridge, I really think that that's an implementation detail that should not be exposed to a user. A user doesn't typically want to think about whether they're using a hardware switch vs. software switch. Instead, they approach it from, I want to have this network topology, and these features enabled.
Agree there is alot of low level detail there, and I think it will be very hard for users, or apps to gain enough knowledge to make intelligent decisions about which they should use. So I don't think we want to expose all that detail. For a libvirt representation we need to consider it more in terms of what capabilities does each options provide, rather than what implementation each option uses
Attached is some background information on VEPA bridging being discussed in this thread and then a proposal for defining it in libvirt xml.
The 'Edge Virtual Bridging'(eVB) working group has proposed a mechanism to offload the bridging function from the server to a physical switch on the network. This is referred to as VEPA (Virtual Ethernet Port Aggregator). This is described here:
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-...
The VEPA mode implies that the virtual machines on a host communicate to each other via the physical switch on the network instead of the bridge in the Linux host. The filtering, quality of service enforcement, stats etc. are all done in the external switch.
The newer NICs with embedded switches (such as SR-IOV cards) will also provide VEPA mode. This implies that the communication between two virtual functions on the same physical NIC will also require a packet to travel to the first hop switch on the network and then be reflected back.
The 'macvlan' driver in Linux supports virtual interfaces that can be attached to virtual machine interfaces. This patch provides tap backend to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If such an interface is used the packets will be forwarded directly onto the network bypassing the host bridge. This is exactly what is required for VEPA mode.
However, the 'macvlan' driver can support both VEPA and 'bridging' mode. The bridging in this case is among its virtual interfaces only. There is also a private mode in which the packets are transmitted to the network but are not forwarded among the VMs.
Similarly, the sr-iov's embedded switch in the future will be settable as 'VEPA', or 'private' or 'bridging' mode.
In the eVB working group the 'private' mode is referred to as PEPA, and the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same terms.
The 'VEB' mode of macvlan or sr-iov is no different than the bridge in Linux. The behaviour of the networking/switching on the network is unaffected.
Changes in the first-hop adjacent Switch on the network: --------------------------------------------------------- When the 'VEPA' (or PEPA) mode is used the packet switching is occuring on the first hop switch. Therefore for VM to VM traffic, the first hop switch must support reflecting the packets back on the port on which they were received. This is referred to as the 'hairpin' or 'reflective relay' mode.
The IEEE 802.1 body is standardizing on the protocol with the switch vendors, and various other server vendors working on the standard. This is derived from the above mentioned eVB ('edge virtual bridging') working group.
To enable easy testing the Linux bridge can be put into the 'reflective relay' (or hairpin) mode. The patches are included in 2.6.32. The mode can be set using sysfs or brctl commands (in latest bridge utils bits).
In the future the switch vendors (in eVB group) expect to support both VEPA and VEB on the same switch port. That is the Linux host can have some VM's using VEPA mode and some in VEB mode on the same outgoing uplink. This protocol is to be fully defined and will require more changes in the bridging function. The ethernet frame will carry tags to identify the packet streams (for VEPA or VEB ports). See chart 4 in the above linked IEEE document.
However, from a libvirt defintion point of view it implies that a 'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative is to define separate bridges handling VEB/VEPA or PEPA modes for the same 'sr-iov' or 'macvlan' backend.
Determining the switch capability: --------------------------------- The Linux host can determine (and set) whether the remote bridge supports 'hairpin' mode and also set this capability through a low level protocol (DCBx) being extended in the above eVB working group. Some drivers (for NICs/CNAs) are likely to do this detrmination themselves and make the information available to the hypervisor/Linux host.
Summary: --------
Based on above a virtual machine might be defined to work with the Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes, or with sr-iov virtual function with switching in bridge, or vepa/pepa modes.
Proposal: --------
To support the above combinations we need to be able to define the bridge to be used, the 'uplink' it is associated with, and the interface type that the VM will use to connect to the bridge.
Currently in libvirt we define a bridge and can associate an ethernet with it (which is the uplink to the network). In the 'macvlan' and the 'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it is embedded in the 'nic', and in the case of macvlan the function is enabled when the virtual interface is created.
Describing the bridge and modes: -------------------------------- So, we can define the bridge function using a new type or maybe extend the bridge.xml itself.
<interface type='bridge' name='br0'> <bridge> <type='hypervisor|embedded|ethernet'/> //hypervisor is default <mode='all|VEPA|PEPA|VEB'/> // 'all' is default if supported. <interface type='ethernet' name='eth0'/> </bridge> </interface>
Does this really map to how VEPA works? For a physical bridge, you create a br0 network interface that also has eth0 as a component. With VEPA and macv{lan,tap}, you do not create a single "br0" interface. Instead, for the given physical port, you create interfaces for each tap device and hand them over. IOW, while something like: <interface type='bridge' name='br0'> <bridge> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface> Makes sense, the following wouldn't: <interface type='bridge' name='br0'> <bridge mode='VEPA'> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface> I think the only use of the interface tag that would make sense is: <interface type='ethernet' name='eth0'> <vepa/> </interface> And then in the VM definition, instead of: <interface type='direct'> <source physical='eth0'> ... </interface> You can imagine doing something similar with SR-IOV: <interface type='ethernet' name='eth0> <sr-iov/> </interface> and in the guest: <interface type='direct'> <source physical='eth0'> ... </interface>
The 'type' and 'mode' need not be specified. libvirt could default to the virtual bridge in the hypervisor. Similarly, the supported modes may be determined dynamically by libvirt.
Or, we could invent a new type for macvlan or sr-iov based switching:
<interface type ='physical' name='pbr0'/> <source dev='eth0'/> <type='embedded|ethernet'/> <mode='all|VEPA|PEPA|VEB'/> // all is default if supported. </interface>
IIUC, when you do macvlan/macvtap, there is no 'pbr0' interface. It's fundamentally different than standard bridging and I think ought to be treated differently. Regards, Anthony Liguori

On Mon, Jan 25, 2010 at 11:38:15AM -0600, Anthony Liguori wrote:
On 01/21/2010 03:13 PM, Vivek Kashyap wrote:
.....
So I think we want to maintain a concept of the qemu backend (virtio, e1000, etc), tbhe fd that connects the qemu backend to the host (tap, socket, macvtap, etc), and the bridge. The bridge bit gets a little complicated. We have the following bridge cases:
- sw bridge - normal existing setup, w/ Linux bridging code - macvlan - hw bridge - on SR-IOV card - configured to simply fwd to external hw bridge (like VEPA mode) - configured as a bridge w/ policies (QoS, ACL, port mirroring, etc. and allows inter-guest traffic and looks a bit like above sw switch) - external - need to possibly inform switch of incoming vport
I've got mixed feelings here. With respect to sw vs. hw bridge, I really think that that's an implementation detail that should not be exposed to a user. A user doesn't typically want to think about whether they're using a hardware switch vs. software switch. Instead, they approach it from, I want to have this network topology, and these features enabled.
Agree there is alot of low level detail there, and I think it will be very hard for users, or apps to gain enough knowledge to make intelligent decisions about which they should use. So I don't think we want to expose all that detail. For a libvirt representation we need to consider it more in terms of what capabilities does each options provide, rather than what implementation each option uses
Attached is some background information on VEPA bridging being discussed in this thread and then a proposal for defining it in libvirt xml.
The 'Edge Virtual Bridging'(eVB) working group has proposed a mechanism to offload the bridging function from the server to a physical switch on the network. This is referred to as VEPA (Virtual Ethernet Port Aggregator). This is described here:
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-...
The VEPA mode implies that the virtual machines on a host communicate to each other via the physical switch on the network instead of the bridge in the Linux host. The filtering, quality of service enforcement, stats etc. are all done in the external switch.
The newer NICs with embedded switches (such as SR-IOV cards) will also provide VEPA mode. This implies that the communication between two virtual functions on the same physical NIC will also require a packet to travel to the first hop switch on the network and then be reflected back.
The 'macvlan' driver in Linux supports virtual interfaces that can be attached to virtual machine interfaces. This patch provides tap backend to macvlan: http://marc.info/?l=linux-kernel&m=125986323631311&w=2. If such an interface is used the packets will be forwarded directly onto the network bypassing the host bridge. This is exactly what is required for VEPA mode.
However, the 'macvlan' driver can support both VEPA and 'bridging' mode. The bridging in this case is among its virtual interfaces only. There is also a private mode in which the packets are transmitted to the network but are not forwarded among the VMs.
Similarly, the sr-iov's embedded switch in the future will be settable as 'VEPA', or 'private' or 'bridging' mode.
In the eVB working group the 'private' mode is referred to as PEPA, and the 'bridging' as VEB (Virtual ethernet bridge). I'll use the same terms.
The 'VEB' mode of macvlan or sr-iov is no different than the bridge in Linux. The behaviour of the networking/switching on the network is unaffected.
Changes in the first-hop adjacent Switch on the network: --------------------------------------------------------- When the 'VEPA' (or PEPA) mode is used the packet switching is occuring on the first hop switch. Therefore for VM to VM traffic, the first hop switch must support reflecting the packets back on the port on which they were received. This is referred to as the 'hairpin' or 'reflective relay' mode.
The IEEE 802.1 body is standardizing on the protocol with the switch vendors, and various other server vendors working on the standard. This is derived from the above mentioned eVB ('edge virtual bridging') working group.
To enable easy testing the Linux bridge can be put into the 'reflective relay' (or hairpin) mode. The patches are included in 2.6.32. The mode can be set using sysfs or brctl commands (in latest bridge utils bits).
In the future the switch vendors (in eVB group) expect to support both VEPA and VEB on the same switch port. That is the Linux host can have some VM's using VEPA mode and some in VEB mode on the same outgoing uplink. This protocol is to be fully defined and will require more changes in the bridging function. The ethernet frame will carry tags to identify the packet streams (for VEPA or VEB ports). See chart 4 in the above linked IEEE document.
However, from a libvirt defintion point of view it implies that a 'bridge' can be in multiple modes(VEPA or VEB or PEPA). An alternative is to define separate bridges handling VEB/VEPA or PEPA modes for the same 'sr-iov' or 'macvlan' backend.
Determining the switch capability: --------------------------------- The Linux host can determine (and set) whether the remote bridge supports 'hairpin' mode and also set this capability through a low level protocol (DCBx) being extended in the above eVB working group. Some drivers (for NICs/CNAs) are likely to do this detrmination themselves and make the information available to the hypervisor/Linux host.
Summary: --------
Based on above a virtual machine might be defined to work with the Linux/hypervisor bridge, with the 'macvlan' in bridge, vepa/pepa modes, or with sr-iov virtual function with switching in bridge, or vepa/pepa modes.
Proposal: --------
To support the above combinations we need to be able to define the bridge to be used, the 'uplink' it is associated with, and the interface type that the VM will use to connect to the bridge.
Currently in libvirt we define a bridge and can associate an ethernet with it (which is the uplink to the network). In the 'macvlan' and the 'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it is embedded in the 'nic', and in the case of macvlan the function is enabled when the virtual interface is created.
Describing the bridge and modes: -------------------------------- So, we can define the bridge function using a new type or maybe extend the bridge.xml itself.
<interface type='bridge' name='br0'> <bridge> <type='hypervisor|embedded|ethernet'/> //hypervisor is default <mode='all|VEPA|PEPA|VEB'/> // 'all' is default if supported. <interface type='ethernet' name='eth0'/> </bridge> </interface>
Does this really map to how VEPA works?
For a physical bridge, you create a br0 network interface that also has eth0 as a component.
With VEPA and macv{lan,tap}, you do not create a single "br0" interface. Instead, for the given physical port, you create interfaces for each tap device and hand them over. IOW, while something like:
<interface type='bridge' name='br0'> <bridge> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface>
Makes sense, the following wouldn't:
<interface type='bridge' name='br0'> <bridge mode='VEPA'> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface>
I think the only use of the interface tag that would make sense is:
<interface type='ethernet' name='eth0'> <vepa/> </interface>
You can imagine doing something similar with SR-IOV:
<interface type='ethernet' name='eth0> <sr-iov/> </interface>
This seems like overkill to me - we don't need to manage these as top level objects, as we would with traditional bridges. I'd think we can keep the config in solely within the realm of the domain XML, and create/delete the macvlan/macvtap devices on the fly, as we do with plain TAP devices today.
and in the guest:
<interface type='direct'> <source physical='eth0'> ... </interface>
I like the simplicity of just having this in the guest XML and a way to just indicate macvlan vs macvtap somehow. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

and in the guest:
<interface type='direct'> <source physical='eth0'> ... </interface>
I like the simplicity of just having this in the guest XML and a way to just indicate macvlan vs macvtap somehow.
macvlan and macvtap devices have to be created per-VM. How they're created is non-trivial as you need to set flags which eventually get down to twiddling bits on the physical interface. So you essentially have two options: 1) punt the creation fo these devices and just let it be specified per-device. libvirt would be incapable of starting these guests automatically without some helper tool. 2) allow the twiddling bits to be specified as part of the per-domain information. <source physical='eth0' mode='vepa'> might be enough but I'm not entirely qualified to understand exactly what the various cases are. Regards, Anthony Liguori

libvir-list-bounces@redhat.com wrote on 01/25/2010 12:52:37 PM:
and in the guest:
<interface type='direct'> <source physical='eth0'> ... </interface>
I like the simplicity of just having this in the guest XML and a way to just indicate macvlan vs macvtap somehow.
I'd rather push the physical device out of the guest XML into a referenced XML, i.e., network XML, so that VM migration can be agnostic of what the local configuraton of the system is, whether it's 'eth0' or 'eth1' or whatever else it may be on different hosts. Stefan
Daniel -- |: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Mon, 25 Jan 2010, Anthony Liguori wrote:
Date: Mon, 25 Jan 2010 11:38:15 -0600 From: Anthony Liguori <aliguori@linux.vnet.ibm.com> To: Vivek Kashyap <kashyapv@us.ibm.com> Cc: Daniel P. Berrange <berrange@redhat.com>, Chris Wright <chrisw@redhat.com>, "libvir-list@redhat.com" <libvir-list@redhat.com>, Michael S. Tsirkin <mst@redhat.com>, Arnd Bergmann <ARNDB@de.ibm.com> Subject: Re: [libvirt] Re: Supporting vhost-net and macvtap in libvirt for QEMU
On 01/21/2010 03:13 PM, Vivek Kashyap wrote:
.....
....
Proposal: --------
To support the above combinations we need to be able to define the bridge to be used, the 'uplink' it is associated with, and the interface type that the VM will use to connect to the bridge.
Currently in libvirt we define a bridge and can associate an ethernet with it (which is the uplink to the network). In the 'macvlan' and the 'sr-iov' cases there is no creation of the bridge itself. In 'sr-iov' it is embedded in the 'nic', and in the case of macvlan the function is enabled when the virtual interface is created.
Describing the bridge and modes: -------------------------------- So, we can define the bridge function using a new type or maybe extend the bridge.xml itself.
<interface type='bridge' name='br0'> <bridge> <type='hypervisor|embedded|ethernet'/> //hypervisor is default <mode='all|VEPA|PEPA|VEB'/> // 'all' is default if supported. <interface type='ethernet' name='eth0'/> </bridge> </interface>
Does this really map to how VEPA works?
For a physical bridge, you create a br0 network interface that also has eth0 as a component.
Right. So a bridge has at least one 'uplink'. In this case the bridge is an abstract concept. It still has an 'uplink' which is the device (eth0 in this instance).
With VEPA and macv{lan,tap}, you do not create a single "br0" interface. Instead, for the given physical port, you create interfaces for each tap device and hand them over. IOW, while something like:
<interface type='bridge' name='br0'> <bridge> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface>
The above is not in the domain xml but was proposed in the bridge xml. The advantage of using the bridge concept is that it appears the same for macvlan and the virtual Linux host bridge. The 'macvlan' interface itself can support 'bridge' mode in addition to the 'vepa' mode. Therefore, one is creating the bridge, attaching it to the physical device. This device is the one which provides the 'uplink' i.e. is either the sr-iov card or is the device associated with the macvlan driver. The domain xml can now point to the above bridge. For the interfaces it creates it can associate target names. See Stefan's post - we moved away from 'bridge.xml' to network xml since the bridge is abstract in this case.
Makes sense, the following wouldn't:
<interface type='bridge' name='br0'> <bridge mode='VEPA'> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface>
The above would if you had only one device, the 'uplink' associated. It creates the bridge object (in libvirt) associated with the physical device 'eth0' on the host. <interface type='bridge' name='br0'> <bridge mode='VEPA'> <interface type='ethernet' name='eth0'/> </bridge> </interface> In the VM xml, then all one does is reference the above bridge. There is no need to specify the interface or bridge mode in VM xml. The only place that comes in is if the bridge supports 'all' (like macvlan drivers does - vepa, pepa or bridge). In such a case the VM could request a 'target mode'. thanks Vivek
I think the only use of the interface tag that would make sense is:
<interface type='ethernet' name='eth0'> <vepa/> </interface>
And then in the VM definition, instead of:
<interface type='direct'> <source physical='eth0'> ... </interface>
You can imagine doing something similar with SR-IOV:
<interface type='ethernet' name='eth0> <sr-iov/> </interface>
and in the guest:
<interface type='direct'> <source physical='eth0'> ... </interface>
The 'type' and 'mode' need not be specified. libvirt could default to the virtual bridge in the hypervisor. Similarly, the supported modes may be determined dynamically by libvirt.
Or, we could invent a new type for macvlan or sr-iov based switching:
<interface type ='physical' name='pbr0'/> <source dev='eth0'/> <type='embedded|ethernet'/> <mode='all|VEPA|PEPA|VEB'/> // all is default if supported. </interface>
IIUC, when you do macvlan/macvtap, there is no 'pbr0' interface. It's fundamentally different than standard bridging and I think ought to be treated differently.
Regards,
Anthony Liguori
__ Vivek Kashyap Linux Technology Center, IBM

On Tue, Jan 26, 2010 at 07:15:05PM -0800, Vivek Kashyap wrote:
On Mon, 25 Jan 2010, Anthony Liguori wrote:
Describing the bridge and modes: -------------------------------- So, we can define the bridge function using a new type or maybe extend the bridge.xml itself.
<interface type='bridge' name='br0'> <bridge> <type='hypervisor|embedded|ethernet'/> //hypervisor is default <mode='all|VEPA|PEPA|VEB'/> // 'all' is default if supported. <interface type='ethernet' name='eth0'/> </bridge> </interface>
Does this really map to how VEPA works?
For a physical bridge, you create a br0 network interface that also has eth0 as a component.
Right. So a bridge has at least one 'uplink'. In this case the bridge is an abstract concept. It still has an 'uplink' which is the device (eth0 in this instance).
With VEPA and macv{lan,tap}, you do not create a single "br0" interface. Instead, for the given physical port, you create interfaces for each tap device and hand them over. IOW, while something like:
<interface type='bridge' name='br0'> <bridge> <interface type='ethernet' name='eth0'/> <interface type='ethernet' name='eth1'/> </bridge> </interface>
The above is not in the domain xml but was proposed in the bridge xml.
The advantage of using the bridge concept is that it appears the same for macvlan and the virtual Linux host bridge. The 'macvlan' interface itself can support 'bridge' mode in addition to the 'vepa' mode.
Therefore, one is creating the bridge, attaching it to the physical device. This device is the one which provides the 'uplink' i.e. is either the sr-iov card or is the device associated with the macvlan driver. The domain xml can now point to the above bridge. For the interfaces it creates it can associate target names.
The main issue with this, is that when using VEPA/macvlan there's no actual host device being created as there is when using the linux software bridge. The <interface> descriptions here are mapped straight into the /etc/sysconfig/networking-scripts/ifcfg-XXX files that trigger creation & setup of the physical, bridge, bonding & vlan interfaces. Since there is no actual bridge interface, there's no ifcfg-XXX to map onto in the VEPA case. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 27 Jan 2010, Daniel P. Berrange wrote: .....
The above is not in the domain xml but was proposed in the bridge xml.
The advantage of using the bridge concept is that it appears the same for macvlan and the virtual Linux host bridge. The 'macvlan' interface itself can support 'bridge' mode in addition to the 'vepa' mode.
Therefore, one is creating the bridge, attaching it to the physical device. This device is the one which provides the 'uplink' i.e. is either the sr-iov card or is the device associated with the macvlan driver. The domain xml can now point to the above bridge. For the interfaces it creates it can associate target names.
The main issue with this, is that when using VEPA/macvlan there's no actual host device being created as there is when using the linux software bridge. The <interface> descriptions here are mapped straight into the /etc/sysconfig/networking-scripts/ifcfg-XXX files that trigger creation & setup of the physical, bridge, bonding & vlan interfaces. Since there is no actual bridge interface, there's no ifcfg-XXX to map onto in the VEPA case.
ok, thanks for this clarification. We can use the similar setup using a different construct - maybe physical or direct. Vivek
Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
__ Vivek Kashyap Linux Technology Center, IBM
participants (8)
-
Anthony Liguori
-
Arnd Bergmann
-
Arnd Bergmann
-
Chris Wright
-
Daniel P. Berrange
-
Michael S. Tsirkin
-
Stefan Berger
-
Vivek Kashyap