[libvirt] [RFC][PATCH v2 0/2] Dynamic backend setup for macvtap interfaces

I posted this RFC/patch set a few weeks ago but didn't receive any response. My implementation works, but I'd like to hear from anyone more familiar with libvirt concurrency than I (i.e. nearly everyone) about how it might be improved. In v2 I rebased against libvirt-0.7.7 and made a bunch of minor changes. - Using a bridge to connect a qemu NIC to a host interface offers a fair amount of flexibility to reconfigure the host without restarting the VM. For example, if the bridge connects host interface eth0 to the qemu tap0 interface, eth0 can be hot-removed and hot-plugged without affecting the VM. Similarly, if the bridge connects host VLAN interface vlan0 to the qemu tap0 interface, the admin can easily replace vlan0 with vlan1 without the VM noticing. Using the macvtap driver instead of a kernel bridge, the host interface is much more tightly tied to the VM. Qemu communicates with the macvtap interface through a file descriptor, and the macvtap interface is bound permanently to a specific host interface when it is first created. What's more, if the underlying host interface disappears, the macvtap interface vanishes along with it, leaving the VM holding a file descriptor for a deleted file. To avoid race conditions during system startup, I would like libvirt to allow starting up the VM with a NIC even if the underlying host interface doesn't yet exist, deferring creation of the macvtap interface (analogous to starting up the VM with a tap interface bound to an orphan bridge). To support adding and removing a host interface without restarting the VM, I would like libvirt to react to the (re)appearance of the underlying host interface, creating a new macvtap interface and passing the new fd to qemu to reconnect to the NIC. (It would also be nice if libvirt allowed the user to change which underlying host interface the qemu NIC is connected to. I'm ignoring this issue for now, except to note that implementing the above features should make this easier.) The libvirt API already supports domainAttachDevice and domainDetachDevice to add or remove an interface while the VM is running. In the qemu implementation, these commands add or remove the VM NIC device as well as reconfiguring the host side. This works only if the OS and application running in the VM can handle PCI hotplug and dynamically reconfigure its network. I would like to isolate the VM from changes to the host network setup, whether you use macvtap or a bridge. The changes I think are needed to implement this include: 1. Refactor qemudDomainAttachNetDevice/qemudDomainDetachNetDevice, which currently handle both backend (host) setup and adding/removing the VM NIC device; move the backend setup code into separate functions that can called separately without affecting VM devices. 2. Implement a thread or task that watches for changes to the underlying host interface for each configured macvtap interface, and reacts by invoking the appropriate backend setup code. 3. Change qemudBuildCommandLine to defer backend setup if qemu supports the necessary features for doing it later (e.g. the host_net_add monitor command). 4. Implement appropriate error handling and reporting, and any necessary changes to the configuration schema. The following patches are a partial implementation of the above as a proof of concept. Patch 1 implements change (1) above, moving the backend setup code to new functions qemudDomainConnectNetBackend/qemudDomainDisconnectNetBackend, and calling these functions from the existing qemudDomainAttachNetDevice/qemudDomainDetachNetDevice. I think this change is useful on its own: it breaks up two monster functions into more manageable pieces, and eliminates some code duplication (e.g. the try_remove clause at the end of qemudDomainAttachNetDevice). Patch 2 is a godawful hack roughly implementing changes (2) and (3) above (did I mention that this is a proof of concept?). It spawns a thread that simply tries reconnecting the backend of each macvtap interface once a second. As long as the interface is already up, the reconnection fails. If the macvtap interface goes away because the underlying host interface disappears, the reconnection fails until the host interface reappears. I ran into two major issues while implementing (2) and (3): - Can we use the existing virEvent functions to invoke the reconnection process, triggered either by a timer or by an event from the host? It seems like this ought to work, but it appears that communication between libvirt and the qemu monitor relies on an event, and since all events run in the same thread, there's no way for an event to call the monitor. - Should the reconnection process use udev or hal to get notifications, or leverage the node device code which itself uses udev or hal? Currently there doesn't appear to be a way to get notifications of changes to node devices; if there were, we'd still need to address the threading issue. If we use node devices, what changes to the configuration schema would be needed to associate a macvtap interface with the underlying node device? I'd appreciate input on item (4) as well (e.g. does it always make sense to ignore the missing host interface on the assumption that it could show up later?). --Ed

libvir-list-bounces@redhat.com wrote on 03/08/2010 07:29:42 PM:
Using a bridge to connect a qemu NIC to a host interface offers a fair amount of flexibility to reconfigure the host without restarting the VM. For example, if the bridge connects host interface eth0 to the qemu tap0 interface, eth0 can be hot-removed and hot-plugged without affecting the VM. Similarly, if the bridge connects host VLAN interface vlan0 to the qemu tap0 interface, the admin can easily replace vlan0 with vlan1 without the VM noticing.
Using the macvtap driver instead of a kernel bridge, the host interface is much more tightly tied to the VM. Qemu communicates with the macvtap interface through a file descriptor, and the macvtap interface is bound permanently to a specific host interface when it is first created. What's more, if the underlying host interface disappears, the macvtap interface vanishes along with it, leaving the VM holding a file descriptor for a deleted file.
To avoid race conditions during system startup, I would like libvirt to
The problem I guess is that the underlying interface can disappear at any time due to hotplug leaving you with a race condition at other times as well until a watcher thread detects the change and can act, no?
allow starting up the VM with a NIC even if the underlying host interface doesn't yet exist, deferring creation of the macvtap interface (analogous to starting up the VM with a tap interface bound to an orphan bridge). To support adding and removing a host interface without
What do you pass to Qemu command line? Currently we pass a file descriptor of the tap interface...
restarting the VM, I would like libvirt to react to the (re)appearance of the underlying host interface, creating a new macvtap interface and passing the new fd to qemu to reconnect to the NIC.
How do you handle the macvtap description in the VM configuration? Currently we have the 'direct' interface description where the link device is written into the domain description: <devices> <interface type='direct'/> ... <interface type='direct'> <source dev='eth0' mode='vepa'/> </interface> </devices> http://libvirt.org/formatdomain.html#elementsNICSDirect Are you reflecting the change to 'eth0' also in case the hotplugged device was to appear under another name?
(It would also be nice if libvirt allowed the user to change which underlying host interface the qemu NIC is connected to. I'm ignoring this issue for now, except to note that implementing the above features should make this easier.)
The libvirt API already supports domainAttachDevice and domainDetachDevice to add or remove an interface while the VM is running. In the qemu implementation, these commands add or remove the VM NIC device as well as reconfiguring the host side. This works only if the OS and application running in the VM can handle PCI hotplug and dynamically reconfigure its network. I would like to isolate the VM from changes to the host network setup, whether you use macvtap or a bridge.
So currently you would have to attach and detach the direct device. Now in your implementation a host unplug would automatically cause the macvtap to get unplugged and if a host interface appears it would automatically recreate a macvtap linking it to this host interface? Under what conditions does this work? Does the new interface have to have the same name? I wondering, because some scripts I believe check for the MAC address of the device and if it doesn't match the one expected for eth0, it may appear as eth1. How are cases handled where I would like it to reconnect to vlan 100 of the newly connected host interface but I probably have to run some command to first create that vlan 100 interface?
The changes I think are needed to implement this include:
1. Refactor qemudDomainAttachNetDevice/qemudDomainDetachNetDevice, which currently handle both backend (host) setup and adding/removing the VM NIC device; move the backend setup code into separate functions that can called separately without affecting VM devices.
2. Implement a thread or task that watches for changes to the underlying host interface for each configured macvtap interface, and reacts by invoking the appropriate backend setup code.
I suppose the backend setup code is provided and not some external script that the user can run to for example have the vlan 100 interface created on host hotplug. Stefan
3. Change qemudBuildCommandLine to defer backend setup if qemu supports the necessary features for doing it later (e.g. the host_net_add monitor command).
4. Implement appropriate error handling and reporting, and any necessary changes to the configuration schema.
The following patches are a partial implementation of the above as a proof of concept.
Patch 1 implements change (1) above, moving the backend setup code to new functions qemudDomainConnectNetBackend/qemudDomainDisconnectNetBackend, and calling these functions from the existing qemudDomainAttachNetDevice/qemudDomainDetachNetDevice. I think this change is useful on its own: it breaks up two monster functions into more manageable pieces, and eliminates some code duplication (e.g. the try_remove clause at the end of qemudDomainAttachNetDevice).
Patch 2 is a godawful hack roughly implementing changes (2) and (3) above (did I mention that this is a proof of concept?). It spawns a thread that simply tries reconnecting the backend of each macvtap interface once a second. As long as the interface is already up, the reconnection fails. If the macvtap interface goes away because the underlying host interface disappears, the reconnection fails until the host interface reappears.
I ran into two major issues while implementing (2) and (3):
- Can we use the existing virEvent functions to invoke the reconnection process, triggered either by a timer or by an event from the host? It seems like this ought to work, but it appears that communication between libvirt and the qemu monitor relies on an event, and since all events run in the same thread, there's no way for an event to call the monitor.
- Should the reconnection process use udev or hal to get notifications, or leverage the node device code which itself uses udev or hal? Currently there doesn't appear to be a way to get notifications of changes to node devices; if there were, we'd still need to address the threading issue. If we use node devices, what changes to the configuration schema would be needed to associate a macvtap interface with the underlying node device?
I'd appreciate input on item (4) as well (e.g. does it always make sense to ignore the missing host interface on the assumption that it could show up later?).
--Ed
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Tue, Mar 9, 2010 at 5:27 AM, Stefan Berger <stefanb@us.ibm.com> wrote:
The problem I guess is that the underlying interface can disappear at any time due to hotplug leaving you with a race condition at other times as well until a watcher thread detects the change and can act, no?
Right. I would point out that the problem isn't limited to hot-pluggable hardware, though; host vlan and bridge interfaces can come and go as well. These days, tools like NetworkManager make it easy for users to reconfigure the network on the fly (especially on mobile systems) and it's up to applications to deal with such changes.
What do you pass to Qemu command line? Currently we pass a file descriptor of the tap interface...
Currently libvirt passes two options for each interface: -net nic,model=e1000,vlan=1 creates the virtual NIC, and -net tap,fd=42,vlan=1 connects it to a tap interface on the host. If you omit the -net tap option, qemu creates the virtual NIC but leaves it disconnected. (The option syntax keeps changing but it's the same idea with -device and -netdev). You can use the getfd and host_net_add monitor commands to connect the NIC to a backend once qemu is running.
How do you handle the macvtap description in the VM configuration? Currently we have the 'direct' interface description where the link device is written into the domain description:
<devices> <interface type='direct'/> ... <interface type='direct'> <source dev='eth0' mode='vepa'/> </interface> </devices>
http://libvirt.org/formatdomain.html#elementsNICSDirect
Are you reflecting the change to 'eth0' also in case the hotplugged device was to appear under another name?
So currently you would have to attach and detach the direct device. Now in your implementation a host unplug would automatically cause the macvtap to get unplugged and if a host interface appears it would automatically recreate a macvtap linking it to this host interface? Under what conditions does this work? Does the new interface have to have the same name? I wondering, because some scripts I believe check for the MAC address of the device and if it doesn't match the one expected for eth0, it may appear as eth1. How are cases handled where I would like it to reconnect to vlan 100 of the newly connected host interface but I probably have to run some command to first create that vlan 100 interface?
I am assuming that the host interface you want to connect to always keeps the same name. So if you define the interface with source dev='eth0' then the hot-plugged NIC has to reappear as eth0 to get libvirt to reconnect it to the virtual NIC. Similarly if you define the interface with source dev='vlan0' then however you change the vlan configuration, you need to call the interface vlan0 to get it reconnected (so this is probably a bad motivating example; a more descriptive name like "vmvlan" would be better than "vlan0"). To handle changing the host interface name, we'd want to use some other property to identify the host interface (perhaps MAC address) or perhaps add a layer of indirection to the configuration model (something like the <network> definition you originally proposed). But I view this issue as complementary to my proposal; I'm deliberately focusing on just the minimum required to handle reconnecting macvtap interfaces.
I suppose the backend setup code is provided and not some external script that the user can run to for example have the vlan 100 interface created on host hotplug.
I am assuming that the host interface you want to connect to is created by something other than libvirt (e.g. udev, NetworkManager, scripts, etc.). Again, all I'm concerned with at the moment is providing a way for libvirt to reconnect the host interface to the virtual NIC when it appears. --Ed

On Mon, Mar 08, 2010 at 04:29:42PM -0800, Ed Swierk wrote:
I posted this RFC/patch set a few weeks ago but didn't receive any response. My implementation works, but I'd like to hear from anyone more familiar with libvirt concurrency than I (i.e. nearly everyone) about how it might be improved.
In v2 I rebased against libvirt-0.7.7 and made a bunch of minor changes.
Thanks for taking the time to update this & remind us about it!
Using a bridge to connect a qemu NIC to a host interface offers a fair amount of flexibility to reconfigure the host without restarting the VM. For example, if the bridge connects host interface eth0 to the qemu tap0 interface, eth0 can be hot-removed and hot-plugged without affecting the VM. Similarly, if the bridge connects host VLAN interface vlan0 to the qemu tap0 interface, the admin can easily replace vlan0 with vlan1 without the VM noticing.
Using the macvtap driver instead of a kernel bridge, the host interface is much more tightly tied to the VM. Qemu communicates with the macvtap interface through a file descriptor, and the macvtap interface is bound permanently to a specific host interface when it is first created. What's more, if the underlying host interface disappears, the macvtap interface vanishes along with it, leaving the VM holding a file descriptor for a deleted file.
There is a related issue which IIRC Stefan raised before for migration, which is that assuming the host interface has the same name on src & dst is somewhat inflexible. ie, eth0 on src may be plugged to the same LAN as eth3 on the destination. Even in the context of a single host, in your scenario of hotpluggable NICs, it could end up with the VM tap device having to switch from eth0 to eth1 after hotplug.
To avoid race conditions during system startup, I would like libvirt to allow starting up the VM with a NIC even if the underlying host interface doesn't yet exist, deferring creation of the macvtap interface (analogous to starting up the VM with a tap interface bound to an orphan bridge). To support adding and removing a host interface without restarting the VM, I would like libvirt to react to the (re)appearance of the underlying host interface, creating a new macvtap interface and passing the new fd to qemu to reconnect to the NIC.
(It would also be nice if libvirt allowed the user to change which underlying host interface the qemu NIC is connected to. I'm ignoring this issue for now, except to note that implementing the above features should make this easier.)
The problem I have the idea of libvirt automatically re-connecting the interfaces is that it embeds in a policy in libvirt. We aim to avoid hardcoding policies in libvirt itself because it limits the way in which apps can make use of libvirt. Having libvirt auto-reconnect NICs, would make it difficult for an app which didn't want that, or wanted to reconfigure the NIC to point at a different named interface. So while I agree that the scenario you raise is one we need a solution to, I don't think libvirt should be auto-reconnecting NICs itself. Instead we should provide a mechanism to allow applications to reconfigure the NIC backends via the libvirt API. A separate management service, could then use this to implement automatic re-connection, if desired.
The libvirt API already supports domainAttachDevice and domainDetachDevice to add or remove an interface while the VM is running. In the qemu implementation, these commands add or remove the VM NIC device as well as reconfiguring the host side. This works only if the OS and application running in the VM can handle PCI hotplug and dynamically reconfigure its network. I would like to isolate the VM from changes to the host network setup, whether you use macvtap or a bridge.
The changes I think are needed to implement this include:
1. Refactor qemudDomainAttachNetDevice/qemudDomainDetachNetDevice, which currently handle both backend (host) setup and adding/removing the VM NIC device; move the backend setup code into separate functions that can called separately without affecting VM devices.
Agreed, this would be a useful refactoring.
2. Implement a thread or task that watches for changes to the underlying host interface for each configured macvtap interface, and reacts by invoking the appropriate backend setup code.
I don't think we should be doing this piece.
3. Change qemudBuildCommandLine to defer backend setup if qemu supports the necessary features for doing it later (e.g. the host_net_add monitor command).
To cope with this, I think we should introduce a new type of network configuration in the XML format. A type='none' which would indicate that a NIC should be created, but not connected to any host device. On the topic of hotplug, for disks, we currently allow change of media for CDROMs/Floppys by calling virDomainAttachDevice & detecting that it is an existing device. We could do a similar for NICs, though it is a little gross so I'm thinking we might want an explicit API called virDomainModifyDevice() we could use for all types of device reconfiguration. In the context of NICs, we use virDomainModifyDevice to enable changing of the host backend type. THus, a guest could be booted with a NIC configured with type=none, and later virDomainModifyDevice could change it to type=direct for macvtap or type=bridge, etc, etc This allows an external app to make arbitrary changes to a NIC on the fly. We probably also want to introduce the idea of link state to the NIC XML format. The virDomainModifyDevice API could use the QEMU monitor "set_link" command for this. Thus the guest OS will see the NIC link state go up & down, allowing it to redo DHCP, etc.
I ran into two major issues while implementing (2) and (3):
- Can we use the existing virEvent functions to invoke the reconnection process, triggered either by a timer or by an event from the host? It seems like this ought to work, but it appears that communication between libvirt and the qemu monitor relies on an event, and since all events run in the same thread, there's no way for an event to call the monitor.
- Should the reconnection process use udev or hal to get notifications, or leverage the node device code which itself uses udev or hal? Currently there doesn't appear to be a way to get notifications of changes to node devices; if there were, we'd still need to address the threading issue. If we use node devices, what changes to the configuration schema would be needed to associate a macvtap interface with the underlying node device?
I think we need to make it possible for applications to see host devices come & go. We expose the current set of devices via our virNodeDev APIs, and internally track the add/remove events from hal/udev, but we don't expose these events to apps. We really need to add an event API for virNodeDev to allow apps to be notified when a device comes / goes. They would need this to be able to then decide to change the guest NIC host backend config. The combo of allowing virDomainModifyDevice to change NIC host backends, and event notifications from virNodeDev APIs, allows a external management app to implement the kind of policy you are suggesting, and all sorts of other policies too. Longer term though, I think we need to add a managed object called a 'switch' / virSwitchPtr. This would encapsulate a connection to a LAN. A guest XML config would just refer to a named switch. The switch itself could be configured with any of the macvtap modes, or traditional bridging. We could allow the switch to be re-configured on the fly to point to a different host device, or change between macvtap / bridging, all without ever needing to alter the guest config. This is especially useful for the migration scenario. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
participants (3)
-
Daniel P. Berrange
-
Ed Swierk
-
Stefan Berger