Re: [libvirt] [PATCH 0/8] Hostdev-hybrid patches

Thursday, 13 September 2012

Please find my comments inline.

Many Thanks,
Regards,
Shradha Shah

On 09/12/2012 08:01 PM, Laine Stump wrote:
...
 On 09/12/2012 05:59 AM, Daniel P. Berrange wrote:
> On Tue, Sep 11, 2012 at 03:07:25PM -0400, Laine Stump wrote:
>> On 09/07/2012 12:12 PM, Shradha Shah wrote:
>>> This patch series adds the support for
interface-type="hostdev-hybrid" and
>>> forward mode="hostdev-hybrid".
>>>
>>> The hostdev-hybrid mode makes migration possible along with PCI-passthrough.
>>> I had posted a RFC on the hostdev-hybrid methodology earlier on the libvirt
>>> mailing list.
>>>
>>> The RFC can be found here:
>>> https://www.redhat.com/archives/libvir-list/2012-February/msg00309.html
>> Before anything else, let me outline what I *think* happens with a
>> hostdev-hybrid device entry, and you can tell me how far off I am :-):
>>
>> * Any hostdev-hybrid interface definition results in 2 PCI devices being
>> added to the guest:
>>
>>    a) a PCI passthrough of an SR-IOV VF (done essentially the same as
>>       <interface type='hostdev'>
>>    b) a virtio-net device which is connected via macvtap "bridge" mode
>>       (? is that always the case) to the PF of the VF in (a)
>>
>> * Both of these devices are assigned the same MAC address.
>>
>> * Each of these occupies one PCI address on the guest, so a total of 2
>> PCI addresses is needed for each hostdev-hybrid "device". (The
>> redundancy in this statement is to be sure that I'm right, as that's an
>> important point :-)
>>
>> * On the guest, these two network devices with matching MAC addresses
>> are put together into a bond interface, with an extra driver that causes
>> the bond to prefer the pci-passthrough device when it is present. So,
>> under normal circumstances *all* traffic goes through the
>> pci-passthrough device.
>>
>> * At migration time, since guests with attached pci-passthrough devices
>> can't be migrated, the pci-passthrough device (which is found by
>> searching the hostdev array for items with the "ephemeral" flag set)
is
>> detached. This reduces the bond interface on the guest to only having
>> the virtio-net device, so traffic now passes through that device - it's
>> slower, but connectivity is maintained.
>>
>> * on the destination, a new VF is found, setup with proper MAC address,
>> VLAN, and 802.1QbX port info. A virtio-net device attached to the PF
>> associated with this VF (via macvtap bridge mode) is also setup. The
>> qemu commandline includes an entry for both of these devices. (Question:
>> Is it the virtio-net device that uses the guest PCI address given in the
>> <interface> device info?) (Question: actually, I guess the
>> pci-passthrough device won't be attached until after the guest actually
>> starts running on the destination host, correct?)
>>
>> * When migration is finished, the guest is shut down on the source and
>> started up on the destination, leaving the new instance of the guest
>> temporarily with just a single (virtio-net) device in the bond.
>>
>> * Finally, the pci-passthrough of the VF is attached to the guest, and
>> the guest's bond interface resumes preferring this device, thus
>> restoring full speed networking.
>>
>> Is that all correct?
>>
>> If so, one issue I have is that one of the devices (the
>> pci-passthrough?) doesn't have its guest-side PCI address visible
>> anywhere in the guest's XML, does it? This is problematic, because
>> management applications (and libvirt itself) expect to be able to scan
>> the list of devices to learn what PCI slots are occupied on the guest,
>> and where they can add new devices.
> If that description is correct,

 That's a big "if" - keep in mind the author of the description :-)
 (seriously, it's very possible I'm missing some important point)

>  then I have to wonder why we need to
> add all this code for a new "hybrid" device type. It seems to me like
> we can do all this already simply by listing one virtio device and one
> hostdev device in the guest XML.

 Aside from detaching/re-attaching the hostdev, the other thing that
 these patches bring is automatic derivation of the <source> of the
 virtio-net device from the hostdev. The hostdev device will be grabbed
 from a pool of VFs in a <network>, then a "reverse lookup" is done in
 PCI space to determine the PF for that VF - that's where the virtio-net
 device is connected.

 I suppose this could be handled by 1) putting only the VFs of a single
 PF in any network definition's device pool, and 2) always having two
 parallel network definitions like this:

     <network>
       <name>net-x-vfs-hostdev</name>
       <forward mode='hostdev' ephemeral='yes'>
         <pf dev='eth3'/> <!-- makes a list of all VFs for PF
'eth3' -->
       </forward>
     </network>

     <network>
       <name>net-x-pf-macvtap</name>
       <forward mode='bridge'>
         <interface dev='eth3'/>
       </forward>
     </network>

 Then each guest would have:

    <interface type='network'>
      <mac address='x:x:x:x:x:x'/>
      <network name='net-x-vfs-hostdev'>
    </interface>
    <interface type='network'>
      <mac address='x:x:x:x:x:x'/>
      <network name='net-x-pf-macvtap'>
      <model type='virtio'/>
    </interface>

 The problem with this is that then you can't have a pool that uses more
 than a single PF-worth of VFs. For example, I have an Intel 82576 card
 that has 2 PFs and 7 VFs per PF. This would mean that I can only have 7
 VFs in a network. Let's say I have 10 guests and want to migrate them
 back and forth between two hosts, I would have to make some arbitrary
 decision that some would use "net-x-vfs-hostdev+net-x-pf-macvtap" and
 some others would use "net-y-vfs-hostdev+net-y-pf-macvtap". Even worse
 would be if I had > 14 guests - there would be artificial limits (beyond
 simply "no more than 14 guests/host") on which guests could be moved to
 which machine at any given time (I would have to oversubscribe the
 7-guest limit for one pair of networks, and no more than 7 of that
 subset of guests could be on the same host at the same time).

 If, instead, the PF used for the virtio-net device is derived from the
 particular VF currently assigned to the same guest's hostdev, I can have
 a single network definition with VFs from multiple PFs, and they all
 become one big pool of resources. In that case, my only limit is the far
 simpler "no more than 14 guests/host"; no worries about *which* of the
 guests those 14 are. tl;dr - the two-in-one hostdev-hybrid device
 simplifies administrative decisions when you have/need multiple PFs.

 (another minor annoyance is that the dual device allows both to use the
 same auto-generated MAC address, but if we just use two individual
 devices, the MAC must be manually specified for each when the device is
 originally defined (so that they will match)).

>  All that's required is to add support
> for the 'ephemeral' against hostdevs, so they are automagically
> unplugged. Technically we don't even need that, since a mgmt app can
> already just use regular hotunplug APIs before issuing the migrate
> API calls.

 I like the idea of having that capability at libvirt's level, so that
 you can easily try things out with virsh (is the ephemeral flag
 implemented so that it also works for virsh save/restore? That would be
 a double plus.) A lot of us don't really use anything higher level than
 virsh or virt-manager, especially for testing. 
The ephemeral flag is not currently implemented so that it works for virsh
save/restore, but I can make the changes very easily if required.

The ephemeral flag will be an addition to the network XML config similar to
"managed".

...

 (I actually think there's merit to adding the ephemeral flag (can anyone
 think of a better name? When I hear ephemeral, I think of that TV chef -
 Emeril) for hostdevs in general - it would provide a method of easily
 allowing save/restore/migration for guests that have hostdevs that could
 be temporarily detached without ill consequences. I think proper
 operation would require that qemu notify libvirt when it's *really*
 finished detaching a device though (I don't have it at hand right now,
 but there's an open BZ requesting that from qemu).)

>   These patches seem to add alot of complexity for mere
> syntactic sugar over existing capabilities.

 I agree that the two-in-one device adds a lot of complexity. If we could
 find a way to derive the PF used for the virtio-net device from the VF
 used for the hostdev without having a combined two-in-one device entry
 (and being able to use a common auto-generated mac address would be nice
 too), then I would agree that it should be left as two separate device
 entries (if nothing else, this gives us an obvious place to put the PCI
 address of the 2nd device). I'm not sure how to do that without limiting
 pools to a single PF though. (I know, I know - the solution is for a
 higher level management application to modify the guest's config during
 migration according to what's in use. But if we're going to do that
 anyway, we may as well not have network definitions defining pools of
 interfaces in the first place.)

 --
 libvir-list mailing list
 libvir-list(a)redhat.com
 https://www.redhat.com/mailman/listinfo/libvir-list 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [PATCH 0/8] Hostdev-hybrid patches