[libvirt] RFC: Migration with NPIV

Hi, This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier. 1) Persistent vHBA support This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN & WWPN, and create the vHBA again. On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting. To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting. Proposed APIs: virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags); int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags); int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags); int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags); 2) Associate vHBA with domain XML There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage. New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g. <disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk> Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not. Using the stable path of LUN also works, e.g. <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/> But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as: <source dev="disk-by-path" model="vport"/> 3) Migration with vHBA One possible solution for migration with vHBA is to use one pair of WWNN & WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning. One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g. In XML of primary vHBA: <secondary wwpn="2101001b32a90004"/> In XML of secondary vHBA: <primary wwpn="2101001b32a90002"/> Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too. To prevent one using the LUN by creating vHBA using the same WWNN & WWPN on another host, we must create the secondary vHBA on source host, even it's not being used. Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting. When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host. In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though. In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN & WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration is finished on target host, the primary vHBA on source host is destroyed, not undefined. If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed. Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined. 4) Enrich HBA's XML It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA? And also it's good to have the maximum vports the HBA supports. Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc. The new XMLs should be like: <vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/> "online", "vendor", "address" make sense to vHBA too. 5) Improve the way to lookup LUN's stable path Currently, to lookup the LUN's stable path with WWNN & WWPN, it needs to iterate over the sysfs each time, maintaining the stable path in vHBA's XML doesn't make sense, as the LUN assigned to the vHBA could be changed as the storage admin's mood. I'm wondering if there is a way to notify the change asynchronously, if there is, then maintaining the stable path internally make sense. 6) Miscellaneous This is only about QEMU emulated device, passthroughed scsi_host with vHBA is still not covered, we have to support the vHBA passthough first. The good thing is the solution should be similiar. Regards, Osier

On 2012年11月19日 17:30, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN & WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
One API missed is: int virNodeDeviceCreate(virNodeDevicePtr dev, unsigned int flags); To create the vHBA.
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN & WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g.
In XML of primary vHBA:
<secondary wwpn="2101001b32a90004"/>
In XML of secondary vHBA:
<primary wwpn="2101001b32a90002"/>
Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too.
To prevent one using the LUN by creating vHBA using the same WWNN & WWPN on another host, we must create the secondary vHBA on source host, even it's not being used.
Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting.
When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host.
In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though.
In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN & WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration is finished on target host, the primary vHBA on source host is destroyed, not undefined.
If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed.
Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
5) Improve the way to lookup LUN's stable path
Currently, to lookup the LUN's stable path with WWNN & WWPN, it needs to iterate over the sysfs each time, maintaining the stable path in vHBA's XML doesn't make sense, as the LUN assigned to the vHBA could be changed as the storage admin's mood. I'm wondering if there is a way to notify the change asynchronously, if there is, then maintaining the stable path internally make sense.
6) Miscellaneous
This is only about QEMU emulated device, passthroughed scsi_host with vHBA is still not covered, we have to support the vHBA passthough first. The good thing is the solution should be similiar.
Regards, Osier
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 2012年11月19日 17:30, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier. Glad to see this topic on the list.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN & WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
One API missed is:
int virNodeDeviceCreate(virNodeDevicePtr dev, unsigned int flags);
To create the vHBA.
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN & WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
Is WWNN part of the migration? I mean, isn't WWNN normally associated w/ the underlying real vendor HBA and to have that also means the target of your migration has to match up that WWNN? Just for the sake of getting the LUN back after migration, the vHBA would only need the WWNN for the zoning and LUN masking, so you can migrate the domain across different vendor HBAs, as long as you make the WWPN naming non-vendor specific. Particularly the guest vm you are migrating is using LUNs via a NPIV port in the host.
One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g.
In XML of primary vHBA:
<secondary wwpn="2101001b32a90004"/>
In XML of secondary vHBA:
<primary wwpn="2101001b32a90002"/>
Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too.
Not familiar w/ sVirt or Sanlock, but will there be any race condition that two domains start migration may end-up getting the same secondary WWPN? Or, I guess my question should how is that prevented? Unless some central database keeps track of it, or the algorithm of generating of secondary vHBA WWPN guarantees that.
To prevent one using the LUN by creating vHBA using the same WWNN & WWPN on another host, we must create the secondary vHBA on source host, even it's not being used.
Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting.
When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host.
In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though.
In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN & WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration
If zoning is correct, then yes.
is finished on target host, the primary vHBA on source host is destroyed, not undefined.
If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed.
Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined. Maybe you can get rid of having 2nd vHBA all the time per domain for migration. I hope I understand you correctly, but maybe not, anyway, bear w/ me in below:
You need a transient vHBA for the target domain that is zoned already to see the LUNs at the same time they are seen on the source via the original (primary) vHBA. Once you have transferred the primary vHBA on source to target, you don't need the secondary vHBA, you only need the transferred primary, since it's already undefined on the source. This cuts down the WWPN space, that is unique per fabric. You may also reserve a pool of these transient WWPNs for the purpose of migration only, i.e., whoever wants to do migration, sends a request to get one of these, along w/ the request, automatically puts the transient WWPN in the zone of the requesting domain's vHBA WWPN. When migration succeeds, the post-cleanup routine can just reconfig the zoning to take the transient WWPN out and then put it back for other domain to use for migration.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
5) Improve the way to lookup LUN's stable path
Currently, to lookup the LUN's stable path with WWNN & WWPN, it needs to iterate over the sysfs each time, maintaining the stable path in vHBA's XML doesn't make sense, as the LUN assigned to the vHBA could be changed as the storage admin's mood. I'm wondering if there is a way to notify the change asynchronously, if there is, then maintaining the stable path internally make sense.
6) Miscellaneous
This is only about QEMU emulated device, passthroughed scsi_host with vHBA is still not covered, we have to support the vHBA passthough first. The good thing is the solution should be similiar.
I assume you meant PCI passthrough here? I am not sure if you want to do migration for that, since anyone using passthrough really wants to bind to the real underlying HW. But maybe there is a usage case, maybe for passing through of PCI virtual functions?
Thanks, this is great start for this issue, let me know if I can help. yi
Regards, Osier
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 2012年11月20日 09:36, Zou, Yi wrote:
On 2012年11月19日 17:30, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier. Glad to see this topic on the list.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN& WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
One API missed is:
int virNodeDeviceCreate(virNodeDevicePtr dev, unsigned int flags);
To create the vHBA.
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/> <target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN& WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
Is WWNN part of the migration? I mean, isn't WWNN normally associated w/ the underlying real vendor HBA and to have that also means the target of your migration has to match up that WWNN?
No, it doesn't have to match the WWNN of HBA. Please look through the threads for the currrent agreement, it's much different with this proposal. Regards, Osier

On Mon, Nov 19, 2012 at 06:42:42PM +0800, Osier Yang wrote:
One API missed is:
int virNodeDeviceCreate(virNodeDevicePtr dev, unsigned int flags);
To create the vHBA.
That API + functionality already exists Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2012年11月20日 18:08, Daniel P. Berrange wrote:
On Mon, Nov 19, 2012 at 06:42:42PM +0800, Osier Yang wrote:
One API missed is:
int virNodeDeviceCreate(virNodeDevicePtr dev, unsigned int flags);
To create the vHBA.
That API + functionality already exists
The existed API virNodeDeviceCreateXML is to create the vHBA from provided XML. The proposed one is to start it with the node device object. Regards, Osier

On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN & WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
I don't really much like this approach. IMHO, this should all be done via the virStoragePool APIs instead. Adding define/undefine/autostart to virNodeDevice is really just duplicating the storage pool functionality.
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
If you change the schema of the <source> element, then you must also create a new type='XXX' attribute to identify it, not just re-use type='block'
<target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
I don't much like the idea of mapping vHBA to <disk> elements, because you have a cardinality mis-match. A <disk> is equivalent of a single LUN, but a vHBA is something that provides multiple LUNs. If you want to directly associate a vHBA with a virtual guest, then this is really in the realm of SCSI HBA passthrough, not <disk> devices. If you want something mapped to the <disk> device, then the approach should be to map to a storage pool volume - something we've long talked about as broadly useful for all storage types, not just NPIV.
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN & WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g.
In XML of primary vHBA:
<secondary wwpn="2101001b32a90004"/>
In XML of secondary vHBA:
<primary wwpn="2101001b32a90002"/>
Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too.
To prevent one using the LUN by creating vHBA using the same WWNN & WWPN on another host, we must create the secondary vHBA on source host, even it's not being used.
Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting.
When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host.
In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though.
In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN & WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration is finished on target host, the primary vHBA on source host is destroyed, not undefined.
If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed.
Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined.
If we do the mapping of HBAs to guest domains using storage pools, then at a guest level, migration requires zero work. It is simply upto the management app to create the storage pool on the destination host with the same Name + UUID, but with the secondary WWNN/WWPN. The nice thing about this, is that you don't need to hardcode details of a secondary WWNN/WWPN up-front. The management app can just decide on those at the time it performs the migration, so 99% of the time there will only need to be a single vHBA setup on the SAN. During migration the mgmt app can setup a second vHBA for the target host, and once complete, delete the original vHBA entirely.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
I'm trying to remember how we modelled the parent/child relationship for SR-IOV PCI cards. NPIV is a very similar concept, so we should ideally seek to model the parent/child relationship in the same manner. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN & WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
I don't really much like this approach. IMHO, this should all be done via the virStoragePool APIs instead. Adding define/undefine/autostart to virNodeDevice is really just duplicating the storage pool functionality.
I like the idea of making vHBAs persist as part of pools; how do you envision it should work? Extend the scsi pools to take a vHBA descriptor and then instantiating the vHBA as part of starting the pool, or something else?
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
If you change the schema of the <source> element, then you must also create a new type='XXX' attribute to identify it, not just re-use type='block'
<target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
I don't much like the idea of mapping vHBA to <disk> elements, because you have a cardinality mis-match. A <disk> is equivalent of a single LUN, but a vHBA is something that provides multiple LUNs.
If you want to directly associate a vHBA with a virtual guest, then this is really in the realm of SCSI HBA passthrough, not <disk> devices.
If you want something mapped to the <disk> device, then the approach should be to map to a storage pool volume - something we've long talked about as broadly useful for all storage types, not just NPIV.
+1, we really should take this as an opportunity to add storage volumes as <disk> devices.
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN & WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g.
In XML of primary vHBA:
<secondary wwpn="2101001b32a90004"/>
In XML of secondary vHBA:
<primary wwpn="2101001b32a90002"/>
Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too.
To prevent one using the LUN by creating vHBA using the same WWNN & WWPN on another host, we must create the secondary vHBA on source host, even it's not being used.
Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting.
When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host.
In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though.
In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN & WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration is finished on target host, the primary vHBA on source host is destroyed, not undefined.
If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed.
Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined.
If we do the mapping of HBAs to guest domains using storage pools, then at a guest level, migration requires zero work.
It is simply upto the management app to create the storage pool on the destination host with the same Name + UUID, but with the secondary WWNN/WWPN. The nice thing about this, is that you don't need to hardcode details of a secondary WWNN/WWPN up-front. The management app can just decide on those at the time it performs the migration, so 99% of the time there will only need to be a single vHBA setup on the SAN. During migration the mgmt app can setup a second vHBA for the target host, and once complete, delete the original vHBA entirely.
Agreed, although there will of course need to be some degree of up-front coordination between the management app and the SAN administrators to avoid having to involve them to migrate a VM.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
I'm trying to remember how we modelled the parent/child relationship for SR-IOV PCI cards. NPIV is a very similar concept, so we should ideally seek to model the parent/child relationship in the same manner.
Physical function: <device> <name>pci_0000_01_00_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igb</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>0</slot> <function>0</function> <product id='0x10c9'>82576 Gigabit Network Connection</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='virt_functions'> <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/> </capability> </capability> </device> Virtual function: <device> <name>pci_0000_01_10_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igbvf</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>16</slot> <function>0</function> <product id='0x10ca'>82576 Virtual Function</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='phys_function'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </capability> <capability type='virt_functions'> </capability> </capability> </device> Interesingly, I think there's a bug there; the VF should not be showing <capability type='virt_functions'> but that's unrelated to the present discussion. Dave
Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Tue, Nov 20, 2012 at 11:26:53AM -0500, Dave Allan wrote:
On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN & WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
I don't really much like this approach. IMHO, this should all be done via the virStoragePool APIs instead. Adding define/undefine/autostart to virNodeDevice is really just duplicating the storage pool functionality.
I like the idea of making vHBAs persist as part of pools; how do you envision it should work? Extend the scsi pools to take a vHBA descriptor and then instantiating the vHBA as part of starting the pool, or something else?
Yes, pretty much that. Create when you start the pool, delete when you destroy the pool.
If we do the mapping of HBAs to guest domains using storage pools, then at a guest level, migration requires zero work.
It is simply upto the management app to create the storage pool on the destination host with the same Name + UUID, but with the secondary WWNN/WWPN. The nice thing about this, is that you don't need to hardcode details of a secondary WWNN/WWPN up-front. The management app can just decide on those at the time it performs the migration, so 99% of the time there will only need to be a single vHBA setup on the SAN. During migration the mgmt app can setup a second vHBA for the target host, and once complete, delete the original vHBA entirely.
Agreed, although there will of course need to be some degree of up-front coordination between the management app and the SAN administrators to avoid having to involve them to migrate a VM.
Yep, this is in fact why I like to push off more of this detail to the mgmt app. Libvirt is unable to talk to the SAN, so its better if the mgmt app had more direct control of the VHBA setup/teardown via the storage APIs, than to do it automagically in virDomainMigrate where the mgmt app cannot synchronize so easily.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
I'm trying to remember how we modelled the parent/child relationship for SR-IOV PCI cards. NPIV is a very similar concept, so we should ideally seek to model the parent/child relationship in the same manner.
Physical function:
<device> <name>pci_0000_01_00_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igb</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>0</slot> <function>0</function> <product id='0x10c9'>82576 Gigabit Network Connection</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='virt_functions'> <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/> </capability> </capability> </device>
Virtual function:
<device> <name>pci_0000_01_10_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igbvf</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>16</slot> <function>0</function> <product id='0x10ca'>82576 Virtual Function</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='phys_function'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </capability> <capability type='virt_functions'> </capability> </capability> </device>
Interesingly, I think there's a bug there; the VF should not be showing <capability type='virt_functions'> but that's unrelated to the present discussion.
Ok, so we should model vHBA relationships via some kind of <capability> then. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 2012年11月21日 00:26, Dave Allan wrote:
On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN& WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
I don't really much like this approach. IMHO, this should all be done via the virStoragePool APIs instead. Adding define/undefine/autostart to virNodeDevice is really just duplicating the storage pool functionality.
I like the idea of making vHBAs persist as part of pools; how do you envision it should work? Extend the scsi pools to take a vHBA descriptor and then instantiating the vHBA as part of starting the pool, or something else?
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
If you change the schema of the<source> element, then you must also create a new type='XXX' attribute to identify it, not just re-use type='block'
<target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
I don't much like the idea of mapping vHBA to<disk> elements, because you have a cardinality mis-match. A<disk> is equivalent of a single LUN, but a vHBA is something that provides multiple LUNs.
If you want to directly associate a vHBA with a virtual guest, then this is really in the realm of SCSI HBA passthrough, not <disk> devices.
If you want something mapped to the<disk> device, then the approach should be to map to a storage pool volume - something we've long talked about as broadly useful for all storage types, not just NPIV.
+1, we really should take this as an opportunity to add storage volumes as<disk> devices.
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN& WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g.
In XML of primary vHBA:
<secondary wwpn="2101001b32a90004"/>
In XML of secondary vHBA:
<primary wwpn="2101001b32a90002"/>
Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too.
To prevent one using the LUN by creating vHBA using the same WWNN& WWPN on another host, we must create the secondary vHBA on source host, even it's not being used.
Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting.
When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host.
In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though.
In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN& WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration is finished on target host, the primary vHBA on source host is destroyed, not undefined.
If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed.
Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined.
If we do the mapping of HBAs to guest domains using storage pools, then at a guest level, migration requires zero work.
It is simply upto the management app to create the storage pool on the destination host with the same Name + UUID, but with the secondary WWNN/WWPN. The nice thing about this, is that you don't need to hardcode details of a secondary WWNN/WWPN up-front. The management app can just decide on those at the time it performs the migration, so 99% of the time there will only need to be a single vHBA setup on the SAN. During migration the mgmt app can setup a second vHBA for the target host, and once complete, delete the original vHBA entirely.
Agreed, although there will of course need to be some degree of up-front coordination between the management app and the SAN administrators to avoid having to involve them to migrate a VM.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
I'm trying to remember how we modelled the parent/child relationship for SR-IOV PCI cards. NPIV is a very similar concept, so we should ideally seek to model the parent/child relationship in the same manner.
Physical function:
<device> <name>pci_0000_01_00_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igb</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>0</slot> <function>0</function> <product id='0x10c9'>82576 Gigabit Network Connection</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='virt_functions'> <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/> <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/> <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/> </capability> </capability> </device>
Virtual function:
<device> <name>pci_0000_01_10_0</name> <parent>pci_0000_00_01_0</parent> <driver> <name>igbvf</name> </driver> <capability type='pci'> <domain>0</domain> <bus>1</bus> <slot>16</slot> <function>0</function> <product id='0x10ca'>82576 Virtual Function</product> <vendor id='0x8086'>Intel Corporation</vendor> <capability type='phys_function'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </capability> <capability type='virt_functions'> </capability> </capability> </device>
Interesingly, I think there's a bug there; the VF should not be showing<capability type='virt_functions'>
Yeah, that's a bug. Okay, "capability" sounds good. Regards, Osier

On 2012年11月20日 18:17, Daniel P. Berrange wrote:
On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
Hi,
This proposal is trying to figure out a solution for migration of domain which uses LUN behind vHBA as disk device (QEMU emulated disk only at this stage). And other related NPIV improvements which are not related with migration. I'm not luck to get a environment to test if the thoughts are workable, but I'd like see if guys have good idea/suggestions earlier.
1) Persistent vHBA support
This is the useful stuff missed for long time. Assuming that one created a vHBA, did masking/zoning, everything works as expected. However, after a system rebooting, everything is just lost. If the user wants to get things back, he has to find out the preivous WWNN& WWPN, and create the vHBA again.
On the other hand, Persistent vHBA support is actually required for domain which uses LUN behind a vHBA. Othewise the domain could fail to start after a system rebooting.
To support the persistent vHBA, new APIs like virNodeDeviceDefineXML, virNodeDeviceUndefine is required. Also it's useful to introduce "autostart" for vHBA, so that the vHBA could be started automatically after system rebooting.
Proposed APIs:
virNodeDevicePtr virNodeDeviceDefineXML(virConnectPtr conn, const char *xml, unsigned int flags);
int virNodeDeviceUndefine(virConnectPtr conn, virNodeDevicePtr dev, unsigned int flags);
int virNodeDeviceSetAutostart(virNodeDevicePtr dev, int autostart, unsigned int flags);
int virNodeDeviceGetAutostart(virNodeDevicePtr dev, int *autostart, unsigned int flags);
I don't really much like this approach. IMHO, this should all be done via the virStoragePool APIs instead. Adding define/undefine/autostart to virNodeDevice is really just duplicating the storage pool functionality.
Agreed, though it means I have to quit the nearly finished patches. Actually I'm not comfortable with the way either while facing the conflicts between the device configurations probed by udev or HAL and the persistent configuration trying to support. So the left work is to improve storage pool's XML so that the vHBA it refers to could be stable. And also manage the lifecyle of vHBA with pool's lifecyle. For how to make sure the pool should not be destroyed if there is volume of the pool is being used by domain, IMO it's time to integrate the storage pool with domain? I.e mapping storage volume to domain disk, and ref/unref the storage volume with domain's lifecyle.
2) Associate vHBA with domain XML
There are two ways to attach a LUN to a domain: as an QEMU emulated device; or passthrough. Since passthrough a LUN is not supported in libvirt yet, let's focus on the emulated LUN at this stage.
New attributes "wwnn" and "wwpn" are introduced to indicate the LUN behind the vHBA. E.g.
<disk type='block' device='disk'> <driver name='qemu' type='raw'/> <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
If you change the schema of the<source> element, then you must also create a new type='XXX' attribute to identify it, not just re-use type='block'
<target dev='vda' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/> </disk>
Before the domain starting, we have to check if there is LUN assigned to the vHBA, error out if not.
Using the stable path of LUN also works, e.g.
<source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
But the disadvantage is the user have to figure out the stable path himself; And we have to do checking of every stable path to see if it's behind a vHBA in migration "Begin" stage. Or an new XML tag for element "source" to indicate that it's behind a vHBA? such as:
<source dev="disk-by-path" model="vport"/>
I don't much like the idea of mapping vHBA to<disk> elements, because you have a cardinality mis-match. A<disk> is equivalent of a single LUN, but a vHBA is something that provides multiple LUNs.
If you want to directly associate a vHBA with a virtual guest, then this is really in the realm of SCSI HBA passthrough, not <disk> devices.
Agreed, I missed that multiple LUNs can be mapped to one HBA.
If you want something mapped to the<disk> device, then the approach should be to map to a storage pool volume - something we've long talked about as broadly useful for all storage types, not just NPIV.
Okay, finally we are at the point to integerate storage with domain.
3) Migration with vHBA
One possible solution for migration with vHBA is to use one pair of WWNN& WWPN on source host, one is using for domain, one is reserved for migration purpose. It requires the storage admin maps the same LUN to the two vHBAs when doing the masking and zoning.
One of the two vHBA is called "Primary vHBA", another is called "secondary vHBA". To maitain the relationship between these two vHBAs, we have to introduce new XMLs to vHBA. E.g.
In XML of primary vHBA:
<secondary wwpn="2101001b32a90004"/>
In XML of secondary vHBA:
<primary wwpn="2101001b32a90002"/>
Primary vHBA is going to be guaranteed not used by any domain which is driven by libvirt (we do some checking eariler before the domain starting). And it's also guaranteed that the LUN can't be used by other domain with sVirt or Sanlock. So it's safe to have two vHBAs on source host too.
To prevent one using the LUN by creating vHBA using the same WWNN& WWPN on another host, we must create the secondary vHBA on source host, even it's not being used.
Both primary and secondary vHBA must be defined and marked as "autostart" so that the domain could be started after system rebooting.
When do migration, we have to bake a bigger cookie with secondary vHBA's info (basically it's WWNN and WWPN) in migration "Begin" stage, and eat that in migration "Prepare" stage on target host.
In "Begin" stage, the XMLs represents the secondary vHBA is constructed. And the secondary vHBA is destoyed on source host, not undefined though.
In "Prepare" stage, a new vHBA is created (define and start) on target host with the same WWNN& WWPN as secondary vHBA on source host. The LUN then should be visible to target host automatically? and thus migration can be performed. After migration is finished on target host, the primary vHBA on source host is destroyed, not undefined.
If migration fails, the new vHBA created on target host will be destroyed and undefined. And both primary and secondary vHBA on source host will be started, so that the domain could be resumed.
Finally if migration succeeds, primary vHBA on source host will be transtered to target host as secondary vHBA (defined). And both primary and secondary vHBA on source host will be undefined.
If we do the mapping of HBAs to guest domains using storage pools, then at a guest level, migration requires zero work.
It is simply upto the management app to create the storage pool on the destination host with the same Name + UUID, but with the secondary WWNN/WWPN. The nice thing about this, is that you don't need to hardcode details of a secondary WWNN/WWPN up-front. The management app can just decide on those at the time it performs the migration, so 99% of the time there will only need to be a single vHBA setup on the SAN. During migration the mgmt app can setup a second vHBA for the target host, and once complete, delete the original vHBA entirely.
Agreed. And it shows again that it's good to integrate storage pool with domain? Otherwise, the management app have to iterate over the domain XML, and look up the pool by the volume paths which are used by domain disks, before setup the pools on target host.
4) Enrich HBA's XML
It's hard to known the vHBAs created from a HBA with current implementation. One have to dump XML of each (v)HBAs and find out the clue with element "parent" of vHBAs. It's good to introduce new element for HBA like "vports", so that one can easily known what (how many) vHBAs are created from the HBA?
And also it's good to have the maximum vports the HBA supports.
Except these, other useful information should be exposed too, such as the vendor name, the HBA state, PCI address, etc.
The new XMLs should be like:
<vports num='2' max='64'> <vport name="scsi_host40" wwpn="2101001b32a90004"/> <vport name="scsi_host40" wwpn="2101001b32a90005"/> </vports> <online/> <vendor>QLogic</vendor> <address type="pci" domain="0" bus="0" slot="5" function="0"/>
"online", "vendor", "address" make sense to vHBA too.
I'm trying to remember how we modelled the parent/child relationship for SR-IOV PCI cards. NPIV is a very similar concept, so we should ideally seek to model the parent/child relationship in the same manner.
Daniel

Il 20/11/2012 11:17, Daniel P. Berrange ha scritto:
If we do the mapping of HBAs to guest domains using storage pools, then at a guest level, migration requires zero work.
I guess that means adding <disk type='volume'>?
It is simply upto the management app to create the storage pool on the destination host with the same Name + UUID, but with the secondary WWNN/WWPN. The nice thing about this, is that you don't need to hardcode details of a secondary WWNN/WWPN up-front. The management app can just decide on those at the time it performs the migration, so 99% of the time there will only need to be a single vHBA setup on the SAN. During migration the mgmt app can setup a second vHBA for the target host, and once complete, delete the original vHBA entirely.
Right, I think this is the right approach because it lets us preceed step-by-step. As a further step, creation and deletion of the HBAs can be moved to libvirt as in Osier's proposal. I don't like making the primary/secondary explicit in the XML, but perhaps you could add a pool of WWNN/WWPNs (really just two of them) to the pool, and pass the active pair to the destination in the migration cookie. The destination can then pick one that doesn't match. To really make things "require zero work" also for the sysadmin, you could have a "delayed open" option for QEMU disks. It would let us recycle the same WWNN/WWPN on both the source and destination, but you would have to shut down the vHBA on the source and bring it up on the destination while the guest is down. I'm afraid that this would cause too much downtime for the guest, since you have to wait for the destination to finish scanning devices. Paolo
participants (5)
-
Daniel P. Berrange
-
Dave Allan
-
Osier Yang
-
Paolo Bonzini
-
Zou, Yi