
Hi, We wanted to check if it’s possible to specify a disk’s target as nvme (so that the disk shows up as a nvme disk to the guest VM). Per libvirt documentation it looks like (since Libvirt 6.0.0) we can specify the disk type as nvme and disks source as a nvme. But the documentation does not say anything about being specify the disk’s target as nvme. Is it possible to present the disk to the guest as a nvme disk, if so how? Example from Libvirt documentation (https://libvirt.org/formatdomain.html) ----- <disk type='nvme' device='disk'> <driver name='qemu' type='raw'/> <source type='pci' managed='yes' namespace='1'> <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/> </source> <target dev='vde' bus='virtio'/> </disk> ----- But we want to do something similar to the following: two vNVMe controllers where the first one has one namespace and the second one has two namespaces ----- # first NVMe controller, one namespace <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='iscsi' name='iqn.2013-07.com.example:iscsi-nopool/123'> <host name='example.com' port='3260'/> <auth username='myuser'> <secret type='iscsi' usage='libvirtiscsi'/> </auth> </source> <target dev='nvme0' bus='nvme'/> </disk> # second NVMe controller, first namespace <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='iscsi' name='iqn.2013-07.com.example:iscsi-nopool/456'> <host name='example.com' port='3260'/> <auth username='myuser'> <secret type='iscsi' usage='libvirtiscsi'/> </auth> </source> <target dev='nvme1' namespace='1' bus='nvme'/> </disk> # second NVMe controller, first namespace <disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='iscsi' name='iqn.2013-07.com.example:iscsi-nopool/789'> <host name='example.com' port='3260'/> <auth username='myuser'> <secret type='iscsi' usage='libvirtiscsi'/> </auth> </source> <target dev='nvme1' namespace='2' bus='nvme'/> </disk> ----- If in case this is not yet supported, would it be merged if we were to implement it? Thanks, Suraj

On Mon, Nov 09, 2020 at 16:38:11 +0000, Suraj Kasi wrote:
Hi,
We wanted to check if it’s possible to specify a disk’s target as nvme (so that the disk shows up as a nvme disk to the guest VM).
Per libvirt documentation it looks like (since Libvirt 6.0.0) we can specify the disk type as nvme and disks source as a nvme. But the documentation does not say anything about being specify the disk’s target as nvme. Is it possible to present the disk to the guest as a nvme disk, if so how?
NVMe device emulation is not supported at this point. I'm not even sure what the state of the feature in qemu upstream is. If you have a real NVMe device, you can obviously use PCI device assignment with it to pass it to the guest os.

-----Original Message----- From: Peter Krempa <pkrempa@redhat.com> Sent: 09 November 2020 16:44 To: Suraj Kasi <suraj.kasi@nutanix.com> Cc: libvirt-list@redhat.com; Thanos Makatos <thanos.makatos@nutanix.com>; John Levon <john.levon@nutanix.com> Subject: Re: Libvirt NVME support
Hi,
We wanted to check if it’s possible to specify a disk’s target as nvme (so
On Mon, Nov 09, 2020 at 16:38:11 +0000, Suraj Kasi wrote: that the disk shows up as a nvme disk to the guest VM).
Per libvirt documentation it looks like (since Libvirt 6.0.0) we can specify the
disk type as nvme and disks source as a nvme. But the documentation does not say anything about being specify the disk’s target as nvme. Is it possible to present the disk to the guest as a nvme disk, if so how?
NVMe device emulation is not supported at this point. I'm not even sure what the state of the feature in qemu upstream is.
In older QEMU versions (~2.12) it was broken, not sure whether it's fixed now. In any case, we plan to provide NVMe emulation using SPDK once the multiprocess QEMU and vfio-user/out-of-process device emulation patch series are merged.
If you have a real NVMe device, you can obviously use PCI device assignment with it to pass it to the guest os.
We want a _virtual_ NVMe controller in the guest where the backend can be connected to anything, e.g. iSCSI, raw block, NVMe, etc.

Hi Peter, Just wanted to follow up. As Thanos mentioned that we want a virtual NVMe controller in the guest for which the support doesn't yet exist in libvirt. Is it something that would be accepted if we were to implement it? Thanks, Suraj On 11/9/20, 8:54 AM, "Thanos Makatos" <thanos.makatos@nutanix.com> wrote: > -----Original Message----- > From: Peter Krempa <pkrempa@redhat.com> > Sent: 09 November 2020 16:44 > To: Suraj Kasi <suraj.kasi@nutanix.com> > Cc: libvirt-list@redhat.com; Thanos Makatos > <thanos.makatos@nutanix.com>; John Levon <john.levon@nutanix.com> > Subject: Re: Libvirt NVME support > > On Mon, Nov 09, 2020 at 16:38:11 +0000, Suraj Kasi wrote: > > Hi, > > > > We wanted to check if it’s possible to specify a disk’s target as nvme (so > that the disk shows up as a nvme disk to the guest VM). > > > > Per libvirt documentation it looks like (since Libvirt 6.0.0) we can specify the > disk type as nvme and disks source as a nvme. But the documentation does > not say anything about being specify the disk’s target as nvme. Is it possible > to present the disk to the guest as a nvme disk, if so how? > > > > NVMe device emulation is not supported at this point. I'm not even sure > what the state of the feature in qemu upstream is. In older QEMU versions (~2.12) it was broken, not sure whether it's fixed now. In any case, we plan to provide NVMe emulation using SPDK once the multiprocess QEMU and vfio-user/out-of-process device emulation patch series are merged. > > If you have a real NVMe device, you can obviously use PCI device > assignment with it to pass it to the guest os. We want a _virtual_ NVMe controller in the guest where the backend can be connected to anything, e.g. iSCSI, raw block, NVMe, etc.

On Mon, Nov 16, 2020 at 23:01:00 +0000, Suraj Kasi wrote:
Hi Peter,
Just wanted to follow up. As Thanos mentioned that we want a virtual NVMe controller in the guest for which the support doesn't yet exist in libvirt. Is it something that would be accepted if we were to implement it?
Sure. Preferably post your proposed design of the XML as a RFC patch on the list so that the design can be discussed without wasting any development work first. As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?

As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.

On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt. As a starting point a trivial way to model this in the XML will be: <controller type='nvme' index='1' model='nvme'> And then add the storage into it as: <disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk> <disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk> The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0. You can theoretically also add your own address type if 'drive' doesn't look right. This model will have problems with hotplug/unplug if the NVMe spec doesn't actually allow hotplug of a single namespace into a controller as libvirt's hotplug APIs only deal with one element at time. We theoretically could work this around by allowing hotplug of disks which correspond to the namespace used while the controller was not attached yet, and the attach of the controller then attaches both the backends and the controller. This is a bit hacky though. Another obvious solution is to disallow hotplug of the namespaces and thus also the controller.

On 11/18/20 11:24 AM, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt.
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
You can theoretically also add your own address type if 'drive' doesn't look right.
This model will have problems with hotplug/unplug if the NVMe spec doesn't actually allow hotplug of a single namespace into a controller as libvirt's hotplug APIs only deal with one element at time.
We theoretically could work this around by allowing hotplug of disks which correspond to the namespace used while the controller was not attached yet, and the attach of the controller then attaches both the backends and the controller. This is a bit hacky though.
Another obvious solution is to disallow hotplug of the namespaces and thus also the controller.
Would it make sense to relax the current limitation in libvirt and allow <disk type='nvme'/> (which is meant for cases where the backend is a NVMe disk) to be on something else than 'virtio' bus? Michal

On Wed, Nov 18, 2020 at 20:31:03 +0100, Michal Privoznik wrote:
On 11/18/20 11:24 AM, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
[...]
Would it make sense to relax the current limitation in libvirt and allow <disk type='nvme'/> (which is meant for cases where the backend is a NVMe disk) to be on something else than 'virtio' bus?
This is really orthogonal to the emulated NVMe controller. A <disk type='nvme'> can theoretically back any disk frontend. I don't remember now why we actually mandate it only for virtio. Do you? Apart from that, it doesn't make that much sense to use <disk type='nvme' for an emulated NVMe drive again. You are way better of using direct PCI asignment.

As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should be something like 'nvme0n1'. I'm not sure though this is something that we can put here anyway, I think the guest driver can number controllers arbitrarily. Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
<address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
You can theoretically also add your own address type if 'drive' doesn't look right.
This model will have problems with hotplug/unplug if the NVMe spec doesn't actually allow hotplug of a single namespace into a controller as libvirt's hotplug APIs only deal with one element at time.
The NVMe spec does allow hotplug/unplug of namespaces, so libvirt should be fine supporting this?

On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should be something like 'nvme0n1'. I'm not sure though this is something that we can put here anyway, I think the guest driver can number controllers arbitrarily.
Well, it was supposed to be like that but really is not. Even with other buses the kernel can name the device arbitrarily, so it doesn't really matter.
Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
<address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
You can theoretically also add your own address type if 'drive' doesn't look right.
This model will have problems with hotplug/unplug if the NVMe spec doesn't actually allow hotplug of a single namespace into a controller as libvirt's hotplug APIs only deal with one element at time.
The NVMe spec does allow hotplug/unplug of namespaces, so libvirt should be fine supporting this?
Ah, cool in such case there shouldn't be any problem. You can attach a controller and then attach namespaces to it or the other way around. Problem would be if the namespace would need to be attached simultaneously with the controller.

On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should be something like 'nvme0n1'. I'm not sure though this is something that we can put here anyway, I think the guest driver can number controllers arbitrarily.
Well, it was supposed to be like that but really is not. Even with other buses the kernel can name the device arbitrarily, so it doesn't really matter.
Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
<address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0): <disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example
IIUC we need a way to associate storage (this XML snippet) with the controller you defined earlier (<controller type='nvme' index='1' model='nvme'>). So shouldn't we only require associating this piece of storage with the controller based on the index?
uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
I think 'namespace' or 'ns' would be more suitable instead of 'unit'. What are 'bus' and 'target' here? And why do they have to be 0? Do we really need dev='nvme0' in <target ...>? Specifying the controller index should be enough, no? Wouldn't this contain the minimum amount of information to unambiguously map this piece of storage to the controller? <disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address controller='1' ns='1'/> </disk>

On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote:
On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should be something like 'nvme0n1'. I'm not sure though this is something that we can put here anyway, I think the guest driver can number controllers arbitrarily.
Well, it was supposed to be like that but really is not. Even with other buses the kernel can name the device arbitrarily, so it doesn't really matter.
Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
<address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0):
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non-qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
The 'drive' address here maps the disk to the controller. This example
IIUC we need a way to associate storage (this XML snippet) with the controller you defined earlier (<controller type='nvme' index='1' model='nvme'>). So shouldn't we only require associating this piece of storage with the controller based on the index?
No. The common approach is to do it via what's specified as <address>
uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
I think 'namespace' or 'ns' would be more suitable instead of 'unit'. What are 'bus' and 'target' here? And why do they have to be 0? Do we really need dev='nvme0' in <target ...>? Specifying the controller index should be enough, no?
You certainly can add <address type='nvme' controller='1' ns='2'/>
Wouldn't this contain the minimum amount of information to unambiguously map this piece of storage to the controller?
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address controller='1' ns='1'/> </disk>
That certainly is correct if you include the "type='nvme'" attribute.

On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote:
On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should be something like 'nvme0n1'. I'm not sure though this is something that we can put here anyway, I think the guest driver can number controllers arbitrarily.
Well, it was supposed to be like that but really is not. Even with other buses the kernel can name the device arbitrarily, so it doesn't really matter.
Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
<address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0):
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non-qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
The 'drive' address here maps the disk to the controller. This example
IIUC we need a way to associate storage (this XML snippet) with the
controller
you defined earlier (<controller type='nvme' index='1' model='nvme'>). So shouldn't we only require associating this piece of storage with the controller based on the index?
No. The common approach is to do it via what's specified as <address>
uses unit= as the way to specify the namespace ID. Both 'bus' and
'target'
must be 0.
I think 'namespace' or 'ns' would be more suitable instead of 'unit'. What are 'bus' and 'target' here? And why do they have to be 0? Do we really need dev='nvme0' in <target ...>? Specifying the controller index should be enough, no?
You certainly can add <address type='nvme' controller='1' ns='2'/>
Wouldn't this contain the minimum amount of information to unambiguously map this piece of storage to the controller?
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address controller='1' ns='1'/> </disk>
That certainly is correct if you include the "type='nvme'" attribute.
Great, so the following would be a good place for us to start? <controller type='nvme' index='1' model='nvme'> <disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow1'/> <target dev='nvme' bus='nvme'/> <address type='nvme' controller='1' ns='1'/> </disk> <disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme' bus='nvme'/> <address type='nvme' controller='1' ns='2'/> </disk>

On Mon, Nov 23, 2020 at 13:07:51 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote:
On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should be something like 'nvme0n1'. I'm not sure though this is something that we can put here anyway, I think the guest driver can number controllers arbitrarily.
Well, it was supposed to be like that but really is not. Even with other buses the kernel can name the device arbitrarily, so it doesn't really matter.
Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
<address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0):
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non-qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
You still need a way to "index" it somehow. The target must be unique for each disk. [...]
That certainly is correct if you include the "type='nvme'" attribute.
Great, so the following would be a good place for us to start?
<controller type='nvme' index='1' model='nvme'>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow1'/> <target dev='nvme' bus='nvme'/> <address type='nvme' controller='1' ns='1'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme' bus='nvme'/> <address type='nvme' controller='1' ns='2'/> </disk>
The address is reasonable this way.

On Mon, Nov 23, 2020 at 13:07:51 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote:
On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
> As a starting point a trivial way to model this in the XML will be: > > <controller type='nvme' index='1' model='nvme'> > > And then add the storage into it as: > > <disk type='file' device='disk'> > <source dev='/Host/QEMUGuest1.qcow2'/> > <target dev='sda' bus='nvme'/>
'target dev' is how the device appears in the guest, right? It should
something like 'nvme0n1'. I'm not sure though this is something
be that
we
can
put here anyway, I think the guest driver can number controllers arbitrarily.
Well, it was supposed to be like that but really is not. Even with other buses the kernel can name the device arbitrarily, so it doesn't really matter.
Maybe we should specify something like BDF? Or is this something QEMU will have to figure out how to do?
> <address type='drive' controller='1' bus='0' target='0' unit='0'/> > </disk> > > <disk type='file' device='disk'> > <source dev='/Host/QEMUGuest2.qcow2'/> > <target dev='sdb' bus='nvme'/> > <address type='drive' controller='1' bus='0' target='0' unit='1'/> > </disk>
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0):
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non-qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
You still need a way to "index" it somehow. The target must be unique for each disk.
I think I've misunderstood something, I thought controller='1' in <address ...> refers to index='1' in <controller ...>. So <address ...> should be: <address type='nvme' index='1' controller='1' ns='2'/> What's controller='1' then?

On Mon, Nov 23, 2020 at 16:48:55 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 13:07:51 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote:
On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0):
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non-qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
You still need a way to "index" it somehow. The target must be unique for each disk.
I think I've misunderstood something, I thought controller='1' in <address ...> refers to index='1' in <controller ...>. So <address ...> should be:
<address type='nvme' index='1' controller='1' ns='2'/>
What's controller='1' then?
What I meant by the above is that the value of "<target dev='THIS'" must be unique for every <disk>. I also wanted to advice to not use numbers for making it unique. Numbers used for it have a legacy meaning. Your suggested <address type='nvme' design looks good.

-----Original Message----- From: Peter Krempa <pkrempa@redhat.com> Sent: 23 November 2020 16:56 To: Thanos Makatos <thanos.makatos@nutanix.com> Cc: Suraj Kasi <suraj.kasi@nutanix.com>; libvirt-list@redhat.com; John Levon <john.levon@nutanix.com> Subject: Re: Libvirt NVME support
On Mon, Nov 23, 2020 at 16:48:55 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 13:07:51 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote:
> On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
Revistiting your initial suggestion, it should be something like this (s/sdb/nvme0):
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='nvme0' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non- qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
You still need a way to "index" it somehow. The target must be unique for each disk.
I think I've misunderstood something, I thought controller='1' in <address ...> refers to index='1' in <controller ...>. So <address ...> should be:
<address type='nvme' index='1' controller='1' ns='2'/>
What's controller='1' then?
What I meant by the above is that the value of "<target dev='THIS'" must be unique for every <disk>. I also wanted to advice to not use numbers for making it unique. Numbers used for it have a legacy meaning.
Your suggested <address type='nvme' design looks good.
OK, so we definitely need dev='...' in target which is something like [a-zA-Z]+ and is unique. If this identifier is not controlled by the user, I think it would be best not to prefix it with 'nvme' (thus resulting in strings like 'nvmea' or 'nvmeabc'), as they can be rather confusing for people who don't know the details. However, I still don't understand how index='1' and controller='1' in address relate to index='1' in controller: <address type='nvme' index='1' controller='1' ns='2'/> and <controller type='nvme' index='1' model='nvme'>

On Mon, Nov 23, 2020 at 17:40:58 +0000, Thanos Makatos wrote:
-----Original Message----- From: Peter Krempa <pkrempa@redhat.com> Sent: 23 November 2020 16:56 To: Thanos Makatos <thanos.makatos@nutanix.com> Cc: Suraj Kasi <suraj.kasi@nutanix.com>; libvirt-list@redhat.com; John Levon <john.levon@nutanix.com> Subject: Re: Libvirt NVME support
On Mon, Nov 23, 2020 at 16:48:55 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 13:07:51 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote: > > On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos wrote:
> > Revistiting your initial suggestion, it should be something like this > (s/sdb/nvme0): > > <disk type='file' device='disk'> > <source dev='/Host/QEMUGuest2.qcow2'/> > <target dev='nvme0' bus='nvme'/> > <address type='drive' controller='1' bus='0' target='0' unit='1'/> > </disk>
Note that the parser for 'dev' is a bit quirky, old, and used in many places besides the qemu driver. It's also used with numbers in non- qemu cases. Extending that to parse numbers for nvme but not for sda might become ugly very quickly. Sticking with a letter at the end ('nvmea' might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
You still need a way to "index" it somehow. The target must be unique for each disk.
I think I've misunderstood something, I thought controller='1' in <address ...> refers to index='1' in <controller ...>. So <address ...> should be:
<address type='nvme' index='1' controller='1' ns='2'/>
What's controller='1' then?
What I meant by the above is that the value of "<target dev='THIS'" must be unique for every <disk>. I also wanted to advice to not use numbers for making it unique. Numbers used for it have a legacy meaning.
Your suggested <address type='nvme' design looks good.
OK, so we definitely need dev='...' in target which is something like [a-zA-Z]+ and is unique. If this identifier is not controlled by the user, I think it would be best not to prefix it with 'nvme' (thus resulting in strings like 'nvmea' or 'nvmeabc'), as they can be rather confusing for people who don't know the details.
However, I still don't understand how index='1' and controller='1' in address relate to index='1' in controller:
<address type='nvme' index='1' controller='1' ns='2'/>
index should not be here at all ..
and
<controller type='nvme' index='1' model='nvme'>
... then it makes sense.

-----Original Message----- From: Peter Krempa <pkrempa@redhat.com> Sent: 23 November 2020 17:47 To: Thanos Makatos <thanos.makatos@nutanix.com> Cc: Suraj Kasi <suraj.kasi@nutanix.com>; libvirt-list@redhat.com; John Levon <john.levon@nutanix.com> Subject: Re: Libvirt NVME support
On Mon, Nov 23, 2020 at 17:40:58 +0000, Thanos Makatos wrote:
-----Original Message----- From: Peter Krempa <pkrempa@redhat.com> Sent: 23 November 2020 16:56 To: Thanos Makatos <thanos.makatos@nutanix.com> Cc: Suraj Kasi <suraj.kasi@nutanix.com>; libvirt-list@redhat.com; John
Levon
<john.levon@nutanix.com> Subject: Re: Libvirt NVME support
On Mon, Nov 23, 2020 at 16:48:55 +0000, Thanos Makatos wrote:
On Mon, Nov 23, 2020 at 13:07:51 +0000, Thanos Makatos wrote:
> On Mon, Nov 23, 2020 at 09:47:23 +0000, Thanos Makatos wrote: > > > On Thu, Nov 19, 2020 at 10:17:56 +0000, Thanos Makatos
wrote:
> > > > Revistiting your initial suggestion, it should be something like this > > (s/sdb/nvme0): > > > > <disk type='file' device='disk'> > > <source dev='/Host/QEMUGuest2.qcow2'/> > > <target dev='nvme0' bus='nvme'/> > > <address type='drive' controller='1' bus='0' target='0' unit='1'/> > > </disk> > > Note that the parser for 'dev' is a bit quirky, old, and used in many > places besides the qemu driver. It's also used with numbers in non- qemu > cases. Extending that to parse numbers for nvme but not for sda might > become ugly very quickly. Sticking with a letter at the end ('nvmea' > might be a more straightforward approach.
Then I think we should just stick with 'nvme'.
You still need a way to "index" it somehow. The target must be unique for each disk.
I think I've misunderstood something, I thought controller='1' in <address ...> refers to index='1' in <controller ...>. So <address ...> should be:
<address type='nvme' index='1' controller='1' ns='2'/>
What's controller='1' then?
What I meant by the above is that the value of "<target dev='THIS'" must be unique for every <disk>. I also wanted to advice to not use numbers for making it unique. Numbers used for it have a legacy meaning.
Your suggested <address type='nvme' design looks good.
OK, so we definitely need dev='...' in target which is something like [a-zA- Z]+ and is unique. If this identifier is not controlled by the user, I think it would be best not to prefix it with 'nvme' (thus resulting in strings like 'nvmea' or 'nvmeabc'), as they can be rather confusing for people who don't know the details.
However, I still don't understand how index='1' and controller='1' in address relate to index='1' in controller:
<address type='nvme' index='1' controller='1' ns='2'/>
index should not be here at all ..
and
<controller type='nvme' index='1' model='nvme'>
... then it makes sense.
Thanks, it makes perfect sense now.

On Wed, Nov 18, 2020 at 11:24:30AM +0100, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt.
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
FWIW, I think that our overloeading of type=drive for FDC, IDE, and SCSI was a mistake in retrospect. We should have had type=fdc, type=ide, type=scsi, since each uses a different subset of the attributes. Lets not continue this mistake with NVME - create a type=nvme address type. I also wonder whether device='disk' makes sense too, as opposed to using device='nvme', since this is really not very similar to classic disks. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/23/20 3:03 PM, Daniel P. Berrangé wrote:
On Wed, Nov 18, 2020 at 11:24:30AM +0100, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt.
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
FWIW, I think that our overloeading of type=drive for FDC, IDE, and SCSI was a mistake in retrospect. We should have had type=fdc, type=ide, type=scsi, since each uses a different subset of the attributes.
Lets not continue this mistake with NVME - create a type=nvme address type.
Don't NVMes live on a PCI(e) bus? Can't we just threat NVMes as PCI devices? Or are we targeting sata too? Bcause we also have that type of address. Michal

On Mon, Nov 23, 2020 at 03:32:20PM +0100, Michal Prívozník wrote:
On 11/23/20 3:03 PM, Daniel P. Berrangé wrote:
On Wed, Nov 18, 2020 at 11:24:30AM +0100, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt.
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
FWIW, I think that our overloeading of type=drive for FDC, IDE, and SCSI was a mistake in retrospect. We should have had type=fdc, type=ide, type=scsi, since each uses a different subset of the attributes.
Lets not continue this mistake with NVME - create a type=nvme address type.
Don't NVMes live on a PCI(e) bus? Can't we just threat NVMes as PCI devices? Or are we targeting sata too? Bcause we also have that type of address.
IIUC, the NVME *controller* lives on a PCI bus, and it can have any number of namespaces associated with it. In real hardware the namespaces can be dynamically changed on the fly. So these <disk> elements are the namespaces, not the controller, hence PCI isn't relevant AFAICT except for the <controller> device. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:
On 11/23/20 3:03 PM, Daniel P. Berrangé wrote:
On Wed, Nov 18, 2020 at 11:24:30AM +0100, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt.
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
FWIW, I think that our overloeading of type=drive for FDC, IDE, and SCSI was a mistake in retrospect. We should have had type=fdc, type=ide, type=scsi, since each uses a different subset of the attributes.
Lets not continue this mistake with NVME - create a type=nvme address type.
Don't NVMes live on a PCI(e) bus? Can't we just threat NVMes as PCI devices? Or are we targeting sata too? Bcause we also have that type of address.
No, the NVMe controller lives on PCIe. Here we are trying to emulate a NVMe controller (as <contoller> if you look elsewhere in the other subthread. The <disk> element here maps to individual emulated namespaces for the emulated NVMe controller. If we'd try to map one <disk> per PCIe device, you'd prevent us from emulating multiple namespaces.

On Mon, Nov 23, 2020 at 03:36:42PM +0100, Peter Krempa wrote:
On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:
On 11/23/20 3:03 PM, Daniel P. Berrangé wrote:
On Wed, Nov 18, 2020 at 11:24:30AM +0100, Peter Krempa wrote:
On Wed, Nov 18, 2020 at 09:57:14 +0000, Thanos Makatos wrote:
As a separate question, is there any performance benefit of emulating a NVMe controller compared to e.g. virtio-scsi?
We haven't measured that yet; I would expect it to be slight faster and/or more CPU efficient but wouldn't be surprised if it isn't. The main benefit of using NVMe is that we don't have to install VirtIO drivers in the guest.
Okay, I'm not sold on the drivers bit but that is definitely not a problem in regards of adding support for emulating NVME controllers to libvirt.
As a starting point a trivial way to model this in the XML will be:
<controller type='nvme' index='1' model='nvme'>
And then add the storage into it as:
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest1.qcow2'/> <target dev='sda' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='0'/> </disk>
<disk type='file' device='disk'> <source dev='/Host/QEMUGuest2.qcow2'/> <target dev='sdb' bus='nvme'/> <address type='drive' controller='1' bus='0' target='0' unit='1'/> </disk>
The 'drive' address here maps the disk to the controller. This example uses unit= as the way to specify the namespace ID. Both 'bus' and 'target' must be 0.
FWIW, I think that our overloeading of type=drive for FDC, IDE, and SCSI was a mistake in retrospect. We should have had type=fdc, type=ide, type=scsi, since each uses a different subset of the attributes.
Lets not continue this mistake with NVME - create a type=nvme address type.
Don't NVMes live on a PCI(e) bus? Can't we just threat NVMes as PCI devices? Or are we targeting sata too? Bcause we also have that type of address.
No, the NVMe controller lives on PCIe. Here we are trying to emulate a NVMe controller (as <contoller> if you look elsewhere in the other subthread. The <disk> element here maps to individual emulated namespaces for the emulated NVMe controller.
If we'd try to map one <disk> per PCIe device, you'd prevent us from emulating multiple namespaces.
The odd thing here is that we're trying expose different host backing store for each namespace, hence the need to expose multiple <disk>. Does it even make sense if you expose a namespace "2" without first exposing a namespace "1" ? It makes me a little uneasy, as it feels like trying to export an regular disk, where we have a different host backing store for each partition. The difference I guess is that partition tables are a purely software construct, where as namespaces are a hardware construct. Exposing individual partitions to a disk was done in Xen, but most people think it was kind of a mistake, as you could get a partition without any containing disk. At least in this case we do have a NVME controller present so the namespace isn't orphaned, like the old Xen partitons. The alternative is to say only one host backing store, and then either let the guest dynamically carve it up into namespaces, or have some data format in the host backing store to represent the namespaces, or have an XML element to specify the regions of host backing that correspond to namespaces, eg <disk type="file" device="nvme"> <source file="/some/file.qcow"/> <target bus="nvme"/> <namespaces> <region offset="0" size="1024000"/> <region offset="1024000" size="2024000"/> <region offset="2024000" size="4024000"/> </namespaces> <address type="pci" .../> </disk> this is of course less flexible, and I'm not entirely serious about suggesting this, but its an option that exists none the less. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Nov 23, 2020 at 15:01:31 +0000, Daniel Berrange wrote:
On Mon, Nov 23, 2020 at 03:36:42PM +0100, Peter Krempa wrote:
On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:
[...]
No, the NVMe controller lives on PCIe. Here we are trying to emulate a NVMe controller (as <contoller> if you look elsewhere in the other subthread. The <disk> element here maps to individual emulated namespaces for the emulated NVMe controller.
If we'd try to map one <disk> per PCIe device, you'd prevent us from emulating multiple namespaces.
The odd thing here is that we're trying expose different host backing store for each namespace, hence the need to expose multiple <disk>.
Does it even make sense if you expose a namespace "2" without first exposing a namespace "1" ?
[1]
It makes me a little uneasy, as it feels like trying to export an regular disk, where we have a different host backing store for each partition. The difference I guess is that partition tables are a purely software construct, where as namespaces are a hardware construct.
For this purpose I viewed the namespace to be akin to a LUN on a SCSI bus. For now controllers usually usually have just one namespace and the storage is directly connected to it. In the other subthread I've specifically asked whether the nvme standard has a notion of namespace hotplug. Since it does it seems to be very similar to how we deal with SCSI disks. Ad [1]. That can be a limitation here. I wonder actually if you can have 0 namespaces. If that's possible then the model still holds. Obviously if we can't have 0 namespaces hotplug would be impossible.
Exposing individual partitions to a disk was done in Xen, but most people think it was kind of a mistake, as you could get a partition without any containing disk. At least in this case we do have a NVME controller present so the namespace isn't orphaned, like the old Xen partitons.
Well, the difference is that the nvme device node in linux actually consists of 3 separate parts: /dev/nvme0n1p1: /dev/nvme0 - controller n1 - namespace p1 - partition In this case we end up at the namespace component, so we don't really deal in any way with partition. It's actually more similar to SCSI albeit the SCSI naming in linux does in no way include the controller which actually creates a mess.
The alternative is to say only one host backing store, and then either let the guest dynamically carve it up into namespaces, or have some data format in the host backing store to represent the namespaces, or have an XML element to specify the regions of host backing that correspond to namespaces, eg
<disk type="file" device="nvme"> <source file="/some/file.qcow"/> <target bus="nvme"/> <namespaces> <region offset="0" size="1024000"/> <region offset="1024000" size="2024000"/> <region offset="2024000" size="4024000"/> </namespaces> <address type="pci" .../> </disk>
this is of course less flexible, and I'm not entirely serious about suggesting this, but its an option that exists none the less.
Eww. This is disgusting and borderline useless if you ever want to modify the backing image, but it certainly can be achieved with multiple 'raw' format drivers. I don't think the NVMe standard mandates that the memory backing the namespace must be the same for all namespaces. For a less disgusting and more usable setup, the namespace element can be a collection of <source> elements. The above also will require use of virDomainUpdateDevice if you'd want to change the backing store in any way since that's possible.

-----Original Message----- From: Peter Krempa <pkrempa@redhat.com> Sent: 23 November 2020 15:20 To: Daniel P. Berrangé <berrange@redhat.com> Cc: Michal Prívozník <mprivozn@redhat.com>; Thanos Makatos <thanos.makatos@nutanix.com>; Suraj Kasi <suraj.kasi@nutanix.com>; libvirt-list@redhat.com; John Levon <john.levon@nutanix.com> Subject: Re: Libvirt NVME support
On Mon, Nov 23, 2020 at 15:01:31 +0000, Daniel Berrange wrote:
On Mon, Nov 23, 2020 at 03:36:42PM +0100, Peter Krempa wrote:
On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:
[...]
No, the NVMe controller lives on PCIe. Here we are trying to emulate a NVMe controller (as <contoller> if you look elsewhere in the other subthread. The <disk> element here maps to individual emulated namespaces for the emulated NVMe controller.
If we'd try to map one <disk> per PCIe device, you'd prevent us from emulating multiple namespaces.
The odd thing here is that we're trying expose different host backing store for each namespace, hence the need to expose multiple <disk>.
Does it even make sense if you expose a namespace "2" without first exposing a namespace "1" ?
[1]
It makes me a little uneasy, as it feels like trying to export an regular disk, where we have a different host backing store for each partition. The difference I guess is that partition tables are a purely software construct, where as namespaces are a hardware construct.
For this purpose I viewed the namespace to be akin to a LUN on a SCSI bus. For now controllers usually usually have just one namespace and the storage is directly connected to it.
In the other subthread I've specifically asked whether the nvme standard has a notion of namespace hotplug. Since it does it seems to be very similar to how we deal with SCSI disks.
Ad [1]. That can be a limitation here. I wonder actually if you can have 0 namespaces. If that's possible then the model still holds. Obviously if we can't have 0 namespaces hotplug would be impossible.
It is possible to have a controller with no namespaces at all or to have gaps in the namespace IDs, there's no requirement to start from 1. Controllers start from 1 since that's the sensible thing to do. We can end up in situations with random namespace IDs simply by adding and deleting namespaces.
Exposing individual partitions to a disk was done in Xen, but most people think it was kind of a mistake, as you could get a partition without any containing disk. At least in this case we do have a NVME controller present so the namespace isn't orphaned, like the old Xen partitons.
Well, the difference is that the nvme device node in linux actually consists of 3 separate parts:
/dev/nvme0n1p1:
/dev/nvme0 - controller
n1
- namespace
p1
- partition
In this case we end up at the namespace component, so we don't really deal in any way with partition. It's actually more similar to SCSI albeit the SCSI naming in linux does in no way include the controller which actually creates a mess.
Agreed, the partition exists solely within the host, so this isn't a problem. Also, I think the analogy of SCSI controller == NVMe controller and SCSI LUN == NVMe namespace is pretty accurate for all practical purposes.
The alternative is to say only one host backing store, and then either let the guest dynamically carve it up into namespaces, or have some data format in the host backing store to represent the namespaces, or have an XML element to specify the regions of host backing that correspond to namespaces, eg
<disk type="file" device="nvme"> <source file="/some/file.qcow"/> <target bus="nvme"/> <namespaces> <region offset="0" size="1024000"/> <region offset="1024000" size="2024000"/> <region offset="2024000" size="4024000"/> </namespaces> <address type="pci" .../> </disk>
this is of course less flexible, and I'm not entirely serious about suggesting this, but its an option that exists none the less.
Eww. This is disgusting and borderline useless if you ever want to modify the backing image, but it certainly can be achieved with multiple 'raw' format drivers.
I agree that this is too limiting.
I don't think the NVMe standard mandates that the memory backing the namespace must be the same for all namespaces.
The NVMe spec says: "A namespace is a quantity of non-volatile memory that may be formatted into logical blocks." (v1.4) So we can pretty much do whatever we want. Having a single NVMe controller through which we can pass all disks to a VM can be useful because it simplifies management and reduces resource consumption both in the guest and the host. But we can definitely add as many controllers as we want should we need to.
For a less disgusting and more usable setup, the namespace element can be a collection of <source> elements.
The above also will require use of virDomainUpdateDevice if you'd want to change the backing store in any way since that's possible.
participants (6)
-
Daniel P. Berrangé
-
Michal Privoznik
-
Michal Prívozník
-
Peter Krempa
-
Suraj Kasi
-
Thanos Makatos