-----Original Message-----
From: Peter Krempa <pkrempa(a)redhat.com>
Sent: 23 November 2020 15:20
To: Daniel P. Berrangé <berrange(a)redhat.com>
Cc: Michal Prívozník <mprivozn(a)redhat.com>; Thanos Makatos
<thanos.makatos(a)nutanix.com>; Suraj Kasi <suraj.kasi(a)nutanix.com>;
libvirt-list(a)redhat.com; John Levon <john.levon(a)nutanix.com>
Subject: Re: Libvirt NVME support
On Mon, Nov 23, 2020 at 15:01:31 +0000, Daniel Berrange wrote:
> On Mon, Nov 23, 2020 at 03:36:42PM +0100, Peter Krempa wrote:
> > On Mon, Nov 23, 2020 at 15:32:20 +0100, Michal Privoznik wrote:
[...]
> > No, the NVMe controller lives on PCIe. Here we are trying to emulate a
> > NVMe controller (as <contoller> if you look elsewhere in the other
> > subthread. The <disk> element here maps to individual emulated
> > namespaces for the emulated NVMe controller.
> >
> > If we'd try to map one <disk> per PCIe device, you'd prevent us
from
> > emulating multiple namespaces.
>
> The odd thing here is that we're trying expose different host backing
> store for each namespace, hence the need to expose multiple <disk>.
>
> Does it even make sense if you expose a namespace "2" without first
> exposing a namespace "1" ?
[1]
>
> It makes me a little uneasy, as it feels like trying to export an
> regular disk, where we have a different host backing store for each
> partition. The difference I guess is that partition tables are a purely
> software construct, where as namespaces are a hardware construct.
For this purpose I viewed the namespace to be akin to a LUN on a
SCSI bus. For now controllers usually usually have just one namespace
and the storage is directly connected to it.
In the other subthread I've specifically asked whether the nvme standard
has a notion of namespace hotplug. Since it does it seems to be very
similar to how we deal with SCSI disks.
Ad [1]. That can be a limitation here. I wonder actually if you can have
0 namespaces. If that's possible then the model still holds. Obviously
if we can't have 0 namespaces hotplug would be impossible.
It is possible to have a controller with no namespaces at all or to have gaps in
the namespace IDs, there's no requirement to start from 1. Controllers start
from 1 since that's the sensible thing to do. We can end up in situations with
random namespace IDs simply by adding and deleting namespaces.
> Exposing individual partitions to a disk was done in Xen, but most
> people think it was kind of a mistake, as you could get a partition
> without any containing disk. At least in this case we do have a
> NVME controller present so the namespace isn't orphaned, like the
> old Xen partitons.
Well, the difference is that the nvme device node in linux actually
consists of 3 separate parts:
/dev/nvme0n1p1:
/dev/nvme0
- controller
n1
- namespace
p1
- partition
In this case we end up at the namespace component, so we don't really
deal in any way with partition. It's actually more similar to SCSI
albeit the SCSI naming in linux does in no way include the controller
which actually creates a mess.
Agreed, the partition exists solely within the host, so this isn't a problem.
Also, I think the analogy of SCSI controller == NVMe controller and
SCSI LUN == NVMe namespace is pretty accurate for all practical purposes.
> The alternative is to say only one host backing store, and then either
> let the guest dynamically carve it up into namespaces, or have some
> data format in the host backing store to represent the namespaces, or
> have an XML element to specify the regions of host backing that
> correspond to namespaces, eg
>
> <disk type="file" device="nvme">
> <source file="/some/file.qcow"/>
> <target bus="nvme"/>
> <namespaces>
> <region offset="0" size="1024000"/>
> <region offset="1024000" size="2024000"/>
> <region offset="2024000" size="4024000"/>
> </namespaces>
> <address type="pci" .../>
> </disk>
>
> this is of course less flexible, and I'm not entirely serious about
> suggesting this, but its an option that exists none the less.
Eww. This is disgusting and borderline useless if you ever want to
modify the backing image, but it certainly can be achieved with multiple
'raw' format drivers.
I agree that this is too limiting.
I don't think the NVMe standard mandates that the memory backing the
namespace must be the same for all namespaces.
The NVMe spec says:
"A namespace is a quantity of non-volatile memory that may be formatted into
logical blocks." (v1.4)
So we can pretty much do whatever we want. Having a single NVMe controller
through which we can pass all disks to a VM can be useful because it simplifies
management and reduces resource consumption both in the guest and the host. But
we can definitely add as many controllers as we want should we need to.
For a less disgusting and more usable setup, the namespace element can
be a collection of <source> elements.
The above also will require use of virDomainUpdateDevice if you'd want
to change the backing store in any way since that's possible.