On 2/20/24 17:29, Martin Kletzander wrote:
On Tue, Feb 20, 2024 at 04:53:29PM +0100, Michal Privoznik wrote:
> As of v9.8.0-rc1~7 we check whether two <memory/> devices don't
> overlap (since we allow setting where a <memory/> device should
> be mapped to). We do this pretty straightforward, by comparing
> start and end address of each <memory/> device combination.
> But since only the start address is given (an exposed in the
> XML), the end address is computed trivially as:
>
> start + mem->size * 1024
>
> And for majority of memory device types this works. Except for
> NVDIMMs. For them the <memory/> device consists of two separate
> regions: 1) actual memory device, and 2) label.
>
> Label is where NVDIMM stores some additional information like
> namespaces partition and so on. But it's not mapped into the
> guest the same way as actual memory device. In fact, mem->size is
> a sum of both actual memory device and label sizes. And to make
> things a bit worse, both sizes are subject to alignment (either
> the alignsize value specified in XML, or system page size if not
> specified in XML).
>
> Therefore, to get the size of actual memory device we need to
> take mem->size and substract label size rounded up to alignment.
>
> If we don't do this we report there's an overlap between two
> NVDIMMs even when in reality there's none.
>
> Fixes: 3fd64fb0e236fc80ffa2cc977c0d471f11fc39bf
> Fixes: 91f9a9fb4fc0d34ed8d7a869de3d9f87687c3618
> Resolves:
>
https://issues.redhat.com/browse/RHEL-4452?focusedId=23805174#comment-238...
> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
> ---
> src/conf/domain_validate.c | 50 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 48 insertions(+), 2 deletions(-)
>
> diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
> index 46479f10f2..5a9398b545 100644
> --- a/src/conf/domain_validate.c
> +++ b/src/conf/domain_validate.c
> @@ -2225,6 +2225,52 @@ virDomainHostdevDefValidate(const
> virDomainHostdevDef *hostdev)
> }
>
>
> +/**
> + * virDomainMemoryGetMappedSize:
> + * @mem: memory device definition
> + *
> + * For given memory device definition (@mem) calculate size mapped into
> + * the guest. This is usually mem->size, except for NVDIMM where its
> + * label is mapped elsewhere.
> + *
> + * Returns: Number of bytes a memory device takes when mapped into a
> + * guest.
> + */
> +static unsigned long long
> +virDomainMemoryGetMappedSize(const virDomainMemoryDef *mem)
> +{
> + unsigned long long ret = mem->size;
> +
> + if (mem->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM) {
> + unsigned long long alignsize = mem->source.nvdimm.alignsize;
> + unsigned long long labelsize = 0;
> +
> + /* For NVDIMM the situation is a bit more complicated. Firstly,
> + * its <label/> is not mapped as a part of memory device, so we
> + * must subtract label size from NVDIMM size. Secondly, label is
> + * also subject to alignment so we need to round it up before
> + * subtraction. */
> +
> + if (alignsize == 0) {
> + long pagesize = virGetSystemPageSizeKB();
> +
> + /* If no alignment is specified in the XML, fallback to
> + * system page size alignment. */
> + if (pagesize > 0)
> + alignsize = pagesize;
I'm not very well versed in memory modules and architectures, but isn't
this restricted a bit more, or rather isn't the QEMU default slightly
different? The following two functions suggest it might be:
qemuDomainGetMemorySizeAlignment
qemuDomainGetMemoryModuleSizeAlignment
Looking into QEMU sources, this is independent. I'm looking at
nvdimm_prepare_memory_region() declared at hw/mem/nvdimm.c in which I
can find:
if (!dimm->hostmem) {
error_setg(errp, "'" PC_DIMM_MEMDEV_PROP "' property must
be set");
return;
}
mr = host_memory_backend_get_memory(dimm->hostmem);
align = memory_region_get_alignment(mr);
size = memory_region_size(mr);
pmem_size = size - nvdimm->label_size;
nvdimm->label_data = memory_region_get_ram_ptr(mr) + pmem_size;
pmem_size = QEMU_ALIGN_DOWN(pmem_size, align);
In here, dimm->hostmem points to corresponding memory-backend-file
object with NVDIMM path (/memory/source/path in our XML). IOW, any huge
pages setting made to domain is independent of this. Then,
memory_region_get_alignment() gets alignment of that memory region,
which (looking at file_backend_memory_alloc()) corresponds to .align
attribute of the aforementioned memory-backend-file object => it
corresponds to /memory/source/alignsize in our XML or
mem->source.nvdimm.alignsize in our code. Finally, size reflects memory
size (.size attribute => /memory/target/size => mem->size).
Long story short, my comment "label is also subject to alignment" is
misleading, as it is not. The remaining memory (after label_size was
substracted) is then aligned down.
Consider the following squashed into my commit:
diff --git i/src/conf/domain_validate.c w/src/conf/domain_validate.c
index 5a9398b545..faa7659f07 100644
--- i/src/conf/domain_validate.c
+++ w/src/conf/domain_validate.c
@@ -2247,9 +2247,10 @@ virDomainMemoryGetMappedSize(const virDomainMemoryDef *mem)
/* For NVDIMM the situation is a bit more complicated. Firstly,
* its <label/> is not mapped as a part of memory device, so we
- * must subtract label size from NVDIMM size. Secondly, label is
- * also subject to alignment so we need to round it up before
- * subtraction. */
+ * must subtract label size from NVDIMM size. Secondly,
+ * remaining memory is then aligned again (rounded down). But
+ * for our purposes we might just round label size up and
+ * achieve the same (numeric) result. */
if (alignsize == 0) {
long pagesize = virGetSystemPageSizeKB();
Nice catch!
Michal