On Wed, Feb 21, 2024 at 05:40:05PM +0100, Michal Prívozník wrote:
On 2/20/24 17:29, Martin Kletzander wrote:
> On Tue, Feb 20, 2024 at 04:53:29PM +0100, Michal Privoznik wrote:
>> As of v9.8.0-rc1~7 we check whether two <memory/> devices don't
>> overlap (since we allow setting where a <memory/> device should
>> be mapped to). We do this pretty straightforward, by comparing
>> start and end address of each <memory/> device combination.
>> But since only the start address is given (an exposed in the
>> XML), the end address is computed trivially as:
>>
>> start + mem->size * 1024
>>
>> And for majority of memory device types this works. Except for
>> NVDIMMs. For them the <memory/> device consists of two separate
>> regions: 1) actual memory device, and 2) label.
>>
>> Label is where NVDIMM stores some additional information like
>> namespaces partition and so on. But it's not mapped into the
>> guest the same way as actual memory device. In fact, mem->size is
>> a sum of both actual memory device and label sizes. And to make
>> things a bit worse, both sizes are subject to alignment (either
>> the alignsize value specified in XML, or system page size if not
>> specified in XML).
>>
>> Therefore, to get the size of actual memory device we need to
>> take mem->size and substract label size rounded up to alignment.
>>
>> If we don't do this we report there's an overlap between two
>> NVDIMMs even when in reality there's none.
>>
>> Fixes: 3fd64fb0e236fc80ffa2cc977c0d471f11fc39bf
>> Fixes: 91f9a9fb4fc0d34ed8d7a869de3d9f87687c3618
>> Resolves:
>>
https://issues.redhat.com/browse/RHEL-4452?focusedId=23805174#comment-238...
>> Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
>> ---
>> src/conf/domain_validate.c | 50 ++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 48 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c
>> index 46479f10f2..5a9398b545 100644
>> --- a/src/conf/domain_validate.c
>> +++ b/src/conf/domain_validate.c
>> @@ -2225,6 +2225,52 @@ virDomainHostdevDefValidate(const
>> virDomainHostdevDef *hostdev)
>> }
>>
>>
>> +/**
>> + * virDomainMemoryGetMappedSize:
>> + * @mem: memory device definition
>> + *
>> + * For given memory device definition (@mem) calculate size mapped into
>> + * the guest. This is usually mem->size, except for NVDIMM where its
>> + * label is mapped elsewhere.
>> + *
>> + * Returns: Number of bytes a memory device takes when mapped into a
>> + * guest.
>> + */
>> +static unsigned long long
>> +virDomainMemoryGetMappedSize(const virDomainMemoryDef *mem)
>> +{
>> + unsigned long long ret = mem->size;
>> +
>> + if (mem->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM) {
>> + unsigned long long alignsize = mem->source.nvdimm.alignsize;
>> + unsigned long long labelsize = 0;
>> +
>> + /* For NVDIMM the situation is a bit more complicated. Firstly,
>> + * its <label/> is not mapped as a part of memory device, so we
>> + * must subtract label size from NVDIMM size. Secondly, label is
>> + * also subject to alignment so we need to round it up before
>> + * subtraction. */
>> +
>> + if (alignsize == 0) {
>> + long pagesize = virGetSystemPageSizeKB();
>> +
>> + /* If no alignment is specified in the XML, fallback to
>> + * system page size alignment. */
>> + if (pagesize > 0)
>> + alignsize = pagesize;
>
> I'm not very well versed in memory modules and architectures, but isn't
> this restricted a bit more, or rather isn't the QEMU default slightly
> different? The following two functions suggest it might be:
>
> qemuDomainGetMemorySizeAlignment
>
> qemuDomainGetMemoryModuleSizeAlignment
Looking into QEMU sources, this is independent. I'm looking at
nvdimm_prepare_memory_region() declared at hw/mem/nvdimm.c in which I
can find:
if (!dimm->hostmem) {
error_setg(errp, "'" PC_DIMM_MEMDEV_PROP "' property must
be set");
return;
}
mr = host_memory_backend_get_memory(dimm->hostmem);
align = memory_region_get_alignment(mr);
size = memory_region_size(mr);
pmem_size = size - nvdimm->label_size;
nvdimm->label_data = memory_region_get_ram_ptr(mr) + pmem_size;
pmem_size = QEMU_ALIGN_DOWN(pmem_size, align);
In here, dimm->hostmem points to corresponding memory-backend-file
object with NVDIMM path (/memory/source/path in our XML). IOW, any huge
pages setting made to domain is independent of this. Then,
memory_region_get_alignment() gets alignment of that memory region,
which (looking at file_backend_memory_alloc()) corresponds to .align
attribute of the aforementioned memory-backend-file object => it
corresponds to /memory/source/alignsize in our XML or
mem->source.nvdimm.alignsize in our code. Finally, size reflects memory
size (.size attribute => /memory/target/size => mem->size).
I was more afraid about the QEMU vs. libvirt default of the alignment if
left unspecified. But since that would only apply to niche scenarios
and this patch makes the code strictly better, so with the below diff
squashed in
Reviewed-by: Martin Kletzander <mkletzan(a)redhat.com>
Long story short, my comment "label is also subject to
alignment" is
misleading, as it is not. The remaining memory (after label_size was
substracted) is then aligned down.
Consider the following squashed into my commit:
diff --git i/src/conf/domain_validate.c w/src/conf/domain_validate.c
index 5a9398b545..faa7659f07 100644
--- i/src/conf/domain_validate.c
+++ w/src/conf/domain_validate.c
@@ -2247,9 +2247,10 @@ virDomainMemoryGetMappedSize(const virDomainMemoryDef *mem)
/* For NVDIMM the situation is a bit more complicated. Firstly,
* its <label/> is not mapped as a part of memory device, so we
- * must subtract label size from NVDIMM size. Secondly, label is
- * also subject to alignment so we need to round it up before
- * subtraction. */
+ * must subtract label size from NVDIMM size. Secondly,
+ * remaining memory is then aligned again (rounded down). But
+ * for our purposes we might just round label size up and
+ * achieve the same (numeric) result. */
if (alignsize == 0) {
long pagesize = virGetSystemPageSizeKB();
Nice catch!
Michal