[PATCH v1 00/10] move memory module align to PostParse time

Hi, This is a work that I intended to do after pSeries NVDIMM changes I made in commit v6.7.0-302-gd3f3c2c97f. The reasoning is that if ppc64 NVDIMM alignment can be done in PostParse time, regular ppc64 DIMMs alignment can also be done the same way. After that I realized that the same logic can be applied to x86 as well, and then I started to see where we could eliminate explicit align calls. The end result is that a bug that affects pSeries the most was found and fixed in patch 02, and now we're able to reflect the exact memory being used by the guest in the live XML since the alignment is now done in parse time. The series can be fetched from: https://gitlab.com/danielhb/libvirt/-/tree/memory_modules_postparse_v1 NOTE: I worked with the x86 logic based on what I could assert from the code itself, i.e. the aligment mechanics (2MiB align for mem modules, 1MiB align for total memory) are related to QEMU internals, not being hypervisor-agnostic. If someone with authority can guarantee that the alignment restrictions for x86 are applicable to all hypervisors (like in the pSeries case), let me know and I'll respin the series putting the x86 code in virDomainMemoryDefPostParse(). Daniel Henrique Barboza (10): domain_conf.c: handle all ppc64 dimms in virDomainNVDimmAlignSizePseries qemu_domain.c: align memory modules before calculating 'initialmem' domain_conf.c: align pSeries mem modules in virDomainMemoryDefPostParse() qemu_domain.c: simplify qemuDomainGetMemoryModuleSizeAlignment() qemu_domain.c: align x86 memory module in post parse time qemu_domain.c: move size check for mem modules after alignment qemu_hotplug.c: avoid aligning the mem obj in qemuDomainDetachPrepMemory() qemu_hotplug.c: avoid aligning the mem obj in qemuDomainAttachMemory() qemu_domain.c: remove qemuDomainMemoryDeviceAlignSize() NEWS.rs: document memory alignment improvements NEWS.rst | 14 +++ src/conf/domain_conf.c | 25 ++++- src/conf/domain_conf.h | 2 +- src/libvirt_private.syms | 2 +- src/qemu/qemu_domain.c | 106 +++++++----------- src/qemu/qemu_domain.h | 2 - src/qemu/qemu_hotplug.c | 9 +- tests/qemuxml2argvdata/hugepages-nvdimm.xml | 2 +- .../memory-hotplug-nvdimm-access.xml | 2 +- .../memory-hotplug-nvdimm-align.xml | 2 +- .../memory-hotplug-nvdimm-label.xml | 2 +- .../memory-hotplug-nvdimm-pmem.xml | 2 +- .../memory-hotplug-nvdimm-readonly.xml | 2 +- .../memory-hotplug-nvdimm.xml | 2 +- .../memory-hotplug-ppc64-nonuma.args | 2 +- tests/qemuxml2argvdata/pages-dimm-discard.xml | 2 +- .../memory-hotplug-dimm.xml | 4 +- 17 files changed, 92 insertions(+), 90 deletions(-) -- 2.26.2

This simplifies the handling of ppc64 alignment in other parts of the code, with a single function to handle all cases for the architecture. No functional changes were made just yet. Function was renamed to reflect the broader scope. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/domain_conf.c | 15 +++++++++++---- src/conf/domain_conf.h | 2 +- src/libvirt_private.syms | 2 +- src/qemu/qemu_domain.c | 4 ++-- 4 files changed, 15 insertions(+), 8 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index a3ca332279..c98e760aa2 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5361,7 +5361,7 @@ virDomainMemoryDefPostParse(virDomainMemoryDefPtr mem, */ if (ARCH_IS_PPC64(def->os.arch) && mem->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && - virDomainNVDimmAlignSizePseries(mem) < 0) + virDomainMemoryDeviceAlignSizePseries(mem) < 0) return -1; return 0; @@ -16851,8 +16851,16 @@ virDomainSEVDefParseXML(xmlNodePtr sevNode, } int -virDomainNVDimmAlignSizePseries(virDomainMemoryDefPtr mem) +virDomainMemoryDeviceAlignSizePseries(virDomainMemoryDefPtr mem) { + unsigned long long ppc64AlignSize = 256 * 1024; + unsigned long long guestArea; + + if (mem->model != VIR_DOMAIN_MEMORY_MODEL_NVDIMM) { + mem->size = VIR_ROUND_UP(mem->size, ppc64AlignSize); + return 0; + } + /* For NVDIMMs in ppc64 in we want to align down the guest * visible space, instead of align up, to avoid writing * beyond the end of file by adding a potential 256MiB @@ -16867,8 +16875,7 @@ virDomainNVDimmAlignSizePseries(virDomainMemoryDefPtr mem) * * target_size = AlignDown(target_size - label_size) + label_size */ - unsigned long long ppc64AlignSize = 256 * 1024; - unsigned long long guestArea = mem->size - mem->labelsize; + guestArea = mem->size - mem->labelsize; /* Align down guest_area. 256MiB is the minimum size. Error * out if target_size is smaller than 256MiB + label_size, diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 77656c8ae3..fed451c4d5 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -3905,7 +3905,7 @@ virDomainBlockIoTuneInfoEqual(const virDomainBlockIoTuneInfo *a, const virDomainBlockIoTuneInfo *b); int -virDomainNVDimmAlignSizePseries(virDomainMemoryDefPtr mem); +virDomainMemoryDeviceAlignSizePseries(virDomainMemoryDefPtr mem); bool virHostdevIsSCSIDevice(const virDomainHostdevDef *hostdev) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 491bd5bd87..a44d37924a 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -546,7 +546,6 @@ virDomainNetTypeToString; virDomainNetUpdate; virDomainNostateReasonTypeFromString; virDomainNostateReasonTypeToString; -virDomainNVDimmAlignSizePseries; virDomainObjAssignDef; virDomainObjBroadcast; virDomainObjCheckActive; @@ -849,6 +848,7 @@ virDomainCachePolicyTypeFromString; virDomainCachePolicyTypeToString; virDomainMemoryAccessTypeFromString; virDomainMemoryAccessTypeToString; +virDomainMemoryDeviceAlignSizePseries; virDomainMemoryLatencyTypeFromString; virDomainMemoryLatencyTypeToString; virDomainNumaCheckABIStability; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 70896ad80e..375a9e5075 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8105,7 +8105,7 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) for (i = 0; i < def->nmems; i++) { if (def->mems[i]->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && ARCH_IS_PPC64(def->os.arch)) { - if (virDomainNVDimmAlignSizePseries(def->mems[i]) < 0) + if (virDomainMemoryDeviceAlignSizePseries(def->mems[i]) < 0) return -1; } else { align = qemuDomainGetMemoryModuleSizeAlignment(def, def->mems[i]); @@ -8142,7 +8142,7 @@ qemuDomainMemoryDeviceAlignSize(virDomainDefPtr def, { if (mem->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && ARCH_IS_PPC64(def->os.arch)) { - return virDomainNVDimmAlignSizePseries(mem); + return virDomainMemoryDeviceAlignSizePseries(mem); } else { mem->size = VIR_ROUND_UP(mem->size, qemuDomainGetMemorySizeAlignment(def)); -- 2.26.2

qemuDomainAlignMemorySizes() has an operation order problem. We are calculating 'initialmem' without aligning the memory modules first. Since we're aligning the dimms afterwards this can create inconsistencies in the end result. x86 has alignment of 1-2MiB and it's not severely impacted by it, but pSeries works with 256MiB alignment and the difference is noticeable. This is the case of the existing 'memory-hotplug-ppc64-nonuma' test. The test consists of a 2GiB (aligned value) guest with 2 ~520MiB dimms, both unaligned. 'initialmem' is calculated by taking total_mem and subtracting the dimms size (via virDomainDefGetMemoryInitial()), which wil give us 2GiB - 520MiB - 520MiB, ending up with a little more than an 1GiB of 'initialmem'. Note that this value is now unaligned, and will be aligned up via VIR_ROUND_UP(), and we'll end up with 'initialmem' of 1GiB + 256MiB. Given that the dimms are aligned later on, the end result for QEMU is that the guest will have a 'mem' size of 1310720k, plus the two 512 MiB dimms, exceeding in 256MiB the desired 2GiB memory and currentMemory specified in the XML. This behavior was found when trying to push ppc64 memory alignment to postparse, as done in the next patch. In that case, the memory modules are aligned before entering qemuDomainAlignMemorySizes() and the 'memory-hotplug-ppc64-nonuma' test started to fail. Fortunately the fix is simple: align the memory modules before calculating 'initialmem' in qemuDomainAlignMemorySizes(). Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 42 ++++++++++--------- .../memory-hotplug-ppc64-nonuma.args | 2 +- 2 files changed, 24 insertions(+), 20 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 375a9e5075..5a17887fa1 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8083,25 +8083,9 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) virDomainNumaSetNodeMemorySize(def->numa, i, mem); } - /* align initial memory size, if NUMA is present calculate it as total of - * individual aligned NUMA node sizes */ - if (initialmem == 0) - initialmem = VIR_ROUND_UP(virDomainDefGetMemoryInitial(def), align); - - if (initialmem > maxmemcapped) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("initial memory size overflowed after alignment")); - return -1; - } - - def->mem.max_memory = VIR_ROUND_UP(def->mem.max_memory, align); - if (def->mem.max_memory > maxmemkb) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", - _("maximum memory size overflowed after alignment")); - return -1; - } - - /* Align memory module sizes */ + /* Align memory module sizes. This needs to occur before 'initialmem' + * calculation because virDomainDefGetMemoryInitial() uses the size + * of the modules in the math. */ for (i = 0; i < def->nmems; i++) { if (def->mems[i]->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && ARCH_IS_PPC64(def->os.arch)) { @@ -8122,6 +8106,26 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) } } + /* Align initial memory size, if NUMA is present calculate it as total of + * individual aligned NUMA node sizes. */ + if (initialmem == 0) { + align = qemuDomainGetMemorySizeAlignment(def); + initialmem = VIR_ROUND_UP(virDomainDefGetMemoryInitial(def), align); + } + + if (initialmem > maxmemcapped) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("initial memory size overflowed after alignment")); + return -1; + } + + def->mem.max_memory = VIR_ROUND_UP(def->mem.max_memory, align); + if (def->mem.max_memory > maxmemkb) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("maximum memory size overflowed after alignment")); + return -1; + } + virDomainDefSetMemoryTotal(def, initialmem + hotplugmem); return 0; diff --git a/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args index 91cea9d8bf..78406f7f04 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args +++ b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args @@ -11,7 +11,7 @@ QEMU_AUDIO_DRV=none \ -name QEMUGuest1 \ -S \ -machine pseries,accel=kvm,usb=off,dump-guest-core=off \ --m size=1310720k,slots=16,maxmem=4194304k \ +-m size=1048576k,slots=16,maxmem=4194304k \ -realtime mlock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -object memory-backend-ram,id=memdimm0,size=536870912 \ -- 2.26.2

On Wed, 2020-11-11 at 19:07 -0300, Daniel Henrique Barboza wrote:
+++ b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args @@ -11,7 +11,7 @@ QEMU_AUDIO_DRV=none \ -name QEMUGuest1 \ -S \ -machine pseries,accel=kvm,usb=off,dump-guest-core=off \ --m size=1310720k,slots=16,maxmem=4194304k \ +-m size=1048576k,slots=16,maxmem=4194304k \ -realtime mlock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -object memory-backend-ram,id=memdimm0,size=536870912 \
This doesn't look right: AFAIK the initial memory size is guest-visible, so by changing how the alignment is performed you might both change the guest ABI across guest boots (if libvirt is upgraded in between them) and break migration (if either the source or destination host is running the newer libvirt but the other side isn't). Did I miss something that makes this okay? -- Andrea Bolognani / Red Hat / Virtualization

On 11/13/20 7:30 AM, Andrea Bolognani wrote:
On Wed, 2020-11-11 at 19:07 -0300, Daniel Henrique Barboza wrote:
+++ b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args @@ -11,7 +11,7 @@ QEMU_AUDIO_DRV=none \ -name QEMUGuest1 \ -S \ -machine pseries,accel=kvm,usb=off,dump-guest-core=off \ --m size=1310720k,slots=16,maxmem=4194304k \ +-m size=1048576k,slots=16,maxmem=4194304k \ -realtime mlock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -object memory-backend-ram,id=memdimm0,size=536870912 \
This doesn't look right: AFAIK the initial memory size is guest-visible, so by changing how the alignment is performed you might both change the guest ABI across guest boots (if libvirt is upgraded in between them) and break migration (if either the source or destination host is running the newer libvirt but the other side isn't).
Good point. I failed to consider ABI stability for ppc64 guest migration. Yes, this will break older guests that happen to have extra memory. In fact, this can be specially harmful for migration. This means that I can't proceed with any other ppc64 changes made here. Aligning ppc64 DIMMs in PostParse will achieve the same results even without this patch - the DIMMs will be aligned before qemuDomainAlignMemorySizes() and initialmem will be rounded to a more precise value of 'currentMemory'. I can think of ways to align ppc64 DIMMs while not touching initialmem, but all of them will require extra state in the domain definition. The benefits are there (the DIMMs will be aligned in the live XML) but I'm not sure it's worth the extra code. Thanks for pointing this out. I'll evaluate if the x86 bits are still valuable and re-send them, since they're not messing with initialmem calculation of x86 guests. DHB
Did I miss something that makes this okay?

On Fri, 2020-11-13 at 09:58 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 7:30 AM, Andrea Bolognani wrote:
On Wed, 2020-11-11 at 19:07 -0300, Daniel Henrique Barboza wrote:
+++ b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args @@ -11,7 +11,7 @@ QEMU_AUDIO_DRV=none \ -name QEMUGuest1 \ -S \ -machine pseries,accel=kvm,usb=off,dump-guest-core=off \ --m size=1310720k,slots=16,maxmem=4194304k \ +-m size=1048576k,slots=16,maxmem=4194304k \ -realtime mlock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -object memory-backend-ram,id=memdimm0,size=536870912 \
This doesn't look right: AFAIK the initial memory size is guest-visible, so by changing how the alignment is performed you might both change the guest ABI across guest boots (if libvirt is upgraded in between them) and break migration (if either the source or destination host is running the newer libvirt but the other side isn't).
Good point. I failed to consider ABI stability for ppc64 guest migration. Yes, this will break older guests that happen to have extra memory. In fact, this can be specially harmful for migration.
This means that I can't proceed with any other ppc64 changes made here. Aligning ppc64 DIMMs in PostParse will achieve the same results even without this patch - the DIMMs will be aligned before qemuDomainAlignMemorySizes() and initialmem will be rounded to a more precise value of 'currentMemory'. I can think of ways to align ppc64 DIMMs while not touching initialmem, but all of them will require extra state in the domain definition. The benefits are there (the DIMMs will be aligned in the live XML) but I'm not sure it's worth the extra code.
I only skimmed the remaining patches and didn't dig into the code as much, or as recently, as you have, but from a high-level perspective I don't see why you wouldn't be able to simply move the existing rounding logic from the command line generator to PostParse? It's not like the former has access to additional information that the latter can't get to, right? -- Andrea Bolognani / Red Hat / Virtualization

On 11/13/20 10:51 AM, Andrea Bolognani wrote:
On Fri, 2020-11-13 at 09:58 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 7:30 AM, Andrea Bolognani wrote:
On Wed, 2020-11-11 at 19:07 -0300, Daniel Henrique Barboza wrote:
+++ b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args @@ -11,7 +11,7 @@ QEMU_AUDIO_DRV=none \ -name QEMUGuest1 \ -S \ -machine pseries,accel=kvm,usb=off,dump-guest-core=off \ --m size=1310720k,slots=16,maxmem=4194304k \ +-m size=1048576k,slots=16,maxmem=4194304k \ -realtime mlock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -object memory-backend-ram,id=memdimm0,size=536870912 \
This doesn't look right: AFAIK the initial memory size is guest-visible, so by changing how the alignment is performed you might both change the guest ABI across guest boots (if libvirt is upgraded in between them) and break migration (if either the source or destination host is running the newer libvirt but the other side isn't).
Good point. I failed to consider ABI stability for ppc64 guest migration. Yes, this will break older guests that happen to have extra memory. In fact, this can be specially harmful for migration.
This means that I can't proceed with any other ppc64 changes made here. Aligning ppc64 DIMMs in PostParse will achieve the same results even without this patch - the DIMMs will be aligned before qemuDomainAlignMemorySizes() and initialmem will be rounded to a more precise value of 'currentMemory'. I can think of ways to align ppc64 DIMMs while not touching initialmem, but all of them will require extra state in the domain definition. The benefits are there (the DIMMs will be aligned in the live XML) but I'm not sure it's worth the extra code.
I only skimmed the remaining patches and didn't dig into the code as much, or as recently, as you have, but from a high-level perspective I don't see why you wouldn't be able to simply move the existing rounding logic from the command line generator to PostParse? It's not like the former has access to additional information that the latter can't get to, right?
Interesting. I suppose that moving the logic of "qemuDomainAlignMemorySizes" to PostParse might allows us to align DIMMs while being able to not touch 'initialmem'. I'm not entirely sure if that code is dependent on runtime stuff though (the NUMA related code for instance). I'll investigate. DHB

(Peter, I cc'ed you because I cited one of your commits and you might want to weight in) On 11/13/20 10:51 AM, Andrea Bolognani wrote:
On Fri, 2020-11-13 at 09:58 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 7:30 AM, Andrea Bolognani wrote:
On Wed, 2020-11-11 at 19:07 -0300, Daniel Henrique Barboza wrote:
+++ b/tests/qemuxml2argvdata/memory-hotplug-ppc64-nonuma.args
[...]
I only skimmed the remaining patches and didn't dig into the code as much, or as recently, as you have, but from a high-level perspective I don't see why you wouldn't be able to simply move the existing rounding logic from the command line generator to PostParse? It's not like the former has access to additional information that the latter can't get to, right?
I was looking into the code and I think that might have the wrong idea here. Apparently we're not aligning memory during migration or snapshot restore. This specific line in qemu_command.c got my attention: -- qemuBuildCommandLine() -- if (!migrateURI && !snapshot && qemuDomainAlignMemorySizes(def) < 0) return NULL; ------ I investigated the history behind this logic and found the following commit: commit c7d7ba85a6242d789ba3f4dae313e950fbb638c5 Author: Peter Krempa <pkrempa@redhat.com> Date: Thu Sep 17 08:14:05 2015 +0200 qemu: command: Align memory sizes only on fresh starts When we are starting a qemu process for an incomming migration or snapshot reloading we should not modify the memory sizes in the domain since we could potentially change the guest ABI that was tediously checked before. Additionally the function now updates the initial memory size according to the NUMA node size, which should not happen if we are restoring state. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1252685 --------- This means that the changes made in this series will not break migration, since the alignment of 'initialmem' is not being triggered in those cases. Which is good. However, you also brought up in an earlier reply that these changes might break "the guest ABI across guest boots (if libvirt is upgraded in between them)". This can't be helped I think - an older ppc64 guest with 4GiB of 'currentMemory' in the XML, that is ending up having 4.256GiB (extra 256MiB) due to the way alignment was being done, will have exactly 4GiB of 'initialmem' after these changes. My point is that we're giving the exact memory the guest is demanding, as intended by the domain XML, in a fresh guest start. This might be considered an ABI break probably, but why would one complain that Libvirt is now giving precisely what was asked for? Avoiding granting extra 256MBs of mem for domains seems worth it, given that we're not impacting live domains or migration. Thanks, DHB

On Fri, 2020-11-13 at 16:23 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 10:51 AM, Andrea Bolognani wrote:
I only skimmed the remaining patches and didn't dig into the code as much, or as recently, as you have, but from a high-level perspective I don't see why you wouldn't be able to simply move the existing rounding logic from the command line generator to PostParse? It's not like the former has access to additional information that the latter can't get to, right?
I was looking into the code and I think that might have the wrong idea here. Apparently we're not aligning memory during migration or snapshot restore. This specific line in qemu_command.c got my attention:
-- qemuBuildCommandLine() --
if (!migrateURI && !snapshot && qemuDomainAlignMemorySizes(def) < 0) return NULL;
------
I investigated the history behind this logic and found the following commit:
commit c7d7ba85a6242d789ba3f4dae313e950fbb638c5 Author: Peter Krempa <pkrempa@redhat.com> Date: Thu Sep 17 08:14:05 2015 +0200
qemu: command: Align memory sizes only on fresh starts
When we are starting a qemu process for an incomming migration or snapshot reloading we should not modify the memory sizes in the domain since we could potentially change the guest ABI that was tediously checked before. Additionally the function now updates the initial memory size according to the NUMA node size, which should not happen if we are restoring state.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1252685
---------
This means that the changes made in this series will not break migration, since the alignment of 'initialmem' is not being triggered in those cases. Which is good.
However, you also brought up in an earlier reply that these changes might break "the guest ABI across guest boots (if libvirt is upgraded in between them)". This can't be helped I think - an older ppc64 guest with 4GiB of 'currentMemory' in the XML, that is ending up having 4.256GiB (extra 256MiB) due to the way alignment was being done, will have exactly 4GiB of 'initialmem' after these changes. My point is that we're giving the exact memory the guest is demanding, as intended by the domain XML, in a fresh guest start. This might be considered an ABI break probably, but why would one complain that Libvirt is now giving precisely what was asked for? Avoiding granting extra 256MBs of mem for domains seems worth it, given that we're not impacting live domains or migration.
In general, changing guest ABI between boots is not something that we want to happen. I have trouble keeping all the details of memory alignment inside my head and I can't quite spend the time necessary to swap them back in right now, so please apologies if I'm being vague and of course correct me if I'm wrong... Having Peter in the thread will also surely help with that :) The aim of this series should *not* be to change the calculations that happen when aligning memory, but only to reflect them back to the domain XML where they can be queried: so for example if the input <memory unit='GiB'>4</memory> <devices> <memory model='nvdimm'> <target> <size unit='MiB'>500</size> results in the command line -m 4352m -object memory-backend-file,size=512m (the exact numbers are not relevant), then what we want is for the XML to be updated at define time so that it reads <memory unit='MiB'>4864</memory> <devices> <memory model='nvdimm'> <target> <size unit='MiB'>512</size> (again, the numbers are almost certainly wrong) and we want *that* XML to generate the same QEMU command line as before. If this can't be achieved, or there are other side effects to it that I haven't considered, then we're better off leaving the current behavior alone (documenting the heck out of it if necessary) instead of changing it in ways that would alter guest ABI between boots. -- Andrea Bolognani / Red Hat / Virtualization

On Mon, Nov 16, 2020 at 20:43:09 +0100, Andrea Bolognani wrote:
On Fri, 2020-11-13 at 16:23 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 10:51 AM, Andrea Bolognani wrote:
I only skimmed the remaining patches and didn't dig into the code as much, or as recently, as you have, but from a high-level perspective I don't see why you wouldn't be able to simply move the existing rounding logic from the command line generator to PostParse? It's not like the former has access to additional information that the latter can't get to, right?
I was looking into the code and I think that might have the wrong idea here. Apparently we're not aligning memory during migration or snapshot restore. This specific line in qemu_command.c got my attention:
-- qemuBuildCommandLine() --
if (!migrateURI && !snapshot && qemuDomainAlignMemorySizes(def) < 0) return NULL;
------
I investigated the history behind this logic and found the following commit:
commit c7d7ba85a6242d789ba3f4dae313e950fbb638c5 Author: Peter Krempa <pkrempa@redhat.com> Date: Thu Sep 17 08:14:05 2015 +0200
qemu: command: Align memory sizes only on fresh starts
When we are starting a qemu process for an incomming migration or snapshot reloading we should not modify the memory sizes in the domain since we could potentially change the guest ABI that was tediously checked before. Additionally the function now updates the initial memory size according to the NUMA node size, which should not happen if we are restoring state.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1252685
---------
This means that the changes made in this series will not break migration, since the alignment of 'initialmem' is not being triggered in those cases. Which is good.
However, you also brought up in an earlier reply that these changes might break "the guest ABI across guest boots (if libvirt is upgraded in between them)". This can't be helped I think - an older ppc64 guest with 4GiB of 'currentMemory' in the XML, that is ending up having 4.256GiB (extra 256MiB) due to the way alignment was being done, will have exactly 4GiB of 'initialmem' after these changes. My point is that we're giving the exact memory the guest is demanding, as intended by the domain XML, in a fresh guest start. This might be considered an ABI break probably, but why would one complain that Libvirt is now giving precisely what was asked for? Avoiding granting extra 256MBs of mem for domains seems worth it, given that we're not impacting live domains or migration.
In general, changing guest ABI between boots is not something that we want to happen.
I have trouble keeping all the details of memory alignment inside my head and I can't quite spend the time necessary to swap them back in right now, so please apologies if I'm being vague and of course correct me if I'm wrong... Having Peter in the thread will also surely help with that :)
Either you have high hopes or my sarcasm detector is failing. Just as one data point: the 'PostParse' callback happen _always_. Even when migrating the VM. My patch according to the commit message is specifically avoiding the alignment on migration. Thus the code can't be moved to the post parse callback. The outcome documented in the NEWS just mentions that it's to update the live XML. Neither of the commit messages of this series mentions that there is an issue with migration so the series needs to be re-evaluated in that light.

On 11/18/20 4:44 AM, Peter Krempa wrote:
On Mon, Nov 16, 2020 at 20:43:09 +0100, Andrea Bolognani wrote:
On Fri, 2020-11-13 at 16:23 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 10:51 AM, Andrea Bolognani wrote:
I only skimmed the remaining patches and didn't dig into the code as
[...]
In general, changing guest ABI between boots is not something that we want to happen.
I have trouble keeping all the details of memory alignment inside my head and I can't quite spend the time necessary to swap them back in right now, so please apologies if I'm being vague and of course correct me if I'm wrong... Having Peter in the thread will also surely help with that :)
Either you have high hopes or my sarcasm detector is failing.
Just as one data point: the 'PostParse' callback happen _always_. Even when migrating the VM. My patch according to the commit message is specifically avoiding the alignment on migration. Thus the code can't be moved to the post parse callback.
Not without considering the PARSE_ABI_UPDATE flag, and even then I'm not sure if it's worth the trouble like I mentioned earlier.
The outcome documented in the NEWS just mentions that it's to update the live XML.
Neither of the commit messages of this series mentions that there is an issue with migration so the series needs to be re-evaluated in that light.
This is correct. I got a lot of stuff wrong in the v1. v2 will be way shorter in scope but it will not mess up with existing migration/snapshot mechanics. My aim will be to fix the existing alignment calculation for pseries DIMMs. Thanks, DHB

On 11/16/20 4:43 PM, Andrea Bolognani wrote:
On Fri, 2020-11-13 at 16:23 -0300, Daniel Henrique Barboza wrote:
On 11/13/20 10:51 AM, Andrea Bolognani wrote:
I only skimmed the remaining patches and didn't dig into the code as much, or as recently, as you have, but from a high-level perspective I don't see why you wouldn't be able to simply move the existing rounding logic from the command line generator to PostParse? It's not like the former has access to additional information that the latter can't get to, right?
I was looking into the code and I think that might have the wrong idea here. Apparently we're not aligning memory during migration or snapshot restore. This specific line in qemu_command.c got my attention:
-- qemuBuildCommandLine() --
if (!migrateURI && !snapshot && qemuDomainAlignMemorySizes(def) < 0) return NULL;
------
I investigated the history behind this logic and found the following commit:
commit c7d7ba85a6242d789ba3f4dae313e950fbb638c5 Author: Peter Krempa <pkrempa@redhat.com> Date: Thu Sep 17 08:14:05 2015 +0200
qemu: command: Align memory sizes only on fresh starts
When we are starting a qemu process for an incomming migration or snapshot reloading we should not modify the memory sizes in the domain since we could potentially change the guest ABI that was tediously checked before. Additionally the function now updates the initial memory size according to the NUMA node size, which should not happen if we are restoring state.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1252685
---------
This means that the changes made in this series will not break migration, since the alignment of 'initialmem' is not being triggered in those cases. Which is good.
However, you also brought up in an earlier reply that these changes might break "the guest ABI across guest boots (if libvirt is upgraded in between them)". This can't be helped I think - an older ppc64 guest with 4GiB of 'currentMemory' in the XML, that is ending up having 4.256GiB (extra 256MiB) due to the way alignment was being done, will have exactly 4GiB of 'initialmem' after these changes. My point is that we're giving the exact memory the guest is demanding, as intended by the domain XML, in a fresh guest start. This might be considered an ABI break probably, but why would one complain that Libvirt is now giving precisely what was asked for? Avoiding granting extra 256MBs of mem for domains seems worth it, given that we're not impacting live domains or migration.
In general, changing guest ABI between boots is not something that we want to happen.
I have trouble keeping all the details of memory alignment inside my head and I can't quite spend the time necessary to swap them back in right now, so please apologies if I'm being vague and of course correct me if I'm wrong... Having Peter in the thread will also surely help with that :)
The aim of this series should *not* be to change the calculations that happen when aligning memory, but only to reflect them back to the domain XML where they can be queried: so for example if the input
<memory unit='GiB'>4</memory> <devices> <memory model='nvdimm'> <target> <size unit='MiB'>500</size>
results in the command line
-m 4352m -object memory-backend-file,size=512m
(the exact numbers are not relevant), then what we want is for the XML to be updated at define time so that it reads
I investigate it more and I think I got a few premises wrong here: - I pointed out that the alignment changes aren't being reflected. This is wrong. I figured it out when launching real guests and seeing that the live XML was being updated properly. This diminishes the urgency of this patch series considerably, and I apologize for the confusion. - The reason I claimed that is because I got misled by my own misunderstanding of what qemuxml2xmltest.c does: it checks the result XML only up to PostParse changes, and that's it. Any further change in the XML aren't being reflected by the tests. qemuxml2argvtest.c goes as far as building the command line, and I got the wrong idea that both tests reflected the same stages of a VM launch. For what I'm doing here, qemuxml2xml isn't helpful. All this out of the way:
<memory unit='MiB'>4864</memory> <devices> <memory model='nvdimm'> <target> <size unit='MiB'>512</size>
(again, the numbers are almost certainly wrong) and we want *that* XML to generate the same QEMU command line as before.
If this can't be achieved, or there are other side effects to it that I haven't considered, then we're better off leaving the current behavior alone (documenting the heck out of it if necessary) instead of changing it in ways that would alter guest ABI between boots.
This can be achieve, sort of. I can't just move the alignment to PostParse, even without changing how it is calculated, because I would break the premises it's based on today (not be executed on migration or snapshot). To do that I would need to gate the code into a VIR_DOMAIN_DEF_PARSE_ABI_UPDATE parse flag. But I'm not sure if this flag is activated on guest reboot without changing the domain (I'm looking into it ATM, but I don't think that's the case). So even if we do that, we would still need the alignment call after PostParse time, as is today, to accommodate existing guests. Regardless of whether this is worth doing or not (TBH, I'm not sure) I'll end up sending a couple of patches to reposition the alignment code out of qemu_command.c (but not up to ParseTime). What I still want to do is to use VIR_DOMAIN_DEF_PARSE_ABI_UPDATE to force the alignment of pseries DIMMs in PostParse time. This will not break ABI of existing guests in any capacity and will fix the alignment calculation for newer ones. We have precedent for such scenario in qemuDomainControllerDefPostParse(), when we use this same flag to define VIR_DOMAIN_CONTROLLER_MODEL_USB_QEMU_XHCI for new pSeries guests. I'll post patches today. Let's see how it goes. DHB

Until now, we were aligning just ppc64 NVDIMM modules in post parse time. Let's extend the treatment to all ppc64 memory models. This is possible because the alignment restrictions for the architecture are imposed by the platform specification, being hypervisor-agnostic. This will remove the need for mem module alignment for pSeries in the QEMU driver, simplifying the code a bit. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/domain_conf.c | 1 - src/qemu/qemu_domain.c | 13 +++---------- 2 files changed, 3 insertions(+), 11 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index c98e760aa2..e4aecc9a4d 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -5360,7 +5360,6 @@ virDomainMemoryDefPostParse(virDomainMemoryDefPtr mem, * will have the same restriction. */ if (ARCH_IS_PPC64(def->os.arch) && - mem->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && virDomainMemoryDeviceAlignSizePseries(mem) < 0) return -1; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 5a17887fa1..33a5765c57 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8087,11 +8087,8 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) * calculation because virDomainDefGetMemoryInitial() uses the size * of the modules in the math. */ for (i = 0; i < def->nmems; i++) { - if (def->mems[i]->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && - ARCH_IS_PPC64(def->os.arch)) { - if (virDomainMemoryDeviceAlignSizePseries(def->mems[i]) < 0) - return -1; - } else { + /* ppc64 memory modules are aligned by virDomainMemoryDefPostParse(). */ + if (!ARCH_IS_PPC64(def->os.arch)) { align = qemuDomainGetMemoryModuleSizeAlignment(def, def->mems[i]); def->mems[i]->size = VIR_ROUND_UP(def->mems[i]->size, align); } @@ -8144,13 +8141,9 @@ int qemuDomainMemoryDeviceAlignSize(virDomainDefPtr def, virDomainMemoryDefPtr mem) { - if (mem->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM && - ARCH_IS_PPC64(def->os.arch)) { - return virDomainMemoryDeviceAlignSizePseries(mem); - } else { + if (!ARCH_IS_PPC64(def->os.arch)) mem->size = VIR_ROUND_UP(mem->size, qemuDomainGetMemorySizeAlignment(def)); - } return 0; } -- 2.26.2

The function is no longer used for ppc64 alignment. This allow us to exclude the 'def' argument and simplify its usage. Let's also make it clearer in the comment that the function is now returning x86 specific values. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 33a5765c57..e40c122311 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8043,16 +8043,10 @@ qemuDomainGetMemorySizeAlignment(const virDomainDef *def) static unsigned long long -qemuDomainGetMemoryModuleSizeAlignment(const virDomainDef *def, - const virDomainMemoryDef *mem G_GNUC_UNUSED) +qemuDomainGetMemoryModuleSizeAlignment(void) { - /* PPC requires the memory sizes to be rounded to 256MiB increments, so - * round them to the size always. */ - if (ARCH_IS_PPC64(def->os.arch)) - return 256 * 1024; - - /* dimm memory modules require 2MiB alignment rather than the 1MiB we are - * using elsewhere. */ + /* For x86, dimm memory modules require 2MiB alignment rather than + * the 1MiB we are using elsewhere. */ return 2048; } @@ -8089,7 +8083,7 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) for (i = 0; i < def->nmems; i++) { /* ppc64 memory modules are aligned by virDomainMemoryDefPostParse(). */ if (!ARCH_IS_PPC64(def->os.arch)) { - align = qemuDomainGetMemoryModuleSizeAlignment(def, def->mems[i]); + align = qemuDomainGetMemoryModuleSizeAlignment(); def->mems[i]->size = VIR_ROUND_UP(def->mems[i]->size, align); } -- 2.26.2

This is already being done for pSeries guests, so let's extend the treatment for x86. The difference here is that the existing x86 alignment in the QEMU driver might be restricted only to QEMU, meaning that we can't put it in virDomainMemoryDefPostParse() without infering that all hypervisors would handle it the same way. Instead, let's create a qemuDomainMemoryDefPostParse() that will be called in qemuDomainDeviceDefPostParse() to do that. After this change, def->nmems will always be aligned after parsing time. qemuDomainAlignMemorySizes() doesn't need to align each memory module before calculating 'hotplugmem'. We'll also generated aligned values in the resulting XML as well. Some xml2xml tests were changed to reflect this new behavior. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 45 +++++++++++-------- tests/qemuxml2argvdata/hugepages-nvdimm.xml | 2 +- .../memory-hotplug-nvdimm-access.xml | 2 +- .../memory-hotplug-nvdimm-align.xml | 2 +- .../memory-hotplug-nvdimm-label.xml | 2 +- .../memory-hotplug-nvdimm-pmem.xml | 2 +- .../memory-hotplug-nvdimm-readonly.xml | 2 +- .../memory-hotplug-nvdimm.xml | 2 +- tests/qemuxml2argvdata/pages-dimm-discard.xml | 2 +- .../memory-hotplug-dimm.xml | 4 +- 10 files changed, 36 insertions(+), 29 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index e40c122311..c06c33f8d0 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5343,6 +5343,23 @@ qemuDomainTPMDefPostParse(virDomainTPMDefPtr tpm, } +static int +qemuDomainMemoryDefPostParse(virDomainMemoryDefPtr mem, + virArch arch) +{ + /* For x86, dimm memory modules require 2MiB alignment rather than + * the 1MiB we are using elsewhere. */ + unsigned int x86MemoryModuleSizeAlignment = 2048; + + /* ppc64 memory module alignment is done in + * virDomainMemoryDefPostParse(). */ + if (!ARCH_IS_PPC64(arch)) + mem->size = VIR_ROUND_UP(mem->size, x86MemoryModuleSizeAlignment); + + return 0; +} + + static int qemuDomainDeviceDefPostParse(virDomainDeviceDefPtr dev, const virDomainDef *def, @@ -5400,6 +5417,10 @@ qemuDomainDeviceDefPostParse(virDomainDeviceDefPtr dev, ret = qemuDomainTPMDefPostParse(dev->data.tpm, def->os.arch); break; + case VIR_DOMAIN_DEVICE_MEMORY: + ret = qemuDomainMemoryDefPostParse(dev->data.memory, def->os.arch); + break; + case VIR_DOMAIN_DEVICE_LEASE: case VIR_DOMAIN_DEVICE_FS: case VIR_DOMAIN_DEVICE_INPUT: @@ -5412,7 +5433,6 @@ qemuDomainDeviceDefPostParse(virDomainDeviceDefPtr dev, case VIR_DOMAIN_DEVICE_MEMBALLOON: case VIR_DOMAIN_DEVICE_NVRAM: case VIR_DOMAIN_DEVICE_RNG: - case VIR_DOMAIN_DEVICE_MEMORY: case VIR_DOMAIN_DEVICE_IOMMU: case VIR_DOMAIN_DEVICE_AUDIO: ret = 0; @@ -8042,15 +8062,6 @@ qemuDomainGetMemorySizeAlignment(const virDomainDef *def) } -static unsigned long long -qemuDomainGetMemoryModuleSizeAlignment(void) -{ - /* For x86, dimm memory modules require 2MiB alignment rather than - * the 1MiB we are using elsewhere. */ - return 2048; -} - - int qemuDomainAlignMemorySizes(virDomainDefPtr def) { @@ -8077,16 +8088,12 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) virDomainNumaSetNodeMemorySize(def->numa, i, mem); } - /* Align memory module sizes. This needs to occur before 'initialmem' - * calculation because virDomainDefGetMemoryInitial() uses the size - * of the modules in the math. */ + /* Calculate hotplugmem. The memory modules are already aligned at this + * point: + * + * - ppc64 mem modules are being aligned by virDomainMemoryDefPostParse(); + * - x86 mem modules are being aligned by qemuDomainMemoryDefPostParse(). */ for (i = 0; i < def->nmems; i++) { - /* ppc64 memory modules are aligned by virDomainMemoryDefPostParse(). */ - if (!ARCH_IS_PPC64(def->os.arch)) { - align = qemuDomainGetMemoryModuleSizeAlignment(); - def->mems[i]->size = VIR_ROUND_UP(def->mems[i]->size, align); - } - hotplugmem += def->mems[i]->size; if (def->mems[i]->size > maxmemkb) { diff --git a/tests/qemuxml2argvdata/hugepages-nvdimm.xml b/tests/qemuxml2argvdata/hugepages-nvdimm.xml index 144d02b56e..d356ad12a8 100644 --- a/tests/qemuxml2argvdata/hugepages-nvdimm.xml +++ b/tests/qemuxml2argvdata/hugepages-nvdimm.xml @@ -39,7 +39,7 @@ <path>/tmp/nvdimm</path> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='0'/> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-access.xml b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-access.xml index a1cc1264eb..4ac5589076 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-access.xml +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-access.xml @@ -48,7 +48,7 @@ <path>/tmp/nvdimm</path> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='0'/> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-align.xml b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-align.xml index 018a693aaf..2436b1a1af 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-align.xml +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-align.xml @@ -49,7 +49,7 @@ <alignsize unit='KiB'>2048</alignsize> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='0'/> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-label.xml b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-label.xml index c9d54a6088..6e8902a25e 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-label.xml +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-label.xml @@ -48,7 +48,7 @@ <path>/tmp/nvdimm</path> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> <label> <size unit='KiB'>128</size> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.xml b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.xml index 391d70f20e..a1a49f68ad 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.xml +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-pmem.xml @@ -49,7 +49,7 @@ <pmem/> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='0'/> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-readonly.xml b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-readonly.xml index 09b2c5c833..32000acb6d 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm-readonly.xml +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm-readonly.xml @@ -48,7 +48,7 @@ <path>/tmp/nvdimm</path> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> <readonly/> </target> diff --git a/tests/qemuxml2argvdata/memory-hotplug-nvdimm.xml b/tests/qemuxml2argvdata/memory-hotplug-nvdimm.xml index a32474da06..815fca3412 100644 --- a/tests/qemuxml2argvdata/memory-hotplug-nvdimm.xml +++ b/tests/qemuxml2argvdata/memory-hotplug-nvdimm.xml @@ -48,7 +48,7 @@ <path>/tmp/nvdimm</path> </source> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='0'/> diff --git a/tests/qemuxml2argvdata/pages-dimm-discard.xml b/tests/qemuxml2argvdata/pages-dimm-discard.xml index 3d233687e1..a0ba528823 100644 --- a/tests/qemuxml2argvdata/pages-dimm-discard.xml +++ b/tests/qemuxml2argvdata/pages-dimm-discard.xml @@ -40,7 +40,7 @@ </memory> <memory model='dimm' access='private' discard='yes'> <target> - <size unit='KiB'>524287</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='1'/> diff --git a/tests/qemuxml2xmloutdata/memory-hotplug-dimm.xml b/tests/qemuxml2xmloutdata/memory-hotplug-dimm.xml index 326b5c954c..985a4fc094 100644 --- a/tests/qemuxml2xmloutdata/memory-hotplug-dimm.xml +++ b/tests/qemuxml2xmloutdata/memory-hotplug-dimm.xml @@ -45,7 +45,7 @@ </memballoon> <memory model='dimm'> <target> - <size unit='KiB'>523264</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='0'/> @@ -56,7 +56,7 @@ <pagesize unit='KiB'>2048</pagesize> </source> <target> - <size unit='KiB'>524287</size> + <size unit='KiB'>524288</size> <node>0</node> </target> <address type='dimm' slot='1'/> -- 2.26.2

The loop in which 'hotplugmem' is calculated in qemuDomainAlignMemorySizes() is throwing a max size overflow error. This is a reminiscent of the time where this loop was doing the memory module alignment. It doesn't hurt to leave the check here, but it makes more sense to move it to the places where the alignments are being done. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/conf/domain_conf.c | 9 +++++++++ src/qemu/qemu_domain.c | 20 ++++++++++---------- 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index e4aecc9a4d..943084ef71 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -16852,11 +16852,20 @@ virDomainSEVDefParseXML(xmlNodePtr sevNode, int virDomainMemoryDeviceAlignSizePseries(virDomainMemoryDefPtr mem) { + unsigned long long maxmemkb = virMemoryMaxValue(false) >> 10; unsigned long long ppc64AlignSize = 256 * 1024; unsigned long long guestArea; if (mem->model != VIR_DOMAIN_MEMORY_MODEL_NVDIMM) { mem->size = VIR_ROUND_UP(mem->size, ppc64AlignSize); + + if (mem->size > maxmemkb) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("size of memory module overflowed after " + "alignment")); + return -1; + } + return 0; } diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index c06c33f8d0..7067d49d7a 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -5350,11 +5350,19 @@ qemuDomainMemoryDefPostParse(virDomainMemoryDefPtr mem, /* For x86, dimm memory modules require 2MiB alignment rather than * the 1MiB we are using elsewhere. */ unsigned int x86MemoryModuleSizeAlignment = 2048; + unsigned long long maxmemkb = virMemoryMaxValue(false) >> 10; /* ppc64 memory module alignment is done in * virDomainMemoryDefPostParse(). */ - if (!ARCH_IS_PPC64(arch)) + if (!ARCH_IS_PPC64(arch)) { mem->size = VIR_ROUND_UP(mem->size, x86MemoryModuleSizeAlignment); + if (mem->size > maxmemkb) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("size of memory module overflowed after " + "alignment")); + return -1; + } + } return 0; } @@ -8093,17 +8101,9 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) * * - ppc64 mem modules are being aligned by virDomainMemoryDefPostParse(); * - x86 mem modules are being aligned by qemuDomainMemoryDefPostParse(). */ - for (i = 0; i < def->nmems; i++) { + for (i = 0; i < def->nmems; i++) hotplugmem += def->mems[i]->size; - if (def->mems[i]->size > maxmemkb) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("size of memory module '%zu' overflowed after " - "alignment"), i); - return -1; - } - } - /* Align initial memory size, if NUMA is present calculate it as total of * individual aligned NUMA node sizes. */ if (initialmem == 0) { -- 2.26.2

The alignment of the mem object in qemuDomainDetachPrepMemory() allows for the virDomainMemoryFindByDef() call to find the right memory object for detach. Turns out that, after recent changes, the alignment of the mem object is already granted at this point. This function is used by qemuDomainDetachDeviceLive(), which in turn is used in two places with a memory device: - qemuDomainDetachDeviceLiveAndConfig(), after a virDomainDeviceDefParse() calll that will align the mem object via PostParse callbacks; - qemuDomainDetachDeviceAliasLiveAndConfig(), after a call to virDomainDefFindDevice() that will get the already aligned mem object from the domain definition. This means that the qemuDomainMemoryDeviceAlignSize() call at the start of this function is now obsolete and can be removed. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_hotplug.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index c1461ac621..e0495ef59e 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -5682,9 +5682,9 @@ qemuDomainDetachPrepMemory(virDomainObjPtr vm, { int idx; - if (qemuDomainMemoryDeviceAlignSize(vm->def, match) < 0) - return -1; - + /* 'match' is granted to be aligned at this point, either + * by PostParse callbacks when parsing it or by fetching it + * from the domain definition. */ if ((idx = virDomainMemoryFindByDef(vm->def, match)) < 0) { virReportError(VIR_ERR_DEVICE_MISSING, _("model '%s' memory device not present " -- 2.26.2

Similar to the previous patch, the alignment of the mem object in qemuDomainAttachMemory() is obsolete after recent changes. qemuDomainAttachMemory() is used by qemuDomainAttachDeviceLive(), which is called by qemuDomainAttachDeviceLiveAndConfig() after a call to virDomainDeviceDefParse(), where the PostParse callback will align the mem object. This means that we can live without a qemuDomainMemoryDeviceAlignSize() call at the start of qemuDomainAttachMemory(). Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_hotplug.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index e0495ef59e..15cbb4d550 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -2395,9 +2395,6 @@ qemuDomainAttachMemory(virQEMUDriverPtr driver, int id; int ret = -1; - if (qemuDomainMemoryDeviceAlignSize(vm->def, mem) < 0) - goto cleanup; - if (qemuDomainDefValidateMemoryHotplug(vm->def, priv->qemuCaps, mem) < 0) goto cleanup; -- 2.26.2

This function has no callers left after we moved the mem module alignment to PostParse callbacks. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- src/qemu/qemu_domain.c | 20 -------------------- src/qemu/qemu_domain.h | 2 -- 2 files changed, 22 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 7067d49d7a..8d69be682e 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8130,26 +8130,6 @@ qemuDomainAlignMemorySizes(virDomainDefPtr def) } -/** - * qemuDomainMemoryDeviceAlignSize: - * @mem: memory device definition object - * - * Aligns the size of the memory module as qemu enforces it. The size is updated - * inplace. Default rounding is now to 1 MiB (qemu requires rounding to page, - * size so this should be safe). - */ -int -qemuDomainMemoryDeviceAlignSize(virDomainDefPtr def, - virDomainMemoryDefPtr mem) -{ - if (!ARCH_IS_PPC64(def->os.arch)) - mem->size = VIR_ROUND_UP(mem->size, - qemuDomainGetMemorySizeAlignment(def)); - - return 0; -} - - /** * qemuDomainGetMonitor: * @vm: domain object diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index ca041e207b..8e6fde412e 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -757,8 +757,6 @@ bool qemuDomainHasBlockjob(virDomainObjPtr vm, bool copy_only) ATTRIBUTE_NONNULL(1); int qemuDomainAlignMemorySizes(virDomainDefPtr def); -int qemuDomainMemoryDeviceAlignSize(virDomainDefPtr def, - virDomainMemoryDefPtr mem); virDomainChrDefPtr qemuFindAgentConfig(virDomainDefPtr def); -- 2.26.2

Document that we're not reflecting the aligned value in the live domain XML, and that we fixed a bug that occurred in pSeries guests where a guest might start with an extra 256MiB of memory. Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- NEWS.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/NEWS.rst b/NEWS.rst index 0e56f5dbca..ab1f628c24 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -15,8 +15,22 @@ v6.10.0 (unreleased) * **Improvements** + * qemu: reflect the actual memory being used in live XML + + Memory modules are aligned internally to fit platform and + architecture constraints. The aligned value is used by QEMU but + the live XML reflected just the original values. Libvirt will + now inform the actual memory value passed to QEMU in the live + XML. + * **Bug fixes** + * qemu: pSeries guests booting with extra memory + + In certain conditions, a pSeries guest would start with an extra + 256MiB of memory aside from what was determined in the ``<memory>`` + and ``<currentMemory>`` values. This issue has been fixed. + * **Removed features** * hyperv: removed support for the Hyper-V V1 WMI API -- 2.26.2

s/rs/rst/ Jano On a Wednesday in 2020, Daniel Henrique Barboza wrote:
Document that we're not reflecting the aligned value in the live domain XML, and that we fixed a bug that occurred in pSeries guests where a guest might start with an extra 256MiB of memory.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- NEWS.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+)

On 11/12/20 8:33 AM, Ján Tomko wrote:
s/rs/rst/
Ops! Thanks for catching that. Fixed in my local branch. DHB
Jano
On a Wednesday in 2020, Daniel Henrique Barboza wrote:
Document that we're not reflecting the aligned value in the live domain XML, and that we fixed a bug that occurred in pSeries guests where a guest might start with an extra 256MiB of memory.
Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- NEWS.rst | 14 ++++++++++++++ 1 file changed, 14 insertions(+)

On 11/11/20 11:07 PM, Daniel Henrique Barboza wrote:
Hi,
This is a work that I intended to do after pSeries NVDIMM changes I made in commit v6.7.0-302-gd3f3c2c97f. The reasoning is that if ppc64 NVDIMM alignment can be done in PostParse time, regular ppc64 DIMMs alignment can also be done the same way. After that I realized that the same logic can be applied to x86 as well, and then I started to see where we could eliminate explicit align calls.
The end result is that a bug that affects pSeries the most was found and fixed in patch 02, and now we're able to reflect the exact memory being used by the guest in the live XML since the alignment is now done in parse time. The series can be fetched from:
https://gitlab.com/danielhb/libvirt/-/tree/memory_modules_postparse_v1
NOTE: I worked with the x86 logic based on what I could assert from the code itself, i.e. the aligment mechanics (2MiB align for mem modules, 1MiB align for total memory) are related to QEMU internals, not being hypervisor-agnostic. If someone with authority can guarantee that the alignment restrictions for x86 are applicable to all hypervisors (like in the pSeries case), let me know and I'll respin the series putting the x86 code in virDomainMemoryDefPostParse().
Daniel Henrique Barboza (10): domain_conf.c: handle all ppc64 dimms in virDomainNVDimmAlignSizePseries qemu_domain.c: align memory modules before calculating 'initialmem' domain_conf.c: align pSeries mem modules in virDomainMemoryDefPostParse() qemu_domain.c: simplify qemuDomainGetMemoryModuleSizeAlignment() qemu_domain.c: align x86 memory module in post parse time qemu_domain.c: move size check for mem modules after alignment qemu_hotplug.c: avoid aligning the mem obj in qemuDomainDetachPrepMemory() qemu_hotplug.c: avoid aligning the mem obj in qemuDomainAttachMemory() qemu_domain.c: remove qemuDomainMemoryDeviceAlignSize() NEWS.rs: document memory alignment improvements
NEWS.rst | 14 +++ src/conf/domain_conf.c | 25 ++++- src/conf/domain_conf.h | 2 +- src/libvirt_private.syms | 2 +- src/qemu/qemu_domain.c | 106 +++++++----------- src/qemu/qemu_domain.h | 2 - src/qemu/qemu_hotplug.c | 9 +- tests/qemuxml2argvdata/hugepages-nvdimm.xml | 2 +- .../memory-hotplug-nvdimm-access.xml | 2 +- .../memory-hotplug-nvdimm-align.xml | 2 +- .../memory-hotplug-nvdimm-label.xml | 2 +- .../memory-hotplug-nvdimm-pmem.xml | 2 +- .../memory-hotplug-nvdimm-readonly.xml | 2 +- .../memory-hotplug-nvdimm.xml | 2 +- .../memory-hotplug-ppc64-nonuma.args | 2 +- tests/qemuxml2argvdata/pages-dimm-discard.xml | 2 +- .../memory-hotplug-dimm.xml | 4 +- 17 files changed, 92 insertions(+), 90 deletions(-)
Reviewed-by: Michal Privoznik <mprivozn@redhat.com> but perhaps you want Andrea to look into it too. Michal

On 11/12/20 8:28 AM, Michal Privoznik wrote:
On 11/11/20 11:07 PM, Daniel Henrique Barboza wrote:
Hi,
This is a work that I intended to do after pSeries NVDIMM changes I made in commit v6.7.0-302-gd3f3c2c97f. The reasoning is that if ppc64 NVDIMM alignment can be done in PostParse time, regular ppc64 DIMMs alignment can also be done the same way. After that I realized that the same logic can be applied to x86 as well, and then I started to see where we could eliminate explicit align calls.
The end result is that a bug that affects pSeries the most was found and fixed in patch 02, and now we're able to reflect the exact memory being used by the guest in the live XML since the alignment is now done in parse time. The series can be fetched from:
https://gitlab.com/danielhb/libvirt/-/tree/memory_modules_postparse_v1
NOTE: I worked with the x86 logic based on what I could assert from the code itself, i.e. the aligment mechanics (2MiB align for mem modules, 1MiB align for total memory) are related to QEMU internals, not being hypervisor-agnostic. If someone with authority can guarantee that the alignment restrictions for x86 are applicable to all hypervisors (like in the pSeries case), let me know and I'll respin the series putting the x86 code in virDomainMemoryDefPostParse().
Daniel Henrique Barboza (10): domain_conf.c: handle all ppc64 dimms in virDomainNVDimmAlignSizePseries qemu_domain.c: align memory modules before calculating 'initialmem' domain_conf.c: align pSeries mem modules in virDomainMemoryDefPostParse() qemu_domain.c: simplify qemuDomainGetMemoryModuleSizeAlignment() qemu_domain.c: align x86 memory module in post parse time qemu_domain.c: move size check for mem modules after alignment qemu_hotplug.c: avoid aligning the mem obj in qemuDomainDetachPrepMemory() qemu_hotplug.c: avoid aligning the mem obj in qemuDomainAttachMemory() qemu_domain.c: remove qemuDomainMemoryDeviceAlignSize() NEWS.rs: document memory alignment improvements
NEWS.rst | 14 +++ src/conf/domain_conf.c | 25 ++++- src/conf/domain_conf.h | 2 +- src/libvirt_private.syms | 2 +- src/qemu/qemu_domain.c | 106 +++++++----------- src/qemu/qemu_domain.h | 2 - src/qemu/qemu_hotplug.c | 9 +- tests/qemuxml2argvdata/hugepages-nvdimm.xml | 2 +- .../memory-hotplug-nvdimm-access.xml | 2 +- .../memory-hotplug-nvdimm-align.xml | 2 +- .../memory-hotplug-nvdimm-label.xml | 2 +- .../memory-hotplug-nvdimm-pmem.xml | 2 +- .../memory-hotplug-nvdimm-readonly.xml | 2 +- .../memory-hotplug-nvdimm.xml | 2 +- .../memory-hotplug-ppc64-nonuma.args | 2 +- tests/qemuxml2argvdata/pages-dimm-discard.xml | 2 +- .../memory-hotplug-dimm.xml | 4 +- 17 files changed, 92 insertions(+), 90 deletions(-)
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Thanks for the review!
but perhaps you want Andrea to look into it too.
Of course. Andrea, not sure if you want to review the whole series so here's the gist: we already have the ppc64 NVDIMM alignment being done in PostParse time, so I started to move all ppc64 memory module alignment to PostParse as well. In the end I figured I could also move the alignment for x86 as well. Do you see any potential problem with this design change? Thanks, DHB
Michal

On Thu, 2020-11-12 at 18:53 -0300, Daniel Henrique Barboza wrote:
On 11/12/20 8:28 AM, Michal Privoznik wrote:
On 11/11/20 11:07 PM, Daniel Henrique Barboza wrote:
NEWS.rst | 14 +++ src/conf/domain_conf.c | 25 ++++- src/conf/domain_conf.h | 2 +- src/libvirt_private.syms | 2 +- src/qemu/qemu_domain.c | 106 +++++++----------- src/qemu/qemu_domain.h | 2 - src/qemu/qemu_hotplug.c | 9 +- tests/qemuxml2argvdata/hugepages-nvdimm.xml | 2 +- .../memory-hotplug-nvdimm-access.xml | 2 +- .../memory-hotplug-nvdimm-align.xml | 2 +- .../memory-hotplug-nvdimm-label.xml | 2 +- .../memory-hotplug-nvdimm-pmem.xml | 2 +- .../memory-hotplug-nvdimm-readonly.xml | 2 +- .../memory-hotplug-nvdimm.xml | 2 +- .../memory-hotplug-ppc64-nonuma.args | 2 +- tests/qemuxml2argvdata/pages-dimm-discard.xml | 2 +- .../memory-hotplug-dimm.xml | 4 +- 17 files changed, 92 insertions(+), 90 deletions(-)
Reviewed-by: Michal Privoznik <mprivozn@redhat.com>
Thanks for the review!
but perhaps you want Andrea to look into it too.
Of course. Andrea, not sure if you want to review the whole series so here's the gist: we already have the ppc64 NVDIMM alignment being done in PostParse time, so I started to move all ppc64 memory module alignment to PostParse as well. In the end I figured I could also move the alignment for x86 as well. Do you see any potential problem with this design change?
I'll take a look and let you know :) -- Andrea Bolognani / Red Hat / Virtualization
participants (5)
-
Andrea Bolognani
-
Daniel Henrique Barboza
-
Ján Tomko
-
Michal Privoznik
-
Peter Krempa