Re: [PATCH] qemu: Relax memory pre-allocation rules

30 Nov 2020

On Mon, Nov 30, 2020 at 11:48:28AM +0100, Michal Privoznik wrote:
...
On 11/30/20 11:16 AM, Daniel P. Berrangé wrote:
...
On Mon, Nov 30, 2020 at 11:06:14AM +0100, Michal Privoznik wrote:
...
Currently, we configure QEMU to prealloc memory almost by
default. Well, by default for NVDIMMs, hugepages and if user
asked us to (via memoryBacking <allocation mode="immediate"/>).
However, there are two cases where this approach is not the best:
1) in case when guest's NVDIMM is backed by real life NVDIMM. In
    this case users should put <pmem/> into the <memory/> device
    <source/>, like this:
<memory model='nvdimm' access='shared'>
      <source>
        <path>/dev/pmem0</path>
        <pmem/>
      </source>
    </memory>
Instructing QEMU to do prealloc in this case means that each
    page of the NVDIMM is "touched" (the first byte is read and
    written back - see QEMU commit v2.9.0-rc1~26^2) which cripples
    device wear.
2) if free-page-reporting is turned on. While the
    free-page-reporting feature might not have a catchy or obvious
    name, when enabled it instructs KVM and subsequently QEMU to
    free pages no longer used by guest resulting in smaller memory
    footprint. And preallocating whole memory goes against this.
The BZ comment 11 mentions another, third case 'virtio-mem' but
that is not implemented in libvirt, yet.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1894053
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
  src/qemu/qemu_command.c                               | 11 +++++++++--
  .../memory-hotplug-nvdimm-pmem.x86_64-latest.args     |  2 +-
  2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c
index 479bcc0b0c..3df8b5ac76 100644
--- a/src/qemu/qemu_command.c
+++ b/src/qemu/qemu_command.c
@@ -2977,7 +2977,11 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
      if (discard == VIR_TRISTATE_BOOL_ABSENT)
          discard = def->mem.discard;
-    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE)
+    /* The whole point of free_page_reporting is that as soon as guest frees
+     * any memory it is freed in the host too. Prealloc doesn't make much sense
+     * then. */
+    if (def->mem.allocation == VIR_DOMAIN_MEMORY_ALLOCATION_IMMEDIATE &&
+        def->memballoon->free_page_reporting != VIR_TRISTATE_SWITCH_ON)
          prealloc = true;
If the user asked for allocation == immediate, we should not be
silently ignoring that request. Isn't the scenario described simply
a wierd user configuration scenario and if they don't want that, then
then they can set     <allocation mode="ondemand"/> instead.
Okay.
...
...
if (virDomainNumatuneGetMode(def->numa, mem->targetNode, &mode) < 0 &&
@@ -3064,7 +3068,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps,
          if (mem->nvdimmPath) {
              memPath = g_strdup(mem->nvdimmPath);
-            prealloc = true;
...
+            /* If the NVDIMM is a real device then there's nothing to prealloc.
+             * If anyhing, we would be only wearing off the device. */
+            if (!mem->nvdimmPmem)
+                prealloc = true;
I wonder if QEMU itself should take this optimization to skip its
allocation logic ?
Also would make sense. This is that kind of bug which lies in between
libvirt and qemu. Although, since we are worried in silently ignoring user
requests, then wouldn't this be exactly what QEMU would be doing? I mean, if
an user/libvirt put both .prealloc=yes and .pmem=yes onto cmd line then
these would cancel out, wouldn't they?
The difference is that an real NVDIMM is inherantly preallocated. QEMU
would not be ignoring the prealloc=yes arg - its implementation would
merely be a no-op.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|