Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model

Friday, 10 September 2021

On 7/7/21 12:30 PM, David Hildenbrand wrote:
...
 On 23.06.21 12:12, Michal Privoznik wrote:
> v4 of:
>
> https://listman.redhat.com/archives/libvir-list/2021-April/msg01138.html
>
> diff to v3:
> - Rebased code on the top of master
> - Tried to work in all Peter's review suggestions
> - Fixed a bug where adjusting <requested/> was viewed as hotplug of new
>    <memory/> by XML validator and thus if <maxMemory/> was close
> enough to
>    <currentMemory/> the validator reported an error (this was reported by
>    David).
>

 Hi Michal, 
Hi,

sorry for replying this late.

...

 I just retested with this version and it mostly works as expected. I
 tested quite some memory configurations and have some comments / reports :)

 I tested successfully:
 - 1 node with one device
 - 2 nodes with one device on each node
 - 2 nodes with two devices on one node
 - "virsh update-memory-device" on live domains -- works great
 - huge pages and anonymous memory with access=private and access=shared.
   There is only one issue with hugepages and memfd (prealloc=on gets
   set).
 - shared memory on memfd and anonymous memory (-> shared file) with
   access=shared

 I only tested on a single host NUMA node so far, but don't expect
 surprises with host numa policies.

 1. "virsh update-memory-device" and stopped domains

 Once I have more than one virtio-mem device defined for a VM, "virsh
 update-memory-device" cannot be used anymore as aliases don't seem to be
 available on stopped VMs. If I manually define an alias on a stopped VM,
 the alias silently gets dropped. Is there any way to identify a
 virtio-mem device on a stopped domain? 
Yes. You want what we call user aliases. They have to have "ua-" prefix
so that they don't clash with whatever libvirt generates. Something like
this:

<memory model="virtio-mem">
  <source>
    <pagesize unit="KiB">2048</pagesize>
  </source>
  <target>
    <size unit="KiB">2097152</size>
    <node>0</node>
    <block unit="KiB">2048</block>
    <requested unit="KiB">1048576</requested>
  </target>
  <alias name="ua-virtiomem0"/>
  <address type="pci" domain="0x0000" bus="0x00"
slot="0x06"
function="0x0"/>
</memory>

But then you get to the fact that I haven't implemented update for
inactive domains. Will do in v5.

...

 2. "virsh update-memory-device" with --config on a running domain

 # virsh update-memory-device "Fedora34" --config --alias
"virtiomem1"
 --requested-size 16G
 error: no memory device found

 I guess the issue is again, that alias don't apply to the "!live" XML.
 So the "--config" option doesn't really work when having more than one
 virtio-mem device defined for a VM. 
Good point. I wonder what piece of input XML I can use to look up
corresponding virtio-mem. Since it has a PCI address maybe I can use
that instead of alias? But then we are back to the old problem - in
general inactive and active XMLs can be different (due to hot(un-)plug).
So even when I'd find a device on the same PCI address it may be
different, actually. Therefore, I think the safest is to use aliases. At
anyrate - this can be implemented afterwards.

...

 3. "virsh update-memory-device" and nodes

 In addition to "--alias", something like "--node" would also be nice
to
 have -- assuming there is only a single virtio-mem device per NUMA node,
 which is usually the case. For example:

 "virsh update-memory-device "Fedora34" --node 1 --requested-size
16G"
 could come in handy. This would also work on "!live" domains. 
Yes, makes sense.

...

 4. "actual" vs. "current"

 "<actual unit='KiB'>16777216</actual>" I wonder if
"current" instead of
 "actual" would be more in line with "currentMemory". But no strong
opinion. 
Yeah, I don't have any opinion either. I can change it.

...

 5. Slot handling.

 As already discussed, virtio-mem and virtio-pmem don't need slots. Yet,
 the "slots" definition is required and libvirt reserves once slot for
 each such device ("error: unsupported configuration: memory device count
 '2' exceeds slots count '1'"). This is certainly future work, if we
ever
 want to change that. 
I can look into this.

...

 6. 4k source results in an error

     <source>
       <pagesize unit='KiB'>4096</pagesize>
       <nodemask>0-1</nodemask>
     </source>

 "error: internal error: Unable to find any usable hugetlbfs mount for
 4096 KiB"

 This example is taken from https://libvirt.org/formatdomain.html for
 DIMMs. Not sure what the expected behavior is. 
Ouch, this is a clear bug. Let me investigate and fix in next version.

...

 7. File source gets silently dropped

     <source>
       <path>/dev/shmem/vm0</path>
     </source>

 The statement gets silently dropped, which is somewhat surprising.
 However, I did not test what happens with DIMMs, maybe it's the same. 
Yeah, this is somewhat expected. I mean, the part that's expected is
that libvirt drops parts it doesn't parse. Sometimes it is pretty
obvious (<source someRandomAttribute='value'/>) and sometimes it's not
so (like in your example when <path/> makes sense for other memory
models like virtio-pmem). But just to be sure - since virtio-mem can be
backed by any memory-backing-* backend, does it make sense to have
<path/> there? So far my code would use memory-backend-file for
hugepages only.

...

 8. Global preallocation of memory

 With

 <memoryBacking>
     <allocation mode="immediate"\>
 </memoryBacking>

 we also get "prealloc=on" set for the memory backends of the virito-mem
 devices, which is sub-optimal, because we end up preallocating all
 memory of the memory backend (which is unexpected for a virtio-mem
 device) and virtio-mem will then discard all memory immediately again.
 So it's essentially a dangerous NOP -- dangerous because we temporarily
 consume a lot of memory.

 In an ideal world, we would not set this for the memory backend used for
 the virtio-mem devices, but for the virtio-mem devices themselves, such
 that preallocation happens when new memory blocks are actually exposed
 to the VM.

 As virtio-mem does not support "prealloc=on" for virtio-mem devices yet,
 this is future work. We might want to error out, though, if <allocation
 mode="immediate"\> is used along with virtio-mem devices for now. I'm
 planning on implementing this in QEMU soon. Until then, it might also be
 good enough to simply document that this setup should be avoided. 
Right. Meanwhile this was implemented in QEMU and thus I can drop
prealloc=on. But then my question is what happens when user wants to
expose additional memory to the guest but doesn't have enough free
hugepages in the pool? Libvirt's using prealloc=on so that QEMU doesn't
get killed later, after the guest booted up.

...

 9. Memfd and huge pages

 <memoryBacking>
     <source type="memfd"/>
 </memoryBacking>

 and

 I get on the QEMU cmdline

 "-object

{"qom-type":"memory-backend-memfd","id":"memvirtiomem0","hugetlb":true,"hugetlbsize":2097152,"share":true,"prealloc":true,"size":17179869184}"

 Dropping "the memfd" source I get on the QEMU cmdline:

-object^@{"qom-type":"memory-backend-file","id":"memvirtiomem0","mem-path":"/dev/hugepages/libvirt/qemu/2-Fedora34-2","share":true,"size":17179869184}

 "prealloc":true should not have been added for virtio-mem in case of
 memfd. !memfd does what's expected.

Okay, I will fix this. But can you shed more light here? I mean, why the
difference?

...

 10. Memory locking

 With

 <memoryBacking>
     <locked/>
 </memoryBacking>

 virtio-mem fails with

 "qemu-system-x86_64: -device

virtio-mem-pci,node=0,block-size=2097152,requested-size=0,memdev=memvirtiomem0,id=virtiomem0,bus=pci.0,addr=0x2:
 Incompatible with mlock"

 Unfortunately,for example, on shmem like:

 <memoryBacking>
     <locked/>
     <access mode="shared"/>
     <source type="memfd"/>
 </memoryBacking>

 it seems to fail after essentially (temporarily) preallocating all
 memory for the memory backend of the virtio-mem device. In the future,
 virtio-mem might be able to support mlock, until then, this is
 suboptimal but at least fails at some point.

 11. Reservation of memory

 With new QEMU versions we'll want to pass "reserve=off" for the memory
 backend used, especially with hugepages and private mappings. While this
 change was merged into QEMU, it's not part of an official release yet.
 Future work.

 https://lore.kernel.org/qemu-devel/20210510114328.21835-1-david@redhat.com/

 Otherwise, when we don't have the "size" currently in free and
 "unreserved" hugepages, we'll fail with "qemu-system-x86_64: unable
to
 map backing store for guest RAM: Cannot allocate memory". The same thing
 can easily happen on anonymous memory when memory overcommit isn't
 disabled.

 So this is future work, but at least the QEMU part is already upstream. 

So what's the difference between reserve and prealloc?

Michal

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [PATCH v4 for v7.6.0 00/14] Introduce virtio-mem <memory/> model