[libvirt] [RFC] Memory hotplug for qemu guests and the relevant XML parts

Hi, qemu recently added support for memory hotplug (hot unplug will arrive later) since commit ~ bef3492d1169a54c966cecd0e0b1dd25e9341582 in qemu.git. For the hotplug to work the VM needs to be started with a certain number of "dimm" slots for plugging virtual memory modules. The memory of the VM at startup has to occupy at least one of the slots. Later on the management can decide to plug more memory to the guest by inserting a virtual memory module into the guest. For representing this in libvirt I'm thinking of using the <devices> section of our domain XML where we'd add a new device type: <memory type="ram"> <source .../> <!-- will be elaborated below --> <target .../> <!-- will be elaborated below --> <address type="acpi-dimm" slot="1"/> </memory> type="ram" to denote that we are adding RAM memory. This will allow possible extensions for example to add a generic pflash (type="flash", type="rom") device or other memory type mapped in the address space. To enable this infrastructure qemu needs two command line options supplied, one setting the maximum amount of supportable memory and the second one the maximum number of memory modules (capped at 256 due to ACPI). The current XML format for specifying memory looks like this: <memory unit='KiB'>524288</memory> <currentMemory unit='KiB'>524288</currentMemory> I'm thinking of adding following attributes to the memory element: <memory slots='16' max-memory='1' max-memory-unit='TiB'/> This would then be updated to the actual size after summing sizes of the memory modules to: <memory slots='16' max-memory='1' max-memory-unit='TiB'/> unit='MiB'>512</memory> This would also allow a possibility to specify the line above and libvirt would then add a memory module holding the whole guest memory. Representing the memory module as a device will then allow us to use the existing hot(un)plug APIs to do the operations on the actual VM. For the ram memory type the source and target elements will allow to specify the following option. For backing the guest with normal memory: <source type="ram" size="500" unit="MiB" host-node="2"/> For hugepage-backed guest: <source type="hugepage" page-size="2048" count="1024" node="1"/> Note: node attribute target's host numa node and is optional. And possibly others possibly for the rom/flash types: <source type="file" path="/asdf"/> For targetting the RAM module the target element could have the following format: <target model="dimm" node='2' address='0xdeadbeef'/> "node" determines the guest numa node to connect the memory "module" to. The attribute is optional for non-numa guests or node 0 is assumed. "address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases). For expansion the model="pflash" device may be added. For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently. My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally. That concludes my thoughts on this subject, but I'm open for discussion or other approach (I didn't start implementation yet). Thanks Peter

On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
For the hotplug to work the VM needs to be started with a certain number of "dimm" slots for plugging virtual memory modules. The memory of the VM at startup has to occupy at least one of the slots. Later on the management can decide to plug more memory to the guest by inserting a virtual memory module into the guest.
For representing this in libvirt I'm thinking of using the <devices> section of our domain XML where we'd add a new device type:
<memory type="ram"> <source .../> <!-- will be elaborated below --> <target .../> <!-- will be elaborated below --> <address type="acpi-dimm" slot="1"/> </memory>
type="ram" to denote that we are adding RAM memory. This will allow possible extensions for example to add a generic pflash (type="flash", type="rom") device or other memory type mapped in the address space.
To enable this infrastructure qemu needs two command line options supplied, one setting the maximum amount of supportable memory and the second one the maximum number of memory modules (capped at 256 due to ACPI).
The current XML format for specifying memory looks like this: <memory unit='KiB'>524288</memory> <currentMemory unit='KiB'>524288</currentMemory>
I'm thinking of adding following attributes to the memory element:
<memory slots='16' max-memory='1' max-memory-unit='TiB'/>
This would then be updated to the actual size after summing sizes of the memory modules to: <memory slots='16' max-memory='1' max-memory-unit='TiB'/> unit='MiB'>512</memory>
Given that we already have <memory> and <currentMemory> it feels a little odd to be adding max-memory as an attribute instead of doing <maxMemory slots='16' unit='TiB'>1</maxMemory>.
This would also allow a possibility to specify the line above and libvirt would then add a memory module holding the whole guest memory.
If a guest has multiple NUMA nodes, this would imply that the one default memory module were spanning multiple NUMA nodes. That does not make much conceptual sense to me. At the very minimum you need to have 1 memory slot per guest NUMA node.
Representing the memory module as a device will then allow us to use the existing hot(un)plug APIs to do the operations on the actual VM.
Ok, that does sort of make sense as a goal.
For the ram memory type the source and target elements will allow to specify the following option.
For backing the guest with normal memory: <source type="ram" size="500" unit="MiB" host-node="2"/>
For hugepage-backed guest: <source type="hugepage" page-size="2048" count="1024" node="1"/>
This design concerns me because it seems like it is adding alot of redundant information vs existing XML schema work we've done to represent NUMA placement / huge page allocation for VMs.
Note: node attribute target's host numa node and is optional.
And possibly others possibly for the rom/flash types: <source type="file" path="/asdf"/>
For targetting the RAM module the target element could have the following format:
<target model="dimm" node='2' address='0xdeadbeef'/>
"node" determines the guest numa node to connect the memory "module" to. The attribute is optional for non-numa guests or node 0 is assumed.
If I'm thinking about this from a physical hardware POV, it doesn't make a whole lot of sense for the NUMA node to be configurable at the time you plug in the DIMM. The NUMA affinity is a property of how the slot is wired into the memory controller. Plugging the DIMM cannot change that. So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
...
For targetting the RAM module the target element could have the following format:
<target model="dimm" node='2' address='0xdeadbeef'/>
"node" determines the guest numa node to connect the memory "module" to. The attribute is optional for non-numa guests or node 0 is assumed.
If I'm thinking about this from a physical hardware POV, it doesn't make a whole lot of sense for the NUMA node to be configurable at the time you plug in the DIMM. The NUMA affinity is a property of how the slot is wired into the memory controller. Plugging the DIMM cannot change that.
While this is true for physical hardware, the emulated one apparently supports changing a slot's position in the numa topology. Additionally this allows to use a non-uniform mapping of memory modules to numa nodes. Are you suggesting that we should bind certain slots to certain numa nodes in advance thus try to emulate the limitations of the physical hardware?
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module. The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
Regards, Daniel
Peter

On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
...
For targetting the RAM module the target element could have the following format:
<target model="dimm" node='2' address='0xdeadbeef'/>
"node" determines the guest numa node to connect the memory "module" to. The attribute is optional for non-numa guests or node 0 is assumed.
If I'm thinking about this from a physical hardware POV, it doesn't make a whole lot of sense for the NUMA node to be configurable at the time you plug in the DIMM. The NUMA affinity is a property of how the slot is wired into the memory controller. Plugging the DIMM cannot change that.
While this is true for physical hardware, the emulated one apparently supports changing a slot's position in the numa topology. Additionally this allows to use a non-uniform mapping of memory modules to numa nodes.
Are you suggesting that we should bind certain slots to certain numa nodes in advance thus try to emulate the limitations of the physical hardware?
Pretty much, yes. Or provide an API that lets us change the slot binding at the time we need to plug the DIMMS if we really need that.
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
Sure, they are clearly designed from different POV, but I'm saying that from an application POV is it very unpleasant to have 2 different ways to configure the same concept in the XML. So I really don't want us to go down that route unless there is absolutely no other option to achieve an acceptable level of functionality. If that really were the case, then I would strongly consider reverting everything related to NUMA that we have just done during this dev cycle and not releasing it as is.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
There's no reason why we have to assume 1 memory slot per guest or per node when booting the guest. If the user wants the ability to unplug, they could set their XML config so the guest has arbitrary slot granularity. eg if i have a guest - memory == 8 GB - max-memory == 16 GB - NUMA nodes == 4 Then we could allow them to specify 32 memory slots each 512 MB in size. This would allow them to plug/unplug memory from NUMA nodes in 512 MB granularity. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 07/24/14 16:40, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
...
For targetting the RAM module the target element could have the following format:
<target model="dimm" node='2' address='0xdeadbeef'/>
"node" determines the guest numa node to connect the memory "module" to. The attribute is optional for non-numa guests or node 0 is assumed.
If I'm thinking about this from a physical hardware POV, it doesn't make a whole lot of sense for the NUMA node to be configurable at the time you plug in the DIMM. The NUMA affinity is a property of how the slot is wired into the memory controller. Plugging the DIMM cannot change that.
While this is true for physical hardware, the emulated one apparently supports changing a slot's position in the numa topology. Additionally this allows to use a non-uniform mapping of memory modules to numa nodes.
Are you suggesting that we should bind certain slots to certain numa nodes in advance thus try to emulate the limitations of the physical hardware?
Pretty much, yes. Or provide an API that lets us change the slot binding at the time we need to plug the DIMMS if we really need that.
Well while it might be workable, that idea makes the whole thing extremely less flexible than qemu actually allows us. I'm not entirely against making it less flexible, I'm concerned that someone might ask for that later. Also adding new APIs is a thing I wanted to avoid as long as we have a pretty powerful device plug/unplug API.
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
Sure, they are clearly designed from different POV, but I'm saying that from an application POV is it very unpleasant to have 2 different ways to configure the same concept in the XML. So I really don't want us to go down that route unless there is absolutely no other option to achieve an acceptable level of functionality. If that really were the case, then I would strongly consider reverting everything related to NUMA that we have just done during this dev cycle and not releasing it as is.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
There's no reason why we have to assume 1 memory slot per guest or per node when booting the guest. If the user wants the ability to unplug, they could set their XML config so the guest has arbitrary slot granularity. eg if i have a guest
- memory == 8 GB - max-memory == 16 GB - NUMA nodes == 4
Then we could allow them to specify 32 memory slots each 512 MB in size. This would allow them to plug/unplug memory from NUMA nodes in 512 MB granularity.
Well, while this makes it pretty close to real hardware, the emulated one doesn't have a problem with plugging "dimms" of weird (non-power-of-2) sizing. And we are loosing flexibility due to that.
Regards, Daniel

On 07/24/14 17:03, Peter Krempa wrote:
On 07/24/14 16:40, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
Sure, they are clearly designed from different POV, but I'm saying that from an application POV is it very unpleasant to have 2 different ways to configure the same concept in the XML. So I really don't want us to go down that route unless there is absolutely no other option to achieve an acceptable level of functionality. If that really were the case, then I would strongly consider reverting everything related to NUMA that we have just done during this dev cycle and not releasing it as is.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
There's no reason why we have to assume 1 memory slot per guest or per node when booting the guest. If the user wants the ability to unplug, they could set their XML config so the guest has arbitrary slot granularity. eg if i have a guest
- memory == 8 GB - max-memory == 16 GB - NUMA nodes == 4
Then we could allow them to specify 32 memory slots each 512 MB in size. This would allow them to plug/unplug memory from NUMA nodes in 512 MB granularity.
In real hardware you still can plug in modules of different sizes. (eg 1GiB + 2Gib) ...
Well, while this makes it pretty close to real hardware, the emulated one doesn't have a problem with plugging "dimms" of weird (non-power-of-2) sizing. And we are loosing flexibility due to that.
Hmm, now that the rest of the Hugepage stuff was pushed and the release is rather soon. What approach should I take? I'd rather avoid crippling the interface for memory hotplug and having to add separate apis and other stuff and mostly I'd like to avoid having to re-do it after consumers of libvirt deem it to be unflexible. Peter

On Tue, Jul 29, 2014 at 04:40:50PM +0200, Peter Krempa wrote:
On 07/24/14 17:03, Peter Krempa wrote:
On 07/24/14 16:40, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
Sure, they are clearly designed from different POV, but I'm saying that from an application POV is it very unpleasant to have 2 different ways to configure the same concept in the XML. So I really don't want us to go down that route unless there is absolutely no other option to achieve an acceptable level of functionality. If that really were the case, then I would strongly consider reverting everything related to NUMA that we have just done during this dev cycle and not releasing it as is.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
There's no reason why we have to assume 1 memory slot per guest or per node when booting the guest. If the user wants the ability to unplug, they could set their XML config so the guest has arbitrary slot granularity. eg if i have a guest
- memory == 8 GB - max-memory == 16 GB - NUMA nodes == 4
Then we could allow them to specify 32 memory slots each 512 MB in size. This would allow them to plug/unplug memory from NUMA nodes in 512 MB granularity.
In real hardware you still can plug in modules of different sizes. (eg 1GiB + 2Gib) ...
I was just illustrating that as an example of the default we'd write into the XML if the app hadn't explicitly given any slot info themselves. If doing it manually you can of course list the slots with arbitrary sizes, each a different size.
Well, while this makes it pretty close to real hardware, the emulated one doesn't have a problem with plugging "dimms" of weird (non-power-of-2) sizing. And we are loosing flexibility due to that.
Hmm, now that the rest of the Hugepage stuff was pushed and the release is rather soon. What approach should I take? I'd rather avoid crippling the interface for memory hotplug and having to add separate apis and other stuff and mostly I'd like to avoid having to re-do it after consumers of libvirt deem it to be unflexible.
NB, as a general point of design, it isn't our goal to always directly expose every possible way to configuring things that QEMU allows. If there are multiple ways to achieve the same end goal it is valid for libvirt to pick a particular approach and not expose all possible QEMU flexibility. This is especially true if this makes cross-hypervisor support of the feature more practical. Looking at the big picture, we've got a bunch of memory related configuration sets - Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes <cpu> <numa> <cell id='0' cpus='0' memory='512000'/> <cell id='1' cpus='1' memory='512000'/> <cell id='2' cpus='2-3' memory='1024000'/> </numa> </cpu> - Request the use of huge pages, optionally different size per guest NUMA node <memoryBacking> <hugepages/> </memoryBacking> <memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0,1'/> <page size='1' unit='GiB' nodeset='2'/> </hugepages> </memoryBacking> - Mapping of guest NUMA nodes to host NUMA nodes <numatune> <memory mode="strict" nodeset="1-4,^3"/> <memnode cellid="0" mode="strict" nodeset="1"/> <memnode cellid="1" mode="strict" nodeset="2"/> </numatune> At the QEMU level, aside from the size of the DIMM, the memory slot device lets you 1. Specify guest NUMA node to attach to 2. Specify host NUMA node to assign to 3. Request use of huge pages, optionally with size Item 1 is clearly needed. Item 2 is something that I think is not relevant to expose in libvirt. We already define a mapping of guest nodes to host nodes, so it can be inferred from that. It is true that specifying host node explicitly is more flexible, because it lets you map different DIMMS within a guest node to different host nodes. I think this flexibility is a feature in search of a problem. It doesn't make sense from a performance optimization POV to have a single guest node with DIMMS mapped to more than one host node. If you find yourself having todo that it is a sign that you didn't configure enough guest nodes in the first place. Item 3 is a slightly more fuzzy one. If we inferred it from the existing hugepage mapping, then any hotplugged memory would use the same page size as the existing memory in that node. If explicitly specified then you could configure a NUMA node with a mixture of 4k, 2MB and 1 GB pages. I could see why you might want this if, say you have setup a 1 GB page size for the node, but only want to add 256 MB of RAM to the node, you'd have to use 2 MB pages. If I consider what it means to the guest from a functional peformance POV though, I'm pretty sceptical that it is a sensible thing to want to do. People are using hugepages so that guest can get predictable memory access latency and better tlb efficiency. Consider if we have a guest with 2 NUMA nodes, the first node uses 4KB pages, and the second node uses 2 MB or 1 GB pages. Now in the guest OS an application needing the predictable latency memory access can be bound to the second guest NUMA node to achieve that. If we consider configuring a single NUMA node with a mixture of page sizes, then there is no way for the guest administrator to set up their guest applications to take advantage of the specific huge pages allocated to the guest. Now from the QEMU CLI there is the ability to configure all these different options but that doesn't imply that all the configuration possibilities are actually intended for use. ie, you need to be able to specify the host NUMA node, huge page usage etc against the slot but that doesn't mean it is intended that we use that to configure different settings for multiple DIMMS within the same NUMA node. So I think it is valid for libvirt to expose the memory slot feature just specifying the RAM size and the guest NUMA node and infer huge page usage, huge page size and host NUMA node from existing data that libvirt has in its domain XML document elsewhere. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Jul 29, 2014 at 05:05:23PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 29, 2014 at 04:40:50PM +0200, Peter Krempa wrote:
On 07/24/14 17:03, Peter Krempa wrote:
On 07/24/14 16:40, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
> "address" determines the address in the guest's memory space where the > memory will be mapped. This is optional and not recommended being set by > the user (except for special cases). > > For expansion the model="pflash" device may be added. > > For migration the target VM needs to be started with the hotplugged > modules already specified on the command line, which is in line how we > treat devices currently. > > My suggestion above contrasts with the approach Michal and Martin took > when adding the numa and hugepage backing capabilities as they describe > a node while this describes the memory device beneath it. I think those > two approaches can co-exist whilst being mutually-exclusive. Simply when > using memory hotplug, the memory will need to be specified using the > memory modules. Non-hotplug guests could use the approach defined > originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
Sure, they are clearly designed from different POV, but I'm saying that from an application POV is it very unpleasant to have 2 different ways to configure the same concept in the XML. So I really don't want us to go down that route unless there is absolutely no other option to achieve an acceptable level of functionality. If that really were the case, then I would strongly consider reverting everything related to NUMA that we have just done during this dev cycle and not releasing it as is.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
There's no reason why we have to assume 1 memory slot per guest or per node when booting the guest. If the user wants the ability to unplug, they could set their XML config so the guest has arbitrary slot granularity. eg if i have a guest
- memory == 8 GB - max-memory == 16 GB - NUMA nodes == 4
Then we could allow them to specify 32 memory slots each 512 MB in size. This would allow them to plug/unplug memory from NUMA nodes in 512 MB granularity.
In real hardware you still can plug in modules of different sizes. (eg 1GiB + 2Gib) ...
I was just illustrating that as an example of the default we'd write into the XML if the app hadn't explicitly given any slot info themselves. If doing it manually you can of course list the slots with arbitrary sizes, each a different size.
Well, while this makes it pretty close to real hardware, the emulated one doesn't have a problem with plugging "dimms" of weird (non-power-of-2) sizing. And we are loosing flexibility due to that.
Hmm, now that the rest of the Hugepage stuff was pushed and the release is rather soon. What approach should I take? I'd rather avoid crippling the interface for memory hotplug and having to add separate apis and other stuff and mostly I'd like to avoid having to re-do it after consumers of libvirt deem it to be unflexible.
NB, as a general point of design, it isn't our goal to always directly expose every possible way to configuring things that QEMU allows. If there are multiple ways to achieve the same end goal it is valid for libvirt to pick a particular approach and not expose all possible QEMU flexibility. This is especially true if this makes cross-hypervisor support of the feature more practical.
Looking at the big picture, we've got a bunch of memory related configuration sets
- Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes
<cpu> <numa> <cell id='0' cpus='0' memory='512000'/> <cell id='1' cpus='1' memory='512000'/> <cell id='2' cpus='2-3' memory='1024000'/> </numa> </cpu>
- Request the use of huge pages, optionally different size per guest NUMA node
<memoryBacking> <hugepages/> </memoryBacking>
<memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0,1'/> <page size='1' unit='GiB' nodeset='2'/> </hugepages> </memoryBacking>
- Mapping of guest NUMA nodes to host NUMA nodes
<numatune> <memory mode="strict" nodeset="1-4,^3"/> <memnode cellid="0" mode="strict" nodeset="1"/> <memnode cellid="1" mode="strict" nodeset="2"/> </numatune>
At the QEMU level, aside from the size of the DIMM, the memory slot device lets you
1. Specify guest NUMA node to attach to 2. Specify host NUMA node to assign to 3. Request use of huge pages, optionally with size
[snip]
So I think it is valid for libvirt to expose the memory slot feature just specifying the RAM size and the guest NUMA node and infer huge page usage, huge page size and host NUMA node from existing data that libvirt has in its domain XML document elsewhere.
I meant to outline how I thought hotplug/unplug would interact with the existing data. When first booting the guest - If the XML does not include any memory slot info, we should add minimum possible memory slots to match the per-guest NUMA node config. - If XML does include slots, then we must validate that the sum of the memory for slots listed against each guest NUMA node matches the memory set in /cpu/numa/cell/@memory When hugepages are in use we need to make we validate that we're adding slots whose size is a multiple of huge page size. The code should already be validating that each NUMA node is a multiple of the configured hge page size for that node. When hotplugging / unplugging - Libvirt would update the /cpu/numa/cell/@memory attribute and /memory element to reflect the newly added/removed DIMM Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 07/30/14 12:08, Daniel P. Berrange wrote:
On Tue, Jul 29, 2014 at 05:05:23PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 29, 2014 at 04:40:50PM +0200, Peter Krempa wrote:
On 07/24/14 17:03, Peter Krempa wrote:
On 07/24/14 16:40, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote: > On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
> So from that POV, I'd say that when we initially configure the > NUMA / huge page information for a guest at boot time, we should > be doing that wrt to the 'maxMemory' size, instead of the current > 'memory' size. ie the actual NUMA topology is all setup upfront > even though the DIMMS are not present for some of this topology. > >> "address" determines the address in the guest's memory space where the >> memory will be mapped. This is optional and not recommended being set by >> the user (except for special cases). >> >> For expansion the model="pflash" device may be added. >> >> For migration the target VM needs to be started with the hotplugged >> modules already specified on the command line, which is in line how we >> treat devices currently. >> >> My suggestion above contrasts with the approach Michal and Martin took >> when adding the numa and hugepage backing capabilities as they describe >> a node while this describes the memory device beneath it. I think those >> two approaches can co-exist whilst being mutually-exclusive. Simply when >> using memory hotplug, the memory will need to be specified using the >> memory modules. Non-hotplug guests could use the approach defined >> originally. > > I don't think it is viable to have two different approaches for configuring > NUMA / huge page information. Apps should not have to change the way they > configure NUMA/hugepages when they decide they want to take advantage of > DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
Sure, they are clearly designed from different POV, but I'm saying that from an application POV is it very unpleasant to have 2 different ways to configure the same concept in the XML. So I really don't want us to go down that route unless there is absolutely no other option to achieve an acceptable level of functionality. If that really were the case, then I would strongly consider reverting everything related to NUMA that we have just done during this dev cycle and not releasing it as is.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
There's no reason why we have to assume 1 memory slot per guest or per node when booting the guest. If the user wants the ability to unplug, they could set their XML config so the guest has arbitrary slot granularity. eg if i have a guest
- memory == 8 GB - max-memory == 16 GB - NUMA nodes == 4
Then we could allow them to specify 32 memory slots each 512 MB in size. This would allow them to plug/unplug memory from NUMA nodes in 512 MB granularity.
In real hardware you still can plug in modules of different sizes. (eg 1GiB + 2Gib) ...
I was just illustrating that as an example of the default we'd write into the XML if the app hadn't explicitly given any slot info themselves. If doing it manually you can of course list the slots with arbitrary sizes, each a different size.
That was a misunderstanding from my part. I was thinking that the user wouldn't be able to specify the slot sizes manually which would lead to the inflexibility I was describing. Having the option to do that seems fine form me along with doing soem sane defaults. As of sane defaults, with no configuration I'd stick all the memory into a single module and allow just plugging in more.
Well, while this makes it pretty close to real hardware, the emulated one doesn't have a problem with plugging "dimms" of weird (non-power-of-2) sizing. And we are loosing flexibility due to that.
Hmm, now that the rest of the Hugepage stuff was pushed and the release is rather soon. What approach should I take? I'd rather avoid crippling the interface for memory hotplug and having to add separate apis and other stuff and mostly I'd like to avoid having to re-do it after consumers of libvirt deem it to be unflexible.
NB, as a general point of design, it isn't our goal to always directly expose every possible way to configuring things that QEMU allows. If there are multiple ways to achieve the same end goal it is valid for libvirt to pick a particular approach and not expose all possible QEMU flexibility. This is especially true if this makes cross-hypervisor support of the feature more practical.
Looking at the big picture, we've got a bunch of memory related configuration sets
- Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes
<cpu> <numa> <cell id='0' cpus='0' memory='512000'/> <cell id='1' cpus='1' memory='512000'/> <cell id='2' cpus='2-3' memory='1024000'/>
Alternatively, we can allow the user to omit the memory attribute if memory modules are specified and re-calculate it as in the hotplug case.
</numa> </cpu>
- Request the use of huge pages, optionally different size per guest NUMA node
<memoryBacking> <hugepages/> </memoryBacking>
<memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0,1'/> <page size='1' unit='GiB' nodeset='2'/> </hugepages> </memoryBacking>
- Mapping of guest NUMA nodes to host NUMA nodes
<numatune> <memory mode="strict" nodeset="1-4,^3"/> <memnode cellid="0" mode="strict" nodeset="1"/> <memnode cellid="1" mode="strict" nodeset="2"/> </numatune>
At the QEMU level, aside from the size of the DIMM, the memory slot device lets you
1. Specify guest NUMA node to attach to 2. Specify host NUMA node to assign to 3. Request use of huge pages, optionally with size
[snip]
So I think it is valid for libvirt to expose the memory slot feature just specifying the RAM size and the guest NUMA node and infer huge page usage, huge page size and host NUMA node from existing data that libvirt has in its domain XML document elsewhere.
I meant to outline how I thought hotplug/unplug would interact with the existing data.
When first booting the guest
- If the XML does not include any memory slot info, we should add minimum possible memory slots to match the per-guest NUMA node config.
- If XML does include slots, then we must validate that the sum of the memory for slots listed against each guest NUMA node matches the memory set in /cpu/numa/cell/@memory
My idea was that the user would be also able to not specify <memoryBacking> and other of those existing info and then provide just the memory module configuration. On the other hand, making the other information above mandatory when using memory hotplug will just make the code simpler and allow to re-use that data.
When hugepages are in use we need to make we validate that we're adding slots whose size is a multiple of huge page size. The code should already be validating that each NUMA node is a multiple of the configured hge page size for that node.
When hotplugging / unplugging
- Libvirt would update the /cpu/numa/cell/@memory attribute and /memory element to reflect the newly added/removed DIMM
Regards, Daniel
Peter

On Wed, Jul 30, 2014 at 01:37:36PM +0200, Peter Krempa wrote:
On 07/30/14 12:08, Daniel P. Berrange wrote:
On Tue, Jul 29, 2014 at 05:05:23PM +0100, Daniel P. Berrange wrote:
Looking at the big picture, we've got a bunch of memory related configuration sets
- Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes
<cpu> <numa> <cell id='0' cpus='0' memory='512000'/> <cell id='1' cpus='1' memory='512000'/> <cell id='2' cpus='2-3' memory='1024000'/>
Alternatively, we can allow the user to omit the memory attribute if memory modules are specified and re-calculate it as in the hotplug case.
Yes, omitting 'memory' would be ok if there are slot devices listed in the XML.
</numa> </cpu>
- Request the use of huge pages, optionally different size per guest NUMA node
<memoryBacking> <hugepages/> </memoryBacking>
<memoryBacking> <hugepages> <page size='2048' unit='KiB' nodeset='0,1'/> <page size='1' unit='GiB' nodeset='2'/> </hugepages> </memoryBacking>
- Mapping of guest NUMA nodes to host NUMA nodes
<numatune> <memory mode="strict" nodeset="1-4,^3"/> <memnode cellid="0" mode="strict" nodeset="1"/> <memnode cellid="1" mode="strict" nodeset="2"/> </numatune>
At the QEMU level, aside from the size of the DIMM, the memory slot device lets you
1. Specify guest NUMA node to attach to 2. Specify host NUMA node to assign to 3. Request use of huge pages, optionally with size
[snip]
So I think it is valid for libvirt to expose the memory slot feature just specifying the RAM size and the guest NUMA node and infer huge page usage, huge page size and host NUMA node from existing data that libvirt has in its domain XML document elsewhere.
I meant to outline how I thought hotplug/unplug would interact with the existing data.
When first booting the guest
- If the XML does not include any memory slot info, we should add minimum possible memory slots to match the per-guest NUMA node config.
- If XML does include slots, then we must validate that the sum of the memory for slots listed against each guest NUMA node matches the memory set in /cpu/numa/cell/@memory
My idea was that the user would be also able to not specify <memoryBacking> and other of those existing info and then provide just the memory module configuration. On the other hand, making the other information above mandatory when using memory hotplug will just make the code simpler and allow to re-use that data.
Yep, I think it is important to use the memoryBacking here, since if we list hugepages against the memory slot we get into the situation where you can have different slots in the same NUMA node specifying different configs which is not a sensible setup. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
On 07/24/14 16:21, Daniel P. Berrange wrote:
On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
...
For targetting the RAM module the target element could have the following format:
<target model="dimm" node='2' address='0xdeadbeef'/>
"node" determines the guest numa node to connect the memory "module" to. The attribute is optional for non-numa guests or node 0 is assumed.
If I'm thinking about this from a physical hardware POV, it doesn't make a whole lot of sense for the NUMA node to be configurable at the time you plug in the DIMM. The NUMA affinity is a property of how the slot is wired into the memory controller. Plugging the DIMM cannot change that.
While this is true for physical hardware, the emulated one apparently supports changing a slot's position in the numa topology. Additionally this allows to use a non-uniform mapping of memory modules to numa nodes.
Are you suggesting that we should bind certain slots to certain numa nodes in advance thus try to emulate the limitations of the physical hardware?
So from that POV, I'd say that when we initially configure the NUMA / huge page information for a guest at boot time, we should be doing that wrt to the 'maxMemory' size, instead of the current 'memory' size. ie the actual NUMA topology is all setup upfront even though the DIMMS are not present for some of this topology.
"address" determines the address in the guest's memory space where the memory will be mapped. This is optional and not recommended being set by the user (except for special cases).
For expansion the model="pflash" device may be added.
For migration the target VM needs to be started with the hotplugged modules already specified on the command line, which is in line how we treat devices currently.
My suggestion above contrasts with the approach Michal and Martin took when adding the numa and hugepage backing capabilities as they describe a node while this describes the memory device beneath it. I think those two approaches can co-exist whilst being mutually-exclusive. Simply when using memory hotplug, the memory will need to be specified using the memory modules. Non-hotplug guests could use the approach defined originally.
I don't think it is viable to have two different approaches for configuring NUMA / huge page information. Apps should not have to change the way they configure NUMA/hugepages when they decide they want to take advantage of DIMM hotplug.
Well, the two approaches are orthogonal in the information they store. The existing approach stores the memory topology from the point of view of the numa node whereas the <device> based approach from the point of the memory module.
The difference is that the existing approach currently wouldn't allow splitting a numa node into more memory devices to allow plugging/unplugging them.
Well, changing '<memnode cellid="1"/>' to '<memnode cellids="0-1"/>' wouldn't require that much of a work, I guess. I still haven't added the APIs to support changing memnode settings, so that is open too. Just my $0.02, Martin
participants (3)
-
Daniel P. Berrange
-
Martin Kletzander
-
Peter Krempa