On 07/30/14 12:08, Daniel P. Berrange wrote:
On Tue, Jul 29, 2014 at 05:05:23PM +0100, Daniel P. Berrange wrote:
> On Tue, Jul 29, 2014 at 04:40:50PM +0200, Peter Krempa wrote:
>> On 07/24/14 17:03, Peter Krempa wrote:
>>> On 07/24/14 16:40, Daniel P. Berrange wrote:
>>>> On Thu, Jul 24, 2014 at 04:30:43PM +0200, Peter Krempa wrote:
>>>>> On 07/24/14 16:21, Daniel P. Berrange wrote:
>>>>>> On Thu, Jul 24, 2014 at 02:20:22PM +0200, Peter Krempa wrote:
>>
>>>>
>>>>>> So from that POV, I'd say that when we initially configure
the
>>>>>> NUMA / huge page information for a guest at boot time, we should
>>>>>> be doing that wrt to the 'maxMemory' size, instead of the
current
>>>>>> 'memory' size. ie the actual NUMA topology is all setup
upfront
>>>>>> even though the DIMMS are not present for some of this topology.
>>>>>>
>>>>>>> "address" determines the address in the guest's
memory space where the
>>>>>>> memory will be mapped. This is optional and not recommended
being set by
>>>>>>> the user (except for special cases).
>>>>>>>
>>>>>>> For expansion the model="pflash" device may be
added.
>>>>>>>
>>>>>>> For migration the target VM needs to be started with the
hotplugged
>>>>>>> modules already specified on the command line, which is in
line how we
>>>>>>> treat devices currently.
>>>>>>>
>>>>>>> My suggestion above contrasts with the approach Michal and
Martin took
>>>>>>> when adding the numa and hugepage backing capabilities as
they describe
>>>>>>> a node while this describes the memory device beneath it. I
think those
>>>>>>> two approaches can co-exist whilst being mutually-exclusive.
Simply when
>>>>>>> using memory hotplug, the memory will need to be specified
using the
>>>>>>> memory modules. Non-hotplug guests could use the approach
defined
>>>>>>> originally.
>>>>>>
>>>>>> I don't think it is viable to have two different approaches
for configuring
>>>>>> NUMA / huge page information. Apps should not have to change the
way they
>>>>>> configure NUMA/hugepages when they decide they want to take
advantage of
>>>>>> DIMM hotplug.
>>>>>
>>>>> Well, the two approaches are orthogonal in the information they
store.
>>>>> The existing approach stores the memory topology from the point of
view
>>>>> of the numa node whereas the <device> based approach from the
point of
>>>>> the memory module.
>>>>
>>>> Sure, they are clearly designed from different POV, but I'm saying
that
>>>> from an application POV is it very unpleasant to have 2 different ways
>>>> to configure the same concept in the XML. So I really don't want us
to
>>>> go down that route unless there is absolutely no other option to achieve
>>>> an acceptable level of functionality. If that really were the case, then
>>>> I would strongly consider reverting everything related to NUMA that we
>>>> have just done during this dev cycle and not releasing it as is.
>>>>
>>>>> The difference is that the existing approach currently wouldn't
allow
>>>>> splitting a numa node into more memory devices to allow
>>>>> plugging/unplugging them.
>>>>
>>>> There's no reason why we have to assume 1 memory slot per guest or
>>>> per node when booting the guest. If the user wants the ability to
>>>> unplug, they could set their XML config so the guest has arbitrary
>>>> slot granularity. eg if i have a guest
>>>>
>>>> - memory == 8 GB
>>>> - max-memory == 16 GB
>>>> - NUMA nodes == 4
>>>>
>>>> Then we could allow them to specify 32 memory slots each 512 MB
>>>> in size. This would allow them to plug/unplug memory from NUMA
>>>> nodes in 512 MB granularity.
>>
>> In real hardware you still can plug in modules of different sizes. (eg
>> 1GiB + 2Gib) ...
>
> I was just illustrating that as an example of the default we'd
> write into the XML if the app hadn't explicitly given any slot
> info themselves. If doing it manually you can of course list
> the slots with arbitrary sizes, each a different size.
That was a misunderstanding from my part. I was thinking that the user
wouldn't be able to specify the slot sizes manually which would lead to
the inflexibility I was describing.
Having the option to do that seems fine form me along with doing soem
sane defaults.
As of sane defaults, with no configuration I'd stick all the memory into
a single module and allow just plugging in more.
>
>>> Well, while this makes it pretty close to real hardware, the emulated
>>> one doesn't have a problem with plugging "dimms" of weird
>>> (non-power-of-2) sizing. And we are loosing flexibility due to that.
>>>
>>
>> Hmm, now that the rest of the Hugepage stuff was pushed and the release
>> is rather soon. What approach should I take? I'd rather avoid crippling
>> the interface for memory hotplug and having to add separate apis and
>> other stuff and mostly I'd like to avoid having to re-do it after
>> consumers of libvirt deem it to be unflexible.
>
> NB, as a general point of design, it isn't our goal to always directly
> expose every possible way to configuring things that QEMU allows. If
> there are multiple ways to achieve the same end goal it is valid for
> libvirt to pick a particular approach and not expose all possible QEMU
> flexibility. This is especially true if this makes cross-hypervisor
> support of the feature more practical.
>
> Looking at the big picture, we've got a bunch of memory related
> configuration sets
>
> - Guest NUMA topology setup, assigning vCPUs and RAM to guest nodes
>
> <cpu>
> <numa>
> <cell id='0' cpus='0' memory='512000'/>
> <cell id='1' cpus='1' memory='512000'/>
> <cell id='2' cpus='2-3' memory='1024000'/>
Alternatively, we can allow the user to omit the memory attribute if
memory modules are specified and re-calculate it as in the hotplug case.
> </numa>
> </cpu>
>
> - Request the use of huge pages, optionally different size
> per guest NUMA node
>
> <memoryBacking>
> <hugepages/>
> </memoryBacking>
>
> <memoryBacking>
> <hugepages>
> <page size='2048' unit='KiB' nodeset='0,1'/>
> <page size='1' unit='GiB' nodeset='2'/>
> </hugepages>
> </memoryBacking>
>
> - Mapping of guest NUMA nodes to host NUMA nodes
>
> <numatune>
> <memory mode="strict" nodeset="1-4,^3"/>
> <memnode cellid="0" mode="strict"
nodeset="1"/>
> <memnode cellid="1" mode="strict"
nodeset="2"/>
> </numatune>
>
>
> At the QEMU level, aside from the size of the DIMM, the memory slot
> device lets you
>
> 1. Specify guest NUMA node to attach to
> 2. Specify host NUMA node to assign to
> 3. Request use of huge pages, optionally with size
[snip]
> So I think it is valid for libvirt to expose the memory slot feature
> just specifying the RAM size and the guest NUMA node and infer huge
> page usage, huge page size and host NUMA node from existing data that
> libvirt has in its domain XML document elsewhere.
I meant to outline how I thought hotplug/unplug would interact with
the existing data.
When first booting the guest
- If the XML does not include any memory slot info, we should
add minimum possible memory slots to match the per-guest
NUMA node config.
- If XML does include slots, then we must validate that the
sum of the memory for slots listed against each guest NUMA
node matches the memory set in /cpu/numa/cell/@memory
My idea was that the user would be also able to not specify
<memoryBacking> and other of those existing info and then provide just
the memory module configuration. On the other hand, making the other
information above mandatory when using memory hotplug will just make the
code simpler and allow to re-use that data.
When hugepages are in use we need to make we validate that we're
adding slots whose size is a multiple of huge page size. The code
should already be validating that each NUMA node is a multiple of
the configured hge page size for that node.
When hotplugging / unplugging
- Libvirt would update the /cpu/numa/cell/@memory attribute
and /memory element to reflect the newly added/removed DIMM
Regards,
Daniel
Peter