On Thu, 2011-05-05 at 17:38 +0800, Osier Yang wrote:
Hi, All,
This is a simple implenmentation for NUMA tuning support based on binary
program 'numactl', currently only supports to bind memory to specified nodes,
using option "--membind", perhaps it need to support more, but I'd like
send it early so that could make sure if the principle is correct.
Ideally, NUMA tuning support should be added in qemu-kvm first, such
as they could provide command options, then what we need to do in libvirt
is just to pass the options to qemu-kvm, but unfortunately qemu-kvm doesn't
support it yet, what we could do currently is only to use numactl,
it forks process, a bit expensive than qemu-kvm supports NUMA tuning
inside with libnuma, but it shouldn't affects much I guess.
The NUMA tuning XML is like:
<numatune>
<membind nodeset='+0-4,8-12'/>
</numatune>
Any thoughts/feedback is appreciated.
Osier:
A couple of thoughts/observations:
1) you can accomplish the same thing -- restricting a domain's memory to
a specified set of nodes -- using the cpuset cgroup that is already
associated with each domain. E.g.,
cgset -r cpuset.mems=<nodeset> /libvirt/qemu/<domain>
Or the equivalent libcgroup call.
However, numactl is more flexible; especially if you intend to support
more policies: preferred, interleave. Which leads to the question:
2) Do you really want the full "membind" semantics as opposed to
"preferred" by default? Membind policy will restrict the VMs pages to
the specified nodeset and will initiate reclaim/stealing and wait for
pages to become available or the task is OOM-killed because of mempolicy
when all of the nodes in nodeset reach their minimum watermark. Membind
works the same as cpuset.mems in this respect. Preferred policy will
keep memory allocations [but not vcpu execution] local to the specified
set of nodes as long as there is sufficient memory, and will silently
"overflow" allocations to other nodes when necessary. I.e., it's a
little more forgiving under memory pressure.
But then pinning a VM's vcpus to the physical cpus of a set of nodes and
retaining the default local allocation policy will have the same effect
as "preferred" while ensuring that the VM component tasks execute
locally to the memory footprint. Currently, I do this by looking up the
cpulist associated with the node[s] from e.g.,
/sys/devices/system/node/node<i>/cpulist and using that list with the
vcpu.cpuset attribute. Adding a 'nodeset' attribute to the
cputune.vcpupin element would simplify specifying that configuration.
Regards,
Lee