[libvirt] [RFC] NUMA topology specification

Hi, qemu supports specification of NUMA topology on command line using -numa option. -numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node] I see that there is no way to specify such NUMA topology in libvirt XML. Are there plans to add support for NUMA topology specification ? Is anybody already working on this ? If not I would like to add this support for libvirt. Currently the topology specification available in libvirt ( <topology sockets='1' cores='2' threads='1'/>) translates to "-smp sockets=1,cores=2,threads=1" option of qemu. There is not equivalent in libvirt that could generate -numa command line option of qemu. How about something like this ? (OPTION 1) <cpu> ... <numa nodeid='node' cpus='cpu[-cpu]' mem='size'> ... </cpu> And we could specify multiple such lines, one for each node. -numa and -smp options in qemu do not work all that well since they are parsed independent of each other and one could specify a cpu set with -numa option that is incompatible with sockets,cores and threads specified on -smp option. This should be fixed in qemu, but given that such a problem has been observed, should libvirt tie the specification of numa and smp (sockets,threads,cores) together so that one is forced to specify only valid combinations of nodes and cpus in libvirt ? May be something like this: (OPTION 2) <cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu This should result in a 2 node system with each node having 1 socket with 2 cores. Comments, suggestions ? Regards, Bharata. -- http://bharata.sulekha.com/blog/posts.htm, http://raobharata.wordpress.com/

于 2011年08月19日 14:35, Bharata B Rao 写道:
Hi,
qemu supports specification of NUMA topology on command line using -numa option.
-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]
I see that there is no way to specify such NUMA topology in libvirt XML. Are there plans to add support for NUMA topology specification ? Is anybody already working on this ? If not I would like to add this support for libvirt.
Currently the topology specification available in libvirt (<topology sockets='1' cores='2' threads='1'/>) translates to "-smp sockets=1,cores=2,threads=1" option of qemu. There is not equivalent in libvirt that could generate -numa command line option of qemu.
How about something like this ? (OPTION 1)
<cpu> ... <numa nodeid='node' cpus='cpu[-cpu]' mem='size'> ... </cpu>
Libvirt already supported NUMA setting (both cpu and memory) on host yet, but yes, nothing for NUMA setting inside guest yet. We have talked once about the XML when adding the support for numa memory setting on host. And finally choosed to introduce new XML node for it with considering to add support for NUMA setting inside guest one day. The XML is: <numatune> <memory mode="strict" nodeset="1-4,^3"/> </numatune> So, personlly, I think the new XML should be inside "<numatune>" as a child node.
And we could specify multiple such lines, one for each node.
-numa and -smp options in qemu do not work all that well since they are parsed independent of each other and one could specify a cpu set with -numa option that is incompatible with sockets,cores and threads specified on -smp option. This should be fixed in qemu, but given that such a problem has been observed, should libvirt tie the specification of numa and smp (sockets,threads,cores) together so that one is forced to specify only valid combinations of nodes and cpus in libvirt ?
May be something like this: (OPTION 2)
<cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu
This will cause we have 3 places for NUMA, one is <numatune>, the other is "<vcpu>", and this one. We can't change the "<vcpu>" as it was introduce much earlier than "<numatune>", but after <numatune> was introduced, I guess it's better to folder all NUMA stuffs in it.
This should result in a 2 node system with each node having 1 socket with 2 cores.
Comments, suggestions ?
Regards, Bharata.

On Fri, Aug 19, 2011 at 12:55 PM, Osier Yang <jyang@redhat.com> wrote:
于 2011年08月19日 14:35, Bharata B Rao 写道:
How about something like this ? (OPTION 1)
<cpu> ... <numa nodeid='node' cpus='cpu[-cpu]' mem='size'> ... </cpu>
Libvirt already supported NUMA setting (both cpu and memory) on host yet, but yes, nothing for NUMA setting inside guest yet.
We have talked once about the XML when adding the support for numa memory setting on host. And finally choosed to introduce new XML node for it with considering to add support for NUMA setting inside guest one day. The XML is:
<numatune> <memory mode="strict" nodeset="1-4,^3"/> </numatune>
But this only specifies the host NUMA policy that should be used for guest VM processes.
So, personlly, I think the new XML should be inside "<numatune>" as a child node.
And we could specify multiple such lines, one for each node.
-numa and -smp options in qemu do not work all that well since they are parsed independent of each other and one could specify a cpu set with -numa option that is incompatible with sockets,cores and threads specified on -smp option. This should be fixed in qemu, but given that such a problem has been observed, should libvirt tie the specification of numa and smp (sockets,threads,cores) together so that one is forced to specify only valid combinations of nodes and cpus in libvirt ?
May be something like this: (OPTION 2)
<cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu
This will cause we have 3 places for NUMA, one is <numatune>,
As I observed above, this controls the NUMA policy of the guest VM threads on host.
the other is "<vcpu>",
vcpu/cpuset specifies how vcpu threads should be pinned on host.
and this one.
I think what we are addressing here is a bit different from the above two. Here we are actually trying to _define_ the NUMA topology of the guest, while via other capabilites (numatune, vcpu) we only control the cpu and memory bindings of vcpu threads on host. Hence I am not sure if if <numatune> is the right place for defining host NUMA topology which btw should be independent of the host topology. Thanks for your response. Regards, Bharata.

于 2011年08月19日 16:09, Bharata B Rao 写道:
On Fri, Aug 19, 2011 at 12:55 PM, Osier Yang <jyang@redhat.com> wrote:
于 2011年08月19日 14:35, Bharata B Rao 写道:
How about something like this ? (OPTION 1)
<cpu> ... <numa nodeid='node' cpus='cpu[-cpu]' mem='size'> ... </cpu>
Libvirt already supported NUMA setting (both cpu and memory) on host yet, but yes, nothing for NUMA setting inside guest yet.
We have talked once about the XML when adding the support for numa memory setting on host. And finally choosed to introduce new XML node for it with considering to add support for NUMA setting inside guest one day. The XML is:
<numatune> <memory mode="strict" nodeset="1-4,^3"/> </numatune> But this only specifies the host NUMA policy that should be used for guest VM processes.
Yes.
So, personlly, I think the new XML should be inside "<numatune>" as a child node.
And we could specify multiple such lines, one for each node.
-numa and -smp options in qemu do not work all that well since they are parsed independent of each other and one could specify a cpu set with -numa option that is incompatible with sockets,cores and threads specified on -smp option. This should be fixed in qemu, but given that such a problem has been observed, should libvirt tie the specification of numa and smp (sockets,threads,cores) together so that one is forced to specify only valid combinations of nodes and cpus in libvirt ?
May be something like this: (OPTION 2)
<cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu This will cause we have 3 places for NUMA, one is <numatune>, As I observed above, this controls the NUMA policy of the guest VM threads on host.
Yes, I known your meaning.
the other is "<vcpu>", vcpu/cpuset specifies how vcpu threads should be pinned on host.
and this one. I think what we are addressing here is a bit different from the above two. Here we are actually trying to _define_ the NUMA topology of the guest, while via other capabilites (numatune, vcpu) we only control the cpu and memory bindings of vcpu threads on host.
Hence I am not sure if if <numatune> is the right place for defining host NUMA topology which btw should be independent of the host topology.
Maybe something like: <numatune> <guest> ...... </guest> </numatune>
Thanks for your response.
Regards, Bharata.

On 08/19/2011 01:35 AM, Bharata B Rao wrote:
May be something like this: (OPTION 2)
<cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu
This should result in a 2 node system with each node having 1 socket with 2 cores.
Comments, suggestions ?
Option 2 (above) seems like the most logical interface to me. I would not support putting this under <numatune> because of the high risk of users confusing guest NUMA topology definition with host NUMA tuning. I like the idea of merging this into <topology> to prevent errors with specifying incompatible cpu and numa topologies but I think you can go a step further (assuming my following assertion is valid). Since cpus are assigned to numa nodes at the core level, and you are providing a 'nodeid' attribute, you can infer the 'cpus' attribute using 'cores' and 'nodeid' alone. For your example above: <topology sockets='1' cores='2' threads='1' nodeid='0' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' mem='size'> You have 4 cores total, each node is assigned 2. Assign cores to nodes starting with core 0 and node 0. -- Adam Litke IBM Linux Technology Center

On Fri, Aug 19, 2011 at 7:10 PM, Adam Litke <agl@us.ibm.com> wrote:
On 08/19/2011 01:35 AM, Bharata B Rao wrote:
... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ...
I like the idea of merging this into <topology> to prevent errors with specifying incompatible cpu and numa topologies but I think you can go a step further (assuming my following assertion is valid). Since cpus are assigned to numa nodes at the core level, and you are providing a 'nodeid' attribute, you can infer the 'cpus' attribute using 'cores' and 'nodeid' alone.
For your example above: <topology sockets='1' cores='2' threads='1' nodeid='0' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' mem='size'>
You have 4 cores total, each node is assigned 2. Assign cores to nodes starting with core 0 and node 0.
Sounds good. Unless anyone or any architecture has specific requirements of enumerating CPUs differently across nodes, 'cpus=' is redundant as you observe. Regards, Bharata.

On Fri, Aug 19, 2011 at 12:05:43PM +0530, Bharata B Rao wrote:
Hi,
qemu supports specification of NUMA topology on command line using -numa option.
-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]
I see that there is no way to specify such NUMA topology in libvirt XML. Are there plans to add support for NUMA topology specification ? Is anybody already working on this ? If not I would like to add this support for libvirt.
Currently the topology specification available in libvirt ( <topology sockets='1' cores='2' threads='1'/>) translates to "-smp sockets=1,cores=2,threads=1" option of qemu. There is not equivalent in libvirt that could generate -numa command line option of qemu.
How about something like this ? (OPTION 1)
<cpu> ... <numa nodeid='node' cpus='cpu[-cpu]' mem='size'> ... </cpu>
And we could specify multiple such lines, one for each node.
I'm not sure it really makes sense having the NUMA memory config inside the <cpu> configuration, but i like the simplicity of of this specification.
-numa and -smp options in qemu do not work all that well since they are parsed independent of each other and one could specify a cpu set with -numa option that is incompatible with sockets,cores and threads specified on -smp option. This should be fixed in qemu, but given that such a problem has been observed, should libvirt tie the specification of numa and smp (sockets,threads,cores) together so that one is forced to specify only valid combinations of nodes and cpus in libvirt ?
No matter what we do, libvirt is going to have todo some kind of semantic validation on the different info.
May be something like this: (OPTION 2)
<cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu
This should result in a 2 node system with each node having 1 socket with 2 cores.
This has the problem of redundancy of specification of the sockets, cores & threads, vs the new 'cpus' attribute. eg you can specify wierd configs like: <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='2' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'> Or even bogus configs <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='4' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'> That all said, given our current XML schema, it is inevitable that we will have some level of duplication of information. Some things that are important to consider are how this interacts with possible CPU / memory hotplug in the future, and how we will be able to pin guest NUMA nodes, to host NUMA nodes. For the first point, it might be desirable to create a NUMA topology which supports upto 8 logical CPUs, but only have 2 physical sockets actually plugged in at boot time. Also, I dread to question whether we want to be able to represent a multi-level NUMA topology, or just assume one level. If we want to be able to cope with multi-level topology, can we assume the levels are solely grouping at the socket, or will we have to consider the possibility of NUMA *inside* a socket. In other words, are we associating socket numbers with NUMA nodes, or are we associating logical CPU numbers with NUMA nodes. This is the difference between configuring something like: <vpus>16</vcpus> <cpu> <topology sockets='4' cores='4' threads='1'> </cpu> <numa> <node sockets='0-1' mem='0-1024'/> <node sockets='2-3' mem='1024-2048'/> </numa> vs <vpus>16</vcpus> <cpu> <topology sockets='4' cores='4' threads='1'> </cpu> <numa> <node cpus='0-7' mem='0-1024'/> <node cpus='8-15' mem='1024-2048'/> </numa> vs <vpus>16</vcpus> <cpu> <topology sockets='4' cores='4' threads='1'> </cpu> <numa> <node mems='0-1024'/> <node cpus='0-3'/> <node cpus='4-7'/> </node> <node mems='1024-2048'/> <node cpus='8-11'/> <node cpus='12-15'/> </node> </numa> vs ...more horrible examples... NB, QEMU's -numa argument may well not support some of the things I am talking about here, but we need to consider the real possibilty that QEMU's -numa arg will be extended, or replaced in the future. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Aug 23, 2011 at 7:43 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Fri, Aug 19, 2011 at 12:05:43PM +0530, Bharata B Rao wrote:
Hi,
qemu supports specification of NUMA topology on command line using -numa option.
-numa node[,mem=size][,cpus=cpu[-cpu]][,nodeid=node]
I see that there is no way to specify such NUMA topology in libvirt XML. Are there plans to add support for NUMA topology specification ? Is anybody already working on this ? If not I would like to add this support for libvirt.
Currently the topology specification available in libvirt ( <topology sockets='1' cores='2' threads='1'/>) translates to "-smp sockets=1,cores=2,threads=1" option of qemu. There is not equivalent in libvirt that could generate -numa command line option of qemu.
How about something like this ? (OPTION 1)
<cpu> ... <numa nodeid='node' cpus='cpu[-cpu]' mem='size'> ... </cpu>
And we could specify multiple such lines, one for each node.
I'm not sure it really makes sense having the NUMA memory config inside the <cpu> configuration, but i like the simplicity of of this specification.
Yes, memory specification inside <cpu>, may be we could define a separate <numa> section as shown in your examples below and put it outside of <cpu>.
-numa and -smp options in qemu do not work all that well since they are parsed independent of each other and one could specify a cpu set with -numa option that is incompatible with sockets,cores and threads specified on -smp option. This should be fixed in qemu, but given that such a problem has been observed, should libvirt tie the specification of numa and smp (sockets,threads,cores) together so that one is forced to specify only valid combinations of nodes and cpus in libvirt ?
No matter what we do, libvirt is going to have todo some kind of semantic validation on the different info.
Right. Given that we have <vcpus> as well as <vcpu current>, libvirt needs to ensure that specified topology is sane.
May be something like this: (OPTION 2)
<cpu> ... <topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='1' cores='2' threads='1' nodeid='1' cpus='2-3' mem='size'> ... </cpu
This should result in a 2 node system with each node having 1 socket with 2 cores.
This has the problem of redundancy of specification of the sockets, cores & threads, vs the new 'cpus' attribute. eg you can specify wierd configs like:
Yes, sockets,cores,threads become redundant, one option is to define sockets,cores,threads once (like we currently do inside <cpu>) and have it as a common definition for all the numa nodes defined. Something like this: <cpu> <topology sockets='1' cores='2' threads='1'> <numa node cpus='0-1' mems=1024> <numa node cpus='2-3' mems=1024> </cpu> This will result in a 2 node system with each node having 1 socket with 2 cores each. But as you can see this will be restrictive since you can't specify different topologies for different nodes. Are there such non-symmetric systems out there and should libvirt be flexible enough to support such NUMA topologies for VMs ? Also looks like nodeid (from OPTION 2 of my original mail) is redundant, may be we should assign increasing node ids based on the number of numa topology statements that appear. In the above example, we can implicitly assign node ids 0 and 1 for two nodes. Adam Litke suggested that we can omit cpus= from the specification since it can be derived, but given that there are topologies that don't enumerate the CPUs within a socket serially, it becomes necessary to have an explicit cpus= specification.
<topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='2' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'>
Or even bogus configs
<topology sockets='1' cores='2' threads='1' nodeid='0' cpus='0-1' mem='size'> <topology sockets='4' cores='1' threads='1' nodeid='1' cpus='2-3' mem='size'>
That all said, given our current XML schema, it is inevitable that we will have some level of duplication of information.
Some things that are important to consider are how this interacts with possible CPU / memory hotplug in the future,
So what are the issues we need to take care of here ?
and how we will be able to pin guest NUMA nodes, to host NUMA nodes.
This would be a good thing to do in libvirt. I think libvirt should intelligently place VMs on host nodes based on the guest topology. But I don't clearly see what issues we need to take care of now while we come up with NUMA topology definition for VM.
For the first point, it might be desirable to create a NUMA topology which supports upto 8 logical CPUs, but only have 2 physical sockets actually plugged in at boot time.
Also, I dread to question whether we want to be able to represent a multi-level NUMA topology, or just assume one level. If we want to be able to cope with multi-level topology, can we assume the levels are solely grouping at the socket, or will we have to consider the possibility of NUMA *inside* a socket.
Given that such (NUMA inside socket) topologies exist in real word, may be libvirt should support them. But I guess this will make the libvirt specification complex.
In other words, are we associating socket numbers with NUMA nodes, or are we associating logical CPU numbers with NUMA nodes.
This is the difference between configuring something like:
<vpus>16</vcpus> <cpu> <topology sockets='4' cores='4' threads='1'> </cpu> <numa> <node sockets='0-1' mem='0-1024'/> <node sockets='2-3' mem='1024-2048'/> </numa>
vs
<vpus>16</vcpus> <cpu> <topology sockets='4' cores='4' threads='1'> </cpu> <numa> <node cpus='0-7' mem='0-1024'/> <node cpus='8-15' mem='1024-2048'/> </numa>
What is the difference b/n the above two ? In the first case, you put 2 sockets in one node and 2 sockets in 2nd node. Since each socket has 4 cores, you ended up having 8 cores (or CPUs) in each node. In the 2nd case, you specified 8 CPUs per node explicitly which obviously means that each node should have 2 sockets. Did I miss your point ?
vs
<vpus>16</vcpus> <cpu> <topology sockets='4' cores='4' threads='1'> </cpu> <numa> <node mems='0-1024'/> <node cpus='0-3'/> <node cpus='4-7'/> </node> <node mems='1024-2048'/> <node cpus='8-11'/> <node cpus='12-15'/> </node> </numa>
vs
...more horrible examples...
I don't really have the right answer to the multi-level NUMA specification, need to think a bit. Regards, Bharata.

Hi, Here is another attempt at guest NUMA topology XML specification that should work for different NUMA topologies. We already specify the number of sockets, cores and threads a system has by using: <cpu> <topology sockets='2' cores='2' threads='2'> </cpu> For NUMA, we can add the following: <numa> <node cpus='0-3' mems='1024'> <node cpus='4-7' mems='1024'> </numa> Specifying only cpus in the NUMA node specification should be enough to represent most of the topologies. Based on the number of cpus specified in each node, we should be able to work out how many cores and sockets will be part of each node. Only other thing needed is explicit memory specification. I have taken a few example NUMA topologies here and shown how the above specification can help. Magny cours ----------------- Topology desc: http://code.google.com/p/likwid-topology/wiki/AMD_MagnyCours8 <cpu> <topology sockets='4' cores='4' threads='1'> <numa> <node cpus='0-3' mems='1024'> <node cpus='4-7' mems='1024'> <node cpus='8-11' mems='1024'> <node cpus='12-15'mems='1024'> </numa> <cpu> OR if we want to stick to how CPUs get enumerated in real hardware we can specify like this: <cpu> <topology sockets='4' cores='4' threads='1'> <numa> <node cpus='0,2,4,6' mems='1024> <node cpus='8,10,12,14' mems='1024> <node cpus='1,3,5,7' mems='1024'> <node cpus='9,11,13,15' mems='1024'> </numa> <cpu> The above two specifations for Magny Cours aren't perfect because we conveniently converted the multi-level NUMA into sigle level NUMA. System has 2 sockets and 2 NUMA domains consisting of 4 cores in each domain, but we aren't really reflecting this in the topology specification. But does this really matter ? We are still showing 4 distinct NUMA domains. Nehalem ------------ Topology desc: http://code.google.com/p/likwid-topology/wiki/Intel_Nehalem <cpu> <topology sockets='2' cores='4' threads='2'> <numa> <node cpus='0-7' mems='1024'> <node cpus='8-15' mems='1024'> </numa> </cpu> OR if we want to stick to how CPUs get enumerated in real hardware we can specify like this: <cpu> <topology sockets='2' cores='4' threads='2'> <numa> <node cpus='0-3,8-11' mems='1024'> <node cpus='4-7,12-15' mems='1024'> </numa> </cpu> However there is a problem here. The specification isn't granular enough to specify which CPU is part of which core. As you can see in the topology diagram, CPUs 0,8 belong to one core, CPUs 1,9 belong to one core etc. So the whole point of specifying all the CPUs explicitly in the specification gets defeated. Dunnington --------------- Topology desc: 2 nodes, 4 sockets in each node, 6 cores in each socket. <cpu> <topology sockets='8' cores='6 threads='1' <numa> <node cpus='0-23' mems=1024> <node cpus='24-47' mems=1024> </numa> </cpu> Here also there is the same problem. CPUs 0,4,8,12,16,,20 belong to a core but the specifcation doesn't allow for that. So here are some questions that we need to answer: - Can we just go with flat NUMA specification and convert multilevel NUMA into flat NUMA wherever possible (like in the above Magny cours eg) ? - Are there topologies where this doesn't work ? - Isn't it enough to enumerate CPUs serially among cores and sockets and not enumerate them exactly as in real hardware ? Regards, Bharata.

On Tue, Aug 30, 2011 at 10:53 AM, Bharata B Rao <bharata.rao@gmail.com> wrote:
Hi,
Here is another attempt at guest NUMA topology XML specification that should work for different NUMA topologies.
Hi Daniel, Do you think I should go ahead and implement this ? Any comments or concerns ? Regards, Bharata.
participants (4)
-
Adam Litke
-
Bharata B Rao
-
Daniel P. Berrange
-
Osier Yang