[Libvir] Extending libvirt to probe NUMA topology

Hello all, I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion. I'd like to suggest the following for discussion: (1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start Thoughts? 1. http://lists.xensource.com/archives/html/xen-devel/2007-06/msg00298.html 2. http://lists.xensource.com/archives/html/xen-devel/2007-06/msg00299.html -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com

On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
Hello all,
Hello Ryan,
I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion.
I'd like to suggest the following for discussion:
(1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start
Okay, but let's start by defining the scope a bit. Historically NUMA have explored various paths, and I assume we are gonna work in a rather small subset of what NUMA (Non Uniform Memory Access) have meant over time. I assume the following, tell me if I'm wrong: - we are just considering memory and processor affinity - the topology, i.e. the affinity between the processors and the various memory areas is fixed and the kind of mapping is rather simple to get into more specifics: - we will need to expand the model of libvirt http://libvirt.org/intro.html to split the Node ressources into separate sets containing processors and memory areas which are highly connected together (assuming the model is a simple partition of the ressources between the equivalent of sub-Nodes) - the function (2) would for a given processor tell how much of its memory is already allocated (to existing running or paused domains) Right ? Is the partition model sufficient for the architectures ? If yes then we will need a new definition and terminology for those sub-Nodes. For 3 we already have support for pinning the domain virtual CPUs to physical CPUs but I guess it's not sufficient because you want this to be activated from the definition of the domain: http://libvirt.org/html/libvirt-libvirt.html#virDomainPinVcpu So the XML format would have to be extended to allow specifying the subset of processors the domain is supposed to start on: http://libvirt.org/format.html I would assume that if nothing is specified, the underlying Hypervisor (in libvirt terminology, that could be a linux kernel in practice) will by default try to do the optimal placement by itself, i.e. (3) is only useful if you want to override the default behaviour. Please correct me if I'm wrong, Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Wed, Jun 13, 2007 at 01:48:21PM -0400, Daniel Veillard wrote:
On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
Hello all,
Hello Ryan,
I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion.
I'd like to suggest the following for discussion:
(1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start
Okay, but let's start by defining the scope a bit. Historically NUMA have explored various paths, and I assume we are gonna work in a rather small subset of what NUMA (Non Uniform Memory Access) have meant over time.
I assume the following, tell me if I'm wrong: - we are just considering memory and processor affinity - the topology, i.e. the affinity between the processors and the various memory areas is fixed and the kind of mapping is rather simple
to get into more specifics: - we will need to expand the model of libvirt http://libvirt.org/intro.html to split the Node ressources into separate sets containing processors and memory areas which are highly connected together (assuming the model is a simple partition of the ressources between the equivalent of sub-Nodes) - the function (2) would for a given processor tell how much of its memory is already allocated (to existing running or paused domains)
Right ? Is the partition model sufficient for the architectures ? If yes then we will need a new definition and terminology for those sub-Nodes.
We have 3 core models we should refer to when deciding how to present things. - Linux/Solaris Xen - hypercalls - Linux non-Xen - libnuma - Solaris non-Xen - liblgrp The Xen & Linux modelling seems reasonably similar IIRC, but Solaris is a slightly different representational approach.
For 3 we already have support for pinning the domain virtual CPUs to physical CPUs but I guess it's not sufficient because you want this to be activated from the definition of the domain:
http://libvirt.org/html/libvirt-libvirt.html#virDomainPinVcpu
So the XML format would have to be extended to allow specifying the subset of processors the domain is supposed to start on:
Yeah, I've previously argued against including VCPU pinning information in the XML since its a tunable, not a hardware description. Reluctantly though we'll have to add this VCPU info, since its an absolute requirement for this info to be provided at time of domain creation for NUMA support.
http://libvirt.org/format.html
I would assume that if nothing is specified, the underlying Hypervisor (in libvirt terminology, that could be a linux kernel in practice) will by default try to do the optimal placement by itself, i.e. (3) is only useful if you want to override the default behaviour.
Yes that is correct. We should not change the default - let the OS appply whatever policy it sees fit by default, since over time OS are tending towards being able to automagically determine & apply NUMA policy. Dan -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
- Linux/Solaris Xen - hypercalls - Linux non-Xen - libnuma - Solaris non-Xen - liblgrp
The Xen & Linux modelling seems reasonably similar IIRC, but Solaris is a slightly different representational approach.
The Solaris approach seems to be fully hierarchical as far as I can work out. That seems to argue for extending the capabilities XML to describe nodes, in as much as the XML can start off as a flat list of NUMA nodes (for IBM) but later be made hierarchical if necessary.
Yes that is correct. We should not change the default - let the OS appply whatever policy it sees fit by default, since over time OS are tending towards being able to automagically determine & apply NUMA policy.
Or add another domain creation call which allows passing an additional set of hints. One of the hints would be initial cpu pinning. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Thu, Jun 14, 2007 at 02:08:12PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
- Linux/Solaris Xen - hypercalls - Linux non-Xen - libnuma - Solaris non-Xen - liblgrp
The Xen & Linux modelling seems reasonably similar IIRC, but Solaris is a slightly different representational approach.
The Solaris approach seems to be fully hierarchical as far as I can work out.
yup that's what I infer from lgrp_root, lgrp_children and lgrp_parents well except in a tree would would only have one parent ever that I'm not sur I really understand http://docs.sun.com/app/docs/doc/816-5172/6mbb7bu79?a=view
That seems to argue for extending the capabilities XML to describe nodes, in as much as the XML can start off as a flat list of NUMA nodes (for IBM) but later be made hierarchical if necessary.
Agreed, it may be a bit painful in a sense to have to parse XML provided back from libvirt, but 1/ you should that once per Node 2/ it's not the only place in libvirt Probably need a bit more thinking about this though, Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

* Daniel Veillard <veillard@redhat.com> [2007-06-13 12:52]:
On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
Hello all,
Hello Ryan,
Hey, thanks for the swift reply.
I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion.
I'd like to suggest the following for discussion:
(1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start
Okay, but let's start by defining the scope a bit. Historically NUMA have explored various paths, and I assume we are gonna work in a rather small subset of what NUMA (Non Uniform Memory Access) have meant over time.
I assume the following, tell me if I'm wrong: - we are just considering memory and processor affinity - the topology, i.e. the affinity between the processors and the various memory areas is fixed and the kind of mapping is rather simple
Correct. Currently we are not processing the SLIT tables which provide costs values between cpus and memory.
to get into more specifics: - we will need to expand the model of libvirt http://libvirt.org/intro.html to split the Node ressources into separate sets containing processors and memory areas which are highly connected together (assuming the model is a simple partition of the ressources between the equivalent of sub-Nodes)
Yeah, the physical topology of the physical machine is split up into NUMA nodes. Each NUMA node will have a set of cpus, and physical memory. This topology is static across reboots, until the admin reconfigures the hardware.
- the function (2) would for a given processor tell how much of its memory is already allocated (to existing running or paused domains)
The memory is tracked by how much memory is free in a given NUMA node. We could implement the function in terms of cpu, but we would be probing on a per-NUMA node basis and then mapping the cpu to which NUMA node the cpu belonged. We should be able to answer the reverse (what is in use) by examining the domain config.
Right ? Is the partition model sufficient for the architectures ? If yes then we will need a new definition and terminology for those sub-Nodes.
I believe so.
For 3 we already have support for pinning the domain virtual CPUs to physical CPUs but I guess it's not sufficient because you want this to be activated from the definition of the domain:
Correct. Knowing the cpus that are being used allows for NUMA-node local memory allocation. For Xen specifically, pinning after memory has been allocated (it happens during domain creation) is not sufficient to ensure that the memory selected will be local to the processors backing the guest Virtual CPUS.
http://libvirt.org/html/libvirt-libvirt.html#virDomainPinVcpu
So the XML format would have to be extended to allow specifying the subset of processors the domain is supposed to start on:
http://libvirt.org/format.html
I would assume that if nothing is specified, the underlying Hypervisor (in libvirt terminology, that could be a linux kernel in practice) will by default try to do the optimal placement by itself, i.e. (3) is only useful if you want to override the default behaviour.
(3) is required if one wants to ensure that the resources allocated to the guest are local. It is possible that the hypervisor allocated local resources, but without specifying, there is no guarantee. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com

On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
Hello all,
I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion.
Let's restart that discussion, I would really like to see this implemented within the next month.
I'd like to suggest the following for discussion:
(1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start
Thoughts?
Okay following the discussions back in June and what seems available as APIs on various setups I would like to suggest the following: 1) Provide a function describing the topology as an XML instance: char * virNodeGetTopology(virConnectPtr conn); which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise. This doesn't change much the content in any way. I think the most important in the call is to get the topology informations as the number of processors, memory and NUMA cells are already available from virNodeGetInfo(). I suggest a format exposing the hierarchy in the XML structure, which will allow for more complex topologies for example on Sun hardware: --------------------------------- <topology> <cells num='2'> <cell id='0'> <cpus num='2'> <cpu id='0'/> <cpu id='1'/> </cpus> <memory size='2097152'/> </cell> <cell id='1'> <cpus num='2'> <cpu id='2'/> <cpu id='3'/> </cpus> <memory size='2097152'/> </cell> </cells> </topology> --------------------------------- A few things to note: - the <cells> element list the top sibling cells - the <cell> element describes as child the resources available like the list of CPUs, the size of the local memory, that could be extended by disk descriptions too <disk dev='/dev/sdb'/> and possibly other special devices (no idea what ATM). - in case of deeper hierarchical topology one may need to be able to name sub-cells and the format could be extended for example as <cells num='2'> <cells num='2'> <cell id='1'> ... </cell> <cell id='2'> ... </cell> </cells> <cells num='2'> <cell id='3'> ... </cell> <cell id='4'> ... </cell> </cells> </cells> But that can be discussed/changed when the need arise :-) - topology may later be extended with other child elements for example to expand the description with memory access costs from cell to cell. I don't know what's the best way, mapping an array in XML is usually not very nice. - the memory size is indicated on an attribute (instead as the content as we use on domain dumps), to preserve extensibility we may need to express more structure there (memory banks for example). We could also add a free='xxxxx' attribute indicating the amount available there, but as you suggested it's probably better to provide a separate call for this. I would expect that function to be available even for ReadOnly connections since it's descriptive only, which means it would need to be added to the set of proxy supported call. The call will of course be added to the driver block. Implementation on recent Xen could use the hypercall. For KVM I'm wondering a bit, I don't have a NUMA box around (but can probably find one), I assume that we could either use libnuma if found at compile time or get informations from /proc. On Solaris there is a specific library as Dan exposed in the thread. I think coming first with a Xen only support would be fine, others hypervisors or platforms can be added later. 2) Function to get the free memory of a given cell: unsigned long virNodeGetCellFreeMemory(virConnectPtr conn, int cell); that's relatively simple, would match the request from the initial mail but I'm wondering a bit. If the program tries to do a best placement it will usually run that request for a number of cells no ? Maybe a call returning the memory amounts for a range of cells would be more appropriate. 3) Adding Cell/CPU placement informations to a domain description That's where I think things starts to get a bit messy, it's not that adding <cell>1</cell> or <cpus> <pin vcpu='0' cpulist='2,3'/> <pin vcpu='1' cpulist='3'/> </cpus> along <vcpu>2</vcpu> would be hard, it's rather what to do if the request can't be satisfied. Basically I still think that the hypervisor is in a better position to do the placement, and doing the requirement here breaks: - the virtualization, the more you rely on the physical hardware property the more you loose the benefits of virtualizing - if CPU 2 and 3 are not available/full or if the topology changed since the domain was saved the domain may just not be able to run, or run worse than if nothing had been specified. CPU pinning at runtime means a dynamic change, it's adaptbility and makes a lot of sense. But saving those dynamic instant values in the process description sounds a bit wrong to me, because the context which led to them may have changed since (or may just not make sense anymore, like after a migration or hardware change). Anyway I guess that's needed, I would tend to go the simplest way and just allow to specify the vcpu pinning in a very explicit way and hence mapping directly to the kind of capabilities already available in virDomainPinVcpu() with a similar cpumap syntax as used in virsh vcpupin command (i.e. comma separated list of CPU numbers). <cpus> <pin vcpu='0' cpulist='2,3'/> <pin vcpu='1' cpulist='3'/> </cpus> If everyone agrees with those suggestions, then I guess we can try to get a first Xen-3.1 based implementation Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
1) Provide a function describing the topology as an XML instance:
char * virNodeGetTopology(virConnectPtr conn);
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise.
I'd definitely prefer to extend virConnectGetCapabilities XML. It avoids changing the remote driver and language bindings, and really callers only need to pull capabilities once per connection.
--------------------------------- <topology> <cells num='2'> <cell id='0'> <cpus num='2'> <cpu id='0'/> <cpu id='1'/> </cpus> <memory size='2097152'/> </cell> <cell id='1'> <cpus num='2'> <cpu id='2'/> <cpu id='3'/> </cpus> <memory size='2097152'/> </cell> </cells> </topology> ---------------------------------
A few things to note: - the <cells> element list the top sibling cells
Not <nodes>?
- the <cell> element describes as child the resources available like the list of CPUs, the size of the local memory, that could be extended by disk descriptions too <disk dev='/dev/sdb'/> and possibly other special devices (no idea what ATM).
- in case of deeper hierarchical topology one may need to be able to name sub-cells and the format could be extended for example as <cells num='2'> <cells num='2'> <cell id='1'> ... </cell> <cell id='2'> ... </cell> </cells> <cells num='2'> <cell id='3'> ... </cell> <cell id='4'> ... </cell> </cells> </cells> But that can be discussed/changed when the need arise :-)
Especially note that 4 (or more) socket AMDs have a topology like this, with two different penalties for reaching nodes which are one and two hops away. Do we have a way to describe the penalties along different paths?
2) Function to get the free memory of a given cell:
unsigned long virNodeGetCellFreeMemory(virConnectPtr conn, int cell);
that's relatively simple, would match the request from the initial mail but I'm wondering a bit. If the program tries to do a best placement it will usually run that request for a number of cells no ? Maybe a call returning the memory amounts for a range of cells would be more appropriate.
Yes, I guess they'd want to get the free memory for all nodes. But IBM will have a better idea about this. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Thu, Sep 06, 2007 at 03:40:23PM +0100, Richard W.M. Jones wrote:
Daniel Veillard wrote:
1) Provide a function describing the topology as an XML instance:
char * virNodeGetTopology(virConnectPtr conn);
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise.
I'd definitely prefer to extend virConnectGetCapabilities XML. It avoids changing the remote driver and language bindings, and really callers only need to pull capabilities once per connection.
yeah, I understand that concern, simplifies a lot of stuff inside, but the goal at the library level is to simplify the user code even if that means a more complex implementation. However if people think they don't need a separate call then I'm really fine with this.
--------------------------------- <topology> <cells num='2'> <cell id='0'> <cpus num='2'> <cpu id='0'/> <cpu id='1'/> </cpus> <memory size='2097152'/> </cell> <cell id='1'> <cpus num='2'> <cpu id='2'/> <cpu id='3'/> </cpus> <memory size='2097152'/> </cell> </cells> </topology> ---------------------------------
A few things to note: - the <cells> element list the top sibling cells
Not <nodes>?
A Node in libvirt terminology is a single physical machine, cell is a weel accepted term I think for a sub-node within a NUMA box.
- the <cell> element describes as child the resources available like the list of CPUs, the size of the local memory, that could be extended by disk descriptions too <disk dev='/dev/sdb'/> and possibly other special devices (no idea what ATM).
- in case of deeper hierarchical topology one may need to be able to name sub-cells and the format could be extended for example as <cells num='2'> <cells num='2'> <cell id='1'> ... </cell> <cell id='2'> ... </cell> </cells> <cells num='2'> <cell id='3'> ... </cell> <cell id='4'> ... </cell> </cells> </cells> But that can be discussed/changed when the need arise :-)
Especially note that 4 (or more) socket AMDs have a topology like this, with two different penalties for reaching nodes which are one and two hops away. Do we have a way to describe the penalties along different paths?
As hinted in my mail, I think the access costs will have to be added separately and probably as a array map, unless people come with a more intelligent way of exposing those informations.
2) Function to get the free memory of a given cell:
unsigned long virNodeGetCellFreeMemory(virConnectPtr conn, int cell);
that's relatively simple, would match the request from the initial mail but I'm wondering a bit. If the program tries to do a best placement it will usually run that request for a number of cells no ? Maybe a call returning the memory amounts for a range of cells would be more appropriate.
Yes, I guess they'd want to get the free memory for all nodes. But IBM will have a better idea about this.
Well I'm looking for feedback :-) Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

* Richard W.M. Jones <rjones@redhat.com> [2007-09-06 09:45]:
Daniel Veillard wrote:
1) Provide a function describing the topology as an XML instance:
char * virNodeGetTopology(virConnectPtr conn);
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise.
I'd definitely prefer to extend virConnectGetCapabilities XML. It avoids changing the remote driver and language bindings, and really callers only need to pull capabilities once per connection.
--------------------------------- <topology> <cells num='2'> <cell id='0'> <cpus num='2'> <cpu id='0'/> <cpu id='1'/> </cpus> <memory size='2097152'/> </cell> <cell id='1'> <cpus num='2'> <cpu id='2'/> <cpu id='3'/> </cpus> <memory size='2097152'/> </cell> </cells> </topology> ---------------------------------
A few things to note: - the <cells> element list the top sibling cells
Not <nodes>?
- the <cell> element describes as child the resources available like the list of CPUs, the size of the local memory, that could be extended by disk descriptions too <disk dev='/dev/sdb'/> and possibly other special devices (no idea what ATM).
- in case of deeper hierarchical topology one may need to be able to name sub-cells and the format could be extended for example as <cells num='2'> <cells num='2'> <cell id='1'> ... </cell> <cell id='2'> ... </cell> </cells> <cells num='2'> <cell id='3'> ... </cell> <cell id='4'> ... </cell> </cells> </cells> But that can be discussed/changed when the need arise :-)
Especially note that 4 (or more) socket AMDs have a topology like this, with two different penalties for reaching nodes which are one and two hops away. Do we have a way to describe the penalties along different paths?
The SLIT table provides distance cost values. Xen isn't messing with SLIT information at the moment. I'm not sure about Linux or Sun, but I would expect that they would. -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com

* Daniel Veillard <veillard@redhat.com> [2007-09-06 08:55]:
On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
Hello all,
I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion.
Let's restart that discussion, I would really like to see this implemented within the next month.
Thanks for starting this back up.
I'd like to suggest the following for discussion:
(1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start
Thoughts?
Okay following the discussions back in June and what seems available as APIs on various setups I would like to suggest the following:
1) Provide a function describing the topology as an XML instance:
char * virNodeGetTopology(virConnectPtr conn);
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise. This doesn't change much the content in any way. I think the most important in the call is to get the topology informations as the number of processors, memory and NUMA cells are already available from virNodeGetInfo(). I suggest a format exposing the hierarchy in the XML structure, which will allow for more complex topologies for example on Sun hardware:
Not having a deep libvirt background, I'm not sure I can argue one way or another. The topology discovery (nr_numa_nodes, nr_cpus, cpu_to_node) - they won't be changing for the lifetime of the libvirt node.
--------------------------------- <topology> <cells num='2'> <cell id='0'> <cpus num='2'> <cpu id='0'/> <cpu id='1'/> </cpus> <memory size='2097152'/> </cell> <cell id='1'> <cpus num='2'> <cpu id='2'/> <cpu id='3'/> </cpus> <memory size='2097152'/> </cell> </cells> </topology> ---------------------------------
A few things to note: - the <cells> element list the top sibling cells
- the <cell> element describes as child the resources available like the list of CPUs, the size of the local memory, that could be extended by disk descriptions too <disk dev='/dev/sdb'/> and possibly other special devices (no idea what ATM).
The only concern I have is the memory size -- I don't believe we have a way to get at anything other than current available memory. As far as other resources, yes that makes sense, I believe there is topology information for pci resources, though for Xen, none of that is available.
- in case of deeper hierarchical topology one may need to be able to name sub-cells and the format could be extended for example as <cells num='2'> <cells num='2'> <cell id='1'> ... </cell> <cell id='2'> ... </cell> </cells> <cells num='2'> <cell id='3'> ... </cell> <cell id='4'> ... </cell> </cells> </cells> But that can be discussed/changed when the need arise :-)
Yep.
- topology may later be extended with other child elements for example to expand the description with memory access costs from cell to cell. I don't know what's the best way, mapping an array in XML is usually not very nice.
- the memory size is indicated on an attribute (instead as the content as we use on domain dumps), to preserve extensibility we may need to express more structure there (memory banks for example). We could also add a free='xxxxx' attribute indicating the amount available there, but as you suggested it's probably better to provide a separate call for this.
I would expect that function to be available even for ReadOnly connections since it's descriptive only, which means it would need to be added to the set of proxy supported call. The call will of course be added to the driver block. Implementation on recent Xen could use the hypercall. For KVM I'm wondering a bit, I don't have a NUMA box around (but can probably find one), I assume that we could either use libnuma if found at compile time or get informations from /proc. On Solaris there is a specific library as Dan exposed in the thread. I think coming first with a Xen only support would be fine, others hypervisors or platforms can be added later.
Agreed.
2) Function to get the free memory of a given cell:
unsigned long virNodeGetCellFreeMemory(virConnectPtr conn, int cell);
that's relatively simple, would match the request from the initial mail but I'm wondering a bit. If the program tries to do a best placement it will usually run that request for a number of cells no ? Maybe a call returning the memory amounts for a range of cells would be more appropriate.
The use-case I have in mind in virt-manager would obtain the current free memory on all cells from which it can choose from according to whatever algorithm. Getting the free memory from all cells within a node would be a good call. The Xen hypercall for querying this information will be done on a per-cell basis.
3) Adding Cell/CPU placement informations to a domain description
That's where I think things starts to get a bit messy, it's not that adding <cell>1</cell> or <cpus> <pin vcpu='0' cpulist='2,3'/> <pin vcpu='1' cpulist='3'/> </cpus> along <vcpu>2</vcpu>
would be hard, it's rather what to do if the request can't be satisfied. Basically I still think that the hypervisor is in a better position to do the placement, and doing the requirement here breaks: - the virtualization, the more you rely on the physical hardware property the more you loose the benefits of virtualizing - if CPU 2 and 3 are not available/full or if the topology changed since the domain was saved the domain may just not be able to run, or run worse than if nothing had been specified. CPU pinning at runtime means a dynamic change, it's adaptbility and makes a lot of sense. But saving those dynamic instant values in the process description sounds a bit wrong to me, because the context which led to them may have changed since (or may just not make sense anymore, like after a migration or hardware change). Anyway I guess that's needed, I would tend to go the simplest way and just allow to specify the vcpu pinning in a very explicit way and hence mapping directly to the kind of capabilities already available in virDomainPinVcpu() with a similar cpumap syntax as used in virsh vcpupin command (i.e. comma separated list of CPU numbers).
With the minimal NUMA support that is available in Xen today, the best we can do is keep guests from cross-ing node boundaries, that is ensure that the cpus the guest uses have local memory allocated. The current mechanism for making this happen in Xen is to supply a cpus affinity list in the domain config file. This ensure that the memory is local to those cpus and that the hypervisor does not migrate the guest vcpus to cpus on non-local cells. While I agree that they hypervisor is a in a better position to make those choices, it would end up embedding placement policy in the hypervisor. Xen already does have a placement policy for cpus, but that doesn't matter since we can re-pin vcpus before starting the domain. What I'm looking for here is a way we can ensure that the guest config can include a cpus list. libvirt doesn;'t have to generate this list, I would expect virt-manager or some other tool which fetched the topology and free memory information to determine a cpulist and then "add" a cpulist property to the domain config.
<cpus> <pin vcpu='0' cpulist='2,3'/> <pin vcpu='1' cpulist='3'/> </cpus>
I like this best.
If everyone agrees with those suggestions, then I guess we can try to get a first Xen-3.1 based implementation
Daniel
-- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
-- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 ryanh@us.ibm.com

Daniel Veillard wrote:
On Wed, Jun 13, 2007 at 10:40:40AM -0500, Ryan Harper wrote:
Hello all,
I wanted to start a discussion on how we might get libvirt to be able to probe the NUMA topology of Xen and Linux (for QEMU/KVM). In Xen, I've recently posted patches for exporting topology into the [1]physinfo hypercall, as well adding a [2]hypercall to probe the Xen heap. I believe the topology and memory info is already available in Linux. With these, we have enough information to be able to write some simple policy above libvirt that can create guests in a NUMA-aware fashion.
Let's restart that discussion, I would really like to see this implemented within the next month.
I'd like to suggest the following for discussion:
(1) A function to discover topology (2) A function to check available memory (3) Specifying which cpus to use prior to domain start
Thoughts?
Okay following the discussions back in June and what seems available as APIs on various setups I would like to suggest the following:
1) Provide a function describing the topology as an XML instance:
char * virNodeGetTopology(virConnectPtr conn);
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise. This doesn't change much the content in any way. I think the most important in the call is to get the topology informations as the number of processors, memory and NUMA cells are already available from virNodeGetInfo(). I suggest a format exposing the hierarchy in the XML structure, which will allow for more complex topologies for example on Sun hardware:
--------------------------------- <topology>
One small suggestion here... I've seen the term numanode used in some recent Xen patches. It would seem clearer to replace "cell(s)" with "numanode(s)". Then it is immediately evident what is being referred to, yet doesn't interfere with the libvirt term "node".
<cells num='2'> <cell id='0'> <cpus num='2'> <cpu id='0'/> <cpu id='1'/> </cpus> <memory size='2097152'/> </cell> <cell id='1'> <cpus num='2'> <cpu id='2'/> <cpu id='3'/> </cpus> <memory size='2097152'/> </cell> </cells> </topology> ---------------------------------
A few things to note: - the <cells> element list the top sibling cells
- the <cell> element describes as child the resources available like the list of CPUs, the size of the local memory, that could be extended by disk descriptions too <disk dev='/dev/sdb'/> and possibly other special devices (no idea what ATM).
- in case of deeper hierarchical topology one may need to be able to name sub-cells and the format could be extended for example as <cells num='2'> <cells num='2'> <cell id='1'> ... </cell> <cell id='2'> ... </cell> </cells> <cells num='2'> <cell id='3'> ... </cell> <cell id='4'> ... </cell> </cells> </cells> But that can be discussed/changed when the need arise :-)
- topology may later be extended with other child elements for example to expand the description with memory access costs from cell to cell. I don't know what's the best way, mapping an array in XML is usually not very nice.
- the memory size is indicated on an attribute (instead as the content as we use on domain dumps), to preserve extensibility we may need to express more structure there (memory banks for example). We could also add a free='xxxxx' attribute indicating the amount available there, but as you suggested it's probably better to provide a separate call for this.
I would expect that function to be available even for ReadOnly connections since it's descriptive only, which means it would need to be added to the set of proxy supported call. The call will of course be added to the driver block. Implementation on recent Xen could use the hypercall. For KVM I'm wondering a bit, I don't have a NUMA box around (but can probably find one), I assume that we could either use libnuma if found at compile time or get informations from /proc. On Solaris there is a specific library as Dan exposed in the thread. I think coming first with a Xen only support would be fine, others hypervisors or platforms can be added later.
2) Function to get the free memory of a given cell:
unsigned long virNodeGetCellFreeMemory(virConnectPtr conn, int cell);
that's relatively simple, would match the request from the initial mail but I'm wondering a bit. If the program tries to do a best placement it will usually run that request for a number of cells no ? Maybe a call returning the memory amounts for a range of cells would be more appropriate.
3) Adding Cell/CPU placement informations to a domain description
That's where I think things starts to get a bit messy, it's not that adding <cell>1</cell> or <cpus> <pin vcpu='0' cpulist='2,3'/> <pin vcpu='1' cpulist='3'/> </cpus> along <vcpu>2</vcpu>
would be hard, it's rather what to do if the request can't be satisfied. Basically I still think that the hypervisor is in a better position to do the placement, and doing the requirement here breaks: - the virtualization, the more you rely on the physical hardware property the more you loose the benefits of virtualizing - if CPU 2 and 3 are not available/full or if the topology changed since the domain was saved the domain may just not be able to run, or run worse than if nothing had been specified. CPU pinning at runtime means a dynamic change, it's adaptbility and makes a lot of sense. But saving those dynamic instant values in the process description sounds a bit wrong to me, because the context which led to them may have changed since (or may just not make sense anymore, like after a migration or hardware change). Anyway I guess that's needed, I would tend to go the simplest way and just allow to specify the vcpu pinning in a very explicit way and hence mapping directly to the kind of capabilities already available in virDomainPinVcpu() with a similar cpumap syntax as used in virsh vcpupin command (i.e. comma separated list of CPU numbers).
<cpus> <pin vcpu='0' cpulist='2,3'/> <pin vcpu='1' cpulist='3'/> </cpus>
If everyone agrees with those suggestions, then I guess we can try to get a first Xen-3.1 based implementation
Daniel
-- Elizabeth Kon (Beth) IBM Linux Technology Center Open Hypervisor Team email: eak@us.ibm.com

On Fri, Sep 07, 2007 at 09:55:45AM -0400, beth kon wrote:
Daniel Veillard wrote:
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise. This doesn't change much the content in any way. I think the most important in the call is to get the topology informations as the number of processors, memory and NUMA cells are already available from virNodeGetInfo(). I suggest a format exposing the hierarchy in the XML structure, which will allow for more complex topologies for example on Sun hardware:
--------------------------------- <topology>
One small suggestion here... I've seen the term numanode used in some recent Xen patches. It would seem clearer to replace "cell(s)" with "numanode(s)". Then it is immediately evident what is being referred to, yet doesn't interfere with the libvirt term "node".
<cells num='2'>
Hum, I don't have any strong opinion one way or another, cell sounds a bit more different so I guess there is less confusion. Using google it seems that 'numa cell' show up more frequently than 'numa node'. On the other hand the NUMA FAQ gives a definition of the later http://lse.sourceforge.net/numa/faq/index.html#what_is_a_node Cell was short and looking unambiguous in our context, since we already use Node to name the physical machine, that's why I suggested this. More opinions on the matter ? ;-) Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
On Fri, Sep 07, 2007 at 09:55:45AM -0400, beth kon wrote:
Daniel Veillard wrote:
which would return an XML instance as in virConnectGetCapabilities. I toyed with the idea of extending virConnectGetCapabilities() to add a topology section in case of NUMA support at the hypervisor level, but it was looking to me that the two might be used at different times and separating both might be a bit cleaner, but I could be convinced otherwise. This doesn't change much the content in any way. I think the most important in the call is to get the topology informations as the number of processors, memory and NUMA cells are already available
from virNodeGetInfo(). I suggest a format exposing the hierarchy in the
XML structure, which will allow for more complex topologies for example on Sun hardware:
--------------------------------- <topology>
One small suggestion here... I've seen the term numanode used in some recent Xen patches. It would seem clearer to replace "cell(s)" with "numanode(s)". Then it is immediately evident what is being referred to, yet doesn't interfere with the libvirt term "node".
<cells num='2'>
Hum, I don't have any strong opinion one way or another, cell sounds a bit more different so I guess there is less confusion. Using google it seems that 'numa cell' show up more frequently than 'numa node'. On the other hand the NUMA FAQ gives a definition of the later http://lse.sourceforge.net/numa/faq/index.html#what_is_a_node
Cell was short and looking unambiguous in our context, since we already use Node to name the physical machine, that's why I suggested this. More opinions on the matter ? ;-)
Daniel
I see that the term "cell" is already sprinkled around the libvirt code, so it may be easier to just leave it as is. It probably won't result in much confusion. -- Elizabeth Kon (Beth) IBM Linux Technology Center Open Hypervisor Team email: eak@us.ibm.com
participants (5)
-
beth kon
-
Daniel P. Berrange
-
Daniel Veillard
-
Richard W.M. Jones
-
Ryan Harper