[Libvir] Start NUMA work

Okay enclosed is a first patch to add the new entry point for getting the available memeory in the NUMA cells: /** * virNodeGetCellFreeMemory: * @conn: pointer to the hypervisor connection * @freeMems: pointer to the array of unsigned long * @nbCells: number of entries available in freeMems * * This call allows to ask the amount of free memory in each NUMA cell. * The @freeMems array must be allocated by the caller and will be filled * with the amounts of free memory in kilobytes for each cell starting * from cell #0 and up to @nbCells -1 or the number of cell in the Node * (which can be found using virNodeGetInfo() see the nodes entry in the * structure). * * Returns the number of entries filled in freeMems, or -1 in case of error. */ int virNodeGetCellsFreeMemory(virConnectPtr conn, unsigned long *freeMems, int nbCells) based on the feedback, it seems it's better to provide an API checking a range of cells. This version suggest to always start at cell 0, it could be extended to start at a base cell number, not a big change, is it needed ? The patch adds it to the driver interfaces and put the entry point needed in the xen_internal.c module xenHypervisorNodeGetCellsFreeMemory() . From there it needs a new function (or set of functions) actually doing one hypercall to get the free mem for a NUMA cell, and the loop to fill the array @freeMems. The hard part is of course to set the definitions and code doing the hypercall: We will need to check the current hypercall version since this was added recently, see how xenHypervisorGetSchedulerType() does the versionning, we will have to write a similar routine , extend xen_op_v2_sys to add support for the availheap call structures, add the define for the availheap system call, glue the whole and call the new function from the loop in xenHypervisorNodeGetCellsFreeMemory() ... this can be a little fun to debug. Now for extending virConnectGetCapabilities() it is a bit messy not not that much. First it's implemented on Xen using xenHypervisorGetCapabilities, unfortunately it seems the easiest way to get the NUMA capabilities is by asking though xend. Calling xend_internals.c from xen_internals.c is not nice, but xenHypervisorGetCapabilities() is actually noty using any hypervisor call as far as I can see, it's all about opening/parsing files from /proc and /sys and returning the result as XML, so this could as well be done in the xend_internals (or xen_unified.c) module. so we will have a bit of surgery to do, but for the first steps of writing the patch I would not be too concerned by calling a function in xend_internal.c from xenHypervisorGetCapabilities (or xenHypervisorMakeCapabilitiesXML) we will just move those 2 in the end (only problem may be the access to hv_version variable). Hope this helps, I will not be online most of the week but I will try to help when possible :-) Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
Okay enclosed is a first patch to add the new entry point for getting the available memeory in the NUMA cells:
/** * virNodeGetCellFreeMemory: * @conn: pointer to the hypervisor connection * @freeMems: pointer to the array of unsigned long * @nbCells: number of entries available in freeMems * * This call allows to ask the amount of free memory in each NUMA cell. * The @freeMems array must be allocated by the caller and will be filled * with the amounts of free memory in kilobytes for each cell starting * from cell #0 and up to @nbCells -1 or the number of cell in the Node * (which can be found using virNodeGetInfo() see the nodes entry in the * structure). * * Returns the number of entries filled in freeMems, or -1 in case of error. */
int virNodeGetCellsFreeMemory(virConnectPtr conn, unsigned long *freeMems, int nbCells)
So you're using "unsigned long" here to mean 32 bits on 32 bit archs, and 64 bits on 64 bit archs? A purely 32 bit freeMem will allow up to 4095 GB of RAM per cell. But in reality up to 2047 GB of RAM because mappings in other languages will probably be signed. High-end users are already decking out PCs with 128 GB of RAM. If they double the RAM every year, we'll hit this limit in 4 years[1]. So is it worth using an explicit 64 bit quantity here, or using another base (MB instead of KB for example)? Or do we just think that all such vast machines will be 64 bit?
based on the feedback, it seems it's better to provide an API checking a range of cells. This version suggest to always start at cell 0, it could be extended to start at a base cell number, not a big change, is it needed ?
On the one hand, subranges of cells could be useful for simple hierarchical archs. On the other hand (hypercubes) useful subranges aren't likely to be contiguous anyway!
The patch adds it to the driver interfaces and put the entry point needed in the xen_internal.c module xenHypervisorNodeGetCellsFreeMemory() .
As for the actual patch, I'm guessing nothing will be committed until we have a working prototype? Apart from lack of remote support it looks fine.
Now for extending virConnectGetCapabilities() it is a bit messy not not that much. First it's implemented on Xen using xenHypervisorGetCapabilities, unfortunately it seems the easiest way to get the NUMA capabilities is by asking though xend. Calling xend_internals.c from xen_internals.c is not nice, but xenHypervisorGetCapabilities() is actually noty using any hypervisor call as far as I can see, it's all about opening/parsing files from /proc and /sys and returning the result as XML, so this could as well be done in the xend_internals (or xen_unified.c) module.
Yeah, best just to move that common code up to xen_unified.c probably. In any case the Xen "driver" is so intertwined that it's really just one big lump so calling between the sub-drivers is unlikely to be a problem. Rich. [1] This analysis avoids two factors: (a) it covers the whole machine rather than individual cells, (b) on the other hand, perhaps flash memory (which has dramatically higher density) will become fast enough to replace convention RAM. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Tue, Sep 11, 2007 at 04:28:00PM +0100, Richard W.M. Jones wrote:
Daniel Veillard wrote:
Okay enclosed is a first patch to add the new entry point for getting the available memeory in the NUMA cells:
/** * virNodeGetCellFreeMemory: * @conn: pointer to the hypervisor connection * @freeMems: pointer to the array of unsigned long * @nbCells: number of entries available in freeMems * * This call allows to ask the amount of free memory in each NUMA cell. * The @freeMems array must be allocated by the caller and will be filled * with the amounts of free memory in kilobytes for each cell starting * from cell #0 and up to @nbCells -1 or the number of cell in the Node * (which can be found using virNodeGetInfo() see the nodes entry in the * structure). * * Returns the number of entries filled in freeMems, or -1 in case of error. */
int virNodeGetCellsFreeMemory(virConnectPtr conn, unsigned long *freeMems, int nbCells)
So you're using "unsigned long" here to mean 32 bits on 32 bit archs, and 64 bits on 64 bit archs?
A purely 32 bit freeMem will allow up to 4095 GB of RAM per cell. But in reality up to 2047 GB of RAM because mappings in other languages will probably be signed.
High-end users are already decking out PCs with 128 GB of RAM. If they double the RAM every year, we'll hit this limit in 4 years[1]. So is it worth using an explicit 64 bit quantity here, or using another base (MB instead of KB for example)? Or do we just think that all such vast machines will be 64 bit?
Well we already use unsigned long in KB for memory quantities in libvirt, I just reused that, I doubt we will see more than 64GB for 32bits CPU ever, that's already stretching the limits
based on the feedback, it seems it's better to provide an API checking a range of cells. This version suggest to always start at cell 0, it could be extended to start at a base cell number, not a big change, is it needed ?
On the one hand, subranges of cells could be useful for simple hierarchical archs. On the other hand (hypercubes) useful subranges aren't likely to be contiguous anyway!
for anything non-flat it's hard to guess and for anything flat placement means basically checking all cells to find the optimum
The patch adds it to the driver interfaces and put the entry point needed in the xen_internal.c module xenHypervisorNodeGetCellsFreeMemory() .
As for the actual patch, I'm guessing nothing will be committed until we have a working prototype? Apart from lack of remote support it looks fine.
yes, and right remote is something I didn't tried to look at yet, I hope returning arrays of values won't be a problem.
Now for extending virConnectGetCapabilities() it is a bit messy not not that much. First it's implemented on Xen using xenHypervisorGetCapabilities, unfortunately it seems the easiest way to get the NUMA capabilities is by asking though xend. Calling xend_internals.c from xen_internals.c is not nice, but xenHypervisorGetCapabilities() is actually noty using any hypervisor call as far as I can see, it's all about opening/parsing files from /proc and /sys and returning the result as XML, so this could as well be done in the xend_internals (or xen_unified.c) module.
Yeah, best just to move that common code up to xen_unified.c probably.
yes my though, except for the global variable used.
In any case the Xen "driver" is so intertwined that it's really just one big lump so calling between the sub-drivers is unlikely to be a problem.
heh :-\ Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
yes, and right remote is something I didn't tried to look at yet, I hope returning arrays of values won't be a problem.
It's no problem at all ... otherwise all the CPU pinning stuff wouldn't work (it does, I checked it all by hand). Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Tue, Sep 11, 2007 at 11:39:07AM -0400, Daniel Veillard wrote:
On Tue, Sep 11, 2007 at 04:28:00PM +0100, Richard W.M. Jones wrote:
Daniel Veillard wrote:
Okay enclosed is a first patch to add the new entry point for getting the available memeory in the NUMA cells:
/** * virNodeGetCellFreeMemory: * @conn: pointer to the hypervisor connection * @freeMems: pointer to the array of unsigned long * @nbCells: number of entries available in freeMems * * This call allows to ask the amount of free memory in each NUMA cell. * The @freeMems array must be allocated by the caller and will be filled * with the amounts of free memory in kilobytes for each cell starting * from cell #0 and up to @nbCells -1 or the number of cell in the Node * (which can be found using virNodeGetInfo() see the nodes entry in the * structure). * * Returns the number of entries filled in freeMems, or -1 in case of error. */
int virNodeGetCellsFreeMemory(virConnectPtr conn, unsigned long *freeMems, int nbCells)
So you're using "unsigned long" here to mean 32 bits on 32 bit archs, and 64 bits on 64 bit archs?
A purely 32 bit freeMem will allow up to 4095 GB of RAM per cell. But in reality up to 2047 GB of RAM because mappings in other languages will probably be signed.
High-end users are already decking out PCs with 128 GB of RAM. If they double the RAM every year, we'll hit this limit in 4 years[1]. So is it worth using an explicit 64 bit quantity here, or using another base (MB instead of KB for example)? Or do we just think that all such vast machines will be 64 bit?
Well we already use unsigned long in KB for memory quantities in libvirt, I just reused that, I doubt we will see more than 64GB for 32bits CPU ever, that's already stretching the limits
Just because we mistakenly used 32-bit for various memory quantites elswhere doesn't been we should propagate this mistake for new APIs. As John points out, on Solaris they use a 32-bit userspace even on a 64-bit host. The same can be true of Linux - you can run 32-bit dom0 on 64-bit hypervisor - indeed I believe XenEnterprise does this for their Dom0. If we think a quantity may need 64-bits at some point, then we should use long long. I think it is worth using long long in this case. Regards, Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|
participants (3)
-
Daniel P. Berrange
-
Daniel Veillard
-
Richard W.M. Jones