[Libvir] [PATCH] finish NUMA code reorg, plug cpuset at creat time support

The following patch finishes the cleanup for NUMA parsing code: - the cpuset parsing is moved to xml.c - some comments and cleanups of the include then add the output of a (cpus '...') line based on the /domain/vcpu/@cpuset attributes, this is parsed and reserialized as ranges to avoid any possibility of misinterpretation of say ^ or any special syntax we may want to add in the future. A few things to note: - dependant on the tiny patch I sent earlier today - if we notice that the set covers all CPU maybe we should avoid outputing (cpus '...'), trivial to add - mostly untested yet Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Mon, 22 Oct 2007 12:39:50 -0400 Daniel Veillard wrote:
The following patch finishes the cleanup for NUMA parsing code: - the cpuset parsing is moved to xml.c - some comments and cleanups of the include then add the output of a (cpus '...') line based on the /domain/vcpu/@cpuset attributes, this is parsed and reserialized as ranges to avoid any possibility of misinterpretation of say ^ or any special syntax we may want to add in the future. A few things to note: - dependant on the tiny patch I sent earlier today - if we notice that the set covers all CPU maybe we should avoid outputing (cpus '...'), trivial to add - mostly untested yet
It seems work fine for creating a guest, i.e. I can set the cpuset by "virsh create" with XML file containing a cpuset parameter. This fix will solve "create" of the following BZ. - BZ#223833: FEAT RHEL5.2: About the setting of CPU affinity at virsh create/reboot https://bugzilla.redhat.com/show_bug.cgi?id=223833 I think we also need to fix for "start/define/dumpxml". So, I attached the patch for them. Could you check it ? Anyway, I will have some more test for the cpuset. thanks, Saori Fukuta

On Tue, Oct 23, 2007 at 04:07:34PM +0900, Saori Fukuta wrote:
On Mon, 22 Oct 2007 12:39:50 -0400 Daniel Veillard wrote:
The following patch finishes the cleanup for NUMA parsing code: - the cpuset parsing is moved to xml.c - some comments and cleanups of the include then add the output of a (cpus '...') line based on the /domain/vcpu/@cpuset attributes, this is parsed and reserialized as ranges to avoid any possibility of misinterpretation of say ^ or any special syntax we may want to add in the future. A few things to note: - dependant on the tiny patch I sent earlier today - if we notice that the set covers all CPU maybe we should avoid outputing (cpus '...'), trivial to add - mostly untested yet
It seems work fine for creating a guest, i.e. I can set the cpuset by "virsh create" with XML file containing a cpuset parameter.
okay, good,
This fix will solve "create" of the following BZ. - BZ#223833: FEAT RHEL5.2: About the setting of CPU affinity at virsh create/reboot https://bugzilla.redhat.com/show_bug.cgi?id=223833
I think we also need to fix for "start/define/dumpxml". So, I attached the patch for them. Could you check it ?
Okay, I looked at it, the only worry I have is that it takes the cpus values as emitted in the S-Expr and output it directly in the XML. I guess it really depends how xend exports this information, reading the code it seems they only generate a comma separated CPU list on output (in which case reusing libvirt parsing/serialization to compact it to ranges would be nicer), but I don't have machines with enough CPUs to really see this, could you have a lookm before I check this in ?
Anyway, I will have some more test for the cpuset.
okay, thanks ! Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Tue, 23 Oct 2007 05:29:21 -0400 Daniel Veillard wrote:
On Tue, Oct 23, 2007 at 04:07:34PM +0900, Saori Fukuta wrote:
On Mon, 22 Oct 2007 12:39:50 -0400 Daniel Veillard wrote:
The following patch finishes the cleanup for NUMA parsing code: - the cpuset parsing is moved to xml.c - some comments and cleanups of the include then add the output of a (cpus '...') line based on the /domain/vcpu/@cpuset attributes, this is parsed and reserialized as ranges to avoid any possibility of misinterpretation of say ^ or any special syntax we may want to add in the future. A few things to note: - dependant on the tiny patch I sent earlier today - if we notice that the set covers all CPU maybe we should avoid outputing (cpus '...'), trivial to add - mostly untested yet
I think we also need to fix for "start/define/dumpxml". So, I attached the patch for them. Could you check it ?
Okay, I looked at it, the only worry I have is that it takes the cpus values as emitted in the S-Expr and output it directly in the XML. I guess it really depends how xend exports this information, reading the code it seems they only generate a comma separated CPU list on output (in which case reusing libvirt parsing/serialization to compact it to ranges would be nicer), but I don't have machines with enough CPUs to really see this, could you have a lookm before I check this in ?
Well, it may not be good idea to emit the S-Expr directly to output as you are worried. I confirmed how I can get the cpus values at fedora8(xen-3.1.0-10.fc8), and I expected to be emitted "(cpus '0,1')", when I specified "cpuset='0,1'". But xend shows nothing about cpus. Then, I could not get the cpus values by sexpr_node(root, "domain/cpus"). I guess this is a bug of Xen, and this has not been fixed with latest upstream Xen. So I will work about this for Xen before discussing about how xend exports. Regards, Saori Fukuta

On Wed, Oct 24, 2007 at 04:38:50PM +0900, Saori Fukuta wrote:
On Tue, 23 Oct 2007 05:29:21 -0400 Daniel Veillard wrote:
On Tue, Oct 23, 2007 at 04:07:34PM +0900, Saori Fukuta wrote:
On Mon, 22 Oct 2007 12:39:50 -0400 Daniel Veillard wrote:
The following patch finishes the cleanup for NUMA parsing code: - the cpuset parsing is moved to xml.c - some comments and cleanups of the include then add the output of a (cpus '...') line based on the /domain/vcpu/@cpuset attributes, this is parsed and reserialized as ranges to avoid any possibility of misinterpretation of say ^ or any special syntax we may want to add in the future. A few things to note: - dependant on the tiny patch I sent earlier today - if we notice that the set covers all CPU maybe we should avoid outputing (cpus '...'), trivial to add - mostly untested yet
I think we also need to fix for "start/define/dumpxml". So, I attached the patch for them. Could you check it ?
Okay, I looked at it, the only worry I have is that it takes the cpus values as emitted in the S-Expr and output it directly in the XML. I guess it really depends how xend exports this information, reading the code it seems they only generate a comma separated CPU list on output (in which case reusing libvirt parsing/serialization to compact it to ranges would be nicer), but I don't have machines with enough CPUs to really see this, could you have a lookm before I check this in ?
Well, it may not be good idea to emit the S-Expr directly to output as you are worried.
I confirmed how I can get the cpus values at fedora8(xen-3.1.0-10.fc8), and I expected to be emitted "(cpus '0,1')", when I specified "cpuset='0,1'". But xend shows nothing about cpus.
Right don't assume it's there on output of the xend S-Expr for that domain, look at xm info --long it doesn't show it. To extract the cpuset you need to call the existing functionalities we have for the vcpu affinities (and this may be a bit costly).
Then, I could not get the cpus values by sexpr_node(root, "domain/cpus").
I guess this is a bug of Xen, and this has not been fixed with latest upstream Xen. So I will work about this for Xen before discussing about how xend exports.
It was never in xen and I think teh new way being based on Xen-API you won't get a fix for upstream for our case here. I would not expect a fix to xend to get the solution but work it out in libvirt, but you can try ... Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Wed, 24 Oct 2007 05:12:42 -0400 Daniel Veillard wrote:
On Wed, Oct 24, 2007 at 04:38:50PM +0900, Saori Fukuta wrote:
I confirmed how I can get the cpus values at fedora8(xen-3.1.0-10.fc8), and I expected to be emitted "(cpus '0,1')", when I specified "cpuset='0,1'". But xend shows nothing about cpus.
Right don't assume it's there on output of the xend S-Expr for that domain, look at xm info --long it doesn't show it.
Yep, it's my fault. Thank you for pointing that.
To extract the cpuset you need to call the existing functionalities we have for the vcpu affinities (and this may be a bit costly).
Then, I could not get the cpus values by sexpr_node(root, "domain/cpus").
I guess this is a bug of Xen, and this has not been fixed with latest upstream Xen. So I will work about this for Xen before discussing about how xend exports.
It was never in xen and I think teh new way being based on Xen-API you won't get a fix for upstream for our case here. I would not expect a fix to xend to get the solution but work it out in libvirt, but you can try ...
I reconsidered about it and I agreed with you, because we won't get the cpus values when pined several PCPU to each VCPU, like the following case: domA VCPU:0 PCPU:0-1 VCPU:1 PCPU:0-1 VCPU:2 PCPU:2 VCPU:3 PCPU:3 And, we certainly can get the vcpu affinities above by other way (i.e. call "xenDaemonDomainGetVcpus" and analyze the result), but it will be costly at this time, though I would like to make them. So could you commit the patch first to support the cpus value at starting/defining the guest domain for RHEL5.2 ? Please let me know if you have any comments and suggestions. thanks ! Saori

On Thu, Oct 25, 2007 at 05:25:25PM +0900, Saori Fukuta wrote:
On Wed, 24 Oct 2007 05:12:42 -0400 Daniel Veillard wrote:
To extract the cpuset you need to call the existing functionalities we have for the vcpu affinities (and this may be a bit costly).
Then, I could not get the cpus values by sexpr_node(root, "domain/cpus").
I guess this is a bug of Xen, and this has not been fixed with latest upstream Xen. So I will work about this for Xen before discussing about how xend exports.
It was never in xen and I think teh new way being based on Xen-API you won't get a fix for upstream for our case here. I would not expect a fix to xend to get the solution but work it out in libvirt, but you can try ...
I reconsidered about it and I agreed with you, because we won't get the cpus values when pined several PCPU to each VCPU, like the following case: domA VCPU:0 PCPU:0-1 VCPU:1 PCPU:0-1 VCPU:2 PCPU:2 VCPU:3 PCPU:3 And, we certainly can get the vcpu affinities above by other way (i.e. call "xenDaemonDomainGetVcpus" and analyze the result), but it will be costly at this time, though I would like to make them. So could you commit the patch first to support the cpus value at starting/defining the guest domain for RHEL5.2 ?
Please let me know if you have any comments and suggestions.
Okay sorry it took so long, I got distrated by something else and I wanted to come back with a more complete patch for the following: - make sure that cpu expressions coming from the user are reparsed and translated to ranges to make sure we don't have a mismatch between the external representations and what is accepted both by libvirt and xend - Saving the pinning in the XML makes things really complex. alls the XML dump routines of xend assume an SExpr input but the affinity is not available there. Querying xend may be very costly and it way better to ask at the xen_unified level where the hypervisor hypercall will be used if available - as a result I removed the DumpXML entry point from xen_unified because you need to extend parameters in the xend case - also when coming from the proxy you just have the domain id and it's harder to get the cpu list in an efficient way without messing even more with the code I still think CPU affinity in the domain is misplaced, but well, okay I agree to do this for real cpus, this complicates the code a bit though. The associated patch compiles, but I have not yet tested it, it's basically how I would expect to finish the NUMA work, but it certainly need debug and testing. I will look at this tomorrow, but I welcome feedback :-) Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Tue, 30 Oct 2007 16:54:36 -0400 Daniel Veillard wrote:
Okay sorry it took so long, I got distrated by something else and I wanted to come back with a more complete patch for the following: - make sure that cpu expressions coming from the user are reparsed and translated to ranges to make sure we don't have a mismatch between the external representations and what is accepted both by libvirt and xend - Saving the pinning in the XML makes things really complex. alls the XML dump routines of xend assume an SExpr input but the affinity is not available there. Querying xend may be very costly and it way better to ask at the xen_unified level where the hypervisor hypercall will be used if available - as a result I removed the DumpXML entry point from xen_unified because you need to extend parameters in the xend case - also when coming from the proxy you just have the domain id and it's harder to get the cpu list in an efficient way without messing even more with the code
I still think CPU affinity in the domain is misplaced, but well, okay I agree to do this for real cpus, this complicates the code a bit though.
The associated patch compiles, but I have not yet tested it, it's basically how I would expect to finish the NUMA work, but it certainly need debug and testing. I will look at this tomorrow, but I welcome feedback :-)
sounds good to me, I tested with your patch, and I have four fixes for it. This is a test I did : [on RHEL5.1] --------------------------------------------------------------- dumpxml (conf -> XML) state:inactive ...(1) and (2) (SExpr -> XML) state:active define (XML -> conf) create (XML -> SExpr) start (conf -> XML -> SExpr) [on fedora8] --------------------------------------------------------------- dumpxml (SExpr -> XML) state:inactive (cannot get by Xen) ...(3) (SExpr -> XML) state:active (get by Xen) ...(3) define (XML -> SExpr) create (XML -> SExpr) start (nothing translation) (1) numa6_fix1.patch I cannot run the dumpxml command with inactive domain on RHEL5.1. So, I added the check of domain ID and xendConfigVersion. (2) numa6_fix2.patch The discontinuous number is invalid for inactive domain on RHEL5.1. I guess the cause is wrong argument is specified as maxcpu. (e.g.) - normal case conf : cpus = "0,1" dumpxml: <vcpu cpuset='0,1'>4</vcpu> - invalid case conf : cpus = "0,3" dumpxml: <vcpu cpuset='0'>4</vcpu> (3) numa6_fix3.patch The expression is different between inactive domain and active domain, when I don't have vcpu setting for the domain (i.e. "any cpu" is pinned). inactive domain : <vcpu cpuset=''>2</vcpu> active domain : <vcpu>2</vcpu> So, I conform to the expression of active domain. (4) numa6_fix4.patch The expression is different between "xm vcpu-list" and "virsh dumpxml", when I specified continuous number. (e.g.) # xm vcpu-list 200 Name ID VCPUs CPU State Time(s) CPU Affinity test 200 0 1 -b- 10.9 0-1 test 200 1 0 -b- 9.1 0-1 test 200 2 4 -b- 3.1 3-4 test 200 3 5 -b- 3.9 5-6 # ./virsh dumpxml 200| grep cpu <vcpu cpuset='0,1,3-6'>4</vcpu> So, I change the expression from "," to "-"(range) that I can use to create the domain. (e.g.) # ./virsh dumpxml 200| grep cpu <vcpu cpuset='0-1,3-6'>4</vcpu> Regards, Saori.

On Wed, Oct 31, 2007 at 05:20:01PM +0900, Saori Fukuta wrote:
On Tue, 30 Oct 2007 16:54:36 -0400 Daniel Veillard wrote:
The associated patch compiles, but I have not yet tested it, it's basically how I would expect to finish the NUMA work, but it certainly need debug and testing. I will look at this tomorrow, but I welcome feedback :-)
sounds good to me, I tested with your patch, and I have four fixes for it.
Cool thanks a lot for debugging while I was asleep :-)
This is a test I did : [on RHEL5.1] --------------------------------------------------------------- dumpxml (conf -> XML) state:inactive ...(1) and (2) (SExpr -> XML) state:active define (XML -> conf) create (XML -> SExpr) start (conf -> XML -> SExpr)
[on fedora8] --------------------------------------------------------------- dumpxml (SExpr -> XML) state:inactive (cannot get by Xen) ...(3) (SExpr -> XML) state:active (get by Xen) ...(3) define (XML -> SExpr) create (XML -> SExpr) start (nothing translation)
excellent, thanks a lot !
(1) numa6_fix1.patch I cannot run the dumpxml command with inactive domain on RHEL5.1. So, I added the check of domain ID and xendConfigVersion.
yes makes sense ! I also fixed that routine in the XEND case because the value for cpus was leaked if non-NULL.
(2) numa6_fix2.patch The discontinuous number is invalid for inactive domain on RHEL5.1. I guess the cause is wrong argument is specified as maxcpu. (e.g.) - normal case conf : cpus = "0,1" dumpxml: <vcpu cpuset='0,1'>4</vcpu> - invalid case conf : cpus = "0,3" dumpxml: <vcpu cpuset='0'>4</vcpu>
Hum .... right this is a bug, I tried to be smart and failed :-)
(3) numa6_fix3.patch The expression is different between inactive domain and active domain, when I don't have vcpu setting for the domain (i.e. "any cpu" is pinned). inactive domain : <vcpu cpuset=''>2</vcpu> active domain : <vcpu>2</vcpu> So, I conform to the expression of active domain.
oh, yes another bug !
(4) numa6_fix4.patch The expression is different between "xm vcpu-list" and "virsh dumpxml", when I specified continuous number. (e.g.) # xm vcpu-list 200 Name ID VCPUs CPU State Time(s) CPU Affinity test 200 0 1 -b- 10.9 0-1 test 200 1 0 -b- 9.1 0-1 test 200 2 4 -b- 3.1 3-4 test 200 3 5 -b- 3.9 5-6 # ./virsh dumpxml 200| grep cpu <vcpu cpuset='0,1,3-6'>4</vcpu> So, I change the expression from "," to "-"(range) that I can use to create the domain. (e.g.) # ./virsh dumpxml 200| grep cpu <vcpu cpuset='0-1,3-6'>4</vcpu>
Hum, I'm not sure why you prefer to use a range for 2 consecutive values, this should be equivalent both for xend and libvirt, but it's a bit simpler, so no problem ! I commited the aggregated patches, I think at this point we should be okay with NUMA support except for potential bugs showing up on testing and the vcpu pinning informations storage which I still think would be better addressed at the virsh level. What is needed now is more testing I think. thanks a lot for your fast testing and debugging ! Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Wed, Oct 31, 2007 at 05:20:01PM +0900, Saori Fukuta wrote:
On Tue, 30 Oct 2007 16:54:36 -0400 Daniel Veillard wrote:
The associated patch compiles, but I have not yet tested it, it's basically how I would expect to finish the NUMA work, but it certainly need debug and testing. I will look at this tomorrow, but I welcome feedback :-)
sounds good to me, I tested with your patch, and I have four fixes for it.
I tested with the latest code, and found a problem with specifying a value in the cpuset in XML that is greater than maxcpu-1. The attached
Daniel Veillard wrote: patch corrects this behavior, but the issue remains that an out of range cpu will prevent the domain from starting. This is the way xm create works today. In talking with Daniel about this, he made the point that it is not necessarily desirable for the failure of a tuning parameter (e.g., config specifies cpu 5 and the domain is now being started on a 4 cpu machine) to cause domain creation to fail. He mentioned phones ringing in the middle of the night somewhere for support when the domain fails to start. I think he has a point. Wouldn't it make more sense to start the domain(s) at perhaps suboptimal performance, than to have everything shut down until someone can come and figure out the problem? It would seem better if the domain started (with no cpu affinity) and a warning was posted. This might also argue for separation of domain config and tuning parameters. Or at least for 2 flavors of create? One that fails if tuning parameters cannot be activated, and one that doesn't. -- Elizabeth Kon (Beth) IBM Linux Technology Center Open Hypervisor Team email: eak@us.ibm.com

On Fri, Nov 02, 2007 at 01:23:32PM -0400, beth kon wrote:
I tested with the latest code, and found a problem with specifying a value in the cpuset in XML that is greater than maxcpu-1. The attached patch corrects this behavior, but the issue remains that an out of range cpu will prevent the domain from starting. This is the way xm create works today.
Okay, thanks, patch applied and commited !
In talking with Daniel about this, he made the point that it is not necessarily desirable for the failure of a tuning parameter (e.g., config specifies cpu 5 and the domain is now being started on a 4 cpu machine) to cause domain creation to fail. He mentioned phones ringing in the middle of the night somewhere for support when the domain fails to start. I think he has a point. Wouldn't it make more sense to start the domain(s) at perhaps suboptimal performance, than to have everything shut down until someone can come and figure out the problem?
It would seem better if the domain started (with no cpu affinity) and a warning was posted. This might also argue for separation of domain config and tuning parameters. Or at least for 2 flavors of create? One that fails if tuning parameters cannot be activated, and one that doesn't.
Well that could be one of the flags passed to the Create (or CreateLinux) calls, but I'm afraid taht would lead the user to assume the behaviour will be consistent from hypervisor to hypervisor and also for all tuning parameters, and honnestly I'm not sure we will ever be able to garantee this. I think we need to discuss more globally tuning parameters, I will make another post today on the topic to start a clean thread :-) thanks a lot ! Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
participants (3)
-
beth kon
-
Daniel Veillard
-
Saori Fukuta