Re: [libvirt] [RFC] phi support in libvirt

On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi.
Can you provide pointer to the relevant QEMU changes.
Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently. Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s). On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications.
Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution.
Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR.
There's separate concepts at play which your description here is mixing up.
First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both.
Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest
Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node.
These are separate configuration items which don't need to be conflated in libvirt. ie we should be able to create a guest with a node containing only memory, and back that by DDR on the host. Conversely we should be able to create a guest with a node containing memory + cpus and back that by MCDRAM on the host (even if that means the vCPUs will end up on a different host node from its RAM)
On the first point, there still appears to be some brokness in either QEMU or Linux wrt configuration of virtual NUMA where either cpus or memory are absent from nodes.
eg if I launch QEMU with
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 -numa node,nodeid=2,cpus=4-7 -numa node,nodeid=3,mem=512 -numa node,nodeid=4,mem=512 -numa node,nodeid=5,cpus=8-11 -numa node,nodeid=6,mem=1024 -numa node,nodeid=7,cpus=12-15,mem=1024
then the guest reports
# numactl --hardware available: 6 nodes (0,3-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 0 size: 487 MB node 0 free: 230 MB node 3 cpus: 12 13 14 15 node 3 size: 1006 MB node 3 free: 764 MB node 4 cpus: node 4 size: 503 MB node 4 free: 498 MB node 5 cpus: node 5 size: 503 MB node 5 free: 499 MB node 6 cpus: node 6 size: 503 MB node 6 free: 498 MB node 7 cpus: node 7 size: 943 MB node 7 free: 939 MB
so its pushed all the CPUs from nodes without RAM into the first node, and moved CPUs from the 7th node into the 3rd node.
I am not sure why this happens, but basically, I lauch QEMU like: -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=node0 \ -numa node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=node0 \ -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=node1 \ -numa node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev=node1 \ -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=node2 \ -numa node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev=node2 \ -object memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=node3 \ -numa node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memdev=node3 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node4 \ -numa node,nodeid=4,memdev=node4 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node5 \ -numa node,nodeid=5,memdev=node5 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node6 \ -numa node,nodeid=6,memdev=node6 \ -object memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node7 \ -numa node,nodeid=7,memdev=node7 \ (Please ignore the complex cpus parameters...) As you can see, the pair of `-object memory-backend-ram` and `-numa` is used to specify where the memory of the guest NUMA node is allocated from. It works well for me :-)
So before considering MCDRAM / Phi, we need to fix this more basic NUMA topology setup.
Now here I'd like to discuss these questions: 1. This feature is only for Phi at present, but we will check Phi platform for CPU-less NUMA node. The NUMA node without CPU indicates MCDRAM node.
We should not assume such semantics - it is a concept that is specific to particular Intel x86_64 CPUs. We need to consider that other architectures may have nodes without CPUs that are backed by normal DDR. IOW, we shoud be explicit about presence of MCDRAM in the host.
Agreed, but for KNL, that is how we detect MCDRAM on host: 1. detect CPU family is Xeon Phi X200 (means KNL) 2. enumerate all NUMA nodes and regard the nodes that contain memory only as MCDRAM nodes. ... Thanks, -He

Shaohe was dropped from the loop, adding him back.
-----Original Message----- From: He Chen [mailto:he.chen@linux.intel.com] Sent: Friday, December 9, 2016 3:46 PM To: Daniel P. Berrange <berrange@redhat.com> Cc: libvir-list@redhat.com; Du, Dolpher <dolpher.du@intel.com>; Zyskowski, Robert <robert.zyskowski@intel.com>; Daniluk, Lukasz <lukasz.daniluk@intel.com>; Zang, Rui <rui.zang@intel.com>; jdenemar@redhat.com Subject: Re: [libvirt] [RFC] phi support in libvirt
On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi.
Can you provide pointer to the relevant QEMU changes.
Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently.
Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s).
On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications.
Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution.
Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR.
There's separate concepts at play which your description here is mixing up.
First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both.
Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest
Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node.
These are separate configuration items which don't need to be conflated in libvirt. ie we should be able to create a guest with a node containing only memory, and back that by DDR on the host. Conversely we should be able to create a guest with a node containing memory + cpus and back that by MCDRAM on the host (even if that means the vCPUs will end up on a different host node from its RAM)
On the first point, there still appears to be some brokness in either QEMU or Linux wrt configuration of virtual NUMA where either cpus or memory are absent from nodes.
eg if I launch QEMU with
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 -numa node,nodeid=2,cpus=4-7 -numa node,nodeid=3,mem=512 -numa node,nodeid=4,mem=512 -numa node,nodeid=5,cpus=8-11 -numa node,nodeid=6,mem=1024 -numa node,nodeid=7,cpus=12-15,mem=1024
then the guest reports
# numactl --hardware available: 6 nodes (0,3-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 0 size: 487 MB node 0 free: 230 MB node 3 cpus: 12 13 14 15 node 3 size: 1006 MB node 3 free: 764 MB node 4 cpus: node 4 size: 503 MB node 4 free: 498 MB node 5 cpus: node 5 size: 503 MB node 5 free: 499 MB node 6 cpus: node 6 size: 503 MB node 6 free: 498 MB node 7 cpus: node 7 size: 943 MB node 7 free: 939 MB
so its pushed all the CPUs from nodes without RAM into the first node, and moved CPUs from the 7th node into the 3rd node.
I am not sure why this happens, but basically, I lauch QEMU like:
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod e0 \ -numa node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n ode0 \
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod e1 \ -numa node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev= node1 \
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod e2 \ -numa node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev =node2 \
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod e3 \ -numa node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde v=node3 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node 4 \ -numa node,nodeid=4,memdev=node4 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node 5 \ -numa node,nodeid=5,memdev=node5 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node 6 \ -numa node,nodeid=6,memdev=node6 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node 7 \ -numa node,nodeid=7,memdev=node7 \
(Please ignore the complex cpus parameters...) As you can see, the pair of `-object memory-backend-ram` and `-numa` is used to specify where the memory of the guest NUMA node is allocated from. It works well for me :-)
So before considering MCDRAM / Phi, we need to fix this more basic NUMA topology setup.
Now here I'd like to discuss these questions: 1. This feature is only for Phi at present, but we will check Phi platform for CPU-less NUMA node. The NUMA node without CPU indicates MCDRAM node.
We should not assume such semantics - it is a concept that is specific to particular Intel x86_64 CPUs. We need to consider that other architectures may have nodes without CPUs that are backed by normal DDR. IOW, we shoud be explicit about presence of MCDRAM in the host.
Agreed, but for KNL, that is how we detect MCDRAM on host: 1. detect CPU family is Xeon Phi X200 (means KNL) 2. enumerate all NUMA nodes and regard the nodes that contain memory only as MCDRAM nodes.
...
Thanks, -He

Thanks. Dolpher. Reply inline. On 2016年12月21日 11:56, Du, Dolpher wrote:
Shaohe was dropped from the loop, adding him back.
-----Original Message----- From: He Chen [mailto:he.chen@linux.intel.com] Sent: Friday, December 9, 2016 3:46 PM To: Daniel P. Berrange <berrange@redhat.com> Cc: libvir-list@redhat.com; Du, Dolpher <dolpher.du@intel.com>; Zyskowski, Robert <robert.zyskowski@intel.com>; Daniluk, Lukasz <lukasz.daniluk@intel.com>; Zang, Rui <rui.zang@intel.com>; jdenemar@redhat.com Subject: Re: [libvirt] [RFC] phi support in libvirt
On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi. Can you provide pointer to the relevant QEMU changes.
Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently.
Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s).
On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications.
Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution.
Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR. There's separate concepts at play which your description here is mixing up.
First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both. Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node. So I suggest libvirt distinguish the MCDRAM
And the MCDRAM numa config as follow, add a "mcdram" attribute for "cell" element: <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
These are separate configuration items which don't need to be conflated in libvirt. ie we should be able to create a guest with a node containing only memory, and back that by DDR on the host. Conversely we should be able to create a guest with a node containing memory + cpus and back that by MCDRAM on the host (even if that means the vCPUs will end up on a different host node from its RAM) On the first point, there still appears to be some brokness in either QEMU or Linux wrt configuration of virtual NUMA where either cpus or memory are absent from nodes. eg if I launch QEMU with
-numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 -numa node,nodeid=2,cpus=4-7 -numa node,nodeid=3,mem=512 -numa node,nodeid=4,mem=512 -numa node,nodeid=5,cpus=8-11 -numa node,nodeid=6,mem=1024 -numa node,nodeid=7,cpus=12-15,mem=1024
then the guest reports
# numactl --hardware available: 6 nodes (0,3-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 0 size: 487 MB node 0 free: 230 MB node 3 cpus: 12 13 14 15 node 3 size: 1006 MB node 3 free: 764 MB node 4 cpus: node 4 size: 503 MB node 4 free: 498 MB node 5 cpus: node 5 size: 503 MB node 5 free: 499 MB node 6 cpus: node 6 size: 503 MB node 6 free: 498 MB node 7 cpus: node 7 size: 943 MB node 7 free: 939 MB
so its pushed all the CPUs from nodes without RAM into the first node, and moved CPUs from the 7th node into the 3rd node.
seems it is a bug.
He Chen, Do you know how qemu generates the numa node for guest. Can qemu do sanity check of Host Physical Numa topology, and generate a smart guest Numa topology?
I am not sure why this happens, but basically, I lauch QEMU like:
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod e0 \ -numa node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n ode0 \
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod e1 \ -numa node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev= node1 \
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod e2 \ -numa node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev =node2 \
-object memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod e3 \ -numa node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde v=node3 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node 4 \ -numa node,nodeid=4,memdev=node4 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node 5 \ -numa node,nodeid=5,memdev=node5 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node 6 \ -numa node,nodeid=6,memdev=node6 \
-object memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node 7 \ -numa node,nodeid=7,memdev=node7 \
(Please ignore the complex cpus parameters...) As you can see, the pair of `-object memory-backend-ram` and `-numa` is used to specify where the memory of the guest NUMA node is allocated from. It works well for me :-)
When a "mcdram" in "cell", we banding it to the Physical numa by specify the "object" <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa>
Now here I'd like to discuss these questions: 1. This feature is only for Phi at present, but we will check Phi platform for CPU-less NUMA node. The NUMA node without CPU indicates MCDRAM node. We should not assume such semantics - it is a concept that is specific to
So before considering MCDRAM / Phi, we need to fix this more basic NUMA topology setup. particular Intel x86_64 CPUs. We need to consider that other architectures may have nodes without CPUs that are backed by normal DDR. IOW, we shoud be explicit about presence of MCDRAM in the host.
Agreed, but for KNL, that is how we detect MCDRAM on host: 1. detect CPU family is Xeon Phi X200 (means KNL) 2. enumerate all NUMA nodes and regard the nodes that contain memory only as MCDRAM nodes.
When a "mcdram" in "cell", we detect the MCDRAM, do some check and banding it to the Physical numa <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa>
...
Thanks, -He

On Wed, Dec 21, 2016 at 12:51:29PM +0800, Feng, Shaohe wrote:
Thanks. Dolpher.
Reply inline.
On 2016年12月21日 11:56, Du, Dolpher wrote:
Shaohe was dropped from the loop, adding him back.
-----Original Message----- From: He Chen [mailto:he.chen@linux.intel.com] Sent: Friday, December 9, 2016 3:46 PM To: Daniel P. Berrange <berrange@redhat.com> Cc: libvir-list@redhat.com; Du, Dolpher <dolpher.du@intel.com>; Zyskowski, Robert <robert.zyskowski@intel.com>; Daniluk, Lukasz <lukasz.daniluk@intel.com>; Zang, Rui <rui.zang@intel.com>; jdenemar@redhat.com Subject: Re: [libvirt] [RFC] phi support in libvirt
On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi. Can you provide pointer to the relevant QEMU changes.
Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently.
Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s).
On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications.
Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution.
Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR. There's separate concepts at play which your description here is mixing up.
First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both. Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node. So I suggest libvirt distinguish the MCDRAM
And the MCDRAM numa config as follow, add a "mcdram" attribute for "cell" element: <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/>
No, that is not backwards compatible for applications using libvirt. We already have a place for storing info about memory backing type, which we use for huge pages. mcdram should use the same approach IMHO. eg <domain> ... <memoryBacking> <mcdram nodeset="3-4"/> </memoryBacking> </domain> to indicate that nodes 3 & 4 should use mcdram Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

Thanks, Daniel. So how about: for the NUMA format, we still uses "memory" to describe the mcdram. But we remove the cpus elements. <numa> <cell id='3' memory='8' unit='GiB'/> </numa> <cell id='4' memory='8' unit='GiB'/> </numa> At present, for this kind CPUless NUMA , we only support mcdram as memroy backend. <domain> ... <memoryBacking> <mcdram nodeset="3-4"/> </memoryBacking> </domain> And we reject a CPUless NUMA without memroy backend. Maybe we will allow it in futures after qemu can handle it well. A question: 1. Should libvirt probe the "host-nodes" for this kind of memory to make a smart map? The qemu arguments will be as follow: -object memory-backend-ram,size=8G,prealloc=yes,host-nodes=0,policy=bind,id=node3 \ -numa node,nodeid=3,memdev=node3 \ -object memory-backend-ram,size=8G,prealloc=yes,host-nodes=0,policy=bind,id=node4 \ -numa node,nodeid=4,memdev=node4 \ 2. or we let user specify the host-nodes. <memoryBacking> <mcdram nodeset="3-4", host-nodes="0-1"/> </memoryBacking> </domain> BR ShaoHe Feng On 2016年12月21日 18:25, Daniel P. Berrange wrote:
On Wed, Dec 21, 2016 at 12:51:29PM +0800, Feng, Shaohe wrote:
Thanks. Dolpher.
Reply inline.
On 2016年12月21日 11:56, Du, Dolpher wrote:
Shaohe was dropped from the loop, adding him back.
-----Original Message----- From: He Chen [mailto:he.chen@linux.intel.com] Sent: Friday, December 9, 2016 3:46 PM To: Daniel P. Berrange <berrange@redhat.com> Cc: libvir-list@redhat.com; Du, Dolpher <dolpher.du@intel.com>; Zyskowski, Robert <robert.zyskowski@intel.com>; Daniluk, Lukasz <lukasz.daniluk@intel.com>; Zang, Rui <rui.zang@intel.com>; jdenemar@redhat.com Subject: Re: [libvirt] [RFC] phi support in libvirt
On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi. Can you provide pointer to the relevant QEMU changes.
Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently.
Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s).
On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications.
Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution.
Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR. There's separate concepts at play which your description here is mixing up.
First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both. Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node. So I suggest libvirt distinguish the MCDRAM
And the MCDRAM numa config as follow, add a "mcdram" attribute for "cell" element: <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> No, that is not backwards compatible for applications using libvirt.
We already have a place for storing info about memory backing type, which we use for huge pages. mcdram should use the same approach IMHO. eg
<domain> ... <memoryBacking> <mcdram nodeset="3-4"/> </memoryBacking> </domain>
to indicate that nodes 3 & 4 should use mcdram
Regards, Daniel

For your question, I would suggest to use the second form, this is consistent with qemu, and will not bring platform specific knowledge to libvirt layer: 2. or we let user specify the host-nodes. <memoryBacking> <mcdram nodeset="3-4", host-nodes="0-1"/> </memoryBacking> </domain> Regards, Dolpher From: Feng, Shaohe Sent: Monday, December 26, 2016 7:58 PM To: Daniel P. Berrange <berrange@redhat.com> Cc: Du, Dolpher <dolpher.du@intel.com>; He Chen <he.chen@linux.intel.com>; libvir-list@redhat.com; Zyskowski, Robert <robert.zyskowski@intel.com>; Daniluk, Lukasz <lukasz.daniluk@intel.com>; Zang, Rui <rui.zang@intel.com>; jdenemar@redhat.com Subject: Re: [libvirt] [RFC] phi support in libvirt Thanks, Daniel. So how about: for the NUMA format, we still uses "memory" to describe the mcdram. But we remove the cpus elements. <numa> <cell id='3' memory='8' unit='GiB'/> </numa> <cell id='4' memory='8' unit='GiB'/> </numa> At present, for this kind CPUless NUMA , we only support mcdram as memroy backend. <domain> ... <memoryBacking> <mcdram nodeset="3-4"/> </memoryBacking> </domain> And we reject a CPUless NUMA without memroy backend. Maybe we will allow it in futures after qemu can handle it well. A question: 1. Should libvirt probe the "host-nodes" for this kind of memory to make a smart map? The qemu arguments will be as follow: -object memory-backend-ram,size=8G,prealloc=yes,host-nodes=0,policy=bind,id=node3 \ -numa node,nodeid=3,memdev=node3 \ -object memory-backend-ram,size=8G,prealloc=yes,host-nodes=0,policy=bind,id=node4 \ -numa node,nodeid=4,memdev=node4 \ 2. or we let user specify the host-nodes. <memoryBacking> <mcdram nodeset="3-4", host-nodes="0-1"/> </memoryBacking> </domain> BR ShaoHe Feng On 2016年12月21日 18:25, Daniel P. Berrange wrote: On Wed, Dec 21, 2016 at 12:51:29PM +0800, Feng, Shaohe wrote: Thanks. Dolpher. Reply inline. On 2016年12月21日 11:56, Du, Dolpher wrote: Shaohe was dropped from the loop, adding him back. -----Original Message----- From: He Chen [mailto:he.chen@linux.intel.com] Sent: Friday, December 9, 2016 3:46 PM To: Daniel P. Berrange <berrange@redhat.com><mailto:berrange@redhat.com> Cc: libvir-list@redhat.com<mailto:libvir-list@redhat.com>; Du, Dolpher <dolpher.du@intel.com><mailto:dolpher.du@intel.com>; Zyskowski, Robert <robert.zyskowski@intel.com><mailto:robert.zyskowski@intel.com>; Daniluk, Lukasz <lukasz.daniluk@intel.com><mailto:lukasz.daniluk@intel.com>; Zang, Rui <rui.zang@intel.com><mailto:rui.zang@intel.com>; jdenemar@redhat.com<mailto:jdenemar@redhat.com> Subject: Re: [libvirt] [RFC] phi support in libvirt On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote: Hi all: As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi. Can you provide pointer to the relevant QEMU changes. Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one is up to 288 CPUs which needs patches to support and we are pushing it, the other is Multi-Channel DRAM (MCDRAM) which does not need any changes currently. Let me introduce more about MCDRAM, MCDRAM is on-package high-bandwidth memory (~500GB/s). On KNL platform, hardware expose MCDRAM as a seperate, CPUless and remote NUMA node to OS so that MCDRAM will not be allocated by default (since MCDRAM node has no CPU, every CPU regards MCDRAM node as remote node). In this way, MCDRAM can be reserved for certain specific applications. Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU . Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa> In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution. Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa> When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR. There's separate concepts at play which your description here is mixing up. First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both. Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest Guest NUMA node shoulde be created with memory only (keep the same as host's) and the more important things is the memory should bind to (come from) host MCDRAM node. So I suggest libvirt distinguish the MCDRAM And the MCDRAM numa config as follow, add a "mcdram" attribute for "cell" element: <numa> <cell id='1' mcdram='16' unit='GiB'/> </numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> No, that is not backwards compatible for applications using libvirt. We already have a place for storing info about memory backing type, which we use for huge pages. mcdram should use the same approach IMHO. eg <domain> ... <memoryBacking> <mcdram nodeset="3-4"/> </memoryBacking> </domain> to indicate that nodes 3 & 4 should use mcdram Regards, Daniel

(btw, if possible please try to avoid sending HTML email, or HTML+text email to the list - plaintext oinly email is preferred) On Mon, Dec 26, 2016 at 07:57:42PM +0800, Feng, Shaohe wrote:
for the NUMA format, we still uses "memory" to describe the mcdram. But we remove the cpus elements. <numa> <cell id='3' memory='8' unit='GiB'/> </numa> <cell id='4' memory='8' unit='GiB'/> </numa>
At present, for this kind CPUless NUMA , we only support mcdram as memroy backend.
Yep, that sounds ok.
<domain> ... <memoryBacking> <mcdram nodeset="3-4"/> </memoryBacking> </domain>
And we reject a CPUless NUMA without memroy backend. Maybe we will allow it in futures after qemu can handle it well.
Yes, that's ok too.
A question: 1. Should libvirt probe the "host-nodes" for this kind of memory to make a smart map?
The qemu arguments will be as follow: -object memory-backend-ram,size=8G,prealloc=yes,host-nodes=0,policy=bind,id=node3 \ -numa node,nodeid=3,memdev=node3 \
-object memory-backend-ram,size=8G,prealloc=yes,host-nodes=0,policy=bind,id=node4 \ -numa node,nodeid=4,memdev=node4 \
2. or we let user specify the host-nodes. <memoryBacking> <mcdram nodeset="3-4", host-nodes="0-1"/> </memoryBacking> </domain>
You don't need to do that - <numatune> already lets the user say which host node, each guest node is attached to http://libvirt.org/formatdomain.html#elementsNUMATuning <numatune> <memory mode="strict" nodeset="1-4,^3"/> <memnode cellid="0" mode="strict" nodeset="1"/> <memnode cellid="2" mode="preferred" nodeset="2"/> </numatune> Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
participants (4)
-
Daniel P. Berrange
-
Du, Dolpher
-
Feng, Shaohe
-
He Chen