Thanks. Dolpher.
Reply inline.
On 2016年12月21日 11:56, Du, Dolpher wrote:
Shaohe was dropped from the loop, adding him back.
> -----Original Message-----
> From: He Chen [mailto:he.chen@linux.intel.com]
> Sent: Friday, December 9, 2016 3:46 PM
> To: Daniel P. Berrange <berrange(a)redhat.com>
> Cc: libvir-list(a)redhat.com; Du, Dolpher <dolpher.du(a)intel.com>; Zyskowski,
> Robert <robert.zyskowski(a)intel.com>; Daniluk, Lukasz
> <lukasz.daniluk(a)intel.com>; Zang, Rui <rui.zang(a)intel.com>;
> jdenemar(a)redhat.com
> Subject: Re: [libvirt] [RFC] phi support in libvirt
>
>> On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
>>> Hi all:
>>>
>>> As we are know Intel® Xeon phi targets high-performance computing and
>>> other parallel workloads.
>>> Now qemu has supported phi virtualization,it is time for libvirt to
>>> support phi.
>> Can you provide pointer to the relevant QEMU changes.
>>
> Xeon Phi Knights Landing (KNL) contains 2 primary hardware features, one
> is up to 288 CPUs which needs patches to support and we are pushing it,
> the other is Multi-Channel DRAM (MCDRAM) which does not need any changes
> currently.
>
> Let me introduce more about MCDRAM, MCDRAM is on-package
> high-bandwidth
> memory (~500GB/s).
>
> On KNL platform, hardware expose MCDRAM as a seperate, CPUless and
> remote NUMA node to OS so that MCDRAM will not be allocated by default
> (since MCDRAM node has no CPU, every CPU regards MCDRAM node as
> remote
> node). In this way, MCDRAM can be reserved for certain specific
> applications.
>
>>> Different from the traditional X86 server, There is a special numa
>>> node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
>>>
>>> Now libvirt requires nonempty cpus argument for NUMA node, such as.
>>> <numa>
>>> <cell id='0' cpus='0-239' memory='80'
unit='GiB'/>
>>> <cell id='1' cpus='240-243' memory='16'
unit='GiB'/> </numa>
>>>
>>> In order to support phi virtualization, libvirt needs to allow a numa
>>> cell definition without 'cpu' attribution.
>>>
>>> Such as:
>>> <numa>
>>> <cell id='0' cpus='0-239' memory='80'
unit='GiB'/>
>>> <cell id='1' memory='16' unit='GiB'/>
</numa>
>>>
>>> When a cell without 'cpu', qemu will allocate memory by default
MCDRAM
> instead of DDR.
>> There's separate concepts at play which your description here is mixing up.
>>
>> First is the question of whether the guest NUMA node can be created with
> only RAM or CPUs, or a mix of both.
>> Second is the question of what kind of host RAM (MCDRAM vs DDR) is used
> as the backing store for the guest
> Guest NUMA node shoulde be created with memory only (keep the same as
> host's) and the more important things is the memory should bind to (come
> from) host MCDRAM node.
So I suggest libvirt distinguish the MCDRAM
And the MCDRAM numa config as follow, add a "mcdram" attribute for
"cell" element:
<numa>
<cell id='1' mcdram='16' unit='GiB'/> </numa>
<cell id='0' cpus='0-239' memory='80'
unit='GiB'/>
>
>> These are separate configuration items which don't need to be conflated in
> libvirt. ie we should be able to create a guest with a node containing only
> memory, and back that by DDR on the host. Conversely we should be able to
> create a guest with a node containing memory + cpus and back that by
> MCDRAM on the host (even if that means the vCPUs will end up on a different
> host node from its RAM)
>> On the first point, there still appears to be some brokness in either QEMU or
> Linux wrt configuration of virtual NUMA where either cpus or memory are
> absent from nodes.
>> eg if I launch QEMU with
>>
>> -numa node,nodeid=0,cpus=0-3,mem=512
>> -numa node,nodeid=1,mem=512
>> -numa node,nodeid=2,cpus=4-7
>> -numa node,nodeid=3,mem=512
>> -numa node,nodeid=4,mem=512
>> -numa node,nodeid=5,cpus=8-11
>> -numa node,nodeid=6,mem=1024
>> -numa node,nodeid=7,cpus=12-15,mem=1024
>>
>> then the guest reports
>>
>> # numactl --hardware
>> available: 6 nodes (0,3-7)
>> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
>> node 0 size: 487 MB
>> node 0 free: 230 MB
>> node 3 cpus: 12 13 14 15
>> node 3 size: 1006 MB
>> node 3 free: 764 MB
>> node 4 cpus:
>> node 4 size: 503 MB
>> node 4 free: 498 MB
>> node 5 cpus:
>> node 5 size: 503 MB
>> node 5 free: 499 MB
>> node 6 cpus:
>> node 6 size: 503 MB
>> node 6 free: 498 MB
>> node 7 cpus:
>> node 7 size: 943 MB
>> node 7 free: 939 MB
>>
>> so its pushed all the CPUs from nodes without RAM into the first node, and
> moved CPUs from the 7th node into the 3rd node.
seems it is a bug.
He Chen, Do you know how qemu generates the numa node for guest.
Can qemu do sanity check of Host Physical Numa topology, and generate a
smart guest Numa topology?
> I am not sure why this happens, but basically, I lauch QEMU
like:
>
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=0,policy=bind,id=nod
> e0 \
> -numa
> node,nodeid=0,cpus=0-14,cpus=60-74,cpus=120-134,cpus=180-194,memdev=n
> ode0 \
>
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=1,policy=bind,id=nod
> e1 \
> -numa
> node,nodeid=1,cpus=15-29,cpus=75-89,cpus=135-149,cpus=195-209,memdev=
> node1 \
>
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=2,policy=bind,id=nod
> e2 \
> -numa
> node,nodeid=2,cpus=30-44,cpus=90-104,cpus=150-164,cpus=210-224,memdev
> =node2 \
>
> -object
> memory-backend-ram,size=20G,prealloc=yes,host-nodes=3,policy=bind,id=nod
> e3 \
> -numa
> node,nodeid=3,cpus=45-59,cpus=105-119,cpus=165-179,cpus=225-239,memde
> v=node3 \
>
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=4,policy=bind,id=node
> 4 \
> -numa node,nodeid=4,memdev=node4 \
>
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=5,policy=bind,id=node
> 5 \
> -numa node,nodeid=5,memdev=node5 \
>
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=6,policy=bind,id=node
> 6 \
> -numa node,nodeid=6,memdev=node6 \
>
> -object
> memory-backend-ram,size=3G,prealloc=yes,host-nodes=7,policy=bind,id=node
> 7 \
> -numa node,nodeid=7,memdev=node7 \
>
> (Please ignore the complex cpus parameters...)
> As you can see, the pair of `-object memory-backend-ram` and `-numa` is
> used to specify where the memory of the guest NUMA node is allocated
> from. It works well for me :-)
When a "mcdram" in "cell", we banding it to the Physical numa by
specify
the "object"
<numa>
<cell id='1' mcdram='16' unit='GiB'/> </numa>
>
>> So before considering MCDRAM / Phi, we need to fix this more basic NUMA
> topology setup.
>>> Now here I'd like to discuss these questions:
>>> 1. This feature is only for Phi at present, but we
>>> will check Phi platform for CPU-less NUMA node.
>>> The NUMA node without CPU indicates MCDRAM node.
>> We should not assume such semantics - it is a concept that is specific to
> particular Intel x86_64 CPUs. We need to consider that other architectures
> may have nodes without CPUs that are backed by normal DDR.
>> IOW, we shoud be explicit about presence of MCDRAM in the host.
>>
> Agreed, but for KNL, that is how we detect MCDRAM on host:
> 1. detect CPU family is Xeon Phi X200 (means KNL)
> 2. enumerate all NUMA nodes and regard the nodes that contain memory
> only as MCDRAM nodes.
When a "mcdram" in "cell", we detect the MCDRAM, do some check and
banding it to the Physical numa
<numa>
<cell id='1' mcdram='16' unit='GiB'/> </numa>
>
> ...
>
> Thanks,
> -He