On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other
parallel workloads.
Now qemu has supported phi virtualization,it is time for libvirt to support
phi.
Can you provide pointer to the relevant QEMU changes.
Different from the traditional X86 server, There is a special numa
node with
Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as.
<numa>
<cell id='0' cpus='0-239' memory='80'
unit='GiB'/>
<cell id='1' cpus='240-243' memory='16'
unit='GiB'/>
</numa>
In order to support phi virtualization, libvirt needs to allow a numa cell
definition without 'cpu' attribution.
Such as:
<numa>
<cell id='0' cpus='0-239' memory='80'
unit='GiB'/>
<cell id='1' memory='16' unit='GiB'/>
</numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of
DDR.
There's separate concepts at play which your description here is
mixing up.
First is the question of whether the guest NUMA node can be created
with only RAM or CPUs, or a mix of both.
Second is the question of what kind of host RAM (MCDRAM vs DDR) is
used as the backing store for the guest
These are separate configuration items which don't need to be
conflated in libvirt. ie we should be able to create a guest
with a node containing only memory, and back that by DDR on
the host. Conversely we should be able to create a guest
with a node containing memory + cpus and back that by MCDRAM
on the host (even if that means the vCPUs will end up on a
different host node from its RAM)
On the first point, there still appears to be some brokness
in either QEMU or Linux wrt configuration of virtual NUMA
where either cpus or memory are absent from nodes.
eg if I launch QEMU with
-numa node,nodeid=0,cpus=0-3,mem=512
-numa node,nodeid=1,mem=512
-numa node,nodeid=2,cpus=4-7
-numa node,nodeid=3,mem=512
-numa node,nodeid=4,mem=512
-numa node,nodeid=5,cpus=8-11
-numa node,nodeid=6,mem=1024
-numa node,nodeid=7,cpus=12-15,mem=1024
then the guest reports
# numactl --hardware
available: 6 nodes (0,3-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11
node 0 size: 487 MB
node 0 free: 230 MB
node 3 cpus: 12 13 14 15
node 3 size: 1006 MB
node 3 free: 764 MB
node 4 cpus:
node 4 size: 503 MB
node 4 free: 498 MB
node 5 cpus:
node 5 size: 503 MB
node 5 free: 499 MB
node 6 cpus:
node 6 size: 503 MB
node 6 free: 498 MB
node 7 cpus:
node 7 size: 943 MB
node 7 free: 939 MB
so its pushed all the CPUs from nodes without RAM into the
first node, and moved CPUs from the 7th node into the 3rd
node.
So before considering MCDRAM / Phi, we need to fix this more
basic NUMA topology setup.
Now here I'd like to discuss these questions:
1. This feature is only for Phi at present, but we
will check Phi platform for CPU-less NUMA node.
The NUMA node without CPU indicates MCDRAM node.
We should not assume such semantics - it is a concept
that is specific to particular Intel x86_64 CPUs. We
need to consider that other architectures may have
nodes without CPUs that are backed by normal DDR.
IOW, we shoud be explicit about presence of MCDRAM
in the host.
And for now MCDRAM is available only for PHI.
However, there is no reason that any other platform
couldn’t define CPU-less NUMA node using libvirt, so
there is no reason to check if PHI is used or not.
2. Type of memory of CPU-less NUMA node will not be
checked during machine creation/configuration step.
There is no reliable way to distinguish if the node
is MCDRAM or regular DDR. This step is not concerned
with type of the memory, only with NUMA assignment.
If we can't distinguish MCDRAM from DDR that's a problem
for apps, given your next point about MCDRAM not supporting
over commit.
3. Unlike traditional memory assign to a VM, MCDRAM do not
support over commit
If the memory of a virtual NUMA node is going to be
explicitly bound to physical NUMA node then it shouldn’t
exceed the size of its corresponding NUMA node, doesn’t
matter if it is MCDRAM or DDR.
It is valid to bind guests to NUMA nodes and still have
memory over commit, so we do need to know if a host node
is using MCDRAM or DDR, so apps can determine whether
that node supports over commit or not.
Regards,
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://entangle-photo.org -o-
http://search.cpan.org/~danberr/ :|