[libvirt] [RFC] phi support in libvirt

Hi all: As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi. Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU . Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa> In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution. Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa> When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR. Now here I'd like to discuss these questions: 1. This feature is only for Phi at present, but we will check Phi platform for CPU-less NUMA node. The NUMA node without CPU indicates MCDRAM node. And for now MCDRAM is available only for PHI. However, there is no reason that any other platform couldn’t define CPU-less NUMA node using libvirt, so there is no reason to check if PHI is used or not. 2. Type of memory of CPU-less NUMA node will not be checked during machine creation/configuration step. There is no reliable way to distinguish if the node is MCDRAM or regular DDR. This step is not concerned with type of the memory, only with NUMA assignment. 3. Unlike traditional memory assign to a VM, MCDRAM do not support over commit If the memory of a virtual NUMA node is going to be explicitly bound to physical NUMA node then it shouldn’t exceed the size of its corresponding NUMA node, doesn’t matter if it is MCDRAM or DDR. 4. Make sure at least one numa must include a CPU info field be passed to qemu. At least one NUMA node must define list of CPUs assigned, or if none has it assigned by default assign all CPUs to virtual NUMA node 0. More info for phi: Knights Landing Key Information Software Resources • Intel® Xeon Phi™ Product Family Main Page<http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html> • Intel® Xeon Phi™ Processor Press Kit<https://newsroom.intel.com/press-kits/intel-isc/> • Intel® Xeon Phi™ Processor SKU Details<http://mark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server> • Updated Intel® Xeon Phi™ Application Proof Points<https://software.intel.com/en-us/articles/intel-xeon-phi-processor-applications-performance-proof-points> • Intel® Xeon Phi™ Public Performance Data<http://www.intel.com/content/www/us/en/benchmarks/server/xeon-phi/xeon-phi-competitive-performance.html> • Developer Access Program (DAP)<http://dap.xeonphi.com/#platformspecs> • Knights Landing – Public Disclosures<https://software.intel.com/en-us/articles/what-disclosures-has-intel-made-about-knights-landing> • Intel® Modern Code<https://software.intel.com/en-us/modern-code> • Intel® Xeon Phi™ Software Code Recipes<https://software.intel.com/en-us/xeon-phi/x200-processor> • Intel® Xeon Phi™ SW Tools and Libraries<https://software.intel.com/en-us/xeon-phi/x200-processor> • Machine Learning<https://software.intel.com/en-us/machine-learning> • Intel® Xeon Phi™ Processor Software<https://software.intel.com/en-us/articles/intel-xeon-phi-processor-software> • Intel® Software Tools Technical Webinar Series<http://software.intel.com/en-us/articles/intel-software-tools-technical-webinar-series> • Intel® Xeon Phi™ Coprocessor – Applications and Solutions Catalog<https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-applications-and-solutions-catalog> • Intel® MIC Developer Zone<http://software.intel.com/mic-developer> • Intel® Xeon Phi™ Processor Product Family Early Release Software and Tools<https://software.intel.com/en-us/xeon-phi-nda/tools> (request access to the site here<https://software.intel.com/en-us/form/intel-xeon-phi-nda-site-access-request>) • Intel® Parallel Studio XE Tools and Libraries<https://software.intel.com/en-us/free_tools_and_libraries>

On Mon, Dec 05, 2016 at 04:12:22PM +0000, Feng, Shaohe wrote:
Hi all:
As we are know Intel® Xeon phi targets high-performance computing and other parallel workloads. Now qemu has supported phi virtualization,it is time for libvirt to support phi.
Can you provide pointer to the relevant QEMU changes.
Different from the traditional X86 server, There is a special numa node with Multi-Channel DRAM (MCDRAM) on Phi, but without any CPU .
Now libvirt requires nonempty cpus argument for NUMA node, such as. <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' cpus='240-243' memory='16' unit='GiB'/> </numa>
In order to support phi virtualization, libvirt needs to allow a numa cell definition without 'cpu' attribution.
Such as: <numa> <cell id='0' cpus='0-239' memory='80' unit='GiB'/> <cell id='1' memory='16' unit='GiB'/> </numa>
When a cell without 'cpu', qemu will allocate memory by default MCDRAM instead of DDR.
There's separate concepts at play which your description here is mixing up. First is the question of whether the guest NUMA node can be created with only RAM or CPUs, or a mix of both. Second is the question of what kind of host RAM (MCDRAM vs DDR) is used as the backing store for the guest These are separate configuration items which don't need to be conflated in libvirt. ie we should be able to create a guest with a node containing only memory, and back that by DDR on the host. Conversely we should be able to create a guest with a node containing memory + cpus and back that by MCDRAM on the host (even if that means the vCPUs will end up on a different host node from its RAM) On the first point, there still appears to be some brokness in either QEMU or Linux wrt configuration of virtual NUMA where either cpus or memory are absent from nodes. eg if I launch QEMU with -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 -numa node,nodeid=2,cpus=4-7 -numa node,nodeid=3,mem=512 -numa node,nodeid=4,mem=512 -numa node,nodeid=5,cpus=8-11 -numa node,nodeid=6,mem=1024 -numa node,nodeid=7,cpus=12-15,mem=1024 then the guest reports # numactl --hardware available: 6 nodes (0,3-7) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 node 0 size: 487 MB node 0 free: 230 MB node 3 cpus: 12 13 14 15 node 3 size: 1006 MB node 3 free: 764 MB node 4 cpus: node 4 size: 503 MB node 4 free: 498 MB node 5 cpus: node 5 size: 503 MB node 5 free: 499 MB node 6 cpus: node 6 size: 503 MB node 6 free: 498 MB node 7 cpus: node 7 size: 943 MB node 7 free: 939 MB so its pushed all the CPUs from nodes without RAM into the first node, and moved CPUs from the 7th node into the 3rd node. So before considering MCDRAM / Phi, we need to fix this more basic NUMA topology setup.
Now here I'd like to discuss these questions: 1. This feature is only for Phi at present, but we will check Phi platform for CPU-less NUMA node. The NUMA node without CPU indicates MCDRAM node.
We should not assume such semantics - it is a concept that is specific to particular Intel x86_64 CPUs. We need to consider that other architectures may have nodes without CPUs that are backed by normal DDR. IOW, we shoud be explicit about presence of MCDRAM in the host.
And for now MCDRAM is available only for PHI. However, there is no reason that any other platform couldn’t define CPU-less NUMA node using libvirt, so there is no reason to check if PHI is used or not.
2. Type of memory of CPU-less NUMA node will not be checked during machine creation/configuration step. There is no reliable way to distinguish if the node is MCDRAM or regular DDR. This step is not concerned with type of the memory, only with NUMA assignment.
If we can't distinguish MCDRAM from DDR that's a problem for apps, given your next point about MCDRAM not supporting over commit.
3. Unlike traditional memory assign to a VM, MCDRAM do not support over commit If the memory of a virtual NUMA node is going to be explicitly bound to physical NUMA node then it shouldn’t exceed the size of its corresponding NUMA node, doesn’t matter if it is MCDRAM or DDR.
It is valid to bind guests to NUMA nodes and still have memory over commit, so we do need to know if a host node is using MCDRAM or DDR, so apps can determine whether that node supports over commit or not. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
participants (2)
-
Daniel P. Berrange
-
Feng, Shaohe