[libvirt] Designing XML for HMAT

Dear list, QEMU gained support for configuring HMAT recently (see v4.2.0-415-g9b12dfa03a and friends). HMAT stands for Heterogeneous Memory Attribute Table and defines various attributes to NUMA. Guest OS/app can read these information and fine tune optimization. See [1] for more info (esp. links in the transcript). QEMU defines so called initiator, which is an attribute to a NUMA node and if specified points to another node that has the best performance to this node. For instance: -machine hmat=on \ -m 2G,slots=2,maxmem=4G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -smp 2,sockets=2,maxcpus=2 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=0,socket-id=1 creates a machine with 2 NUMA nodes, node 0 has CPUs and node 1 has memory only and it's initiator is node 0 (yes, HMAT allows you to create CPU-less "NUMA" nodes). The initiator of node 0 is not specified, but since the node has at least one CPU it is initiator to itself (and has to be per specs). This could be represented by an attribute to our /domain/cpu/numa/cell element. For instance like this: <domain> <vcpu>2</vcpu> <cpu> <numa> <cell id='0' cpus='0,1' memory='1' unit='GiB'/> <cell id='1' memory='1' unit='GiB' initiator='0'/> </numa> </cpu> </domain> Then, QEMU allows us to control two other important memory attributes: 1) hmat-lb for Latency and Bandwidth 2) hmat-cache for cache attributes For example: -machine hmat=on \ -m 2G,slots=2,maxmem=4G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -smp 2,sockets=2,maxcpus=2 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=0,socket-id=1 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M \ -numa hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=8 \ -numa hmat-cache,node-id=1,size=10K,level=1,associativity=direct,policy=write-back,line=8 This extends previous example by defining some latencies and cache attributes. The node 0 has access latency of 5 ns and bandwidth of 200MB/s and node 1 has access latency of 10ns and bandwidth of only 100MB/s. The memory cache level 1 on both nodes is 10KB, cache line is 8B long with write-back policy and direct associativity (whatever that means). For better future extensibility I'd express these as separate elements, rather than attributes to <cell/> element. For instance like this: <domain> <vcpu>2</vcpu> <cpu> <numa> <cell id='0' cpus='0,1' memory='1' unit='GiB'> <latencies> <latency type='access' value='5'/> <bandwidth type='access' unit='MiB' value='200'/> </latencies> <caches> <cache level='1' associativity='direct' policy='write-back'> <size unit='KiB' value='10'/> <line unit='B' value='8'/> </cache> </caches> </cell> <cell id='1' memory='1' unit='GiB' initiator='0'> <latencies> <latency type='access' value='10'/> <bandwidth type='access' unit='MiB' value='100'/> </latencies> <caches> <cache level='1' associativity='direct' policy='write-back'> <size unit='KiB' value='10'/> <line unit='B' value='8'/> </cache> </caches> </cell> </numa> </cpu> </domain> Thing is, the @hierarchy argument accepts: memory (referring to whole memory), or first-level|second-level|third-level (referring to side caches for each domain). I haven't figured out yet, how to express the levels in XML yet. The @data-type argument accepts access|read|write (this is expressed by @type attribute to <latency/> and <bandwidth/> elements). Latency and bandwidth can be combined with each type: access-latency, read-latency, write-latency, access-bandwidth, read-bandwidth, write-bandwidth. And these 6 can then be combined with aforementioned @hierarchy, producing 24 combinations (if I read qemu cmd line specs correctly [2]). What are your thoughts? Michal 1: https://bugzilla.redhat.com/show_bug.cgi?id=1786303 2: https://git.qemu.org/?p=qemu.git;a=blob;f=qemu-options.hx;h=d4b73ef60c1d4589...

On Thu, Jan 09, 2020 at 05:18:02PM +0100, Michal Privoznik wrote:
Dear list,
QEMU gained support for configuring HMAT recently (see v4.2.0-415-g9b12dfa03a and friends). HMAT stands for Heterogeneous Memory Attribute Table and defines various attributes to NUMA. Guest OS/app can read these information and fine tune optimization. See [1] for more info (esp. links in the transcript).
QEMU defines so called initiator, which is an attribute to a NUMA node and if specified points to another node that has the best performance to this node.
For instance:
-machine hmat=on \ -m 2G,slots=2,maxmem=4G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -smp 2,sockets=2,maxcpus=2 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=0,socket-id=1
creates a machine with 2 NUMA nodes, node 0 has CPUs and node 1 has memory only and it's initiator is node 0 (yes, HMAT allows you to create CPU-less "NUMA" nodes). The initiator of node 0 is not specified, but since the node has at least one CPU it is initiator to itself (and has to be per specs).
This could be represented by an attribute to our /domain/cpu/numa/cell element. For instance like this:
<domain> <vcpu>2</vcpu> <cpu> <numa> <cell id='0' cpus='0,1' memory='1' unit='GiB'/> <cell id='1' memory='1' unit='GiB' initiator='0'/> </numa> </cpu> </domain>
We've gained an 'initiator' attribute on the cell, and 'cpus' is optional if 'initiator' is present. Can we have the opposite - nodes with CPUs, but without local memory ? eg <cell id='0' cpus='0,1' unit='GiB'/>
Then, QEMU allows us to control two other important memory attributes:
1) hmat-lb for Latency and Bandwidth
2) hmat-cache for cache attributes
For example:
-machine hmat=on \ -m 2G,slots=2,maxmem=4G \ -object memory-backend-ram,size=1G,id=m0 \ -object memory-backend-ram,size=1G,id=m1 \ -smp 2,sockets=2,maxcpus=2 \ -numa node,nodeid=0,memdev=m0 \ -numa node,nodeid=1,memdev=m1,initiator=0 \ -numa cpu,node-id=0,socket-id=0 \ -numa cpu,node-id=0,socket-id=1 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 \ -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 \ -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M \ -numa hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=8 \ -numa hmat-cache,node-id=1,size=10K,level=1,associativity=direct,policy=write-back,line=8
This extends previous example by defining some latencies and cache attributes. The node 0 has access latency of 5 ns and bandwidth of 200MB/s and node 1 has access latency of 10ns and bandwidth of only 100MB/s. The memory cache level 1 on both nodes is 10KB, cache line is 8B long with write-back policy and direct associativity (whatever that means).
This description doesn't match my understanding of the semantics for these latency options. Your description here is talking about latency of a single node at a time. I believe these configs are talking about latency of the *link* between two nodes. So -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5 is a local node access latency as src+dst nodes are the same but -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10 is a cross-node access latency for the link between node 0 and node 1.
For better future extensibility I'd express these as separate elements, rather than attributes to <cell/> element. For instance like this:
<domain> <vcpu>2</vcpu> <cpu> <numa> <cell id='0' cpus='0,1' memory='1' unit='GiB'> <latencies> <latency type='access' value='5'/> <bandwidth type='access' unit='MiB' value='200'/> </latencies> <caches> <cache level='1' associativity='direct' policy='write-back'> <size unit='KiB' value='10'/> <line unit='B' value='8'/> </cache> </caches> </cell> <cell id='1' memory='1' unit='GiB' initiator='0'> <latencies> <latency type='access' value='10'/> <bandwidth type='access' unit='MiB' value='100'/> </latencies> <caches> <cache level='1' associativity='direct' policy='write-back'> <size unit='KiB' value='10'/> <line unit='B' value='8'/> </cache> </caches> </cell> </numa>
We shouldn't have <latencies> as a child of the <cell>, because we need to describe the latencies for the cross-product of all cells. Putting latency as a child of a cell means we would have 2 possible places to put the same information - either the source or target node. The <caches> info is ok as a child of <cell>, though I'd prefer to cull the extra <caches> wrapper and make <cache> a direct child - we can still allow <cache> to be listed multiple times under <cell> without the extra element.
</cpu> </domain>
Thing is, the @hierarchy argument accepts: memory (referring to whole memory), or first-level|second-level|third-level (referring to side caches for each domain). I haven't figured out yet, how to express the levels in XML yet.
The @data-type argument accepts access|read|write (this is expressed by @type attribute to <latency/> and <bandwidth/> elements). Latency and bandwidth can be combined with each type: access-latency, read-latency, write-latency, access-bandwidth, read-bandwidth, write-bandwidth. And these 6 can then be combined with aforementioned @hierarchy, producing 24 combinations (if I read qemu cmd line specs correctly [2]).
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (2)
-
Daniel P. Berrangé
-
Michal Privoznik