On Thu, Jan 09, 2020 at 05:18:02PM +0100, Michal Privoznik wrote:
Dear list,
QEMU gained support for configuring HMAT recently (see
v4.2.0-415-g9b12dfa03a
and friends). HMAT stands for Heterogeneous Memory Attribute Table and
defines
various attributes to NUMA. Guest OS/app can read these information and fine
tune optimization. See [1] for more info (esp. links in the transcript).
QEMU defines so called initiator, which is an attribute to a NUMA node and
if
specified points to another node that has the best performance to this node.
For instance:
-machine hmat=on \
-m 2G,slots=2,maxmem=4G \
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-smp 2,sockets=2,maxcpus=2 \
-numa cpu,node-id=0,socket-id=0 \
-numa cpu,node-id=0,socket-id=1
creates a machine with 2 NUMA nodes, node 0 has CPUs and node 1 has memory
only
and it's initiator is node 0 (yes, HMAT allows you to create CPU-less
"NUMA"
nodes). The initiator of node 0 is not specified, but since the node has at
least one CPU it is initiator to itself (and has to be per specs).
This could be represented by an attribute to our /domain/cpu/numa/cell
element.
For instance like this:
<domain>
<vcpu>2</vcpu>
<cpu>
<numa>
<cell id='0' cpus='0,1' memory='1'
unit='GiB'/>
<cell id='1' memory='1' unit='GiB'
initiator='0'/>
</numa>
</cpu>
</domain>
We've gained an 'initiator' attribute on the cell, and 'cpus' is
optional if 'initiator' is present.
Can we have the opposite - nodes with CPUs, but without local memory ?
eg
<cell id='0' cpus='0,1' unit='GiB'/>
Then, QEMU allows us to control two other important memory
attributes:
1) hmat-lb for Latency and Bandwidth
2) hmat-cache for cache attributes
For example:
-machine hmat=on \
-m 2G,slots=2,maxmem=4G \
-object memory-backend-ram,size=1G,id=m0 \
-object memory-backend-ram,size=1G,id=m1 \
-smp 2,sockets=2,maxcpus=2 \
-numa node,nodeid=0,memdev=m0 \
-numa node,nodeid=1,memdev=m1,initiator=0 \
-numa cpu,node-id=0,socket-id=0 \
-numa cpu,node-id=0,socket-id=1 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5
\
-numa
hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=200M
\
-numa
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10
\
-numa
hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=100M
\
-numa
hmat-cache,node-id=0,size=10K,level=1,associativity=direct,policy=write-back,line=8
\
-numa
hmat-cache,node-id=1,size=10K,level=1,associativity=direct,policy=write-back,line=8
This extends previous example by defining some latencies and cache
attributes.
The node 0 has access latency of 5 ns and bandwidth of 200MB/s and node 1
has
access latency of 10ns and bandwidth of only 100MB/s. The memory cache level
1
on both nodes is 10KB, cache line is 8B long with write-back policy and
direct
associativity (whatever that means).
This description doesn't match my understanding of the semantics
for these latency options. Your description here is talking about
latency of a single node at a time. I believe these configs
are talking about latency of the *link* between two nodes.
So
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=5
is a local node access latency as src+dst nodes are the same
but
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=10
is a cross-node access latency for the link between node
0 and node 1.
For better future extensibility I'd express these as separate
elements,
rather
than attributes to <cell/> element. For instance like this:
<domain>
<vcpu>2</vcpu>
<cpu>
<numa>
<cell id='0' cpus='0,1' memory='1'
unit='GiB'>
<latencies>
<latency type='access' value='5'/>
<bandwidth type='access' unit='MiB'
value='200'/>
</latencies>
<caches>
<cache level='1' associativity='direct'
policy='write-back'>
<size unit='KiB' value='10'/>
<line unit='B' value='8'/>
</cache>
</caches>
</cell>
<cell id='1' memory='1' unit='GiB'
initiator='0'>
<latencies>
<latency type='access' value='10'/>
<bandwidth type='access' unit='MiB'
value='100'/>
</latencies>
<caches>
<cache level='1' associativity='direct'
policy='write-back'>
<size unit='KiB' value='10'/>
<line unit='B' value='8'/>
</cache>
</caches>
</cell>
</numa>
We shouldn't have <latencies> as a child of the <cell>, because
we need to describe the latencies for the cross-product of all
cells. Putting latency as a child of a cell means we would have
2 possible places to put the same information - either the source
or target node.
The <caches> info is ok as a child of <cell>, though I'd prefer
to cull the extra <caches> wrapper and make <cache> a direct
child - we can still allow <cache> to be listed multiple times
under <cell> without the extra element.
</cpu>
</domain>
Thing is, the @hierarchy argument accepts: memory (referring to whole
memory),
or first-level|second-level|third-level (referring to side caches for each
domain). I haven't figured out yet, how to express the levels in XML yet.
The @data-type argument accepts access|read|write (this is expressed by
@type
attribute to <latency/> and <bandwidth/> elements). Latency and bandwidth
can
be combined with each type: access-latency, read-latency, write-latency,
access-bandwidth, read-bandwidth, write-bandwidth. And these 6 can then be
combined with aforementioned @hierarchy, producing 24 combinations (if I
read
qemu cmd line specs correctly [2]).
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|