Re: [libvirt] [PATCH 01/12] Introduce virNodeHugeTLB

Friday, 30 May 2014

On Fri, May 30, 2014 at 01:41:06PM +0200, Michal Privoznik wrote:
...
 On 30.05.2014 10:52, Daniel P. Berrange wrote:
 >On Thu, May 29, 2014 at 10:32:35AM +0200, Michal Privoznik wrote:
 >>  /**
 >>+ * virNodeHugeTLB:
 >>+ * @conn: pointer to the hypervisor connection
 >>+ * @type: type
 >>+ * @params: pointer to memory parameter object
 >>+ *          (return value, allocated by the caller)
 >>+ * @nparams: pointer to number of memory parameters; input and output
 >>+ * @flags: extra flags; not used yet, so callers should always pass 0
 >>+ *
 >>+ * Get information about host's huge pages. On input, @nparams
 >>+ * gives the size of the @params array; on output, @nparams gives
 >>+ * how many slots were filled with parameter information, which
 >>+ * might be less but will not exceed the input value.
 >>+ *
 >>+ * As a special case, calling with @params as NULL and @nparams
 >>+ * as 0 on input will cause @nparams on output to contain the
 >>+ * number of parameters supported by the hypervisor. The caller
 >>+ * should then allocate @params array, i.e.
 >>+ * (sizeof(@virTypedParameter) * @nparams) bytes and call the API
 >>+ * again.  See virDomainGetMemoryParameters() for an equivalent
 >>+ * usage example.
 >>+ *
 >>+ * Returns 0 in case of success, and -1 in case of failure.
 >>+ */
 >>+int
 >>+virNodeHugeTLB(virConnectPtr conn,
 >>+               int type,
 >>+               virTypedParameterPtr params,
 >>+               int *nparams,
 >>+               unsigned int flags)
 >
 >What is the 'type' parameter doing ?

 Ah, it should be named numa_node rather than type. If type==-1, then overall
 statistics are returned (number of {available,free} pages accumulated across
 all NUMA nodes), if type >= 0, info on the specific NUMA node is returned.

 >
 >I think in general this API needs a different design. I'd like to have
 >an API that can request info for all page sizes on all NUMA nods in a
 >single call. I also think the static unchanging data should be part of
 >the cpu + NUMA info in the capabilities XML. So the API only reports
 >info which is changing - ie the available pages.

 The only problem is, the size of huge pages pool is not immutable. Now it's
 possible for 2M huge pages to be allocated dynamically:

 # echo 8 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

 and it may be possible for 1GB too in future (what if kernel learns how to
 do it?). In general, the only thing that we can take unalterable for now is
 the default size of huge pages. And I wouldn't bet on that either. 
Yes, you can in theory change the number of huge pages at an arbitrary
time, but realistically people mostly only do it immediately at boot.
With 1 GB pages is will be impossible todo it any time except immediately
at boot. If you wait a little while, then memory will be too fragmented
for you to be able to dynamically allocate more 1 GB pages. The same
is true of 2MB pages to a lesser degree.

...
 >In the <topology> element we should report the number of
pages of each
 >size that are present in that node. We shouldn't treat huge pages
 >separately from small pages in this respect.

 Well, if we do that you'll still need two API calls (and hence you'll lack
 atomicity): virConnectGetCapabilities() followed by virNodeGetFreePages().
 On the other hand, the virNodeGetFreePages() itself is not atomic either
 (from system POV - other processes can jump in and allocate huge pages as
 libvirtd is executing the API). 
I'm not worried about atomicity wrt to free pages - no matter what
you do you will always race, because there's inherantly a gap between
when you check the free pages and when the VM boots. I'm more interested
in the fact that the capabilities shows you the overall NUMA topology
of the host, and the pages allocated per node are inherantly part of
that model IMHO.

...
 >   <host>
 >     <cpu>
 >       <arch>x86_64</arch>
 >       <model>Westmere</model>
 >       <vendor>Intel</vendor>
 >       <topology sockets='1' cores='6' threads='2'/>
 >       <feature name='rdtscp'/>
 >       <feature name='pdpe1gb'/>
 >       <feature name='dca'/>
 >       <feature name='pdcm'/>
 >       <feature name='xtpr'/>
 >       <pages units="KiB" size="4"/>
 >       <pages units="MiB" size="2"/>
 >       <pages units="GiB" size="1"/>
 >     </cpu>

 Right. This makes sense.

 >
 >     <topology>
 >       <cells num='2'>
 >         <cell id='0'>
 >           <memory unit='KiB'>3221225472</memory>
 >           <pages unit="KiB" 
size="4">262144</pages>
 >           <pages unit="MiB"  size="2">0</pages>
 >           <pages unit="GiB"  size="1">2</pages>
 >           <cpus num='4'>
 >             <cpu id='0'/>
 >             <cpu id='2'/>
 >             <cpu id='4'/>
 >             <cpu id='6'/>
 >           </cpus>
 >         </cell>
 >         <cell id='1'>
 >           <memory unit='KiB'>3221225472</memory>
 >           <pages unit="KiB"  size="4">0</pages>
 >           <pages unit="MiB"  size="2">1536</pages>
 >           <pages unit="GiB"  size="1">2</pages>
 >           <cpus num='4'>
 >             <cpu id='1'/>
 >             <cpu id='3'/>
 >             <cpu id='5'/>
 >             <cpu id='7'/>
 >           </cpus>
 >         </cell>
 >       </cells>
 >     </topology
 >

 Well, and this to some extent too. We can report the pool size, but since it
 may vary it's on the same level as number of free pages. What about:

 <pages unit='MiB' size='2' free='8' avail='8'/>

 That way we don't even need a new API (not saying we shouldn't implement it
 though - API is still better than updating capabilities each single time). 
I'm reluctant to include the free count in the capabilities because
that will encourage people to query the capabilities overly frequently
which will be pretty inefficient. This is why we only provide free
memory info there currently and have the separate API call.

If we report the used/free count in the capabilities we're forced into
refreshing this data frequently. If we only report the overall count
per node, even if there's the possibility to dynamically change that,
we have more potential to avoid refreshing it frequently.

...
 >So then an API call to request the available pages on all nodes
 >would look something like
 >
 >    virNodeGetFreePages(virConnectPtr conn,
 >                        unsigned int *pages,
 >                        unsigned int npages,
 >                        unsigned int startcell,
 >                        unsigned int cellcount,
 >                        unsigned long long *counts);
 >
 >In this API
 >
 >  @pages - array whose elements are the page sizes to request info for
 >  @npages - number of elements in @pages
 >  @startcell - ID of first NUMA cell to request data for
 >  @cellcount - number of cells to request data for
 >  @counts - array which is @npages * @cellcount in length

 I wanted to use virTypedParam to make the API open for future addition (e.g.
 if somebody needs count of surplus or reserved huge pages). 
I'm not sure what you mean by 'reserved huge pages' - all non-4k page
counts we are reporting are "reserved pages".

The typed parameter APIs are really horrible to use from applications,
so I think there needs to be a strong compelling reason to use that
design. It makes sense with the various tunable/stats whose valid
keys vary across hypervisor and are frequently extended, but I'm not 
really seeing a good reason for using it for memory page count
reporting

...
 Currently, the API I'm proposing returns tuple (PAGE_SIZE,
PAGE_AVAILABLE,
 PAGE_FREE) repeated for each huge page size. For instance:

 PAGE_SIZE=2048, PAGE_AVAILABLE=8, PAGE_FREE=8 ,PAGE_SIZE=1048576,
 PAGE_AVAILABLE=4, PAGE_FREE=2

 If we want to have atomic API, how about fitting in NUMA node number?

 NUMA_NODE=0, PAGE_SIZE=2048, PAGE_AVAILABLE=..., PAGE_SIZE=1048576,...\
 NUMA_NODE=1, PAGE_SIZE=2048, ...

 So the API would then look like

 virNodeHugeTLB(virConnectPtr conn,
                virTypedParameterPtr params,
                int *nparams,
                unsigned int flags); 
IMHO this results in a rather unpleasant API to use.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [PATCH 01/12] Introduce virNodeHugeTLB