Daniel P. Berrange wrote:
On Wed, Jul 22, 2009 at 09:25:02PM -0400, john cooper wrote:
> This patch allows passing of a "-mem-path <arg>"
> flag to qemu for support of huge page backed
> guests. A guest may request this option via
> specifying:
>
> <hugepage>on</hugepage>
>
> in its domain definition xml file.
This really opens a can of worms. While obviously this maps
very simply onto KVM's -mem-path argument, I can't help
thinking things are going to get much more advanced later.
For example, I don't think a boolean on/off is sufficient for
this, since Xen already has a 3rd option of 'best effort' it
uses by default where it'll try to allocate hugepages and
fallback to normal pages - in fact you can't tell Xen not
to use hugepages AFAIK.
I agree growing beyond a simple on/off switch is likely.
The patch originally had a "prealloc" option (since removed)
to accommodate passing of the existing "-mem-prealloc" flag.
That option defaults to enabled, which is desired in general
and therefore was dropped for the sake of patch simplicity.
I'm also wondering whether we need
to be concerned about different hugepage sizes for guest
configs eg 2M vs 1 GB, vs a mix of both - who decides?
Not a consideration currently, but it does underscore the
argument for a more extensible option syntax.
KVM also seems to have ability to request that huge pages
are pre-allocated upfront, vs on demand, though I'm not
sure what happens to a VM if it doesn't pre-allocate and
it later can't be satisfied.
That was the motivation for "-mem-prealloc", prior to which
a VM would SIGSEGV terminate if a huge page fault couldn't
be satisfied during runtime. The (now default) preallocation
behavior guarantees the guest has its working set present at
startup but at the potential cost of overly pessimistic
memory allocation.
> The request
> for huge page backing will be attempted within
> libvirt if the host system has indicated a
> hugetlbfs mount point in qemu.conf, for example:
>
> hugepage_mount = "/hugetlbfs"
Seems like it would be simpler to just open /proc/mounts
and scan it to find whether/where hugetlbfs is mounted,
so it would 'just work' if the user had mounted it.
Checking /proc/mounts solely seemed a bit too speculative
which is where the qemu.conf option arose. But I can see
both being useful as in checking whether the qemu.conf
mount point exists, otherwise attempting to glean the
information from /proc/mounts, and if neither are satisfied
flagging the error.
It looks like argument is not available in upstream QEMU, only
part of the KVM fork ? ANy idea why it hasn't been sent upstream,
and/or whether it will be soon.
I'd hazard due to the existing huge page support
being closely tied to kvm and no motivation as of
yet to reconcile this upstream.
I'm loathe to add more KVM
specific options since we've been burnt everytime we've done
this in the past with its semantics changing when merged to
QEMU :-(
Quite understandable. It is also why I was attempting
to be as generic (and simple) as possible here and not
excessively cast the existing kvm implementation specifics
into the exported libvirt option.
I agree that setting up hugetlbfs is out of scope for libvirt.
We should just probe to see whether its available or not.
We ought to have some way of reporting available hugepages
though, both at a host level, and likely per NUMA node too.
Without this a mgmt app using libvirt has no clue whether they'll
be able to actually use hugepages successfully or not.
Agree. Extracting this host information will be needed
when more comprehensive management of the same exists.
Still it would seem a case of best-effort enforcement
without some sort of additional coordination. The
process of gleaning the number of free huge pages from
the host and launching of the guest currently has an
inherent race.
Thanks,
-john
--
john.cooper(a)redhat.com