On Wed, Jun 08, 2011 at 02:20:23PM -0500, Adam Litke wrote:
Hi all. In this post I would like to bring up 3 issues which are
tightly related: 1. unwanted behavior when using cfs hardlimits with
libvirt, 2. Scaling cputune.share according to the number of vcpus, 3.
API proposal for CFS hardlimits support.
=== 1 ===
Mark Peloquin (on cc:) has been looking at implementing CFS hard limit
support on top of the existing libvirt cgroups implementation and he has
run into some unwanted behavior when enabling quotas that seems to be
affected by the cgroup hierarchy being used by libvirt.
Here are Mark's words on the subject (posted by me while Mark joins this
mailing list):
------------------
I've conducted a number of measurements using CFS.
The system config is a 2 socket Nehalem system with 64GB ram. Installed
is RHEL6.1-snap4. The guest VMs being used have RHEL5.5 - 32bit. I've
replaced the kernel with 2.6.39-rc6+ with patches from
Paul-V6-upstream-breakout.tar.bz2 for CFS bandwidth. The test config
uses 5 VMs of various vcpu and memory sizes. Being used are 2 VMs with 2
vcpus and 4GB of memory, 1 VM with 4vcpus/8GB, another VM with
8vcpus/16GB and finally a VM with 16vcpus/16GB.
Thus far the tests have been limited to cpu intensive workloads. Each VM
runs a single instance of the workload. The workload is configured to
create one thread for each vcpu in the VM. The workload is then capable
of completely saturation each vcpu in each VM.
CFS was tested using two different topologies.
First vcpu cgroups were created under each VM created by libvirt. The
vcpu threads from the VM's cgroup/tasks were moved to the tasks list of
each vcpu cgroup, one thread to each vcpu cgroup. This tree structure
permits setting CFS quota and period per vcpu. Default values for
cpu.shares (1024), quota (-1) and period (500000us) was used in each VM
cgroup and inherited by the vcpu croup. With these settings the workload
generated system cpu utilization (measured in the host) of >99% guest,
>0.1 idle, 0.14% user and 0.38 system.
Second, using the same topology, the CFS quota in each vcpu's cgroup was
set to 250000us allowing each vcpu to consume 50% of a cpu. The cpu
workloads was run again. This time the total system cpu utilization was
measured at 75% guest, ~24% idle, 0.15% user and 0.40% system.
The topology was changed such that a cgroup for each vcpu was created in
/cgroup/cpu.
The first test used the default/inherited shares and CFS quota and
period. The measured system cpu utilization was >99% guest, ~0.5 idle,
0.13 user and 0.38 system, similar to the default settings using vcpu
cgroups under libvirt.
The next test, like before the topology change, set the vcpu quota
values to 250000us or 50% of a cpu. In this case the measured system cpu
utilization was ~92% guest, ~7.5% idle, 0.15% user and 0.38% system.
We can see that moving the vcpu cgroups from being under libvirt/qemu
make a big difference in idle cpu time.
Does this suggest a possible problems with libvirt?
------------------
I can't really understand from your description what the different
setups are. You're talking about libvirt vcpu cgroups, but nothing
in libvirt does vcpu based cgroups, our cgroup granularity is always
per-VM.
=== 2 ===
Something else we are seeing is that libvirt's default setting for
cputune.share is 1024 for any domain (regardless of how many vcpus are
configured. This ends up hindering performance of really large VMs
(with lots of vcpus) as compared to smaller ones since all domains are
given equal share. Would folks consider changing the default for
'shares' to be a quantity scaled by the number of vcpus such that bigger
domains get to use proportionally more host cpu resource?
Well that's just the kernel default setting actually. The intent
of the default cgroups configuration for a VM, is that it should
be identical to the configuration if the VM was *not* in any
cgroups. So I think that gives some justification for setting
the cpu shares relative to the # of vCPUs by default, otherwise
we have a regression vs not using cgroups.
=== 3 ===
Besides the above issues, I would like to open a discussion on what the
libvirt API for enabling cpu hardlimits should look like. Here is what
I was thinking:
Two additional scheduler parameters (based on the names given in the
cgroup fs) will be recognized for qemu domains: 'cfs_period' and
'cfs_quota'. These can use the existing
virDomain[Get|Set]SchedulerParameters() API. The Domain XML schema
would be updated to permit the following:
--- snip ---
<cputune>
...
<cfs_period>1000000</cfs_period>
<cfs_quota>500000</cfs_quota>
</cputune>
--- snip ---
I don't think 'cfs_' should be in the names here. These absolute
limits on CPU time could easily be applicable to non-CFS schedulars
or non-Linux hypervisors.
To actuate these configuration settings, we simply apply the values
to
the appropriate cgroup(s) for the domain. We would prefer that each
vcpu be in its own cgroup to ensure equal and fair scheduling across all
vcpus running on the system. (We will need to resolve the issues
described by Mark in order to figure out where to hang these cgroups).
The reason for putting VMs in cgroups is that, because KVM is multithreaded,
using Cgroups is the only way to control settings of the VM as a whole. If
you just want to control individual VCPU settings, then that can be done
without cgroups just be setting the process' schedpriority via the normal
APIs. Creating cgroups at the granularity of individual vCPUs is somewhat
troublesome, because if the administrator has mounted other cgroups
controllers at the same location as the 'cpu' controller, then putting
each VCPU in a separate cgroup will negatively impact other aspects of
the VM. Also KVM has a number of other non-VCPU threads which consume a
non-trivial amount of CPU time, which often come & go over time. So IMHO
the smallest cgroup granularity should remain per-VM.
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|