At 06/09/2011 03:20 AM, Adam Litke Write:
Hi all. In this post I would like to bring up 3 issues which are
tightly related: 1. unwanted behavior when using cfs hardlimits with
libvirt, 2. Scaling cputune.share according to the number of vcpus, 3.
API proposal for CFS hardlimits support.
=== 1 ===
Mark Peloquin (on cc:) has been looking at implementing CFS hard limit
support on top of the existing libvirt cgroups implementation and he has
run into some unwanted behavior when enabling quotas that seems to be
affected by the cgroup hierarchy being used by libvirt.
Here are Mark's words on the subject (posted by me while Mark joins this
mailing list):
------------------
I've conducted a number of measurements using CFS.
The system config is a 2 socket Nehalem system with 64GB ram. Installed
is RHEL6.1-snap4. The guest VMs being used have RHEL5.5 - 32bit. I've
replaced the kernel with 2.6.39-rc6+ with patches from
Paul-V6-upstream-breakout.tar.bz2 for CFS bandwidth. The test config
uses 5 VMs of various vcpu and memory sizes. Being used are 2 VMs with 2
vcpus and 4GB of memory, 1 VM with 4vcpus/8GB, another VM with
8vcpus/16GB and finally a VM with 16vcpus/16GB.
Thus far the tests have been limited to cpu intensive workloads. Each VM
runs a single instance of the workload. The workload is configured to
create one thread for each vcpu in the VM. The workload is then capable
of completely saturation each vcpu in each VM.
CFS was tested using two different topologies.
First vcpu cgroups were created under each VM created by libvirt. The
vcpu threads from the VM's cgroup/tasks were moved to the tasks list of
each vcpu cgroup, one thread to each vcpu cgroup. This tree structure
permits setting CFS quota and period per vcpu. Default values for
cpu.shares (1024), quota (-1) and period (500000us) was used in each VM
cgroup and inherited by the vcpu croup. With these settings the workload
generated system cpu utilization (measured in the host) of >99% guest,
> 0.1 idle, 0.14% user and 0.38 system.
Second, using the same topology, the CFS quota in each vcpu's cgroup was
set to 250000us allowing each vcpu to consume 50% of a cpu. The cpu
workloads was run again. This time the total system cpu utilization was
measured at 75% guest, ~24% idle, 0.15% user and 0.40% system.
The topology was changed such that a cgroup for each vcpu was created in
/cgroup/cpu.
The first test used the default/inherited shares and CFS quota and
period. The measured system cpu utilization was >99% guest, ~0.5 idle,
0.13 user and 0.38 system, similar to the default settings using vcpu
cgroups under libvirt.
The next test, like before the topology change, set the vcpu quota
values to 250000us or 50% of a cpu. In this case the measured system cpu
utilization was ~92% guest, ~7.5% idle, 0.15% user and 0.38% system.
We can see that moving the vcpu cgroups from being under libvirt/qemu
make a big difference in idle cpu time.
Does this suggest a possible problems with libvirt?
I do not think it is a problem in libvirt.
Libvirt only uses the interface provided by cgroup system. It may a problem
in cgroup or CFS bandwidth.
------------------
Has anyone else seen this type of behavior when using cgroups with CFS
hardlimits? We are working with the kernel community to see if there
might be a bug in cgroups itself.
=== 2 ===
Something else we are seeing is that libvirt's default setting for
cputune.share is 1024 for any domain (regardless of how many vcpus are
configured. This ends up hindering performance of really large VMs
(with lots of vcpus) as compared to smaller ones since all domains are
given equal share. Would folks consider changing the default for
'shares' to be a quantity scaled by the number of vcpus such that bigger
domains get to use proportionally more host cpu resource?
The value 1024 is a default value in kernel, not libvirt.
If you want to change cputune.share, you should edit the xml config file.
=== 3 ===
Besides the above issues, I would like to open a discussion on what the
libvirt API for enabling cpu hardlimits should look like. Here is what
I was thinking:
I need this feature immediately after CFS bandwidth patchset is merged into
upsteam kernel. So I am working on this recently.
Two additional scheduler parameters (based on the names given in the
cgroup fs) will be recognized for qemu domains: 'cfs_period' and
'cfs_quota'. These can use the existing
virDomain[Get|Set]SchedulerParameters() API. The Domain XML schema
would be updated to permit the following:
--- snip ---
<cputune>
...
<cfs_period>1000000</cfs_period>
<cfs_quota>500000</cfs_quota>
</cputune>
--- snip ---
To actuate these configuration settings, we simply apply the values to
the appropriate cgroup(s) for the domain. We would prefer that each
vcpu be in its own cgroup to ensure equal and fair scheduling across all
vcpus running on the system. (We will need to resolve the issues
described by Mark in order to figure out where to hang these cgroups).
each vcpu in its own cgroup?
Do you mean each vcpu has a seperate thread?
AFAIK, qemu does not create thread for each vcpu.
Thanks.
Wen Congyang
Thanks for sticking with me through this long email. I greatly
appreciate your thoughts and comments on these topics.