Re: [libvirt] [PATCH RESEND RFC v4 5/6] qemu: Implement cfs_period and cfs_quota's modification

21 Jul 2011


      On Thu, Jul 21, 2011 at 08:20:25AM -0500, Adam Litke wrote:
...
On 07/21/2011 06:01 AM, Daniel P. Berrange wrote:
...
On Thu, Jul 21, 2011 at 10:11:32AM +0800, Wen Congyang wrote:
...
@@ -5851,7 +5953,6 @@ static int qemuSetSchedulerParametersFlags(virDomainPtr dom,
         virTypedParameterPtr param = ¶ms[i];
if (STREQ(param->field, "cpu_shares")) {
-            int rc;
             if (param->type != VIR_TYPED_PARAM_ULLONG) {
                 qemuReportError(VIR_ERR_INVALID_ARG, "%s",
                                 _("invalid type for cpu_shares tunable, expected a 'ullong'"));
@@ -5870,19 +5971,47 @@ static int qemuSetSchedulerParametersFlags(virDomainPtr dom,
             }
if (flags & VIR_DOMAIN_AFFECT_CONFIG) {
-                persistentDef = virDomainObjGetPersistentDef(driver->caps, vm);
-                if (!persistentDef) {
-                    qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s",
-                                    _("can't get persistentDef"));
+                vmdef->cputune.shares = params[i].value.ul;
+            }
+        } else if (STREQ(param->field, "cfs_period")) {
[snip]
...
+        } else if (STREQ(param->field, "cfs_quota")) {
In the XML file we now have
<cputune>
    <shares>1024</shares>
    <period>90000</period>
    <quota>0</quota>
  </cputune>
But the schedinfo parameter are being named
cpu_shares: 1024
 cfs_period: 90000
 cfs_quota: 0
These new tunables should be named 'cpu_period' and 'cpu_quota' too.
You mean 'cfs_period' and 'cfs_quota', right?
No, I we should *not* be using 'cfs_' in the public interface since
this is just an internal impl, of the current Linux schedular.
If Linux gains a different schedular in the future which can still
provide period/quota caps, the naming 'cfs_period' and 'cfs_quota'
will be bad.

The XML already has this correct, by not using 'cfs_' in the XML
naming. The schedular info parameters are wrong, since they have
go the 'cfs_' prefix.
...
...
...
+    priv = vm->privateData;
+    if (priv->nvcpupids == 0 || priv->vcpupids[0] == vm->pid) {
+        /* We do not create sub dir for each vcpu */
+        rc = qemuGetVcpuBWLive(cgroup, period, quota);
+        if (rc < 0)
+            goto cleanup;
+
+        if (*quota > 0)
+            *quota /= vm->def->vcpus;
+        goto out;
+    }
+
+    /* get period and quota for vcpu0 */
+    rc = virCgroupForVcpu(cgroup, 0, &cgroup_vcpu, 0);
+    if (!cgroup_vcpu) {
+        virReportSystemError(-rc,
+                             _("Unable to find vcpu cgroup for %s(vcpu: 0)"),
+                             vm->def->name);
+        goto cleanup;
+    }
+
+    rc = qemuGetVcpuBWLive(cgroup_vcpu, period, quota);
+    if (rc < 0)
+        goto cleanup;
This is also bad IMHO, giving different semantics for the operation
of the API, depending on whether the QEMU impl has VCPUs threads or
not. Again this just says to me that we need this to apply at the VM
level not the VCPU level.
IMO, this firms up the semantics of the libvirt API.  As these patches
are written, the libvirt semantics are: "Apply cpu capping to this
domain.  Period is the measurement interval.  Quota is the amount of
time each vcpu may run within the period."  We are able to insulate the
users from details of the underlying qemu -- whether it supports threads
or not.
Another reason a per-vcpu quota setting is more desirable than a global
setting is seen when changing the number of vcpus assigned to a domain.
 Each time you change the vcpu count, you would have to change the quota
number too.  Imagine all of the complaints from users when they increase
the vcpu count and find their domain actually performs worse.
I think it is possible to argue that last point either way really,
and we in fact already have that behaviour with the existing
cpu_shares parameter. Having different behaviour for these new
parameters, vs cpu_shares is an even worse problem.

Since there is no clear cut answer to your last point either way,
this again suggests to me that we should have two sets of CPU
tunables here, one applying to the whole VM and other to the
per-VCPUs (if supported).
...
...
If we really do need finer per-VCPU controls, in addition to a per-VM
control, I think that should likely be done as a separate tunable,
or separate CPU
eg, we could have 2 sets of tunables, one that applies to the VM and
the second set that adds further controls to the VCPUs.
How would this behave when qemu doesn't support vcpu threads?  We'd
either have to throw an error, or just ignore the vcpu_ settings.
either way, you're exposing users to qemu internals.  They would need to
change their domain.xml files when moving from a multi-threaded qemu to
a single-threaded qemu.
If the user requests a per-VCPU control and we can't support that
we should return an error to the user, rather than silently
implementing a different type of control.

This doesn't imply that we've exposing users to QEMU internals.
We can simply telling them that we have been unable to support
the configuration that they requested.

This is good, because knowing that the config could not be
supported means they have the knowledge to be able to apply
a different policy.

With this current impl users have no idea we are secretly
applying different semantics and limiting the whole VM.


Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|