Re: [libvirt] cpu affinity, isolcpus and cgroups

14 Oct 2015

      On Thu, 2 Jul 2015 17:27:21 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:
...
On Thu, Jul 02, 2015 at 04:42:47PM +0200, Henning Schild wrote:
...
On Thu, 2 Jul 2015 15:18:46 +0100
"Daniel P. Berrange" <berrange@redhat.com> wrote:
...
On Thu, Jul 02, 2015 at 04:02:58PM +0200, Henning Schild wrote:
...
Hi,
i am currently looking into realtime VMs using libvirt. My first
starting point was reserving a couple of cores using isolcpus
and later tuning the affinity to place my vcpus on the reserved
pcpus.
My first observation was that libvirt ignores isolcpus. Affinity
masks of new qemus will default to all cpus and will not be
inherited from libvirtd. A comment in the code suggests that
this is done on purpose.
Ignore realtime + isolcpus for a minute. It is not unreasonable
for the system admin to decide system services should be
restricted to run on a certain subset of CPUs. If we let VMs
inherit the CPU pinning on libvirtd, we'd be accidentally
confining VMs to a subset of CPUs too. With new cgroups layout,
libvirtd lives in a separate cgroups tree /system.slice, while
VMs live in /machine.slice. So for both these reasons, when
starting VMs, we explicitly ignore any affinity libvirtd has and
set VMs mask to allow any CPU.
Since i started making heavy use of realtime priorities on 100% busy
threads i started running into starvation problems.
I just found a stuck qemu that still had the affinity of all 'f' and no
high prio yet. But it got unlucky and ended up in the scheduling q on
one of my busy cores ... that qemu never came to life.

I do not remember the details of the last time we discussed the topic,
the take-away was that libvirt itself does not do policy. The policy
(affinity and prio) comes from nova, but there should be no time where
the qemu is already running with the policy not yet applied. That can
cause starvation and disturbance of realtime workloads.
To me it seems there is such a time-window. If there is i need a way to
limit such new-born hypervisors to a cpuset, actually they should just
inherit it from libvirtd ... isolcpus.
...
...
Sure, that was my first guess as well. Still i wanted to raise the
topic again from the realtime POV.
I am using a pretty recent libvirt from git but did not come across
the system.slice yet. Might be a matter of configuration/invocation
of libvirtd.
Oh, I should mention that I'm referring to OS that use systemd
for their init system here, not legacy sysvinit
FWIW our cgroups layout is described here
http://libvirt.org/cgroups.html
The system.slice does not have a libvirtd.service in my case but my
libvirtd is running in a screen and not started using systemd. Might
that be causing the problem?
...
...
...
...
After that i changed the code to use only the available cpus by
default. But taskset was still showing all 'f's on my qemus.
Then i traced my change down to sched_setaffinity assuming that
some other mechanism might have reverted my hack, but it is
still in place.
From the libvirt POV, we can't tell whether the admin set isolcpus
because they want to reserve those CPUs only for VMs, or because
they want to stop VMs using those CPUs by default. As such libvirt
does not try to interpret isolcpus at all, it leaves it upto a
higher level app to decide on this policy.
I know, you have to tell libvirt that the reservation is actually
for libvirt. My idea was to introduce a config option in libvirt
and maybe sanity check it by looking at whether the pcpus are
actually reserved. Rik recently posted a patch to allow easy
programmatic checking of isolcpus via sysfs.
In libvirt we try to have a general principle that libvirt will
provide the mechanism but not implement usage policy. So if we
follow a strict interpretation here, then applying CPU mask
based on isolcpus would be out of scope for libvirt, since we
expose a sufficiently flexible mechanism to implement any
desired policy at a higher level.
...
...
In the case of OpenStack, the /etc/nova/nova.conf allows a config
setting  'vcpu_pin_set' to say what set of CPUs VMs should be
allowed to run on, and nova will then update the libvirt XML when
starting each guest.
I see, would it not still make sense to have that setting centrally
in libvirt? I am thinking about people not using nova but virsh or
virt-manager.
virsh aims to be a completely plain passthrough where the user is
in total control of their setup. To a large extent that is true
of virt-manager too. So I'd tend to expect users of both those
apps would manually configured CPU affinity of their VMs as & when
they used isolcpus.
Where we'd put in policies around isolcpus would be in the apps
like OpenStack and RHEV/oVirt which define specific usage policies
for the system as a whole.
Regards,
Daniel