[libvirt] [RFC] Support cpu hotplug in libvirt.

Hi~ It seems that libvirt is not cpu hotplug aware. Please refer to the following problem. 1. At first, we have 2 cpus. # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-1 2. And we have a vm1 with following configuration. <cputune> <vcpupin vcpu='0' cpuset='1'/> <hypervisorpin cpuset='1'/> </cputune> 3. Offline cpu1. # echo 0 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 0 # cat /cgroup/cpuset/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 4. Online cpu1. # echo 1 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 1 # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0 Here,cgroup updated cpuset.cpus,but not for libvirt directory,and also qemu and lxc directory. vm1 cannot be started again. # virsh start vm1 error: Failed to start domain vm1 error: Unable to set cpuset.cpus: Permission denied And libvird gave the following errors. 2012-07-17 07:30:22.478+0000: 3118: error : qemuSetupCgroupVcpuPin:498 : Unable to set cpuset.cpus: Permission denied I am trying to use netlink socket with NETLINK_KOBJECT_UEVENT protocol to listen to cpu hotplug events. But I met a little problem here. virNetlinkEventServiceStart() only create a global variable server, and create a NETLINK_ROUTE netlink socket for it. So if I want to create another different netlink socket, such as NETLINK_KOBJECT_UEVENT, what should I do ? Shall we make the server global variable a global array ? (It seems there are a lot of work to do.) Thanks. :) -- Best Regards, Tang chen

On 07/17/2012 07:33 PM, tangchen wrote:
Hi~
It seems that libvirt is not cpu hotplug aware.
Portions of libvirt are aware of host hotplug issues, but you are correct that there are still lingering bugs (I just found one today in nodeinfo.c).
4. Online cpu1. # echo 1 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 1 # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/cpuset.cpus 0
I think this is related to (if not the same as) this known kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=714271 Basically, when the kernel suspends and then resumes, it is not properly restoring descendant cgroup information. Hot unplug of a host cpu is more or less a subset of suspending.
I am trying to use netlink socket with NETLINK_KOBJECT_UEVENT protocol to listen to cpu hotplug events. But I met a little problem here.
Interesting approach. Is it also possible to use inotify?
virNetlinkEventServiceStart() only create a global variable server, and create a NETLINK_ROUTE netlink socket for it.
So if I want to create another different netlink socket, such as NETLINK_KOBJECT_UEVENT, what should I do ?
I'm not sure here, having not really coded much with netlink sockets myself. If you have a patch pending, then post it and we can discuss whether it can be improved.
Shall we make the server global variable a global array ? (It seems there are a lot of work to do.)
Thanks. :)
-- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Hi~ On 07/18/2012 09:44 AM, Eric Blake wrote:
On 07/17/2012 07:33 PM, tangchen wrote:
Hi~
It seems that libvirt is not cpu hotplug aware.
Portions of libvirt are aware of host hotplug issues, but you are correct that there are still lingering bugs (I just found one today in nodeinfo.c).
Could you give me some more info about hotplug support in libvirt ? I tried to find out, but I didn't see any.
I think this is related to (if not the same as) this known kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=714271
Basically, when the kernel suspends and then resumes, it is not properly restoring descendant cgroup information. Hot unplug of a host cpu is more or less a subset of suspending.
Yes, hot unplug cpu function in kernel is not usable now. But cgroup doesn't know what value a file used to be. As a result, it cannot recover any value in subdirectories.
I am trying to use netlink socket with NETLINK_KOBJECT_UEVENT protocol to listen to cpu hotplug events. But I met a little problem here.
Interesting approach. Is it also possible to use inotify?
I have tried inotify. But since /sys is a special file system, it seems that inotify failed to keep an eye on file systems like /proc, /cgroup or /sys. But netlink works fine.
virNetlinkEventServiceStart() only create a global variable server, and create a NETLINK_ROUTE netlink socket for it.
So if I want to create another different netlink socket, such as NETLINK_KOBJECT_UEVENT, what should I do ?
I'm not sure here, having not really coded much with netlink sockets myself. If you have a patch pending, then post it and we can discuss whether it can be improved.
OK.:) I'll try, but I am still wondering if anybody else could give me some advice. Thank. :)
Shall we make the server global variable a global array ? (It seems there are a lot of work to do.)
Thanks. :)
-- Best Regards, Tang chen

On 07/18/2012 03:44 AM, Eric Blake wrote:
On 07/17/2012 07:33 PM, tangchen wrote:
Hi~
It seems that libvirt is not cpu hotplug aware.
Portions of libvirt are aware of host hotplug issues, but you are correct that there are still lingering bugs (I just found one today in nodeinfo.c).
4. Online cpu1. # echo 1 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 1 # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/cpuset.cpus 0
I think this is related to (if not the same as) this known kernel bug: https://bugzilla.redhat.com/show_bug.cgi?id=714271
Basically, when the kernel suspends and then resumes, it is not properly restoring descendant cgroup information. Hot unplug of a host cpu is more or less a subset of suspending.
That really depends on the architecture, on s390 with it's multi-level virtualization, it's common to hot unplug host (the host being a 1st level guest) CPUs during lower utilization periods. If these are part of a KVM guest's CPU set, then they are gone for good...
I am trying to use netlink socket with NETLINK_KOBJECT_UEVENT protocol to listen to cpu hotplug events. But I met a little problem here.
Interesting approach. Is it also possible to use inotify?
virNetlinkEventServiceStart() only create a global variable server, and create a NETLINK_ROUTE netlink socket for it.
So if I want to create another different netlink socket, such as NETLINK_KOBJECT_UEVENT, what should I do ?
I'm not sure here, having not really coded much with netlink sockets myself. If you have a patch pending, then post it and we can discuss whether it can be improved.
I was actually considering a workaround in libvirt for what I believe to be a kernel misbehavior, however it is possible to deconfigure the cpuset controller in the host (i.e. removing it from /etc/cgconfig.conf). The CPU pinning will still work, only that it using the "legacy" taskset mechanism. All other cgroup-related functionality continues to work as well. One issue with monitoring the online state of cpus is that libvirtd might not be running all the time (crash, update) and thus can miss hotplug events.
Shall we make the server global variable a global array ? (It seems there are a lot of work to do.)
Thanks. :)
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

Hi Viktor: On 07/18/2012 05:06 PM, Viktor Mihajlovski wrote:
I was actually considering a workaround in libvirt for what I believe to be a kernel misbehavior, however it is possible to deconfigure the cpuset controller in the host (i.e. removing it from /etc/cgconfig.conf). The CPU pinning will still work, only that it using the "legacy" taskset mechanism. All other cgroup-related functionality continues to work as well. One issue with monitoring the online state of cpus is that libvirtd might not be running all the time (crash, update) and thus can miss hotplug events.
I don't think remove cpuset controller from cgroup is a good idea. Maybe other apps will need it. So I am considering capture the hotplug event (which is uevent) with netlink. And change the behavior of cgroup in libvirt. How do you think ? Thanks. :) -- Best Regards, Tang chen

On 07/18/2012 12:53 PM, tangchen wrote:
Hi Viktor:
On 07/18/2012 05:06 PM, Viktor Mihajlovski wrote:
I was actually considering a workaround in libvirt for what I believe to be a kernel misbehavior, however it is possible to deconfigure the cpuset controller in the host (i.e. removing it from /etc/cgconfig.conf). The CPU pinning will still work, only that it using the "legacy" taskset mechanism. All other cgroup-related functionality continues to work as well. One issue with monitoring the online state of cpus is that libvirtd might not be running all the time (crash, update) and thus can miss hotplug events.
I don't think remove cpuset controller from cgroup is a good idea. Maybe other apps will need it.
Maybe ... but a workaround in libvirt will not solve the other apps' issues ;-) I still would maintain this has to be fixed in the kernel.
So I am considering capture the hotplug event (which is uevent) with netlink. And change the behavior of cgroup in libvirt. How do you think ?
Looking forward to see a patch...
Thanks. :)
-- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On 07/18/2012 05:17 PM, Viktor Mihajlovski wrote:
On 07/18/2012 12:53 PM, tangchen wrote:
Hi Viktor:
On 07/18/2012 05:06 PM, Viktor Mihajlovski wrote:
I was actually considering a workaround in libvirt for what I believe to be a kernel misbehavior, however it is possible to deconfigure the cpuset controller in the host (i.e. removing it from /etc/cgconfig.conf). The CPU pinning will still work, only that it using the "legacy" taskset mechanism. All other cgroup-related functionality continues to work as well. One issue with monitoring the online state of cpus is that libvirtd might not be running all the time (crash, update) and thus can miss hotplug events.
I don't think remove cpuset controller from cgroup is a good idea. Maybe other apps will need it.
Maybe ... but a workaround in libvirt will not solve the other apps' issues ;-) I still would maintain this has to be fixed in the kernel.
Not too long ago, there was a long thread in the Linux kernel mailing list (see below) where this issue was discussed in detail. And the conclusion (in short) was that, for suspend/resume, its the kernel's responsibility to restore the cpusets to the same state as to how it was before suspend. (Note that, suspend/resume invokes CPU hotplug internally, and this is hidden from the user). However, in the regular case of CPU hotplug, the user himself initiates CPU hotplug. So it is expected that the user be aware of its consequences and take appropriate actions. In short, there is nothing to fix in the kernel, for the case of regular CPU hotplug. It is an expected behaviour. [Note that, technically the kernel can be "fixed" for this case as well. But this would break kernel semantics for CPU hotplug. Hence we have chosen not to do it. The suspend/resume case was inevitable (as long as it depends on CPU hotplug).] And the suspend/resume case has been fixed, in the 3.6 merge window by this commit: commit d35be8bab9b0ce44bed4b9453f86ebf64062721e Author: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Date: Thu May 24 19:46:26 2012 +0530 CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume Links to the earlier discussions: -------------------------------- [1]. http://thread.gmane.org/gmane.linux.documentation/4805 [2]. http://thread.gmane.org/gmane.linux.kernel/1296339 [3]. http://article.gmane.org/gmane.linux.kernel/1298967 http://thread.gmane.org/gmane.linux.kernel/1298967/focus=1300380 [4]. http://thread.gmane.org/gmane.linux.kernel/1302893 (this is the version which went upstream in the 3.6 merge window) Regards, Srivatsa S. Bhat
participants (4)
-
Eric Blake
-
Srivatsa S. Bhat
-
tangchen
-
Viktor Mihajlovski