[libvirt] [PATCH v2 0/4] Add cpu hotplug support to libvirt.

It seems that libvirt is not cpu hotplug aware. Please refer to the following problem. 1. At first, we have 2 cpus. # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-1 2. And we have a vm1 with following configuration. <cputune> <vcpupin vcpu='0' cpuset='1'/> <emulatorpin cpuset='1'/> </cputune> 3. Offline cpu1. # echo 0 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 0 # cat /cgroup/cpuset/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0 4. Online cpu1. # echo 1 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 1 # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0 Here,cgroup updated cpuset.cpus,but not for libvirt directory,and also qemu and lxc directory. vm1 cannot be started again. # virsh start vm1 error: Failed to start domain vm1 error: Unable to set cpuset.cpus: Permission denied And libvird gave the following errors. 2012-07-17 07:30:22.478+0000: 3118: error : qemuSetupCgroupVcpuPin:498 : Unable to set cpuset.cpus: Permission denied These patches resolves this problem by listening on the netlink for cpu hotplug event. When the netlink service gets the cpu hotplug event, it will attract the cpuid in the message, and add it into cpuset.cpus in: /cgroup/cpuset/libvirt /cgroup/cpuset/libvirt/qemu /cgroup/cpuset/libvirt/lxc change log of v2: [PATCH v2 1/4]: * Move all the handlers from hotplug.c to modules who want to listen to cpu hotplug events. [PATCH v2 2/4]: * Set /libvirt/cpuset.cpus to the lastest value when libvirtd is being initialized. [PATCH v2 3/4]: * Set /libvirt/qemu/cpuset.cpus to the lastest value when qemu driver is being initialized. [PATCH v2 4/4]: * Set /libvirt/lxc/cpuset.cpus to the lastest value when lxc driver is being initialized. Tang Chen (4): Add helpers to support cpu hotplug in libvirt. Register cpu hotplug netlink handler for libvirtd. Register cpu hotplug netlink handler for qemu driver. Register cpu hotplug netlink handler for lxc driver. daemon/libvirtd.c | 62 +++++++++++ include/libvirt/virterror.h | 2 + src/Makefile.am | 1 + src/libvirt_private.syms | 7 ++ src/lxc/lxc_driver.c | 67 +++++++++++- src/qemu/qemu_driver.c | 64 ++++++++++++ src/util/cgroup.c | 10 +- src/util/cgroup.h | 7 ++ src/util/hotplug.c | 240 +++++++++++++++++++++++++++++++++++++++++++ src/util/hotplug.h | 44 ++++++++ src/util/virterror.c | 3 +- 11 files changed, 499 insertions(+), 8 deletions(-) create mode 100644 src/util/hotplug.c create mode 100644 src/util/hotplug.h -- 1.7.10.1

This patch adds some helpers to support cpu hotplug in libvirt. virHotplugGetCpuidFromMsg(): Get cpuid from cpu hotplug netlink message. The message is of the following format: {online|offline}@/devices/system/cpu/cpuxx (xx is cpuid) virHotplugUpdateCgroupCpuset(): Update the cpuset.cpus in the specified cgroup when a cpu is hotpluged. virHotplugSetCpusetToLastest(): Copy the cpuset from root to the specified cgroup. This function could be used when there are cpu houplug events occurred when libvirtd is not running. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- include/libvirt/virterror.h | 2 + src/Makefile.am | 1 + src/libvirt_private.syms | 6 ++ src/util/cgroup.c | 4 +- src/util/cgroup.h | 3 + src/util/hotplug.c | 240 +++++++++++++++++++++++++++++++++++++++++++ src/util/hotplug.h | 44 ++++++++ src/util/virterror.c | 3 +- 8 files changed, 300 insertions(+), 3 deletions(-) create mode 100644 src/util/hotplug.c create mode 100644 src/util/hotplug.h diff --git a/include/libvirt/virterror.h b/include/libvirt/virterror.h index 5140c38..aabda04 100644 --- a/include/libvirt/virterror.h +++ b/include/libvirt/virterror.h @@ -114,6 +114,8 @@ typedef enum { VIR_FROM_SSH = 50, /* Error from libssh2 connection transport */ + VIR_FROM_HOTPLUG = 51, /* Error from Hotplug driver */ + # ifdef VIR_ENUM_SENTINELS VIR_ERR_DOMAIN_LAST # endif diff --git a/src/Makefile.am b/src/Makefile.am index 95e1bea..c65ee37 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -60,6 +60,7 @@ UTIL_SOURCES = \ util/event.c util/event.h \ util/event_poll.c util/event_poll.h \ util/hooks.c util/hooks.h \ + util/hotplug.c util/hotplug.h \ util/iptables.c util/iptables.h \ util/ebtables.c util/ebtables.h \ util/dnsmasq.c util/dnsmasq.h \ diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6f14763..2ff7e0e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -94,6 +94,7 @@ virCgroupKillPainfully; virCgroupKillRecursive; virCgroupMounted; virCgroupMoveTask; +virCgroupNew; virCgroupPathOfController; virCgroupRemove; virCgroupSetBlkioDeviceWeight; @@ -643,6 +644,11 @@ virHookInitialize; virHookPresent; +# hotplug.h +virHotplugUpdateCgroupCpuset; +virHotplugSetCpusetToLastest; + + # interface_conf.h virInterfaceAssignDef; virInterfaceDefFormat; diff --git a/src/util/cgroup.c b/src/util/cgroup.c index 5dc0764..393d439 100644 --- a/src/util/cgroup.c +++ b/src/util/cgroup.c @@ -608,8 +608,8 @@ static int virCgroupMakeGroup(virCgroupPtr parent, virCgroupPtr group, } -static int virCgroupNew(const char *path, - virCgroupPtr *group) +int virCgroupNew(const char *path, + virCgroupPtr *group) { int rc = 0; char *typpath = NULL; diff --git a/src/util/cgroup.h b/src/util/cgroup.h index 68ac232..e294495 100644 --- a/src/util/cgroup.h +++ b/src/util/cgroup.h @@ -68,6 +68,9 @@ int virCgroupPathOfController(virCgroupPtr group, const char *key, char **path); +int virCgroupNew(const char *path, + virCgroupPtr *group); + int virCgroupAddTask(virCgroupPtr group, pid_t pid); int virCgroupAddTaskController(virCgroupPtr group, diff --git a/src/util/hotplug.c b/src/util/hotplug.c new file mode 100644 index 0000000..7c44299 --- /dev/null +++ b/src/util/hotplug.c @@ -0,0 +1,240 @@ +/* + * Copyright (C) 2012 FUJITSU, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; If not, see + * <http://www.gnu.org/licenses/>. + * + * Authors: + * Tang Chen <tangchen@cn.fujitsu.com> + */ + +#include <sys/types.h> +#include <sys/socket.h> +#include <string.h> +#include <stdlib.h> + +#include "hotplug.h" +#include "cgroup.h" +#include "buf.h" +#include "virterror_internal.h" +#include "virnetlink.h" +#include "logging.h" +#include "memory.h" +#include "util.h" + +#define VIR_FROM_THIS VIR_FROM_HOTPLUG + +#ifdef __linux__ + +/** + * CPU hotplug message is of the following format: + * {online|offline}@/devices/system/cpu/cpuxx (xx is cpuid) + */ +# define CPU_ONLINE_MSG "online@/devices/system/cpu/cpu" +# define CPU_OFFLINE_MSG "offline@/devices/system/cpu/cpu" + +/** + * virHotplugGetCpuidFromMsg: + * + * @msg: The buffer containing the received netlink message + * @length: The length of the received netlink message + * @cpuid: Contains the cpuid in the message + * + * This function get the cpuid from the following netlink message, + * {online|offline}@/devices/system/cpu/cpuxx (xx is cpuid) + * + * Returns 0 on success, or -1 on error. + */ +static int virHotplugGetCpuidFromMsg(unsigned char *msg, + int length, + char **cpuid) +{ + char *p = NULL; + size_t len_online = MIN(length, strlen(CPU_ONLINE_MSG)); + size_t len_offline = MIN(length, strlen(CPU_OFFLINE_MSG)); + + if (VIR_ALLOC_N(*cpuid, 64) < 0) + goto memory_error; + + /** + * For now, we aren't sure if the message is a '/0' ended string. + * So we only test the first len_online or len_offline characters. + */ + if (strncmp((const char *)msg, CPU_ONLINE_MSG, len_online) == 0 || + strncmp((const char *)msg, CPU_OFFLINE_MSG, len_offline) == 0) { + p = strrchr((const char *)msg, '/'); + p = p + 4; + strcpy(*cpuid, p); + return 0; + } else { + VIR_DEBUG("Event is not a cpu hotplug event."); + VIR_FREE(*cpuid); + return 0; + } + +memory_error: + virReportOOMError(); + return -1; +} + +/** + * virHotplugUpdateCgroupCpuset: + * + * @msg: The buffer containing the received netlink message + * @length: The length of the received netlink message + * @cgroup: Contains a cgroup identifier to be modified + * + * Cpu hotplug netlink event handler. It is called when libvirtd + * receives netlink message from kernel, and modifies cpuset.cpus + * of specified cgroup. + * + * Return 0 on success. + */ +int +virHotplugUpdateCgroupCpuset(unsigned char *msg, + int length, + virCgroupPtr cgroup) +{ + int rc = 0; + char *cpuid = NULL; + char *cpus = NULL; + + if (!cgroup) { + virReportSystemError(EINVAL, + "%s", + _("invalid argument")); + return -EINVAL; + } + + if (VIR_ALLOC_N(cpus, 1024) < 0) { + virReportOOMError(); + return -ENOMEM; + } + + rc = virHotplugGetCpuidFromMsg(msg, length, &cpuid); + if (rc < 0) + goto error; + + /** + * If virHotplugGetCpuidFromMsg() succeeds, we are sure the + * netlink message is a '/0' ended string. + */ + VIR_DEBUG("netlink (cpu hotplug): %s", msg); + + if (msg == strstr((const char *)msg, "online")) { + VIR_DEBUG("CPU %s online message received.", cpuid); + + if (virCgroupGetCpusetCpus(cgroup, &cpus) < 0) { + virReportSystemError(errno, + "%s", + _("Unable to get cpuset.cpus")); + rc = -1; + goto error; + } + + if (virAsprintf(&cpus, "%s,%s", cpus, cpuid) < 0) { + virReportOOMError(); + rc = -ENOMEM; + goto error; + } + + if (virCgroupSetCpusetCpus(cgroup, cpus) < 0) { + virReportSystemError(errno, + "%s", + _("Unable to set cpuset.cpus")); + rc = -1; + goto error; + } + + } else if (msg == strstr((const char *)msg, "offline")) { + /* For now, nothing to do. */ + VIR_DEBUG("CPU %s offline message received.", cpuid); + } + +error: + VIR_FREE(cpuid); + VIR_FREE(cpus); + + return rc; +} + +/** + * virHotplugSetCpusetToLastest: + * + * @cgroup: The cgroup to be reset + * + * This function copies the cpuset from root to the specified cgroup. + * + * Return 0 on success or errno on error. + */ +int +virHotplugSetCpusetToLastest(virCgroupPtr cgroup) +{ + int rc = 0; + char *cpus = NULL; + virCgroupPtr mpgrp = NULL; + + /* mpgrp here is just the cpuset controller's mount point, + * which always contains all the online cpus. + */ + rc = virCgroupNew("/", &mpgrp); + if (rc < 0) + goto out; + + rc = virCgroupGetCpusetCpus(mpgrp, &cpus); + if (rc < 0) + goto error; + + rc = virCgroupSetCpusetCpus(cgroup, cpus); + if (rc < 0) + goto error; + +error: + virCgroupFree(&mpgrp); + VIR_FREE(cpus); + +out: + return rc; +} + +#else + +static const char *unsupported = N_("Not a linux system."); + +/** + * virHotplugUpdateCgroupCpuset: This function is called when libvirtd + * receives netlink message from kernel, and modifies cpuset.cpus of + * specified cgroup. + */ +int +virHotplugUpdateCgroupCpuset(unsigned char *msg ATTRIBUTE_UNUSED, + int length ATTRIBUTE_UNUSED, + virCgroupPtr cgroup ATTRIBUTE_UNUSED) +{ + VIR_DEBUG("%s", _(unsupported)); + return 0; +} + +/** + * virHotplugSetCpusetToLastest: This function copies the cpuset from + * root to the specified cgroup. + */ +int +virHotplugSetCpusetToLastest(virCgroupPtr cgroup ATTRIBUTE_UNUSED) +{ + VIR_DEBUG("%s", _(unsupported)); + return 0; +} + +#endif /* __linux__ */ diff --git a/src/util/hotplug.h b/src/util/hotplug.h new file mode 100644 index 0000000..ef3e279 --- /dev/null +++ b/src/util/hotplug.h @@ -0,0 +1,44 @@ +/* + * Copyright (C) 2012 FUJITSU, Inc. + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; If not, see + * <http://www.gnu.org/licenses/>. + * + * Authors: + * Tang Chen <tangchen@cn.fujitsu.com> + */ + +#ifndef __HOTPLUG_H_ +# define __HOTPLUG_H_ + +# include "internal.h" +# include "cgroup.h" + +/** + * virHotplugUpdateCgroupCpuset: This function is called when libvirtd + * receives netlink message from kernel, and modifies cpuset.cpus of + * specified cgroup. + */ +int +virHotplugUpdateCgroupCpuset(unsigned char *msg, int length, + virCgroupPtr opaque); + +/** + * virHotplugSetCpusetToLastest: This function copies the cpuset from + * root to the specified cgroup. + */ +int +virHotplugSetCpusetToLastest(virCgroupPtr cgroup); + +#endif diff --git a/src/util/virterror.c b/src/util/virterror.c index 7caa69e..eff52ad 100644 --- a/src/util/virterror.c +++ b/src/util/virterror.c @@ -115,7 +115,8 @@ VIR_ENUM_IMPL(virErrorDomain, VIR_ERR_DOMAIN_LAST, "Parallels Cloud Server", "Device Config", - "SSH transport layer" /* 50 */ + "SSH transport layer", /* 50 */ + "Hotplug driver" /* 51 */ ) -- 1.7.10.1

This patch first sets cpuset.cpus to lastest value for libvirtd, and then registers 2 callback for cpu hotplug event for libvirtd: - daemonNetlinkCpuHotplugHandleCallback() - daemonNetlinkCpuHotplugRemoveCallback() Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- daemon/libvirtd.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++ src/libvirt_private.syms | 1 + src/util/cgroup.c | 6 ++--- src/util/cgroup.h | 4 +++ 4 files changed, 70 insertions(+), 3 deletions(-) diff --git a/daemon/libvirtd.c b/daemon/libvirtd.c index 6973df6..b612231 100644 --- a/daemon/libvirtd.c +++ b/daemon/libvirtd.c @@ -56,6 +56,8 @@ #include "uuid.h" #include "viraudit.h" #include "locking/lock_manager.h" +#include "cgroup.h" +#include "hotplug.h" #ifdef WITH_DRIVER_MODULES # include "driver.h" @@ -935,6 +937,43 @@ enum { OPT_VERSION = 129 }; +static void +daemonNetlinkCpuHotplugHandleCallback(unsigned char *msg, + int length, + struct sockaddr_nl *peer ATTRIBUTE_UNUSED, + bool *handled ATTRIBUTE_UNUSED, + void *opaque) +{ + virCgroupPtr cgroup = NULL; + + /* opaque is a cgroup identifier pointing to libvirt root cgroup. */ + if (!opaque) { + virReportSystemError(EINVAL, + "%s", + _("invalid argument")); + return; + } + + cgroup = (virCgroupPtr)opaque; + + if (virHotplugUpdateCgroupCpuset(msg, length, cgroup) < 0) { + virReportSystemError(errno, + "%s", + _("Unable to update cpuset in cgroup")); + } + + return; +} + +static void +daemonNetlinkCpuHotplugRemoveCallback(int watch ATTRIBUTE_UNUSED, + const virMacAddrPtr macaddr ATTRIBUTE_UNUSED, + void *opaque ATTRIBUTE_UNUSED) +{ + /* For now, nothing to do. */ + VIR_INFO("CPU hotplug netlink handler has been removed."); +} + #define MAX_LISTEN 5 int main(int argc, char **argv) { virNetServerPtr srv = NULL; @@ -954,6 +993,7 @@ int main(int argc, char **argv) { bool implicit_conf = false; char *run_dir = NULL; mode_t old_umask; + virCgroupPtr rootgrp = NULL; struct option opts[] = { { "verbose", no_argument, &verbose, 1}, @@ -1327,11 +1367,32 @@ int main(int argc, char **argv) { #endif #if defined(__linux__) && defined(NETLINK_KOBJECT_UEVENT) + /** + * If we had ever missed any cpu hotplug event, we need to update + * cpuset.cpus to the lastest value + */ + if (virCgroupAppRoot(privileged, &rootgrp, 0) != 0 || + virHotplugSetCpusetToLastest(rootgrp) < 0) { + ret = VIR_DAEMON_ERR_INIT; + goto cleanup; + } + /* Register the netlink event service for NETLINK_KOBJECT_UEVENT */ if (virNetlinkEventServiceStart(NETLINK_KOBJECT_UEVENT, 1) < 0) { ret = VIR_DAEMON_ERR_NETWORK; goto cleanup; } + + /* Register cpu hotplug event handler for libvirtd */ + if (virNetlinkEventServiceIsRunning(NETLINK_KOBJECT_UEVENT)) { + if (virNetlinkEventAddClient(daemonNetlinkCpuHotplugHandleCallback, + daemonNetlinkCpuHotplugRemoveCallback, + rootgrp, NULL, + NETLINK_KOBJECT_UEVENT) < 0) { + ret = VIR_DAEMON_ERR_NETWORK; + goto cleanup; + } + } #endif /* Run event loop. */ @@ -1362,6 +1423,7 @@ cleanup: if (pid_file_fd != -1) virPidFileReleasePath(pid_file, pid_file_fd); + virCgroupFree(&rootgrp); VIR_FREE(sock_file); VIR_FREE(sock_file_ro); VIR_FREE(pid_file); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2ff7e0e..752e3b0 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -64,6 +64,7 @@ virCgroupAddTaskController; virCgroupAllowDevice; virCgroupAllowDeviceMajor; virCgroupAllowDevicePath; +virCgroupAppRoot; virCgroupControllerTypeFromString; virCgroupControllerTypeToString; virCgroupDenyAllDevices; diff --git a/src/util/cgroup.c b/src/util/cgroup.c index 393d439..a1b21de 100644 --- a/src/util/cgroup.c +++ b/src/util/cgroup.c @@ -641,9 +641,9 @@ err: return rc; } -static int virCgroupAppRoot(int privileged, - virCgroupPtr *group, - int create) +int virCgroupAppRoot(int privileged, + virCgroupPtr *group, + int create) { virCgroupPtr rootgrp = NULL; int rc; diff --git a/src/util/cgroup.h b/src/util/cgroup.h index e294495..f31346c 100644 --- a/src/util/cgroup.h +++ b/src/util/cgroup.h @@ -44,6 +44,10 @@ enum { VIR_ENUM_DECL(virCgroupController); +int virCgroupAppRoot(int privileged, + virCgroupPtr *group, + int create); + int virCgroupForDriver(const char *name, virCgroupPtr *group, int privileged, -- 1.7.10.1

This patch sets cpuset.cpus to lastest for qemu driver when the driver is being initialized, and then registers 2 cpu hotplug events handlers for qemu driver: - qemuNetlinkCpuHotplugHandleCallback() - qemuNetlinkCpuHotplugRemoveCallback() Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- src/qemu/qemu_driver.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 53d6e5b..08f7be1 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -92,6 +92,9 @@ #include "virnodesuspend.h" #include "virtime.h" #include "virtypedparam.h" +#include "hotplug.h" +#include "cgroup.h" +#include "virnetlink.h" #define VIR_FROM_THIS VIR_FROM_QEMU @@ -525,6 +528,43 @@ static void qemuDomainNetsRestart(void *payload, virDomainObjUnlock(vm); } +static void +qemuNetlinkCpuHotplugHandleCallback(unsigned char *msg, + int length, + struct sockaddr_nl *peer ATTRIBUTE_UNUSED, + bool *handled ATTRIBUTE_UNUSED, + void *opaque) +{ + virCgroupPtr cgroup = NULL; + + /* opaque is a cgroup identifier pointing to libvirt root cgroup. */ + if (!opaque) { + virReportSystemError(EINVAL, + "%s", + _("invalid argument")); + return; + } + + cgroup = (virCgroupPtr)opaque; + + if (virHotplugUpdateCgroupCpuset(msg, length, cgroup) < 0) { + virReportSystemError(errno, + "%s", + _("Unable to update cpuset in cgroup")); + } + + return; +} + +static void +qemuNetlinkCpuHotplugRemoveCallback(int watch ATTRIBUTE_UNUSED, + const virMacAddrPtr macaddr ATTRIBUTE_UNUSED, + void *opaque ATTRIBUTE_UNUSED) +{ + /* For now, nothing to do. */ + VIR_INFO("CPU hotplug netlink handler has been removed."); +} + /** * qemudStartup: * @@ -682,6 +722,30 @@ qemudStartup(int privileged) { virStrerror(-rc, ebuf, sizeof(ebuf))); } +#if defined(__linux__) && defined(NETLINK_KOBJECT_UEVENT) + /** + * If we had ever missed any cpu hotplug event, we need to update + * cpuset.cpus to the lastest value + */ + rc = virHotplugSetCpusetToLastest(qemu_driver->cgroup); + if (rc < 0) { + VIR_INFO("Unable to reset cgroup for driver: %s", + virStrerror(-rc, ebuf, sizeof(ebuf))); + } + + /* Register cpu hotplug event handler for qemu driver */ + if (virNetlinkEventServiceIsRunning(NETLINK_KOBJECT_UEVENT)) { + rc = virNetlinkEventAddClient(qemuNetlinkCpuHotplugHandleCallback, + qemuNetlinkCpuHotplugRemoveCallback, + qemu_driver->cgroup, NULL, + NETLINK_KOBJECT_UEVENT); + if (rc < 0) { + VIR_INFO("Unable to register cpu hotplug event handler for driver: %s", + virStrerror(-rc, ebuf, sizeof(ebuf))); + } + } +#endif + if (qemudLoadDriverConfig(qemu_driver, driverConf) < 0) { goto error; } -- 1.7.10.1

This patch sets cpuset.cpus to the lastest value for lxc driver when the driver is being initialized, and registers 2 cpu hotplug handlers for lxc driver: - lxcNetlinkCpuHotplugHandleCallback() - lxcNetlinkCpuHotplugRemoveCallback() Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> --- src/lxc/lxc_driver.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 65 insertions(+), 2 deletions(-) diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index ff11c2c..a2f17de 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -63,6 +63,9 @@ #include "virtime.h" #include "virtypedparam.h" #include "viruri.h" +#include "hotplug.h" +#include "cgroup.h" +#include "virnetlink.h" #define VIR_FROM_THIS VIR_FROM_LXC @@ -1396,11 +1399,48 @@ error: return -1; } +static void +lxcNetlinkCpuHotplugHandleCallback(unsigned char *msg, + int length, + struct sockaddr_nl *peer ATTRIBUTE_UNUSED, + bool *handled ATTRIBUTE_UNUSED, + void *opaque) +{ + virCgroupPtr cgroup = NULL; + + /* opaque is a cgroup identifier pointing to libvirt root cgroup. */ + if (!opaque) { + virReportSystemError(EINVAL, + "%s", + _("invalid argument")); + return; + } + + cgroup = (virCgroupPtr)opaque; + + if (virHotplugUpdateCgroupCpuset(msg, length, cgroup) < 0) { + virReportSystemError(errno, + "%s", + _("Unable to update cpuset in cgroup")); + } + + return; +} + +static void +lxcNetlinkCpuHotplugRemoveCallback(int watch ATTRIBUTE_UNUSED, + const virMacAddrPtr macaddr ATTRIBUTE_UNUSED, + void *opaque ATTRIBUTE_UNUSED) +{ + /* For now, nothing to do. */ + VIR_INFO("CPU hotplug netlink handler has been removed."); +} static int lxcStartup(int privileged) { char *ld; int rc; + char ebuf[1024]; /* Valgrind gets very annoyed when we clone containers, so * disable LXC when under valgrind @@ -1445,14 +1485,37 @@ static int lxcStartup(int privileged) rc = virCgroupForDriver("lxc", &lxc_driver->cgroup, privileged, 1); if (rc < 0) { - char buf[1024] ATTRIBUTE_UNUSED; VIR_DEBUG("Unable to create cgroup for LXC driver: %s", - virStrerror(-rc, buf, sizeof(buf))); + virStrerror(-rc, ebuf, sizeof(ebuf))); /* Don't abort startup. We will explicitly report to * the user when they try to start a VM */ } +#if defined(__linux__) && defined(NETLINK_KOBJECT_UEVENT) + /** + * If we had ever missed any cpu hotplug event, we need to update + * cpuset.cpus to the lastest value + */ + rc = virHotplugSetCpusetToLastest(lxc_driver->cgroup); + if (rc < 0) { + VIR_INFO("Unable to reset cgroup for driver: %s", + virStrerror(-rc, ebuf, sizeof(ebuf))); + } + + /* Register cpu hotplug event handler for lxc driver */ + if (virNetlinkEventServiceIsRunning(NETLINK_KOBJECT_UEVENT)) { + rc = virNetlinkEventAddClient(lxcNetlinkCpuHotplugHandleCallback, + lxcNetlinkCpuHotplugRemoveCallback, + lxc_driver->cgroup, NULL, + NETLINK_KOBJECT_UEVENT); + if (rc < 0) { + VIR_INFO("Unable to register cpu hotplug event handler for driver: %s", + virStrerror(-rc, ebuf, sizeof(ebuf))); + } + } +#endif + /* Call function to load lxc driver configuration information */ if (lxcLoadDriverConfig(lxc_driver) < 0) goto cleanup; -- 1.7.10.1

On Tue, Sep 04, 2012 at 04:45:16PM +0800, Tang Chen wrote:
It seems that libvirt is not cpu hotplug aware. Please refer to the following problem.
1. At first, we have 2 cpus. # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-1
2. And we have a vm1 with following configuration. <cputune> <vcpupin vcpu='0' cpuset='1'/> <emulatorpin cpuset='1'/> </cputune>
3. Offline cpu1. # echo 0 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 0 # cat /cgroup/cpuset/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0
4. Online cpu1. # echo 1 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 1 # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0
Here,cgroup updated cpuset.cpus,but not for libvirt directory,and also qemu and lxc directory.
I'm rather inclined to say this is the kernel's fault. This is the same class of problem that we save with S3/S4 kernel support where the cpuset got blanked out. The kernel should *not* be altering the user specified cgroups settings when offlining CPUs. The problem is that the kernel is not distinguishing between the user requested cpuset mask and the mask of available CPUs - it has overloaded both into one config file. The cgroup cpuset.cpus should only reflect the user config. The kernel should privately AND this with the current mask of CPUs which actually exist.
vm1 cannot be started again. # virsh start vm1 error: Failed to start domain vm1 error: Unable to set cpuset.cpus: Permission denied
And libvird gave the following errors. 2012-07-17 07:30:22.478+0000: 3118: error : qemuSetupCgroupVcpuPin:498 : Unable to set cpuset.cpus: Permission denied
These patches resolves this problem by listening on the netlink for cpu hotplug event. When the netlink service gets the cpu hotplug event, it will attract the cpuid in the message, and add it into cpuset.cpus in: /cgroup/cpuset/libvirt /cgroup/cpuset/libvirt/qemu /cgroup/cpuset/libvirt/lxc
I don't think we should be doing this. eg, Consisder the host has 8 cpus an the admin explicitly configured libvirt to only use cpus 1-4. If the host admin onlines CPU 6, then libvirt should not be adding CPU 6 into its cpuset. In addition we cannot assume that the 'libvirt' cgroup is immediately below the root cgroup. There might be several other layers in the hierarchy above which also loose their correct cpuset data, not to mention the cgroups of all other apps in the system. This is a system-wide flaw, but your patch is only addressing the libvirt impact. So I don't think we should be doing this. The kernel should fix cgroups properly so all apps work correctly. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 09/04/2012 07:25 PM, Daniel P. Berrange wrote:
On Tue, Sep 04, 2012 at 04:45:16PM +0800, Tang Chen wrote:
It seems that libvirt is not cpu hotplug aware. Please refer to the following problem.
1. At first, we have 2 cpus. # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0-1
2. And we have a vm1 with following configuration. <cputune> <vcpupin vcpu='0' cpuset='1'/> <emulatorpin cpuset='1'/> </cputune>
3. Offline cpu1. # echo 0 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 0 # cat /cgroup/cpuset/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0
4. Online cpu1. # echo 1 > /sys/devices/system/cpu/cpu1/online # cat /sys/devices/system/cpu/cpu1/online 1 # cat /cgroup/cpuset/cpuset.cpus 0-1 # cat /cgroup/cpuset/libvirt/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/qemu/cpuset.cpus 0 # cat /cgroup/cpuset/libvirt/lxc/cpuset.cpus 0
Here,cgroup updated cpuset.cpus,but not for libvirt directory,and also qemu and lxc directory.
I'm rather inclined to say this is the kernel's fault. This is the same class of problem that we save with S3/S4 kernel support where the cpuset got blanked out.
The kernel should *not* be altering the user specified cgroups settings when offlining CPUs. The problem is that the kernel is not distinguishing between the user requested cpuset mask and the mask of available CPUs - it has overloaded both into one config file.
The cgroup cpuset.cpus should only reflect the user config. The kernel should privately AND this with the current mask of CPUs which actually exist.
I had posted a Linux kernel patchset[1] some time ago to expose another file so that we can distinguish between the user specified settings vs the actual scenario underneath. But the conclusion in the ensuing discussion was that the existing kernel behaviour is good as is, and trying to "fix" it would break kernel semantics. (However, note that the suspend/resume case has been fixed in the kernel by commit d35be8bab). [1]. http://thread.gmane.org/gmane.linux.documentation/4805 Regards, Srivatsa S. Bhat
vm1 cannot be started again. # virsh start vm1 error: Failed to start domain vm1 error: Unable to set cpuset.cpus: Permission denied
And libvird gave the following errors. 2012-07-17 07:30:22.478+0000: 3118: error : qemuSetupCgroupVcpuPin:498 : Unable to set cpuset.cpus: Permission denied
These patches resolves this problem by listening on the netlink for cpu hotplug event. When the netlink service gets the cpu hotplug event, it will attract the cpuid in the message, and add it into cpuset.cpus in: /cgroup/cpuset/libvirt /cgroup/cpuset/libvirt/qemu /cgroup/cpuset/libvirt/lxc
I don't think we should be doing this. eg, Consisder the host has 8 cpus an the admin explicitly configured libvirt to only use cpus 1-4. If the host admin onlines CPU 6, then libvirt should not be adding CPU 6 into its cpuset.
In addition we cannot assume that the 'libvirt' cgroup is immediately below the root cgroup. There might be several other layers in the hierarchy above which also loose their correct cpuset data, not to mention the cgroups of all other apps in the system. This is a system-wide flaw, but your patch is only addressing the libvirt impact. So I don't think we should be doing this. The kernel should fix cgroups properly so all apps work correctly.
Daniel

Hi Srivatsa, Daniel, Thank you very much for all the comments. :) On 09/05/2012 04:57 AM, Srivatsa S. Bhat wrote:
I had posted a Linux kernel patchset[1] some time ago to expose another file so that we can distinguish between the user specified settings vs the actual scenario underneath. But the conclusion in the ensuing discussion was that the existing kernel behaviour is good as is, and trying to "fix" it would break kernel semantics. (However, note that the suspend/resume case has been fixed in the kernel by commit d35be8bab).
The reason why I made this patch set is that if libvirt doesn't recover the cpuset.cpus, all the domains with vcpus pinned to a *re-pluged* cpu in xml will fail to start. Which means all these domain will be unusable, or we have to modify the configuration. If the cpu is really removed, it is normal for a domain fails to start. We can simply print an error message. But if the cpu is added again, and it is active and usable, the domain should be able to start normally. (am I right here ?) This is the key problem I want to solve. So first, I improved the netlink related code in libvirt, and now libvirt can be notified when cpu hotplug event occurred. I read the emails posted above. In summary, you discussed about the following problems: 1) Make cgroup be able to distinguish actual configuration and user's. - ( Srivatsa's idea: mask = (actual config) & (user config) ) Seems that it is hard to be applied for some cgroup design reasons. 2) Kill all the tasks on the cpu when hot-unplug it. - I don't think this is a good idea. And, this won't solve the problem. For example, a task binded on cpu 3. Suppose cpu 3 is unpluged, * if the task is killed, it's just too rude, and users running important tasks will suffer. * if the task is migrated to other cpus, what if cpu 3 is active again ? Are we going to see the added cpu 3 is not the original cpu 3 ? Whatever, the domain will still fail to start. 3) Make cpu hot unplug fail when there are tasks on it. - This may be unacceptable for hotplug users. And this won't solve the problem either. If the domain is not running when the hot unplug happens, the hot unplug will succeed. And when we start the domain, it will fail anyway, right ? 4) Make libvirt not use cpuset cgroup. - For now, seems impossable. sched_setaffinity() behaves properly, which assumes the repluged cpu is the same one unpluged before. (am I right ?) But with cgroup's control, we cannot resolve this problem using sched_setaffinity(). If I want to solve the start failure problem, what should I do ? Thanks. :)

On 09/05/2012 07:32 AM, Tang Chen wrote:
4) Make libvirt not use cpuset cgroup. - For now, seems impossable. sched_setaffinity() behaves properly, which assumes the repluged cpu is the same one unpluged before. (am I right ?) But with cgroup's control, we cannot resolve this problem using sched_setaffinity().
If I want to solve the start failure problem, what should I do ?
Hi, I posted a comment some time ago about that. If you do not mount the cpuset controller, i.e for RHEL 6 you delete the cpuset line from /etc/cgconfig, the CPU affinity isn't controlled by cgroups any more but uses the old mechanism, which works as expected: take a host CPU offline and it will be removed from the process CPU mask and will show up again after onlining the host CPU. The only issue I currently see is that the display of virsh vcpuinfo and vcpupin is somewhat strange. Using taskset will however show the the correct affinity. I suggest that you try out that approach. -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martin Jetter Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

Hi Viktor, On 09/05/2012 04:54 PM, Viktor Mihajlovski wrote:
I posted a comment some time ago about that. If you do not mount the cpuset controller, i.e for RHEL 6 you delete the cpuset line from /etc/cgconfig, the CPU affinity isn't controlled by cgroups any more but uses the old mechanism, which works as expected: take a host CPU offline and it will be removed from the process CPU mask and will show up again after onlining the host CPU. The only issue I currently see is that the display of virsh vcpuinfo and vcpupin is somewhat strange. Using taskset will however show the the correct affinity.
I suggest that you try out that approach.
I saw your comment before. You are quite right. :) But the situation here is there are some other features in libvirt using cpuset. For example, emulator-pin feature. If we remove cpuset in the system, other features could be unusable. And more, I found different cgroups are widely used in libvirt now. I don't think removing cgroups from system is a good enough idea, though it can be a work around. What do you think? :) Thanks. :)

On Wed, Sep 05, 2012 at 05:19:22PM +0800, Tang Chen wrote:
Hi Viktor,
On 09/05/2012 04:54 PM, Viktor Mihajlovski wrote:
I posted a comment some time ago about that. If you do not mount the cpuset controller, i.e for RHEL 6 you delete the cpuset line from /etc/cgconfig, the CPU affinity isn't controlled by cgroups any more but uses the old mechanism, which works as expected: take a host CPU offline and it will be removed from the process CPU mask and will show up again after onlining the host CPU. The only issue I currently see is that the display of virsh vcpuinfo and vcpupin is somewhat strange. Using taskset will however show the the correct affinity.
I suggest that you try out that approach.
I saw your comment before. You are quite right. :)
But the situation here is there are some other features in libvirt using cpuset. For example, emulator-pin feature. If we remove cpuset in the system, other features could be unusable.
The emulator pinning code should be made to use sched_setaffinity() when the cpuset controller is not available, just as we do for the vCPU pinning.
And more, I found different cgroups are widely used in libvirt now. I don't think removing cgroups from system is a good enough idea, though it can be a work around.
What do you think? :)
Either the kernel fixes the broken cpuset behaviour, or we have to recommend that people don't use it. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Sep 05, 2012 at 01:32:12PM +0800, Tang Chen wrote:
Hi Srivatsa, Daniel,
Thank you very much for all the comments. :)
On 09/05/2012 04:57 AM, Srivatsa S. Bhat wrote:
I had posted a Linux kernel patchset[1] some time ago to expose another file so that we can distinguish between the user specified settings vs the actual scenario underneath. But the conclusion in the ensuing discussion was that the existing kernel behaviour is good as is, and trying to "fix" it would break kernel semantics. (However, note that the suspend/resume case has been fixed in the kernel by commit d35be8bab).
The reason why I made this patch set is that if libvirt doesn't recover the cpuset.cpus, all the domains with vcpus pinned to a *re-pluged* cpu in xml will fail to start. Which means all these domain will be unusable, or we have to modify the configuration.
If the cpu is really removed, it is normal for a domain fails to start. We can simply print an error message. But if the cpu is added again, and it is active and usable, the domain should be able to start normally. (am I right here ?) This is the key problem I want to solve.
So first, I improved the netlink related code in libvirt, and now libvirt can be notified when cpu hotplug event occurred.
Your patch appears to work in some limited scenarios, but more generally it will fail to work, and resulted in undesirable behaviour. Consider for example, if libvirtd is configured thus: cd /sys/fs/cgroup/cpuset mkdir demo cd demo echo 2-3 > cpuset.cpus echo 0 > cpuset.mems echo $$ > tasks /usr/sbin/libvirtd ie, libvirtd is now running on cpus 2-3, in group 'demo'. VMs will be created in /sys/fs/cgroup/cpuset/demo/libvirt/qemu/$VMNAME Your patch attempts to set the cpuset.cpus on 'libvirt/qemu/$VMNAME' but ignores the fact that there could be many higher directories (eg demo here) that need setting. libvirtd, however, should not be responsible for / allowed to change settings in parent cgroups from where it was started. ie in this example, libvirtd should *not* touch the 'demo' cgroup. So consider systemd starting tasks, giving them custom cgroups. Now systemd also has to listen for netlink events and reset the cpuset masks. Things are even worse if the admin has temporarily offlined all the cpus that are associated with the current cpuset. When this happens the kernel throws libvirtd and all its VMs out of their current cgroups and dumps them up in a parent cgroup (potentially even the root group). This is really awful.
I read the emails posted above. In summary, you discussed about the following problems:
1) Make cgroup be able to distinguish actual configuration and user's. - ( Srivatsa's idea: mask = (actual config) & (user config) ) Seems that it is hard to be applied for some cgroup design reasons.
2) Kill all the tasks on the cpu when hot-unplug it. - I don't think this is a good idea. And, this won't solve the problem. For example, a task binded on cpu 3. Suppose cpu 3 is unpluged, * if the task is killed, it's just too rude, and users running important tasks will suffer. * if the task is migrated to other cpus, what if cpu 3 is active again ? Are we going to see the added cpu 3 is not the original cpu 3 ? Whatever, the domain will still fail to start.
IMHO, execution of those tasks should simply be paused (same way that the 'freezer' cgroup pauses tasks). The admin can then either move the tasks to an alternate cgroup, or change the cpuset mask to allow them to continue running. The kernel's current behaviour of pushing all tasks up into a parent cgroup is just crazy - it is just throwing away the users requested cpu mask forever :-(
3) Make cpu hot unplug fail when there are tasks on it. - This may be unacceptable for hotplug users. And this won't solve the problem either. If the domain is not running when the hot unplug happens, the hot unplug will succeed. And when we start the domain, it will fail anyway, right ?
4) Make libvirt not use cpuset cgroup. - For now, seems impossable. sched_setaffinity() behaves properly, which assumes the repluged cpu is the same one unpluged before. (am I right ?) But with cgroup's control, we cannot resolve this problem using sched_setaffinity().
If I want to solve the start failure problem, what should I do ?
I maintain the problems we see with cpuset controller cannot be reasonably solved by libvirtd, or userspace in general. The kernel behaviour is just flawed. If the kernel won't fix it, then we should recommend people not to use the cpuset cgroup at all, and just rely on our sched_setaffinity support instead. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Hi Daniel, On 09/05/2012 05:43 PM, Daniel P. Berrange wrote:
Your patch appears to work in some limited scenarios, but more generally it will fail to work, and resulted in undesirable behaviour.
Consider for example, if libvirtd is configured thus:
cd /sys/fs/cgroup/cpuset mkdir demo cd demo echo 2-3> cpuset.cpus echo 0> cpuset.mems echo $$> tasks /usr/sbin/libvirtd
ie, libvirtd is now running on cpus 2-3, in group 'demo'. VMs will be created in
/sys/fs/cgroup/cpuset/demo/libvirt/qemu/$VMNAME
Your patch attempts to set the cpuset.cpus on 'libvirt/qemu/$VMNAME' but ignores the fact that there could be many higher directories (eg demo here) that need setting. libvirtd, however, should not be responsible for / allowed to change settings in parent cgroups from where it was started. ie in this example, libvirtd should *not* touch the 'demo' cgroup.
Yes, I didn't realize this situation. Thanks for remind me. :)
So consider systemd starting tasks, giving them custom cgroups. Now systemd also has to listen for netlink events and reset the cpuset masks.
Things are even worse if the admin has temporarily offlined all the cpus that are associated with the current cpuset. When this happens the kernel throws libvirtd and all its VMs out of their current cgroups and dumps them up in a parent cgroup (potentially even the root group). This is really awful.
Agreed. :)
IMHO, execution of those tasks should simply be paused (same way that the 'freezer' cgroup pauses tasks). The admin can then either move the tasks to an alternate cgroup, or change the cpuset mask to allow them to continue running.
The kernel's current behaviour of pushing all tasks up into a parent cgroup is just crazy - it is just throwing away the users requested cpu mask forever :-(
If I want to solve the start failure problem, what should I do ?
I maintain the problems we see with cpuset controller cannot be reasonably solved by libvirtd, or userspace in general. The kernel behaviour is just flawed. If the kernel won't fix it, then we should recommend people not to use the cpuset cgroup at all, and just rely on our sched_setaffinity support instead.
I like the sched_setaffinity idea. Let's just temporarily shut off cpuset cgroup in libvirt, shall we ? Since cpuset cgroup was turned on when I was working on the emulator-pin job, I will shut if off and improve all these with sched_setaffinity(). And I will send new patches soon. Thanks. :)
Daniel
participants (4)
-
Daniel P. Berrange
-
Srivatsa S. Bhat
-
Tang Chen
-
Viktor Mihajlovski