[libvirt] [PATCH 0 of 2] [RFC] Add cgroup manipulation and LXC driver support

This patch set adds basic cgroup support to the LXC driver. It consists of a small internal cgroup manipulation API, as well as changes to the driver itself to utilize the support. Currently, we just set a memory limit and the allowed devices list. The cgroup.{c,h} interface can be easily redirected to libcgroup in the future if and when the decision to move in that direction is made. Some discussion on the following points is probably warranted, to help determine how deep we want to go with this internal implementation, in terms' of supporting complex system configurations, etc. - What to do if controllers are mounted in multiple places - What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior Some additional caveats are noted per-patch as well.

This patch adds src/cgroup.{c,h} with support for creating and manipulating cgroups. It's quite naive at the moment, but should provide something to work with to move forward with resource controls. All groups created with the internal API are forced under $mount/libvirt/ to keep everything together. The first time a group is created, the libvirt directory is also created, and the settings from the root are inherited. The code scans the mount table to look for the first mount of type cgroup, and assumes that all controllers are mounted there. I think this could/should be updated to prefer a mount with just the controller(s) we want, if there are multiple ones. If you have the cpuset controller enabled, and cpuset.cpus_exclusive is 1, then all cgroups to be created will fail. Since we probably shouldn't blindly set the root to be non-exclusive, we may also want to consider this condition to be "no cgroup support". diff -r 444e2614d0a2 -r 8e948eb88328 src/Makefile.am --- a/src/Makefile.am Wed Sep 17 16:07:03 2008 +0000 +++ b/src/Makefile.am Mon Sep 29 09:37:42 2008 -0700 @@ -96,7 +96,8 @@ lxc_conf.c lxc_conf.h \ lxc_container.c lxc_container.h \ lxc_controller.c \ - veth.c veth.h + veth.c veth.h \ + cgroup.c cgroup.h OPENVZ_DRIVER_SOURCES = \ openvz_conf.c openvz_conf.h \ diff -r 444e2614d0a2 -r 8e948eb88328 src/cgroup.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/cgroup.c Mon Sep 29 09:37:42 2008 -0700 @@ -0,0 +1,526 @@ +/* + * cgroup.c: Tools for managing cgroups + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * Dan Smith <danms@us.ibm.com> + */ +#include <config.h> + +#include <stdio.h> +#include <stdint.h> +#include <inttypes.h> +#include <mntent.h> +#include <fcntl.h> +#include <string.h> +#include <errno.h> +#include <stdlib.h> +#include <stdbool.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <libgen.h> + +#include "internal.h" +#include "util.h" +#include "cgroup.h" + +#define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) +#define DEBUG0(msg) VIR_DEBUG(__FILE__, "%s", msg) + +struct virCgroup { + char *path; +}; + +void virCgroupFree(virCgroupPtr *group) +{ + if (*group != NULL) { + free((*group)->path); + free(*group); + *group = NULL; + } +} + +static virCgroupPtr cgroup_get_mount(void) +{ + FILE *mounts; + struct mntent entry; + char buf[512]; + virCgroupPtr root = NULL; + + root = calloc(1, sizeof(*root)); + if (root == NULL) + return NULL; + + mounts = fopen("/proc/mounts", "r"); + if (mounts == NULL) { + DEBUG0("Unable to open /proc/mounts: %m"); + goto err; + } + + while (getmntent_r(mounts, &entry, buf, sizeof(buf)) != NULL) { + if (STREQ(entry.mnt_type, "cgroup")) { + root->path = strdup(entry.mnt_dir); + break; + } + } + + if (root->path == NULL) { + DEBUG0("Did not find cgroup mount"); + goto err; + } + + fclose(mounts); + + return root; +err: + virCgroupFree(&root); + + return NULL; +} + +int virCgroupHaveSupport(void) +{ + virCgroupPtr root; + + root = cgroup_get_mount(); + if (root == NULL) + return -1; + + virCgroupFree(&root); + + return 0; +} + +static int cgroup_path_of(const char *grppath, + const char *key, + char **path) +{ + virCgroupPtr root; + int rc = 0; + + root = cgroup_get_mount(); + if (root == NULL) { + rc = -ENOTDIR; + goto out; + } + + if (asprintf(path, "%s/%s/%s", root->path, grppath, key) == -1) + rc = -ENOMEM; +out: + virCgroupFree(&root); + + return rc; +} + +int virCgroupSetValueStr(virCgroupPtr group, + const char *key, + const char *value) +{ + int fd = -1; + int rc = 0; + char *keypath = NULL; + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc != 0) + return rc; + + fd = open(keypath, O_WRONLY); + if (fd < 0) { + DEBUG("Unable to open %s: %m", keypath); + rc = -ENOENT; + goto out; + } + + DEBUG("Writing '%s' to '%s'", value, keypath); + + rc = safewrite(fd, value, strlen(value)); + if (rc < 0) { + DEBUG("Failed to write value '%s': %m", value); + rc = -errno; + goto out; + } else if (rc != strlen(value)) { + DEBUG("Short write of value '%s'", value); + rc = -ENOSPC; + goto out; + } + + rc = 0; +out: + free(keypath); + close(fd); + + return rc; +} + +int virCgroupSetValueU64(virCgroupPtr group, + const char *key, + uint64_t value) +{ + char *strval = NULL; + int rc; + + if (asprintf(&strval, "%" PRIu64, value) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, key, strval); + + free(strval); + + return rc; +} + +int virCgroupSetValueI64(virCgroupPtr group, + const char *key, + int64_t value) +{ + char *strval = NULL; + int rc; + + if (asprintf(&strval, "%" PRIi64, value) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, key, strval); + + free(strval); + + return rc; +} + +int virCgroupGetValueStr(virCgroupPtr group, + const char *key, + char **value) +{ + int fd = -1; + int rc; + char *keypath = NULL; + char buf[512]; + + memset(buf, 0, sizeof(buf)); + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc != 0) { + DEBUG("No path of %s, %s", group->path, key); + return rc; + } + + fd = open(keypath, O_RDONLY); + if (fd < 0) { + DEBUG("Unable to open %s: %m", keypath); + rc = -ENOENT; + goto out; + } + + rc = saferead(fd, buf, sizeof(buf)); + if (rc < 0) { + DEBUG("Failed to read %s: %m\n", keypath); + rc = -errno; + goto out; + } else if (rc == 0) { + DEBUG("Short read of %s\n", keypath); + rc = -EIO; + goto out; + } + + *value = strdup(buf); + if (*value == NULL) { + rc = -ENOMEM; + goto out; + } + + rc = 0; +out: + free(keypath); + close(fd); + + return rc; +} + +int virCgroupGetValueU64(virCgroupPtr group, + const char *key, + uint64_t *value) +{ + char *strval = NULL; + int rc = 0; + + rc = virCgroupGetValueStr(group, key, &strval); + if (rc != 0) + goto out; + + if (sscanf(strval, "%" SCNu64, value) != 1) + rc = -EINVAL; +out: + free(strval); + + return rc; +} + +int virCgroupGetValueI64(virCgroupPtr group, + const char *key, + int64_t *value) +{ + char *strval = NULL; + int rc = 0; + + rc = virCgroupGetValueStr(group, key, &strval); + if (rc != 0) + goto out; + + if (sscanf(strval, "%" SCNi64, value) != 1) + rc = -EINVAL; +out: + free(strval); + + return rc; +} + +int virCgroupHasValue(virCgroupPtr group, + const char *key) +{ + int rc; + char *keypath = NULL; + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc < 0) + goto out; + + if (access(keypath, F_OK) != 0) + rc = -ENOENT; +out: + free(keypath); + + return rc; +} + +static int _cgroup_inherit(virCgroupPtr parent, + virCgroupPtr group, + const char *key) +{ + int rc; + char *keyval = NULL; + + rc = virCgroupGetValueStr(parent, key, &keyval); + if (rc != 0) + goto out; + + rc = virCgroupSetValueStr(group, key, keyval); + +out: + free(keyval); + + return rc; +} + +static int cgroup_inherit(virCgroupPtr parent, + virCgroupPtr group) +{ + int i; + int rc = 0; + const char *inherit_values[] = { + "cpuset.cpus", + "cpuset.mems", + NULL + }; + + for (i = 0; inherit_values[i] != NULL; i++) { + const char *key = inherit_values[i]; + if (virCgroupHasValue(group, key) == 0) + rc = _cgroup_inherit(parent, group, key); + + if (rc != 0) { + DEBUG("inherit of %s failed\n", key); + break; + } + } + + return rc; +} + +static int cgroup_root(virCgroupPtr *root) +{ + int rc = 0; + char *grppath = NULL; + struct virCgroup _root = { (char *)"." }; + + *root = calloc(1, sizeof(**root)); + if (*root == NULL) { + rc = -ENOMEM; + goto out; + } + + (*root)->path = strdup("libvirt"); + if ((*root)->path == NULL) { + rc = -ENOMEM; + goto out; + } + + rc = cgroup_path_of(".", (*root)->path, &grppath); + if (rc != 0) + goto out; + + if (access(grppath, F_OK) != 0) { + if (mkdir(grppath, 0655) != 0) + rc = -errno; + else + rc = cgroup_inherit(&_root, *root); + } +out: + if (rc != 0) + virCgroupFree(root); + free(grppath); + + return rc; +} + +static int cgroup_new(virCgroupPtr *parent, + const char *group, + virCgroupPtr *newgroup) +{ + int rc = 0; + char *typpath = NULL; + + if (*parent == NULL) { + rc = cgroup_root(parent); + if (rc != 0) + goto err; + } + + *newgroup = calloc(1, sizeof(**newgroup)); + if (*newgroup == NULL) { + rc = -ENOMEM; + goto err; + } + + rc = asprintf(&((*newgroup)->path), + "%s/%s", + (*parent)->path, + group); + if (rc == -1) { + rc = -ENOMEM; + goto err; + } + + rc = 0; + + return rc; +err: + virCgroupFree(newgroup); + *newgroup = NULL; + + free(typpath); + + return rc; +} + +int virCgroupOpen(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup) +{ + int rc = 0; + char *grppath = NULL; + + rc = cgroup_new(&parent, group, newgroup); + if (rc != 0) + goto err; + virCgroupFree(&parent); + + rc = cgroup_path_of(".", (*newgroup)->path, &grppath); + if (rc != 0) + goto err; + + if (access(grppath, F_OK) != 0) { + rc = -ENOENT; + goto err; + } + + return rc; +err: + virCgroupFree(newgroup); + *newgroup = NULL; + + return rc; +} + +int virCgroupCreate(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup) +{ + int rc = 0; + char *grppath = NULL; + bool free_parent = (parent == NULL); + + rc = cgroup_new(&parent, group, newgroup); + if (rc != 0) { + DEBUG0("Unable to allocate new virCgroup structure"); + goto err; + } + + rc = cgroup_path_of(".", (*newgroup)->path, &grppath); + if (rc != 0) + goto err; + + if (access(grppath, F_OK) == 0) { + rc = -EEXIST; + goto err; + } + + if (mkdir(grppath, 0655) != 0) { + DEBUG("mkdir of %s failed", grppath); + rc = -errno; + goto err; + } + + rc = cgroup_inherit(parent, *newgroup); + if (rc != 0) + goto err; + + free(grppath); + + if (free_parent) + virCgroupFree(&parent); + + return rc; +err: + free(grppath); + virCgroupFree(newgroup); + *newgroup = NULL; + + if (free_parent) + virCgroupFree(&parent); + + return rc; +} + +int virCgroupRemove(virCgroupPtr group) +{ + int rc = 0; + char *grppath = NULL; + + rc = cgroup_path_of(".", group->path, &grppath); + if (rc != 0) + goto out; + + if (rmdir(grppath) != 0) + rc = -errno; +out: + free(grppath); + + return rc; +} + +int virCgroupAddTask(virCgroupPtr group, pid_t pid) +{ + int rc = 0; + char *pidstr = NULL; + + if (asprintf(&pidstr, "%lu", (unsigned long)pid) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, "tasks", pidstr); + + free(pidstr); + + return rc; +} diff -r 444e2614d0a2 -r 8e948eb88328 src/cgroup.h --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/cgroup.h Mon Sep 29 09:37:42 2008 -0700 @@ -0,0 +1,120 @@ +/* + * cgroup.h: Interface to tools for managing cgroups + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * Dan Smith <danms@us.ibm.com> + */ + +#ifndef CGROUP_H +#define CGROUP_H + +#include <stdint.h> + +struct virCgroup; +typedef struct virCgroup *virCgroupPtr; + +/** + * virCgroupHaveSupport: + * + * Returns 0 if support is present, negative if not + */ +int virCgroupHaveSupport(void); + +/** + * virCgroupCreate: + * @parent: The path to the parent of the desired new group, or NULL if root + * @group: The name of the new group + * @newgroup: The newly-created group. Must be passed to cgroup_free() + * + * Returns 0 on success, negative on failure + */ +int virCgroupCreate(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup); + +/** + * virCgroupOpen: + * @parent: The path to the parent of the desired new group, or NULL if root + * @group: The name of the group + * @newgroup: The group. Must be passed to cgroup_free() + * + * Returns 0 on success, negative on failure + */ +int virCgroupOpen(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup); + +/** + * virCgroupFree: + * @group: A pointer to the virCgroupPtr to be freed + */ +void virCgroupFree(virCgroupPtr *group); + +/** + * virCgroupRemove + * @parent: The path to the parent of the group to be deleted + * @group: The name of the group to be removed + * + * Returns 0 on success, negative on failure + */ +int virCgroupRemove(virCgroupPtr group); + +/** + * virCgroupHasValue: + * @group: The scoping cgroup + * @key: The cgrop node to test for + * + * Returns 0 if present, negative if not + */ +int virCgroupHasValue(virCgroupPtr group, const char *key); + +/** + * virCgroupAddTask: + * @group: The scoping cgroup + * @pid: The pid of the task to add + * + * Returns 0 on success, negative on failure + */ +int virCgroupAddTask(virCgroupPtr group, pid_t pid); + +/** + * virCgroupSetValueXXX: + * @group: The scoping cgroup + * @key: The cgroup node to set + * @value: The value to write into the @key node + * + * Returns 0 on success, negative on failure + */ +int virCgroupSetValueStr(virCgroupPtr group, + const char *key, + const char *value); +int virCgroupSetValueU64(virCgroupPtr group, + const char *key, + uint64_t value); +int virCgroupSetValueI64(virCgroupPtr group, + const char *key, + int64_t value); + +/** + * virCgroupGetValueXXX: + * @group: The scoping cgroup + * @key: The cgroup node to read + * @value: A pointer to the storage area + * + * Returns 0 on success, negative on failure + */ +int virCgroupGetValueStr(virCgroupPtr group, + const char *key, + char **value); +int virCgroupGetValueU64(virCgroupPtr group, + const char *key, + uint64_t *value); +int virCgroupGetValueI64(virCgroupPtr group, + const char *key, + int64_t *value); + +#endif /* CGROUP_H */

On Mon, Sep 29, 2008 at 09:40:42AM -0700, Dan Smith wrote:
This patch adds src/cgroup.{c,h} with support for creating and manipulating cgroups. It's quite naive at the moment, but should provide something to work with to move forward with resource controls.
All groups created with the internal API are forced under $mount/libvirt/ to keep everything together. The first time a group is created, the libvirt directory is also created, and the settings from the root are inherited.
The code scans the mount table to look for the first mount of type cgroup, and assumes that all controllers are mounted there. I think this could/should be updated to prefer a mount with just the controller(s) we want, if there are multiple ones.
If you have the cpuset controller enabled, and cpuset.cpus_exclusive is 1, then all cgroups to be created will fail. Since we probably shouldn't blindly set the root to be non-exclusive, we may also want to consider this condition to be "no cgroup support".
My thought on the overall design of this internal API is that it is too low level & pushing too much work to the caller. eg, the caller would have to lookup virCGroupPtr objects for each controller it cares about, and has to remember the tunables & their data types. While this kind of interface is applicable for a general purpose API, libvirt's needs are not general purpose. While LXC driver is the only current user, as more controllers are added I anticipate that QEMU driver might use cgroups, eg for I/O controls and CPU schedular controls As such I'd expect an API to be at a slightly higher level of abstraction, strongly typed and a single cgroup object associated with a domain object. virDomainObjPtr dom = ...driver's internal domain object... virCGroupPtr cg = virCGroupNew(dom); # Set domain memory limit to 1 GB virCGroupSetMemory(cg, 1024 * 1024 * 1024); # Remove all device permissions virCGroupDenyAllDevices(cg); # Allow access to /dev/null device virCGroupAllowDevice(cg, 3, 1) Some of the existing APIs could be made static since they'd be needed anyway, but some could be re-design to be tailored to the semantics we want.
+static virCgroupPtr cgroup_get_mount(void) +{ + FILE *mounts; + struct mntent entry; + char buf[512]; + virCgroupPtr root = NULL; + + root = calloc(1, sizeof(*root)); + if (root == NULL) + return NULL;
VIR_ALLOC() please
+ + mounts = fopen("/proc/mounts", "r"); + if (mounts == NULL) { + DEBUG0("Unable to open /proc/mounts: %m"); + goto err; + } + + while (getmntent_r(mounts, &entry, buf, sizeof(buf)) != NULL) { + if (STREQ(entry.mnt_type, "cgroup")) { + root->path = strdup(entry.mnt_dir); + break; + } + } + + if (root->path == NULL) { + DEBUG0("Did not find cgroup mount"); + goto err; + } + + fclose(mounts); + + return root; +err: + virCgroupFree(&root); + + return NULL; +} + +int virCgroupHaveSupport(void) +{ + virCgroupPtr root; + + root = cgroup_get_mount(); + if (root == NULL) + return -1; + + virCgroupFree(&root); + + return 0; +} + +static int cgroup_path_of(const char *grppath, + const char *key, + char **path) +{ + virCgroupPtr root; + int rc = 0; + + root = cgroup_get_mount(); + if (root == NULL) { + rc = -ENOTDIR; + goto out; + } + + if (asprintf(path, "%s/%s/%s", root->path, grppath, key) == -1) + rc = -ENOMEM; +out: + virCgroupFree(&root); + + return rc; +} + +int virCgroupSetValueStr(virCgroupPtr group, + const char *key, + const char *value) +{ + int fd = -1; + int rc = 0; + char *keypath = NULL; + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc != 0) + return rc; + + fd = open(keypath, O_WRONLY); + if (fd < 0) { + DEBUG("Unable to open %s: %m", keypath); + rc = -ENOENT; + goto out; + } + + DEBUG("Writing '%s' to '%s'", value, keypath); + + rc = safewrite(fd, value, strlen(value)); + if (rc < 0) { + DEBUG("Failed to write value '%s': %m", value); + rc = -errno; + goto out; + } else if (rc != strlen(value)) { + DEBUG("Short write of value '%s'", value); + rc = -ENOSPC; + goto out; + } + + rc = 0; +out: + free(keypath); + close(fd); + + return rc; +} + +int virCgroupSetValueU64(virCgroupPtr group, + const char *key, + uint64_t value) +{ + char *strval = NULL; + int rc; + + if (asprintf(&strval, "%" PRIu64, value) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, key, strval); + + free(strval);
Not much difference in this case, but VIR_FREE() is the standard API for this, since we like to wrap all alloc related APIs for consistency. There's a few other example of this through the patch, but I won't mention them in this review.
+ *root = calloc(1, sizeof(**root));
VIR_ALLOC..
+ if (*root == NULL) { + rc = -ENOMEM; + goto out; + } + + (*root)->path = strdup("libvirt"); + if ((*root)->path == NULL) { + rc = -ENOMEM; + goto out; + } + + rc = cgroup_path_of(".", (*root)->path, &grppath); + if (rc != 0) + goto out; + + if (access(grppath, F_OK) != 0) { + if (mkdir(grppath, 0655) != 0) + rc = -errno; + else + rc = cgroup_inherit(&_root, *root); + } +out: + if (rc != 0) + virCgroupFree(root); + free(grppath); + + return rc; +}
Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

DB> My thought on the overall design of this internal API is that it DB> is too low level & pushing too much work to the caller. eg, the DB> caller would have to lookup virCGroupPtr objects for each DB> controller it cares about, Well, that's not what I had in mind, actually. DB> and has to remember the tunables & their data types. Not pushing the actual value names up to the caller is a reasonable abstraction, I suppose. It seems like in the future QEMU may want to use one type of memory limit, where LXC wants another. I can see LXC domains having multiple memory limits (soft, hard, per process, per group, etc) where those might not be meaningful to QEMU domains. DB> While this kind of interface is applicable for a general purpose DB> API, libvirt's needs are not general purpose. While LXC driver is DB> the only current user, as more controllers are added I anticipate DB> that QEMU driver might use cgroups, eg for I/O controls and CPU DB> schedular controls Indeed. DB> As such I'd expect an API to be at a slightly higher level of DB> abstraction, strongly typed and a single cgroup object associated DB> with a domain object. DB> virDomainObjPtr dom = ...driver's internal domain object... DB> virCGroupPtr cg = virCGroupNew(dom); DB> # Set domain memory limit to 1 GB DB> virCGroupSetMemory(cg, 1024 * 1024 * 1024); DB> # Remove all device permissions DB> virCGroupDenyAllDevices(cg); DB> # Allow access to /dev/null device DB> virCGroupAllowDevice(cg, 3, 1) DB> Some of the existing APIs could be made static since they'd be DB> needed anyway, but some could be re-design to be tailored to DB> the semantics we want. Okay, I'll reswizzle it and re-post. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
This patch adds src/cgroup.{c,h} with support for creating and manipulating cgroups. It's quite naive at the moment, but should provide something to work with to move forward with resource controls.
All groups created with the internal API are forced under $mount/libvirt/ to keep everything together. The first time a group is created, the libvirt directory is also created, and the settings from the root are inherited.
The code scans the mount table to look for the first mount of type cgroup, and assumes that all controllers are mounted there. I think this could/should be updated to prefer a mount with just the controller(s) we want, if there are multiple ones.
If you have the cpuset controller enabled, and cpuset.cpus_exclusive is 1, then all cgroups to be created will fail. Since we probably shouldn't blindly set the root to be non-exclusive, we may also want to consider this condition to be "no cgroup support".
diff -r 444e2614d0a2 -r 8e948eb88328 src/Makefile.am --- a/src/Makefile.am Wed Sep 17 16:07:03 2008 +0000 +++ b/src/Makefile.am Mon Sep 29 09:37:42 2008 -0700 @@ -96,7 +96,8 @@ lxc_conf.c lxc_conf.h \ lxc_container.c lxc_container.h \ lxc_controller.c \ - veth.c veth.h + veth.c veth.h \ + cgroup.c cgroup.h
OPENVZ_DRIVER_SOURCES = \ openvz_conf.c openvz_conf.h \ diff -r 444e2614d0a2 -r 8e948eb88328 src/cgroup.c --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/cgroup.c Mon Sep 29 09:37:42 2008 -0700 @@ -0,0 +1,526 @@ +/* + * cgroup.c: Tools for managing cgroups + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * Dan Smith <danms@us.ibm.com> + */ +#include <config.h> + +#include <stdio.h> +#include <stdint.h> +#include <inttypes.h> +#include <mntent.h> +#include <fcntl.h> +#include <string.h> +#include <errno.h> +#include <stdlib.h> +#include <stdbool.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <libgen.h> + +#include "internal.h" +#include "util.h" +#include "cgroup.h" + +#define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) +#define DEBUG0(msg) VIR_DEBUG(__FILE__, "%s", msg) + +struct virCgroup { + char *path; +}; +
There is no support for permissions, is everything run as root?
+void virCgroupFree(virCgroupPtr *group) +{ + if (*group != NULL) { + free((*group)->path); + free(*group); + *group = NULL; + } +} + +static virCgroupPtr cgroup_get_mount(void) +{ + FILE *mounts; + struct mntent entry; + char buf[512];
Is 512 arbitrary? How do we know it is going to be sufficient?
+ virCgroupPtr root = NULL; + + root = calloc(1, sizeof(*root)); + if (root == NULL) + return NULL; + + mounts = fopen("/proc/mounts", "r"); + if (mounts == NULL) { + DEBUG0("Unable to open /proc/mounts: %m"); + goto err; + } + + while (getmntent_r(mounts, &entry, buf, sizeof(buf)) != NULL) { + if (STREQ(entry.mnt_type, "cgroup")) { + root->path = strdup(entry.mnt_dir); + break; + } + } + + if (root->path == NULL) { + DEBUG0("Did not find cgroup mount");
Or strdup failed due to ENOMEM
+ goto err; + } + + fclose(mounts); + + return root; +err: + virCgroupFree(&root); + + return NULL; +} + +int virCgroupHaveSupport(void) +{ + virCgroupPtr root; + + root = cgroup_get_mount(); + if (root == NULL) + return -1; + + virCgroupFree(&root); +
This is quite a horrible way of wasting computation.
+ return 0; +} + +static int cgroup_path_of(const char *grppath, + const char *key, + char **path) +{ + virCgroupPtr root; + int rc = 0; + + root = cgroup_get_mount();
So every routine calls cgroup_path_of(), reads the mounts entry and find a entry for cgroup and returns it, why not do it just once and use it.
+ if (root == NULL) { + rc = -ENOTDIR; + goto out; + } + + if (asprintf(path, "%s/%s/%s", root->path, grppath, key) == -1) + rc = -ENOMEM; +out: + virCgroupFree(&root); + + return rc; +} + +int virCgroupSetValueStr(virCgroupPtr group, + const char *key, + const char *value) +{ + int fd = -1; + int rc = 0; + char *keypath = NULL; + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc != 0) + return rc; + + fd = open(keypath, O_WRONLY);
I see a mix of open and fopen calls.I would prefer to stick to just one, helps with readability.
+ if (fd < 0) { + DEBUG("Unable to open %s: %m", keypath); + rc = -ENOENT; + goto out; + } + + DEBUG("Writing '%s' to '%s'", value, keypath); + + rc = safewrite(fd, value, strlen(value)); + if (rc < 0) { + DEBUG("Failed to write value '%s': %m", value); + rc = -errno; + goto out; + } else if (rc != strlen(value)) { + DEBUG("Short write of value '%s'", value); + rc = -ENOSPC; + goto out; + } + + rc = 0; +out: + free(keypath); + close(fd); + + return rc; +} + +int virCgroupSetValueU64(virCgroupPtr group, + const char *key, + uint64_t value) +{ + char *strval = NULL; + int rc; + + if (asprintf(&strval, "%" PRIu64, value) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, key, strval); + + free(strval); + + return rc; +} + +int virCgroupSetValueI64(virCgroupPtr group, + const char *key, + int64_t value) +{ + char *strval = NULL; + int rc; + + if (asprintf(&strval, "%" PRIi64, value) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, key, strval); + + free(strval); + + return rc; +} + +int virCgroupGetValueStr(virCgroupPtr group, + const char *key, + char **value) +{ + int fd = -1; + int rc; + char *keypath = NULL; + char buf[512]; +
I don't think 512 bytes will be sufficient. THere are multi-line files like memory.stat.
+ memset(buf, 0, sizeof(buf)); + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc != 0) { + DEBUG("No path of %s, %s", group->path, key); + return rc; + } + + fd = open(keypath, O_RDONLY); + if (fd < 0) { + DEBUG("Unable to open %s: %m", keypath); + rc = -ENOENT; + goto out; + } + + rc = saferead(fd, buf, sizeof(buf)); + if (rc < 0) { + DEBUG("Failed to read %s: %m\n", keypath); + rc = -errno; + goto out; + } else if (rc == 0) { + DEBUG("Short read of %s\n", keypath); + rc = -EIO; + goto out; + } + + *value = strdup(buf); + if (*value == NULL) { + rc = -ENOMEM; + goto out; + } + + rc = 0; +out: + free(keypath); + close(fd); + + return rc; +} + +int virCgroupGetValueU64(virCgroupPtr group, + const char *key, + uint64_t *value) +{ + char *strval = NULL; + int rc = 0; + + rc = virCgroupGetValueStr(group, key, &strval); + if (rc != 0) + goto out; + + if (sscanf(strval, "%" SCNu64, value) != 1) + rc = -EINVAL; +out: + free(strval); + + return rc; +} + +int virCgroupGetValueI64(virCgroupPtr group, + const char *key, + int64_t *value) +{ + char *strval = NULL; + int rc = 0; + + rc = virCgroupGetValueStr(group, key, &strval); + if (rc != 0) + goto out; + + if (sscanf(strval, "%" SCNi64, value) != 1) + rc = -EINVAL; +out: + free(strval); + + return rc; +} + +int virCgroupHasValue(virCgroupPtr group, + const char *key) +{ + int rc; + char *keypath = NULL; + + rc = cgroup_path_of(group->path, key, &keypath); + if (rc < 0) + goto out; + + if (access(keypath, F_OK) != 0) + rc = -ENOENT; +out: + free(keypath); + + return rc; +} + +static int _cgroup_inherit(virCgroupPtr parent, + virCgroupPtr group, + const char *key) +{ + int rc; + char *keyval = NULL; + + rc = virCgroupGetValueStr(parent, key, &keyval); + if (rc != 0) + goto out; + + rc = virCgroupSetValueStr(group, key, keyval); + +out: + free(keyval); + + return rc; +} + +static int cgroup_inherit(virCgroupPtr parent, + virCgroupPtr group) +{ + int i; + int rc = 0; + const char *inherit_values[] = { + "cpuset.cpus", + "cpuset.mems", + NULL + }; + + for (i = 0; inherit_values[i] != NULL; i++) { + const char *key = inherit_values[i]; + if (virCgroupHasValue(group, key) == 0) + rc = _cgroup_inherit(parent, group, key); + + if (rc != 0) { + DEBUG("inherit of %s failed\n", key); + break; + } + } + + return rc; +} + +static int cgroup_root(virCgroupPtr *root) +{ + int rc = 0; + char *grppath = NULL; + struct virCgroup _root = { (char *)"." }; + + *root = calloc(1, sizeof(**root)); + if (*root == NULL) { + rc = -ENOMEM; + goto out; + } + + (*root)->path = strdup("libvirt"); + if ((*root)->path == NULL) { + rc = -ENOMEM; + goto out; + } + + rc = cgroup_path_of(".", (*root)->path, &grppath); + if (rc != 0) + goto out; + + if (access(grppath, F_OK) != 0) { + if (mkdir(grppath, 0655) != 0) + rc = -errno; + else + rc = cgroup_inherit(&_root, *root); + }
Yikes... The whole thing looks messy and magical, what is the code trying to do with dots and libvirt and 0655's?
+out: + if (rc != 0) + virCgroupFree(root); + free(grppath); + + return rc; +} + +static int cgroup_new(virCgroupPtr *parent, + const char *group, + virCgroupPtr *newgroup) +{ + int rc = 0; + char *typpath = NULL; + + if (*parent == NULL) { + rc = cgroup_root(parent); + if (rc != 0) + goto err; + } + + *newgroup = calloc(1, sizeof(**newgroup)); + if (*newgroup == NULL) { + rc = -ENOMEM; + goto err; + } + + rc = asprintf(&((*newgroup)->path), + "%s/%s", + (*parent)->path, + group); + if (rc == -1) { + rc = -ENOMEM; + goto err; + } + + rc = 0; + + return rc; +err: + virCgroupFree(newgroup); + *newgroup = NULL; + + free(typpath); + + return rc; +} + +int virCgroupOpen(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup) +{ + int rc = 0; + char *grppath = NULL; + + rc = cgroup_new(&parent, group, newgroup); + if (rc != 0) + goto err; + virCgroupFree(&parent); + + rc = cgroup_path_of(".", (*newgroup)->path, &grppath); + if (rc != 0) + goto err; + + if (access(grppath, F_OK) != 0) { + rc = -ENOENT; + goto err; + } + + return rc; +err: + virCgroupFree(newgroup); + *newgroup = NULL; + + return rc; +} + +int virCgroupCreate(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup) +{ + int rc = 0; + char *grppath = NULL; + bool free_parent = (parent == NULL); + + rc = cgroup_new(&parent, group, newgroup); + if (rc != 0) { + DEBUG0("Unable to allocate new virCgroup structure"); + goto err; + } + + rc = cgroup_path_of(".", (*newgroup)->path, &grppath); + if (rc != 0) + goto err; + + if (access(grppath, F_OK) == 0) { + rc = -EEXIST; + goto err; + } +
Why do you need access? mkdir will do that check anyway.
+ if (mkdir(grppath, 0655) != 0) {
Why 0655? Why not use meaningful names see sys/stat.h
+ DEBUG("mkdir of %s failed", grppath); + rc = -errno; + goto err; + } + + rc = cgroup_inherit(parent, *newgroup); + if (rc != 0) + goto err; + + free(grppath); + + if (free_parent) + virCgroupFree(&parent); + + return rc; +err: + free(grppath); + virCgroupFree(newgroup); + *newgroup = NULL; + + if (free_parent) + virCgroupFree(&parent); + + return rc; +} + +int virCgroupRemove(virCgroupPtr group) +{ + int rc = 0; + char *grppath = NULL; + + rc = cgroup_path_of(".", group->path, &grppath); + if (rc != 0) + goto out; + + if (rmdir(grppath) != 0) + rc = -errno; +out: + free(grppath); + + return rc; +} + +int virCgroupAddTask(virCgroupPtr group, pid_t pid) +{ + int rc = 0; + char *pidstr = NULL; + + if (asprintf(&pidstr, "%lu", (unsigned long)pid) == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(group, "tasks", pidstr); + + free(pidstr); + + return rc; +} diff -r 444e2614d0a2 -r 8e948eb88328 src/cgroup.h --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/src/cgroup.h Mon Sep 29 09:37:42 2008 -0700 @@ -0,0 +1,120 @@ +/* + * cgroup.h: Interface to tools for managing cgroups + * + * Copyright IBM Corp. 2008 + * + * See COPYING.LIB for the License of this software + * + * Authors: + * Dan Smith <danms@us.ibm.com> + */ + +#ifndef CGROUP_H +#define CGROUP_H + +#include <stdint.h> + +struct virCgroup; +typedef struct virCgroup *virCgroupPtr; + +/** + * virCgroupHaveSupport: + * + * Returns 0 if support is present, negative if not + */ +int virCgroupHaveSupport(void); + +/** + * virCgroupCreate: + * @parent: The path to the parent of the desired new group, or NULL if root + * @group: The name of the new group + * @newgroup: The newly-created group. Must be passed to cgroup_free() + * + * Returns 0 on success, negative on failure + */ +int virCgroupCreate(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup); + +/** + * virCgroupOpen: + * @parent: The path to the parent of the desired new group, or NULL if root + * @group: The name of the group + * @newgroup: The group. Must be passed to cgroup_free() + * + * Returns 0 on success, negative on failure + */ +int virCgroupOpen(virCgroupPtr parent, + const char *group, + virCgroupPtr *newgroup); + +/** + * virCgroupFree: + * @group: A pointer to the virCgroupPtr to be freed + */ +void virCgroupFree(virCgroupPtr *group); + +/** + * virCgroupRemove + * @parent: The path to the parent of the group to be deleted + * @group: The name of the group to be removed + * + * Returns 0 on success, negative on failure + */ +int virCgroupRemove(virCgroupPtr group); + +/** + * virCgroupHasValue: + * @group: The scoping cgroup + * @key: The cgrop node to test for + * + * Returns 0 if present, negative if not + */ +int virCgroupHasValue(virCgroupPtr group, const char *key); + +/** + * virCgroupAddTask: + * @group: The scoping cgroup + * @pid: The pid of the task to add + * + * Returns 0 on success, negative on failure + */ +int virCgroupAddTask(virCgroupPtr group, pid_t pid); + +/** + * virCgroupSetValueXXX: + * @group: The scoping cgroup + * @key: The cgroup node to set + * @value: The value to write into the @key node + * + * Returns 0 on success, negative on failure + */ +int virCgroupSetValueStr(virCgroupPtr group, + const char *key, + const char *value); +int virCgroupSetValueU64(virCgroupPtr group, + const char *key, + uint64_t value); +int virCgroupSetValueI64(virCgroupPtr group, + const char *key, + int64_t value); + +/** + * virCgroupGetValueXXX: + * @group: The scoping cgroup + * @key: The cgroup node to read + * @value: A pointer to the storage area + * + * Returns 0 on success, negative on failure + */ +int virCgroupGetValueStr(virCgroupPtr group, + const char *key, + char **value); +int virCgroupGetValueU64(virCgroupPtr group, + const char *key, + uint64_t *value); +int virCgroupGetValueI64(virCgroupPtr group, + const char *key, + int64_t *value); + +#endif /* CGROUP_H */
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Balbir

BS> There is no support for permissions, is everything run as root? The LXC driver must run as root, yes. BS> Is 512 arbitrary? How do we know it is going to be sufficient? This should probably be a constant, but it's more than enough for any of the values we're setting. Defining it specifically will help make that clear. BS> So every routine calls cgroup_path_of(), reads the mounts entry BS> and find a entry for cgroup and returns it, why not do it just BS> once and use it. Because maintaining the (potentially stale) state isn't worth the small amount of work necessary to parse the mount table on each domain creation (IMHO). BS> I see a mix of open and fopen calls.I would prefer to stick to BS> just one, helps with readability. You mean because I use open() and getmntent_r() wants a FILE*? I think I'm willing to live with that :) BS> I don't think 512 bytes will be sufficient. THere are multi-line BS> files like memory.stat. We don't read memory.stat (or any other values at the moment). However: % for i in $(find /cgroup | grep memory.stat); do cat $i | wc -c; done 53 53 53 98 I don't see any point in allocating a huge amount of memory on the stack each time if we're not going to use it. If we need to read larger values in the future, we could move to an allocated string. BS> Why do you need access? mkdir will do that check anyway. Yep, I'll change it. BS> Why 0655? Why not use meaningful names see sys/stat.h I don't think that the meaningful names in sys/stat.h lead to better readability, personally. Octal permissions are used elsewhere int the code pretty extensively, so I'm comfortable with this as it is. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

This patch adds code to the controller to set up a cgroup called libvirt/lxc/$name, set the memory limit, and restrict devices. It also adds bits to lxc_driver to properly clean up the cgroup on domain death. If virCgroupHaveSupport() says that no support is available, then we just allow the domain creation to proceed as it did before without resource controls in place. diff -r 8e948eb88328 -r f452190ac168 src/Makefile.am --- a/src/Makefile.am Mon Sep 29 09:37:42 2008 -0700 +++ b/src/Makefile.am Mon Sep 29 09:37:43 2008 -0700 @@ -90,7 +90,8 @@ lxc_conf.c lxc_conf.h \ lxc_container.c lxc_container.h \ lxc_driver.c lxc_driver.h \ - veth.c veth.h + veth.c veth.h \ + cgroup.c cgroup.h LXC_CONTROLLER_SOURCES = \ lxc_conf.c lxc_conf.h \ diff -r 8e948eb88328 -r f452190ac168 src/lxc_controller.c --- a/src/lxc_controller.c Mon Sep 29 09:37:42 2008 -0700 +++ b/src/lxc_controller.c Mon Sep 29 09:37:43 2008 -0700 @@ -42,12 +42,125 @@ #include "veth.h" #include "memory.h" #include "util.h" - +#include "cgroup.h" #define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) #define DEBUG0(msg) VIR_DEBUG(__FILE__, "%s", msg) int debugFlag = 0; + +struct cgroup_device_policy { + char type; + int major; + int minor; +}; + +static int lxcSetMemory(virDomainDefPtr def, virCgroupPtr cgroup) +{ + return virCgroupSetValueU64(cgroup, + "memory.limit_in_bytes", + (uint64_t)(def->maxmem << 10)); +} + +static int lxcSetDevices(virDomainDefPtr def, virCgroupPtr cgroup) +{ + int rc; + int i; + struct cgroup_device_policy devices[] = { + {'c', 1, 3}, + {'c', 1, 5}, + {'c', 1, 7}, + {'c', 1, 8}, + {'c', 1, 9}, + {'c', 5, 1}, + {0, 0, 0}}; + + rc = virCgroupSetValueStr(cgroup, + "devices.deny", + "a"); + if (rc != 0) + return rc; + + for (i = 0; devices[i].type != 0; i++) { + char *devstr = NULL; + + rc = asprintf(&devstr, + "%c %i:%i rwm", + devices[i].type, + devices[i].major, + devices[i].minor); + if (rc == -1) + return -ENOMEM; + + rc = virCgroupSetValueStr(cgroup, "devices.allow", devstr); + free(devstr); + + if (rc != 0) + return rc; + } + + return 0; +} + +static int lxcCreateTypeGroup(virCgroupPtr *tgrp) +{ + int ret; + + if (virCgroupOpen(NULL, "lxc", tgrp) != 0) { + ret = virCgroupCreate(NULL, "lxc", tgrp); + if (ret != 0) + fprintf(stderr, "Failed to create lxc: %s\n", strerror(-ret)); + } + + return ret; +} + +/** + * lxcSetContainerResources + * @def: pointer to virtual machine structure + * + * Creates a cgroup for the container, moves the task inside, + * and sets resource limits + * + * Returns 0 on success or -1 in case of error + */ +static int lxcSetContainerResources(virDomainDefPtr def) +{ + virCgroupPtr lxcgroup; + virCgroupPtr cgroup; + int rc = -1; + + if (virCgroupHaveSupport() != 0) + return 0; /* Not supported, so claim success */ + + rc = lxcCreateTypeGroup(&lxcgroup); + if (rc != 0) + goto out; + + rc = virCgroupCreate(lxcgroup, def->name, &cgroup); + if (rc != 0) { + fprintf(stderr, "Unable to create cgroup for %s\n", def->name); + goto out; + } + + rc = lxcSetMemory(def, cgroup); + if (rc != 0) + goto out; + + rc = lxcSetDevices(def, cgroup); + if (rc != 0) + goto out; + + rc = virCgroupAddTask(cgroup, getpid()); +out: + virCgroupFree(&lxcgroup); + virCgroupFree(&cgroup); + + if (rc != 0) + fprintf(stderr, "Failed to set lxc resources: %s\n", strerror(-rc)); + + return rc; +} static char*lxcMonitorPath(virDomainDefPtr def) { @@ -394,6 +507,9 @@ if (lxcControllerMoveInterfaces(nveths, veths, container) < 0) goto cleanup; + if (lxcSetContainerResources(def) < 0) + goto cleanup; + if (lxcContainerSendContinue(control[0]) < 0) goto cleanup; @@ -412,6 +528,7 @@ kill(container, SIGTERM); waitpid(container, NULL, 0); } + return rc; } diff -r 8e948eb88328 -r f452190ac168 src/lxc_driver.c --- a/src/lxc_driver.c Mon Sep 29 09:37:42 2008 -0700 +++ b/src/lxc_driver.c Mon Sep 29 09:37:43 2008 -0700 @@ -43,7 +43,7 @@ #include "bridge.h" #include "veth.h" #include "event.h" - +#include "cgroup.h" /* debug macros */ #define DEBUG(fmt,...) VIR_DEBUG(__FILE__, fmt, __VA_ARGS__) @@ -376,6 +376,8 @@ int waitRc; int childStatus = -1; virDomainNetDefPtr net; + virCgroupPtr lxcgroup; + virCgroupPtr cgroup; while (((waitRc = waitpid(vm->pid, &childStatus, 0)) == -1) && errno == EINTR) @@ -408,6 +410,19 @@ for (net = vm->def->nets; net; net = net->next) { vethInterfaceUpOrDown(net->ifname, 0); vethDelete(net->ifname); + } + + rc = virCgroupOpen(NULL, "lxc", &lxcgroup); + if (rc == 0) { + rc = virCgroupOpen(lxcgroup, vm->def->name, &cgroup); + if (rc == 0) { + virCgroupRemove(cgroup); + virCgroupFree(&cgroup); + } else { + DEBUG("Unable to find cgroup %s to remove", vm->def->name); + } + } else { + DEBUG0("Unable to find lxc cgroup to remove"); } return rc;

On Mon, Sep 29, 2008 at 09:40:43AM -0700, Dan Smith wrote:
This patch adds code to the controller to set up a cgroup called libvirt/lxc/$name, set the memory limit, and restrict devices. It also adds bits to lxc_driver to properly clean up the cgroup on domain death.
If virCgroupHaveSupport() says that no support is available, then we just allow the domain creation to proceed as it did before without resource controls in place.
This is very nice & simple patch - my main comment is it would be even simpler if the cgroup.h API exposed higher level semantically targetted APIs, rather than general purposes setters/getters. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Dan Smith wrote:
This patch set adds basic cgroup support to the LXC driver. It consists of a small internal cgroup manipulation API, as well as changes to the driver itself to utilize the support. Currently, we just set a memory limit and the allowed devices list. The cgroup.{c,h} interface can be easily redirected to libcgroup in the future if and when the decision to move in that direction is made.
Some discussion on the following points is probably warranted, to help determine how deep we want to go with this internal implementation, in terms' of supporting complex system configurations, etc.
- What to do if controllers are mounted in multiple places
For all practical purposes, it is not possible to mount all controllers at the same place. Consider a simple case of "ns", if the ns controller is mounted, you need root permissions to create new groups, which defeats the whole purpose of the cgroup filesystem and assigning permissions, so that an application can create groups on it own.
- What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior
These need to be fixed as well. -- Balbir

BS> For all practical purposes, it is not possible to mount all BS> controllers at the same place. Consider a simple case of "ns", if BS> the ns controller is mounted, you need root permissions to create BS> new groups, which defeats the whole purpose of the cgroup BS> filesystem and assigning permissions, so that an application can BS> create groups on it own. I don't think I'd go so far as saying that it "defeats the whole purpose", but I understand your point. After just a small amount of playing around, it seems like it might be reasonable to just mount the controllers we care about somewhere just for libvirt.
- What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior
BS> These need to be fixed as well. ...that's why I pointed them out :) I'm thinking that mounting the controllers we care about at daemon startup (as mentioned above) would solve both of these issues as well. Does anyone have an opinion on taking that approach? -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

On Tue, Sep 30, 2008 at 11:11:57AM -0700, Dan Smith wrote:
BS> For all practical purposes, it is not possible to mount all BS> controllers at the same place. Consider a simple case of "ns", if BS> the ns controller is mounted, you need root permissions to create BS> new groups, which defeats the whole purpose of the cgroup BS> filesystem and assigning permissions, so that an application can BS> create groups on it own.
I don't think I'd go so far as saying that it "defeats the whole purpose", but I understand your point.
After just a small amount of playing around, it seems like it might be reasonable to just mount the controllers we care about somewhere just for libvirt.
- What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior
BS> These need to be fixed as well.
...that's why I pointed them out :)
I'm thinking that mounting the controllers we care about at daemon startup (as mentioned above) would solve both of these issues as well.
Does anyone have an opinion on taking that approach?
The trouble is then libvirt would be dictating policy to the host admin, because once you mount a particular controller, you can't change the wayu its mounted. So if libvirt mounted each controller separately, then the admin couldn't have a mount with multiple controllers active, and vica-verca. The kernel cgroups interface really sucks in this regard :-( At the same time having the controllers mounted is mandatory for libvirt to work and asking the admin to set things up manually also sucks. So perhaps we'll need to mount them automatically, but make this behaviuour configurable in some way, so admin can override it Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

DB> The trouble is then libvirt would be dictating policy to the host DB> admin, because once you mount a particular controller, you can't DB> change the wayu its mounted. So if libvirt mounted each controller DB> separately, then the admin couldn't have a mount with multiple DB> controllers active, and vica-verca. Oh, I see. I had left that out of my quick test. I had assumed that it would behave as you would expect. DB> The kernel cgroups interface really sucks in this regard :-( I was going to go with "surprisingly unideal" ...but yeah. DB> At the same time having the controllers mounted is mandatory for DB> libvirt to work and asking the admin to set things up manually DB> also sucks. So perhaps we'll need to mount them automatically, but DB> make this behaviuour configurable in some way, so admin can DB> override it Perhaps we can: - Have a list of controllers we use (memory and devices so far) - Create each group in all mounts required to satisfy our necessary controllers - Select the appropriate mount when setting a cont.key value It will muck things up a bit, but I think it might be doable. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

Dan Smith wrote:
DB> The trouble is then libvirt would be dictating policy to the host DB> admin, because once you mount a particular controller, you can't DB> change the wayu its mounted. So if libvirt mounted each controller DB> separately, then the admin couldn't have a mount with multiple DB> controllers active, and vica-verca.
Oh, I see. I had left that out of my quick test. I had assumed that it would behave as you would expect.
DB> The kernel cgroups interface really sucks in this regard :-(
I was going to go with "surprisingly unideal" ...but yeah.
The interface, when it was designed was designed to allow flexibility of separating controllers. One might need different resources for tasks, they should not be forced to share the same set of controllers. Cgroups has the notion of busy (as in no new groups are created underneath), so it needs to be not busy for changing the way it is mounted. This has made our life while working on libcgroup very hard. The other thing that gets hard is controller interplay and rules. CPUsets for example has rules about not allowing tasks to attach without adding cpus and mems and other rules about exclusivity and having certain files just in the root.
DB> At the same time having the controllers mounted is mandatory for DB> libvirt to work and asking the admin to set things up manually DB> also sucks. So perhaps we'll need to mount them automatically, but DB> make this behaviuour configurable in some way, so admin can DB> override it
Perhaps we can:
- Have a list of controllers we use (memory and devices so far) - Create each group in all mounts required to satisfy our necessary controllers - Select the appropriate mount when setting a cont.key value
I am not sure how libvirt provides thread safety, but I did not see any explicit coding for that?
It will muck things up a bit, but I think it might be doable.
I would really recommend looking at libcgroup in the long run and using it. -- Balbir

On Wed, Oct 01, 2008 at 08:41:19AM +0530, Balbir Singh wrote:
Dan Smith wrote:
DB> At the same time having the controllers mounted is mandatory for DB> libvirt to work and asking the admin to set things up manually DB> also sucks. So perhaps we'll need to mount them automatically, but DB> make this behaviuour configurable in some way, so admin can DB> override it
Perhaps we can:
- Have a list of controllers we use (memory and devices so far) - Create each group in all mounts required to satisfy our necessary controllers - Select the appropriate mount when setting a cont.key value
I am not sure how libvirt provides thread safety, but I did not see any explicit coding for that?
The thread safety model for libvirt has two levels - A single virConnectPtr object must only be used by one thread. If you have multiple threads, you must provide each with its own conenct object - Within a stateless driver (Xen, OpenVZ, Test), there is no shared state between virConnectPtr objects, so there are no thread issues in this respect - With a stateful driver, the libvirtd daemon ensures that only a single thread is active at once, so against there are no thread issues there either. Now, in a short while I will be making the daemon fully-multithreaded. When this happens, the stateful drivers will be required to maintain mutexes for locking. The locking model wil have 2 levels, one lock over the driver as a whole. This is held only while acquiring a lock against the object being modified (eg the virtual domain object). Each virtual domain, lives in one cgroup, so there is a single virCGroup object associated with each domain. the virCGroup object state is seflf contained, so independant virCGroup objects can be accessed concurrently from multiple threads, without any threads safety issues. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Wed, Oct 01, 2008 at 08:41:19AM +0530, Balbir Singh wrote:
Dan Smith wrote:
DB> At the same time having the controllers mounted is mandatory for DB> libvirt to work and asking the admin to set things up manually DB> also sucks. So perhaps we'll need to mount them automatically, but DB> make this behaviuour configurable in some way, so admin can DB> override it
Perhaps we can:
- Have a list of controllers we use (memory and devices so far) - Create each group in all mounts required to satisfy our necessary controllers - Select the appropriate mount when setting a cont.key value
I am not sure how libvirt provides thread safety, but I did not see any explicit coding for that?
The thread safety model for libvirt has two levels
- A single virConnectPtr object must only be used by one thread. If you have multiple threads, you must provide each with its own conenct object
- Within a stateless driver (Xen, OpenVZ, Test), there is no shared state between virConnectPtr objects, so there are no thread issues in this respect
- With a stateful driver, the libvirtd daemon ensures that only a single thread is active at once, so against there are no thread issues there either.
Now, in a short while I will be making the daemon fully-multithreaded. When this happens, the stateful drivers will be required to maintain mutexes for locking. The locking model wil have 2 levels, one lock over the driver as a whole. This is held only while acquiring a lock against the object being modified (eg the virtual domain object).
Each virtual domain, lives in one cgroup, so there is a single virCGroup object associated with each domain. the virCGroup object state is seflf contained, so independant virCGroup objects can be accessed concurrently from multiple threads, without any threads safety issues.
Thanks, that was quite insightful. -- Balbir

Daniel P. Berrange wrote:
On Tue, Sep 30, 2008 at 11:11:57AM -0700, Dan Smith wrote:
BS> For all practical purposes, it is not possible to mount all BS> controllers at the same place. Consider a simple case of "ns", if BS> the ns controller is mounted, you need root permissions to create BS> new groups, which defeats the whole purpose of the cgroup BS> filesystem and assigning permissions, so that an application can BS> create groups on it own.
I don't think I'd go so far as saying that it "defeats the whole purpose", but I understand your point.
After just a small amount of playing around, it seems like it might be reasonable to just mount the controllers we care about somewhere just for libvirt.
- What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior BS> These need to be fixed as well.
...that's why I pointed them out :)
I'm thinking that mounting the controllers we care about at daemon startup (as mentioned above) would solve both of these issues as well.
Does anyone have an opinion on taking that approach?
The trouble is then libvirt would be dictating policy to the host admin, because once you mount a particular controller, you can't change the wayu its mounted. So if libvirt mounted each controller separately, then the admin couldn't have a mount with multiple controllers active, and vica-verca. The kernel cgroups interface really sucks in this regard :-(
At the same time having the controllers mounted is mandatory for libvirt to work and asking the admin to set things up manually also sucks. So perhaps we'll need to mount them automatically, but make this behaviuour configurable in some way, so admin can override it
As I mentioned in my previous email, one could use the cgconfigparser to automatically mount the controllers at initscripts time and then also use a policy to automatically classify tasks. -- Balbir

Dan Smith wrote:
BS> For all practical purposes, it is not possible to mount all BS> controllers at the same place. Consider a simple case of "ns", if BS> the ns controller is mounted, you need root permissions to create BS> new groups, which defeats the whole purpose of the cgroup BS> filesystem and assigning permissions, so that an application can BS> create groups on it own.
I don't think I'd go so far as saying that it "defeats the whole purpose", but I understand your point.
After just a small amount of playing around, it seems like it might be reasonable to just mount the controllers we care about somewhere just for libvirt.
I think it would be better to query for mount points as to where the controller is currently mounted and use that information.
- What to do if memory and device controllers aren't present - What to do if the root group is set for exclusive cpuset behavior
BS> These need to be fixed as well.
...that's why I pointed them out :)
I'm thinking that mounting the controllers we care about at daemon startup (as mentioned above) would solve both of these issues as well.
Does anyone have an opinion on taking that approach?
We have a configuration parser in libcgroup, called cgconfigparser, that can parse a configuration file and automatically setup groups and mount points. We also have a PAM plugin (pam_cgroup) and other methods to classify the task to the correct group and an API to query where the task is currently classified. The administrator can setup simple rules to move tasks to the correct group (depending on what sort of resources need to be provided to the task). -- Balbir
participants (3)
-
Balbir Singh
-
Dan Smith
-
Daniel P. Berrange