[libvirt] [PATCH v8 0/7] file descriptor passing using fd sets

libvirt's sVirt security driver provides SELinux MAC isolation for Qemu guest processes and their corresponding image files. In other words, sVirt uses SELinux to prevent a QEMU process from opening files that do not belong to it. sVirt provides this support by labeling guests and resources with security labels that are stored in file system extended attributes. Some file systems, such as NFS, do not support the extended attribute security namespace, and therefore cannot support sVirt isolation. A solution to this problem is to provide fd passing support, where libvirt opens files and passes file descriptors to QEMU. This, along with SELinux policy to prevent QEMU from opening files, can provide image file isolation for NFS files stored on the same NFS mount. This patch series adds the add-fd, remove-fd, and query-fdsets QMP monitor commands, which allow file descriptors to be passed via SCM_RIGHTS, and assigned to specified fd sets. This allows fd sets to be created per file with fds having, for example, different access rights. When QEMU needs to reopen a file with different access rights, it can search for a matching fd in the fd set. Fd sets also allow for easy tracking of fds per file, helping to prevent fd leaks. Support is also added to the block layer to allow QEMU to dup an fd from an fdset when the filename is of the /dev/fdset/nnn format, where nnn is the fd set ID. No new SELinux policy is required to prevent open of NFS files (files with type nfs_t). The virt_use_nfs boolean type simply needs to be set to false, and open will be prevented (and dup will be allowed). For example: # setsebool virt_use_nfs 0 # getsebool virt_use_nfs virt_use_nfs --> off Corey Bryant (7): qemu-char: Add MSG_CMSG_CLOEXEC flag to recvmsg qapi: Introduce add-fd, remove-fd, query-fdsets monitor: Clean up fd sets on monitor disconnect block: Prevent detection of /dev/fdset/ as floppy block: Convert open calls to qemu_open block: Convert close calls to qemu_close block: Enable qemu_open/close to work with fd sets block/raw-posix.c | 46 +++++---- block/raw-win32.c | 6 +- block/vdi.c | 5 +- block/vmdk.c | 25 ++--- block/vpc.c | 4 +- block/vvfat.c | 16 +-- cutils.c | 5 + monitor.c | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++++ monitor.h | 5 + osdep.c | 117 +++++++++++++++++++++ qapi-schema.json | 98 ++++++++++++++++++ qemu-char.c | 12 ++- qemu-common.h | 2 + qemu-tool.c | 20 ++++ qmp-commands.hx | 117 +++++++++++++++++++++ savevm.c | 4 +- 16 files changed, 721 insertions(+), 55 deletions(-) -- 1.7.10.4

Set the close-on-exec flag for the file descriptor received via SCM_RIGHTS. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v4 -This patch is new in v4 (eblake@redhat.com) v5 -Fallback to FD_CLOEXEC if MSG_CMSG_CLOEXEC is not available (eblake@redhat.com, stefanha@linux.vnet.ibm.com) v6 -Set cloexec on correct fd (eblake@redhat.com) v7-v8 -No changes qemu-char.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/qemu-char.c b/qemu-char.c index c2aaaee..ab4a928 100644 --- a/qemu-char.c +++ b/qemu-char.c @@ -2238,6 +2238,9 @@ static void unix_process_msgfd(CharDriverState *chr, struct msghdr *msg) if (fd < 0) continue; +#ifndef MSG_CMSG_CLOEXEC + qemu_set_cloexec(fd); +#endif if (s->msgfd != -1) close(s->msgfd); s->msgfd = fd; @@ -2253,6 +2256,7 @@ static ssize_t tcp_chr_recv(CharDriverState *chr, char *buf, size_t len) struct cmsghdr cmsg; char control[CMSG_SPACE(sizeof(int))]; } msg_control; + int flags = 0; ssize_t ret; iov[0].iov_base = buf; @@ -2263,9 +2267,13 @@ static ssize_t tcp_chr_recv(CharDriverState *chr, char *buf, size_t len) msg.msg_control = &msg_control; msg.msg_controllen = sizeof(msg_control); - ret = recvmsg(s->fd, &msg, 0); - if (ret > 0 && s->is_unix) +#ifdef MSG_CMSG_CLOEXEC + flags |= MSG_CMSG_CLOEXEC; +#endif + ret = recvmsg(s->fd, &msg, flags); + if (ret > 0 && s->is_unix) { unix_process_msgfd(chr, &msg); + } return ret; } -- 1.7.10.4

This patch adds support that enables passing of file descriptors to the QEMU monitor where they will be stored in specified file descriptor sets. A file descriptor set can be used by a client like libvirt to store file descriptors for the same file. This allows the client to open a file with different access modes (O_RDWR, O_WRONLY, O_RDONLY) and add/remove the passed fds to/from an fd set as needed. This will allow QEMU to (in a later patch in this series) "open" and "reopen" the same file by dup()ing the fd in the fd set that corresponds to the file, where the fd has the matching access mode flag that QEMU requests. The new QMP commands are: add-fd: Add a file descriptor to an fd set remove-fd: Remove a file descriptor from an fd set query-fdsets: Return information describing all fd sets Note: These commands are not compatible with the existing getfd and closefd QMP commands. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v5: -This patch is new in v5 and replaces the pass-fd QMP command from v4. -By grouping fds in fd sets, we ease managability with an fd set per file, addressing concerns raised in v4 about handling "reopens" and preventing fd leakage. (eblake@redhat.com, kwolf@redhat.com, dberrange@redhat.com) v6 -Make @fd optional for remove-fd (eblake@redhat.com) -Make @fdset-id optional for add-fd (eblake@redhat.com) v7: -Share fd sets among all monitor connections (kwolf@redhat.com) -Added mon_refcount to keep track of monitor connection count. v8: -Add opaque string to add-fd/query-fdsets. (stefanha@linux.vnet.ibm.com) -Use camel case for structures. (stefanha@linux.vnet.ibm.com) -Don't return in-use and refcount from query-fdsets. (stefanha@linux.vnet.ibm.com) -Don't return removed fd's from query-fdsets. (stefanha@linux.vnet.ibm.com) -Use fdset-id rather than fdset_id. (eblake@redhat.com) -Fix fd leak in qmp_add_fd(). (stefanha@linux.vnet.ibm.com) -Update QMP errors. (stefanha@linux.vnet.ibm.com, eblake@redhat.com) monitor.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ qapi-schema.json | 98 ++++++++++++++++++++++++++++ qmp-commands.hx | 117 +++++++++++++++++++++++++++++++++ 3 files changed, 403 insertions(+) diff --git a/monitor.c b/monitor.c index 49dccfe..f9a0577 100644 --- a/monitor.c +++ b/monitor.c @@ -140,6 +140,24 @@ struct mon_fd_t { QLIST_ENTRY(mon_fd_t) next; }; +/* file descriptor associated with a file descriptor set */ +typedef struct MonFdsetFd MonFdsetFd; +struct MonFdsetFd { + int fd; + bool removed; + char *opaque; + QLIST_ENTRY(MonFdsetFd) next; +}; + +/* file descriptor set containing fds passed via SCM_RIGHTS */ +typedef struct MonFdset MonFdset; +struct MonFdset { + int64_t id; + int refcount; + QLIST_HEAD(, MonFdsetFd) fds; + QLIST_ENTRY(MonFdset) next; +}; + typedef struct MonitorControl { QObject *id; JSONMessageParser parser; @@ -211,6 +229,8 @@ static inline int mon_print_count_get(const Monitor *mon) { return 0; } #define QMP_ACCEPT_UNKNOWNS 1 static QLIST_HEAD(mon_list, Monitor) mon_list; +static QLIST_HEAD(mon_fdsets, MonFdset) mon_fdsets; +static int mon_refcount; static mon_cmd_t mon_cmds[]; static mon_cmd_t info_cmds[]; @@ -2389,6 +2409,174 @@ int monitor_get_fd(Monitor *mon, const char *fdname) return -1; } +static void monitor_fdset_cleanup(MonFdset *mon_fdset) +{ + MonFdsetFd *mon_fdset_fd; + MonFdsetFd *mon_fdset_fd_next; + + if (mon_fdset->refcount != 0) { + return; + } + + QLIST_FOREACH_SAFE(mon_fdset_fd, &mon_fdset->fds, next, mon_fdset_fd_next) { + if (mon_refcount == 0 || mon_fdset_fd->removed) { + close(mon_fdset_fd->fd); + g_free(mon_fdset_fd->opaque); + QLIST_REMOVE(mon_fdset_fd, next); + g_free(mon_fdset_fd); + } + } + + if (QLIST_EMPTY(&mon_fdset->fds)) { + QLIST_REMOVE(mon_fdset, next); + g_free(mon_fdset); + } +} + +AddfdInfo *qmp_add_fd(bool has_fdset_id, int64_t fdset_id, bool has_opaque, + const char *opaque, Error **errp) +{ + int fd; + Monitor *mon = cur_mon; + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + AddfdInfo *fdinfo; + + fd = qemu_chr_fe_get_msgfd(mon->chr); + if (fd == -1) { + error_set(errp, QERR_FD_NOT_SUPPLIED); + goto error; + } + + if (has_fdset_id) { + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id == fdset_id) { + break; + } + } + if (mon_fdset == NULL) { + error_set(errp, QERR_INVALID_PARAMETER_VALUE, "fdset-id", + "an existing fdset-id or no fdset-id"); + goto error; + } + } else { + int64_t fdset_id_prev = -1; + MonFdset *mon_fdset_cur = QLIST_FIRST(&mon_fdsets); + + /* Use first available fdset ID */ + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + mon_fdset_cur = mon_fdset; + if (fdset_id_prev == mon_fdset_cur->id - 1) { + fdset_id_prev = mon_fdset_cur->id; + continue; + } + break; + } + + mon_fdset = g_malloc0(sizeof(*mon_fdset)); + mon_fdset->id = fdset_id_prev + 1; + mon_fdset->refcount = 0; + + /* The fdset list is ordered by fdset ID */ + if (mon_fdset->id == 0) { + QLIST_INSERT_HEAD(&mon_fdsets, mon_fdset, next); + } else if (mon_fdset->id < mon_fdset_cur->id) { + QLIST_INSERT_BEFORE(mon_fdset_cur, mon_fdset, next); + } else { + QLIST_INSERT_AFTER(mon_fdset_cur, mon_fdset, next); + } + } + + mon_fdset_fd = g_malloc0(sizeof(*mon_fdset_fd)); + mon_fdset_fd->fd = fd; + mon_fdset_fd->removed = false; + if (has_opaque) { + mon_fdset_fd->opaque = g_strdup(opaque); + } + QLIST_INSERT_HEAD(&mon_fdset->fds, mon_fdset_fd, next); + + fdinfo = g_malloc0(sizeof(*fdinfo)); + fdinfo->fdset_id = mon_fdset->id; + fdinfo->fd = mon_fdset_fd->fd; + + return fdinfo; + +error: + if (fd != -1) { + close(fd); + } + return NULL; +} + +void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + char fd_str[20]; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (has_fd && mon_fdset_fd->fd != fd) { + continue; + } + mon_fdset_fd->removed = true; + if (has_fd) { + break; + } + } + monitor_fdset_cleanup(mon_fdset); + return; + } + snprintf(fd_str, sizeof(fd_str), "%" PRId64, fd); + error_set(errp, QERR_FD_NOT_FOUND, fd_str); +} + +FdsetInfoList *qmp_query_fdsets(Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + FdsetInfoList *fdset_list = NULL; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + FdsetInfoList *fdset_info = g_malloc0(sizeof(*fdset_info)); + FdsetFdInfoList *fdsetfd_list = NULL; + + fdset_info->value = g_malloc0(sizeof(*fdset_info->value)); + fdset_info->value->fdset_id = mon_fdset->id; + + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + FdsetFdInfoList *fdsetfd_info; + + if (mon_fdset_fd->removed) { + continue; + } + + fdsetfd_info = g_malloc0(sizeof(*fdsetfd_info)); + fdsetfd_info->value = g_malloc0(sizeof(*fdsetfd_info->value)); + fdsetfd_info->value->fd = mon_fdset_fd->fd; + if (mon_fdset_fd->opaque) { + fdsetfd_info->value->has_opaque = true; + fdsetfd_info->value->opaque = g_strdup(mon_fdset_fd->opaque); + } else { + fdsetfd_info->value->has_opaque = false; + } + + fdsetfd_info->next = fdsetfd_list; + fdsetfd_list = fdsetfd_info; + } + + fdset_info->value->fds = fdsetfd_list; + + fdset_info->next = fdset_list; + fdset_list = fdset_info; + } + + return fdset_list; +} + /* mon_cmds and info_cmds would be sorted at runtime */ static mon_cmd_t mon_cmds[] = { #include "hmp-commands.h" diff --git a/qapi-schema.json b/qapi-schema.json index bd9c450..e13dc7f 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -2199,3 +2199,101 @@ # Since: 0.14.0 ## { 'command': 'closefd', 'data': {'fdname': 'str'} } + +# @AddfdInfo: +# +# Information about a file descriptor that was added to an fd set. +# +# @fdset-id: The ID of the fd set that @fd was added to. +# +# @fd: The file descriptor that was received via SCM rights and +# added to the fd set. +# +# Since: 1.2.0 +## +{ 'type': 'AddfdInfo', 'data': {'fdset-id': 'int', 'fd': 'int'} } + +## +# @add-fd: +# +# Add a file descriptor, that was passed via SCM rights, to an fd set. +# +# @fdset-id: #optional The ID of the fd set to add the file descriptor to. +# +# @opaque: #optional A free-form string that can be used to describe the fd. +# +# Returns: @AddfdInfo on success +# If file descriptor was not received, FdNotSupplied +# If @fdset_id does not exist, InvalidParameterValue +# +# Notes: The list of fd sets is shared by all monitor connections. +# +# If @fdset-id is not specified, a new fd set will be created. +# +# Since: 1.2.0 +## +{ 'command': 'add-fd', 'data': {'*fdset-id': 'int', '*opaque': 'str'}, + 'returns': 'AddfdInfo' } + +## +# @remove-fd: +# +# Remove a file descriptor from an fd set. +# +# @fdset-id: The ID of the fd set that the file descriptor belongs to. +# +# @fd: #optional The file descriptor that is to be removed. +# +# Returns: Nothing on success +# If @fdset_id or @fd is not found, FdNotFound +# +# Since: 1.2.0 +# +# Notes: The list of fd sets is shared by all monitor connections. +# +# If @fd is not specified, all file descriptors in @fdset-id +# will be removed. +## +{ 'command': 'remove-fd', 'data': {'fdset-id': 'int', '*fd': 'int'} } + +## +# @FdsetFdInfo: +# +# Information about a file descriptor that belongs to an fd set. +# +# @fd: The file descriptor value. +# +# @opaque: #optional A free-form string that can be used to describe the fd. +# +# Since: 1.2.0 +## +{ 'type': 'FdsetFdInfo', + 'data': {'fd': 'int', '*opaque': 'str'} } + +## +# @FdsetInfo: +# +# Information about an fd set. +# +# @fdset-id: The ID of the fd set. +# +# @fds: A list of file descriptors that belong to this fd set. +# +# Since: 1.2.0 +## +{ 'type': 'FdsetInfo', + 'data': {'fdset-id': 'int', 'fds': ['FdsetFdInfo']} } + +## +# @query-fdsets: +# +# Return information describing all fd sets. +# +# Returns: A list of @FdsetInfo +# +# Since: 1.2.0 +# +# Note: The list of fd sets is shared by all monitor connections. +# +## +{ 'command': 'query-fdsets', 'returns': ['FdsetInfo'] } diff --git a/qmp-commands.hx b/qmp-commands.hx index ac46638..2160701 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -926,6 +926,123 @@ Example: EQMP + { + .name = "add-fd", + .args_type = "fdset-id:i?,opaque:s?", + .params = "add-fd fdset-id opaque", + .help = "Add a file descriptor, that was passed via SCM rights, to an fd set", + .mhandler.cmd_new = qmp_marshal_input_add_fd, + }, + +SQMP +add-fd +------- + +Add a file descriptor, that was passed via SCM rights, to an fd set. + +Arguments: + +- "fdset-id": The ID of the fd set to add the file descriptor to. + (json-int, optional) +- "opaque": A free-form string that can be used to describe the fd. + (json-string, optional) + +Return a json-object with the following information: + +- "fdset-id": The ID of the fd set that the fd was added to. (json-int) +- "fd": The file descriptor that was received via SCM rights and added to the + fd set. (json-int) + +Example: + +-> { "execute": "add-fd", "arguments": { "fdset-id": 1 } } +<- { "return": { "fdset-id": 1, "fd": 3 } } + +Notes: + +(1) The list of fd sets is shared by all monitor connections. +(2) If "fdset-id" is not specified, a new fd set will be created. + +EQMP + + { + .name = "remove-fd", + .args_type = "fdset-id:i,fd:i?", + .params = "remove-fd fdset-id fd", + .help = "Remove a file descriptor from an fd set", + .mhandler.cmd_new = qmp_marshal_input_remove_fd, + }, + +SQMP +remove-fd +--------- + +Remove a file descriptor from an fd set. + +Arguments: + +- "fdset-id": The ID of the fd set that the file descriptor belongs to. + (json-int) +- "fd": The file descriptor that is to be removed. (json-int, optional) + +Example: + +-> { "execute": "remove-fd", "arguments": { "fdset-id": 1, "fd": 3 } } +<- { "return": {} } + +Notes: + +(1) The list of fd sets is shared by all monitor connections. +(2) If "fd" is not specified, all file descriptors in "fdset-id" will be + removed. + +EQMP + + { + .name = "query-fdsets", + .args_type = "", + .help = "Return information describing all fd sets", + .mhandler.cmd_new = qmp_marshal_input_query_fdsets, + }, + +SQMP +query-fdsets +------------- + +Return information describing all fd sets. + +Arguments: None + +Example: + +-> { "execute": "query-fdsets" } +<- { "return": [ + { + "fdset-id": 1 + "fds": [ + { + "fd": 21, + }, + { + "fd": 23, + } + ], + }, + { + "fdset-id": 2 + "fds": [ + { + "fd": 22, + } + ], + } + ] + } + +Note: The list of fd sets is shared by all monitor connections. + +EQMP + { .name = "block_passwd", .args_type = "device:B,password:s", -- 1.7.10.4

On 08/09/2012 08:10 PM, Corey Bryant wrote:
This patch adds support that enables passing of file descriptors to the QEMU monitor where they will be stored in specified file descriptor sets.
A file descriptor set can be used by a client like libvirt to store file descriptors for the same file. This allows the client to open a file with different access modes (O_RDWR, O_WRONLY, O_RDONLY) and add/remove the passed fds to/from an fd set as needed. This will allow QEMU to (in a later patch in this series) "open" and "reopen" the same file by dup()ing the fd in the fd set that corresponds to the file, where the fd has the matching access mode flag that QEMU requests.
The new QMP commands are: add-fd: Add a file descriptor to an fd set remove-fd: Remove a file descriptor from an fd set query-fdsets: Return information describing all fd sets
Note: These commands are not compatible with the existing getfd and closefd QMP commands.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
+void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + char fd_str[20]; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (has_fd && mon_fdset_fd->fd != fd) { + continue; + } + mon_fdset_fd->removed = true; + if (has_fd) { + break; + } + } + monitor_fdset_cleanup(mon_fdset); + return; + } + snprintf(fd_str, sizeof(fd_str), "%" PRId64, fd); + error_set(errp, QERR_FD_NOT_FOUND, fd_str);
This error catches the case of fdset_id not found, but you never raise an error when has_fd is true but fd is not found within fdset_id. You also don't raise an error for back-to-back remove-fd(fdset-id=1,fd=3), because you aren't checking whether mon_fdset_fd->removed was already true. I'm not sure which semantic is better, but I _am_ worried that we have both semantics in teh same function (are we arguing that this command succeeds when the fd is gone, even if it already was gone? Or are we arguing that this command diagnoses failure on an attempt to remove something that doesn't exist). I guess I'm 60-40 towards always issuing an error on an attempt to remove something that can't be found or is already removed.
+} + +FdsetInfoList *qmp_query_fdsets(Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + FdsetInfoList *fdset_list = NULL; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + FdsetInfoList *fdset_info = g_malloc0(sizeof(*fdset_info)); + FdsetFdInfoList *fdsetfd_list = NULL; + + fdset_info->value = g_malloc0(sizeof(*fdset_info->value)); + fdset_info->value->fdset_id = mon_fdset->id; + + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + FdsetFdInfoList *fdsetfd_info; + + if (mon_fdset_fd->removed) { + continue; + }
This means that an fdset with all fds removed will show up as empty in the output; I guess that's okay, as libvirt could use that to infer that the dup()d fds are still in use. The alternative is to keep track of whether we have output any information about an fd within a set before including the set itself in the overall output.
+++ b/qapi-schema.json
+## +# @add-fd: +# +# Add a file descriptor, that was passed via SCM rights, to an fd set. +# +# @fdset-id: #optional The ID of the fd set to add the file descriptor to. +# +# @opaque: #optional A free-form string that can be used to describe the fd. +# +# Returns: @AddfdInfo on success +# If file descriptor was not received, FdNotSupplied +# If @fdset_id does not exist, InvalidParameterValue
Missed one: s/_/-/
+## +# @remove-fd: +# +# Remove a file descriptor from an fd set. +# +# @fdset-id: The ID of the fd set that the file descriptor belongs to. +# +# @fd: #optional The file descriptor that is to be removed. +# +# Returns: Nothing on success +# If @fdset_id or @fd is not found, FdNotFound
and another s/_/-/
+SQMP +query-fdsets +------------- + +Return information describing all fd sets. + +Arguments: None + +Example: + +-> { "execute": "query-fdsets" } +<- { "return": [ + { + "fdset-id": 1
missing a comma
+ "fds": [ + { + "fd": 21,
JSON does not permit trailing commas.
+ }, + { + "fd": 23,
and again
+ } + ], + }, + { + "fdset-id": 2
missing a comma
+ "fds": [ + { + "fd": 22,
trailing comma Also, it might be nice to include something like: "opaque":"rdonly:/path/to/file" in one of the examples to give a hint to the reader how to use opaque. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 08/10/2012 01:57 AM, Eric Blake wrote:
On 08/09/2012 08:10 PM, Corey Bryant wrote:
This patch adds support that enables passing of file descriptors to the QEMU monitor where they will be stored in specified file descriptor sets.
A file descriptor set can be used by a client like libvirt to store file descriptors for the same file. This allows the client to open a file with different access modes (O_RDWR, O_WRONLY, O_RDONLY) and add/remove the passed fds to/from an fd set as needed. This will allow QEMU to (in a later patch in this series) "open" and "reopen" the same file by dup()ing the fd in the fd set that corresponds to the file, where the fd has the matching access mode flag that QEMU requests.
The new QMP commands are: add-fd: Add a file descriptor to an fd set remove-fd: Remove a file descriptor from an fd set query-fdsets: Return information describing all fd sets
Note: These commands are not compatible with the existing getfd and closefd QMP commands.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
+void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + char fd_str[20]; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (has_fd && mon_fdset_fd->fd != fd) { + continue; + } + mon_fdset_fd->removed = true; + if (has_fd) { + break; + } + } + monitor_fdset_cleanup(mon_fdset); + return; + } + snprintf(fd_str, sizeof(fd_str), "%" PRId64, fd); + error_set(errp, QERR_FD_NOT_FOUND, fd_str);
This error catches the case of fdset_id not found, but you never raise an error when has_fd is true but fd is not found within fdset_id. You also don't raise an error for back-to-back remove-fd(fdset-id=1,fd=3), because you aren't checking whether mon_fdset_fd->removed was already true.
I'm not sure which semantic is better, but I _am_ worried that we have both semantics in teh same function (are we arguing that this command succeeds when the fd is gone, even if it already was gone? Or are we arguing that this command diagnoses failure on an attempt to remove something that doesn't exist). I guess I'm 60-40 towards always issuing an error on an attempt to remove something that can't be found or is already removed.
Thanks for catching this. The code used to fall through to the QERR_FD_NOT_FOUND error before I moved the return outside of the mon_fdset_fd loop. Anyway I intended to set QERR_FD_NOT_FOUND when either the fdset or fd is not found. I'll fix that. I hadn't thought of back-to-back remove-fd's requiring an error. I could go either way on that. But since I'll be touching the code again I'll issue an error for that too.
+} + +FdsetInfoList *qmp_query_fdsets(Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + FdsetInfoList *fdset_list = NULL; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + FdsetInfoList *fdset_info = g_malloc0(sizeof(*fdset_info)); + FdsetFdInfoList *fdsetfd_list = NULL; + + fdset_info->value = g_malloc0(sizeof(*fdset_info->value)); + fdset_info->value->fdset_id = mon_fdset->id; + + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + FdsetFdInfoList *fdsetfd_info; + + if (mon_fdset_fd->removed) { + continue; + }
This means that an fdset with all fds removed will show up as empty in the output; I guess that's okay, as libvirt could use that to infer that the dup()d fds are still in use. The alternative is to keep track of whether we have output any information about an fd within a set before including the set itself in the overall output.
You're right, an empty set will be shown in this case. I chose the route of less code. :) I'm going to leave this as-is unless there's an objection.
+++ b/qapi-schema.json
+## +# @add-fd: +# +# Add a file descriptor, that was passed via SCM rights, to an fd set. +# +# @fdset-id: #optional The ID of the fd set to add the file descriptor to. +# +# @opaque: #optional A free-form string that can be used to describe the fd. +# +# Returns: @AddfdInfo on success +# If file descriptor was not received, FdNotSupplied +# If @fdset_id does not exist, InvalidParameterValue
Missed one: s/_/-/
Gah! Sorry.. and thanks for catching.
+## +# @remove-fd: +# +# Remove a file descriptor from an fd set. +# +# @fdset-id: The ID of the fd set that the file descriptor belongs to. +# +# @fd: #optional The file descriptor that is to be removed. +# +# Returns: Nothing on success +# If @fdset_id or @fd is not found, FdNotFound
and another s/_/-/
Ok
+SQMP +query-fdsets +------------- + +Return information describing all fd sets. + +Arguments: None + +Example: + +-> { "execute": "query-fdsets" } +<- { "return": [ + { + "fdset-id": 1
missing a comma
Ok. Perhaps I should just use a real example rather than editing the old one!
+ "fds": [ + { + "fd": 21,
JSON does not permit trailing commas.
Ok
+ }, + { + "fd": 23,
and again
Ok
+ } + ], + }, + { + "fdset-id": 2
missing a comma
Ok
+ "fds": [ + { + "fd": 22,
trailing comma
Ok
Also, it might be nice to include something like:
"opaque":"rdonly:/path/to/file"
in one of the examples to give a hint to the reader how to use opaque.
No problem I'll do that. -- Regards, Corey

On Thu, Aug 09, 2012 at 10:10:44PM -0400, Corey Bryant wrote:
+void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + char fd_str[20]; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (has_fd && mon_fdset_fd->fd != fd) { + continue; + } + mon_fdset_fd->removed = true; + if (has_fd) { + break; + } + } + monitor_fdset_cleanup(mon_fdset); + return; + } + snprintf(fd_str, sizeof(fd_str), "%" PRId64, fd); + error_set(errp, QERR_FD_NOT_FOUND, fd_str);
fd is optional and may be uninitialized. I think the human-readable string should be: if has_fd: fd_str = '%s:%s' % (fdset_id, fd) else: fd_str = '%s' % fdset_id Otherwise, looks good.

On 08/10/2012 03:20 AM, Stefan Hajnoczi wrote:
On Thu, Aug 09, 2012 at 10:10:44PM -0400, Corey Bryant wrote:
+void qmp_remove_fd(int64_t fdset_id, bool has_fd, int64_t fd, Error **errp) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + char fd_str[20]; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (has_fd && mon_fdset_fd->fd != fd) { + continue; + } + mon_fdset_fd->removed = true; + if (has_fd) { + break; + } + } + monitor_fdset_cleanup(mon_fdset); + return; + } + snprintf(fd_str, sizeof(fd_str), "%" PRId64, fd); + error_set(errp, QERR_FD_NOT_FOUND, fd_str);
fd is optional and may be uninitialized. I think the human-readable string should be:
if has_fd: fd_str = '%s:%s' % (fdset_id, fd) else: fd_str = '%s' % fdset_id
Otherwise, looks good.
Good point, thanks. I'll fix this. -- Regards, Corey

Am 10.08.2012 04:10, schrieb Corey Bryant:
This patch adds support that enables passing of file descriptors to the QEMU monitor where they will be stored in specified file descriptor sets.
A file descriptor set can be used by a client like libvirt to store file descriptors for the same file. This allows the client to open a file with different access modes (O_RDWR, O_WRONLY, O_RDONLY) and add/remove the passed fds to/from an fd set as needed. This will allow QEMU to (in a later patch in this series) "open" and "reopen" the same file by dup()ing the fd in the fd set that corresponds to the file, where the fd has the matching access mode flag that QEMU requests.
The new QMP commands are: add-fd: Add a file descriptor to an fd set remove-fd: Remove a file descriptor from an fd set query-fdsets: Return information describing all fd sets
Note: These commands are not compatible with the existing getfd and closefd QMP commands.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v5: -This patch is new in v5 and replaces the pass-fd QMP command from v4. -By grouping fds in fd sets, we ease managability with an fd set per file, addressing concerns raised in v4 about handling "reopens" and preventing fd leakage. (eblake@redhat.com, kwolf@redhat.com, dberrange@redhat.com)
v6 -Make @fd optional for remove-fd (eblake@redhat.com) -Make @fdset-id optional for add-fd (eblake@redhat.com)
v7: -Share fd sets among all monitor connections (kwolf@redhat.com) -Added mon_refcount to keep track of monitor connection count.
v8: -Add opaque string to add-fd/query-fdsets. (stefanha@linux.vnet.ibm.com) -Use camel case for structures. (stefanha@linux.vnet.ibm.com) -Don't return in-use and refcount from query-fdsets. (stefanha@linux.vnet.ibm.com) -Don't return removed fd's from query-fdsets. (stefanha@linux.vnet.ibm.com) -Use fdset-id rather than fdset_id. (eblake@redhat.com) -Fix fd leak in qmp_add_fd(). (stefanha@linux.vnet.ibm.com) -Update QMP errors. (stefanha@linux.vnet.ibm.com, eblake@redhat.com)
monitor.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ qapi-schema.json | 98 ++++++++++++++++++++++++++++ qmp-commands.hx | 117 +++++++++++++++++++++++++++++++++ 3 files changed, 403 insertions(+)
diff --git a/monitor.c b/monitor.c index 49dccfe..f9a0577 100644 --- a/monitor.c +++ b/monitor.c @@ -140,6 +140,24 @@ struct mon_fd_t { QLIST_ENTRY(mon_fd_t) next; };
+/* file descriptor associated with a file descriptor set */ +typedef struct MonFdsetFd MonFdsetFd; +struct MonFdsetFd { + int fd; + bool removed; + char *opaque; + QLIST_ENTRY(MonFdsetFd) next; +}; + +/* file descriptor set containing fds passed via SCM_RIGHTS */ +typedef struct MonFdset MonFdset; +struct MonFdset { + int64_t id; + int refcount; + QLIST_HEAD(, MonFdsetFd) fds; + QLIST_ENTRY(MonFdset) next; +}; + typedef struct MonitorControl { QObject *id; JSONMessageParser parser; @@ -211,6 +229,8 @@ static inline int mon_print_count_get(const Monitor *mon) { return 0; } #define QMP_ACCEPT_UNKNOWNS 1
static QLIST_HEAD(mon_list, Monitor) mon_list; +static QLIST_HEAD(mon_fdsets, MonFdset) mon_fdsets; +static int mon_refcount;
You introduce mon_refcount in this patch and check it against 0 in one place, but it never gets changed until patch 3 is applied. Would it make sense to do both in the same patch? Kevin

On 08/10/2012 12:08 PM, Kevin Wolf wrote:
Am 10.08.2012 04:10, schrieb Corey Bryant:
This patch adds support that enables passing of file descriptors to the QEMU monitor where they will be stored in specified file descriptor sets.
A file descriptor set can be used by a client like libvirt to store file descriptors for the same file. This allows the client to open a file with different access modes (O_RDWR, O_WRONLY, O_RDONLY) and add/remove the passed fds to/from an fd set as needed. This will allow QEMU to (in a later patch in this series) "open" and "reopen" the same file by dup()ing the fd in the fd set that corresponds to the file, where the fd has the matching access mode flag that QEMU requests.
The new QMP commands are: add-fd: Add a file descriptor to an fd set remove-fd: Remove a file descriptor from an fd set query-fdsets: Return information describing all fd sets
Note: These commands are not compatible with the existing getfd and closefd QMP commands.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v5: -This patch is new in v5 and replaces the pass-fd QMP command from v4. -By grouping fds in fd sets, we ease managability with an fd set per file, addressing concerns raised in v4 about handling "reopens" and preventing fd leakage. (eblake@redhat.com, kwolf@redhat.com, dberrange@redhat.com)
v6 -Make @fd optional for remove-fd (eblake@redhat.com) -Make @fdset-id optional for add-fd (eblake@redhat.com)
v7: -Share fd sets among all monitor connections (kwolf@redhat.com) -Added mon_refcount to keep track of monitor connection count.
v8: -Add opaque string to add-fd/query-fdsets. (stefanha@linux.vnet.ibm.com) -Use camel case for structures. (stefanha@linux.vnet.ibm.com) -Don't return in-use and refcount from query-fdsets. (stefanha@linux.vnet.ibm.com) -Don't return removed fd's from query-fdsets. (stefanha@linux.vnet.ibm.com) -Use fdset-id rather than fdset_id. (eblake@redhat.com) -Fix fd leak in qmp_add_fd(). (stefanha@linux.vnet.ibm.com) -Update QMP errors. (stefanha@linux.vnet.ibm.com, eblake@redhat.com)
monitor.c | 188 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ qapi-schema.json | 98 ++++++++++++++++++++++++++++ qmp-commands.hx | 117 +++++++++++++++++++++++++++++++++ 3 files changed, 403 insertions(+)
diff --git a/monitor.c b/monitor.c index 49dccfe..f9a0577 100644 --- a/monitor.c +++ b/monitor.c @@ -140,6 +140,24 @@ struct mon_fd_t { QLIST_ENTRY(mon_fd_t) next; };
+/* file descriptor associated with a file descriptor set */ +typedef struct MonFdsetFd MonFdsetFd; +struct MonFdsetFd { + int fd; + bool removed; + char *opaque; + QLIST_ENTRY(MonFdsetFd) next; +}; + +/* file descriptor set containing fds passed via SCM_RIGHTS */ +typedef struct MonFdset MonFdset; +struct MonFdset { + int64_t id; + int refcount; + QLIST_HEAD(, MonFdsetFd) fds; + QLIST_ENTRY(MonFdset) next; +}; + typedef struct MonitorControl { QObject *id; JSONMessageParser parser; @@ -211,6 +229,8 @@ static inline int mon_print_count_get(const Monitor *mon) { return 0; } #define QMP_ACCEPT_UNKNOWNS 1
static QLIST_HEAD(mon_list, Monitor) mon_list; +static QLIST_HEAD(mon_fdsets, MonFdset) mon_fdsets; +static int mon_refcount;
You introduce mon_refcount in this patch and check it against 0 in one place, but it never gets changed until patch 3 is applied. Would it make sense to do both in the same patch?
Yes, I'll go ahead and move the mon_refcount code from this patch to patch 3. -- Regards, Corey

Fd sets are shared by all monitor connections. Fd sets are considered to be in use while at least one monitor is connected. When the last monitor disconnects, all fds that are members of an fd set with refcount of zero are closed. This prevents any fd leakage associated with a client disconnect prior to using a passed fd. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v5: -This patch is new in v5. -This support addresses concerns from v4 regarding fd leakage if the client disconnects unexpectedly. (eblake@redhat.com, kwolf@redhat.com, dberrange@redhat.com) v6: -No changes v7: -Removed monitor_fdsets_set_in_use() function since we now use mon_refcount to determine if fdsets are in use. -Added monitor_fdsets_cleanup() function, and increment/decrement of mon_refcount when monitor connects/disconnects. v8: -Use camel case for structures. (stefanha@linux.vnet.ibm.com) monitor.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/monitor.c b/monitor.c index f9a0577..96aafec 100644 --- a/monitor.c +++ b/monitor.c @@ -2433,6 +2433,16 @@ static void monitor_fdset_cleanup(MonFdset *mon_fdset) } } +static void monitor_fdsets_cleanup(void) +{ + MonFdset *mon_fdset; + MonFdset *mon_fdset_next; + + QLIST_FOREACH_SAFE(mon_fdset, &mon_fdsets, next, mon_fdset_next) { + monitor_fdset_cleanup(mon_fdset); + } +} + AddfdInfo *qmp_add_fd(bool has_fdset_id, int64_t fdset_id, bool has_opaque, const char *opaque, Error **errp) { @@ -4781,9 +4791,12 @@ static void monitor_control_event(void *opaque, int event) data = get_qmp_greeting(); monitor_json_emitter(mon, data); qobject_decref(data); + mon_refcount++; break; case CHR_EVENT_CLOSED: json_message_parser_destroy(&mon->mc->parser); + mon_refcount--; + monitor_fdsets_cleanup(); break; } } @@ -4824,6 +4837,12 @@ static void monitor_event(void *opaque, int event) readline_show_prompt(mon->rs); } mon->reset_seen = 1; + mon_refcount++; + break; + + case CHR_EVENT_CLOSED: + mon_refcount--; + monitor_fdsets_cleanup(); break; } } -- 1.7.10.4

Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v8 -This patch is new in v8. It was reported on a prior fd passing approach and I realized it's needed in this series. (kwolf@redhat.com) block/raw-posix.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 0dce089..f606211 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -1052,8 +1052,10 @@ static int floppy_probe_device(const char *filename) struct floppy_struct fdparam; struct stat st; - if (strstart(filename, "/dev/fd", NULL)) + if (strstart(filename, "/dev/fd", NULL) && + !strstart(filename, "/dev/fdset/", NULL)) { prio = 50; + } fd = open(filename, O_RDONLY | O_NONBLOCK); if (fd < 0) { -- 1.7.10.4

This patch converts all block layer open calls to qemu_open. Note that this adds the O_CLOEXEC flag to the changed open paths when the O_CLOEXEC macro is defined. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v2: -Convert calls to qemu_open instead of file_open (kwolf@redhat.com) -Mention introduction of O_CLOEXEC (kwolf@redhat.com) v3-v8: -No changes block/raw-posix.c | 18 +++++++++--------- block/raw-win32.c | 4 ++-- block/vdi.c | 5 +++-- block/vmdk.c | 21 +++++++++------------ block/vpc.c | 2 +- block/vvfat.c | 4 ++-- 6 files changed, 26 insertions(+), 28 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index f606211..08b997e 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -572,8 +572,8 @@ static int raw_create(const char *filename, QEMUOptionParameter *options) options++; } - fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, - 0644); + fd = qemu_open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, + 0644); if (fd < 0) { result = -errno; } else { @@ -846,7 +846,7 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags) if ( bsdPath[ 0 ] != '\0' ) { strcat(bsdPath,"s0"); /* some CDs don't have a partition 0 */ - fd = open(bsdPath, O_RDONLY | O_BINARY | O_LARGEFILE); + fd = qemu_open(bsdPath, O_RDONLY | O_BINARY | O_LARGEFILE); if (fd < 0) { bsdPath[strlen(bsdPath)-1] = '1'; } else { @@ -903,7 +903,7 @@ static int fd_open(BlockDriverState *bs) #endif return -EIO; } - s->fd = open(bs->filename, s->open_flags & ~O_NONBLOCK); + s->fd = qemu_open(bs->filename, s->open_flags & ~O_NONBLOCK); if (s->fd < 0) { s->fd_error_time = get_clock(); s->fd_got_error = 1; @@ -977,7 +977,7 @@ static int hdev_create(const char *filename, QEMUOptionParameter *options) options++; } - fd = open(filename, O_WRONLY | O_BINARY); + fd = qemu_open(filename, O_WRONLY | O_BINARY); if (fd < 0) return -errno; @@ -1057,7 +1057,7 @@ static int floppy_probe_device(const char *filename) prio = 50; } - fd = open(filename, O_RDONLY | O_NONBLOCK); + fd = qemu_open(filename, O_RDONLY | O_NONBLOCK); if (fd < 0) { goto out; } @@ -1110,7 +1110,7 @@ static void floppy_eject(BlockDriverState *bs, bool eject_flag) close(s->fd); s->fd = -1; } - fd = open(bs->filename, s->open_flags | O_NONBLOCK); + fd = qemu_open(bs->filename, s->open_flags | O_NONBLOCK); if (fd >= 0) { if (ioctl(fd, FDEJECT, 0) < 0) perror("FDEJECT"); @@ -1160,7 +1160,7 @@ static int cdrom_probe_device(const char *filename) int prio = 0; struct stat st; - fd = open(filename, O_RDONLY | O_NONBLOCK); + fd = qemu_open(filename, O_RDONLY | O_NONBLOCK); if (fd < 0) { goto out; } @@ -1284,7 +1284,7 @@ static int cdrom_reopen(BlockDriverState *bs) */ if (s->fd >= 0) close(s->fd); - fd = open(bs->filename, s->open_flags, 0644); + fd = qemu_open(bs->filename, s->open_flags, 0644); if (fd < 0) { s->fd = -1; return -EIO; diff --git a/block/raw-win32.c b/block/raw-win32.c index e4b0b75..8d7838d 100644 --- a/block/raw-win32.c +++ b/block/raw-win32.c @@ -255,8 +255,8 @@ static int raw_create(const char *filename, QEMUOptionParameter *options) options++; } - fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, - 0644); + fd = qemu_open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, + 0644); if (fd < 0) return -EIO; set_sparse(fd); diff --git a/block/vdi.c b/block/vdi.c index 57325d6..c4f1529 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -653,8 +653,9 @@ static int vdi_create(const char *filename, QEMUOptionParameter *options) options++; } - fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY | O_LARGEFILE, - 0644); + fd = qemu_open(filename, + O_WRONLY | O_CREAT | O_TRUNC | O_BINARY | O_LARGEFILE, + 0644); if (fd < 0) { return -errno; } diff --git a/block/vmdk.c b/block/vmdk.c index 18e9b4c..557dc1b 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -1161,10 +1161,9 @@ static int vmdk_create_extent(const char *filename, int64_t filesize, VMDK4Header header; uint32_t tmp, magic, grains, gd_size, gt_size, gt_count; - fd = open( - filename, - O_WRONLY | O_CREAT | O_TRUNC | O_BINARY | O_LARGEFILE, - 0644); + fd = qemu_open(filename, + O_WRONLY | O_CREAT | O_TRUNC | O_BINARY | O_LARGEFILE, + 0644); if (fd < 0) { return -errno; } @@ -1484,15 +1483,13 @@ static int vmdk_create(const char *filename, QEMUOptionParameter *options) (flags & BLOCK_FLAG_COMPAT6 ? 6 : 4), total_size / (int64_t)(63 * 16 * 512)); if (split || flat) { - fd = open( - filename, - O_WRONLY | O_CREAT | O_TRUNC | O_BINARY | O_LARGEFILE, - 0644); + fd = qemu_open(filename, + O_WRONLY | O_CREAT | O_TRUNC | O_BINARY | O_LARGEFILE, + 0644); } else { - fd = open( - filename, - O_WRONLY | O_BINARY | O_LARGEFILE, - 0644); + fd = qemu_open(filename, + O_WRONLY | O_BINARY | O_LARGEFILE, + 0644); } if (fd < 0) { return -errno; diff --git a/block/vpc.c b/block/vpc.c index 5cd13d1..60ebf5a 100644 --- a/block/vpc.c +++ b/block/vpc.c @@ -678,7 +678,7 @@ static int vpc_create(const char *filename, QEMUOptionParameter *options) } /* Create the file */ - fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0644); + fd = qemu_open(filename, O_WRONLY | O_CREAT | O_TRUNC | O_BINARY, 0644); if (fd < 0) { return -EIO; } diff --git a/block/vvfat.c b/block/vvfat.c index 7b1dcee..22b586a 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -1162,7 +1162,7 @@ static int open_file(BDRVVVFATState* s,mapping_t* mapping) if(!s->current_mapping || strcmp(s->current_mapping->path,mapping->path)) { /* open file */ - int fd = open(mapping->path, O_RDONLY | O_BINARY | O_LARGEFILE); + int fd = qemu_open(mapping->path, O_RDONLY | O_BINARY | O_LARGEFILE); if(fd<0) return -1; vvfat_close_current_file(s); @@ -2221,7 +2221,7 @@ static int commit_one_file(BDRVVVFATState* s, for (i = s->cluster_size; i < offset; i += s->cluster_size) c = modified_fat_get(s, c); - fd = open(mapping->path, O_RDWR | O_CREAT | O_BINARY, 0666); + fd = qemu_open(mapping->path, O_RDWR | O_CREAT | O_BINARY, 0666); if (fd < 0) { fprintf(stderr, "Could not open %s... (%s, %d)\n", mapping->path, strerror(errno), errno); -- 1.7.10.4

This patch converts all block layer close calls, that correspond to qemu_open calls, to qemu_close. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v5: -This patch is new in v5. (kwolf@redhat.com, eblake@redhat.com) v6-v8: -No changes block/raw-posix.c | 24 ++++++++++++------------ block/raw-win32.c | 2 +- block/vmdk.c | 4 ++-- block/vpc.c | 2 +- block/vvfat.c | 12 ++++++------ osdep.c | 5 +++++ qemu-common.h | 1 + savevm.c | 4 ++-- 8 files changed, 30 insertions(+), 24 deletions(-) diff --git a/block/raw-posix.c b/block/raw-posix.c index 08b997e..6be20b1 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -271,7 +271,7 @@ static int raw_open_common(BlockDriverState *bs, const char *filename, out_free_buf: qemu_vfree(s->aligned_buf); out_close: - close(fd); + qemu_close(fd); return -errno; } @@ -376,7 +376,7 @@ static void raw_close(BlockDriverState *bs) { BDRVRawState *s = bs->opaque; if (s->fd >= 0) { - close(s->fd); + qemu_close(s->fd); s->fd = -1; if (s->aligned_buf != NULL) qemu_vfree(s->aligned_buf); @@ -580,7 +580,7 @@ static int raw_create(const char *filename, QEMUOptionParameter *options) if (ftruncate(fd, total_size * BDRV_SECTOR_SIZE) != 0) { result = -errno; } - if (close(fd) != 0) { + if (qemu_close(fd) != 0) { result = -errno; } } @@ -850,7 +850,7 @@ static int hdev_open(BlockDriverState *bs, const char *filename, int flags) if (fd < 0) { bsdPath[strlen(bsdPath)-1] = '1'; } else { - close(fd); + qemu_close(fd); } filename = bsdPath; } @@ -889,7 +889,7 @@ static int fd_open(BlockDriverState *bs) last_media_present = (s->fd >= 0); if (s->fd >= 0 && (get_clock() - s->fd_open_time) >= FD_OPEN_TIMEOUT) { - close(s->fd); + qemu_close(s->fd); s->fd = -1; #ifdef DEBUG_FLOPPY printf("Floppy closed\n"); @@ -988,7 +988,7 @@ static int hdev_create(const char *filename, QEMUOptionParameter *options) else if (lseek(fd, 0, SEEK_END) < total_size * BDRV_SECTOR_SIZE) ret = -ENOSPC; - close(fd); + qemu_close(fd); return ret; } @@ -1038,7 +1038,7 @@ static int floppy_open(BlockDriverState *bs, const char *filename, int flags) return ret; /* close fd so that we can reopen it as needed */ - close(s->fd); + qemu_close(s->fd); s->fd = -1; s->fd_media_changed = 1; @@ -1072,7 +1072,7 @@ static int floppy_probe_device(const char *filename) prio = 100; outc: - close(fd); + qemu_close(fd); out: return prio; } @@ -1107,14 +1107,14 @@ static void floppy_eject(BlockDriverState *bs, bool eject_flag) int fd; if (s->fd >= 0) { - close(s->fd); + qemu_close(s->fd); s->fd = -1; } fd = qemu_open(bs->filename, s->open_flags | O_NONBLOCK); if (fd >= 0) { if (ioctl(fd, FDEJECT, 0) < 0) perror("FDEJECT"); - close(fd); + qemu_close(fd); } } @@ -1175,7 +1175,7 @@ static int cdrom_probe_device(const char *filename) prio = 100; outc: - close(fd); + qemu_close(fd); out: return prio; } @@ -1283,7 +1283,7 @@ static int cdrom_reopen(BlockDriverState *bs) * FreeBSD seems to not notice sometimes... */ if (s->fd >= 0) - close(s->fd); + qemu_close(s->fd); fd = qemu_open(bs->filename, s->open_flags, 0644); if (fd < 0) { s->fd = -1; diff --git a/block/raw-win32.c b/block/raw-win32.c index 8d7838d..c56bf83 100644 --- a/block/raw-win32.c +++ b/block/raw-win32.c @@ -261,7 +261,7 @@ static int raw_create(const char *filename, QEMUOptionParameter *options) return -EIO; set_sparse(fd); ftruncate(fd, total_size * 512); - close(fd); + qemu_close(fd); return 0; } diff --git a/block/vmdk.c b/block/vmdk.c index 557dc1b..daee426 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -1258,7 +1258,7 @@ static int vmdk_create_extent(const char *filename, int64_t filesize, ret = 0; exit: - close(fd); + qemu_close(fd); return ret; } @@ -1506,7 +1506,7 @@ static int vmdk_create(const char *filename, QEMUOptionParameter *options) } ret = 0; exit: - close(fd); + qemu_close(fd); return ret; } diff --git a/block/vpc.c b/block/vpc.c index 60ebf5a..c0b82c4 100644 --- a/block/vpc.c +++ b/block/vpc.c @@ -744,7 +744,7 @@ static int vpc_create(const char *filename, QEMUOptionParameter *options) } fail: - close(fd); + qemu_close(fd); return ret; } diff --git a/block/vvfat.c b/block/vvfat.c index 22b586a..59d3c5b 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -1105,7 +1105,7 @@ static inline void vvfat_close_current_file(BDRVVVFATState *s) if(s->current_mapping) { s->current_mapping = NULL; if (s->current_fd) { - close(s->current_fd); + qemu_close(s->current_fd); s->current_fd = 0; } } @@ -2230,7 +2230,7 @@ static int commit_one_file(BDRVVVFATState* s, } if (offset > 0) { if (lseek(fd, offset, SEEK_SET) != offset) { - close(fd); + qemu_close(fd); g_free(cluster); return -3; } @@ -2251,13 +2251,13 @@ static int commit_one_file(BDRVVVFATState* s, (uint8_t*)cluster, (rest_size + 0x1ff) / 0x200); if (ret < 0) { - close(fd); + qemu_close(fd); g_free(cluster); return ret; } if (write(fd, cluster, rest_size) < 0) { - close(fd); + qemu_close(fd); g_free(cluster); return -2; } @@ -2268,11 +2268,11 @@ static int commit_one_file(BDRVVVFATState* s, if (ftruncate(fd, size)) { perror("ftruncate()"); - close(fd); + qemu_close(fd); g_free(cluster); return -4; } - close(fd); + qemu_close(fd); g_free(cluster); return commit_mappings(s, first_cluster, dir_index); diff --git a/osdep.c b/osdep.c index c07faf5..7f876ae 100644 --- a/osdep.c +++ b/osdep.c @@ -107,6 +107,11 @@ int qemu_open(const char *name, int flags, ...) return ret; } +int qemu_close(int fd) +{ + return close(fd); +} + /* * A variant of write(2) which handles partial write. * diff --git a/qemu-common.h b/qemu-common.h index f16079f..e53126d 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -207,6 +207,7 @@ const char *path(const char *pathname); void *qemu_oom_check(void *ptr); int qemu_open(const char *name, int flags, ...); +int qemu_close(int fd); ssize_t qemu_write_full(int fd, const void *buf, size_t count) QEMU_WARN_UNUSED_RESULT; ssize_t qemu_send_full(int fd, const void *buf, size_t count, int flags) diff --git a/savevm.c b/savevm.c index 6e82b2d..8ecd5d2 100644 --- a/savevm.c +++ b/savevm.c @@ -513,7 +513,7 @@ static void qemu_fill_buffer(QEMUFile *f) * * Returns f->close() return value, or 0 if close function is not set. */ -static int qemu_close(QEMUFile *f) +static int _qemu_fclose(QEMUFile *f) { int ret = 0; if (f->close) { @@ -535,7 +535,7 @@ int qemu_fclose(QEMUFile *f) { int ret; qemu_fflush(f); - ret = qemu_close(f); + ret = _qemu_fclose(f); /* If any error was spotted before closing, we should report it * instead of the close() return value. */ -- 1.7.10.4

When qemu_open is passed a filename of the "/dev/fdset/nnn" format (where nnn is the fdset ID), an fd with matching access mode flags will be searched for within the specified monitor fd set. If the fd is found, a dup of the fd will be returned from qemu_open. Each fd set has a reference count. The purpose of the reference count is to determine if an fd set contains file descriptors that have open dup() references that have not yet been closed. It is incremented on qemu_open and decremented on qemu_close. It is not until the refcount is zero that file desriptors in an fd set can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode. Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com> --- v2: -Get rid of file_open and move dup code to qemu_open (kwolf@redhat.com) -Use strtol wrapper instead of atoi (kwolf@redhat.com) v3: -Add note about fd leakage (eblake@redhat.com) v4 -Moved patch to be later in series (lcapitulino@redhat.com) -Update qemu_open to check access mode flags and set flags that can be set (eblake@redhat.com, kwolf@redhat.com) v5: -This patch was overhauled quite a bit in this version, with the addition of fd set and refcount support. -Use qemu_set_cloexec() on dup'd fd (eblake@redhat.com) -Modify flags set by fcntl on dup'd fd (eblake@redhat.com) -Reduce syscalls when setting flags for dup'd fd (eblake@redhat.com) -Fix O_RDWR, O_RDONLY, O_WRONLY checks (eblake@redhat.com) v6: -Pass only the fd to qemu_close() and keep track of dup fds per fd set. (kwolf@redhat.com, eblake@redhat.com) -Handle refcount incr/decr in new dup_fd_add/remove fd functions. -Use qemu_set_cloexec() appropriately in qemu_dup() (kwolf@redhat.com) -Simplify setting of setfl_flags in qemu_dup() (kwolf@redhat.com) -Add preprocessor checks for F_DUPFD_CLOEXEC (eblake@redhat.com) -Simplify flag checking in monitor_fdset_get_fd() (kwolf@redhat.com) v7: -Minor updates to reference global mon_fdsets, and to remove default_mon usage in osdep.c. (kwolf@redhat.com) v8: -Use camel case for structures. (stefanha@linux.vnet.ibm.com) cutils.c | 5 +++ monitor.c | 87 ++++++++++++++++++++++++++++++++++++++++++++ monitor.h | 5 +++ osdep.c | 112 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ qemu-common.h | 1 + qemu-tool.c | 20 +++++++++++ 6 files changed, 230 insertions(+) diff --git a/cutils.c b/cutils.c index 9d4c570..8b0d2bb 100644 --- a/cutils.c +++ b/cutils.c @@ -382,3 +382,8 @@ int qemu_parse_fd(const char *param) } return fd; } + +int qemu_parse_fdset(const char *param) +{ + return qemu_parse_fd(param); +} diff --git a/monitor.c b/monitor.c index 96aafec..63a43a2 100644 --- a/monitor.c +++ b/monitor.c @@ -155,6 +155,7 @@ struct MonFdset { int64_t id; int refcount; QLIST_HEAD(, MonFdsetFd) fds; + QLIST_HEAD(, MonFdsetFd) dup_fds; QLIST_ENTRY(MonFdset) next; }; @@ -2587,6 +2588,92 @@ FdsetInfoList *qmp_query_fdsets(Error **errp) return fdset_list; } +int monitor_fdset_get_fd(int64_t fdset_id, int flags) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + int mon_fd_flags; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (mon_fdset_fd->removed) { + continue; + } + + mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL); + if (mon_fd_flags == -1) { + return -1; + } + + if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) { + return mon_fdset_fd->fd; + } + } + errno = EACCES; + return -1; + } + errno = ENOENT; + return -1; +} + +int monitor_fdset_dup_fd_add(int64_t fdset_id, int dup_fd) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd_dup; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd_dup, &mon_fdset->dup_fds, next) { + if (mon_fdset_fd_dup->fd == dup_fd) { + return -1; + } + } + mon_fdset_fd_dup = g_malloc0(sizeof(*mon_fdset_fd_dup)); + mon_fdset_fd_dup->fd = dup_fd; + QLIST_INSERT_HEAD(&mon_fdset->dup_fds, mon_fdset_fd_dup, next); + mon_fdset->refcount++; + return 0; + } + return -1; +} + +static int _monitor_fdset_dup_fd_find(int dup_fd, bool remove) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd_dup; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + QLIST_FOREACH(mon_fdset_fd_dup, &mon_fdset->dup_fds, next) { + if (mon_fdset_fd_dup->fd == dup_fd) { + if (remove) { + QLIST_REMOVE(mon_fdset_fd_dup, next); + mon_fdset->refcount--; + if (mon_fdset->refcount == 0) { + monitor_fdset_cleanup(mon_fdset); + } + } + return mon_fdset->id; + } + } + } + return -1; +} + +int monitor_fdset_dup_fd_find(int dup_fd) +{ + return _monitor_fdset_dup_fd_find(dup_fd, false); +} + +int monitor_fdset_dup_fd_remove(int dup_fd) +{ + return _monitor_fdset_dup_fd_find(dup_fd, true); +} + /* mon_cmds and info_cmds would be sorted at runtime */ static mon_cmd_t mon_cmds[] = { #include "hmp-commands.h" diff --git a/monitor.h b/monitor.h index 5f4de1b..30b660f 100644 --- a/monitor.h +++ b/monitor.h @@ -86,4 +86,9 @@ int qmp_qom_set(Monitor *mon, const QDict *qdict, QObject **ret); int qmp_qom_get(Monitor *mon, const QDict *qdict, QObject **ret); +int monitor_fdset_get_fd(int64_t fdset_id, int flags); +int monitor_fdset_dup_fd_add(int64_t fdset_id, int dup_fd); +int monitor_fdset_dup_fd_remove(int dup_fd); +int monitor_fdset_dup_fd_find(int dup_fd); + #endif /* !MONITOR_H */ diff --git a/osdep.c b/osdep.c index 7f876ae..9ff3561 100644 --- a/osdep.c +++ b/osdep.c @@ -48,6 +48,7 @@ extern int madvise(caddr_t, size_t, int); #include "qemu-common.h" #include "trace.h" #include "qemu_socket.h" +#include "monitor.h" static bool fips_enabled = false; @@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice) #endif } +/* + * Dups an fd and sets the flags + */ +static int qemu_dup(int fd, int flags) +{ + int ret; + int serrno; + int dup_flags; + int setfl_flags; + + if (flags & O_CLOEXEC) { +#ifdef F_DUPFD_CLOEXEC + ret = fcntl(fd, F_DUPFD_CLOEXEC, 0); +#else + ret = dup(fd); + if (ret != -1) { + qemu_set_cloexec(ret); + } +#endif + } else { + ret = dup(fd); + } + + if (ret == -1) { + goto fail; + } + + dup_flags = fcntl(ret, F_GETFL); + if (dup_flags == -1) { + goto fail; + } + + if ((flags & O_SYNC) != (dup_flags & O_SYNC)) { + errno = EINVAL; + goto fail; + } + + /* Set/unset flags that we can with fcntl */ + setfl_flags = O_APPEND | O_ASYNC | O_DIRECT | O_NOATIME | O_NONBLOCK; + dup_flags &= ~setfl_flags; + dup_flags |= (flags & setfl_flags); + if (fcntl(ret, F_SETFL, dup_flags) == -1) { + goto fail; + } + + /* Truncate the file in the cases that open() would truncate it */ + if (flags & O_TRUNC || + ((flags & (O_CREAT | O_EXCL)) == (O_CREAT | O_EXCL))) { + if (ftruncate(ret, 0) == -1) { + goto fail; + } + } + + return ret; + +fail: + serrno = errno; + if (ret != -1) { + close(ret); + } + errno = serrno; + return -1; +} /* * Opens a file with FD_CLOEXEC set @@ -87,6 +151,39 @@ int qemu_open(const char *name, int flags, ...) int ret; int mode = 0; +#ifndef _WIN32 + const char *fdset_id_str; + + /* Attempt dup of fd from fd set */ + if (strstart(name, "/dev/fdset/", &fdset_id_str)) { + int64_t fdset_id; + int fd, dupfd; + + fdset_id = qemu_parse_fdset(fdset_id_str); + if (fdset_id == -1) { + errno = EINVAL; + return -1; + } + + fd = monitor_fdset_get_fd(fdset_id, flags); + if (fd == -1) { + return -1; + } + + dupfd = qemu_dup(fd, flags); + if (fd == -1) { + return -1; + } + + ret = monitor_fdset_dup_fd_add(fdset_id, dupfd); + if (ret == -1) { + return -1; + } + + return dupfd; + } +#endif + if (flags & O_CREAT) { va_list ap; @@ -109,6 +206,21 @@ int qemu_open(const char *name, int flags, ...) int qemu_close(int fd) { + int64_t fdset_id; + + /* Close fd that was dup'd from an fdset */ + fdset_id = monitor_fdset_dup_fd_find(fd); + if (fdset_id != -1) { + int ret; + + ret = close(fd); + if (ret == 0) { + monitor_fdset_dup_fd_remove(fd); + } + + return ret; + } + return close(fd); } diff --git a/qemu-common.h b/qemu-common.h index e53126d..9becb32 100644 --- a/qemu-common.h +++ b/qemu-common.h @@ -166,6 +166,7 @@ int qemu_fls(int i); int qemu_fdatasync(int fd); int fcntl_setfl(int fd, int flag); int qemu_parse_fd(const char *param); +int qemu_parse_fdset(const char *param); /* * strtosz() suffixes used to specify the default treatment of an diff --git a/qemu-tool.c b/qemu-tool.c index 318c5fc..b7622f5 100644 --- a/qemu-tool.c +++ b/qemu-tool.c @@ -57,6 +57,26 @@ void monitor_protocol_event(MonitorEvent event, QObject *data) { } +int monitor_fdset_get_fd(int64_t fdset_id, int flags) +{ + return -1; +} + +int monitor_fdset_dup_fd_add(int64_t fdset_id, int dup_fd) +{ + return -1; +} + +int monitor_fdset_dup_fd_remove(int dup_fd) +{ + return -1; +} + +int monitor_fdset_dup_fd_find(int dup_fd) +{ + return -1; +} + int64_t cpu_get_clock(void) { return qemu_get_clock_ns(rt_clock); -- 1.7.10.4

On 08/09/2012 08:10 PM, Corey Bryant wrote:
When qemu_open is passed a filename of the "/dev/fdset/nnn" format (where nnn is the fdset ID), an fd with matching access mode flags will be searched for within the specified monitor fd set. If the fd is found, a dup of the fd will be returned from qemu_open.
Each fd set has a reference count. The purpose of the reference count is to determine if an fd set contains file descriptors that have open dup() references that have not yet been closed. It is incremented on qemu_open and decremented on qemu_close. It is not until the refcount is zero that file desriptors in an fd set
s/desriptors/descriptors/
can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
+int monitor_fdset_get_fd(int64_t fdset_id, int flags) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + int mon_fd_flags; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (mon_fdset_fd->removed) { + continue; + }
Is this right? According to the commit message, the whole point of leaving an fd in the set is to allow the fd to be reused as a dup() target for as long as the fdset is alive, even if the monitor no longer cares about the existence of the fd. But this will always skip over an fd marked for removal. Maybe this function needs a flag to say whether this is an initial open driven by an explicit user string (in which case, honor the removed flag - if the user removed the O_RDWR fd and then tries a drive_add with the same fdset, the drive_add should fail because from the user's perspective, there is no O_RDWR fd in the set); vs. an internal usage due to a reopen (use an fd even if removed is true, because we may be toggling between O_RDWR and O_RDONLY multiple times long after the monitor has already removed the fdset, based on actions that were not drive by an explicit /dev/fdset name.)
+ + mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL); + if (mon_fd_flags == -1) { + return -1; + } + + if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) { + return mon_fdset_fd->fd; + }
I still wonder if a request for O_RDONLY should be satisfied by an existing O_RDWR fd, especially in light of the fact that libvirt would rather pass in only one RDWR fd but qemu block opening currently opens twice during probing. But if it turns out to be a problem in practice, and if libvirt can't really manage to pass two fds into the set, we can hack that in later. Meanwhile, I'm okay with this first round patch requiring an exact match.
@@ -87,6 +151,39 @@ int qemu_open(const char *name, int flags, ...) int ret; int mode = 0;
+#ifndef _WIN32 + const char *fdset_id_str; + + /* Attempt dup of fd from fd set */ + if (strstart(name, "/dev/fdset/", &fdset_id_str)) { + int64_t fdset_id; + int fd, dupfd; + + fdset_id = qemu_parse_fdset(fdset_id_str); + if (fdset_id == -1) { + errno = EINVAL; + return -1; + } + + fd = monitor_fdset_get_fd(fdset_id, flags); + if (fd == -1) { + return -1; + } + + dupfd = qemu_dup(fd, flags); + if (fd == -1) { + return -1; + } + + ret = monitor_fdset_dup_fd_add(fdset_id, dupfd); + if (ret == -1) { + return -1;
Leaks dupfd (admittedly only on a corner-case failure, but still worth addressing). -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 08/10/2012 02:16 AM, Eric Blake wrote:
On 08/09/2012 08:10 PM, Corey Bryant wrote:
When qemu_open is passed a filename of the "/dev/fdset/nnn" format (where nnn is the fdset ID), an fd with matching access mode flags will be searched for within the specified monitor fd set. If the fd is found, a dup of the fd will be returned from qemu_open.
Each fd set has a reference count. The purpose of the reference count is to determine if an fd set contains file descriptors that have open dup() references that have not yet been closed. It is incremented on qemu_open and decremented on qemu_close. It is not until the refcount is zero that file desriptors in an fd set
s/desriptors/descriptors/
Thanks
can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
+int monitor_fdset_get_fd(int64_t fdset_id, int flags) +{ + MonFdset *mon_fdset; + MonFdsetFd *mon_fdset_fd; + int mon_fd_flags; + + QLIST_FOREACH(mon_fdset, &mon_fdsets, next) { + if (mon_fdset->id != fdset_id) { + continue; + } + QLIST_FOREACH(mon_fdset_fd, &mon_fdset->fds, next) { + if (mon_fdset_fd->removed) { + continue; + }
Is this right? According to the commit message, the whole point of leaving an fd in the set is to allow the fd to be reused as a dup() target for as long as the fdset is alive, even if the monitor no longer cares about the existence of the fd. But this will always skip over an fd marked for removal. Maybe this function needs a flag to say whether this is an initial open driven by an explicit user string (in which case, honor the removed flag - if the user removed the O_RDWR fd and then tries a drive_add with the same fdset, the drive_add should fail because from the user's perspective, there is no O_RDWR fd in the set); vs. an internal usage due to a reopen (use an fd even if removed is true, because we may be toggling between O_RDWR and O_RDONLY multiple times long after the monitor has already removed the fdset, based on actions that were not drive by an explicit /dev/fdset name.)
Hmm.. something needs to change here, either the commit message or the code. For security purposes, I'm thinking that an fd should no longer be available for opening after libvirt issues a remove-fd command, with no exceptions. That allows libvirt to have complete control over usage of an fd. Note that other fds in the fd set could still be used on a reopen, assuming remove-fd hasn't been called for them. Does that work for you? In that case, the code will remain as-is. Apologies if I've sent conflicting messages on this in the past.
+ + mon_fd_flags = fcntl(mon_fdset_fd->fd, F_GETFL); + if (mon_fd_flags == -1) { + return -1; + } + + if ((flags & O_ACCMODE) == (mon_fd_flags & O_ACCMODE)) { + return mon_fdset_fd->fd; + }
I still wonder if a request for O_RDONLY should be satisfied by an existing O_RDWR fd, especially in light of the fact that libvirt would rather pass in only one RDWR fd but qemu block opening currently opens twice during probing. But if it turns out to be a problem in practice, and if libvirt can't really manage to pass two fds into the set, we can hack that in later. Meanwhile, I'm okay with this first round patch requiring an exact match.
I see your point but I think allowing subset access mode matches could allow for a client to get lazy and cause security implications (client only adds RW fd's and QEMU only needs R access in some cases). So I'll keep this as is.
@@ -87,6 +151,39 @@ int qemu_open(const char *name, int flags, ...) int ret; int mode = 0;
+#ifndef _WIN32 + const char *fdset_id_str; + + /* Attempt dup of fd from fd set */ + if (strstart(name, "/dev/fdset/", &fdset_id_str)) { + int64_t fdset_id; + int fd, dupfd; + + fdset_id = qemu_parse_fdset(fdset_id_str); + if (fdset_id == -1) { + errno = EINVAL; + return -1; + } + + fd = monitor_fdset_get_fd(fdset_id, flags); + if (fd == -1) { + return -1; + } + + dupfd = qemu_dup(fd, flags); + if (fd == -1) { + return -1; + } + + ret = monitor_fdset_dup_fd_add(fdset_id, dupfd); + if (ret == -1) { + return -1;
Leaks dupfd (admittedly only on a corner-case failure, but still worth addressing).
Thanks, I'll fix this. -- Regards, Corey

On 08/10/2012 08:17 AM, Corey Bryant wrote:
can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
Is this right? According to the commit message, the whole point of leaving an fd in the set is to allow the fd to be reused as a dup() target for as long as the fdset is alive, even if the monitor no longer cares about the existence of the fd. But this will always skip over an fd marked for removal.
For security purposes, I'm thinking that an fd should no longer be available for opening after libvirt issues a remove-fd command, with no exceptions.
If that's the case, then you don't need the removed flag. Simply close the original fd and forget about it at the point that the monitor removes the fd. The fdset will still remain as long as there is a refcount, so that libvirt can then re-add to the set. And that also argues that query-fdsets MUST list empty fdsets, where all existing fds have been removed, but where the set can still be added to again just prior to the next reopen operation. That is, consider this life cycle: libvirt> add-fd opaque="RDWR" qemu< fdset=1, fd=3; the set is in use because of the monitor but with refcount 0 libvirt> drive_add /dev/fdset/1 qemu< success; internally, fd 4 was created as a dup of 3, and fdset 1 now has refcount 1 and tracks that fd 4 is tied to the set libvirt> remove-fd fdset=1 fd=3, since the drive_add is complete and libvirt no longer needs to track what fd qemu is using, but does remember internally that qemu is using fdset 1 qemu< success; internally, fdset 1 is still refcount 1 later on, libvirt prepares to do a snapshot, and knows that after the snapshot, qemu will want to reopen the file RDONLY, as part of making it the backing image libvirt> add-fd fdset=1 opaque="RDONLY" qemu< fdset=1, fd=3 (yes, fd 3 is available for use again, since the earlier remove closed it out) libvirt> blockdev-snapshot-sync qemu< success; internally, the block device knows that /dev/fdset/1 is now becoming a backing file, so it tries a reopen to convert its current RDWR fd 4 into something RDONLY; the probe succeeds by returning fd 5 as a dup of the readonly fd 3
Does that work for you? In that case, the code will remain as-is.
I think that lifecycle will work (either libvirt leaves an fd in the set for as long as it thinks qemu might need to do independent open operations, or it removes the fd as soon as it knows that qemu is done with open, but the fdset remains reserved, tracking all dup'd fds that were created from the set. For that matter, your refcount doesn't even have to be a separate field, but can simply be the length of the list of dup'd fds that are tied to the set until a call to qemu_close. That is, a set must remain alive as long as there is either an add-fd that has not yet been removed, or there is at least one dup'd fd that has not been qemu_close()d. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 08/10/2012 11:25 AM, Eric Blake wrote:
On 08/10/2012 08:17 AM, Corey Bryant wrote:
can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
Is this right? According to the commit message, the whole point of leaving an fd in the set is to allow the fd to be reused as a dup() target for as long as the fdset is alive, even if the monitor no longer cares about the existence of the fd. But this will always skip over an fd marked for removal.
For security purposes, I'm thinking that an fd should no longer be available for opening after libvirt issues a remove-fd command, with no exceptions.
If that's the case, then you don't need the removed flag. Simply close the original fd and forget about it at the point that the monitor
You're right. Kevin also brought this up to me offline during a discussion a little while ago. I'll plan on closing the fd immediately when remove-fd is issued.
removes the fd. The fdset will still remain as long as there is a refcount, so that libvirt can then re-add to the set. And that also argues that query-fdsets MUST list empty fdsets, where all existing fds have been removed, but where the set can still be added to again just prior to the next reopen operation.
That is, consider this life cycle:
libvirt> add-fd opaque="RDWR" qemu< fdset=1, fd=3; the set is in use because of the monitor but with refcount 0 libvirt> drive_add /dev/fdset/1 qemu< success; internally, fd 4 was created as a dup of 3, and fdset 1 now has refcount 1 and tracks that fd 4 is tied to the set libvirt> remove-fd fdset=1 fd=3, since the drive_add is complete and libvirt no longer needs to track what fd qemu is using, but does remember internally that qemu is using fdset 1 qemu< success; internally, fdset 1 is still refcount 1 later on, libvirt prepares to do a snapshot, and knows that after the snapshot, qemu will want to reopen the file RDONLY, as part of making it the backing image libvirt> add-fd fdset=1 opaque="RDONLY" qemu< fdset=1, fd=3 (yes, fd 3 is available for use again, since the earlier remove closed it out) libvirt> blockdev-snapshot-sync qemu< success; internally, the block device knows that /dev/fdset/1 is now becoming a backing file, so it tries a reopen to convert its current RDWR fd 4 into something RDONLY; the probe succeeds by returning fd 5 as a dup of the readonly fd 3
This all makes sense to me.
Does that work for you? In that case, the code will remain as-is.
I think that lifecycle will work (either libvirt leaves an fd in the set for as long as it thinks qemu might need to do independent open operations, or it removes the fd as soon as it knows that qemu is done with open, but the fdset remains reserved, tracking all dup'd fds that were created from the set.
Yep
For that matter, your refcount doesn't even have to be a separate field, but can simply be the length of the list of dup'd fds that are tied to the set until a call to qemu_close. That is, a set must remain alive as long as there is either an add-fd that has not yet been removed, or there is at least one dup'd fd that has not been qemu_close()d.
I agree. I'll plan on dropping refcount and just checking the dup_fds list. -- Regards, Corey

Am 10.08.2012 04:10, schrieb Corey Bryant:
When qemu_open is passed a filename of the "/dev/fdset/nnn" format (where nnn is the fdset ID), an fd with matching access mode flags will be searched for within the specified monitor fd set. If the fd is found, a dup of the fd will be returned from qemu_open.
Each fd set has a reference count. The purpose of the reference count is to determine if an fd set contains file descriptors that have open dup() references that have not yet been closed. It is incremented on qemu_open and decremented on qemu_close. It is not until the refcount is zero that file desriptors in an fd set can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
@@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice) #endif }
+/* + * Dups an fd and sets the flags + */ +static int qemu_dup(int fd, int flags)
qemu_dup() is probably not a good name. We'll want to use it when we need to get a wrapper around dup(). And I suspect that we will need it for making bdrv_reopen() compatible with fdset refcounting.
+{ + int ret; + int serrno; + int dup_flags; + int setfl_flags; + + if (flags & O_CLOEXEC) { +#ifdef F_DUPFD_CLOEXEC + ret = fcntl(fd, F_DUPFD_CLOEXEC, 0); +#else + ret = dup(fd); + if (ret != -1) { + qemu_set_cloexec(ret); + } +#endif + } else { + ret = dup(fd); + }
qemu_open() is supposed to add O_CLOEXEC by itself, so I think we should execute the then branch unconditionally (or we would have to change the qemu_dup() call below to add it - but the fact that O_CLOEXEC isn't even necessarily defined doesn't help with that). Kevin

On 08/10/2012 12:34 PM, Kevin Wolf wrote:
Am 10.08.2012 04:10, schrieb Corey Bryant:
When qemu_open is passed a filename of the "/dev/fdset/nnn" format (where nnn is the fdset ID), an fd with matching access mode flags will be searched for within the specified monitor fd set. If the fd is found, a dup of the fd will be returned from qemu_open.
Each fd set has a reference count. The purpose of the reference count is to determine if an fd set contains file descriptors that have open dup() references that have not yet been closed. It is incremented on qemu_open and decremented on qemu_close. It is not until the refcount is zero that file desriptors in an fd set can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
@@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice) #endif }
+/* + * Dups an fd and sets the flags + */ +static int qemu_dup(int fd, int flags)
qemu_dup() is probably not a good name. We'll want to use it when we need to get a wrapper around dup(). And I suspect that we will need it for making bdrv_reopen() compatible with fdset refcounting.
Do you you have a suggestion for a name?
+{ + int ret; + int serrno; + int dup_flags; + int setfl_flags; + + if (flags & O_CLOEXEC) { +#ifdef F_DUPFD_CLOEXEC + ret = fcntl(fd, F_DUPFD_CLOEXEC, 0); +#else + ret = dup(fd); + if (ret != -1) { + qemu_set_cloexec(ret); + } +#endif + } else { + ret = dup(fd); + }
qemu_open() is supposed to add O_CLOEXEC by itself, so I think we should execute the then branch unconditionally (or we would have to change the qemu_dup() call below to add it - but the fact that O_CLOEXEC isn't even necessarily defined doesn't help with that).
Sure I can modify this to always add O_CLOEXEC. (I know you mentioned this before but I think I interpreted the comment as a question to others.) Do you also want me to modify qemu_open to always add O_CLOEXEC? -- Regards, Corey

On 08/10/2012 12:56 PM, Corey Bryant wrote:
On 08/10/2012 12:34 PM, Kevin Wolf wrote:
Am 10.08.2012 04:10, schrieb Corey Bryant:
When qemu_open is passed a filename of the "/dev/fdset/nnn" format (where nnn is the fdset ID), an fd with matching access mode flags will be searched for within the specified monitor fd set. If the fd is found, a dup of the fd will be returned from qemu_open.
Each fd set has a reference count. The purpose of the reference count is to determine if an fd set contains file descriptors that have open dup() references that have not yet been closed. It is incremented on qemu_open and decremented on qemu_close. It is not until the refcount is zero that file desriptors in an fd set can be closed. If an fd set has dup() references open, then we must keep the other fds in the fd set open in case a reopen of the file occurs that requires an fd with a different access mode.
Signed-off-by: Corey Bryant <coreyb@linux.vnet.ibm.com>
@@ -78,6 +79,69 @@ int qemu_madvise(void *addr, size_t len, int advice) #endif }
+/* + * Dups an fd and sets the flags + */ +static int qemu_dup(int fd, int flags)
qemu_dup() is probably not a good name. We'll want to use it when we need to get a wrapper around dup(). And I suspect that we will need it for making bdrv_reopen() compatible with fdset refcounting.
Do you you have a suggestion for a name?
+{ + int ret; + int serrno; + int dup_flags; + int setfl_flags; + + if (flags & O_CLOEXEC) { +#ifdef F_DUPFD_CLOEXEC + ret = fcntl(fd, F_DUPFD_CLOEXEC, 0); +#else + ret = dup(fd); + if (ret != -1) { + qemu_set_cloexec(ret); + } +#endif + } else { + ret = dup(fd); + }
qemu_open() is supposed to add O_CLOEXEC by itself, so I think we should execute the then branch unconditionally (or we would have to change the qemu_dup() call below to add it - but the fact that O_CLOEXEC isn't even necessarily defined doesn't help with that).
Sure I can modify this to always add O_CLOEXEC. (I know you mentioned this before but I think I interpreted the comment as a question to others.)
Do you also want me to modify qemu_open to always add O_CLOEXEC?
My mistake.. it's always set in qemu_open already. -- Regards, Corey

Am 10.08.2012 04:10, schrieb Corey Bryant:
libvirt's sVirt security driver provides SELinux MAC isolation for Qemu guest processes and their corresponding image files. In other words, sVirt uses SELinux to prevent a QEMU process from opening files that do not belong to it.
sVirt provides this support by labeling guests and resources with security labels that are stored in file system extended attributes. Some file systems, such as NFS, do not support the extended attribute security namespace, and therefore cannot support sVirt isolation.
A solution to this problem is to provide fd passing support, where libvirt opens files and passes file descriptors to QEMU. This, along with SELinux policy to prevent QEMU from opening files, can provide image file isolation for NFS files stored on the same NFS mount.
This patch series adds the add-fd, remove-fd, and query-fdsets QMP monitor commands, which allow file descriptors to be passed via SCM_RIGHTS, and assigned to specified fd sets. This allows fd sets to be created per file with fds having, for example, different access rights. When QEMU needs to reopen a file with different access rights, it can search for a matching fd in the fd set. Fd sets also allow for easy tracking of fds per file, helping to prevent fd leaks.
Support is also added to the block layer to allow QEMU to dup an fd from an fdset when the filename is of the /dev/fdset/nnn format, where nnn is the fd set ID.
No new SELinux policy is required to prevent open of NFS files (files with type nfs_t). The virt_use_nfs boolean type simply needs to be set to false, and open will be prevented (and dup will be allowed). For example:
# setsebool virt_use_nfs 0 # getsebool virt_use_nfs virt_use_nfs --> off
Corey Bryant (7): qemu-char: Add MSG_CMSG_CLOEXEC flag to recvmsg qapi: Introduce add-fd, remove-fd, query-fdsets monitor: Clean up fd sets on monitor disconnect block: Prevent detection of /dev/fdset/ as floppy block: Convert open calls to qemu_open block: Convert close calls to qemu_close block: Enable qemu_open/close to work with fd sets
block/raw-posix.c | 46 +++++---- block/raw-win32.c | 6 +- block/vdi.c | 5 +- block/vmdk.c | 25 ++--- block/vpc.c | 4 +- block/vvfat.c | 16 +-- cutils.c | 5 + monitor.c | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++++ monitor.h | 5 + osdep.c | 117 +++++++++++++++++++++ qapi-schema.json | 98 ++++++++++++++++++ qemu-char.c | 12 ++- qemu-common.h | 2 + qemu-tool.c | 20 ++++ qmp-commands.hx | 117 +++++++++++++++++++++ savevm.c | 4 +- 16 files changed, 721 insertions(+), 55 deletions(-)
Apart from the few comments I made, I like this series. Maybe v9 will be the last one. :-) Kevin

On 08/10/2012 12:36 PM, Kevin Wolf wrote:
Am 10.08.2012 04:10, schrieb Corey Bryant:
libvirt's sVirt security driver provides SELinux MAC isolation for Qemu guest processes and their corresponding image files. In other words, sVirt uses SELinux to prevent a QEMU process from opening files that do not belong to it.
sVirt provides this support by labeling guests and resources with security labels that are stored in file system extended attributes. Some file systems, such as NFS, do not support the extended attribute security namespace, and therefore cannot support sVirt isolation.
A solution to this problem is to provide fd passing support, where libvirt opens files and passes file descriptors to QEMU. This, along with SELinux policy to prevent QEMU from opening files, can provide image file isolation for NFS files stored on the same NFS mount.
This patch series adds the add-fd, remove-fd, and query-fdsets QMP monitor commands, which allow file descriptors to be passed via SCM_RIGHTS, and assigned to specified fd sets. This allows fd sets to be created per file with fds having, for example, different access rights. When QEMU needs to reopen a file with different access rights, it can search for a matching fd in the fd set. Fd sets also allow for easy tracking of fds per file, helping to prevent fd leaks.
Support is also added to the block layer to allow QEMU to dup an fd from an fdset when the filename is of the /dev/fdset/nnn format, where nnn is the fd set ID.
No new SELinux policy is required to prevent open of NFS files (files with type nfs_t). The virt_use_nfs boolean type simply needs to be set to false, and open will be prevented (and dup will be allowed). For example:
# setsebool virt_use_nfs 0 # getsebool virt_use_nfs virt_use_nfs --> off
Corey Bryant (7): qemu-char: Add MSG_CMSG_CLOEXEC flag to recvmsg qapi: Introduce add-fd, remove-fd, query-fdsets monitor: Clean up fd sets on monitor disconnect block: Prevent detection of /dev/fdset/ as floppy block: Convert open calls to qemu_open block: Convert close calls to qemu_close block: Enable qemu_open/close to work with fd sets
block/raw-posix.c | 46 +++++---- block/raw-win32.c | 6 +- block/vdi.c | 5 +- block/vmdk.c | 25 ++--- block/vpc.c | 4 +- block/vvfat.c | 16 +-- cutils.c | 5 + monitor.c | 294 +++++++++++++++++++++++++++++++++++++++++++++++++++++ monitor.h | 5 + osdep.c | 117 +++++++++++++++++++++ qapi-schema.json | 98 ++++++++++++++++++ qemu-char.c | 12 ++- qemu-common.h | 2 + qemu-tool.c | 20 ++++ qmp-commands.hx | 117 +++++++++++++++++++++ savevm.c | 4 +- 16 files changed, 721 insertions(+), 55 deletions(-)
Apart from the few comments I made, I like this series. Maybe v9 will be the last one. :-)
Thanks, I hope so too! -- Regards, Corey
participants (4)
-
Corey Bryant
-
Eric Blake
-
Kevin Wolf
-
Stefan Hajnoczi