[libvirt PATCH 00/17] qemu: Implement external limit manager feature

This feature has been requested by KubeVirt developers and will make it possible for them to make some VFIO-related features, such as migration and hotplug, work correctly. https://bugzilla.redhat.com/show_bug.cgi?id=1916346 The first part of the series, especially the first 9 patches, is preparation work: it addresses a few annoying issues with our APIs that deal with process limits, and makes them all nice, consistent and easy to reason about while moving policy code from the generic code to the QEMU driver where it belongs. Andrea Bolognani (17): util: Document limit-related functions util: Simplify stubs util: Always pass a pid to virProcessSetMax*() util: Introduce virProcess{Get,Set}Limit() qemu: Make some minor tweaks qemu: Set all limits at the same time util: Have virCommand remember whether limits are set qemu: Set limits only when explicitly asked to do so util: Don't special-case setting a limit to zero conf: Rename original_memlock -> originalMemlock tests: Mock virProcessGetMaxMemLock() util: Try to get limits from /proc qemu: Don't ignore virProcessGetMaxMemLock() errors qemu: Refactor qemuDomainAdjustMaxMemLock() qemu: Add external_limit_manager config knob qemu: Wire up external limit manager news: Document external limit manager feature NEWS.rst | 10 + src/conf/domain_conf.h | 5 +- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 12 + src/qemu/qemu_command.c | 4 - src/qemu/qemu_conf.c | 4 + src/qemu/qemu_conf.h | 1 + src/qemu/qemu_domain.c | 47 ++-- src/qemu/qemu_migration.c | 2 + src/qemu/qemu_process.c | 30 ++- src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 21 +- src/util/virprocess.c | 340 ++++++++++++++++++++--------- src/util/virprocess.h | 2 +- tests/virprocessmock.c | 7 + 15 files changed, 354 insertions(+), 133 deletions(-) -- 2.26.2

We're going to change their behavior, so it's good to have the current one documented to serve as baseline. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 46 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 69d64e9466..9fe75d54d6 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -739,6 +739,15 @@ virProcessPrLimit(pid_t pid G_GNUC_UNUSED, #endif #if WITH_SETRLIMIT && defined(RLIMIT_MEMLOCK) +/** + * virProcessSetMaxMemLock: + * @pid: process to be changed (0 for the current process) + * @bytes: new limit (0 for no change) + * + * Sets a new limit on the amount of locked memory for a process. + * + * Returns: 0 on success, <0 on failure. + */ int virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) { @@ -791,6 +800,15 @@ virProcessSetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long bytes) #endif /* ! (WITH_SETRLIMIT && defined(RLIMIT_MEMLOCK)) */ #if WITH_GETRLIMIT && defined(RLIMIT_MEMLOCK) +/** + * virProcessGetMaxMemLock: + * @pid: process to be queried (0 for the current process) + * @bytes: return location for the limit + * + * Obtain the current limit on the amount of locked memory for a process. + * + * Returns: 0 on success, <0 on failure. + */ int virProcessGetMaxMemLock(pid_t pid, unsigned long long *bytes) @@ -843,6 +861,16 @@ virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, #endif /* ! (WITH_GETRLIMIT && defined(RLIMIT_MEMLOCK)) */ #if WITH_SETRLIMIT && defined(RLIMIT_NPROC) +/** + * virProcessSetMaxProcesses: + * @pid: process to be changed (0 for the current process) + * @procs: new limit (0 for no change) + * + * Sets a new limit on the amount of processes for the user the + * process is running as. + * + * Returns: 0 on success, <0 on failure. + */ int virProcessSetMaxProcesses(pid_t pid, unsigned int procs) { @@ -883,6 +911,15 @@ virProcessSetMaxProcesses(pid_t pid G_GNUC_UNUSED, unsigned int procs) #endif /* ! (WITH_SETRLIMIT && defined(RLIMIT_NPROC)) */ #if WITH_SETRLIMIT && defined(RLIMIT_NOFILE) +/** + * virProcessSetMaxFiles: + * @pid: process to be changed (0 for the current process) + * @files: new limit (0 for no change) + * + * Sets a new limit on the number of opened files for a process. + * + * Returns: 0 on success, <0 on failure. + */ int virProcessSetMaxFiles(pid_t pid, unsigned int files) { @@ -931,6 +968,15 @@ virProcessSetMaxFiles(pid_t pid G_GNUC_UNUSED, unsigned int files) #endif /* ! (WITH_SETRLIMIT && defined(RLIMIT_NOFILE)) */ #if WITH_SETRLIMIT && defined(RLIMIT_CORE) +/** + * virProcessSetMaxCoreSize: + * @pid: process to be changed (0 for the current process) + * @bytes: new limit (0 to disable core dumps) + * + * Sets a new limit on the size of core dumps for a process. + * + * Returns: 0 on success, <0 on failure. + */ int virProcessSetMaxCoreSize(pid_t pid, unsigned long long bytes) { -- 2.26.2

Calling a stub should always result in ENOSYS being raised, regardless of what arguments are passed to it. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 9fe75d54d6..e01ff25540 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -850,11 +850,8 @@ virProcessGetMaxMemLock(pid_t pid, #else /* ! (WITH_GETRLIMIT && defined(RLIMIT_MEMLOCK)) */ int virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, - unsigned long long *bytes) + unsigned long long *bytes G_GNUC_UNUSED) { - if (!bytes) - return 0; - virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); return -1; } @@ -900,11 +897,9 @@ virProcessSetMaxProcesses(pid_t pid, unsigned int procs) } #else /* ! (WITH_SETRLIMIT && defined(RLIMIT_NPROC)) */ int -virProcessSetMaxProcesses(pid_t pid G_GNUC_UNUSED, unsigned int procs) +virProcessSetMaxProcesses(pid_t pid G_GNUC_UNUSED, + unsigned int procs G_GNUC_UNUSED) { - if (procs == 0) - return 0; - virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); return -1; } @@ -957,11 +952,9 @@ virProcessSetMaxFiles(pid_t pid, unsigned int files) } #else /* ! (WITH_SETRLIMIT && defined(RLIMIT_NOFILE)) */ int -virProcessSetMaxFiles(pid_t pid G_GNUC_UNUSED, unsigned int files) +virProcessSetMaxFiles(pid_t pid G_GNUC_UNUSED, + unsigned int files G_GNUC_UNUSED) { - if (files == 0) - return 0; - virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); return -1; } @@ -1004,11 +997,8 @@ virProcessSetMaxCoreSize(pid_t pid, unsigned long long bytes) #else /* ! (WITH_SETRLIMIT && defined(RLIMIT_CORE)) */ int virProcessSetMaxCoreSize(pid_t pid G_GNUC_UNUSED, - unsigned long long bytes) + unsigned long long bytes G_GNUC_UNUSED) { - if (bytes == 0) - return 0; - virReportSystemError(ENOSYS, "%s", _("Not supported on this platform")); return -1; } -- 2.26.2

On 3/5/21 8:13 PM, Andrea Bolognani wrote:
Calling a stub should always result in ENOSYS being raised, regardless of what arguments are passed to it.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-)
Missed virProcessSetMaxMemLock(). Michal

On Mon, 2021-03-08 at 11:31 +0100, Michal Privoznik wrote:
On 3/5/21 8:13 PM, Andrea Bolognani wrote:
Calling a stub should always result in ENOSYS being raised, regardless of what arguments are passed to it.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 22 ++++++---------------- 1 file changed, 6 insertions(+), 16 deletions(-)
Missed virProcessSetMaxMemLock().
Good catch - I must have misplaced it during a rebase O:-) -- Andrea Bolognani / Red Hat / Virtualization

Currently, the functions accept either an explicit pid or zero, in which case the current process should be modified: the latter might sound like a convenient little feature, but in reality obtaining the pid of the current process is a single additional function call away, so it hardly makes a difference. Removing the few cases in which we're passing zero will allow us to simplify and improve the functions later. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/vircommand.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/src/util/vircommand.c b/src/util/vircommand.c index 1a4b77ea24..3eef0767bb 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -780,11 +780,13 @@ virExec(virCommandPtr cmd) } } + pid = getpid(); + if (cmd->pidfile) { int pidfilefd = -1; char c; - pidfilefd = virPidFileAcquirePath(cmd->pidfile, false, getpid()); + pidfilefd = virPidFileAcquirePath(cmd->pidfile, false, pid); if (pidfilefd < 0) goto fork_error; if (virSetInherit(pidfilefd, true) < 0) { @@ -804,14 +806,14 @@ virExec(virCommandPtr cmd) /* pidfilefd is intentionally leaked. */ } - if (virProcessSetMaxMemLock(0, cmd->maxMemLock) < 0) + if (virProcessSetMaxMemLock(pid, cmd->maxMemLock) < 0) goto fork_error; - if (virProcessSetMaxProcesses(0, cmd->maxProcesses) < 0) + if (virProcessSetMaxProcesses(pid, cmd->maxProcesses) < 0) goto fork_error; - if (virProcessSetMaxFiles(0, cmd->maxFiles) < 0) + if (virProcessSetMaxFiles(pid, cmd->maxFiles) < 0) goto fork_error; if (cmd->setMaxCore && - virProcessSetMaxCoreSize(0, cmd->maxCore) < 0) + virProcessSetMaxCoreSize(pid, cmd->maxCore) < 0) goto fork_error; if (cmd->hook) { -- 2.26.2

These functions abstract part of the existing logic, which is the same in all virProcessSetMax*() functions, and changes it so that which underlying syscall is used depends on their availability rather than on the context in which they are called: since prlimit() and {g,s}etrlimit() have slightly different requirements, using the same one every single time should make for a more consistent experience. As part of the change, we also remove the special case for passing zero to virProcessSetMax*() functions: we have removed all callers that depended on that functionality in the previous commit, so this is now safe to do and makes the semantics simpler. This commit is better viewed with 'git show -w'. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 175 +++++++++++++++++++++++------------------- 1 file changed, 95 insertions(+), 80 deletions(-) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index e01ff25540..38b248217e 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -738,10 +738,66 @@ virProcessPrLimit(pid_t pid G_GNUC_UNUSED, } #endif +#if WITH_GETRLIMIT +static int +virProcessGetRLimit(int resource, + struct rlimit *old_limit) +{ + return getrlimit(resource, old_limit); +} +#endif /* WITH_GETRLIMIT */ + +#if WITH_SETRLIMIT +static int +virProcessSetRLimit(int resource, + const struct rlimit *new_limit) +{ + return setrlimit(resource, new_limit); +} +#endif /* WITH_SETRLIMIT */ + +#if WITH_GETRLIMIT +static int +virProcessGetLimit(pid_t pid, + int resource, + struct rlimit *old_limit) +{ + pid_t current_pid = getpid(); + bool same_process = (pid == current_pid); + + if (virProcessPrLimit(pid, resource, NULL, old_limit) == 0) + return 0; + + if (same_process && virProcessGetRLimit(resource, old_limit) == 0) + return 0; + + return -1; +} +#endif /* WITH_GETRLIMIT */ + +#if WITH_SETRLIMIT +static int +virProcessSetLimit(pid_t pid, + int resource, + const struct rlimit *new_limit) +{ + pid_t current_pid = getpid(); + bool same_process = (pid == current_pid); + + if (virProcessPrLimit(pid, resource, new_limit, NULL) == 0) + return 0; + + if (same_process && virProcessSetRLimit(resource, new_limit) == 0) + return 0; + + return -1; +} +#endif /* WITH_SETRLIMIT */ + #if WITH_SETRLIMIT && defined(RLIMIT_MEMLOCK) /** * virProcessSetMaxMemLock: - * @pid: process to be changed (0 for the current process) + * @pid: process to be changed * @bytes: new limit (0 for no change) * * Sets a new limit on the amount of locked memory for a process. @@ -765,21 +821,11 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) else rlim.rlim_cur = rlim.rlim_max = RLIM_INFINITY; - if (pid == 0) { - if (setrlimit(RLIMIT_MEMLOCK, &rlim) < 0) { - virReportSystemError(errno, - _("cannot limit locked memory to %llu"), - bytes); - return -1; - } - } else { - if (virProcessPrLimit(pid, RLIMIT_MEMLOCK, &rlim, NULL) < 0) { - virReportSystemError(errno, - _("cannot limit locked memory " - "of process %lld to %llu"), - (long long int)pid, bytes); - return -1; - } + if (virProcessSetLimit(pid, RLIMIT_MEMLOCK, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit locked memory " + "of process %lld to %llu"), + (long long int)pid, bytes); } VIR_DEBUG("Locked memory for process %lld limited to %llu bytes", @@ -802,7 +848,7 @@ virProcessSetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long bytes) #if WITH_GETRLIMIT && defined(RLIMIT_MEMLOCK) /** * virProcessGetMaxMemLock: - * @pid: process to be queried (0 for the current process) + * @pid: process to be queried * @bytes: return location for the limit * * Obtain the current limit on the amount of locked memory for a process. @@ -818,21 +864,12 @@ virProcessGetMaxMemLock(pid_t pid, if (!bytes) return 0; - if (pid == 0) { - if (getrlimit(RLIMIT_MEMLOCK, &rlim) < 0) { - virReportSystemError(errno, - "%s", - _("cannot get locked memory limit")); - return -1; - } - } else { - if (virProcessPrLimit(pid, RLIMIT_MEMLOCK, NULL, &rlim) < 0) { - virReportSystemError(errno, - _("cannot get locked memory limit " - "of process %lld"), - (long long int) pid); - return -1; - } + if (virProcessGetLimit(pid, RLIMIT_MEMLOCK, &rlim) < 0) { + virReportSystemError(errno, + _("cannot get locked memory limit " + "of process %lld"), + (long long int) pid); + return -1; } /* virProcessSetMaxMemLock() sets both rlim_cur and rlim_max to the @@ -860,7 +897,7 @@ virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, #if WITH_SETRLIMIT && defined(RLIMIT_NPROC) /** * virProcessSetMaxProcesses: - * @pid: process to be changed (0 for the current process) + * @pid: process to be changed * @procs: new limit (0 for no change) * * Sets a new limit on the amount of processes for the user the @@ -877,21 +914,13 @@ virProcessSetMaxProcesses(pid_t pid, unsigned int procs) return 0; rlim.rlim_cur = rlim.rlim_max = procs; - if (pid == 0) { - if (setrlimit(RLIMIT_NPROC, &rlim) < 0) { - virReportSystemError(errno, - _("cannot limit number of subprocesses to %u"), - procs); - return -1; - } - } else { - if (virProcessPrLimit(pid, RLIMIT_NPROC, &rlim, NULL) < 0) { - virReportSystemError(errno, - _("cannot limit number of subprocesses " - "of process %lld to %u"), - (long long int)pid, procs); - return -1; - } + + if (virProcessSetLimit(pid, RLIMIT_NPROC, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of subprocesses " + "of process %lld to %u"), + (long long int)pid, procs); + return -1; } return 0; } @@ -908,7 +937,7 @@ virProcessSetMaxProcesses(pid_t pid G_GNUC_UNUSED, #if WITH_SETRLIMIT && defined(RLIMIT_NOFILE) /** * virProcessSetMaxFiles: - * @pid: process to be changed (0 for the current process) + * @pid: process to be changed * @files: new limit (0 for no change) * * Sets a new limit on the number of opened files for a process. @@ -932,22 +961,15 @@ virProcessSetMaxFiles(pid_t pid, unsigned int files) * behavior. */ rlim.rlim_cur = rlim.rlim_max = files + 1; - if (pid == 0) { - if (setrlimit(RLIMIT_NOFILE, &rlim) < 0) { - virReportSystemError(errno, - _("cannot limit number of open files to %u"), - files); - return -1; - } - } else { - if (virProcessPrLimit(pid, RLIMIT_NOFILE, &rlim, NULL) < 0) { - virReportSystemError(errno, - _("cannot limit number of open files " - "of process %lld to %u"), - (long long int)pid, files); - return -1; - } + + if (virProcessSetLimit(pid, RLIMIT_NOFILE, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit number of open files " + "of process %lld to %u"), + (long long int)pid, files); + return -1; } + return 0; } #else /* ! (WITH_SETRLIMIT && defined(RLIMIT_NOFILE)) */ @@ -963,7 +985,7 @@ virProcessSetMaxFiles(pid_t pid G_GNUC_UNUSED, #if WITH_SETRLIMIT && defined(RLIMIT_CORE) /** * virProcessSetMaxCoreSize: - * @pid: process to be changed (0 for the current process) + * @pid: process to be changed * @bytes: new limit (0 to disable core dumps) * * Sets a new limit on the size of core dumps for a process. @@ -976,22 +998,15 @@ virProcessSetMaxCoreSize(pid_t pid, unsigned long long bytes) struct rlimit rlim; rlim.rlim_cur = rlim.rlim_max = bytes; - if (pid == 0) { - if (setrlimit(RLIMIT_CORE, &rlim) < 0) { - virReportSystemError(errno, - _("cannot limit core file size to %llu"), - bytes); - return -1; - } - } else { - if (virProcessPrLimit(pid, RLIMIT_CORE, &rlim, NULL) < 0) { - virReportSystemError(errno, - _("cannot limit core file size " - "of process %lld to %llu"), - (long long int)pid, bytes); - return -1; - } + + if (virProcessSetLimit(pid, RLIMIT_CORE, &rlim) < 0) { + virReportSystemError(errno, + _("cannot limit core file size " + "of process %lld to %llu"), + (long long int)pid, bytes); + return -1; } + return 0; } #else /* ! (WITH_SETRLIMIT && defined(RLIMIT_CORE)) */ -- 2.26.2

On 3/5/21 8:13 PM, Andrea Bolognani wrote:
These functions abstract part of the existing logic, which is the same in all virProcessSetMax*() functions, and changes it so that which underlying syscall is used depends on their availability rather than on the context in which they are called: since prlimit() and {g,s}etrlimit() have slightly different requirements, using the same one every single time should make for a more consistent experience.
As part of the change, we also remove the special case for passing zero to virProcessSetMax*() functions: we have removed all callers that depended on that functionality in the previous commit, so this is now safe to do and makes the semantics simpler.
This commit is better viewed with 'git show -w'.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 175 +++++++++++++++++++++++------------------- 1 file changed, 95 insertions(+), 80 deletions(-)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index e01ff25540..38b248217e 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -738,10 +738,66 @@ virProcessPrLimit(pid_t pid G_GNUC_UNUSED, } #endif
+#if WITH_GETRLIMIT +static int +virProcessGetRLimit(int resource, + struct rlimit *old_limit) +{ + return getrlimit(resource, old_limit); +} +#endif /* WITH_GETRLIMIT */ + +#if WITH_SETRLIMIT +static int +virProcessSetRLimit(int resource, + const struct rlimit *new_limit) +{ + return setrlimit(resource, new_limit); +} +#endif /* WITH_SETRLIMIT */ + +#if WITH_GETRLIMIT
Question of preference. I'd rather have these two WITH_GETRLIMIT sections joined together.... [1]
+static int +virProcessGetLimit(pid_t pid, + int resource, + struct rlimit *old_limit) +{ + pid_t current_pid = getpid(); + bool same_process = (pid == current_pid); + + if (virProcessPrLimit(pid, resource, NULL, old_limit) == 0) + return 0; + + if (same_process && virProcessGetRLimit(resource, old_limit) == 0) + return 0; + + return -1; +} +#endif /* WITH_GETRLIMIT */ + +#if WITH_SETRLIMIT
1: ... just like these two WITH_SETRLIMIT. But it's matter of taste, I agree.
+static int +virProcessSetLimit(pid_t pid, + int resource, + const struct rlimit *new_limit) +{ + pid_t current_pid = getpid(); + bool same_process = (pid == current_pid); + + if (virProcessPrLimit(pid, resource, new_limit, NULL) == 0) + return 0; + + if (same_process && virProcessSetRLimit(resource, new_limit) == 0) + return 0; + + return -1; +} +#endif /* WITH_SETRLIMIT */ +
Michal

Doing this now will make the next changes nicer. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/qemu_process.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index d3208e5203..1e1df6da64 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7013,10 +7013,13 @@ qemuProcessLaunch(virConnectPtr conn, goto cleanup; virCommandSetPreExecHook(cmd, qemuProcessHook, &hookData); + virCommandSetUmask(cmd, 0x002); + + VIR_DEBUG("Setting up process limits"); + virCommandSetMaxProcesses(cmd, cfg->maxProcesses); virCommandSetMaxFiles(cmd, cfg->maxFiles); virCommandSetMaxCoreSize(cmd, cfg->maxCore); - virCommandSetUmask(cmd, 0x002); VIR_DEBUG("Setting up security labelling"); if (qemuSecuritySetChildProcessLabel(driver->securityManager, -- 2.26.2

qemuProcessLaunch() is the correct place to set process limits, and in fact is where we were dealing with almost all of them, but the memory locking limit was handled in qemuBuildCommandLine() instead for some reason. The code is rewritten so that the desired limit is calculated and applied in separated steps, which will help with further changes, but this doesn't alter the behavior. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/qemu_command.c | 4 ---- src/qemu/qemu_process.c | 6 ++++++ 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index f255b0f881..348e7488f7 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -10121,10 +10121,6 @@ qemuBuildCommandLine(virQEMUDriverPtr driver, qemuBuildVsockCommandLine(cmd, def, def->vsock, qemuCaps) < 0) return NULL; - /* In some situations, eg. VFIO passthrough, QEMU might need to lock a - * significant amount of memory, so we need to set the limit accordingly */ - virCommandSetMaxMemLock(cmd, qemuDomainGetMemLockLimitBytes(def, false)); - if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MSG_TIMESTAMP) && cfg->logTimestamp) virCommandAddArgList(cmd, "-msg", "timestamp=on", NULL); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 1e1df6da64..92e33707c0 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6920,6 +6920,7 @@ qemuProcessLaunch(virConnectPtr conn, g_autoptr(virQEMUDriverConfig) cfg = NULL; size_t nnicindexes = 0; g_autofree int *nicindexes = NULL; + unsigned long long maxMemLock = 0; VIR_DEBUG("conn=%p driver=%p vm=%p name=%s if=%d asyncJob=%d " "incoming.launchURI=%s incoming.deferredURI=%s " @@ -7017,6 +7018,11 @@ qemuProcessLaunch(virConnectPtr conn, VIR_DEBUG("Setting up process limits"); + /* In some situations, eg. VFIO passthrough, QEMU might need to lock a + * significant amount of memory, so we need to set the limit accordingly */ + maxMemLock = qemuDomainGetMemLockLimitBytes(vm->def, false); + + virCommandSetMaxMemLock(cmd, maxMemLock); virCommandSetMaxProcesses(cmd, cfg->maxProcesses); virCommandSetMaxFiles(cmd, cfg->maxFiles); virCommandSetMaxCoreSize(cmd, cfg->maxCore); -- 2.26.2

Currently this only happens for the core size, but we want the behavior to be consistent for other limits as well. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/vircommand.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/util/vircommand.c b/src/util/vircommand.c index 3eef0767bb..044c5e0500 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -132,8 +132,11 @@ struct _virCommand { bool reap; bool rawStatus; + bool setMaxMemLock; unsigned long long maxMemLock; + bool setMaxProcesses; unsigned int maxProcesses; + bool setMaxFiles; unsigned int maxFiles; bool setMaxCore; unsigned long long maxCore; @@ -806,11 +809,14 @@ virExec(virCommandPtr cmd) /* pidfilefd is intentionally leaked. */ } - if (virProcessSetMaxMemLock(pid, cmd->maxMemLock) < 0) + if (cmd->setMaxMemLock && + virProcessSetMaxMemLock(pid, cmd->maxMemLock) < 0) goto fork_error; - if (virProcessSetMaxProcesses(pid, cmd->maxProcesses) < 0) + if (cmd->setMaxProcesses && + virProcessSetMaxProcesses(pid, cmd->maxProcesses) < 0) goto fork_error; - if (virProcessSetMaxFiles(pid, cmd->maxFiles) < 0) + if (cmd->setMaxFiles && + virProcessSetMaxFiles(pid, cmd->maxFiles) < 0) goto fork_error; if (cmd->setMaxCore && virProcessSetMaxCoreSize(pid, cmd->maxCore) < 0) @@ -1149,6 +1155,7 @@ virCommandSetMaxMemLock(virCommandPtr cmd, unsigned long long bytes) return; cmd->maxMemLock = bytes; + cmd->setMaxMemLock = true; } void @@ -1158,6 +1165,7 @@ virCommandSetMaxProcesses(virCommandPtr cmd, unsigned int procs) return; cmd->maxProcesses = procs; + cmd->setMaxProcesses = true; } void @@ -1167,6 +1175,7 @@ virCommandSetMaxFiles(virCommandPtr cmd, unsigned int files) return; cmd->maxFiles = files; + cmd->setMaxFiles = true; } void virCommandSetMaxCoreSize(virCommandPtr cmd, unsigned long long bytes) -- 2.26.2

The current code is written under the assumption that, for all limits except the core size, asking for the limit to be set to zero is a no-op, and so the operation is performed unconditionally. While this is the behavior we want for the QEMU driver, the virCommand and virProcess facilities are generic, and should not implement this kind of policy: asking for a limit to be set to zero should result in that limit being set to zero every single time. Add some checks in the QEMU driver, effectively moving the policy where it belongs. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/qemu_domain.c | 5 +++-- src/qemu/qemu_migration.c | 2 ++ src/qemu/qemu_process.c | 15 ++++++++++++--- 3 files changed, 17 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 17fa71c21b..54d8bd0d3a 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9259,9 +9259,10 @@ qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, vm->original_memlock = 0; } - /* Trying to set the memory locking limit to zero is a no-op */ - if (virProcessSetMaxMemLock(vm->pid, bytes) < 0) + if (bytes > 0 && + virProcessSetMaxMemLock(vm->pid, bytes) < 0) { return -1; + } return 0; } diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index d7231f68ae..e44931dcfa 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -2950,6 +2950,7 @@ qemuMigrationDstPrepareAny(virQEMUDriverPtr driver, } if (STREQ_NULLABLE(protocol, "rdma") && + vm->def->mem.hard_limit > 0 && virProcessSetMaxMemLock(vm->pid, vm->def->mem.hard_limit << 10) < 0) { goto stopjob; } @@ -4199,6 +4200,7 @@ qemuMigrationSrcRun(virQEMUDriverPtr driver, switch (spec->destType) { case MIGRATION_DEST_HOST: if (STREQ(spec->dest.host.protocol, "rdma") && + vm->def->mem.hard_limit > 0 && virProcessSetMaxMemLock(vm->pid, vm->def->mem.hard_limit << 10) < 0) { goto exit_monitor; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 92e33707c0..c05cbe3570 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7022,9 +7022,18 @@ qemuProcessLaunch(virConnectPtr conn, * significant amount of memory, so we need to set the limit accordingly */ maxMemLock = qemuDomainGetMemLockLimitBytes(vm->def, false); - virCommandSetMaxMemLock(cmd, maxMemLock); - virCommandSetMaxProcesses(cmd, cfg->maxProcesses); - virCommandSetMaxFiles(cmd, cfg->maxFiles); + /* For all these settings, zero indicates that the limit should + * not be set explicitly and the default/inherited limit should + * be applied instead */ + if (maxMemLock > 0) + virCommandSetMaxMemLock(cmd, maxMemLock); + if (cfg->maxProcesses > 0) + virCommandSetMaxProcesses(cmd, cfg->maxProcesses); + if (cfg->maxFiles > 0) + virCommandSetMaxFiles(cmd, cfg->maxFiles); + + /* In this case, however, zero means that core dumps should be + * disabled, and so we always need to set the limit explicitly */ virCommandSetMaxCoreSize(cmd, cfg->maxCore); VIR_DEBUG("Setting up security labelling"); -- 2.26.2

This behavior reflects the needs of the QEMU driver and has no place in a generic module such as virProcess. Thanks to the changes made with the previous commit, it is now safe to remove these checks and make all virProcessSetMax*() functions finally behave the same way. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 15 +++------------ 1 file changed, 3 insertions(+), 12 deletions(-) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 38b248217e..abce522bf3 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -798,7 +798,7 @@ virProcessSetLimit(pid_t pid, /** * virProcessSetMaxMemLock: * @pid: process to be changed - * @bytes: new limit (0 for no change) + * @bytes: new limit * * Sets a new limit on the amount of locked memory for a process. * @@ -809,9 +809,6 @@ virProcessSetMaxMemLock(pid_t pid, unsigned long long bytes) { struct rlimit rlim; - if (bytes == 0) - return 0; - /* We use VIR_DOMAIN_MEMORY_PARAM_UNLIMITED internally to represent * unlimited memory amounts, but setrlimit() and prlimit() use * RLIM_INFINITY for the same purpose, so we need to translate between @@ -898,7 +895,7 @@ virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, /** * virProcessSetMaxProcesses: * @pid: process to be changed - * @procs: new limit (0 for no change) + * @procs: new limit * * Sets a new limit on the amount of processes for the user the * process is running as. @@ -910,9 +907,6 @@ virProcessSetMaxProcesses(pid_t pid, unsigned int procs) { struct rlimit rlim; - if (procs == 0) - return 0; - rlim.rlim_cur = rlim.rlim_max = procs; if (virProcessSetLimit(pid, RLIMIT_NPROC, &rlim) < 0) { @@ -938,7 +932,7 @@ virProcessSetMaxProcesses(pid_t pid G_GNUC_UNUSED, /** * virProcessSetMaxFiles: * @pid: process to be changed - * @files: new limit (0 for no change) + * @files: new limit * * Sets a new limit on the number of opened files for a process. * @@ -949,9 +943,6 @@ virProcessSetMaxFiles(pid_t pid, unsigned int files) { struct rlimit rlim; - if (files == 0) - return 0; - /* Max number of opened files is one greater than actual limit. See * man setrlimit. * -- 2.26.2

That's more consistent with our usual naming convention. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/conf/domain_conf.h | 4 ++-- src/qemu/qemu_domain.c | 10 +++++----- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 930eed60de..7d208d881c 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2807,8 +2807,8 @@ struct _virDomainObj { size_t ndeprecations; char **deprecations; - unsigned long long original_memlock; /* Original RLIMIT_MEMLOCK, zero if no - * restore will be required later */ + unsigned long long originalMemlock; /* Original RLIMIT_MEMLOCK, zero if no + * restore will be required later */ }; G_DEFINE_AUTOPTR_CLEANUP_FUNC(virDomainObj, virObjectUnref); diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 54d8bd0d3a..73e2473dce 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9248,15 +9248,15 @@ qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, * value so that we can restore it once memory locking is no longer * required. Failing to obtain the current limit is not a critical * failure, it just means we'll be unable to lower it later */ - if (!vm->original_memlock) { - if (virProcessGetMaxMemLock(vm->pid, &(vm->original_memlock)) < 0) - vm->original_memlock = 0; + if (!vm->originalMemlock) { + if (virProcessGetMaxMemLock(vm->pid, &(vm->originalMemlock)) < 0) + vm->originalMemlock = 0; } } else { /* Once memory locking is no longer required, we can restore the * original, usually very low, limit */ - bytes = vm->original_memlock; - vm->original_memlock = 0; + bytes = vm->originalMemlock; + vm->originalMemlock = 0; } if (bytes > 0 && -- 2.26.2

Up until now we've implicitly relied on the fact that failures reported from this function were simply ignored, but that's about to change and so we need a proper mock. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.h | 2 +- tests/virprocessmock.c | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 34210d6c9d..dbf4148e90 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -79,7 +79,7 @@ int virProcessSetMaxProcesses(pid_t pid, unsigned int procs); int virProcessSetMaxFiles(pid_t pid, unsigned int files); int virProcessSetMaxCoreSize(pid_t pid, unsigned long long bytes); -int virProcessGetMaxMemLock(pid_t pid, unsigned long long *bytes); +int virProcessGetMaxMemLock(pid_t pid, unsigned long long *bytes) G_GNUC_NO_INLINE; /* Callback to run code within the mount namespace tied to the given * pid. This function must use only async-signal-safe functions, as diff --git a/tests/virprocessmock.c b/tests/virprocessmock.c index c9386d757a..0356ff2f70 100644 --- a/tests/virprocessmock.c +++ b/tests/virprocessmock.c @@ -21,6 +21,13 @@ #include <config.h> #include "virprocess.h" +int +virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long *bytes G_GNUC_UNUSED) +{ + *bytes = 0; + return 0; +} + int virProcessSetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long bytes G_GNUC_UNUSED) { -- 2.26.2

On 3/5/21 8:13 PM, Andrea Bolognani wrote:
Up until now we've implicitly relied on the fact that failures reported from this function were simply ignored, but that's about to change and so we need a proper mock.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.h | 2 +- tests/virprocessmock.c | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 34210d6c9d..dbf4148e90 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -79,7 +79,7 @@ int virProcessSetMaxProcesses(pid_t pid, unsigned int procs); int virProcessSetMaxFiles(pid_t pid, unsigned int files); int virProcessSetMaxCoreSize(pid_t pid, unsigned long long bytes);
-int virProcessGetMaxMemLock(pid_t pid, unsigned long long *bytes); +int virProcessGetMaxMemLock(pid_t pid, unsigned long long *bytes) G_GNUC_NO_INLINE;
/* Callback to run code within the mount namespace tied to the given * pid. This function must use only async-signal-safe functions, as diff --git a/tests/virprocessmock.c b/tests/virprocessmock.c index c9386d757a..0356ff2f70 100644 --- a/tests/virprocessmock.c +++ b/tests/virprocessmock.c @@ -21,6 +21,13 @@ #include <config.h> #include "virprocess.h"
+int +virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long *bytes G_GNUC_UNUSED)
Ehm, probably coffee hadn't kicked in? Because I can see @bytes used ..
+{ + *bytes = 0;
.. right here :-D
+ return 0; +} + int virProcessSetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long bytes G_GNUC_UNUSED) {
Michal

On Mon, 2021-03-08 at 11:31 +0100, Michal Privoznik wrote:
On 3/5/21 8:13 PM, Andrea Bolognani wrote:
+int +virProcessGetMaxMemLock(pid_t pid G_GNUC_UNUSED, unsigned long long *bytes G_GNUC_UNUSED)
Ehm, probably coffee hadn't kicked in? Because I can see @bytes used ..
+{ + *bytes = 0;
.. right here :-D
More like it had been too long since the last coffee, but yeah, of course you're absolutely right :) -- Andrea Bolognani / Red Hat / Virtualization

Calling prlimit() requires elevated privileges, specifically CAP_SYS_RESOURCE, and getrlimit() only works for the current process which is too limiting for our needs; /proc/$pid/limits, on the other hand, can be read by any process, so implement parsing that file as a fallback for when prlimit() fails. This is useful in containerized environments. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 98 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index abce522bf3..e62ec379a6 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -757,6 +757,95 @@ virProcessSetRLimit(int resource, #endif /* WITH_SETRLIMIT */ #if WITH_GETRLIMIT +static const char* +virProcessLimitResourceToLabel(int resource) +{ + switch (resource) { +# if defined(RLIMIT_MEMLOCK) + case RLIMIT_MEMLOCK: + return "Max locked memory"; +# endif /* defined(RLIMIT_MEMLOCK) */ + +# if defined(RLIMIT_NPROC) + case RLIMIT_NPROC: + return "Max processes"; +# endif /* defined(RLIMIT_NPROC) */ + +# if defined(RLIMIT_NOFILE) + case RLIMIT_NOFILE: + return "Max open files"; +# endif /* defined(RLIMIT_NOFILE) */ + +# if defined(RLIMIT_CORE) + case RLIMIT_CORE: + return "Max core file size"; +# endif /* defined(RLIMIT_CORE) */ + + default: + return NULL; + } +} + +static int +virProcessGetLimitFromProc(pid_t pid, + int resource, + struct rlimit *limit) +{ + g_autofree char *procfile = NULL; + g_autofree char *buf = NULL; + g_auto(GStrv) lines = NULL; + const char *label; + size_t len; + size_t i; + + if (!(label = virProcessLimitResourceToLabel(resource))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown resource %d requested for process %lld"), + resource, (long long)pid); + return -1; + } + + procfile = g_strdup_printf("/proc/%lld/limits", (long long)pid); + + if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1; + + if (!(lines = g_strsplit(buf, "\n", 0))) + return -1; + + for (i = 0; lines[i]; i++) { + g_autofree char *softLimit = NULL; + g_autofree char *hardLimit = NULL; + char *line = lines[i]; + unsigned long long tmp; + + if (!STRPREFIX(line, label)) + continue; + + line += strlen(label); + + if (sscanf(line, "%ms %ms %*s", &softLimit, &hardLimit) < 2) + return -1; + + if (STREQ(softLimit, "unlimited")) { + limit->rlim_cur = RLIM_INFINITY; + } else { + if (virStrToLong_ull(softLimit, NULL, 10, &tmp) < 0) + return -1; + limit->rlim_cur = tmp; + } + if (STREQ(hardLimit, "unlimited")) { + limit->rlim_max = RLIM_INFINITY; + } else { + if (virStrToLong_ull(hardLimit, NULL, 10, &tmp) < 0) + return -1; + limit->rlim_max = tmp; + } + } + + return 0; +} + static int virProcessGetLimit(pid_t pid, int resource, @@ -768,6 +857,15 @@ virProcessGetLimit(pid_t pid, if (virProcessPrLimit(pid, resource, NULL, old_limit) == 0) return 0; + /* For whatever reason, using prlimit() on another process - even + * when it's just to obtain the current limit rather than changing + * it - requires CAP_SYS_RESOURCE, which we might not have in a + * containerized environment; on the other hand, no particular + * permission is needed to poke around /proc, so try that if going + * through the syscall didn't work */ + if (virProcessGetLimitFromProc(pid, resource, old_limit) == 0) + return 0; + if (same_process && virProcessGetRLimit(resource, old_limit) == 0) return 0; -- 2.26.2

On 3/5/21 8:13 PM, Andrea Bolognani wrote:
Calling prlimit() requires elevated privileges, specifically CAP_SYS_RESOURCE, and getrlimit() only works for the current process which is too limiting for our needs; /proc/$pid/limits, on the other hand, can be read by any process, so implement parsing that file as a fallback for when prlimit() fails.
This is useful in containerized environments.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 98 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index abce522bf3..e62ec379a6 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -757,6 +757,95 @@ virProcessSetRLimit(int resource, #endif /* WITH_SETRLIMIT */
#if WITH_GETRLIMIT +static const char* +virProcessLimitResourceToLabel(int resource) +{ + switch (resource) { +# if defined(RLIMIT_MEMLOCK) + case RLIMIT_MEMLOCK: + return "Max locked memory"; +# endif /* defined(RLIMIT_MEMLOCK) */ + +# if defined(RLIMIT_NPROC) + case RLIMIT_NPROC: + return "Max processes"; +# endif /* defined(RLIMIT_NPROC) */ + +# if defined(RLIMIT_NOFILE) + case RLIMIT_NOFILE: + return "Max open files"; +# endif /* defined(RLIMIT_NOFILE) */ + +# if defined(RLIMIT_CORE) + case RLIMIT_CORE: + return "Max core file size"; +# endif /* defined(RLIMIT_CORE) */ + + default: + return NULL; + } +} + +static int +virProcessGetLimitFromProc(pid_t pid, + int resource, + struct rlimit *limit) +{ + g_autofree char *procfile = NULL; + g_autofree char *buf = NULL; + g_auto(GStrv) lines = NULL; + const char *label; + size_t len; + size_t i; + + if (!(label = virProcessLimitResourceToLabel(resource))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown resource %d requested for process %lld"), + resource, (long long)pid); + return -1; + } + + procfile = g_strdup_printf("/proc/%lld/limits", (long long)pid); + + if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1; + + if (!(lines = g_strsplit(buf, "\n", 0))) + return -1; + + for (i = 0; lines[i]; i++) { + g_autofree char *softLimit = NULL; + g_autofree char *hardLimit = NULL; + char *line = lines[i]; + unsigned long long tmp; + + if (!STRPREFIX(line, label)) + continue; + + line += strlen(label);
Or if (!(line = STRSKIP(line, label)) continue;
+ + if (sscanf(line, "%ms %ms %*s", &softLimit, &hardLimit) < 2) + return -1; + + if (STREQ(softLimit, "unlimited")) { + limit->rlim_cur = RLIM_INFINITY; + } else { + if (virStrToLong_ull(softLimit, NULL, 10, &tmp) < 0) + return -1; + limit->rlim_cur = tmp; + } + if (STREQ(hardLimit, "unlimited")) { + limit->rlim_max = RLIM_INFINITY; + } else { + if (virStrToLong_ull(hardLimit, NULL, 10, &tmp) < 0) + return -1; + limit->rlim_max = tmp; + } + } + + return 0; +} + static int virProcessGetLimit(pid_t pid, int resource, @@ -768,6 +857,15 @@ virProcessGetLimit(pid_t pid, if (virProcessPrLimit(pid, resource, NULL, old_limit) == 0) return 0;
+ /* For whatever reason, using prlimit() on another process - even + * when it's just to obtain the current limit rather than changing + * it - requires CAP_SYS_RESOURCE, which we might not have in a + * containerized environment; on the other hand, no particular + * permission is needed to poke around /proc, so try that if going + * through the syscall didn't work */ + if (virProcessGetLimitFromProc(pid, resource, old_limit) == 0) + return 0; + if (same_process && virProcessGetRLimit(resource, old_limit) == 0) return 0;
Michal

On Mon, 2021-03-08 at 11:30 +0100, Michal Privoznik wrote:
On 3/5/21 8:13 PM, Andrea Bolognani wrote:
+ if (!STRPREFIX(line, label)) + continue; + + line += strlen(label);
Or if (!(line = STRSKIP(line, label)) continue;
Oh, I didn't know that existed! Nice suggestion :) -- Andrea Bolognani / Red Hat / Virtualization

On Fri, Mar 05, 2021 at 08:13:59PM +0100, Andrea Bolognani wrote:
Calling prlimit() requires elevated privileges, specifically CAP_SYS_RESOURCE, and getrlimit() only works for the current process which is too limiting for our needs; /proc/$pid/limits, on the other hand, can be read by any process, so implement parsing that file as a fallback for when prlimit() fails.
This is useful in containerized environments.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 98 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index abce522bf3..e62ec379a6 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -757,6 +757,95 @@ virProcessSetRLimit(int resource, #endif /* WITH_SETRLIMIT */
#if WITH_GETRLIMIT +static const char* +virProcessLimitResourceToLabel(int resource) +{ + switch (resource) { +# if defined(RLIMIT_MEMLOCK) + case RLIMIT_MEMLOCK: + return "Max locked memory"; +# endif /* defined(RLIMIT_MEMLOCK) */ + +# if defined(RLIMIT_NPROC) + case RLIMIT_NPROC: + return "Max processes"; +# endif /* defined(RLIMIT_NPROC) */ + +# if defined(RLIMIT_NOFILE) + case RLIMIT_NOFILE: + return "Max open files"; +# endif /* defined(RLIMIT_NOFILE) */ + +# if defined(RLIMIT_CORE) + case RLIMIT_CORE: + return "Max core file size"; +# endif /* defined(RLIMIT_CORE) */ + + default: + return NULL; + } +} + +static int +virProcessGetLimitFromProc(pid_t pid, + int resource, + struct rlimit *limit) +{ + g_autofree char *procfile = NULL; + g_autofree char *buf = NULL; + g_auto(GStrv) lines = NULL; + const char *label; + size_t len; + size_t i; + + if (!(label = virProcessLimitResourceToLabel(resource))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown resource %d requested for process %lld"), + resource, (long long)pid); + return -1;
Setting errors on -1
+ } + + procfile = g_strdup_printf("/proc/%lld/limits", (long long)pid); + + if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1;
Not setting errors on -1
+ + if (!(lines = g_strsplit(buf, "\n", 0))) + return -1;
....
+ + for (i = 0; lines[i]; i++) { + g_autofree char *softLimit = NULL; + g_autofree char *hardLimit = NULL; + char *line = lines[i]; + unsigned long long tmp; + + if (!STRPREFIX(line, label)) + continue; + + line += strlen(label); + + if (sscanf(line, "%ms %ms %*s", &softLimit, &hardLimit) < 2) + return -1; + + if (STREQ(softLimit, "unlimited")) { + limit->rlim_cur = RLIM_INFINITY; + } else { + if (virStrToLong_ull(softLimit, NULL, 10, &tmp) < 0) + return -1; + limit->rlim_cur = tmp; + } + if (STREQ(hardLimit, "unlimited")) { + limit->rlim_max = RLIM_INFINITY; + } else { + if (virStrToLong_ull(hardLimit, NULL, 10, &tmp) < 0) + return -1; + limit->rlim_max = tmp; + } + } + + return 0; +} + static int virProcessGetLimit(pid_t pid, int resource, @@ -768,6 +857,15 @@ virProcessGetLimit(pid_t pid, if (virProcessPrLimit(pid, resource, NULL, old_limit) == 0) return 0;
+ /* For whatever reason, using prlimit() on another process - even + * when it's just to obtain the current limit rather than changing + * it - requires CAP_SYS_RESOURCE, which we might not have in a + * containerized environment; on the other hand, no particular + * permission is needed to poke around /proc, so try that if going + * through the syscall didn't work */ + if (virProcessGetLimitFromProc(pid, resource, old_limit) == 0) + return 0;
This ought to be conditional for Linux only and error reporting needs to be made consistent.
+ if (same_process && virProcessGetRLimit(resource, old_limit) == 0) return 0;
-- 2.26.2
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 2021-03-08 at 10:50 +0000, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:13:59PM +0100, Andrea Bolognani wrote:
+ if (!(label = virProcessLimitResourceToLabel(resource))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown resource %d requested for process %lld"), + resource, (long long)pid); + return -1;
Setting errors on -1
This is only hit if virProcessGetLimitFromProc() has been asked to obtain limits for a resource it doesn't know how to fetch, which indicates a bug in libvirt and is thus reported as internal error.
+ procfile = g_strdup_printf("/proc/%lld/limits", (long long)pid); + + if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1;
Not setting errors on -1
This is simply "file couldn't be read", which would be the case on FreeBSD for example.
+ /* For whatever reason, using prlimit() on another process - even + * when it's just to obtain the current limit rather than changing + * it - requires CAP_SYS_RESOURCE, which we might not have in a + * containerized environment; on the other hand, no particular + * permission is needed to poke around /proc, so try that if going + * through the syscall didn't work */ + if (virProcessGetLimitFromProc(pid, resource, old_limit) == 0) + return 0;
This ought to be conditional for Linux only and error reporting needs to be made consistent.
The intent above was to have this fail quietly on non-Linux without adding checks for it, but sure I can have an actual stub on other platforms instead. -- Andrea Bolognani / Red Hat / Virtualization

On Mon, Mar 08, 2021 at 04:21:16PM +0100, Andrea Bolognani wrote:
On Mon, 2021-03-08 at 10:50 +0000, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:13:59PM +0100, Andrea Bolognani wrote:
+ if (!(label = virProcessLimitResourceToLabel(resource))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown resource %d requested for process %lld"), + resource, (long long)pid); + return -1;
Setting errors on -1
This is only hit if virProcessGetLimitFromProc() has been asked to obtain limits for a resource it doesn't know how to fetch, which indicates a bug in libvirt and is thus reported as internal error.
The our coding standard is that we only call virReportError if the caller is actually going to honour the errors.
+ procfile = g_strdup_printf("/proc/%lld/limits", (long long)pid); + + if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1;
Not setting errors on -1
This is simply "file couldn't be read", which would be the case on FreeBSD for example.
Right, but we *must* always be consistent in whether a method uses virReportError or not. We have two code paths here returning -1, and only one of which reports errors. Either all must report errors, or none. Since the only caller is ignoring errors, then it looks like none should report errors.
+ /* For whatever reason, using prlimit() on another process - even + * when it's just to obtain the current limit rather than changing + * it - requires CAP_SYS_RESOURCE, which we might not have in a + * containerized environment; on the other hand, no particular + * permission is needed to poke around /proc, so try that if going + * through the syscall didn't work */ + if (virProcessGetLimitFromProc(pid, resource, old_limit) == 0) + return 0;
This ought to be conditional for Linux only and error reporting needs to be made consistent.
The intent above was to have this fail quietly on non-Linux without adding checks for it, but sure I can have an actual stub on other platforms instead.
-- Andrea Bolognani / Red Hat / Virtualization
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 3/5/21 8:13 PM, Andrea Bolognani wrote:
Calling prlimit() requires elevated privileges, specifically CAP_SYS_RESOURCE, and getrlimit() only works for the current process which is too limiting for our needs; /proc/$pid/limits, on the other hand, can be read by any process, so implement parsing that file as a fallback for when prlimit() fails.
This is useful in containerized environments.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/util/virprocess.c | 98 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+)
Sorry in advance for hijacking this thread.
+static int +virProcessGetLimitFromProc(pid_t pid, + int resource, + struct rlimit *limit) +{ + g_autofree char *procfile = NULL; + g_autofree char *buf = NULL; + g_auto(GStrv) lines = NULL; + const char *label; + size_t len; + size_t i; + + if (!(label = virProcessLimitResourceToLabel(resource))) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unknown resource %d requested for process %lld"), + resource, (long long)pid); + return -1; + } + + procfile = g_strdup_printf("/proc/%lld/limits", (long long)pid); + + if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1;
I did not spot this yesterday, but now I'm working on a something else and have to read a contents of a file under /proc. I did not recall the exact name but remembered where I saw it lately - here :-) And now that I am thinking about it - and reading the docs - is this function safe? I mean, it reads file without any limit - which may be fine for /proc files, but I worry that if allowed in one func it may sneak into others and read user provided files, or while its use in a function X might be warranted for now, in the future after some refactor the function X might be used to read user provided files. Therefore, I think it should go onto the list of not-on-my-watch functions and we ought stick with our fine crafted virFileRead*(). BTW: I think the same about g_get_host_name(), which does not reflect hostname changes. Unfortunately, we have three places which slipped through while I wasn't watching. I'll look into how to revert them. Michal

On Tue, 2021-03-09 at 12:43 +0100, Michal Privoznik wrote:
On 3/5/21 8:13 PM, Andrea Bolognani wrote:
+ if (!g_file_get_contents(procfile, &buf, &len, NULL)) + return -1;
I did not spot this yesterday, but now I'm working on a something else and have to read a contents of a file under /proc. I did not recall the exact name but remembered where I saw it lately - here :-)
And now that I am thinking about it - and reading the docs - is this function safe? I mean, it reads file without any limit - which may be fine for /proc files, but I worry that if allowed in one func it may sneak into others and read user provided files, or while its use in a function X might be warranted for now, in the future after some refactor the function X might be used to read user provided files.
You're right. I used pure GLib functions initially because I was implementing this as a tiny stand-alone tool for faster iterative development, and I just forgot to change that specific function back to the libvirt equivalent when I was done :)
Therefore, I think it should go onto the list of not-on-my-watch functions and we ought stick with our fine crafted virFileRead*().
BTW: I think the same about g_get_host_name(), which does not reflect hostname changes. Unfortunately, we have three places which slipped through while I wasn't watching. I'll look into how to revert them.
Sounds good. -- Andrea Bolognani / Red Hat / Virtualization

Now that we've implemented a fallback for the function that obtains the information from /proc, there is no reason we would get a failure unless there's something seriously wrong with the environment we're running in, in which case we're better off reporting the issue to the user rather than pretending everything is fine. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/qemu_domain.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 73e2473dce..1e9178764b 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9246,11 +9246,10 @@ qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, if (bytes) { /* If this is the first time adjusting the limit, save the current * value so that we can restore it once memory locking is no longer - * required. Failing to obtain the current limit is not a critical - * failure, it just means we'll be unable to lower it later */ + * required */ if (!vm->originalMemlock) { if (virProcessGetMaxMemLock(vm->pid, &(vm->originalMemlock)) < 0) - vm->originalMemlock = 0; + return -1; } } else { /* Once memory locking is no longer required, we can restore the -- 2.26.2

Store the current memory locking limit and the desired one separately, which will help with later changes. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/qemu_domain.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 1e9178764b..f8b0e1a62a 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9239,27 +9239,29 @@ int qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, bool forceVFIO) { - unsigned long long bytes = 0; + unsigned long long currentMemLock = 0; + unsigned long long desiredMemLock = 0; - bytes = qemuDomainGetMemLockLimitBytes(vm->def, forceVFIO); + desiredMemLock = qemuDomainGetMemLockLimitBytes(vm->def, forceVFIO); + if (virProcessGetMaxMemLock(vm->pid, ¤tMemLock) < 0) + return -1; - if (bytes) { + if (desiredMemLock > 0) { /* If this is the first time adjusting the limit, save the current * value so that we can restore it once memory locking is no longer * required */ - if (!vm->originalMemlock) { - if (virProcessGetMaxMemLock(vm->pid, &(vm->originalMemlock)) < 0) - return -1; + if (vm->originalMemlock == 0) { + vm->originalMemlock = currentMemLock; } } else { /* Once memory locking is no longer required, we can restore the * original, usually very low, limit */ - bytes = vm->originalMemlock; + desiredMemLock = vm->originalMemlock; vm->originalMemlock = 0; } - if (bytes > 0 && - virProcessSetMaxMemLock(vm->pid, bytes) < 0) { + if (desiredMemLock > 0 && + virProcessSetMaxMemLock(vm->pid, desiredMemLock) < 0) { return -1; } -- 2.26.2

This will be useful when libvirtd is running in a containerized environment with limited capabilities, and in order to make things like VFIO device assignment still work an external privileged process changes the limits from outside of the container. KubeVirt is an example of this setup. Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 12 ++++++++++++ src/qemu/qemu_conf.c | 4 ++++ src/qemu/qemu_conf.h | 1 + src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 19 insertions(+) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 3c1045858b..f1b024a37f 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -104,6 +104,7 @@ module Libvirtd_qemu = | str_entry "slirp_helper" | str_entry "dbus_daemon" | bool_entry "set_process_name" + | bool_entry "external_limit_manager" | int_entry "max_processes" | int_entry "max_files" | limits_entry "max_core" diff --git a/src/qemu/qemu.conf b/src/qemu/qemu.conf index 0c1054f198..15cbc3ba38 100644 --- a/src/qemu/qemu.conf +++ b/src/qemu/qemu.conf @@ -662,6 +662,18 @@ # #set_process_name = 1 +# If enabled, libvirt will not attempt to change process limits (as +# configured with the max_processes, max_files and max_core settings +# below) itself but will instead expect an external entity to perform +# this task. +# +# This also applies to the memory locking limit, which cannot be +# configured here and is instead calculated dynamically based on the +# exact guest configuration: if an external limit manager is in use, +# then libvirt will merely check that the limit has been set +# appropriately. +# +#external_limit_manager = 1 # If max_processes is set to a positive integer, libvirt will use # it to set the maximum number of processes that can be run by qemu diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 2bbc75024c..ee95c124dd 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -673,6 +673,10 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfigPtr cfg, if (virConfGetValueBool(conf, "set_process_name", &cfg->setProcessName) < 0) return -1; + + if (virConfGetValueBool(conf, "external_limit_manager", &cfg->externalLimitManager) < 0) + return -1; + if (virConfGetValueUInt(conf, "max_processes", &cfg->maxProcesses) < 0) return -1; if (virConfGetValueUInt(conf, "max_files", &cfg->maxFiles) < 0) diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 7025b5222e..15e0353253 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -176,6 +176,7 @@ struct _virQEMUDriverConfig { bool nogfxAllowHostAudio; bool setProcessName; + bool externalLimitManager; unsigned int maxProcesses; unsigned int maxFiles; unsigned int maxThreadsPerProc; diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 9310dcec1c..73be55febe 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -77,6 +77,7 @@ module Test_libvirtd_qemu = { "hugetlbfs_mount" = "/dev/hugepages" } { "bridge_helper" = "/usr/libexec/qemu-bridge-helper" } { "set_process_name" = "1" } +{ "external_limit_manager" = "1" } { "max_processes" = "0" } { "max_files" = "0" } { "max_threads_per_process" = "0" } -- 2.26.2

On 3/5/21 8:14 PM, Andrea Bolognani wrote:
This will be useful when libvirtd is running in a containerized environment with limited capabilities, and in order to make things like VFIO device assignment still work an external privileged process changes the limits from outside of the container. KubeVirt is an example of this setup.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 12 ++++++++++++ src/qemu/qemu_conf.c | 4 ++++ src/qemu/qemu_conf.h | 1 + src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 19 insertions(+)
diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 3c1045858b..f1b024a37f 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -104,6 +104,7 @@ module Libvirtd_qemu = | str_entry "slirp_helper" | str_entry "dbus_daemon" | bool_entry "set_process_name" + | bool_entry "external_limit_manager" | int_entry "max_processes" | int_entry "max_files" | limits_entry "max_core" diff --git a/src/qemu/qemu.conf b/src/qemu/qemu.conf index 0c1054f198..15cbc3ba38 100644 --- a/src/qemu/qemu.conf +++ b/src/qemu/qemu.conf @@ -662,6 +662,18 @@ # #set_process_name = 1
+# If enabled, libvirt will not attempt to change process limits (as +# configured with the max_processes, max_files and max_core settings +# below) itself but will instead expect an external entity to perform +# this task. +# +# This also applies to the memory locking limit, which cannot be +# configured here and is instead calculated dynamically based on the +# exact guest configuration: if an external limit manager is in use, +# then libvirt will merely check that the limit has been set +# appropriately. +# +#external_limit_manager = 1
# If max_processes is set to a positive integer, libvirt will use # it to set the maximum number of processes that can be run by qemu diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 2bbc75024c..ee95c124dd 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -673,6 +673,10 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfigPtr cfg,
if (virConfGetValueBool(conf, "set_process_name", &cfg->setProcessName) < 0) return -1; + + if (virConfGetValueBool(conf, "external_limit_manager", &cfg->externalLimitManager) < 0) + return -1; + if (virConfGetValueUInt(conf, "max_processes", &cfg->maxProcesses) < 0) return -1; if (virConfGetValueUInt(conf, "max_files", &cfg->maxFiles) < 0)
I think we could error out if external_limit_manager is set and one of these max_processes, max_files, max_core is set also. It's not really a combination we can make sense of, is it? It's perfectly okay to write that in a separate patch though.
diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 7025b5222e..15e0353253 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -176,6 +176,7 @@ struct _virQEMUDriverConfig { bool nogfxAllowHostAudio; bool setProcessName;
+ bool externalLimitManager; unsigned int maxProcesses; unsigned int maxFiles; unsigned int maxThreadsPerProc; diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 9310dcec1c..73be55febe 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -77,6 +77,7 @@ module Test_libvirtd_qemu = { "hugetlbfs_mount" = "/dev/hugepages" } { "bridge_helper" = "/usr/libexec/qemu-bridge-helper" } { "set_process_name" = "1" } +{ "external_limit_manager" = "1" } { "max_processes" = "0" } { "max_files" = "0" } { "max_threads_per_process" = "0" }
Michal

On Fri, Mar 05, 2021 at 08:14:02PM +0100, Andrea Bolognani wrote:
This will be useful when libvirtd is running in a containerized environment with limited capabilities, and in order to make things like VFIO device assignment still work an external privileged process changes the limits from outside of the container. KubeVirt is an example of this setup.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 12 ++++++++++++ src/qemu/qemu_conf.c | 4 ++++ src/qemu/qemu_conf.h | 1 + src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 19 insertions(+)
diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 3c1045858b..f1b024a37f 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -104,6 +104,7 @@ module Libvirtd_qemu = | str_entry "slirp_helper" | str_entry "dbus_daemon" | bool_entry "set_process_name" + | bool_entry "external_limit_manager" | int_entry "max_processes" | int_entry "max_files" | limits_entry "max_core" diff --git a/src/qemu/qemu.conf b/src/qemu/qemu.conf index 0c1054f198..15cbc3ba38 100644 --- a/src/qemu/qemu.conf +++ b/src/qemu/qemu.conf @@ -662,6 +662,18 @@ # #set_process_name = 1
+# If enabled, libvirt will not attempt to change process limits (as +# configured with the max_processes, max_files and max_core settings +# below) itself but will instead expect an external entity to perform +# this task.
Can't users simply not set max_core, max_files, etc already ? I think it is preferrable to have flags tailored specifically to the individual limits, not a global flag. Otherwise you can end up in a case where you want to disable the memory limits, but keep the other limits set which is impossible with this global flag.
+# +# This also applies to the memory locking limit, which cannot be +# configured here and is instead calculated dynamically based on the +# exact guest configuration: if an external limit manager is in use, +# then libvirt will merely check that the limit has been set +# appropriately. +# +#external_limit_manager = 1
# If max_processes is set to a positive integer, libvirt will use # it to set the maximum number of processes that can be run by qemu diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 2bbc75024c..ee95c124dd 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -673,6 +673,10 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfigPtr cfg,
if (virConfGetValueBool(conf, "set_process_name", &cfg->setProcessName) < 0) return -1; + + if (virConfGetValueBool(conf, "external_limit_manager", &cfg->externalLimitManager) < 0) + return -1; + if (virConfGetValueUInt(conf, "max_processes", &cfg->maxProcesses) < 0) return -1; if (virConfGetValueUInt(conf, "max_files", &cfg->maxFiles) < 0) diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 7025b5222e..15e0353253 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -176,6 +176,7 @@ struct _virQEMUDriverConfig { bool nogfxAllowHostAudio; bool setProcessName;
+ bool externalLimitManager; unsigned int maxProcesses; unsigned int maxFiles; unsigned int maxThreadsPerProc; diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 9310dcec1c..73be55febe 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -77,6 +77,7 @@ module Test_libvirtd_qemu = { "hugetlbfs_mount" = "/dev/hugepages" } { "bridge_helper" = "/usr/libexec/qemu-bridge-helper" } { "set_process_name" = "1" } +{ "external_limit_manager" = "1" } { "max_processes" = "0" } { "max_files" = "0" } { "max_threads_per_process" = "0" } -- 2.26.2
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 2021-03-08 at 10:52 +0000, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:14:02PM +0100, Andrea Bolognani wrote:
+# If enabled, libvirt will not attempt to change process limits (as +# configured with the max_processes, max_files and max_core settings +# below) itself but will instead expect an external entity to perform +# this task.
Can't users simply not set max_core, max_files, etc already ?
That works for things that are static and have a corresponding configuration option in qemu.conf, but the memory locking limit is dynamic, per-VM and needs to change as devices are added and removed from the guest.
I think it is preferrable to have flags tailored specifically to the individual limits, not a global flag. Otherwise you can end up in a case where you want to disable the memory limits, but keep the other limits set which is impossible with this global flag.
Since what I'm interested in is the memory locking limit, I guess I could turn this into max_memlock_external = 1 or even max_memlock = "external" with "dynamic" being the other accepted value, which would be the default and would behave as libvirt does today. Do you think that would work better? -- Andrea Bolognani / Red Hat / Virtualization

On Mon, Mar 08, 2021 at 01:56:15PM +0100, Andrea Bolognani wrote:
On Mon, 2021-03-08 at 10:52 +0000, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:14:02PM +0100, Andrea Bolognani wrote:
+# If enabled, libvirt will not attempt to change process limits (as +# configured with the max_processes, max_files and max_core settings +# below) itself but will instead expect an external entity to perform +# this task.
Can't users simply not set max_core, max_files, etc already ?
That works for things that are static and have a corresponding configuration option in qemu.conf, but the memory locking limit is dynamic, per-VM and needs to change as devices are added and removed from the guest.
I think it is preferrable to have flags tailored specifically to the individual limits, not a global flag. Otherwise you can end up in a case where you want to disable the memory limits, but keep the other limits set which is impossible with this global flag.
Since what I'm interested in is the memory locking limit, I guess I could turn this into
max_memlock_external = 1
or even
max_memlock = "external"
with "dynamic" being the other accepted value, which would be the default and would behave as libvirt does today.
Do you think that would work better?
I think that would be better, as it has clearly defined scope which we can maintain more accurately long term. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 3/8/21 11:52 AM, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:14:02PM +0100, Andrea Bolognani wrote:
This will be useful when libvirtd is running in a containerized environment with limited capabilities, and in order to make things like VFIO device assignment still work an external privileged process changes the limits from outside of the container. KubeVirt is an example of this setup.
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 12 ++++++++++++ src/qemu/qemu_conf.c | 4 ++++ src/qemu/qemu_conf.h | 1 + src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 19 insertions(+)
diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 3c1045858b..f1b024a37f 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -104,6 +104,7 @@ module Libvirtd_qemu = | str_entry "slirp_helper" | str_entry "dbus_daemon" | bool_entry "set_process_name" + | bool_entry "external_limit_manager" | int_entry "max_processes" | int_entry "max_files" | limits_entry "max_core" diff --git a/src/qemu/qemu.conf b/src/qemu/qemu.conf index 0c1054f198..15cbc3ba38 100644 --- a/src/qemu/qemu.conf +++ b/src/qemu/qemu.conf @@ -662,6 +662,18 @@ # #set_process_name = 1
+# If enabled, libvirt will not attempt to change process limits (as +# configured with the max_processes, max_files and max_core settings +# below) itself but will instead expect an external entity to perform +# this task.
Can't users simply not set max_core, max_files, etc already ?
These two yes, mem lock no. Michal

When the config knob is enabled, we simply skip the part where limits are set; for the memory locking limit, which can change dynamically over the lifetime of the guest, we still make sure that the external process has set it correctly and error out if that turns out not to be the case This commit is better viewed with 'git show -w'. https://bugzilla.redhat.com/show_bug.cgi?id=1916346 Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/conf/domain_conf.h | 1 + src/qemu/qemu_domain.c | 39 ++++++++++++++++++++++-------------- src/qemu/qemu_process.c | 44 +++++++++++++++++++++++------------------ 3 files changed, 50 insertions(+), 34 deletions(-) diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 7d208d881c..d1333020e1 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2807,6 +2807,7 @@ struct _virDomainObj { size_t ndeprecations; char **deprecations; + bool externalLimitManager; /* Whether process limits are handled outside of libvirt */ unsigned long long originalMemlock; /* Original RLIMIT_MEMLOCK, zero if no * restore will be required later */ }; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index f8b0e1a62a..0d9adb2f9c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9246,23 +9246,32 @@ qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, if (virProcessGetMaxMemLock(vm->pid, ¤tMemLock) < 0) return -1; - if (desiredMemLock > 0) { - /* If this is the first time adjusting the limit, save the current - * value so that we can restore it once memory locking is no longer - * required */ - if (vm->originalMemlock == 0) { - vm->originalMemlock = currentMemLock; + if (!vm->externalLimitManager) { + if (desiredMemLock > 0) { + /* If this is the first time adjusting the limit, save the current + * value so that we can restore it once memory locking is no longer + * required */ + if (vm->originalMemlock == 0) { + vm->originalMemlock = currentMemLock; + } + } else { + /* Once memory locking is no longer required, we can restore the + * original, usually very low, limit */ + desiredMemLock = vm->originalMemlock; + vm->originalMemlock = 0; } - } else { - /* Once memory locking is no longer required, we can restore the - * original, usually very low, limit */ - desiredMemLock = vm->originalMemlock; - vm->originalMemlock = 0; - } - if (desiredMemLock > 0 && - virProcessSetMaxMemLock(vm->pid, desiredMemLock) < 0) { - return -1; + if (desiredMemLock > 0 && + virProcessSetMaxMemLock(vm->pid, desiredMemLock) < 0) { + return -1; + } + } else { + if (currentMemLock < desiredMemLock) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("insufficient memlock limit (%llu < %llu)"), + currentMemLock, desiredMemLock); + return -1; + } } return 0; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index c05cbe3570..2eac3934c7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7016,25 +7016,31 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetPreExecHook(cmd, qemuProcessHook, &hookData); virCommandSetUmask(cmd, 0x002); - VIR_DEBUG("Setting up process limits"); - - /* In some situations, eg. VFIO passthrough, QEMU might need to lock a - * significant amount of memory, so we need to set the limit accordingly */ - maxMemLock = qemuDomainGetMemLockLimitBytes(vm->def, false); - - /* For all these settings, zero indicates that the limit should - * not be set explicitly and the default/inherited limit should - * be applied instead */ - if (maxMemLock > 0) - virCommandSetMaxMemLock(cmd, maxMemLock); - if (cfg->maxProcesses > 0) - virCommandSetMaxProcesses(cmd, cfg->maxProcesses); - if (cfg->maxFiles > 0) - virCommandSetMaxFiles(cmd, cfg->maxFiles); - - /* In this case, however, zero means that core dumps should be - * disabled, and so we always need to set the limit explicitly */ - virCommandSetMaxCoreSize(cmd, cfg->maxCore); + if (cfg->externalLimitManager) { + VIR_DEBUG("Not setting up process limits (handled externally)"); + + vm->externalLimitManager = true; + } else { + VIR_DEBUG("Setting up process limits"); + + /* In some situations, eg. VFIO passthrough, QEMU might need to lock a + * significant amount of memory, so we need to set the limit accordingly */ + maxMemLock = qemuDomainGetMemLockLimitBytes(vm->def, false); + + /* For all these settings, zero indicates that the limit should + * not be set explicitly and the default/inherited limit should + * be applied instead */ + if (maxMemLock > 0) + virCommandSetMaxMemLock(cmd, maxMemLock); + if (cfg->maxProcesses > 0) + virCommandSetMaxProcesses(cmd, cfg->maxProcesses); + if (cfg->maxFiles > 0) + virCommandSetMaxFiles(cmd, cfg->maxFiles); + + /* In this case, however, zero means that core dumps should be + * disabled, and so we always need to set the limit explicitly */ + virCommandSetMaxCoreSize(cmd, cfg->maxCore); + } VIR_DEBUG("Setting up security labelling"); if (qemuSecuritySetChildProcessLabel(driver->securityManager, -- 2.26.2

On 3/5/21 8:14 PM, Andrea Bolognani wrote:
When the config knob is enabled, we simply skip the part where limits are set; for the memory locking limit, which can change dynamically over the lifetime of the guest, we still make sure that the external process has set it correctly and error out if that turns out not to be the case
This commit is better viewed with 'git show -w'.
https://bugzilla.redhat.com/show_bug.cgi?id=1916346
Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- src/conf/domain_conf.h | 1 + src/qemu/qemu_domain.c | 39 ++++++++++++++++++++++-------------- src/qemu/qemu_process.c | 44 +++++++++++++++++++++++------------------ 3 files changed, 50 insertions(+), 34 deletions(-)
diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 7d208d881c..d1333020e1 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2807,6 +2807,7 @@ struct _virDomainObj { size_t ndeprecations; char **deprecations;
+ bool externalLimitManager; /* Whether process limits are handled outside of libvirt */ unsigned long long originalMemlock; /* Original RLIMIT_MEMLOCK, zero if no * restore will be required later */ }; diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index f8b0e1a62a..0d9adb2f9c 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -9246,23 +9246,32 @@ qemuDomainAdjustMaxMemLock(virDomainObjPtr vm, if (virProcessGetMaxMemLock(vm->pid, ¤tMemLock) < 0) return -1;
- if (desiredMemLock > 0) { - /* If this is the first time adjusting the limit, save the current - * value so that we can restore it once memory locking is no longer - * required */ - if (vm->originalMemlock == 0) { - vm->originalMemlock = currentMemLock; + if (!vm->externalLimitManager) { + if (desiredMemLock > 0) { + /* If this is the first time adjusting the limit, save the current + * value so that we can restore it once memory locking is no longer + * required */ + if (vm->originalMemlock == 0) { + vm->originalMemlock = currentMemLock; + } + } else { + /* Once memory locking is no longer required, we can restore the + * original, usually very low, limit */ + desiredMemLock = vm->originalMemlock; + vm->originalMemlock = 0; } - } else { - /* Once memory locking is no longer required, we can restore the - * original, usually very low, limit */ - desiredMemLock = vm->originalMemlock; - vm->originalMemlock = 0; - }
- if (desiredMemLock > 0 && - virProcessSetMaxMemLock(vm->pid, desiredMemLock) < 0) { - return -1; + if (desiredMemLock > 0 && + virProcessSetMaxMemLock(vm->pid, desiredMemLock) < 0) { + return -1; + } + } else { + if (currentMemLock < desiredMemLock) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("insufficient memlock limit (%llu < %llu)"), + currentMemLock, desiredMemLock);
Should we go this way? Our limit computation is nothing but a guess (even though it might be more educated than others). I think we should reduce this to a warning. User had chosen not to set any limits and might have tracked down (e.g. via strace) the actual value needed which may be smaller than our guess. Another reason is that if domain is started with a VFIO/mdev device then no memlock is set at all but here, during hotplug it would be.
+ return -1; + } }
return 0; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index c05cbe3570..2eac3934c7 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7016,25 +7016,31 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetPreExecHook(cmd, qemuProcessHook, &hookData); virCommandSetUmask(cmd, 0x002);
- VIR_DEBUG("Setting up process limits"); - - /* In some situations, eg. VFIO passthrough, QEMU might need to lock a - * significant amount of memory, so we need to set the limit accordingly */ - maxMemLock = qemuDomainGetMemLockLimitBytes(vm->def, false); - - /* For all these settings, zero indicates that the limit should - * not be set explicitly and the default/inherited limit should - * be applied instead */ - if (maxMemLock > 0) - virCommandSetMaxMemLock(cmd, maxMemLock); - if (cfg->maxProcesses > 0) - virCommandSetMaxProcesses(cmd, cfg->maxProcesses); - if (cfg->maxFiles > 0) - virCommandSetMaxFiles(cmd, cfg->maxFiles); - - /* In this case, however, zero means that core dumps should be - * disabled, and so we always need to set the limit explicitly */ - virCommandSetMaxCoreSize(cmd, cfg->maxCore); + if (cfg->externalLimitManager) { + VIR_DEBUG("Not setting up process limits (handled externally)"); + + vm->externalLimitManager = true; + } else { + VIR_DEBUG("Setting up process limits"); + + /* In some situations, eg. VFIO passthrough, QEMU might need to lock a + * significant amount of memory, so we need to set the limit accordingly */ + maxMemLock = qemuDomainGetMemLockLimitBytes(vm->def, false); + + /* For all these settings, zero indicates that the limit should + * not be set explicitly and the default/inherited limit should + * be applied instead */ + if (maxMemLock > 0) + virCommandSetMaxMemLock(cmd, maxMemLock); + if (cfg->maxProcesses > 0) + virCommandSetMaxProcesses(cmd, cfg->maxProcesses); + if (cfg->maxFiles > 0) + virCommandSetMaxFiles(cmd, cfg->maxFiles); + + /* In this case, however, zero means that core dumps should be + * disabled, and so we always need to set the limit explicitly */ + virCommandSetMaxCoreSize(cmd, cfg->maxCore); + }
VIR_DEBUG("Setting up security labelling"); if (qemuSecuritySetChildProcessLabel(driver->securityManager,
Michal

Signed-off-by: Andrea Bolognani <abologna@redhat.com> --- NEWS.rst | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/NEWS.rst b/NEWS.rst index bd40373a80..3a3e3962c2 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -13,6 +13,16 @@ v7.2.0 (unreleased) * **New features** + * qemu: Allow process limits to be managed by an external process + + Whereas in traditional virtualization scenarios libvirt takes full + reponsibility of setting up the environment for a QEMU process, when it's + used as part of things like KubeVirt some of these tasks are delegated to + external, more privileged processes. + + The ``external_limit_manager`` feature can now be enabled in ``qemu.conf`` + to tell libvirt that it shouldn't attempt to set process limits itself. + * **Improvements** * **Bug fixes** -- 2.26.2

On 3/5/21 8:13 PM, Andrea Bolognani wrote:
This feature has been requested by KubeVirt developers and will make it possible for them to make some VFIO-related features, such as migration and hotplug, work correctly.
https://bugzilla.redhat.com/show_bug.cgi?id=1916346
The first part of the series, especially the first 9 patches, is preparation work: it addresses a few annoying issues with our APIs that deal with process limits, and makes them all nice, consistent and easy to reason about while moving policy code from the generic code to the QEMU driver where it belongs.
Andrea Bolognani (17): util: Document limit-related functions util: Simplify stubs util: Always pass a pid to virProcessSetMax*() util: Introduce virProcess{Get,Set}Limit() qemu: Make some minor tweaks qemu: Set all limits at the same time util: Have virCommand remember whether limits are set qemu: Set limits only when explicitly asked to do so util: Don't special-case setting a limit to zero conf: Rename original_memlock -> originalMemlock tests: Mock virProcessGetMaxMemLock() util: Try to get limits from /proc qemu: Don't ignore virProcessGetMaxMemLock() errors qemu: Refactor qemuDomainAdjustMaxMemLock() qemu: Add external_limit_manager config knob qemu: Wire up external limit manager news: Document external limit manager feature
NEWS.rst | 10 + src/conf/domain_conf.h | 5 +- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 12 + src/qemu/qemu_command.c | 4 - src/qemu/qemu_conf.c | 4 + src/qemu/qemu_conf.h | 1 + src/qemu/qemu_domain.c | 47 ++-- src/qemu/qemu_migration.c | 2 + src/qemu/qemu_process.c | 30 ++- src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 21 +- src/util/virprocess.c | 340 ++++++++++++++++++++--------- src/util/virprocess.h | 2 +- tests/virprocessmock.c | 7 + 15 files changed, 354 insertions(+), 133 deletions(-)
Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Michal

On Fri, Mar 05, 2021 at 08:13:47PM +0100, Andrea Bolognani wrote:
This feature has been requested by KubeVirt developers and will make it possible for them to make some VFIO-related features, such as migration and hotplug, work correctly.
https://bugzilla.redhat.com/show_bug.cgi?id=1916346
The first part of the series, especially the first 9 patches, is preparation work: it addresses a few annoying issues with our APIs that deal with process limits, and makes them all nice, consistent and easy to reason about while moving policy code from the generic code to the QEMU driver where it belongs.
IIUC, the code was already supposed to attempt to set the limits and gracefully carry on if it was unable todo so. Can you outline the actual problem being solved, as wading through the bug comments to understand the precise detail of the problem is not very clear. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 2021-03-08 at 10:54 +0000, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:13:47PM +0100, Andrea Bolognani wrote:
This feature has been requested by KubeVirt developers and will make it possible for them to make some VFIO-related features, such as migration and hotplug, work correctly.
https://bugzilla.redhat.com/show_bug.cgi?id=1916346
The first part of the series, especially the first 9 patches, is preparation work: it addresses a few annoying issues with our APIs that deal with process limits, and makes them all nice, consistent and easy to reason about while moving policy code from the generic code to the QEMU driver where it belongs.
IIUC, the code was already supposed to attempt to set the limits and gracefully carry on if it was unable todo so. Can you outline the actual problem being solved, as wading through the bug comments to understand the precise detail of the problem is not very clear.
Not being able to set the memory locking limit shouldn't be a soft failure, because that might cause QEMU to start up apparently fine and then error out later on during the lifetime of the VM, which is much worse than not starting up at all. We're actually doing this correctly for the most part, which is why hotplug requests for VFIO devices in KubeVirt result in errors under certain conditions. The reason why VFIO device assignment is currently not completely broken in KubeVirt is that, when the QEMU process is initially started, we set the memory locking limit after fork(), so we can do that using setrlimit() which doesn't require additional capabilities, but in the hotplug scenario libvirtd needs to change the limits of a different process: in that case we are forced to use prlimit(), which fails due to the lack of CAP_SYS_RESOURCE. -- Andrea Bolognani / Red Hat / Virtualization

On Mon, Mar 08, 2021 at 02:11:56PM +0100, Andrea Bolognani wrote:
On Mon, 2021-03-08 at 10:54 +0000, Daniel P. Berrangé wrote:
On Fri, Mar 05, 2021 at 08:13:47PM +0100, Andrea Bolognani wrote:
This feature has been requested by KubeVirt developers and will make it possible for them to make some VFIO-related features, such as migration and hotplug, work correctly.
https://bugzilla.redhat.com/show_bug.cgi?id=1916346
The first part of the series, especially the first 9 patches, is preparation work: it addresses a few annoying issues with our APIs that deal with process limits, and makes them all nice, consistent and easy to reason about while moving policy code from the generic code to the QEMU driver where it belongs.
IIUC, the code was already supposed to attempt to set the limits and gracefully carry on if it was unable todo so. Can you outline the actual problem being solved, as wading through the bug comments to understand the precise detail of the problem is not very clear.
Not being able to set the memory locking limit shouldn't be a soft failure, because that might cause QEMU to start up apparently fine and then error out later on during the lifetime of the VM, which is much worse than not starting up at all. We're actually doing this correctly for the most part, which is why hotplug requests for VFIO devices in KubeVirt result in errors under certain conditions.
Yep, I agree we shoudln't have it be a mere warning. IIUC, we would try to set the limit, and if we fail, then we query the current limit to see if its satisfactory and report a fatal error if not.
The reason why VFIO device assignment is currently not completely broken in KubeVirt is that, when the QEMU process is initially started, we set the memory locking limit after fork(), so we can do that using setrlimit() which doesn't require additional capabilities, but in the hotplug scenario libvirtd needs to change the limits of a different process: in that case we are forced to use prlimit(), which fails due to the lack of CAP_SYS_RESOURCE.
Since you added code to parse existing limits from /proc, I'm wondering if we can just do without the config option. Simply try to use prlimit and if it fails, query existing limits to determine if we sould treat the prlimit as fatal or ignore it. Overall I'd prefer libvirt to "just work" out of the box rather than requiring people to know about setting a "make-vfio-hotplug-work=yes" flag in the config file. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 2021-03-08 at 13:17 +0000, Daniel P. Berrangé wrote:
On Mon, Mar 08, 2021 at 02:11:56PM +0100, Andrea Bolognani wrote:
The reason why VFIO device assignment is currently not completely broken in KubeVirt is that, when the QEMU process is initially started, we set the memory locking limit after fork(), so we can do that using setrlimit() which doesn't require additional capabilities, but in the hotplug scenario libvirtd needs to change the limits of a different process: in that case we are forced to use prlimit(), which fails due to the lack of CAP_SYS_RESOURCE.
Since you added code to parse existing limits from /proc, I'm wondering if we can just do without the config option. Simply try to use prlimit and if it fails, query existing limits to determine if we sould treat the prlimit as fatal or ignore it. Overall I'd prefer libvirt to "just work" out of the box rather than requiring people to know about setting a "make-vfio-hotplug-work=yes" flag in the config file.
The problem with that approach is what to do when *lowering* the limit, for example as a consequence of hot-unplugging the last VFIO device from the VM. If we're controlling the memory locking limit ourselves, then failure to lower it should be an error, because leaving the limit much higher than necessary creates potential for DoS by a compromised QEMU; on the other hand, if the limit is controlled by an external process, all we can really do is assume they will do the right thing after hot-unplugging has happened. I don't think discoverability is too much of an issue, as anyone who needs to use this option will already have needed to figure out a lot more in order to effectively take over memory locking limit management responsibilities from libvirt... -- Andrea Bolognani / Red Hat / Virtualization

On Mon, Mar 08, 2021 at 04:32:26PM +0100, Andrea Bolognani wrote:
On Mon, 2021-03-08 at 13:17 +0000, Daniel P. Berrangé wrote:
On Mon, Mar 08, 2021 at 02:11:56PM +0100, Andrea Bolognani wrote:
The reason why VFIO device assignment is currently not completely broken in KubeVirt is that, when the QEMU process is initially started, we set the memory locking limit after fork(), so we can do that using setrlimit() which doesn't require additional capabilities, but in the hotplug scenario libvirtd needs to change the limits of a different process: in that case we are forced to use prlimit(), which fails due to the lack of CAP_SYS_RESOURCE.
Since you added code to parse existing limits from /proc, I'm wondering if we can just do without the config option. Simply try to use prlimit and if it fails, query existing limits to determine if we sould treat the prlimit as fatal or ignore it. Overall I'd prefer libvirt to "just work" out of the box rather than requiring people to know about setting a "make-vfio-hotplug-work=yes" flag in the config file.
The problem with that approach is what to do when *lowering* the limit, for example as a consequence of hot-unplugging the last VFIO device from the VM.
If we're controlling the memory locking limit ourselves, then failure to lower it should be an error, because leaving the limit much higher than necessary creates potential for DoS by a compromised QEMU; on the other hand, if the limit is controlled by an external process, all we can really do is assume they will do the right thing after hot-unplugging has happened.
IMHO once QEMU vCPUs start running, immediately assume QEMU is compromised / hostile. IOW, the DoS risk arrived the moment it was given the higher limit. We're just failing to close off the existing risk we've already accepted, which doesn't worry me much. On unplug the only thing we actually do when memory lock reduce fails is to log a warning message, it is never treated as a fatal error. So the only difference is whether we skip the warning message when we get EPERM from prlimit(), or always emit the warning. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 2021-03-08 at 15:57 +0000, Daniel P. Berrangé wrote:
On Mon, Mar 08, 2021 at 04:32:26PM +0100, Andrea Bolognani wrote:
On Mon, 2021-03-08 at 13:17 +0000, Daniel P. Berrangé wrote:
Since you added code to parse existing limits from /proc, I'm wondering if we can just do without the config option. Simply try to use prlimit and if it fails, query existing limits to determine if we sould treat the prlimit as fatal or ignore it. Overall I'd prefer libvirt to "just work" out of the box rather than requiring people to know about setting a "make-vfio-hotplug-work=yes" flag in the config file.
The problem with that approach is what to do when *lowering* the limit, for example as a consequence of hot-unplugging the last VFIO device from the VM.
If we're controlling the memory locking limit ourselves, then failure to lower it should be an error, because leaving the limit much higher than necessary creates potential for DoS by a compromised QEMU; on the other hand, if the limit is controlled by an external process, all we can really do is assume they will do the right thing after hot-unplugging has happened.
IMHO once QEMU vCPUs start running, immediately assume QEMU is compromised / hostile. IOW, the DoS risk arrived the moment it was given the higher limit. We're just failing to close off the existing risk we've already accepted, which doesn't worry me much.
On unplug the only thing we actually do when memory lock reduce fails is to log a warning message, it is never treated as a fatal error.
So the only difference is whether we skip the warning message when we get EPERM from prlimit(), or always emit the warning.
You're right, we're currently just soft-failing when we can't lower the memlock limit on unplug. Given this and your assessment of the security implications, which I trust, we should indeed be able to avoid introducing the qemu.conf knob and just behave sanely in all scenarios out of the box. I'll give it a try. -- Andrea Bolognani / Red Hat / Virtualization
participants (3)
-
Andrea Bolognani
-
Daniel P. Berrangé
-
Michal Privoznik