Re: [libvirt] [PATCH v3 09/12] qemu: Start PR daemon on domain startup

22 Mar 2018

      On Thu, Mar 22, 2018 at 11:16:47AM +0100, Michal Privoznik wrote:
...
On 03/15/2018 01:31 PM, Peter Krempa wrote:
...
On Wed, Mar 14, 2018 at 17:05:38 +0100, Michal Privoznik wrote:
...
Before we exec() qemu we have to spawn pr-helper processes for
all managed reservations (well, technically there can only one).
The only caveat there is that we should place the process into
the same namespace and cgroup as qemu (so that it shares the same
view of the system). But we can do that only after we've forked.
That means calling the setup function between fork() and exec().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 src/qemu/qemu_process.c | 169 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 169 insertions(+)
...
...
+
+    priv->prPid = cpid;
+    ret = 0;
+ cleanup:
+    if (ret < 0) {
+        virCommandAbort(cmd);
+        virProcessKillPainfully(cpid, true);
+    }
+    virCommandFree(cmd);
+    if (pidfile) {
+        if (unlink(pidfile) < 0 &&
+            errno != ENOENT &&
+            !virGetLastError())
+            virReportSystemError(errno,
+                                 _("Cannot remove stale PID file %s"),
+                                 pidfile);
+        VIR_FREE(pidfile);
By removing the pidfile, you can't make sure later that the PR helper
program is still the same process, thus when killing the VM by the pid
number only you might actually kill a different process.
I unfortunately already deleted the top of the message so you'll need to
fix the function qemuProcessKillPRDaemon to see whether it's the same
process, similarly to what we do when reconnecting to qemu.
In general when pidfiles are locked using lockf()/fcntl(), you should never
unlink them unless you have first acquired the lock, and even then the code
that acquires the lock needs to be made robust against races. 
a race condition

Our virPidFileAcquirePath copes with race'ing unlinks, but QEMU's code does
not. We should fix QEMU in this respect too.
...
Okay, let me write here what I told you in person. I've tried couple of
approaches to do proper locking of pidfile:
a) run qemu-pr-helper -f $pidfile in combination with
virPidFileForceCleanupPath(). This doesn't work because qemu-pr-helper
uses lockf() to lock the pidfile. Libvirt uses F_SETLK. These two locks
don't know about each other. Sure, we can adapt virPid* to do
lockf()-ing. But we will fail miserably if qemu ever decides to change
that to say fnctl().
I don't think that is correct - lockf() is just a thing wrapper around
fnctl() F_SETLK from what I understand.  Many systems do this equivalance,
although technically POSIX leaves it undefined.
...
b) I've changed qemu-pr-helper to use fnctl() to lock pidfile. They have
a nice wrapper over it - qemu_lock_fd(). This still did not fly. Their
wrapper prefers F_OFD_SETLK over F_SETLK. OFD stands for Open File
Description Locks. The lock is not associated with process but with FD
and thus survives fork(). Again, incompatible locks. If file is locked
with F_OFD_SETLK another process can F_SETLK it successfully.
QEMU does the OFD_SETLK/SETLK  magic because its APIs were primarily
design for use with the block layer, where it must have sane semantics
to prevent accidental loss of the lock.

The limitations of F_SETLK are not really an issue for pidfiles which
are opened once and never closed. So it would be valid to change
QEMU to use fcntl(F_SETLK) exclusively for pidfiles. This would be
semantically compatible with lockf() on most systems, and get QEMU
away from under-specified semantics of lockf().
...
c) My third attempt was to not rely on qemu-pr-helper locking at all
(after all they can change it anytime which could break our tailored
solution). So I've modified qemuProcessStartPRDaemonHook() so it calls
virPidFileAcquirePath() and leaks the pidfile FD to qemu-pr-helper. This
looked promising: command hooks are called just before exec(), after all
FDs are closed. Unfortunately, virPidFileAcquirePath() sets FD_CLOEXEC
flag, so no leaking is possible. Sure I can call virSetInherit() to
cancel the CLOEXEC flag. BUT, here the proposal:
IIUC the only problem we have with storing pid and killing it later is
that we might end up killing a different process than qemu-pr-helper
because qemu-pr-helper may die and another process might be spawn with
its PID. Well, this would be solved with the events I'm mentioning in
cover letter. The idea is that qemu sends us an event as soon as
qemu-pr-helper process dies. So we would spawn a new one and update the
stored PID. This way we would never kill wrong process.
BTW: don't look into qemuProcessReconnect. It suffers from the same
problem. If libvirtd is stopped, qemu quits and another process is
spawned with its PID, we kill the new process as soon as we try to
reconnect to the monitor, because connecting to the monitor fails
(obviously) so we jump onto error label where qemuProcessStop() is
called. Worse, if there's OOM situation and we fail to asprintf()
pidfile path we jump onto the label too (and massacre the process we
think is qemu).
My proposal is to:
1) keep tracking of PID as-is now,
2) forbid starting domains with <reservations managed='yes'/> until the
events are implemented.
Alternatively, I can go with c) (in the end I have it working locally)
and do all the changes necessary. Frankly, I don't have preference. I
also wonder if API design for events and query-* command for reconnect
might makes us lean towards one of the options.
I would suggest doing both a) and b).

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

Re: [libvirt] [PATCH v3 09/12] qemu: Start PR daemon on domain startup

Daniel P. Berrangé