There were cases where the process was gone (no /proc/<pid> entry anymore), but
kill with signal=0 still was able to reach the process.
This can happen if the kernel still cleans up resources.
In most common cases of this there would be a /proc/<pid> entry left with the
process in Zombie state until reaped (by init). But those cases usually resolve
rather quickly as init periodically will call wait to reap.
The more critical and confusing cases are those where the process is
gone from all that (not in /proc/<pid> anymore), but the kernel still
considers it reachable by kill with signal 0.
This is due to kill (2) only checking for "existence of a process ID" but not
the process itself.
This effect has mostly been seen when using plenty of SR-IOV resources in the
guest (which might explain the extra cleanup phase by the kernel) and to
resolve those issues libvirt will accept /proc/<pid> being gone as valid exit
as well (on top of signal 0 returning ESRCH).
Signed-off-by: Christian Ehrhardt <christian.ehrhardt(a)canonical.com>
---
src/util/virprocess.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c
index 10952b0980..8d863a6777 100644
--- a/src/util/virprocess.c
+++ b/src/util/virprocess.c
@@ -352,9 +352,14 @@ virProcessKillPainfully(pid_t pid, bool force)
int ret = -1;
int maxwait = (force ? 200 : 75 );
const char *signame = "TERM";
+ char *procPath = NULL;
VIR_DEBUG("vpid=%lld force=%d", (long long)pid, force);
+ if (virAsprintf(&procPath, "/proc/%llu", (long long) pid) < 0)
+ VIR_WARN("Can't allocate procPath to check for exit of pid %lld,
",
+ (long long)pid);
+
/* This loop sends SIGTERM, then waits a few iterations (10 seconds)
* to see if it dies. If the process still hasn't exited, and
* @force is requested, a SIGKILL will be sent, and this will
@@ -393,6 +398,24 @@ virProcessKillPainfully(pid_t pid, bool force)
ret = signum == SIGTERM ? 0 : 1;
goto cleanup; /* process is dead */
}
+ /*
+ * There were cases where the process was gone (no /proc/<pid> entry
+ * anymore), but kill with signal=0 still was able to reach the process
+ * as kill (2) only checks for "existence of a process ID" but not the
+ * process itself. For example if the kernel might still clean up
+ * resources. We accept having no /proc/<pid> entry left as valid exit
+ * of the process as well.
+ */
+ if (procPath != NULL && !virFileExists(procPath)) {
+ if (errno == ENOENT) {
+ ret = signum == SIGTERM ? 0 : 1;
+ /* DEBUG as it could be just the race from signal to cleanup */
+ VIR_DEBUG("Process with pid %lld still reachable with signals
"
+ "but %s is no more existing",
+ (long long) pid, procPath);
+ goto cleanup;
+ }
+ }
usleep(200 * 1000);
}
@@ -402,6 +425,7 @@ virProcessKillPainfully(pid_t pid, bool force)
(long long)pid, signame);
cleanup:
+ VIR_FREE(procPath);
return ret;
}
--
2.17.1