[libvirt] [PATCH v3 0/2] Fix detection of slow guest shutdown

Hi, after a good discussion a few days ago in https://www.redhat.com/archives/libvir-list/2018-August/msg00122.html and a short lived but back then untested v2 in https://www.redhat.com/archives/libvir-list/2018-August/msg00199.html I finally get access to the right HW again and completed the series. Being finally retested and working I finally feel safe to submit without a RFC prefix. I think this would be a great addition for a better handling of guests with plenty of host devices passed through. With the new code in place I can shutdown systems that have 12, 16 or even more hostdevs attached without getting into the "zombie" mode where libvirt will forever consider the guest as "in shutdown" as it gave up waiting too early because the signal zero still was able to reach it. Scaling examples (extracted with gdb): 16 Devices: virProcessKillPainfullyDelay (pid=67096, force=true, extradelay=32) 12 Devices: virProcessKillPainfullyDelay (pid=68251, force=true, extradelay=24) *Updates in v3* - fixup some issues found in testing and code checks *Updates in v2* - removed the "accept the lack of /proc/<pid> as valid process removal" approach due to valid concerns about reusing ressources. - added a dynamic extra wait scaling with the amount of hostdevs Christian Ehrhardt (2): process: wait longer on kill per assigned Hostdev process: wait longer 5->30s on hard shutdown src/libvirt_private.syms | 1 + src/qemu/qemu_process.c | 5 +++-- src/util/virprocess.c | 20 ++++++++++++++++---- src/util/virprocess.h | 1 + 4 files changed, 21 insertions(+), 6 deletions(-) -- 2.17.1

It was found that in cases with host devices virProcessKillPainfully might be able to send signal zero to the target PID for quite a while with the process already being gone from /proc/<PID>. That is due to cleanup and reset of devices which might include a secondary bus reset that on top of the actions taken has a 1s delay to let the bus settle. Due to that guests with plenty of Host devices could easily exceed the default timeouts. To solve that, this adds an extra delay of 2s per hostdev that is associated to a VM. Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_process.c | 5 +++-- src/util/virprocess.c | 18 +++++++++++++++--- src/util/virprocess.h | 1 + 4 files changed, 20 insertions(+), 5 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ca4a192a4a..47ea35f864 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2605,6 +2605,7 @@ virProcessGetPids; virProcessGetStartTime; virProcessKill; virProcessKillPainfully; +virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInMountNamespace; virProcessSchedPolicyTypeFromString; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 02fdc55156..b7bf8813da 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6814,8 +6814,9 @@ qemuProcessKill(virDomainObjPtr vm, unsigned int flags) return 0; } - ret = virProcessKillPainfully(vm->pid, - !!(flags & VIR_QEMU_PROCESS_KILL_FORCE)); + ret = virProcessKillPainfullyDelay(vm->pid, + !!(flags & VIR_QEMU_PROCESS_KILL_FORCE), + vm->def->nhostdevs * 2); return ret; } diff --git a/src/util/virprocess.c b/src/util/virprocess.c index ecea27a2d4..46360cc051 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -341,15 +341,19 @@ int virProcessKill(pid_t pid, int sig) * Returns 0 if it was killed gracefully, 1 if it * was killed forcibly, -1 if it is still alive, * or another error occurred. + * + * Callers can proide an extra delay to wait longer + * than the default. */ int -virProcessKillPainfully(pid_t pid, bool force) +virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; + unsigned int delay = 75 + (extradelay*5); const char *signame = "TERM"; - VIR_DEBUG("vpid=%lld force=%d", (long long)pid, force); + VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force, pid); /* This loop sends SIGTERM, then waits a few iterations (10 seconds) * to see if it dies. If the process still hasn't exited, and @@ -357,9 +361,12 @@ virProcessKillPainfully(pid_t pid, bool force) * wait up to 5 seconds more for the process to exit before * returning. * + * An extra delay can be specified for cases that are expected to clean + * up slower than usual. + * * Note that setting @force could result in dataloss for the process. */ - for (i = 0; i < 75; i++) { + for (i = 0; i < delay; i++) { int signum; if (i == 0) { signum = SIGTERM; /* kindly suggest it should exit */ @@ -402,6 +409,11 @@ virProcessKillPainfully(pid_t pid, bool force) } +int virProcessKillPainfully(pid_t pid, bool force) +{ + return virProcessKillPainfullyDelay(pid, force, 0); +} + #if HAVE_SCHED_GETAFFINITY int virProcessSetAffinity(pid_t pid, virBitmapPtr map) diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 3c5a882772..b72603ca8e 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -55,6 +55,7 @@ virProcessWait(pid_t pid, int *exitstatus, bool raw) int virProcessKill(pid_t pid, int sig); int virProcessKillPainfully(pid_t pid, bool force); +int virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int); int virProcessSetAffinity(pid_t pid, virBitmapPtr map); -- 2.17.1

Christian Ehrhardt <christian.ehrhardt@canonical.com> [2018-08-14, 11:27AM +0200]:
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index ecea27a2d4..46360cc051 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -341,15 +341,19 @@ int virProcessKill(pid_t pid, int sig) * Returns 0 if it was killed gracefully, 1 if it * was killed forcibly, -1 if it is still alive, * or another error occurred. + * + * Callers can proide an extra delay to wait longer + * than the default. */ int -virProcessKillPainfully(pid_t pid, bool force) +virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; + unsigned int delay = 75 + (extradelay*5); const char *signame = "TERM";
- VIR_DEBUG("vpid=%lld force=%d", (long long)pid, force); + VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force, pid); ^ This probably should be delay, right?

On Tue, Aug 14, 2018 at 12:13 PM Bjoern Walk <bwalk@linux.ibm.com> wrote:
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index ecea27a2d4..46360cc051 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -341,15 +341,19 @@ int virProcessKill(pid_t pid, int sig) * Returns 0 if it was killed gracefully, 1 if it * was killed forcibly, -1 if it is still alive, * or another error occurred. + * + * Callers can proide an extra delay to wait longer + * than the default. */ int -virProcessKillPainfully(pid_t pid, bool force) +virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; + unsigned int delay = 75 + (extradelay*5); const char *signame = "TERM";
- VIR_DEBUG("vpid=%lld force=%d", (long long)pid, force); + VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force,
Christian Ehrhardt <christian.ehrhardt@canonical.com> [2018-08-14, 11:27AM +0200]: pid); ^ This probably should be delay, right?
Absolutely correct, bad old copy-pasta from the beginning of this change. Thanks for spotting it. Example is with 12 Devices passed through and force, which means with both patches applied: - raise base value to wait after kill by 30 seconds (force is set) - second patch - request 12*2 seconds extra for the devices - this patch That should be 200+(12*2*5)=320 and that matches what I see int he log: 2018-08-16 13:04:51.999+0000: 61243: debug : virProcessKillPainfullyDelay:359 : vpid=60939 force=1 delay=320 I'm not pushing a V2 just for the change, but I have it queued with the fixed variable. Any other feedback on this or the sibling patch? Especially since those aren't my usual apparmor changes I'd really want some more reviews/acks before considering a push. -- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd

On Tue, Aug 14, 2018 at 11:27:33AM +0200, Christian Ehrhardt wrote:
It was found that in cases with host devices virProcessKillPainfully might be able to send signal zero to the target PID for quite a while with the process already being gone from /proc/<PID>.
That is due to cleanup and reset of devices which might include a secondary bus reset that on top of the actions taken has a 1s delay to let the bus settle. Due to that guests with plenty of Host devices could easily exceed the default timeouts.
To solve that, this adds an extra delay of 2s per hostdev that is associated to a VM.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_process.c | 5 +++-- src/util/virprocess.c | 18 +++++++++++++++--- src/util/virprocess.h | 1 + 4 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ca4a192a4a..47ea35f864 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2605,6 +2605,7 @@ virProcessGetPids; virProcessGetStartTime; virProcessKill; virProcessKillPainfully; +virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInMountNamespace; virProcessSchedPolicyTypeFromString; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 02fdc55156..b7bf8813da 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6814,8 +6814,9 @@ qemuProcessKill(virDomainObjPtr vm, unsigned int flags) return 0; }
- ret = virProcessKillPainfully(vm->pid, - !!(flags & VIR_QEMU_PROCESS_KILL_FORCE)); + ret = virProcessKillPainfullyDelay(vm->pid, + !!(flags & VIR_QEMU_PROCESS_KILL_FORCE), + vm->def->nhostdevs * 2);
This API contract is a bit wierd. You've got an arbitray x2 multiplier here...
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index ecea27a2d4..46360cc051 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -341,15 +341,19 @@ int virProcessKill(pid_t pid, int sig) * Returns 0 if it was killed gracefully, 1 if it * was killed forcibly, -1 if it is still alive, * or another error occurred. + * + * Callers can proide an extra delay to wait longer + * than the default.
Mention that it is in "seconds"
*/ int -virProcessKillPainfully(pid_t pid, bool force) +virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; + unsigned int delay = 75 + (extradelay*5);
...and it gets another x5 multiplier here. Feels nicer to just have the caller provide a x10 multiplier and honour that directly
const char *signame = "TERM";
- VIR_DEBUG("vpid=%lld force=%d", (long long)pid, force); + VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force, pid);
s/pid/extradelay/
/* This loop sends SIGTERM, then waits a few iterations (10 seconds) * to see if it dies. If the process still hasn't exited, and @@ -357,9 +361,12 @@ virProcessKillPainfully(pid_t pid, bool force) * wait up to 5 seconds more for the process to exit before * returning. * + * An extra delay can be specified for cases that are expected to clean + * up slower than usual. + * * Note that setting @force could result in dataloss for the process. */ - for (i = 0; i < 75; i++) { + for (i = 0; i < delay; i++) { int signum; if (i == 0) { signum = SIGTERM; /* kindly suggest it should exit */ @@ -402,6 +409,11 @@ virProcessKillPainfully(pid_t pid, bool force) }
+int virProcessKillPainfully(pid_t pid, bool force) +{ + return virProcessKillPainfullyDelay(pid, force, 0); +} + #if HAVE_SCHED_GETAFFINITY
int virProcessSetAffinity(pid_t pid, virBitmapPtr map) diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 3c5a882772..b72603ca8e 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -55,6 +55,7 @@ virProcessWait(pid_t pid, int *exitstatus, bool raw) int virProcessKill(pid_t pid, int sig);
int virProcessKillPainfully(pid_t pid, bool force); +int virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int);
Missing parameter name here Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Aug 21, 2018 at 1:15 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Tue, Aug 14, 2018 at 11:27:33AM +0200, Christian Ehrhardt wrote:
It was found that in cases with host devices virProcessKillPainfully might be able to send signal zero to the target PID for quite a while with the process already being gone from /proc/<PID>.
That is due to cleanup and reset of devices which might include a secondary bus reset that on top of the actions taken has a 1s delay to let the bus settle. Due to that guests with plenty of Host devices could easily exceed the default timeouts.
To solve that, this adds an extra delay of 2s per hostdev that is associated to a VM.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_process.c | 5 +++-- src/util/virprocess.c | 18 +++++++++++++++--- src/util/virprocess.h | 1 + 4 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ca4a192a4a..47ea35f864 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2605,6 +2605,7 @@ virProcessGetPids; virProcessGetStartTime; virProcessKill; virProcessKillPainfully; +virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInMountNamespace; virProcessSchedPolicyTypeFromString; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 02fdc55156..b7bf8813da 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6814,8 +6814,9 @@ qemuProcessKill(virDomainObjPtr vm, unsigned int flags) return 0; }
- ret = virProcessKillPainfully(vm->pid, - !!(flags & VIR_QEMU_PROCESS_KILL_FORCE)); + ret = virProcessKillPainfullyDelay(vm->pid, + !!(flags & VIR_QEMU_PROCESS_KILL_FORCE), + vm->def->nhostdevs * 2);
This API contract is a bit wierd. You've got an arbitray x2 multiplier here...
Out of the discussion with Alex I derived that due to the 1 sec settling of the bus plus some extra work 2 seconds per device would be a good rule of thumb. The delay is meant to be in seconds and here it requests 2*nhostdevs to get the amount needed. And I'd not want to make the "unit" of the call "double-seconds" :-) Let me add a comment at this call to explain the reasoning on the multiplier here.
diff --git a/src/util/virprocess.c b/src/util/virprocess.c
index ecea27a2d4..46360cc051 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -341,15 +341,19 @@ int virProcessKill(pid_t pid, int sig) * Returns 0 if it was killed gracefully, 1 if it * was killed forcibly, -1 if it is still alive, * or another error occurred. + * + * Callers can proide an extra delay to wait longer + * than the default.
Mention that it is in "seconds"
Ack
*/
int -virProcessKillPainfully(pid_t pid, bool force) +virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; + unsigned int delay = 75 + (extradelay*5);
...and it gets another x5 multiplier here.
That *5 is due to the internal unit it polls being 0.2 seconds. I think externally anything other than seconds (double-seconds, half-deca-seconds, ...) makes no sense. I'll call it "polldelay" instead to make it more clear that. That uncommon *5 ratio was there before, just not as redable (it just defined 75 in the loop header). I'll add some text about it as well to be sure.
Feels nicer to just have the caller provide a x10 multiplier and honour that directly
See above, the unit being seconds seems the most readable IMHO. With the changes made I think it won't be as confusing.
const char *signame = "TERM";
- VIR_DEBUG("vpid=%lld force=%d", (long long)pid, force); + VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force,
pid);
s/pid/extradelay/
Yes, that was mentioned by Bjoern and queued that way. Also in the text the delay= should be extradelay= which is ready that way.
/* This loop sends SIGTERM, then waits a few iterations (10 seconds) * to see if it dies. If the process still hasn't exited, and @@ -357,9 +361,12 @@ virProcessKillPainfully(pid_t pid, bool force) * wait up to 5 seconds more for the process to exit before * returning. * + * An extra delay can be specified for cases that are expected to
+ * up slower than usual. + * * Note that setting @force could result in dataloss for the
clean process.
*/ - for (i = 0; i < 75; i++) { + for (i = 0; i < delay; i++) { int signum; if (i == 0) { signum = SIGTERM; /* kindly suggest it should exit */ @@ -402,6 +409,11 @@ virProcessKillPainfully(pid_t pid, bool force) }
+int virProcessKillPainfully(pid_t pid, bool force) +{ + return virProcessKillPainfullyDelay(pid, force, 0); +} + #if HAVE_SCHED_GETAFFINITY
int virProcessSetAffinity(pid_t pid, virBitmapPtr map) diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 3c5a882772..b72603ca8e 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -55,6 +55,7 @@ virProcessWait(pid_t pid, int *exitstatus, bool raw) int virProcessKill(pid_t pid, int sig);
int virProcessKillPainfully(pid_t pid, bool force); +int virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int);
Missing parameter name here
Ack, and then breaking it into new lines for 80 chars Thanks all these will be part of a v4 ready after some syntax check and test builds ... Regards,
Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
-- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd

On Tue, Aug 21, 2018 at 02:20:21PM +0200, Christian Ehrhardt wrote:
On Tue, Aug 21, 2018 at 1:15 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
On Tue, Aug 14, 2018 at 11:27:33AM +0200, Christian Ehrhardt wrote:
It was found that in cases with host devices virProcessKillPainfully might be able to send signal zero to the target PID for quite a while with the process already being gone from /proc/<PID>.
That is due to cleanup and reset of devices which might include a secondary bus reset that on top of the actions taken has a 1s delay to let the bus settle. Due to that guests with plenty of Host devices could easily exceed the default timeouts.
To solve that, this adds an extra delay of 2s per hostdev that is associated to a VM.
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/libvirt_private.syms | 1 + src/qemu/qemu_process.c | 5 +++-- src/util/virprocess.c | 18 +++++++++++++++--- src/util/virprocess.h | 1 + 4 files changed, 20 insertions(+), 5 deletions(-)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ca4a192a4a..47ea35f864 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2605,6 +2605,7 @@ virProcessGetPids; virProcessGetStartTime; virProcessKill; virProcessKillPainfully; +virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInMountNamespace; virProcessSchedPolicyTypeFromString; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 02fdc55156..b7bf8813da 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -6814,8 +6814,9 @@ qemuProcessKill(virDomainObjPtr vm, unsigned int flags) return 0; }
- ret = virProcessKillPainfully(vm->pid, - !!(flags & VIR_QEMU_PROCESS_KILL_FORCE)); + ret = virProcessKillPainfullyDelay(vm->pid, + !!(flags & VIR_QEMU_PROCESS_KILL_FORCE), + vm->def->nhostdevs * 2);
This API contract is a bit wierd. You've got an arbitray x2 multiplier here...
Out of the discussion with Alex I derived that due to the 1 sec settling of the bus plus some extra work 2 seconds per device would be a good rule of thumb. The delay is meant to be in seconds and here it requests 2*nhostdevs to get the amount needed.
And I'd not want to make the "unit" of the call "double-seconds" :-)
Let me add a comment at this call to explain the reasoning on the multiplier here.
-virProcessKillPainfully(pid_t pid, bool force) +virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; + unsigned int delay = 75 + (extradelay*5);
...and it gets another x5 multiplier here.
That *5 is due to the internal unit it polls being 0.2 seconds. I think externally anything other than seconds (double-seconds, half-deca-seconds, ...) makes no sense.
Opps, yes, I mis-understood the behaviour of the code. A comment is fine Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

In cases where virProcessKillPainfully already reailizes that SIGTERM wasn't enough we are partially on a bad path already. Maybe the system is overloaded or having serious trouble to free and reap resources in time. In those case give the SIGKILL that was sent after 10 seconds some more time to take effect if force was set (only then we are falling back to SIGKILL anyway). Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/util/virprocess.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 46360cc051..dda8916284 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -350,7 +350,7 @@ virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; - unsigned int delay = 75 + (extradelay*5); + unsigned int delay = (force ? 200 : 75) + (extradelay*5); const char *signame = "TERM"; VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force, pid); @@ -358,7 +358,7 @@ virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) /* This loop sends SIGTERM, then waits a few iterations (10 seconds) * to see if it dies. If the process still hasn't exited, and * @force is requested, a SIGKILL will be sent, and this will - * wait up to 5 seconds more for the process to exit before + * wait up to 30 seconds more for the process to exit before * returning. * * An extra delay can be specified for cases that are expected to clean -- 2.17.1

On Tue, Aug 14, 2018 at 11:27:34AM +0200, Christian Ehrhardt wrote:
In cases where virProcessKillPainfully already reailizes that SIGTERM wasn't enough we are partially on a bad path already. Maybe the system is overloaded or having serious trouble to free and reap resources in time.
In those case give the SIGKILL that was sent after 10 seconds some more time to take effect if force was set (only then we are falling back to SIGKILL anyway).
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/util/virprocess.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 46360cc051..dda8916284 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -350,7 +350,7 @@ virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; - unsigned int delay = 75 + (extradelay*5); + unsigned int delay = (force ? 200 : 75) + (extradelay*5); const char *signame = "TERM";
VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force, pid); @@ -358,7 +358,7 @@ virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) /* This loop sends SIGTERM, then waits a few iterations (10 seconds) * to see if it dies. If the process still hasn't exited, and * @force is requested, a SIGKILL will be sent, and this will - * wait up to 5 seconds more for the process to exit before + * wait up to 30 seconds more for the process to exit before * returning. * * An extra delay can be specified for cases that are expected to clean
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Aug 21, 2018 at 1:16 PM Daniel P. Berrangé <berrange@redhat.com> wrote:
In cases where virProcessKillPainfully already reailizes that SIGTERM wasn't enough we are partially on a bad path already. Maybe the system is overloaded or having serious trouble to free and reap resources in time.
In those case give the SIGKILL that was sent after 10 seconds some more time to take effect if force was set (only then we are falling back to SIGKILL anyway).
Signed-off-by: Christian Ehrhardt <christian.ehrhardt@canonical.com> --- src/util/virprocess.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 46360cc051..dda8916284 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -350,7 +350,7 @@ virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) { size_t i; int ret = -1; - unsigned int delay = 75 + (extradelay*5); + unsigned int delay = (force ? 200 : 75) + (extradelay*5); const char *signame = "TERM";
VIR_DEBUG("vpid=%lld force=%d delay=%u", (long long)pid, force,
On Tue, Aug 14, 2018 at 11:27:34AM +0200, Christian Ehrhardt wrote: pid);
@@ -358,7 +358,7 @@ virProcessKillPainfullyDelay(pid_t pid, bool force, unsigned int extradelay) /* This loop sends SIGTERM, then waits a few iterations (10 seconds) * to see if it dies. If the process still hasn't exited, and * @force is requested, a SIGKILL will be sent, and this will - * wait up to 5 seconds more for the process to exit before + * wait up to 30 seconds more for the process to exit before * returning. * * An extra delay can be specified for cases that are expected to clean
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Thanks, added to the v4 submission queue ...
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
-- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd

Ping - as there was no other review yet for the series except a copy-paste bug spotted (thanks Bjoern) On Tue, Aug 14, 2018 at 11:27 AM Christian Ehrhardt < christian.ehrhardt@canonical.com> wrote:
Hi, after a good discussion a few days ago in https://www.redhat.com/archives/libvir-list/2018-August/msg00122.html and a short lived but back then untested v2 in https://www.redhat.com/archives/libvir-list/2018-August/msg00199.html I finally get access to the right HW again and completed the series.
Being finally retested and working I finally feel safe to submit without a RFC prefix. I think this would be a great addition for a better handling of guests with plenty of host devices passed through.
With the new code in place I can shutdown systems that have 12, 16 or even more hostdevs attached without getting into the "zombie" mode where libvirt will forever consider the guest as "in shutdown" as it gave up waiting too early because the signal zero still was able to reach it.
Scaling examples (extracted with gdb): 16 Devices: virProcessKillPainfullyDelay (pid=67096, force=true, extradelay=32) 12 Devices: virProcessKillPainfullyDelay (pid=68251, force=true, extradelay=24)
*Updates in v3* - fixup some issues found in testing and code checks
*Updates in v2* - removed the "accept the lack of /proc/<pid> as valid process removal" approach due to valid concerns about reusing ressources. - added a dynamic extra wait scaling with the amount of hostdevs
Christian Ehrhardt (2): process: wait longer on kill per assigned Hostdev process: wait longer 5->30s on hard shutdown
src/libvirt_private.syms | 1 + src/qemu/qemu_process.c | 5 +++-- src/util/virprocess.c | 20 ++++++++++++++++---- src/util/virprocess.h | 1 + 4 files changed, 21 insertions(+), 6 deletions(-)
-- 2.17.1
-- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd
participants (3)
-
Bjoern Walk
-
Christian Ehrhardt
-
Daniel P. Berrangé