On Fri, Aug 3, 2018 at 6:39 PM Alex Williamson <alex.williamson@redhat.com> wrote:

On Fri, 3 Aug 2018 08:29:39 +0200
Christian Ehrhardt <christian.ehrhardt@canonical.com> wrote:

> Hi,
> I was recently looking into a case which essentially looked like this:
> 1. virsh shutdown guest
> 2. after <1 second the qemu process was gone from /proc/
> 3. but libvirt spun in virProcessKillPainfully because the process
> was still reachable via signals
> 4. virProcessKillPainfully eventually fails after 15 seconds and the
> guest stays in "in shutdown" state forever
>
> This is not one of the common cases I've found for
> virProcessKillPainfully to break:
> - bad I/O e.g. NFS gets qemu stuck
> - CPU overload stalls things to death
> - qemu not being reaped (by init)
> All of the above would have the process still available in /proc/<pid>
> as Zombie or in uninterruptible sleep, but that is not true in my case.
>
> It turned out that the case was dependent on the amount of hostdev resources
> passed to the guest. Debugging showed that with 8 and more likely 16 GPUs
> passed it took ~18 seconds from SIGTERM to "no more be reachable with signal 0".
> I haven't conducted much more tests but stayed on the 16 GPU case, but
> I'm rather sure more devices might make it take even longer.

If it's dependent on device assignment, then it's probably either
related to unmapping DMA or resetting devices. The former should scale
with the size of the VM, not the number of devices attached. The
latter could increase with each device. Typically with physical GPUs
we don't have a function level reset mechanism so we need to do a
secondary bus reset on the upstream bridge to reset the device, this
requires a 1s delay to let the bus settle after reset. So if we're
gated by these sorts of resets, your scaling doesn't sound
unreasonable,

So the scaling makes sense with ~16*1s plus a tiny bit of default time to clean up matching the ~18 seconds I see.

Thanks for that explanation!

though I'm not sure how these factor into the process
state you're seeing.

Yeah I'd have thought to still see it in any form like a Zombie or such.

But it really is gone.

I'd also be surprised if you have a system that
can host 16 physical GPUs, so maybe this is a vGPU example?

16*physical GPU it is :-)

See https://www.nvidia.com/en-us/data-center/dgx-2/

Any mdev
device should provide a reset callback for roughly the equivalent of a
function level reset. Implementation of such a reset would be vendor
specific.

Since it is no classic mdev [1][2], but just 16*physical GPUs the callback suggestion would not make sens right?

In that case I wonder what the libvirt community thinks of the proposed general "Pid is gone means we can assume it is dead" approach?

An alternative would be to understand on the Kernel side why the PID is gone "too early" and fix that so it stays until fully cleaned up.

But even then on the Libvirt side we would need the extended timeout values.

[1]: https://libvirt.org/drvnodedev.html#MDEV

[2]: https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt

Thanks,

Alex

Christian Ehrhardt

Software Engineer, Ubuntu Server

Canonical Ltd