On Mon, Aug 6, 2018 at 10:47 AM Daniel P. Berrangé <berrange(a)redhat.com>
wrote:
On Mon, Aug 06, 2018 at 07:20:10AM +0200, Christian Ehrhardt wrote:
> In that case I wonder what the libvirt community thinks of the proposed
> general "Pid is gone means we can assume it is dead" approach?
The key thing with the shutdown process is that we use the dissapperance of
the PID as the flag to indicate that it is safe to release any resources
that
the PID was using. eg the hostdevs are now available for another guest to
use.
I'd be concerned that if we looking /proc/$PID going away as the flag, then
we would be releasing the hostdevs for reuse, before the kernel has cleaned
them up. In the best case this would result in a 2nd guest failing to start
because the device was still in the case, in the worst case we could crash
the entire host (though I'd be hopeful vfio prevents that).
Yeah I agree that ressources being in use could lead to bad and rather hard
to debug problems.
An alternative would be to understand on the Kernel side why the PID
is
> gone "too early" and fix that so it stays until fully cleaned up.
> But even then on the Libvirt side we would need the extended timeout
values.
Yeah, looks like extended timeouts are unavoidable. The only real
optimization
would be to pass an explicit timeout to the kill method, increasing it by 2
seconds for each hostdev that is assigned. That way we'll scale the timeout
up as we need, so don't have to predict the worst case number of assigned
devices.
I'd do both:
- extending the KILL path (if force is set) timeout in general to give bad
systems a chance
- extend the maximum by 2s per hostdev
I'll submit that in a few minutes as a reply.
--
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd