Re: [libvirt] [PATCH 2/5] qemu: Avoid dangling migration-in job on shutoff domains

Wednesday, 21 March 2012

On Tue, Mar 20, 2012 at 15:56:39 -0600, Eric Blake wrote:
...
 On 03/19/2012 10:18 AM, Jiri Denemark wrote:
 > Destination daemon should not rely on the client or source daemon
 > (depending on the type of migration) to call Finish when migration
 > fails, because the client may crash before it can do so. The domain
 > prepared for incoming migration is set to be destroyed (and migration
 > job cleaned up) when connection with the client closes but this is not
 > enough. If the associated qemu process crashes after Prepare step and
 > the domain is cleaned up before the connection gets closed, autodestroy
 > is not called for the domain and migration jobs remains set. In case the
 > domain is defined on destination host (i.e., it is not completely
 > removed once destroyed) we keep the job set for ever. To fix this, we
 > register a cleanup callback which is responsible to clean migration-in
 > job when a domain dies anywhere between Prepare and Finish steps. Note
 > that we can't blindly clean any job when spotting EOF on monitor since
 > normally an API is running at that time.
 > ---
 >  src/qemu/qemu_domain.c    |    2 --
 >  src/qemu/qemu_domain.h    |    2 ++
 >  src/qemu/qemu_migration.c |   22 ++++++++++++++++++++++
 >  3 files changed, 24 insertions(+), 2 deletions(-)

 I'm restating my understanding of the bug, to make sure I am sure why
 your patch helps:

 - src requests a migration Right :-)

...
 - dest starts a qemu process using information from the src, but the
 destination happens to be running an older qemu that can't support the
 full migration Perhaps, but there might be several reasons for qemu to die during
migration,
even if it's exactly the same version as on the source.

...
 - qemu dies, but the destination hasn't seen a 'Finish'
from the source,
 so the job remains open and the domain remains The domain is persistent so it is
still there but inactive since we got EOF on
qemu monitor. And because we haven't seen Finish, the job remains open even on
the inactive domain.

...
 - connection is broken, but the open job prevents reclaiming the
 autodestroy domain on the destination The domain is inactive so there's nothing
to autodestroy in the first place.
The EOF handler which destroyed the domain also removed its autodestroy
callback. If we were lucky and we saw broken connection earlier than monitor
EOF, we'd be fine since autodestroy always removes any async job active on the
domain.

...
 - new connection is made, but source can't migrate because
destination
 is already locked up on the stale attempt Right.

...
 and the fix is adding a new callback, which says if qemu dies while
the
 callback is registered, we cancel the migration job; therefore, even
 without a 'Finish' from the source, the autodestroy can now kick in
...therefore, even without a 'Finish' from the source, we don't end up
with a
stale job in case monitor EOF handler is faster and destroys the domain (and
unregisters autodestroy) before autodestroy is called on broken connection

...
 ACK. Thanks.

Should I clarify the commit message a bit before pushing?

Jirka

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] [PATCH 2/5] qemu: Avoid dangling migration-in job on shutoff domains