Re: [libvirt] [PATCH] RFC: Support QEMU live uprgade

19 Nov 2013

      on 2013/11/13 21:10, Daniel P. Berrange wrote:
...
On Wed, Nov 13, 2013 at 12:15:30PM +0800, Zheng Sheng ZS Zhou wrote:
...
Hi Daniel,
on 2013/11/12/ 20:23, Daniel P. Berrange wrote:
...
...
On Tue, Nov 12, 2013 at 08:14:11PM +0800, Zheng Sheng ZS Zhou wrote:
Hi all,
Recently QEMU developers are working on a feature to allow upgrading
a live QEMU instance to a new version without restarting the VM. This
is implemented as live migration between the old and new QEMU process
on the same host [1]. Here is the the use case:
1) Guests are running QEMU release 1.6.1.
2) Admin installs QEMU release 1.6.2 via RPM or deb.
3) Admin starts a new VM using the updated QEMU binary, and asks the old
QEMU process to migrate the VM to the newly started VM.
I think it will be very useful to support QEMU live upgrade in libvirt.
After some investigations, I found migrating to the same host breaks
the current migration code. I'd like to propose a new work flow for
QEMU live migration. It is to implement the above step 3).
How does it break migration code ? Your patch below is effectively
re-implementing the multistep migration workflow, leaving out many
important features (seemless reconnect to SPICE clients for example)
which is really bad for our ongoing code support burden, so not
something I want to see.
Daniel
Actually I wrote another hacking patch to investigate how we
can re-use existing framework to do local migration. I found
the following problems.
(1) When migrate to different host, the destination domain uses
the same UUID and name as the source, and this is OK. When migrate
to localhost, destination domain UUID and name causes conflict
with the source. In QEMU driver, it maintains a hash table of
domain objects, the reference key is the UUID of the virtual
machine. The closeCallbacks is also a hash table with domain
UUID as key, and maybe there are other data structures using
UUID as key. This implies we use a different name and UUID
for the destination domain. In the migration framework, during
the Begin and Prepare stage, it calls virDomainDefCheckABIStability
to prevent us using a different UUID, and it also checks the
hostname and host UUID to be different. If we want to enable
local migration, we have to skip these check and generate new
UUID and name for destination domain. Of course we restore the
original UUID after migration. UUID is used in higher level
management software to identify virtual machines. It should
stay the same after QEMU live upgrade.
This point is something that needs to be solved regardless of
whether using migration framework, or re-inventing the migration
framework. The QEMU driver fundamentally assumes that there is
only ever one single VM with a given UUID, and a VM has only
1 process. IMHO name + uuid must be preserved during any live
upgrade process, otherwise mgmt will get confused. This has
more problems becasue 'name' is used for various resources
created by QEMU on disk - eg the monitor command path. We can't
have 2 QEMUs using the same name, but at the same time that's
exactly what we'd need here.
Thanks Daniel. I agree with you on that we should not change QEMU UUID.
I also think refactor and re-use existing migration code is great. So I
did some investigation towards this direction. I found the assumption
of one process in one VM and UUID is not only in the QEMU driver, also
in libvirt higher level data structure and functions, say virDomainObj
structure and virCloseCallbacksXXX functions. The hypervirsor process ID
is directly associated with virDomainObj.pid. virDomainObj contains only
one pid field. virCloseCallbacksXXX functions maintain an invariant that
only one callback and connection can be registered to each VM UUID. For
example, in a non-p2p migration, client opens two connections to
libvirtd, one for source domain and one for destination domain. When it
tries to register a close callback for the dst connection and the dst
domain, libvirt reports error that there is already another connection
registered the callback for this UUID, it's registered by src connection
for the src domain.

Is it acceptable if we start the new QEMU process giving it the same
UUID while we refer to it in libvirt virDomainObj using a different
UUID? I mean we can generate new UUID for the destination VM, thus
avoids all the conflicts in libvirt. We should also store the original
UUID in the VM definition, and when we start new QEMU process, use the
original UUID. After migration, we drop the new virDomainObj, and let
original virDomainObj attaches to the new QEMU process. In this way the
guest should not notice any change in the UUID and we avoid conflict in
libvirt.
...
...
(2) If I understand the code correctly, libvirt uses thread
pool to handle RPC requests. This means local migration may
cause deadlock in P2P migration mode. Suppose there are some
concurrent local migration requests and all the worker threads
are occupied by these requests. When source libvirtd connects
destination libvirtd on the same host to negotiate the migration,
the negotiation request is queued, but the negotiation request
will never be handled, because the original migration request
from client is waiting for the negotiation request to finish
to progress, while the negotiation request is queued waiting
for the original request to end. This is one of the dealock
risk I can think of.
I guess in traditional migration mode, in which the client
opens two connections to source and destination libvirtd,
there is also risk to cause deadlock.
Yes, it sounds like you could get deadlock even with 2 separate
libvirtds, if both them were migrating to the other concurrently.
We will try to locate and fix deadlock problems when implementing local
migration. This seems the right way to go.
...
...
(3) Libvirt supports Unix domain socket transport, but
this is only used in a tunnelled migration. For native
migration, it only supports TCP. We need to enable Unix
domain socket transport in native migration. Now we already
have a hypervisor migration URI argument in the migration
API, but there is no support for parsing and verifying a
"unix:/full/path" URI and passing that URI transparently
to QEMU. We can add this to current migration framework
but direct Unix socket transport looks meaningless for
normal migration.
Actually as far as QEMU is concerned libvirt uses fd: migration
only. Again though this points seems pretty much unrelated to
the question of how we design the APIs & structure the code.
Yes. I just want to remind that native unix socket transport is what
QEMU developers decide to use in local migration with page-flipping. You
may already notice that the system call vmsplice() needs a pipe. The old
QEMU process and the new one are not parent-child, so QEMU uses some
ancillary and out-of-band APIs of Unix domain socket to transfer the
pipe fd from one QEMU process to another. This is not supported by TCP.
That's why I need to enable direct Unix domain socket for QEMU live
upgrade.
...
...
(4) When migration fails, the source domain is resumed, and
this may not work if we enable page-flipping in QEMU. With
page-flipping enabled, QEMU transfers memory page ownership
to the destination QEMU, so the source virtual machine
should be restarted but not resumed when the migration fails.
IMHO that is not an acceptable approach. The whole point of doing
live upgrades in place, is that you consider the VMs to be
"precious". If you were OK with VMs being killed & restarted then
we'd not bother doing any of this live upgrade pain at all.
So if we're going to support live upgrades, we *must* be able to
guarantee that they will either succeed, or the existing QEMU is
left intact.  Killing the VM and restarting is not an option on
failure.
Yes. I'll check with QEMU developers to see if a page-flipped guest can
resume vCPU or not.
...
...
So I propose a new and compact work flow dedicated for QEMU
live upgrade. After all, it's an upgrade operation based on
tricky migration. When developing the previous RFC patch for
the new API, I focused on the correctness of the work flow,
so many other things are missing. I think I can add things
like Spice seamless migration when I submitting new versions.
This way lies madness. We do not want 2 impls of the internal
migration framework.
...
I am also really happy if you could give me some advice to
re-use the migration framework. Re-using the current framework
can saves a lot of effort.
I consider using the internal migration framework a mandatory
requirement here, even if the public API is different.
Daniel
I also think re-using migration code is good. I'm trying to find ways to
avoid UUID and name conflict problems. If there is no simple way, I need
to investigate how much refactor should be done, and propose some
solutions to enable QEMU driver managing multiple processes.

Thanks and best regards!

-- 
Zhou Zheng Sheng / 周征晟