There appears to be a race condition wherein a 'cont' command sent
immediately on qemu startup can prevent a inbound migration specified
via -incoming from occurring. libvirt's process for starting up qemu
domains with an incoming migration includes with a 'cont' command at the
end of qemudInitCpus, shortly after a successful connection with the
monitor is made. While the libvirt monitor is generally unresponsive
while an inbound migration is ongoing, forcing the 'cont' to occur only
after the migration has completed, this isn't always true (as will be
demonstrated below).
I suspect strongly that this is responsible for an occasional failure
I'm seeing when loading libvirt domains from file.
This is highly reproducible using qemu-kvm-0.11.0-rc2, and
straightforward to demonstrate by the following means:
[ONE-TIME SETUP]
- Build an appropriate ramsave file via migrating a stopped guest
to disk.
- Mark any backing store used by this guest read-only.
[COMMON STEPS]
- Create an empty qcow2 file backed by the read-only store, if your
guest has any disks.
- Invoke qemu with arguments appropriate to the VM being resumed,
and also the following: -S -monitor stdio -incoming 'exec:echo
START_DELAY >&2 && sleep 5 && echo END_DELAY >&2 &&
cat <ramsave.raw &&
echo LOAD_DONE >&2'.
[VALIDATING CORRECT OPERATION]
- Wait until 'LOAD_DONE' is displayed, and run 'cont'
- The VM will correctly resume.
[REPRODUCING THE BUG]
- Run 'cont' after START_DELAY is displayed, but before END_DELAY.
- 'cat: write error: Broken pipe' will be displayed.
- The guest VM will reboot, enter a catatonic state, or otherwise
fail to load correctly.
[REPRODUCING WITHOUT ARTIFICIAL DELAY]
As the 'sleep 5' used in the above may be considered cheating, this
issue may also be reproduced without any delay by removing the 'sleep',
and terminating the shell command used to invoke qemu with <<<$'cont\n'
[REPRODUCING OVER A UNIX SOCKET]
Included for completeness, as libvirt 0.7.x uses UNIX sockets here.
Use -monitor unix:tmp/test.monitor during qemu invocation, and
- Invoke the following in a separate window:
socat - UNIX-LISTEN:/tmp/test.monitor <<<$'cont\n'
- Invoke qemu as above, but with -monitor unix:/tmp/test.monitor
I have a work-in-progress patch which modifies libvirt to use -daemonize
for startup; waiting for the guest to detach before attempting to
interact with the monitor may avoid this issue. However, as this patch
is against libvirt master, and the master branch has other issues which
expose themselves on virDomainRestore, I am unable to test it here.
Thoughts (and workarounds) welcome.