On 01/12/2011 05:13 PM, Jim Fehlig wrote:
libvirt 0.8.7
qemu 0.13
I'm looking into a problem with qemu save/restore via JSON monitor. On
restore, the vm is left in a paused state with following error returned
for 'cont' command
An incoming migration is expected before this command can be executed
I was trying to debug the issue in gdb, but stepping through the code
introduces enough delay between qemudStartVMDaemon() and doStartCPUs()
that the latter succeeds. Any suggestions on how to determine when it
is safe to call doStartCPUs() to prevent the above error? I don't see
this issue with the text monitor btw.
I'm pretty sure this is related to a bug I reported on qemu-devel last
April:
http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg00635.html
(be sure to read my own followup if you want a correct description of
the circumstances). In this case libvirt was using the text monitor, and
there was a race condition between qemudStartVMDaemon (which executes
qemu with '-S -incoming') and doStartCPUs() (which issues a 'cont'
command to the qemu monitor). The result would be that sometimes the
'cont' would be received and processed by qemu before the incoming
migration had started, meaning that qemu would be executing garbage
memory instead of the saved/restored image of the guest.
The solution to this was posted to upstream qemu in July:
http://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01574.html
and I believe is in qemu 0.13. That patch adds a check to the 'cont'
command so that if '-incoming' was specified on the commandline, 'cont'
will only execute after a migration has successfully completed, but will
otherwise return an error.
Actually, thinking about this "fix", it seems that it isn't really a
solution, because instead of the guest starting up in an indeterminate
state, doStartCPUs() will just fail (as you've seen) making the entire
guest startup fail.
You can almost surely make it work properly by putting in a 250msec
delay between those two function calls in libvirt. It would be nice if
it could be totally fixed in qemu, though, so that libvirt didn't need
such a hack :-(
(I had unfortunately lost track of the bug by the time the patch was
posted - it had been there for so long I'd just gotten used to manually
pausing/unpausing any guest I wanted to save on the one machine that
displays the problem. Too bad I got so used to living with it, as I'd
have otherwise been forced to try it out (this machine is running F13,
which is still at qemu-0.12.5, which doesn't have the patch).