Daniel P. Berrange wrote:
On Wed, Jan 12, 2011 at 11:29:27PM -0500, Laine Stump wrote:
> On 01/12/2011 05:13 PM, Jim Fehlig wrote:
>
>> libvirt 0.8.7
>> qemu 0.13
>>
>> I'm looking into a problem with qemu save/restore via JSON monitor. On
>> restore, the vm is left in a paused state with following error returned
>> for 'cont' command
>>
>> An incoming migration is expected before this command can be executed
>>
>> I was trying to debug the issue in gdb, but stepping through the code
>> introduces enough delay between qemudStartVMDaemon() and doStartCPUs()
>> that the latter succeeds. Any suggestions on how to determine when it
>> is safe to call doStartCPUs() to prevent the above error? I don't see
>> this issue with the text monitor btw.
>>
> I'm pretty sure this is related to a bug I reported on qemu-devel
> last April:
>
>
http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg00635.html
>
> (be sure to read my own followup if you want a correct description
> of the circumstances). In this case libvirt was using the text
> monitor, and there was a race condition between qemudStartVMDaemon
> (which executes qemu with '-S -incoming') and doStartCPUs() (which
> issues a 'cont' command to the qemu monitor). The result would be
> that sometimes the 'cont' would be received and processed by qemu
> before the incoming migration had started, meaning that qemu would
> be executing garbage memory instead of the saved/restored image of
> the guest.
>
> The solution to this was posted to upstream qemu in July:
>
>
http://lists.gnu.org/archive/html/qemu-devel/2010-07/msg01574.html
>
> and I believe is in qemu 0.13. That patch adds a check to the 'cont'
> command so that if '-incoming' was specified on the commandline,
> 'cont' will only execute after a migration has successfully
> completed, but will otherwise return an error.
>
> Actually, thinking about this "fix", it seems that it isn't really a
> solution, because instead of the guest starting up in an
> indeterminate state, doStartCPUs() will just fail (as you've seen)
> making the entire guest startup fail.
>
Yep, it wasn't really intended as a fix. It was intended to make
the error scenario clearly detectable, which has succeeded as per
Jim's report. The fact that QMP returned an error in this way,
means we can now reliably detect, usleep(1000), and then retry
the 'cont' again at which point we should succeed.
Something like the attached patch? I'm not quite sure about the retry
bounds (currently 1 second). In my testing, it always succeeds on the
second attempt - even with large memory vms.
Regards,
Jim