Re: [libvirt-users] Migration hangs on Gentoo with KVM

[re-adding the list] On 08/01/2011 04:22 PM, Jonathan Stoppani wrote:
On Aug 1, 2011, at 16:20 , Eric Blake wrote:
On 08/01/2011 04:11 PM, Jonathan Stoppani wrote:
Hi there,
I'm trying to migrate a domain between two Gentoo hosts using KVM as hypervisor, but the migration hangs. I tried both live or offline migration modes without success.
Does your 'nc' command have a -q option? If so, then this is probably
Thanks for the prompt answer Eric! Yes, nc has a q option:
-q, --hold-timeout=SEC1[:SEC2] Set hold timeout(s) for local [and remote]
Glad to hear that we found root cause to your problems, then.
The bug specifically refers to ssh, does that mean that it should work over tcp?
The problem is that libvirt is trying to start a remote nc session over ssh; but looking at http://libvirt.org/remote.html, it looks like ssh is the only protocol using nc in that manner (so yes, you can probably avoid the issue by using tcp or tls). Meanwhile, I think you can work around it without patching libvirt, by using this as your remote URI: qemu+ssh://user@remotehost/system?netcat=/path/to/nc-wrapper where nc-wrapper is an executable script installed on remotehost, looking like: #!/bin/sh exec /path/to/real/nc -q0 "$@" -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Aug 1, 2011, at 16:33 , Eric Blake wrote:
[re-adding the list]
Sorry about that, still not used to mailman lists which don't put the list address in the reply-to field. ;-)
Thanks for the prompt answer Eric! Yes, nc has a q option:
-q, --hold-timeout=SEC1[:SEC2] Set hold timeout(s) for local [and remote]
Glad to hear that we found root cause to your problems, then.
The bug specifically refers to ssh, does that mean that it should work over tcp?
The problem is that libvirt is trying to start a remote nc session over ssh; but looking at http://libvirt.org/remote.html, it looks like ssh is the only protocol using nc in that manner (so yes, you can probably avoid the issue by using tcp or tls). Meanwhile, I think you can work around it without patching libvirt, by using this as your remote URI:
qemu+ssh://user@remotehost/system?netcat=/path/to/nc-wrapper
where nc-wrapper is an executable script installed on remotehost, looking like:
#!/bin/sh exec /path/to/real/nc -q0 "$@"
Just tried this, but still hangs; will try tcp and report the results. ~Jonathan

On Aug 1, 2011, at 16:50 , Jonathan Stoppani wrote:
On Aug 1, 2011, at 16:33 , Eric Blake wrote:
[re-adding the list]
Sorry about that, still not used to mailman lists which don't put the list address in the reply-to field. ;-)
Thanks for the prompt answer Eric! Yes, nc has a q option:
-q, --hold-timeout=SEC1[:SEC2] Set hold timeout(s) for local [and remote]
Glad to hear that we found root cause to your problems, then.
The bug specifically refers to ssh, does that mean that it should work over tcp?
The problem is that libvirt is trying to start a remote nc session over ssh; but looking at http://libvirt.org/remote.html, it looks like ssh is the only protocol using nc in that manner (so yes, you can probably avoid the issue by using tcp or tls). Meanwhile, I think you can work around it without patching libvirt, by using this as your remote URI:
qemu+ssh://user@remotehost/system?netcat=/path/to/nc-wrapper
where nc-wrapper is an executable script installed on remotehost, looking like:
#!/bin/sh exec /path/to/real/nc -q0 "$@"
Just tried this, but still hangs; will try tcp and report the results.
~Jonathan
Tested using qemu+tcp and it hangs the same. If I interrupt the migration (^C), the domain is correctly destroyed on the destination but left in the paused state on the source. If I try to start it manually, I obtain this error: # virsh resume 1 error: Failed to resume domain 1 error: Timed out during operation: cannot acquire state change lock Any insights? ~Jonathan

On Aug 1, 2011, at 18:01 , Jonathan Stoppani wrote:
On Aug 1, 2011, at 16:50 , Jonathan Stoppani wrote:
On Aug 1, 2011, at 16:33 , Eric Blake wrote:
[re-adding the list]
Sorry about that, still not used to mailman lists which don't put the list address in the reply-to field. ;-)
Thanks for the prompt answer Eric! Yes, nc has a q option:
-q, --hold-timeout=SEC1[:SEC2] Set hold timeout(s) for local [and remote]
Glad to hear that we found root cause to your problems, then.
The bug specifically refers to ssh, does that mean that it should work over tcp?
The problem is that libvirt is trying to start a remote nc session over ssh; but looking at http://libvirt.org/remote.html, it looks like ssh is the only protocol using nc in that manner (so yes, you can probably avoid the issue by using tcp or tls). Meanwhile, I think you can work around it without patching libvirt, by using this as your remote URI:
qemu+ssh://user@remotehost/system?netcat=/path/to/nc-wrapper
where nc-wrapper is an executable script installed on remotehost, looking like:
#!/bin/sh exec /path/to/real/nc -q0 "$@"
Just tried this, but still hangs; will try tcp and report the results.
~Jonathan
Tested using qemu+tcp and it hangs the same. If I interrupt the migration (^C), the domain is correctly destroyed on the destination but left in the paused state on the source. If I try to start it manually, I obtain this error:
# virsh resume 1 error: Failed to resume domain 1 error: Timed out during operation: cannot acquire state change lock
Any insights?
Can someone shed some light on the libvirt locking possibilities? It seems to me that sanlock is not supported on gentoo (and libvirt is compiled using --without-sanlock); could this be the cause of the problem? Is there some way to explicitly set the locking mechanism to a noop in the libvirt configuration? ~Jonathan

On 08/17/2011 02:30 PM, Jonathan Stoppani wrote:
Thanks for the prompt answer Eric! Yes, nc has a q option:
-q, --hold-timeout=SEC1[:SEC2] Set hold timeout(s) for local [and remote]
We still haven't incorporated patches to autodetect nc usage on the remote side (some have been proposed by Guido, but there were some additional issues to address first). Hopefully by 0.9.5... Until that is fixed, then it very well could be that you are deadlocking the libvirtd handling of the remote connection due to nc holding the connection open too long, explaining while all further attempts to do something with the domain are getting stuck waiting for the nc connection to resolve.
Tested using qemu+tcp and it hangs the same. If I interrupt the migration (^C), the domain is correctly destroyed on the destination but left in the paused state on the source. If I try to start it manually, I obtain this error:
# virsh resume 1 error: Failed to resume domain 1 error: Timed out during operation: cannot acquire state change lock
This is the internal mutex lock used for serializing access to libvirt internal structures, such as when coordinating with a remote server (which coordination involves the use of nc). When you get this message, about the only thing you can do is restart libvirtd. Which version of libvirt were you testing? 0.9.4 adds quite a few improvements on being able to gracefully recover from failed migrations.
Any insights?
Can someone shed some light on the libvirt locking possibilities? It seems to me that sanlock is not supported on gentoo (and libvirt is compiled using --without-sanlock); could this be the cause of the problem?
Completely unrelated. sanlock is a program for controlling access to shared file storage, and has nothing to do with the internal mutex lock failure message you quoted above.
Is there some way to explicitly set the locking mechanism to a noop in the libvirt configuration?
You are confusing two terms; using the sanlock or no-op disk manager has nothing to do with libvirtd getting confused and deadlocking on internal data structures. If you built --without-sanlock, then you are already using the no-op disk manager; but if sanlock is compiled in, you control whether to use it by modifying /etc/libvirt/qemu.conf. But making a configuration change there won't affect the problem you actually saw above. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
participants (2)
-
Eric Blake
-
Jonathan Stoppani