On 09/29/2015 10:06 AM, Shivaprasad bhat wrote:
[...]
> Perhaps I should clarify my query-migrate has no timeout
comment... It
> seems based on what I've read so far, the 'query-migrate' command
> started successfully, because if it hadn't we would have received a
> failure (as shown below). Thus libvirt has sent the command via the
> monitor and is waiting for a response (e.g. the virCondWait after the
> qemuMonitorSend in the trace below). The response isn't coming because
> either "A" qemu didn't send it back or "B" libvirt missed it
- that
> should be determinable.
>
> There's a way to turn on debugging so the monitor dialog can be seen -
> via changes to /etc/libvirt/libvirtd.conf. I use :
>
> log_level = 1
> log_filters="3:remote 4:event 3:json 3:rpc"
> log_outputs="1:file:/var/log/libvirt/libvirtd.log"
>
> But you may need to remove the "3:json" in order to see the dialog since
> that where it "feels like" the issue might be. Then start libvirtd in
> the debugger again. Once it's hung - you should be able to scan (eg,
> edit) the libvirtd.log file and search for the "query-migrate" command
> being sent and then follow the copious output looking for the presence
> of a returned command. If there is none, then something in qemu isn't
> returning the failure correctly and it would need to be fixed there I
> would think as opposed to throwing down the big hammer of closing the fd.
Had a chance to run with your log settings. The query-migrate doesn't seem to
have a corresponding "return" in the logs. So as you say, there may be
a qemu bug
that is not returning a response when the fd is still open(as libvirt
didnt close it) but
no read actually happening there. I felt qemu can't sense the failure as the fd
is open, so posted this patch. Though, the qemu should return with the
current state
of migration as it sees instead of not returning at all. Hope we are
on the same page.
Thanks,
Shivaprasad
Meant to respond yesterday but got wrapped up in other things. OK - so
at least now it makes a bit more sense why purely adding a stream_abort
didn't work - we're not getting a reply from the monitor.
So there's perhaps 3 ways to "resolve" this issue (that come to my mind)
1. As you've done with the close() in the error path of
qemuMigrationIOFunc when the virStream{Send|Finish} fails. Although this
does feel like a work-around, I suppose since the tunnel is a libvirt
created thing and qemu isn't aware of it, then it feels reasonable.
Although that does make me wonder how qemu could be hung up. What would
something like a "virsh qemu-monitor-command $dom
'{"execute":"query-migrate"}' return when the source is hung?
Or does it
hang too?
2. Adding some sort of "timeout" logic in qemuMonitorSend (e.g.
virCondWaitUntil instead of virCondWait) to handle when a command
doesn't get a response. Not sure this is right either since it's not
clear to me there is a "time" that "all" commands are guaranteed to
be
run in/by, especially async ones.
3. Dig into qemu to figure out why it's not returning anything for a
migrate-status request. Currently a bit beyond what I've done, but I
believe would require attaching into the running qemu process to see if
there was some thread "stuck" somewhere "waiting" on something that
won't return because the stream closed.
Hopefully Jiri (or perhaps Daniel) could provide some other insights.
John