Hi John,
Sorry about the delay. Got to experiment with qemu a bit and here are my
responses inline.
On Wed, Sep 30, 2015 at 8:18 PM, John Ferlan <jferlan(a)redhat.com> wrote:
On 09/29/2015 10:06 AM, Shivaprasad bhat wrote:
[...]
>> Perhaps I should clarify my query-migrate has no timeout comment... It
>> seems based on what I've read so far, the 'query-migrate' command
>> started successfully, because if it hadn't we would have received a
>> failure (as shown below). Thus libvirt has sent the command via the
>> monitor and is waiting for a response (e.g. the virCondWait after the
>> qemuMonitorSend in the trace below). The response isn't coming because
>> either "A" qemu didn't send it back or "B" libvirt missed
it - that
>> should be determinable.
>>
>> There's a way to turn on debugging so the monitor dialog can be seen -
>> via changes to /etc/libvirt/libvirtd.conf. I use :
>>
>> log_level = 1
>> log_filters="3:remote 4:event 3:json 3:rpc"
>> log_outputs="1:file:/var/log/libvirt/libvirtd.log"
>>
>> But you may need to remove the "3:json" in order to see the dialog
since
>> that where it "feels like" the issue might be. Then start libvirtd in
>> the debugger again. Once it's hung - you should be able to scan (eg,
>> edit) the libvirtd.log file and search for the "query-migrate" command
>> being sent and then follow the copious output looking for the presence
>> of a returned command. If there is none, then something in qemu isn't
>> returning the failure correctly and it would need to be fixed there I
>> would think as opposed to throwing down the big hammer of closing the fd.
>
> Had a chance to run with your log settings. The query-migrate doesn't seem to
> have a corresponding "return" in the logs. So as you say, there may be
> a qemu bug
> that is not returning a response when the fd is still open(as libvirt
> didnt close it) but
> no read actually happening there. I felt qemu can't sense the failure as the fd
> is open, so posted this patch. Though, the qemu should return with the
> current state
> of migration as it sees instead of not returning at all. Hope we are
> on the same page.
>
> Thanks,
> Shivaprasad
>
Meant to respond yesterday but got wrapped up in other things. OK - so
at least now it makes a bit more sense why purely adding a stream_abort
didn't work - we're not getting a reply from the monitor.
So there's perhaps 3 ways to "resolve" this issue (that come to my mind)
1. As you've done with the close() in the error path of
qemuMigrationIOFunc when the virStream{Send|Finish} fails. Although this
does feel like a work-around, I suppose since the tunnel is a libvirt
created thing and qemu isn't aware of it, then it feels reasonable.
Although that does make me wonder how qemu could be hung up. What would
something like a "virsh qemu-monitor-command $dom
'{"execute":"query-migrate"}' return when the source is hung?
Or does it
hang too?
Yes. As I debug qemu with gdb what I find is, the thread which handles the
qmp commands is stuck in os_host_main_loop() waiting for iothread lock
which is held by qemu_savevm_state_complete() on the other thread waiting to
finish writeev() call on the source FD. Unless the FD is closed, this thread
is not going to release the iothread lock and the qmp commands would not be
serviced.
2. Adding some sort of "timeout" logic in qemuMonitorSend (e.g.
virCondWaitUntil instead of virCondWait) to handle when a command
doesn't get a response. Not sure this is right either since it's not
clear to me there is a "time" that "all" commands are guaranteed to
be
run in/by, especially async ones.
Since the qmp commands won't be honored without the source FD being closed,
having any logic to call migrate_cancel would be useless.
3. Dig into qemu to figure out why it's not returning anything
for a
migrate-status request. Currently a bit beyond what I've done, but I
believe would require attaching into the running qemu process to see if
there was some thread "stuck" somewhere "waiting" on something that
won't return because the stream closed.
Hopefully Jiri (or perhaps Daniel) could provide some other insights.
I think we have to go with option 1. I am planning to send a V2 with the
Coverity fix that you suggested.
Daniel, Jiri,
Would you agree with the approach.?
Thanks and Regards,
Shivaprasad
John