On 07/09/2012 03:29 PM, Eric Blake wrote:
On 07/09/2012 02:00 PM, Anthony Liguori wrote:
>> with the fd:name approach, the sequence is:
>>
>> libvirt calls getfd:name1 over normal monitor
>> qemu responds
>> libvirt calls getfd:name2 over normal monitor
>> qemu responds
>> libvirt calls transaction around blockdev-snapshot-sync over normal
>> monitor, using fd:name1 and fd:name2
>> qemu responds
This general layout is true whether we rewrite all commands to
understand fd:nnn (proposal 1) or whether we add new magic parsing
(/dev/fd/nnn of proposal 3, or even /dev/fdset/nnn of proposal 5), all
as called out in these messages:
https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00227.html
https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01098.html
>>
>> but with -open-hook-fd, the approach would be:
>>
>> libvirt calls transaction
>> qemu calls open(file1) over hook
>> libvirt responds
>> qemu calls open(file2) over hook
>> libvirt responds
>> qemu responds to the original transaction
whereas this approach is quite different in semantics, but may indeed be
easier for qemu to implement, at the expense of some more complexity on
the part of libvirt.
At the high level, I think both approaches have one thing in common: by
refactoring all qemu code to go through qemu_open(), we can then
implement our desired complexity (whether fd:nn, /dev/fd/nnn,
/dev/fdset/nnn, or some other magic name parsing; or whether it is an
rpc call over a second socket in parallel to the monitor socket) in just
one location. Likewise, both approaches have to deal with libvirtd
restarts (magic name parsing by changing an 'inuse' flag when the
monitor detects EOF; rpc passing by failing a qemu_open() when the rpc
socket detects EOF).
Ack.
>>
>> The 'transaction' operation is thus blocked by the time it takes to do
>> two intermediate opens over a second channel, which kind of defeats the
>> purpose of making the transaction take effect with minimal guest
>> downtime.
>
> How are you defining "guest down time"?
>
> It's important to note that code running in QEMU does not equate to
> guest visible down time unless QEMU does an explicit vm_stop() which is
> not happening here.
>
> Instead, a VCPU may become blocked *if* it attempts to acquire qemu_mute
> while QEMU is holding it.
>
> If your concern is qemu_mutex being held while waiting for libvirt, it
> would be fairly easy to implement a qemu_open_async() that dropped
> allowed dropping back to the main loop and then calling a callback when
> the open completes.
>
> It would be pretty trivial to convert qmp_transaction to use such a
> command.
In other words, remembering that transactions are divided into phases:
phase 1 - prepare: obtain all needed fds (whether by pre-opening them
via 'pass-fd' or other new 'getfd' relative, or whether by RPC calls);
no guest downtime, and with cleanup that avoids any leaks on any failures
phase 2 - commit: flush all devices and actually make the changes in
qemu state to use the fds obtained in phase 1
and where the guest downtime (if any) is more likely due to flushing
changes in phase 2
Not quite. A synchronous flush can cause lock contention. We need to separate
out the problem of lock contention from guest down time.
Also, there's no obvious need to move the flushes before opens. The main issue
is that we use qemu_mutex to effectively create a write queue.
You can imagine a simple write queueing mechanism that would obviate the need
need for this such that we could flush, queue upcoming writes, and drop
qemu_mutex to sleep waiting for libvirt to send us our fds.
> But this is all speculative. There's no reason to believe
that an RPC
> would have a noticable guest visible latency unless you assume there's
> lot contention. I would strongly suspect that the bdrv_flush() is going
> to be a much greater source of lock contention than the RPC would be.
> An RPC is only bound by scheduler latency whereas synchronous disk I/O
> is bound spinning a platter.
>
>> And libvirt code becomes a lot trickier to deal with the fact
>> that two channels are in use, and that the channel that issued the
>> 'transaction' command must block while the other channel for handling
>> hooks must be responsive.
>
> All libvirt needs to do is listen on a socket and delegate access
> according to a white list. Whatever is providing fd's needs to have no
> knowledge of anythign other than what the guest is allowed to access
> which shouldn't depend on an executing command.
That's not quite accurate. What the guest is allowed to access should
indeed change depending on the executing command. That is, if I start a
guest with:
I should have spoke more clearly. libvirt may change the white list for various
reasons dynamically. But there shouldn't be a direct dependency on whatever is
serving up fd's and whatever is changing the white list.
Basically, you just need a shared hash table for each guest. It should be quite
simple.
Maybe the only reason that I'm still leaning towards a
'pass-fd'
solution instead of a hook fd solution is that libvirt would have less
code to write to get it working. But it was originally Dan's complaint
that an rpc solution has too much risk for deadlock or leaks;
The reason I came back to this is that after reading through the threads, I
started thinking about how to solve the leak problem.
You need clear ownership. Having QEMU "own" a file descriptor because it
"asks"
for an fd allows QEMU to be the clear owner. OTOH, having libvirt give QEMU an
fd with a floating reference that QEMU may or not may not pin ends up being
extremely complex in practice. I'm not sure you can really solve the reliable
closing problem either. If you did have a "kill all floating references"
command, that could introduce other problems (what about multiple clients?).
if we are
confident that we can avoid deadlock,
I don't think deadlocks are possible FWIW.
and that the idea of passing in
fds in response to an rpc involves less bookkeeping and speculation on
libvirt's part about which monitor commands will require opening an fd,
then maybe it really is a better technical solution, even if it costs
more libvirt code to implement.
I think the important part is that it allows libvirt to not have to have
intimate knowledge of how QEMU commands work. If we decide we need to change
the flags/perms on a file descriptor down the road, it's a lot easier to cope
with that as it is to cope with changing the order in which we open files.
Plus, once you implement this in libvirt, you don't have to worry about it for
future block commands. With fdsets, would need to deal with figuring out the
magic incantation of setfd commands for all future block commands.
Plus, making /dev/fdset be treated as not a valid file path is just asking for
trouble down the road...
Regards,
Anthony Liguori