Re: [libvirt] [Qemu-devel] QEMU interfaces for image streaming and post-copy block migration

8 Sep 2010

      On Tue, Sep 7, 2010 at 4:00 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com> wrote:
...
On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
...
On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com>  wrote:
...
On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
...
On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori
<aliguori@linux.vnet.ibm.com>    wrote:
...
The interface for copy-on-read is just an option within qemu-img
create.
 Streaming, on the other hand, requires a bit more thought.  Today, I
have a
monitor command that does the following:
stream<device>    <sector offset>
Which will try to stream the minimal amount of data for a single I/O
operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0;
while offset<    image_size:
  wait_for_idle_time()
  count = stream(device, offset)
  offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness.
 The
thing I'm not sure about is 1) would libvirt want to expose a similar
stream
interface and let management software determine idle time 2) attempt to
detect idle time on it's own and provide a higher level interface.  If
(2),
the question then becomes whether we should try to do this within qemu
and
provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for
other components (management stack) or the user to get involved.  In
this case self-tuning should be possible.  We need to detect periods
of I/O inactivity, for example tracking the number of in-flight
requests and then setting a grace timer when it reaches zero.  When
the grace timer expires, we start streaming until the guest initiates
I/O again.
That detects idle I/O within a single QEMU guest, but you might have
another
guest running that's I/O bound which means that from an overall system
throughput perspective, you really don't want to stream.
I think libvirt might be able to do a better job here by looking at
overall
system I/O usage.  But I'm not sure hence this RFC :-)
Isn't this what block I/O controller cgroups is meant to solve?  If
you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then
vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
That assumes you're capping I/O.  But sometimes you care about overall
system throughput more than you care about any individual VM.
Another way to look at it may be, a user waits for a cron job that runs at
midnight and starts streaming as necessary.  However, the user wants to be
able to interrupt the streaming should there been a sudden demand.
If the user drives the streaming through an interface like I've specified,
they're in full control.  It's pretty simple to build a interfaces on top of
this that implement stream as an aggressive or conservative background task
too.
...
 Also, I'm not sure we should worry about the priority of the I/O too
much: perhaps the user wants their vm to stream more than they want an
unimportant local vm that is currently I/O bound to have all resources
to itself.  So I think it makes sense to defer this and not try for
system-wide knowledge inside a QEMU process.
Right, so that argues for an incremental interface like I started with :-)
BTW, this whole discussion is also relevant for other background tasks like
online defragmentation so keep that use-case in mind too.
Right, I'm a little hesitant to get too far into discussing the
management interface because I remember long threads about polling and
async.  I never fully read them but I bet some wisdom came out of them
that applies here.

There are two ways to do a long running (async?) task:
1. Multiple smaller pokes.  Perhaps completion of a single poke is
async.  But the key is that the interface is incremental and driven by
the management stack.
2. State.  Turn on streaming and watch it go.  You can find out its
current state using another command which will tell you whether it is
enabled/disabled and progress.  Use a command to disable it.

Stefan