
On Tue, Sep 7, 2010 at 4:00 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream<device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset< image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for other components (management stack) or the user to get involved. In this case self-tuning should be possible. We need to detect periods of I/O inactivity, for example tracking the number of in-flight requests and then setting a grace timer when it reaches zero. When the grace timer expires, we start streaming until the guest initiates I/O again.
That detects idle I/O within a single QEMU guest, but you might have another guest running that's I/O bound which means that from an overall system throughput perspective, you really don't want to stream.
I think libvirt might be able to do a better job here by looking at overall system I/O usage. But I'm not sure hence this RFC :-)
Isn't this what block I/O controller cgroups is meant to solve? If you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
That assumes you're capping I/O. But sometimes you care about overall system throughput more than you care about any individual VM.
Another way to look at it may be, a user waits for a cron job that runs at midnight and starts streaming as necessary. However, the user wants to be able to interrupt the streaming should there been a sudden demand.
If the user drives the streaming through an interface like I've specified, they're in full control. It's pretty simple to build a interfaces on top of this that implement stream as an aggressive or conservative background task too.
Also, I'm not sure we should worry about the priority of the I/O too much: perhaps the user wants their vm to stream more than they want an unimportant local vm that is currently I/O bound to have all resources to itself. So I think it makes sense to defer this and not try for system-wide knowledge inside a QEMU process.
Right, so that argues for an incremental interface like I started with :-)
BTW, this whole discussion is also relevant for other background tasks like online defragmentation so keep that use-case in mind too.
Right, I'm a little hesitant to get too far into discussing the management interface because I remember long threads about polling and async. I never fully read them but I bet some wisdom came out of them that applies here. There are two ways to do a long running (async?) task: 1. Multiple smaller pokes. Perhaps completion of a single poke is async. But the key is that the interface is incremental and driven by the management stack. 2. State. Turn on streaming and watch it go. You can find out its current state using another command which will tell you whether it is enabled/disabled and progress. Use a command to disable it. Stefan