[libvirt] QEMU interfaces for image streaming and post-copy block migration

Hi, We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense. Here's the basic idea: Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests. In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation. The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer. The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following: stream <device> <sector offset> Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed. The idea about how to drive this interface is a loop like: offset = 0; while offset < image_size: wait_for_idle_time() count = stream(device, offset) offset += count Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface. A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device. With QED, to implement this one would: 1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this. Again though, I think the question is what type of interfaces would libvirt prefer? Low level interfaces + recipes on how to do high level things or higher level interfaces? Regards, Anthony Liguori

On 07.09.2010, at 15:41, Anthony Liguori wrote:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream <device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset < image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
I'm torn here too. Why not expose both? Have a qemu internal daemon available that gets a sleep time as parameter and an external "pull sectors" command. We'll see which one is more useful, but I don't think it's too much code to justify only having one of the two. And the internal daemon could be started using a command line parameter, which helps non-managed users.
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this. Again though, I think the question is what type of interfaces would libvirt prefer? Low level interfaces + recipes on how to do high level things or higher level interfaces?
Is there anything keeping us from making the QMP socket multiplexable? I was thinking of something like: { command = "nbd_server" ; block = "qemu_block_name" } { result = "done" } <qmp socket turns into nbd socket> This way we don't require yet another port, don't have to care about conflicts and get internal qemu block names for free. Alex

On 09/07/2010 09:01 AM, Alexander Graf wrote:
I'm torn here too. Why not expose both? Have a qemu internal daemon available that gets a sleep time as parameter and an external "pull sectors" command. We'll see which one is more useful, but I don't think it's too much code to justify only having one of the two. And the internal daemon could be started using a command line parameter, which helps non-managed users.
Let me turn it around and ask, how would libvirt do this? Would they just use a sleep time parameter and just make use of our command or would they do something more clever and attempt to detect system idle? Could we just do that in qemu? Or would they punt to the end user?
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this. Again though, I think the question is what type of interfaces would libvirt prefer? Low level interfaces + recipes on how to do high level things or higher level interfaces?
Is there anything keeping us from making the QMP socket multiplexable? I was thinking of something like:
{ command = "nbd_server" ; block = "qemu_block_name" } { result = "done" } <qmp socket turns into nbd socket>
This way we don't require yet another port, don't have to care about conflicts and get internal qemu block names for free.
Possibly, but something that complicates life here is that an nbd session would be source -> destination but there's no QMP session between source -> destination. Instead, there's a session from source -> management node and destination -> management node so you'd have to proxy nbd traffic between the two. That gets ugly quick. Regards, Anthony Liguori
Alex

On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream <device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset < image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for other components (management stack) or the user to get involved. In this case self-tuning should be possible. We need to detect periods of I/O inactivity, for example tracking the number of in-flight requests and then setting a grace timer when it reaches zero. When the grace timer expires, we start streaming until the guest initiates I/O again. Stefan

On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream<device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset< image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for other components (management stack) or the user to get involved. In this case self-tuning should be possible. We need to detect periods of I/O inactivity, for example tracking the number of in-flight requests and then setting a grace timer when it reaches zero. When the grace timer expires, we start streaming until the guest initiates I/O again.
That detects idle I/O within a single QEMU guest, but you might have another guest running that's I/O bound which means that from an overall system throughput perspective, you really don't want to stream. I think libvirt might be able to do a better job here by looking at overall system I/O usage. But I'm not sure hence this RFC :-) Regards, Anthony Liguori
Stefan

On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream<device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset< image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for other components (management stack) or the user to get involved. In this case self-tuning should be possible. We need to detect periods of I/O inactivity, for example tracking the number of in-flight requests and then setting a grace timer when it reaches zero. When the grace timer expires, we start streaming until the guest initiates I/O again.
That detects idle I/O within a single QEMU guest, but you might have another guest running that's I/O bound which means that from an overall system throughput perspective, you really don't want to stream.
I think libvirt might be able to do a better job here by looking at overall system I/O usage. But I'm not sure hence this RFC :-)
Isn't this what block I/O controller cgroups is meant to solve? If you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then vm-1 can do streaming without eating into vm-2's guaranteed bandwidth. Also, I'm not sure we should worry about the priority of the I/O too much: perhaps the user wants their vm to stream more than they want an unimportant local vm that is currently I/O bound to have all resources to itself. So I think it makes sense to defer this and not try for system-wide knowledge inside a QEMU process. Stefan

On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream<device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset< image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for other components (management stack) or the user to get involved. In this case self-tuning should be possible. We need to detect periods of I/O inactivity, for example tracking the number of in-flight requests and then setting a grace timer when it reaches zero. When the grace timer expires, we start streaming until the guest initiates I/O again.
That detects idle I/O within a single QEMU guest, but you might have another guest running that's I/O bound which means that from an overall system throughput perspective, you really don't want to stream.
I think libvirt might be able to do a better job here by looking at overall system I/O usage. But I'm not sure hence this RFC :-)
Isn't this what block I/O controller cgroups is meant to solve? If you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
That assumes you're capping I/O. But sometimes you care about overall system throughput more than you care about any individual VM. Another way to look at it may be, a user waits for a cron job that runs at midnight and starts streaming as necessary. However, the user wants to be able to interrupt the streaming should there been a sudden demand. If the user drives the streaming through an interface like I've specified, they're in full control. It's pretty simple to build a interfaces on top of this that implement stream as an aggressive or conservative background task too.
Also, I'm not sure we should worry about the priority of the I/O too much: perhaps the user wants their vm to stream more than they want an unimportant local vm that is currently I/O bound to have all resources to itself. So I think it makes sense to defer this and not try for system-wide knowledge inside a QEMU process.
Right, so that argues for an incremental interface like I started with :-) BTW, this whole discussion is also relevant for other background tasks like online defragmentation so keep that use-case in mind too. Regards, Anthony Liguori
Stefan

On Tue, Sep 7, 2010 at 4:00 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:55 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:51 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:33 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 2:41 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream<device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset< image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
A self-tuning solution is attractive because it reduces the need for other components (management stack) or the user to get involved. In this case self-tuning should be possible. We need to detect periods of I/O inactivity, for example tracking the number of in-flight requests and then setting a grace timer when it reaches zero. When the grace timer expires, we start streaming until the guest initiates I/O again.
That detects idle I/O within a single QEMU guest, but you might have another guest running that's I/O bound which means that from an overall system throughput perspective, you really don't want to stream.
I think libvirt might be able to do a better job here by looking at overall system I/O usage. But I'm not sure hence this RFC :-)
Isn't this what block I/O controller cgroups is meant to solve? If you give vm-1 50% block bandwidth and vm-2 50% block bandwidth then vm-1 can do streaming without eating into vm-2's guaranteed bandwidth.
That assumes you're capping I/O. But sometimes you care about overall system throughput more than you care about any individual VM.
Another way to look at it may be, a user waits for a cron job that runs at midnight and starts streaming as necessary. However, the user wants to be able to interrupt the streaming should there been a sudden demand.
If the user drives the streaming through an interface like I've specified, they're in full control. It's pretty simple to build a interfaces on top of this that implement stream as an aggressive or conservative background task too.
Also, I'm not sure we should worry about the priority of the I/O too much: perhaps the user wants their vm to stream more than they want an unimportant local vm that is currently I/O bound to have all resources to itself. So I think it makes sense to defer this and not try for system-wide knowledge inside a QEMU process.
Right, so that argues for an incremental interface like I started with :-)
BTW, this whole discussion is also relevant for other background tasks like online defragmentation so keep that use-case in mind too.
Right, I'm a little hesitant to get too far into discussing the management interface because I remember long threads about polling and async. I never fully read them but I bet some wisdom came out of them that applies here. There are two ways to do a long running (async?) task: 1. Multiple smaller pokes. Perhaps completion of a single poke is async. But the key is that the interface is incremental and driven by the management stack. 2. State. Turn on streaming and watch it go. You can find out its current state using another command which will tell you whether it is enabled/disabled and progress. Use a command to disable it. Stefan

On 09/07/2010 10:09 AM, Stefan Hajnoczi wrote:
Right, so that argues for an incremental interface like I started with :-)
BTW, this whole discussion is also relevant for other background tasks like online defragmentation so keep that use-case in mind too.
Right, I'm a little hesitant to get too far into discussing the management interface because I remember long threads about polling and async. I never fully read them but I bet some wisdom came out of them that applies here.
There are two ways to do a long running (async?) task: 1. Multiple smaller pokes. Perhaps completion of a single poke is async. But the key is that the interface is incremental and driven by the management stack. 2. State. Turn on streaming and watch it go. You can find out its current state using another command which will tell you whether it is enabled/disabled and progress. Use a command to disable it.
If everyone is going to do (1) by just doing a tight loop or just using the same simple mechanism (a sleep(5)), then I agree, we should do (2). I can envision people wanting to do very complex decisions about the right time to do the next poke though and I'm looking for feedback about what other people think. I expected people to do complex heuristics with respect to migration convergence but in reality, I don't think anyone does today. So while I generally like being flexible, I realize that too much flexibility isn't always a good thing :-) Regards, Anthony Liguori
Stefan

Am 07.09.2010 17:09, schrieb Stefan Hajnoczi:
Right, I'm a little hesitant to get too far into discussing the management interface because I remember long threads about polling and async. I never fully read them but I bet some wisdom came out of them that applies here.
There are two ways to do a long running (async?) task: 1. Multiple smaller pokes. Perhaps completion of a single poke is async. But the key is that the interface is incremental and driven by the management stack. 2. State. Turn on streaming and watch it go. You can find out its current state using another command which will tell you whether it is enabled/disabled and progress. Use a command to disable it.
I think we need option 2 in any case for users not using libvirt. I for one wouldn't really love to type in monitor commands every few seconds to get the streaming done. ;-) Let's start with this. We can always add option 1 for more sophisticated cases later if it's desired by users. Kevin

Am 07.09.2010 15:41, schrieb Anthony Liguori:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create.
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file. Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream <device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset < image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
I think libvirt shouldn't have to care about sector offsets. You should just tell qemu to fetch the image and it should do so. We could have something like -drive backing_mode=[cow|cor|stream].
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
Hm, that's an interesting idea. :-) Kevin

On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf <kwolf@redhat.com> wrote:
Am 07.09.2010 15:41, schrieb Anthony Liguori:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create.
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
I agree that streaming should be generic, like block migration. The trivial generic implementation is: void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector < bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector, &n)) { bdrv_read(bs, sector, ...); bdrv_write(bs, sector, ...); } } }
Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream <device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset < image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
I think libvirt shouldn't have to care about sector offsets. You should just tell qemu to fetch the image and it should do so. We could have something like -drive backing_mode=[cow|cor|stream].
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
Hm, that's an interesting idea. :-)
Kevin

On 09/07/2010 09:49 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf<kwolf@redhat.com> wrote:
Am 07.09.2010 15:41, schrieb Anthony Liguori:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create.
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
I agree that streaming should be generic, like block migration. The trivial generic implementation is:
void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector< bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,&n)) {
Three problems here. First problem is that bdrv_is_allocated is synchronous. The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range. You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O. The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format. Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros. Regards, Anthony Liguori

On Tue, Sep 7, 2010 at 3:57 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:49 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf<kwolf@redhat.com> wrote:
Am 07.09.2010 15:41, schrieb Anthony Liguori:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create.
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
I agree that streaming should be generic, like block migration. The trivial generic implementation is:
void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector< bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,&n)) {
Three problems here. First problem is that bdrv_is_allocated is synchronous. The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range.
You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O.
The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format.
Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros.
Good points. I agree that it is easiest to write features into the block driver, but there is a significant amount of code duplication, plus the barrier for enabling other block drivers with these features is increased. These points (except the lines of code argument) can be addressed with the proper extensions to the block driver interface. Stefan

On 09/07/2010 10:05 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:57 PM, Anthony Liguori <aliguori@linux.vnet.ibm.com> wrote:
On 09/07/2010 09:49 AM, Stefan Hajnoczi wrote:
On Tue, Sep 7, 2010 at 3:34 PM, Kevin Wolf<kwolf@redhat.com> wrote:
Am 07.09.2010 15:41, schrieb Anthony Liguori:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create.
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
I agree that streaming should be generic, like block migration. The trivial generic implementation is:
void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector< bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,&n)) {
Three problems here. First problem is that bdrv_is_allocated is synchronous. The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range.
You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O.
The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format.
Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros.
Good points. I agree that it is easiest to write features into the block driver, but there is a significant amount of code duplication,
There's two ways to attack code duplication. The first is to move the feature into block.c and add interfaces to the block drivers to support it. The second is to keep it in qed.c but to abstract out things that could really be common to multiple drivers (like the find_cluster functionality and some of the request handling functionality). I prefer the later approach because it keeps a high quality implementation of copy-on-read whereas the former is almost certainly going to dumb down the implementation.
plus the barrier for enabling other block drivers with these features is increased. These points (except the lines of code argument) can be addressed with the proper extensions to the block driver interface.
Regards, Anthony Liguori
Stefan

On 09/07/2010 05:57 PM, Anthony Liguori wrote:
I agree that streaming should be generic, like block migration. The trivial generic implementation is:
void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector< bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,&n)) {
Three problems here. First problem is that bdrv_is_allocated is synchronous.
Put the whole thing in a thread.
The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range.
You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O.
That seems like a nice solution. You probably want a multiple of the cluster size to retain efficiency.
The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format.
bdrv_is_allocated() already exists (and is needed for commit), what else is needed? cluster size?
Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros.
You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros. -- error compiling committee.c: too many arguments to function

On 09/12/2010 07:41 AM, Avi Kivity wrote:
On 09/07/2010 05:57 PM, Anthony Liguori wrote:
I agree that streaming should be generic, like block migration. The trivial generic implementation is:
void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector< bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,&n)) {
Three problems here. First problem is that bdrv_is_allocated is synchronous.
Put the whole thing in a thread.
It doesn't fix anything. You don't want stream to serialize all I/O operations.
The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range.
You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O.
That seems like a nice solution. You probably want a multiple of the cluster size to retain efficiency.
What you basically do is: stream_step_three(): complete() stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three) bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two) And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces. One of the things Stefan has mentioned is that a lot of the QED code could be reused by other formats. All formats implement things like CoW on their own today but if you exposed interfaces like bdrv_aio_find_free_cluster(), you could actually implement a lot more in the generic block layer. So, I agree with you in principle that this all should be common code. I think it's a larger effort though.
The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format.
bdrv_is_allocated() already exists (and is needed for commit), what else is needed? cluster size?
Synchronous implementations are not reusable to implement asynchronous anything. But you need the code to be cluster aware too.
Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros.
You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros.
My QED branch now does that FWIW. At the moment, it only detects zero reads to unallocated clusters and writes a special zero cluster marker. However, the detection code is in the generic path so once the fsck() logic is working, we can implement a free list in QED. In QED, the detection code needs to have a lot of knowledge about cluster boundaries and the format of the device. In principle, this should be common code but it's not for the same reason copy-on-write is not common code today. Regards, Anthony Liguori

On 09/12/2010 03:25 PM, Anthony Liguori wrote:
On 09/12/2010 07:41 AM, Avi Kivity wrote:
On 09/07/2010 05:57 PM, Anthony Liguori wrote:
I agree that streaming should be generic, like block migration. The trivial generic implementation is:
void bdrv_stream(BlockDriverState* bs) { for (sector = 0; sector< bdrv_getlength(bs); sector += n) { if (!bdrv_is_allocated(bs, sector,&n)) {
Three problems here. First problem is that bdrv_is_allocated is synchronous.
Put the whole thing in a thread.
It doesn't fix anything. You don't want stream to serialize all I/O operations.
Why would it serialize all I/O operations? It's just like another vcpu issuing reads.
The second problem is that streaming makes the most sense when it's the smallest useful piece of work whereas bdrv_is_allocated() may return a very large range.
You could cap it here but you then need to make sure that cap is at least cluster_size to avoid a lot of unnecessary I/O.
That seems like a nice solution. You probably want a multiple of the cluster size to retain efficiency.
What you basically do is:
stream_step_three(): complete()
stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three)
bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two)
Isn't there a write() missing somewhere?
And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces.
Why do you need find_free_cluster()? That's a physical offset thing. Just write to the same logical offset. IOW: bdrv_aio_stream(): bdrv_aio_read(offset, stream_2) stream_2(): if all zeros: increment offset if more: bdrv_aio_stream() bdrv_aio_write(offset, stream_3) stream_3(): bdrv_aio_write(offset, stream_4) stream_4(): increment offset if more: bdrv_aio_stream() Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that.
One of the things Stefan has mentioned is that a lot of the QED code could be reused by other formats. All formats implement things like CoW on their own today but if you exposed interfaces like bdrv_aio_find_free_cluster(), you could actually implement a lot more in the generic block layer.
So, I agree with you in principle that this all should be common code. I think it's a larger effort though.
Not that large I think; and it will make commit async as a side effect.
The QED streaming implementation is 140 LOCs too so you quickly end up adding more code to the block formats to support these new interfaces than it takes to just implement it in the block format.
bdrv_is_allocated() already exists (and is needed for commit), what else is needed? cluster size?
Synchronous implementations are not reusable to implement asynchronous anything.
Surely this is easy to fix, at least for qed. What we need is thread infrastructure that allows us to convert between the two methods.
But you need the code to be cluster aware too.
Yes, another variable in BlockDriverState.
Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros.
You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros.
My QED branch now does that FWIW. At the moment, it only detects zero reads to unallocated clusters and writes a special zero cluster marker. However, the detection code is in the generic path so once the fsck() logic is working, we can implement a free list in QED.
In QED, the detection code needs to have a lot of knowledge about cluster boundaries and the format of the device. In principle, this should be common code but it's not for the same reason copy-on-write is not common code today.
Parts of it are: commit. Of course, that's horribly synchronous. -- error compiling committee.c: too many arguments to function

On 09/12/2010 08:40 AM, Avi Kivity wrote:
Why would it serialize all I/O operations? It's just like another vcpu issuing reads.
Because the block layer isn't re-entrant.
What you basically do is:
stream_step_three(): complete()
stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three)
bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two)
Isn't there a write() missing somewhere?
Streaming relies on copy-on-read to do the writing.
And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces.
Why do you need find_free_cluster()? That's a physical offset thing. Just write to the same logical offset.
IOW:
bdrv_aio_stream(): bdrv_aio_read(offset, stream_2)
It's an optimization. If you've got a fully missing L1 entry, then you're going to memset() 2GB worth of zeros. That's just wasted work. With a 1TB image with a 1GB allocation, it's a huge amount of wasted work.
stream_2(): if all zeros: increment offset if more: bdrv_aio_stream() bdrv_aio_write(offset, stream_3)
stream_3(): bdrv_aio_write(offset, stream_4)
I don't understand why stream_3() is needed.
stream_4(): increment offset if more: bdrv_aio_stream()
Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that.
http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a0...
Third problem is that streaming really requires being able to do zero write detection in a meaningful way. You don't want to always do zero write detection so you need another interface to mark a specific write as a write that should be checked for zeros.
You can do that in bdrv_stream(), above, before the actual write, and call bdrv_unmap() if you detect zeros.
My QED branch now does that FWIW. At the moment, it only detects zero reads to unallocated clusters and writes a special zero cluster marker. However, the detection code is in the generic path so once the fsck() logic is working, we can implement a free list in QED.
In QED, the detection code needs to have a lot of knowledge about cluster boundaries and the format of the device. In principle, this should be common code but it's not for the same reason copy-on-write is not common code today.
Parts of it are: commit. Of course, that's horribly synchronous.
If you've got AIO internally, making commit work is pretty easy. Doing asynchronous commit at a generic layer is not easy though unless you expose lots of details. Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux. So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code. Regards, Anthony Liguori

On 09/12/2010 05:23 PM, Anthony Liguori wrote:
On 09/12/2010 08:40 AM, Avi Kivity wrote:
Why would it serialize all I/O operations? It's just like another vcpu issuing reads.
Because the block layer isn't re-entrant.
A threaded block layer is reentrant. Of course pushing the thing into a thread requires that.
What you basically do is:
stream_step_three(): complete()
stream_step_two(offset, length): bdrv_aio_readv(offset, length, buffer, stream_step_three)
bdrv_aio_stream(): bdrv_aio_find_free_cluster(stream_step_two)
Isn't there a write() missing somewhere?
Streaming relies on copy-on-read to do the writing.
Ah. You can avoid the copy-on-read implementation in the block format driver and do it completely in generic code.
And that's exactly what the current code looks like. The only change to the patch that this does is make some of qed's internals be block layer interfaces.
Why do you need find_free_cluster()? That's a physical offset thing. Just write to the same logical offset.
IOW:
bdrv_aio_stream(): bdrv_aio_read(offset, stream_2)
It's an optimization. If you've got a fully missing L1 entry, then you're going to memset() 2GB worth of zeros. That's just wasted work. With a 1TB image with a 1GB allocation, it's a huge amount of wasted work.
Ok. And it's a logical offset, not physical as I thought, which confused me.
stream_2(): if all zeros: increment offset if more: bdrv_aio_stream() bdrv_aio_write(offset, stream_3)
stream_3(): bdrv_aio_write(offset, stream_4)
I don't understand why stream_3() is needed.
This implementation doesn't rely on copy-on-read code in the block format driver. It is generic and uses existing block layer interfaces. It would need copy-on-read support in the generic block layer as well.
stream_4(): increment offset if more: bdrv_aio_stream()
Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that.
http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a0...
Clever - it pushes all the synchronization into the copy-on-read implementation. But the serialization there hardly jumps out of the code. Do I understand correctly that you can only have one allocating read or write running?
Parts of it are: commit. Of course, that's horribly synchronous.
If you've got AIO internally, making commit work is pretty easy. Doing asynchronous commit at a generic layer is not easy though unless you expose lots of details.
I don't see why. Commit is a simple loop that copies all clusters. All it needs to know is if a cluster is allocated or not. When commit is running you need additional serialization against guest writes, and to direct guest writes and reads to the committed region to the backing file instead of the temporary image. But the block layer already knows of all guest writes.
Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux.
So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code.
What Linux does if have an equivalent of bdrv_generic_aio_commit() which most implementations call (or default to), and only do something if they want something special. Something like commit (or copy-on-read, or copy-on-write, or streaming) can be implement 100% in terms of the generic functions (and indeed qcow2 backing files can be any format). -- error compiling committee.c: too many arguments to function

On 09/12/2010 11:45 AM, Avi Kivity wrote:
Streaming relies on copy-on-read to do the writing.
Ah. You can avoid the copy-on-read implementation in the block format driver and do it completely in generic code.
Copy on read takes advantage of temporal locality. You wouldn't want to stream without copy on read because you decrease your idle I/O time by not effectively caching.
stream_4(): increment offset if more: bdrv_aio_stream()
Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that.
http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a0...
Clever - it pushes all the synchronization into the copy-on-read implementation. But the serialization there hardly jumps out of the code.
Do I understand correctly that you can only have one allocating read or write running?
Cluster allocation, L2 cache allocation, or on-disk L2 allocation? You only have one on-disk L2 allocation at one time. That's just an implementation detail at the moment. An on-disk L2 allocation happens only when writing to a new cluster that requires a totally new L2 entry. Since L2s cover 2GB of logical space, it's a rare event so this turns out to be pretty reasonable for a first implementation. Parallel on-disk L2 allocations is not that difficult, it's just a future TODO.
Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux.
So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code.
What Linux does if have an equivalent of bdrv_generic_aio_commit() which most implementations call (or default to), and only do something if they want something special. Something like commit (or copy-on-read, or copy-on-write, or streaming) can be implement 100% in terms of the generic functions (and indeed qcow2 backing files can be any format).
Yes, what I'm really saying is that we should take the bdrv_generic_aio_commit() approach. I think we're in agreement here. Regards, Anthony Liguori

On 09/12/2010 07:19 PM, Anthony Liguori wrote:
On 09/12/2010 11:45 AM, Avi Kivity wrote:
Streaming relies on copy-on-read to do the writing.
Ah. You can avoid the copy-on-read implementation in the block format driver and do it completely in generic code.
Copy on read takes advantage of temporal locality. You wouldn't want to stream without copy on read because you decrease your idle I/O time by not effectively caching.
I meant, implement copy-on-read in generic code side by side with streaming. Streaming becomes just a prefetch operation (read and discard) which lets copy-on-read do the rest. This is essentially your implementation, yes?
stream_4(): increment offset if more: bdrv_aio_stream()
Of course, need to serialize wrt guest writes, which adds a bit more complexity. I'll leave it to you to code the state machine for that.
http://repo.or.cz/w/qemu/aliguori.git/commitdiff/d44ea43be084cc879cd1a33e1a0...
Clever - it pushes all the synchronization into the copy-on-read implementation. But the serialization there hardly jumps out of the code.
Do I understand correctly that you can only have one allocating read or write running?
Cluster allocation, L2 cache allocation, or on-disk L2 allocation?
You only have one on-disk L2 allocation at one time. That's just an implementation detail at the moment. An on-disk L2 allocation happens only when writing to a new cluster that requires a totally new L2 entry. Since L2s cover 2GB of logical space, it's a rare event so this turns out to be pretty reasonable for a first implementation.
Parallel on-disk L2 allocations is not that difficult, it's just a future TODO.
Really, you can just preallocate all L2s. Most filesystems will touch all of them very soon. qcow2 might save some space for snapshots which share L2s (doubtful) or for 4k clusters (historical) but for qed with 64k clusters, it doesn't save any space. Linear L2s will also make your fsck *much* quicker. Size is .01% of logical image size. 1MB for a 10GB guest, by the time you install something on it that's a drop in the bucket. If you install a guest on a 100GB disk, what percentage of L2s are allocated?
Generally, I think the block layer makes more sense if the interface to the formats are high level and code sharing is achieved not by mandating a world view but rather but making libraries of common functionality. This is more akin to how the FS layer works in Linux.
So IMHO, we ought to add a bdrv_aio_commit function, turn the current code into a generic_aio_commit, implement a qed_aio_commit, then somehow do qcow2_aio_commit, and look at what we can refactor into common code.
What Linux does if have an equivalent of bdrv_generic_aio_commit() which most implementations call (or default to), and only do something if they want something special. Something like commit (or copy-on-read, or copy-on-write, or streaming) can be implement 100% in terms of the generic functions (and indeed qcow2 backing files can be any format).
Yes, what I'm really saying is that we should take the bdrv_generic_aio_commit() approach. I think we're in agreement here.
Strange feeling. -- error compiling committee.c: too many arguments to function

On 09/07/2010 09:34 AM, Kevin Wolf wrote:
Am 07.09.2010 15:41, schrieb Anthony Liguori:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create.
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
The way it's implemented in QED is that it's a compatible feature. This means that implementations are allowed to ignore it if they want to. It's really a suggestion. So yes, you could have a run time switch that overrides the feature bit on disk and either forces copy-on-read on or off. Do we have a way to pass block drivers run time options?
Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
To do it efficiently, it really needs to be in the format for the same reason that copy-on-write is part of the format. You need to understand the cluster boundaries in order to optimize the metadata updates. Sure, you can expose interfaces to the block layer to give all of this info but that's solving the same problem for doing block level copy-on-write. The other challenge is that for copy-on-read to be efficiently, you really need a format that can distinguish between unallocated sectors and zero sectors and do zero detection during the copy-on-read operation. Otherwise, if you have a 10G virtual disk with a backing file that's 1GB is size, copy-on-read will result in the leaf being 10G instead of ~1GB.
Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream<device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset< image_size: wait_for_idle_time() count = stream(device, offset) offset += count
Obviously, the "wait_for_idle_time()" requires wide system awareness. The thing I'm not sure about is 1) would libvirt want to expose a similar stream interface and let management software determine idle time 2) attempt to detect idle time on it's own and provide a higher level interface. If (2), the question then becomes whether we should try to do this within qemu and provide libvirt a higher level interface.
I think libvirt shouldn't have to care about sector offsets. You should just tell qemu to fetch the image and it should do so. We could have something like -drive backing_mode=[cow|cor|stream].
This interface let's libvirt decide when the I/O system is idle. The sector is really just a token to keep track of our overall progress. One thing I envisioned was that a tool like virt-manager could have a progress bar showing the streaming progress. It could update the progress bar based on (offset * 512) / image_size. If libvirt isn't driving it, we need to detect idle I/O time and we need to provide an interface to query status. Not a huge problem but I'm not sure that a single QEMU instance can properly detect idle I/O time. Regards, Anthony Liguori
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
Hm, that's an interesting idea. :-)
Kevin

Am 07.09.2010 16:49, schrieb Anthony Liguori:
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
The way it's implemented in QED is that it's a compatible feature. This means that implementations are allowed to ignore it if they want to. It's really a suggestion.
Well, the point is that I see no reason why an image should contain this suggestion. There's really nothing about an image that could reasonably indicate "use this better with copy-on-read than with copy-on-write". It's a decision you make when using the image.
So yes, you could have a run time switch that overrides the feature bit on disk and either forces copy-on-read on or off.
Do we have a way to pass block drivers run time options?
We'll get them with -blockdev. Today we're using colons for format specific and separate -drive options for generic things.
Doing it this way has the additional advantage that you need no image format support for this, so we could implement copy-on-read for other formats, too.
To do it efficiently, it really needs to be in the format for the same reason that copy-on-write is part of the format.
Copy-on-write is not part of the format, it's a way of how to use this format. Backing files are part of the format, and they are used for both copy-on-write and copy-on-read. Any driver implementing a format that has support for backing files should be able to implement copy-on-read.
You need to understand the cluster boundaries in order to optimize the metadata updates. Sure, you can expose interfaces to the block layer to give all of this info but that's solving the same problem for doing block level copy-on-write.
The other challenge is that for copy-on-read to be efficiently, you really need a format that can distinguish between unallocated sectors and zero sectors and do zero detection during the copy-on-read operation. Otherwise, if you have a 10G virtual disk with a backing file that's 1GB is size, copy-on-read will result in the leaf being 10G instead of ~1GB.
That's a good point. But it's not a reason to make the interface specific to QED just because other formats would probably not implement it as efficiently. Kevin

On 09/07/2010 10:02 AM, Kevin Wolf wrote:
Am 07.09.2010 16:49, schrieb Anthony Liguori:
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
The way it's implemented in QED is that it's a compatible feature. This means that implementations are allowed to ignore it if they want to. It's really a suggestion.
Well, the point is that I see no reason why an image should contain this suggestion. There's really nothing about an image that could reasonably indicate "use this better with copy-on-read than with copy-on-write".
It's a decision you make when using the image.
Copy-on-read is, in many cases, a property of the backing file because it suggests that the backing file is either very slow or potentially volatile. IOW, let's say I'm an image distributor and I want to provide my images in a QED format that actually streams the image from an http server. I could provide a QED file without a copy-on-read bit set but I'd really like to convey this information as part of the image. You can argue that I should provide a config file too that contained the copy-on-read flag set but you could make the same argument about backing files too.
So yes, you could have a run time switch that overrides the feature bit on disk and either forces copy-on-read on or off.
Do we have a way to pass block drivers run time options?
We'll get them with -blockdev. Today we're using colons for format specific and separate -drive options for generic things.
That's right. I think I'd rather wait for -blockdev.
You need to understand the cluster boundaries in order to optimize the metadata updates. Sure, you can expose interfaces to the block layer to give all of this info but that's solving the same problem for doing block level copy-on-write.
The other challenge is that for copy-on-read to be efficiently, you really need a format that can distinguish between unallocated sectors and zero sectors and do zero detection during the copy-on-read operation. Otherwise, if you have a 10G virtual disk with a backing file that's 1GB is size, copy-on-read will result in the leaf being 10G instead of ~1GB.
That's a good point. But it's not a reason to make the interface specific to QED just because other formats would probably not implement it as efficiently.
You really can't do as good of a job in the block layer because you have very little info about the characteristics of the disk image. Regards, Anthony Liguori
Kevin

Am 07.09.2010 17:11, schrieb Anthony Liguori:
On 09/07/2010 10:02 AM, Kevin Wolf wrote:
Am 07.09.2010 16:49, schrieb Anthony Liguori:
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
The way it's implemented in QED is that it's a compatible feature. This means that implementations are allowed to ignore it if they want to. It's really a suggestion.
Well, the point is that I see no reason why an image should contain this suggestion. There's really nothing about an image that could reasonably indicate "use this better with copy-on-read than with copy-on-write".
It's a decision you make when using the image.
Copy-on-read is, in many cases, a property of the backing file because it suggests that the backing file is either very slow or potentially volatile.
The simple copy-on-read without actively streaming the rest of the image is not enough anyway for volatile backing files.
IOW, let's say I'm an image distributor and I want to provide my images in a QED format that actually streams the image from an http server. I could provide a QED file without a copy-on-read bit set but I'd really like to convey this information as part of the image.
You can argue that I should provide a config file too that contained the copy-on-read flag set but you could make the same argument about backing files too.
No. The image is perfectly readable when using COW instead of COR. On the other hand, it's completely meaningless without its backing file.
So yes, you could have a run time switch that overrides the feature bit on disk and either forces copy-on-read on or off.
Do we have a way to pass block drivers run time options?
We'll get them with -blockdev. Today we're using colons for format specific and separate -drive options for generic things.
That's right. I think I'd rather wait for -blockdev.
Well, then I consider -blockdev a dependency of QED (the copy-on-read part at least) and we can't merge it before we have -blockdev.
You need to understand the cluster boundaries in order to optimize the metadata updates. Sure, you can expose interfaces to the block layer to give all of this info but that's solving the same problem for doing block level copy-on-write.
The other challenge is that for copy-on-read to be efficiently, you really need a format that can distinguish between unallocated sectors and zero sectors and do zero detection during the copy-on-read operation. Otherwise, if you have a 10G virtual disk with a backing file that's 1GB is size, copy-on-read will result in the leaf being 10G instead of ~1GB.
That's a good point. But it's not a reason to make the interface specific to QED just because other formats would probably not implement it as efficiently.
You really can't do as good of a job in the block layer because you have very little info about the characteristics of the disk image.
I'm not saying that the generic block layer should implement copy-on-read. I just think that it should pass a run-time option to the driver - maybe just a BDRV_O_COPY_ON_READ flag - instead of having the information in the image file. From a user perspective it should look the same for qed, qcow2 and whatever else (like copy-on-write today) Kevin

On 09/07/2010 10:20 AM, Kevin Wolf wrote:
Am 07.09.2010 17:11, schrieb Anthony Liguori:
On 09/07/2010 10:02 AM, Kevin Wolf wrote:
Am 07.09.2010 16:49, schrieb Anthony Liguori:
Shouldn't it be a runtime option? You can use the very same image with copy-on-read or copy-on-write and it will behave the same (execpt for performance), so it's not an inherent feature of the image file.
The way it's implemented in QED is that it's a compatible feature. This means that implementations are allowed to ignore it if they want to. It's really a suggestion.
Well, the point is that I see no reason why an image should contain this suggestion. There's really nothing about an image that could reasonably indicate "use this better with copy-on-read than with copy-on-write".
It's a decision you make when using the image.
Copy-on-read is, in many cases, a property of the backing file because it suggests that the backing file is either very slow or potentially volatile.
The simple copy-on-read without actively streaming the rest of the image is not enough anyway for volatile backing files.
But as a web site owner, it's extremely useful for me to associate copy-on-read with an image because it significantly reduces my bandwidth. I have a hard time believing this isn't a valuable use-case and not one that's actually pretty common.
IOW, let's say I'm an image distributor and I want to provide my images in a QED format that actually streams the image from an http server. I could provide a QED file without a copy-on-read bit set but I'd really like to convey this information as part of the image.
You can argue that I should provide a config file too that contained the copy-on-read flag set but you could make the same argument about backing files too.
No. The image is perfectly readable when using COW instead of COR. On the other hand, it's completely meaningless without its backing file.
N.B. the whole concept of compat features in QED is that if the features are ignored, the image is still perfectly readable. It's extra information that let's an implementation to smarter things with a given image.
So yes, you could have a run time switch that overrides the feature bit on disk and either forces copy-on-read on or off.
Do we have a way to pass block drivers run time options?
We'll get them with -blockdev. Today we're using colons for format specific and separate -drive options for generic things.
That's right. I think I'd rather wait for -blockdev.
Well, then I consider -blockdev a dependency of QED (the copy-on-read part at least) and we can't merge it before we have -blockdev.
If we determine that having copy-on-read be a part of the image is universally a bad idea, then I'd agree with you. Keep in mind, I don't expect to merge the cor or streaming stuff with the first merge of QED. I'm still not convinced that having cor as a compat feature is a bad idea though.
You really can't do as good of a job in the block layer because you have very little info about the characteristics of the disk image.
I'm not saying that the generic block layer should implement copy-on-read. I just think that it should pass a run-time option to the driver - maybe just a BDRV_O_COPY_ON_READ flag - instead of having the information in the image file. From a user perspective it should look the same for qed, qcow2 and whatever else (like copy-on-write today)
Okay, the only place I'm disagreeing slightly is that I think an image format should be able to request copy_on_read such that the default behavior if an explicit flag isn't specified is to do what the image suggests we do. Regards, Anthony Liguori
Kevin

Am 07.09.2010 17:30, schrieb Anthony Liguori:
On 09/07/2010 10:20 AM, Kevin Wolf wrote:
Am 07.09.2010 17:11, schrieb Anthony Liguori:
Copy-on-read is, in many cases, a property of the backing file because it suggests that the backing file is either very slow or potentially volatile.
The simple copy-on-read without actively streaming the rest of the image is not enough anyway for volatile backing files.
But as a web site owner, it's extremely useful for me to associate copy-on-read with an image because it significantly reduces my bandwidth.
I have a hard time believing this isn't a valuable use-case and not one that's actually pretty common.
As a web site user, I don't necessarily want you to control the behaviour of my qemu. :-) But I do see your point there.
You really can't do as good of a job in the block layer because you have very little info about the characteristics of the disk image.
I'm not saying that the generic block layer should implement copy-on-read. I just think that it should pass a run-time option to the driver - maybe just a BDRV_O_COPY_ON_READ flag - instead of having the information in the image file. From a user perspective it should look the same for qed, qcow2 and whatever else (like copy-on-write today)
Okay, the only place I'm disagreeing slightly is that I think an image format should be able to request copy_on_read such that the default behavior if an explicit flag isn't specified is to do what the image suggests we do.
Maybe we can agree on that. I'm not completely decided yet if allowing the image to contain such a hint is a good or a bad thing. Kevin

On 09/07/2010 10:39 AM, Kevin Wolf wrote:
Am 07.09.2010 17:30, schrieb Anthony Liguori:
On 09/07/2010 10:20 AM, Kevin Wolf wrote:
Am 07.09.2010 17:11, schrieb Anthony Liguori:
Copy-on-read is, in many cases, a property of the backing file because it suggests that the backing file is either very slow or potentially volatile.
The simple copy-on-read without actively streaming the rest of the image is not enough anyway for volatile backing files.
But as a web site owner, it's extremely useful for me to associate copy-on-read with an image because it significantly reduces my bandwidth.
I have a hard time believing this isn't a valuable use-case and not one that's actually pretty common.
As a web site user, I don't necessarily want you to control the behaviour of my qemu. :-)
That's why I understand your argument about -blockdev and making sure all compat features can be overridden. I'm happy with that as a requirement.Okay, the only place I'm disagreeing slightly is that I think an image
format should be able to request copy_on_read such that the default behavior if an explicit flag isn't specified is to do what the image suggests we do.
Maybe we can agree on that. I'm not completely decided yet if allowing the image to contain such a hint is a good or a bad thing.
It's a tough space. We don't want to include crazy amounts of metadata (and basically become OVF) but there's metadata that we would like to have. backing_format is a good example. It's a suggestion and it's something you really want to let a user override. Regards, Anthony Liguori
Kevin

On Tue, Sep 07, 2010 at 08:41:44AM -0500, Anthony Liguori wrote:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
[snip]
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
IMHO, adding further network sockets is the one thing we absolutely don't want to do to migration. I don't much like the idea of launching extra daemons either.
This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this. Again though, I think the question is what type of interfaces would libvirt prefer? Low level interfaces + recipes on how to do high level things or higher level interfaces?
I think it should be done entirely within the main QEMU migration socket. I know this isn't possible with the current impl, since it is unidirectional, preventing the target sending the source requests for specific data blocks. If we made migration socket bi-directional I think we could do it all within qemu with no external helpers or extra sockets 1. Create empty qed file on the destination with copy on read enable backing file pointing to a special 'migrate:' protocol 2. Run qemu -incoming on the destination with with the qed file 3. execute the migration 4. when migration completes, target QEMU continues streaming blocks from the soruce qemu. 5. when streaming is complete, source qemu can shutdown. Both your original proposal and mine here seem to have a pretty bad failure scenario though. After the cut-over point where the VM cpus start running on the destination QEMU, AFAICT, any failure on the source before block streaming complete leaves you dead in the water. The source VM no longer has up2date RAM contents and the destination VM does not yet have a complete disk image. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 09/07/2010 10:03 AM, Daniel P. Berrange wrote:
On Tue, Sep 07, 2010 at 08:41:44AM -0500, Anthony Liguori wrote:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
[snip]
A related topic is block migration. Today we support pre-copy migration which means we transfer the block device and then do a live migration. Another approach is to do a live migration, and on the source, run a block server using image streaming on the destination to move the device.
With QED, to implement this one would:
1) launch qemu-nbd on the source while the guest is running 2) create a qed file on the destination with copy-on-read enabled and a backing file using nbd: to point to the source qemu-nbd 3) run qemu -incoming on the destination with the qed file 4) execute the migration 5) when migration completes, begin streaming on the destination to complete the copy 6) when the streaming is complete, shut down the qemu-nbd instance on the source
IMHO, adding further network sockets is the one thing we absolutely don't want to do to migration. I don't much like the idea of launching extra daemons either.
One of the use cases I'm trying to accommodate is migration to free resources. By launching a qemu-nbd daemon, we can kill the source qemu process and free up all of the associated memory.
This is a bit involved and we could potentially automate some of this in qemu by launching qemu-nbd and providing commands to do some of this. Again though, I think the question is what type of interfaces would libvirt prefer? Low level interfaces + recipes on how to do high level things or higher level interfaces?
I think it should be done entirely within the main QEMU migration socket. I know this isn't possible with the current impl, since it is unidirectional, preventing the target sending the source requests for specific data blocks. If we made migration socket bi-directional I think we could do it all within qemu with no external helpers or extra sockets
1. Create empty qed file on the destination with copy on read enable backing file pointing to a special 'migrate:' protocol
Why not just point migration and nbd to a unix domain socket and then multiplex the two protocols at a higher level?
2. Run qemu -incoming on the destination with with the qed file 3. execute the migration 4. when migration completes, target QEMU continues streaming blocks from the soruce qemu. 5. when streaming is complete, source qemu can shutdown.
Both your original proposal and mine here seem to have a pretty bad failure scenario though. After the cut-over point where the VM cpus start running on the destination QEMU, AFAICT, any failure on the source before block streaming complete leaves you dead in the water. The source VM no longer has up2date RAM contents and the destination VM does not yet have a complete disk image.
Yes. It's a trade off. However, pre-copy doesn't really change your likelihood of catastrophic failure because if you were going to fail in the source, it was going to happen before you completed the block transfer anyway. The advantage of post-copy is that you immediately free resources on the source so as a reaction to pressure from overcommit, it's tremendously useful. I still think pre-copy has it's place though. Regards, Anthony Liguori
Regards, Daniel

On 09/07/2010 04:41 PM, Anthony Liguori wrote:
Hi,
We've got copy-on-read and image streaming working in QED and before going much further, I wanted to bounce some interfaces off of the libvirt folks to make sure our final interface makes sense.
Here's the basic idea:
Today, you can create images based on base images that are copy on write. With QED, we also support copy on read which forces a copy from the backing image on read requests and write requests.
Is copy on read QED specific? It looks very similar to the commit command, except with I/O directions reversed. IIRC, commit looks like for each sector: if image.mapped(sector): backing_image.write(sector, image.read(sector)) whereas copy-on-read looks like: def copy_on_read(): set_ioprio(idle) for each sector: if not image.mapped(sector): image.write(sector, backing_image.read(sector)) run_in_thread(copy_on_read) With appropriate locking.
In additional to copy on read, we introduce a notion of streaming a block device which means that we search for an unallocated region of the leaf image and force a copy-on-read operation.
The combination of copy-on-read and streaming means that you can start a guest based on slow storage (like over the network) and bring in blocks on demand while also having a deterministic mechanism to complete the transfer.
The interface for copy-on-read is just an option within qemu-img create. Streaming, on the other hand, requires a bit more thought. Today, I have a monitor command that does the following:
stream <device> <sector offset>
Which will try to stream the minimal amount of data for a single I/O operation and then return how many sectors were successfully streamed.
The idea about how to drive this interface is a loop like:
offset = 0; while offset < image_size: wait_for_idle_time() count = stream(device, offset) offset += count
This is way too low level for the management stack. Have you considered using the idle class I/O priority to implement this? That would allow host-wide prioritization. Not sure how to do cluster-wide, I don't think NFS has the concept of I/O priority. -- error compiling committee.c: too many arguments to function
participants (7)
-
Alexander Graf
-
Anthony Liguori
-
Anthony Liguori
-
Avi Kivity
-
Daniel P. Berrange
-
Kevin Wolf
-
Stefan Hajnoczi