[libvirt] PATCH: Disable QEMU drive caching

QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems - It is a waste of memory because the guest already caches I/O ops - It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. - It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance. This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests - QEMU only gained this ability when -drive was introduced, and sadly kept the default to unsafe cache=on settings. Daniel diff -r 4a0ccc9dc530 src/qemu_conf.c --- a/src/qemu_conf.c Wed Oct 08 11:53:45 2008 +0100 +++ b/src/qemu_conf.c Wed Oct 08 11:59:33 2008 +0100 @@ -460,6 +460,8 @@ flags |= QEMUD_CMD_FLAG_DRIVE; if (strstr(help, "boot=on")) flags |= QEMUD_CMD_FLAG_DRIVE_BOOT; + if (strstr(help, "cache=on")) + flags |= QEMUD_CMD_FLAG_DRIVE_CACHE; if (version >= 9000) flags |= QEMUD_CMD_FLAG_VNC_COLON; @@ -959,13 +961,15 @@ break; } - snprintf(opt, PATH_MAX, "file=%s,if=%s,%sindex=%d%s", + snprintf(opt, PATH_MAX, "file=%s,if=%s,%sindex=%d%s%s", disk->src ? disk->src : "", bus, media ? media : "", idx, bootable && disk->device == VIR_DOMAIN_DISK_DEVICE_DISK - ? ",boot=on" : ""); + ? ",boot=on" : "", + qemuCmdFlags & QEMUD_CMD_FLAG_DRIVE_BOOT + ? ",cache=off" : ""); ADD_ARG_LIT("-drive"); ADD_ARG_LIT(opt); diff -r 4a0ccc9dc530 src/qemu_conf.h --- a/src/qemu_conf.h Wed Oct 08 11:53:45 2008 +0100 +++ b/src/qemu_conf.h Wed Oct 08 11:59:33 2008 +0100 @@ -44,7 +44,8 @@ QEMUD_CMD_FLAG_NO_REBOOT = (1 << 2), QEMUD_CMD_FLAG_DRIVE = (1 << 3), QEMUD_CMD_FLAG_DRIVE_BOOT = (1 << 4), - QEMUD_CMD_FLAG_NAME = (1 << 5), + QEMUD_CMD_FLAG_DRIVE_CACHE = (1 << 5), + QEMUD_CMD_FLAG_NAME = (1 << 6), }; /* Main driver state */ -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems
- It is a waste of memory because the guest already caches I/O ops - It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. - It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance.
This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests - QEMU only gained this ability when -drive was introduced, and sadly kept the default to unsafe cache=on settings.
I'm for this in general, but I'm a little worried about the "performance regression" aspect of this. People are going to upgrade to 0.4.7 (or whatever), and suddenly find that their KVM guests perform much more slowly. This is better in the end for their data, but we might hear large complaints about it. Might it be a better idea to make the default "cache=off", but provide a toggle in the domain XML to turn it back to "cache=on" for the people who really want it and know what they are doing? -- Chris Lalancette

On Wed, Oct 08, 2008 at 01:15:46PM +0200, Chris Lalancette wrote:
Daniel P. Berrange wrote:
QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems
- It is a waste of memory because the guest already caches I/O ops - It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. - It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance.
This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests - QEMU only gained this ability when -drive was introduced, and sadly kept the default to unsafe cache=on settings.
I'm for this in general, but I'm a little worried about the "performance regression" aspect of this. People are going to upgrade to 0.4.7 (or whatever), and suddenly find that their KVM guests perform much more slowly. This is better in the end for their data, but we might hear large complaints about it.
Yes & no. They will find their guests perform more consistently. With the current system their guests will perform very erratically depending on memory & I/O pressure on the host. If the host I/O cache is empty & has no I/O load, current guests will be "fast", but if host I/O cache is full and they do something which requires more host memory (eg start up another guest), then all existing guests get their I/O performance trashed as the I/O cache has to be flushed out, and future I/O is unable to be cached. Xen went through this same change and there were not any serious complaints, particularly when explained that previous system had zero data integrity guarentees. The current system merely provides an illusion of performance - any attempt to show that performance has decreased is impossible because any attempt to run benchmarks with existing caching just results in meaningless garbage. https://bugzilla.redhat.com/show_bug.cgi?id=444047 The idea that a guest can have x5 the performance of the underlying host device is just ridiculous
Might it be a better idea to make the default "cache=off", but provide a toggle in the domain XML to turn it back to "cache=on" for the people who really want it and know what they are doing?
Perhaps, but that's a separate issue for discussion. The immediate need is data integrity & consistent performance, so we can actually measure performance going forward. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 01:15:46PM +0200, Chris Lalancette wrote:
Daniel P. Berrange wrote:
QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems
- It is a waste of memory because the guest already caches I/O ops - It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. - It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance.
This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests - QEMU only gained this ability when -drive was introduced, and sadly kept the default to unsafe cache=on settings. I'm for this in general, but I'm a little worried about the "performance regression" aspect of this. People are going to upgrade to 0.4.7 (or whatever), and suddenly find that their KVM guests perform much more slowly. This is better in the end for their data, but we might hear large complaints about it.
Yes & no. They will find their guests perform more consistently. With the current system their guests will perform very erratically depending on memory & I/O pressure on the host. If the host I/O cache is empty & has no I/O load, current guests will be "fast",
They will perform marginally better than if cache=off. This is the Linux host knows more about the underlying hardware than the guest and is able to do smarter read-ahead. When using cache=off, the host cannot perform any sort of read-ahead.
but if host I/O cache is full and they do something which requires more host memory (eg start up another guest), then all existing guests get their I/O performance trashed as the I/O cache has to be flushed out, and future I/O is unable to be cached.
This is not accurate. Dirty pages in the host page cache are not reclaimable until they're written to disk. If you're in a seriously low memory situation, they the thing allocating memory is going to sleep until the data is written to disk. If an existing guest is trying to do I/O, then what things will degenerate to is basically cache=off since the guest must wait for other pending IO to complete
Xen went through this same change and there were not any serious complaints, particularly when explained that previous system had zero data integrity guarentees. The current system merely provides an illusion of performance - any attempt to show that performance has decreased is impossible because any attempt to run benchmarks with existing caching just results in meaningless garbage.
I can't see this bug, but a quick grep of ioemu in xen-unstable for O_DIRECT reveals that they are not in fact using O_DIRECT. O_DIRECT, O_SYNC, and fsync are not the same mechanism. Regards, Anthony Liguori

On Wed, Oct 08, 2008 at 11:06:27AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 01:15:46PM +0200, Chris Lalancette wrote:
Daniel P. Berrange wrote:
QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems
- It is a waste of memory because the guest already caches I/O ops - It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. - It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance.
This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests - QEMU only gained this ability when -drive was introduced, and sadly kept the default to unsafe cache=on settings. I'm for this in general, but I'm a little worried about the "performance regression" aspect of this. People are going to upgrade to 0.4.7 (or whatever), and suddenly find that their KVM guests perform much more slowly. This is better in the end for their data, but we might hear large complaints about it.
Yes & no. They will find their guests perform more consistently. With the current system their guests will perform very erratically depending on memory & I/O pressure on the host. If the host I/O cache is empty & has no I/O load, current guests will be "fast",
They will perform marginally better than if cache=off. This is the Linux host knows more about the underlying hardware than the guest and is able to do smarter read-ahead. When using cache=off, the host cannot perform any sort of read-ahead.
but if host I/O cache is full and they do something which requires more host memory (eg start up another guest), then all existing guests get their I/O performance trashed as the I/O cache has to be flushed out, and future I/O is unable to be cached.
This is not accurate. Dirty pages in the host page cache are not reclaimable until they're written to disk. If you're in a seriously low memory situation, they the thing allocating memory is going to sleep until the data is written to disk. If an existing guest is trying to do I/O, then what things will degenerate to is basically cache=off since the guest must wait for other pending IO to complete
Xen went through this same change and there were not any serious complaints, particularly when explained that previous system had zero data integrity guarentees. The current system merely provides an illusion of performance - any attempt to show that performance has decreased is impossible because any attempt to run benchmarks with existing caching just results in meaningless garbage.
I can't see this bug, but a quick grep of ioemu in xen-unstable for O_DIRECT reveals that they are not in fact using O_DIRECT.
Sorry, it was mistakenly private - fixed now. Xen does use O_DIRECT for paravirt driver case - blktap is using the combo of AIO+O_DIRECT. QEMU code is only used for the IDE emulation case which isn't interesting from a performance POV. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 11:06:27AM -0500, Anthony Liguori wrote:
Sorry, it was mistakenly private - fixed now.
Xen does use O_DIRECT for paravirt driver case - blktap is using the combo of AIO+O_DIRECT.
You have to use O_DIRECT with linux-aio. And blktap is well known to have terrible performance. Most serious users use blkback/blkfront and blkback does not avoid the host page cache. It maintains data integrity by passing through barriers from the guest to the host. You can approximate this in userspace by using fdatasync. The issue the bug addresses, iozone performs better than native, can be addressed in the following way: 1) For IDE, you have to disable write-caching in the guest. This should force an fdatasync in the host. 2) For virtio-blk, we need to implement barrier support. This is what blkfront/blkback do. 3) For SCSI, we should support ordered queuing which would result in an fdatasync when barriers are injected. This would result in write performance being what was expected in the guest while still letting the host coalesce IO requests, perform scheduling with other guests (while respecting each guest's own ordering requirements). Regards, Anthony LIguori
QEMU code is only used for the IDE emulation case which isn't interesting from a performance POV.
Daniel

On Wed, Oct 08, 2008 at 11:49:19AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 11:06:27AM -0500, Anthony Liguori wrote:
Sorry, it was mistakenly private - fixed now.
Xen does use O_DIRECT for paravirt driver case - blktap is using the combo of AIO+O_DIRECT.
You have to use O_DIRECT with linux-aio. And blktap is well known to have terrible performance. Most serious users use blkback/blkfront and blkback does not avoid the host page cache. It maintains data integrity by passing through barriers from the guest to the host. You can approximate this in userspace by using fdatasync.
The issue the bug addresses, iozone performs better than native, can be addressed in the following way:
1) For IDE, you have to disable write-caching in the guest. This should force an fdatasync in the host. 2) For virtio-blk, we need to implement barrier support. This is what blkfront/blkback do. 3) For SCSI, we should support ordered queuing which would result in an fdatasync when barriers are injected.
Ok, ignore my libvirt patch then. We'll punt this problem back to QEMU & virtio develoers to solve properly Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 11:06:27AM -0500, Anthony Liguori wrote: Sorry, it was mistakenly private - fixed now. Xen does use O_DIRECT for paravirt driver case - blktap is using the combo of AIO+O_DIRECT.
You have to use O_DIRECT with linux-aio. And blktap is well known to have terrible performance. Most serious users use blkback/blkfront and blkback does not avoid the host page cache. It maintains data integrity by passing through barriers from the guest to the host. You can approximate this in userspace by using fdatasync.
This is not accurate (at least for HVM guests using PV drivers on Xen 3.2). blkback does indeed bypass the host page cache completely. It's I/O behavior is akin to O_DIRECT. I/O is dma'd directly to/from guest pages without involving any dom0 buffering. blkback barrier support only enforces write ordering of the blkback I/O stream(s). It does nothing to synchronize data in the host page cache. Data written through blkback will modify the storage "underneath" any data in the host page cache (w/o flushing the page cache). Subsequent access to the page cache by qemu-dm will access stale data. In our own Xen product we must explicitly flush the host page cache backing store data at qemu-dm start up, to guarantee proper data access. It is not safe to access the same backing object with both qemu-dm and blkback simultaneously.
The issue the bug addresses, iozone performs better than native, can be addressed in the following way:
1) For IDE, you have to disable write-caching in the guest. This should force an fdatasync in the host. 2) For virtio-blk, we need to implement barrier support. This is what blkfront/blkback do.
I don't think this is enough. Barrier semantics are local to a particular I/O stream. There would be no reason for the barrier to affect the host page cache (unless the I/Os are buffered by the cache).
3) For SCSI, we should support ordered queuing which would result in an fdatasync when barriers are injected.
This would result in write performanc> e being what was expected in the guest while still letting the host coalesce IO requests, perform scheduling with other guests (while respecting each guest's own ordering requirements).
I generally agree with your suggestion that host page cache performance benefits shouldn't be discarded just to make naive benchmark data collection easier. Anyone suggesting that QEMU emulated disk I/O could somehow outperform the host I/O system should know that something is wrong with their benchmark setup. Unfortunately this discussion continues to reappear in the Xen community. I am sure that as QEMU/KVM/virtio matures, a similar thread will continue to resurface. Steve
Regards,
Anthony LIguori
QEMU code is only used for the IDE emulation case which isn't interesting from a performance POV.
Daniel
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Steve Ofsthun wrote:
Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 11:06:27AM -0500, Anthony Liguori wrote: Sorry, it was mistakenly private - fixed now. Xen does use O_DIRECT for paravirt driver case - blktap is using the combo of AIO+O_DIRECT.
You have to use O_DIRECT with linux-aio. And blktap is well known to have terrible performance. Most serious users use blkback/blkfront and blkback does not avoid the host page cache. It maintains data integrity by passing through barriers from the guest to the host. You can approximate this in userspace by using fdatasync.
This is not accurate (at least for HVM guests using PV drivers on Xen 3.2). blkback does indeed bypass the host page cache completely. It's I/O behavior is akin to O_DIRECT.
I reread the code more closely and convinced myself that you are correct. While it was obvious that the bio's were being constructed from granted pages, my initial impression was that the requests were still going through the scheduler and could still be satisfied from the host page cache. But that is not that case.
I/O is dma'd directly to/from guest pages without involving any dom0 buffering. blkback barrier support only enforces write ordering of the blkback I/O stream(s). It does nothing to synchronize data in the host page cache. Data written through blkback will modify the storage "underneath" any data in the host page cache (w/o flushing the page cache). Subsequent access to the page cache by qemu-dm will access stale data. In our own Xen product we must explicitly flush the host page cache backing store data at qemu-dm start up, to guarantee proper data access. It is not safe to access the same backing object with both qemu-dm and blkback simultaneously.
The issue the bug addresses, iozone performs better than native, can be addressed in the following way:
1) For IDE, you have to disable write-caching in the guest. This should force an fdatasync in the host. 2) For virtio-blk, we need to implement barrier support. This is what blkfront/blkback do.
I don't think this is enough. Barrier semantics are local to a particular I/O stream. There would be no reason for the barrier to affect the host page cache (unless the I/Os are buffered by the cache).
If we implement barriers in terms of fdatasync, it should be sufficient. Regards, Anthony Liguori

On Wed, Oct 08, 2008 at 12:03:33PM +0100, Daniel P. Berrange wrote:
QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems
- It is a waste of memory because the guest already caches I/O ops - It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. - It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance.
This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests - QEMU only gained this ability when -drive was introduced, and sadly kept the default to unsafe cache=on settings.
Right ! I think for integrity reason we should revert that default at the libvirt level and swicth caching to off. I would not be against a way to reactivate it optionally, assuming we have a clean way to express it at the XML level (I don't think we have currently maybe an optional cache="on|off" attribute could be added to device/disk/target) because in some circumstances like cache of read-only devices available to multiple domain, it can make sense to keep caching on the host os. So I'm fine with the patch going in as-is, but maybe we need one patch on top to reenable the cache in an explicit case by case basis. Daniel P.S.: can you try to get patches with -p to get the contextual function without them it's harder to review exactly where things goes, especially when there is a line number shift due to other pending patches, thanks ! -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Daniel P. Berrange wrote:
QEMU defaults to allowing the host OS to cache all disk I/O. THis has a couple of problems
Oh, say it ain't so. This is precisely what I didn't want to see happen :-(
- It is a waste of memory because the guest already caches I/O ops
Page cache memory is easily reclaimable and has relatively low priority. If a guest needs memory, the size of the page cache will be reduced.
- It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem.
This has nothing to do with cache=off. The IDE device defaults to write-back caching. As such, IDE makes no guarantee that when a data write completes, it's actually completed on disk. This only comes into play when write-back is disabled. I'm perfectly happy to accept a patch that adds explicit sync's when write-back is disabled. For SCSI, an unordered queue is advertised. Again, everything depends on whether or not write-back caching is enabled or not. Again, perfectly happy to take patches here. More importantly, the most common journaled filesystem, ext3, does not enable write barriers by default (even for journal updates). This is how it ship in Red Hat distros. So there is no greater risk of corrupting a journal in QEMU than there is on bare metal.
- It makes benchmarking more or less impossible / worthless because what the benchmark things are disk writes just sit around in memory so guest disk performance appears to exceed host diskperformance.
It just means you have to understand the extra level of caching. A great deal of virtualization users are doing some form of homogeneous consolidation. If they have a good set of management tools or sophisticated storage, then their guests will be sharing base images or something like that. Caching in the host will result in major performance improvements because otherwise, the same data will be fetched multiple times.
This patch disables caching on all QEMU guests. NB, Xen has long done this for both PV & HVM guests
They don't for HVM actually. When using file: for PV disks, it also goes through the host page cache. For HVM, Xen uses the write-back disabled synchronization stuff I mentioned early. This is a really bad thing to do by default. I don't even think it should be an option for users because it's so terribly misunderstood. Regards, Anthony Liguori - QEMU only gained this ability when -drive was
introduced, and sadly kept the default to unsafe cache=on settings.
Daniel
diff -r 4a0ccc9dc530 src/qemu_conf.c --- a/src/qemu_conf.c Wed Oct 08 11:53:45 2008 +0100 +++ b/src/qemu_conf.c Wed Oct 08 11:59:33 2008 +0100 @@ -460,6 +460,8 @@ flags |= QEMUD_CMD_FLAG_DRIVE; if (strstr(help, "boot=on")) flags |= QEMUD_CMD_FLAG_DRIVE_BOOT; + if (strstr(help, "cache=on")) + flags |= QEMUD_CMD_FLAG_DRIVE_CACHE; if (version >= 9000) flags |= QEMUD_CMD_FLAG_VNC_COLON;
@@ -959,13 +961,15 @@ break; }
- snprintf(opt, PATH_MAX, "file=%s,if=%s,%sindex=%d%s", + snprintf(opt, PATH_MAX, "file=%s,if=%s,%sindex=%d%s%s", disk->src ? disk->src : "", bus, media ? media : "", idx, bootable && disk->device == VIR_DOMAIN_DISK_DEVICE_DISK - ? ",boot=on" : ""); + ? ",boot=on" : "", + qemuCmdFlags & QEMUD_CMD_FLAG_DRIVE_BOOT + ? ",cache=off" : "");
ADD_ARG_LIT("-drive"); ADD_ARG_LIT(opt); diff -r 4a0ccc9dc530 src/qemu_conf.h --- a/src/qemu_conf.h Wed Oct 08 11:53:45 2008 +0100 +++ b/src/qemu_conf.h Wed Oct 08 11:59:33 2008 +0100 @@ -44,7 +44,8 @@ QEMUD_CMD_FLAG_NO_REBOOT = (1 << 2), QEMUD_CMD_FLAG_DRIVE = (1 << 3), QEMUD_CMD_FLAG_DRIVE_BOOT = (1 << 4), - QEMUD_CMD_FLAG_NAME = (1 << 5), + QEMUD_CMD_FLAG_DRIVE_CACHE = (1 << 5), + QEMUD_CMD_FLAG_NAME = (1 << 6), };
/* Main driver state */

On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
A great deal of virtualization users are doing some form of homogeneous consolidation. If they have a good set of management tools or sophisticated storage, then their guests will be sharing base images or something like that. Caching in the host will result in major performance improvements because otherwise, the same data will be fetched multiple times.
NB, this has no impact on caching of backing files - QEMU masks out the O_DIRECT flag when opening the backing file - so in a shared master image scenario, all reads for the shared file will still be cached, only write5Cs to the cow file are impacted. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
A great deal of virtualization users are doing some form of homogeneous consolidation. If they have a good set of management tools or sophisticated storage, then their guests will be sharing base images or something like that. Caching in the host will result in major performance improvements because otherwise, the same data will be fetched multiple times.
NB, this has no impact on caching of backing files - QEMU masks out the O_DIRECT flag when opening the backing file
It doesn't mask out O_DIRECT, it just doesn't pass any flags to the backing file when it opens it. IMHO, this is a bug. Regards, Anthony Liguori
- so in a shared master image scenario, all reads for the shared file will still be cached, only write5Cs to the cow file are impacted.
Daniel

On Wed, Oct 08, 2008 at 11:53:14AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
A great deal of virtualization users are doing some form of homogeneous consolidation. If they have a good set of management tools or sophisticated storage, then their guests will be sharing base images or something like that. Caching in the host will result in major performance improvements because otherwise, the same data will be fetched multiple times.
NB, this has no impact on caching of backing files - QEMU masks out the O_DIRECT flag when opening the backing file
It doesn't mask out O_DIRECT, it just doesn't pass any flags to the backing file when it opens it. IMHO, this is a bug.
Perhaps I'm interpreting the wrong bit of code, but I was looking at QEMU's block.c in the bdrv_open2() function. The last thing it does is this, which masks out all flags except for the open mode: if (bs->backing_file[0] != '\0') { if (bdrv_open(bs->backing_hd, backing_filename, flags & (BDRV_O_RDONLY | BDRV_O_RDWR)) < 0) goto fail; } Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 11:53:14AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
A great deal of virtualization users are doing some form of homogeneous consolidation. If they have a good set of management tools or sophisticated storage, then their guests will be sharing base images or something like that. Caching in the host will result in major performance improvements because otherwise, the same data will be fetched multiple times.
NB, this has no impact on caching of backing files - QEMU masks out the O_DIRECT flag when opening the backing file
It doesn't mask out O_DIRECT, it just doesn't pass any flags to the backing file when it opens it. IMHO, this is a bug.
Perhaps I'm interpreting the wrong bit of code, but I was looking at QEMU's block.c in the bdrv_open2() function. The last thing it does is this, which masks out all flags except for the open mode:
if (bs->backing_file[0] != '\0') {
if (bdrv_open(bs->backing_hd, backing_filename, flags & (BDRV_O_RDONLY | BDRV_O_RDWR)) < 0) goto fail; }
if (bs->backing_file[0] != '\0') { /* if there is a backing file, use it */ bs->backing_hd = bdrv_new(""); if (!bs->backing_hd) { fail: bdrv_close(bs); return -ENOMEM; } path_combine(backing_filename, sizeof(backing_filename), filename, bs->backing_file); if (bdrv_open(bs->backing_hd, backing_filename, 0) < 0) goto fail; } Is what's in the latest QEMU tree. Is what you're looking at carrying a patch, perhaps? If so, there may be a bug in the patch. Regards, Anthony Liguori
Daniel

On Wed, Oct 08, 2008 at 12:16:00PM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 11:53:14AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
A great deal of virtualization users are doing some form of homogeneous consolidation. If they have a good set of management tools or sophisticated storage, then their guests will be sharing base images or something like that. Caching in the host will result in major performance improvements because otherwise, the same data will be fetched multiple times.
NB, this has no impact on caching of backing files - QEMU masks out the O_DIRECT flag when opening the backing file
It doesn't mask out O_DIRECT, it just doesn't pass any flags to the backing file when it opens it. IMHO, this is a bug.
Perhaps I'm interpreting the wrong bit of code, but I was looking at QEMU's block.c in the bdrv_open2() function. The last thing it does is this, which masks out all flags except for the open mode:
if (bs->backing_file[0] != '\0') {
if (bdrv_open(bs->backing_hd, backing_filename, flags & (BDRV_O_RDONLY | BDRV_O_RDWR)) < 0) goto fail; }
if (bs->backing_file[0] != '\0') { /* if there is a backing file, use it */ bs->backing_hd = bdrv_new(""); if (!bs->backing_hd) { fail: bdrv_close(bs); return -ENOMEM; } path_combine(backing_filename, sizeof(backing_filename), filename, bs->backing_file); if (bdrv_open(bs->backing_hd, backing_filename, 0) < 0) goto fail; }
Is what's in the latest QEMU tree. Is what you're looking at carrying a patch, perhaps? If so, there may be a bug in the patch.
No, I had forgotten that I had updated my checkout to an old changeset to trace some unrelated issue, so wasn't loooking at lastest code. The change to pass '0' was made way back in changeset 2075 when AIO was added Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Hi, Not greatly familiar with this subject, but trying to follow your logic ... On Wed, 2008-10-08 at 10:51 -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
- It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem.
This has nothing to do with cache=off. The IDE device defaults to write-back caching. As such, IDE makes no guarantee that when a data write completes, it's actually completed on disk. This only comes into play when write-back is disabled. I'm perfectly happy to accept a patch that adds explicit sync's when write-back is disabled.
i.e. with write-back caching enabled, the IDE protocol makes no guarantees about when data is committed to disk. So, from a protocol correctness POV, qemu is behaving correctly with cache=on and write-back caching enabled on the disk.
For SCSI, an unordered queue is advertised. Again, everything depends on whether or not write-back caching is enabled or not. Again, perfectly happy to take patches here.
Queue ordering and write-back caching sound like very different things. Are they two distinct SCSI options, or ...? Surely an ordered queue doesn't do much help prevent fs corruption if the host crashes, right? You would still need write-back caching disabled?
More importantly, the most common journaled filesystem, ext3, does not enable write barriers by default (even for journal updates). This is how it ship in Red Hat distros.
i.e. implementing barriers for virtio won't help most ext3 deployments? And again, if barriers are just about ordering, don't you need to disable caching anyway?
So there is no greater risk of corrupting a journal in QEMU than there is on bare metal.
This is the bit I really don't buy - we're equating qemu caching to IDE write-back caching and saying the risk of corruption is the same in both cases. But doesn't qemu cache data for far, far longer than a typical IDE disk with write-back caching would do? Doesn't that mean you're far, far more likely to see fs corruption with qemu caching? Or put it another way, if we fix it by implementing the disabling of write-back caching ... users running a virtual machine will need to run "hdparam -W 0 /dev/sda" where they would never have run it on baremetal? Cheers, Mark.

Mark McLoughlin wrote:
i.e. with write-back caching enabled, the IDE protocol makes no guarantees about when data is committed to disk.
So, from a protocol correctness POV, qemu is behaving correctly with cache=on and write-back caching enabled on the disk.
Yes, the host page cache is basically a big on-disk cache.
For SCSI, an unordered queue is advertised. Again, everything depends on whether or not write-back caching is enabled or not. Again, perfectly happy to take patches here.
Queue ordering and write-back caching sound like very different things. Are they two distinct SCSI options, or ...?
Yes.
Surely an ordered queue doesn't do much help prevent fs corruption if the host crashes, right? You would still need write-back caching disabled?
You need both. In theory, a guest would use queue ordering to guarantee that certain writes made it to disk before other writes. Enabling write-through guarantees that the data is actually on disk. Since we advertise an unordered queue, we're okay from a safety point-of-view but for performance reasons, we'll want to do ordered queuing.
More importantly, the most common journaled filesystem, ext3, does not enable write barriers by default (even for journal updates). This is how it ship in Red Hat distros.
i.e. implementing barriers for virtio won't help most ext3 deployments?
Yes, ext3 doesn't use barriers by default. See http://kerneltrap.org/mailarchive/linux-kernel/2008/5/19/1865314
And again, if barriers are just about ordering, don't you need to disable caching anyway?
Well, virtio doesn't have a notion of write-caching. My thinking is that we ought to implement barriers via fdatasync because posix-aio already has an op for it. This would effectively use barriers as a point to force something on disk. I think this would take care of most of the data corruption issues since the cases where the guest cares about data corruption would be handled (barriers should be used for journal writes and any O_DIRECT write for instance, although, yeah, not the case today with ext3).
So there is no greater risk of corrupting a journal in QEMU than there is on bare metal.
This is the bit I really don't buy - we're equating qemu caching to IDE write-back caching and saying the risk of corruption is the same in both cases.
Yes.
But doesn't qemu cache data for far, far longer than a typical IDE disk with write-back caching would do? Doesn't that mean you're far, far more likely to see fs corruption with qemu caching?
It caches more data, I don't know how much longer it cases than a typical IDE disk. The guest can crash and that won't cause data loss. The only thing that will really cause data loss is the host crashing so it's slightly better than write-back caching from that regard.
Or put it another way, if we fix it by implementing the disabling of write-back caching ... users running a virtual machine will need to run "hdparam -W 0 /dev/sda" where they would never have run it on baremetal?
I don't see it as something needing to be fixed because I don't see that the exposure is significantly greater for a VM than for a real machine. And let's take a step back too. If people are really concerned about this point, let's introduce a sync=on option that opens the image with O_SYNC. This will effectively make the cache write-through without the baggage associated with O_DIRECT. While I object to libvirt always setting cache=off, I think sync=on for IDE and SCSI may be reasonable (you don't want it for virtio-blk once we implement proper barriers with fdatasync I think). Regards, Anthony Liguori
Cheers, Mark.

Anthony Liguori wrote:
Mark McLoughlin wrote:
And let's take a step back too. If people are really concerned about this point, let's introduce a sync=on option that opens the image with O_SYNC. This will effectively make the cache write-through without the baggage associated with O_DIRECT.
I'm starting to slowly convince myself we should always open files with O_SYNC. Barriers should just force ordering within the thread pool. posix-aio has no interface for this but we could create one with our own thread pool implementation. Ryan: could you give the following patch a perf-run so we can see how this would effect us? Thanks, Anthony Liguori
While I object to libvirt always setting cache=off, I think sync=on for IDE and SCSI may be reasonable (you don't want it for virtio-blk once we implement proper barriers with fdatasync I think).
Regards,
Anthony Liguori
Cheers, Mark.

* Anthony Liguori (anthony@codemonkey.ws) wrote:
Mark McLoughlin wrote:
This is the bit I really don't buy - we're equating qemu caching to IDE write-back caching and saying the risk of corruption is the same in both cases.
Yes.
I'm with Mark here.
But doesn't qemu cache data for far, far longer than a typical IDE disk with write-back caching would do? Doesn't that mean you're far, far more likely to see fs corruption with qemu caching?
It caches more data, I don't know how much longer it cases than a typical IDE disk. The guest can crash and that won't cause data loss. The only thing that will really cause data loss is the host crashing so it's slightly better than write-back caching from that regard.
Or put it another way, if we fix it by implementing the disabling of write-back caching ... users running a virtual machine will need to run "hdparam -W 0 /dev/sda" where they would never have run it on baremetal?
I don't see it as something needing to be fixed because I don't see that the exposure is significantly greater for a VM than for a real machine.
One host crash corrupting all VM's data? What is the benefit? Seems like the benefit of caching is only useful when VM's aren't all that busy. Once the host is heavily committed, the case where it might benefit most from the extra caching, the host cache will shrink to essentially nothing. Also, many folks will be running heterogenous guests (or at least not template based), so in that case it's really just double caching (i.e. memory overhead). Seems a no-brainer to me, so I must be confused and/or missing smth. thanks, -chris

Chris Wright wrote:
* Anthony Liguori (anthony@codemonkey.ws) wrote:
Mark McLoughlin wrote:
This is the bit I really don't buy - we're equating qemu caching to IDE write-back caching and saying the risk of corruption is the same in both cases.
Yes.
I'm with Mark here.
I've been persuaded. Relying on the host's integrity for guest data integrity is not a good idea by default. I don't think we should use cache=off to address this though. I've sent a patch and started a thread on qemu-devel. Let's continue the conversation there. Regards, Anthony Liguori

* Anthony Liguori (anthony@codemonkey.ws) wrote:
I've been persuaded. Relying on the host's integrity for guest data integrity is not a good idea by default. I don't think we should use cache=off to address this though. I've sent a patch and started a thread on qemu-devel. Let's continue the conversation there.
Thanks, both for the patch, and redirecting conversation to proper spot. -chris

On Thu, Oct 09, 2008 at 12:03:28PM -0500, Anthony Liguori wrote:
I've been persuaded. Relying on the host's integrity for guest data integrity is not a good idea by default. I don't think we should use cache=off to address this though. I've sent a patch and started a thread on qemu-devel. Let's continue the conversation there.
Well, it's really good to be able to expect a good default behaviour but that doesn't mean there shouldn't be a user setting to activate or deactivate caching explicitely. It will still be avaialble at the QEmu level, I think it makes some sense in other environment even if that's not applicable for a container based virtualization. I think we should still provide an optional flag at the libvirt device XML description to indicate caching on the hypervisor preference for a device. Off by default of course. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
- It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem.
This has nothing to do with cache=off. The IDE device defaults to write-back caching. As such, IDE makes no guarantee that when a data write completes, it's actually completed on disk. This only comes into play when write-back is disabled. I'm perfectly happy to accept a patch that adds explicit sync's when write-back is disabled.
For SCSI, an unordered queue is advertised. Again, everything depends on whether or not write-back caching is enabled or not. Again, perfectly happy to take patches here.
More importantly, the most common journaled filesystem, ext3, does not enable write barriers by default (even for journal updates). This is how it ship in Red Hat distros. So there is no greater risk of corrupting a journal in QEMU than there is on bare metal.
Interesting discussion, I'm wondering about the non-local storage effect though, if the Node is caching writes, how can we ensure a coherent view on remote storage for example when migrating a domain ? Maybe migration is easy to fix because qemu is aware and can issue a sync, but as we start adding cloning APIs to libvirt, we could face the issue if issuing an LVM snapshot operation on the guest storage while the Node still cache some of the data. The more layers of caching the harder it is to have a predictable behaviour, no ? Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Daniel Veillard wrote:
On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
- It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem. This has nothing to do with cache=off. The IDE device defaults to write-back caching. As such, IDE makes no guarantee that when a data write completes, it's actually completed on disk. This only comes into
Daniel P. Berrange wrote: play when write-back is disabled. I'm perfectly happy to accept a patch that adds explicit sync's when write-back is disabled.
For SCSI, an unordered queue is advertised. Again, everything depends on whether or not write-back caching is enabled or not. Again, perfectly happy to take patches here.
More importantly, the most common journaled filesystem, ext3, does not enable write barriers by default (even for journal updates). This is how it ship in Red Hat distros. So there is no greater risk of corrupting a journal in QEMU than there is on bare metal.
Interesting discussion, I'm wondering about the non-local storage effect though, if the Node is caching writes, how can we ensure a coherent view on remote storage for example when migrating a domain ? Maybe migration is easy to fix because qemu is aware and can issue a sync, but as we start adding cloning APIs to libvirt, we could face the issue if issuing an LVM snapshot operation on the guest storage while the Node still cache some of the data. The more layers of caching the harder it is to have a predictable behaviour, no ?
Any live migration infrastructure must guarantee the write ordering between guest writes generated on the "old" node and guest writes generated on the "new" node. This usually happens as the live migration crosses the point of no return where the guest is allow to execute code on the "new" node. The "old" node must flush it's writes and/or the "new" node must delay any new writes until it is safe to do so. In the case of LVM snapshots, only one node is able to safely access the snapshot at a time, so an organized transfer of the active snapshot is necessary during the live migration. For the case of CLVM, I would think the "cluster-aware" bits would coordinate the transfer. Even in this case though, the data must be flushed out of the page cache on the "old" node and onto the storage itself. Steve

Daniel Veillard wrote:
On Wed, Oct 08, 2008 at 10:51:16AM -0500, Anthony Liguori wrote:
Daniel P. Berrange wrote:
- It is unsafe on host OS crash - all unflushed guest I/O will be lost, and there's no ordering guarentees, so metadata updates could be flushe to disk, while the journal updates were not. Say goodbye to your filesystem.
This has nothing to do with cache=off. The IDE device defaults to write-back caching. As such, IDE makes no guarantee that when a data write completes, it's actually completed on disk. This only comes into play when write-back is disabled. I'm perfectly happy to accept a patch that adds explicit sync's when write-back is disabled.
For SCSI, an unordered queue is advertised. Again, everything depends on whether or not write-back caching is enabled or not. Again, perfectly happy to take patches here.
More importantly, the most common journaled filesystem, ext3, does not enable write barriers by default (even for journal updates). This is how it ship in Red Hat distros. So there is no greater risk of corrupting a journal in QEMU than there is on bare metal.
Interesting discussion, I'm wondering about the non-local storage effect though, if the Node is caching writes, how can we ensure a coherent view on remote storage for example when migrating a domain ?
In the case of remote storage, cache coherency is part of the network storage protocol/architecture. In NFS for instance, the most common coherency model is close-to-open. Other network storage solutions provide stronger coherency models.
Maybe migration is easy to fix because qemu is aware and can issue a sync, but as we start adding cloning APIs to libvirt, we could face the issue if issuing an LVM snapshot operation on the guest storage while the Node still cache some of the data. The more layers of caching the harder it is to have a predictable behaviour, no ?
With respect to migration, QEMU does a flush(), but not an fdatasync. Even if we did an fdatasync, I'm not sure that's good enough with NFS because I don't know if fdatasync on the source *after* the target has opened a file and read data will guarantee consistency. Regards, Anthony Liguori
Daniel
participants (7)
-
Anthony Liguori
-
Chris Lalancette
-
Chris Wright
-
Daniel P. Berrange
-
Daniel Veillard
-
Mark McLoughlin
-
Steve Ofsthun