Re: [libvirt] PATCH: Disable QEMU drive caching

9 Oct 2008

      Mark McLoughlin wrote:
...
i.e. with write-back caching enabled, the IDE  protocol makes no
guarantees about when data is committed to disk.
So, from a protocol correctness POV, qemu is behaving correctly with
cache=on and write-back caching enabled on the disk.
Yes, the host page cache is basically a big on-disk cache.
...
...
For SCSI, an unordered queue is advertised.  Again, everything depends 
on whether or not write-back caching is enabled or not.  Again, 
perfectly happy to take patches here.
Queue ordering and write-back caching sound like very different things.
Are they two distinct SCSI options, or ...?
Yes.
...
Surely an ordered queue doesn't do much help prevent fs corruption if
the host crashes, right? You would still need write-back caching
disabled?
You need both.  In theory, a guest would use queue ordering to guarantee 
that certain writes made it to disk before other writes.  Enabling 
write-through guarantees that the data is actually on disk.  Since we 
advertise an unordered queue, we're okay from a safety point-of-view but 
for performance reasons, we'll want to do ordered queuing.
...
...
More importantly, the most common journaled filesystem, ext3, does not 
enable write barriers by default (even for journal updates).  This is 
how it ship in Red Hat distros.
i.e. implementing barriers for virtio won't help most ext3 deployments?
Yes, ext3 doesn't use barriers by default.  See 
http://kerneltrap.org/mailarchive/linux-kernel/2008/5/19/1865314
...
And again, if barriers are just about ordering, don't you need to
disable caching anyway?
Well, virtio doesn't have a notion of write-caching.  My thinking is 
that we ought to implement barriers via fdatasync because posix-aio 
already has an op for it.  This would effectively use barriers as a 
point to force something on disk.  I think this would take care of most 
of the data corruption issues since the cases where the guest cares 
about data corruption would be handled (barriers should be used for 
journal writes and any O_DIRECT write for instance, although, yeah, not 
the case today with ext3).
...
...
So there is no greater risk of corrupting a journal in QEMU than there
is on bare metal.
This is the bit I really don't buy - we're equating qemu caching to IDE
write-back caching and saying the risk of corruption is the same in both
cases.
Yes.
...
But doesn't qemu cache data for far, far longer than a typical IDE disk
with write-back caching would do? Doesn't that mean you're far, far more
likely to see fs corruption with qemu caching?
It caches more data, I don't know how much longer it cases than a 
typical IDE disk.  The guest can crash and that won't cause data loss.  
The only thing that will really cause data loss is the host crashing so 
it's slightly better than write-back caching from that regard.
...
Or put it another way, if we fix it by implementing the disabling of
write-back caching ... users running a virtual machine will need to run
"hdparam -W 0 /dev/sda" where they would never have run it on baremetal?
I don't see it as something needing to be fixed because I don't see that 
the exposure is significantly greater for a VM than for a real machine.

And let's take a step back too.  If people are really concerned about 
this point, let's introduce a sync=on option that opens the image with 
O_SYNC.  This will effectively make the cache write-through without the 
baggage associated with O_DIRECT.

While I object to libvirt always setting cache=off, I think sync=on for 
IDE and SCSI may be reasonable (you don't want it for virtio-blk once we 
implement proper barriers with fdatasync I think).

Regards,

Anthony Liguori
...
Cheers,
Mark.