Mark McLoughlin wrote:
i.e. with write-back caching enabled, the IDE protocol makes no
guarantees about when data is committed to disk.
So, from a protocol correctness POV, qemu is behaving correctly with
cache=on and write-back caching enabled on the disk.
Yes, the host page cache is basically a big on-disk cache.
> For SCSI, an unordered queue is advertised. Again, everything
depends
> on whether or not write-back caching is enabled or not. Again,
> perfectly happy to take patches here.
>
Queue ordering and write-back caching sound like very different things.
Are they two distinct SCSI options, or ...?
Yes.
Surely an ordered queue doesn't do much help prevent fs
corruption if
the host crashes, right? You would still need write-back caching
disabled?
You need both. In theory, a guest would use queue ordering to guarantee
that certain writes made it to disk before other writes. Enabling
write-through guarantees that the data is actually on disk. Since we
advertise an unordered queue, we're okay from a safety point-of-view but
for performance reasons, we'll want to do ordered queuing.
> More importantly, the most common journaled filesystem, ext3,
does not
> enable write barriers by default (even for journal updates). This is
> how it ship in Red Hat distros.
>
i.e. implementing barriers for virtio won't help most ext3 deployments?
Yes, ext3 doesn't use barriers by default. See
http://kerneltrap.org/mailarchive/linux-kernel/2008/5/19/1865314
And again, if barriers are just about ordering, don't you need
to
disable caching anyway?
Well, virtio doesn't have a notion of write-caching. My thinking is
that we ought to implement barriers via fdatasync because posix-aio
already has an op for it. This would effectively use barriers as a
point to force something on disk. I think this would take care of most
of the data corruption issues since the cases where the guest cares
about data corruption would be handled (barriers should be used for
journal writes and any O_DIRECT write for instance, although, yeah, not
the case today with ext3).
> So there is no greater risk of corrupting a journal in QEMU than
there
> is on bare metal.
>
This is the bit I really don't buy - we're equating qemu caching to IDE
write-back caching and saying the risk of corruption is the same in both
cases.
Yes.
But doesn't qemu cache data for far, far longer than a typical
IDE disk
with write-back caching would do? Doesn't that mean you're far, far more
likely to see fs corruption with qemu caching?
It caches more data, I don't know how much longer it cases than a
typical IDE disk. The guest can crash and that won't cause data loss.
The only thing that will really cause data loss is the host crashing so
it's slightly better than write-back caching from that regard.
Or put it another way, if we fix it by implementing the disabling of
write-back caching ... users running a virtual machine will need to run
"hdparam -W 0 /dev/sda" where they would never have run it on baremetal?
I don't see it as something needing to be fixed because I don't see that
the exposure is significantly greater for a VM than for a real machine.
And let's take a step back too. If people are really concerned about
this point, let's introduce a sync=on option that opens the image with
O_SYNC. This will effectively make the cache write-through without the
baggage associated with O_DIRECT.
While I object to libvirt always setting cache=off, I think sync=on for
IDE and SCSI may be reasonable (you don't want it for virtio-blk once we
implement proper barriers with fdatasync I think).
Regards,
Anthony Liguori
Cheers,
Mark.