[libvirt] potential bug in qemu snapshots

Right now, we create qemu snapshots to a file by using an exec: monitor command that passes 'compressor | { dd && dd; }' with stdout connected to the target file. However, since dd is using a larger bs= than PIPE_MAX, it is conceivable that under heavy machine load, that dd will get a short read from the pipe, and unless we use the GNU extension of iflag=fullblock, that short read will be padded out to the output block size and result in data corruption in the destination file. The possibility of corruption due to short reads when dd is fed input through a pipe but produces output through a large block size has occurred several times on the coreutils mailing list: http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00092.html Coreutils currently obeys the (non-intuitive) POSIX requirements for this short read behavior, and while it is being considered to make dd when POSIXLY_CORRECT is unset be more sensible, you can't rely on that being a default. Should we refuse to make snapshots if we don't detect the GNU extension of 'dd iflag=fullblock'? Probably safe to do on GNU/Linux machines, but I'm wondering if it will have negative effects on BSD machines where GNU coreutils is not installed. Is there a way to make monitor commands use fd: style migration, where we can avoid dd altogether by just passing an already open fd that has already positioned correctly via lseek()? Then again, it seems like fd: migration requires passing an open fd, which only seems possible on the command line at startup rather than something you can do on the fly with monitor commands. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 23/12/2010, at 5:35 AM, Eric Blake wrote: <snip>
Should we refuse to make snapshots if we don't detect the GNU extension of 'dd iflag=fullblock'? Probably safe to do on GNU/Linux machines, but I'm wondering if it will have negative effects on BSD machines where GNU coreutils is not installed.
As a thought, that option isn't known to the dd supplied with OSX 10.6.5: $ dd if=/dev/zero of=foo bs=1024 count=1 1+0 records in 1+0 records out 1024 bytes transferred in 0.000041 secs (24970740 bytes/sec) $ rm foo $ dd if=/dev/zero of=foo bs=1024 count=1 iflag=fullblock dd: unknown operand iflag $

On 12/22/2010 11:35 AM, Eric Blake wrote:
Right now, we create qemu snapshots to a file by using an exec: monitor command that passes 'compressor | { dd && dd; }' with stdout connected to the target file. However, since dd is using a larger bs= than PIPE_MAX, it is conceivable that under heavy machine load, that dd will get a short read from the pipe, and unless we use the GNU extension of iflag=fullblock, that short read will be padded out to the output block size and result in data corruption in the destination file.
The possibility of corruption due to short reads when dd is fed input through a pipe but produces output through a large block size has occurred several times on the coreutils mailing list: http://lists.gnu.org/archive/html/bug-coreutils/2010-11/msg00092.html Coreutils currently obeys the (non-intuitive) POSIX requirements for this short read behavior, and while it is being considered to make dd when POSIXLY_CORRECT is unset be more sensible, you can't rely on that being a default.
Should we refuse to make snapshots if we don't detect the GNU extension of 'dd iflag=fullblock'? Probably safe to do on GNU/Linux machines, but I'm wondering if it will have negative effects on BSD machines where GNU coreutils is not installed.
I've gone back and read POSIX again. It states: If the bs=expr operand is specified and no conversions other than sync, noerror, or notrunc are requested, the data returned from each input block shall be written as a separate output block; if the read returns less than a full block and the sync conversion is not specified, the resulting output block shall be the same size as the input block. If the bs=expr operand is not specified, or a conversion other than sync, noerror, or notrunc is requested, the input shall be processed and collected into full-sized output blocks until the end of the input is reached. So I stand corrected - because we aren't using conv=sync, there is no zero padding or data corruption; however, behavior can still be improved. With bs=, a short read results in a short write, so we aren't using the full 1M block size that we asked of dd, and end up with more syscalls than necessary. Swapping to ibs= obs= instead of bs= allows dd to do fewer syscalls on output. Patch coming up soon, but I'm grateful that this is not a data corruption bug after all. And the GNU extension of iflag=fullblock is only useful when mixing bs= and conv=sync, so there's no need to use that extension. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

POSIX states about dd: If the bs=expr operand is specified and no conversions other than sync, noerror, or notrunc are requested, the data returned from each input block shall be written as a separate output block; if the read returns less than a full block and the sync conversion is not specified, the resulting output block shall be the same size as the input block. If the bs=expr operand is not specified, or a conversion other than sync, noerror, or notrunc is requested, the input shall be processed and collected into full-sized output blocks until the end of the input is reached. Since we aren't using conv=trunc, there is no zero-padding, but our use of bs= means that a short read results in a short write. If instead we use ibs= and obs=, then short reads are collected and dd only has to do a single write, which can make dd more efficient. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Avoid 'dd bs=', since it can cause short writes. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. --- src/qemu/qemu_monitor_json.c | 3 ++- src/qemu/qemu_monitor_text.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 7877731..284f9c4 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,11 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, * <> redirection to avoid truncating a regular file. */ if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " "{ dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, + "dd ibs=%llu; obs=%llu } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, safe_target) < 0) { virReportOOMError(); goto cleanup; diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 11a9391..9c139ed 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -1280,10 +1280,11 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, * <> redirection to avoid truncating a regular file. */ if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " "{ dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, + "dd ibs=%llu obs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, safe_target) < 0) { virReportOOMError(); goto cleanup; -- 1.7.3.3

On Wed, Dec 22, 2010 at 22:06, Eric Blake <eblake@redhat.com> wrote:
diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 7877731..284f9c4 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,11 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, * <> redirection to avoid truncating a regular file. */ if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " "{ dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, + "dd ibs=%llu; obs=%llu } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX,
typo: semicolon before obs. Should be: "dd ibs=%llu obs=%llu } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, -- Pawel

On 12/30/2010 01:48 AM, Paweł Krześniak wrote:
On Wed, Dec 22, 2010 at 22:06, Eric Blake <eblake@redhat.com> wrote:
diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 7877731..284f9c4 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,11 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, * <> redirection to avoid truncating a regular file. */ if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " "{ dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, + "dd ibs=%llu; obs=%llu } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX,
typo: semicolon before obs. Should be: "dd ibs=%llu obs=%llu } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX,
Thanks for the review. But your correction is also wrong. It should be: "dd ibs=%llu obs=%llu; } 1<>%s" -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
participants (3)
-
Eric Blake
-
Justin Clift
-
Paweł Krześniak