[libvirt] [PATCH 0/2] speed up qemu domain save by increasing dd blocksize

These two patches are posted together because applying the 2nd exposes the bug fixed by the first. Here are the results of tests I made with various block sizes before deciding the 1MB really was the best balance (all tests were done on a paused 512MB domain, saving to local disk on a Lenovo T61 laptop) BS M:SS save image size ----- ---- --------------- 2048K - 0:56 476135451 1024K - 0:56 475090953 512k - 1:02 474564173 256k - 1:10 474303797 128k - 1:25 474176859 512 - 3:47 474085423 - the original I didn't bother testing sizes between 512 and 128k, as there was still significant improvement from 128k to 256k.

The pointer to the xml describing the domain is saved into an object prior to calling VIR_REALLOC_N() to make the size of the memory it points to a multiple of QEMU_MONITOR_MIGRATE_TO_FILE_BS. If that operation needs to allocate new memory, the pointer that was saved is no longer valid. To avoid this situation, adjust the size *before* saving the pointer. (This showed up when experimenting with very large values of QEMU_MONITOR_MIGRATE_TO_FILE_BS). --- src/qemu/qemu_driver.c | 12 ++++++------ 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6f77ea0..2dc32fa 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4959,12 +4959,6 @@ static int qemudDomainSaveFlag(virDomainPtr dom, const char *path, is_reg = S_ISREG(sb.st_mode); } - - /* Setup hook data needed by virFileOperation hook function */ - hdata.dom = dom; - hdata.path = path; - hdata.xml = xml; - hdata.header = &header; offset = sizeof(header) + header.xml_len; /* Due to way we append QEMU state on our header with dd, @@ -4985,6 +4979,12 @@ static int qemudDomainSaveFlag(virDomainPtr dom, const char *path, header.xml_len += pad; } + /* Setup hook data needed by virFileOperation hook function */ + hdata.dom = dom; + hdata.path = path; + hdata.xml = xml; + hdata.header = &header; + /* Write header to file, followed by XML */ /* First try creating the file as root */ -- 1.7.0.1

On 06/03/2010 09:57 PM, Laine Stump wrote:
The pointer to the xml describing the domain is saved into an object prior to calling VIR_REALLOC_N() to make the size of the memory it points to a multiple of QEMU_MONITOR_MIGRATE_TO_FILE_BS. If that operation needs to allocate new memory, the pointer that was saved is no longer valid.
To avoid this situation, adjust the size *before* saving the pointer.
ACK. Subtle bugs like that can be a bear to track down. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

See https://bugzilla.redhat.com/show_bug.cgi?id=599091 Saving a paused 512MB domain took 3m47s with the old block size of 512 bytes. Changing the block size to 1024*1024 decreased the time to 56 seconds. (Doubling again to 2048*1024 yielded 0 improvement; lowering to 512k increased the save time to 1m10s, about 20%) --- src/qemu/qemu_monitor.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 1870b22..fd39b1c 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -261,7 +261,7 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv); -# define QEMU_MONITOR_MIGRATE_TO_FILE_BS 512llu +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 2048) int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, -- 1.7.0.1

On Thu, Jun 03, 2010 at 11:57:33PM -0400, Laine Stump wrote:
See https://bugzilla.redhat.com/show_bug.cgi?id=599091
Saving a paused 512MB domain took 3m47s with the old block size of 512 bytes. Changing the block size to 1024*1024 decreased the time to 56 seconds. (Doubling again to 2048*1024 yielded 0 improvement; lowering to 512k increased the save time to 1m10s, about 20%)
Surely we should have gone for 1024*1024 in this case ? NB, since our XML header gets rounded up to a multiple of the block size, smaller is better, because we're filling the disk with zeros here :-)
--- src/qemu/qemu_monitor.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 1870b22..fd39b1c 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -261,7 +261,7 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv);
-# define QEMU_MONITOR_MIGRATE_TO_FILE_BS 512llu +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 2048)
int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, --
Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 06/04/2010 04:28 AM, Daniel P. Berrange wrote:
On Thu, Jun 03, 2010 at 11:57:33PM -0400, Laine Stump wrote:
See https://bugzilla.redhat.com/show_bug.cgi?id=599091
Saving a paused 512MB domain took 3m47s with the old block size of 512 bytes. Changing the block size to 1024*1024 decreased the time to 56 seconds. (Doubling again to 2048*1024 yielded 0 improvement; lowering to 512k increased the save time to 1m10s, about 20%)
Surely we should have gone for 1024*1024 in this case ?
Of course! I accidentally committed before hitting save on the buffer where I switched it back to 1024 :-P (if this mail were html-ized, the emoticon would be red with embarrassment)
NB, since our XML header gets rounded up to a multiple of the block size, smaller is better, because we're filling the disk with zeros here :-)
Yep, that's why I tested all the different sizes. If only I'd stayed awake until the final step of sending the mail... So I assume you're okay with what I intended (see below)? --- src/qemu/qemu_monitor.h | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 1870b22..dd12620 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -261,7 +261,7 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv); -# define QEMU_MONITOR_MIGRATE_TO_FILE_BS 512llu +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 1024) int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, -- 1.7.0.1

On 06/04/2010 02:28 AM, Daniel P. Berrange wrote:
On Thu, Jun 03, 2010 at 11:57:33PM -0400, Laine Stump wrote:
See https://bugzilla.redhat.com/show_bug.cgi?id=599091
Saving a paused 512MB domain took 3m47s with the old block size of 512 bytes. Changing the block size to 1024*1024 decreased the time to 56 seconds. (Doubling again to 2048*1024 yielded 0 improvement; lowering to 512k increased the save time to 1m10s, about 20%)
Surely we should have gone for 1024*1024 in this case ?
NB, since our XML header gets rounded up to a multiple of the block size, smaller is better, because we're filling the disk with zeros here :-)
In that case, it may be better to have two constants - the preferred transfer size (1M), and the XML padding block size (512 as before, or perhaps 4k given newer disk architectures that prefer 4k), along with code that allows the final transfer to be shorter than the preferred size so long as it is still a multiple of the block size. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Fri, Jun 04, 2010 at 08:23:32AM -0600, Eric Blake wrote:
On 06/04/2010 02:28 AM, Daniel P. Berrange wrote:
On Thu, Jun 03, 2010 at 11:57:33PM -0400, Laine Stump wrote:
See https://bugzilla.redhat.com/show_bug.cgi?id=599091
Saving a paused 512MB domain took 3m47s with the old block size of 512 bytes. Changing the block size to 1024*1024 decreased the time to 56 seconds. (Doubling again to 2048*1024 yielded 0 improvement; lowering to 512k increased the save time to 1m10s, about 20%)
Surely we should have gone for 1024*1024 in this case ?
NB, since our XML header gets rounded up to a multiple of the block size, smaller is better, because we're filling the disk with zeros here :-)
In that case, it may be better to have two constants - the preferred transfer size (1M), and the XML padding block size (512 as before, or perhaps 4k given newer disk architectures that prefer 4k), along with code that allows the final transfer to be shorter than the preferred size so long as it is still a multiple of the block size.
That doesn't work because dd can't seek into the file a size less than its transfer size. ie the seek=NN parameter is specifying a multiple of the bs=XX value, not an absolute seek position in bytes. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 06/04/2010 10:38 AM, Daniel P. Berrange wrote:
On Fri, Jun 04, 2010 at 08:23:32AM -0600, Eric Blake wrote:
In that case, it may be better to have two constants - the preferred transfer size (1M), and the XML padding block size (512 as before, or perhaps 4k given newer disk architectures that prefer 4k), along with code that allows the final transfer to be shorter than the preferred size so long as it is still a multiple of the block size.
That doesn't work because dd can't seek into the file a size less than its transfer size. ie the seek=NN parameter is specifying a multiple of the bs=XX value, not an absolute seek position in bytes.
Yup, very bothersome, but true. Maybe we should file a bug against dd ;-) So do I have the ACK or can someone think of a straightforward way to avoid the wastage?

On 06/04/2010 08:38 AM, Daniel P. Berrange wrote:
In that case, it may be better to have two constants - the preferred transfer size (1M), and the XML padding block size (512 as before, or perhaps 4k given newer disk architectures that prefer 4k), along with code that allows the final transfer to be shorter than the preferred size so long as it is still a multiple of the block size.
That doesn't work because dd can't seek into the file a size less than its transfer size. ie the seek=NN parameter is specifying a multiple of the bs=XX value, not an absolute seek position in bytes.
One invocation of dd can't. But two invocations can! Thanks to the fact that fd offsets are one of the few things that a child process propagates back to the parent, when an fd is shared among processes. The trick is to use the first dd to position the fd, then the second dd to write larger blocks that are unaligned per the block size, but still aligned based on the fundamental block size. For an example: $ perl -e 'print "1"x20' > sample1 $ { dd bs=5 seek=1 if=/dev/null; dd bs=10 count=1 if=sample1; } > sample2 0+0 records in 0+0 records out 0 bytes (0 B) copied, 2.2349e-05 s, 0.0 kB/s 1+0 records in 1+0 records out 10 bytes (10 B) copied, 6.4604e-05 s, 155 kB/s $ tr '\0' 0 < sample2 000001111111111 See - I produced sample2 by skipping past a smaller header (in this case, bs=5), then writing a single larger block (bs=10, a multiple of the smaller size, but not aligned to the smaller offset). Would you like me to try and write this up into a followup patch? -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding. * src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. --- This should result in much less wasted space (but I don't have a good environment set up for testing migration at the moment; Laine, would you mind testing this patch, since you wrote the last one?). Also, should we bump QEMU_MONITOR_MIGRATE_TO_FILE_BS to 4k, for the benefit of newer disks that have 4k blocks? src/qemu/qemu_driver.c | 7 ------- src/qemu/qemu_monitor.h | 12 ++++++++---- src/qemu/qemu_monitor_json.c | 13 +++++++++---- src/qemu/qemu_monitor_text.c | 13 +++++++++---- 4 files changed, 26 insertions(+), 19 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index bc8dcfa..4edf5ec 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5000,13 +5000,6 @@ static int qemudDomainSaveFlag(virDomainPtr dom, const char *path, * we need to ensure there's a 512 byte boundary. Unfortunately * we don't have an explicit offset in the header, so we fake * it by padding the XML string with NULLs. - * - * XXX: This means there will be (QEMU_MONITOR_MIGRATE_TO_FILE_BS - * - strlen(xml)) bytes of wastage in each file. - * Unfortunately, a large BS is needed for reasonable - * performance. It would be nice to find a replacement for dd - * that could specify the start offset in bytes rather than - * blocks, to eliminate this waste. */ if (offset % QEMU_MONITOR_MIGRATE_TO_FILE_BS) { unsigned long long pad = diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index e03b4df..8391ebd 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -1,7 +1,7 @@ /* * qemu_monitor.h: interaction with QEMU monitor console * - * Copyright (C) 2006-2009 Red Hat, Inc. + * Copyright (C) 2006-2010 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -261,10 +261,14 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv); -/* In general, a larger BS means better domain save performance, - * at the expense of a larger resulting file - see qemu_driver.c +/* In general, BS is the smallest fundamental block size we can use to + * access a block device; everything must be aligned to a multiple of + * this. However, operating on blocks this small is painfully slow, + * so we also have a transfer size that is larger but only aligned to + * the smaller block size. */ -# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 1024) +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS 512llu +# define QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE (1024llu * 1024) int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e89ad43..5e6cedd 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,15 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, goto cleanup; } - if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } >%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target) < 0) { virReportOOMError(); goto cleanup; } diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index c0ebe5f..b914c94 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -1273,10 +1273,15 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, goto cleanup; } - if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } >%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target) < 0) { virReportOOMError(); goto cleanup; } -- 1.7.0.1

On Wed, Jun 09, 2010 at 08:33:49PM -0600, Eric Blake wrote:
Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding.
* src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. ---
This should result in much less wasted space (but I don't have a good environment set up for testing migration at the moment; Laine, would you mind testing this patch, since you wrote the last one?).
Also, should we bump QEMU_MONITOR_MIGRATE_TO_FILE_BS to 4k, for the benefit of newer disks that have 4k blocks?
src/qemu/qemu_driver.c | 7 ------- src/qemu/qemu_monitor.h | 12 ++++++++---- src/qemu/qemu_monitor_json.c | 13 +++++++++---- src/qemu/qemu_monitor_text.c | 13 +++++++++---- 4 files changed, 26 insertions(+), 19 deletions(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index bc8dcfa..4edf5ec 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5000,13 +5000,6 @@ static int qemudDomainSaveFlag(virDomainPtr dom, const char *path, * we need to ensure there's a 512 byte boundary. Unfortunately * we don't have an explicit offset in the header, so we fake * it by padding the XML string with NULLs. - * - * XXX: This means there will be (QEMU_MONITOR_MIGRATE_TO_FILE_BS - * - strlen(xml)) bytes of wastage in each file. - * Unfortunately, a large BS is needed for reasonable - * performance. It would be nice to find a replacement for dd - * that could specify the start offset in bytes rather than - * blocks, to eliminate this waste. */ if (offset % QEMU_MONITOR_MIGRATE_TO_FILE_BS) { unsigned long long pad = diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index e03b4df..8391ebd 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -1,7 +1,7 @@ /* * qemu_monitor.h: interaction with QEMU monitor console * - * Copyright (C) 2006-2009 Red Hat, Inc. + * Copyright (C) 2006-2010 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -261,10 +261,14 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv);
-/* In general, a larger BS means better domain save performance, - * at the expense of a larger resulting file - see qemu_driver.c +/* In general, BS is the smallest fundamental block size we can use to + * access a block device; everything must be aligned to a multiple of + * this. However, operating on blocks this small is painfully slow, + * so we also have a transfer size that is larger but only aligned to + * the smaller block size. */ -# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 1024) +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS 512llu +# define QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE (1024llu * 1024)
int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e89ad43..5e6cedd 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,15 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, goto cleanup; }
- if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } >%s",
Does this work with block devices as the target ? We previously switched from cat>> to dd, because it didn't work correctly with block devs. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 06/10/2010 04:16 AM, Daniel P. Berrange wrote:
- if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } >%s",
Does this work with block devices as the target ? We previously switched from cat>> to dd, because it didn't work correctly with block devs.
Yes - the problem with >>dev was that it appended (which doesn't make sense for a block device); while >dev opens a seekable fd. One difference between 'dd >file' and 'dd of=file' is which process opens the device; but to share the fd between two dd invocations, we have to open the file in the shell rather than with of=file. The other difference is that >file truncates; for block devices, truncation is a no-op, but for regular files, this wipes out any pre-existing contents (such as the header we are skipping over) - is that an issue? If so, then we should use <>file instead of >file, to open a read-write fd without forcing truncation (even if we only use the fd for writing). And from the coreutils manual, 'info dd', this example validates the use of dual dd invocations on a block device (although it goes the opposite direction with a seek larger than the transfer size): Use different `dd' invocations to use different block sizes for skipping and I/O. For example, the following shell commands copy data in 512 KiB blocks between a disk and a tape, but do not save or restore a 4 KiB label at the start of the disk: disk=/dev/rdsk/c0t1d0s2 tape=/dev/rmt/0 # Copy all but the label from disk to tape. (dd bs=4k skip=1 count=0 && dd bs=512k) <$disk >$tape # Copy from tape back to disk, but leave the disk label alone. (dd bs=4k seek=1 count=0 && dd bs=512k) <$tape >$disk -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Thu, Jun 10, 2010 at 05:03:10AM -0600, Eric Blake wrote:
On 06/10/2010 04:16 AM, Daniel P. Berrange wrote:
- if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } >%s",
Does this work with block devices as the target ? We previously switched from cat>> to dd, because it didn't work correctly with block devs.
Yes - the problem with >>dev was that it appended (which doesn't make sense for a block device); while >dev opens a seekable fd. One difference between 'dd >file' and 'dd of=file' is which process opens the device; but to share the fd between two dd invocations, we have to open the file in the shell rather than with of=file. The other difference is that >file truncates; for block devices, truncation is a no-op, but for regular files, this wipes out any pre-existing contents (such as the header we are skipping over) - is that an issue? If so, then we should use <>file instead of >file, to open a read-write fd without forcing truncation (even if we only use the fd for writing).
Yes, we need to avoid truncation - the seek is there to skip over the metadata header, so we can't truncate that Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On 06/10/2010 05:20 AM, Daniel P. Berrange wrote:
difference is that >file truncates; for block devices, truncation is a no-op, but for regular files, this wipes out any pre-existing contents (such as the header we are skipping over) - is that an issue? If so, then we should use <>file instead of >file, to open a read-write fd without forcing truncation (even if we only use the fd for writing).
Yes, we need to avoid truncation - the seek is there to skip over the metadata header, so we can't truncate that
Simple test on regular files: $ perl -e 'print "1"x12' > sample1 $ perl -e 'print "2"x12' > sample2 $ cat sample1 | { dd bs=5 seek=1 if=/dev/null && dd bs=10; } 1<>sample2 0+0 records in 0+0 records out 0 bytes (0 B) copied, 7.054e-06 s, 0.0 kB/s 1+1 records in 1+1 records out 12 bytes (12 B) copied, 2.8006e-05 s, 428 kB/s $ cat sample2 22222111111111111 I'll adjust the patch accordingly (again, since I'm on the road today, I'm not in the easiest position to test actual migration to a block device at the moment, so I'd appreciate someone else testing libvirt's usage of this same paradigm). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding. * src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. --- Interdiff from v1 appears below the patch proper. Basically, change to a non-truncating open for regular files. src/qemu/qemu_driver.c | 7 ------- src/qemu/qemu_monitor.h | 12 ++++++++---- src/qemu/qemu_monitor_json.c | 14 ++++++++++---- src/qemu/qemu_monitor_text.c | 14 ++++++++++---- 4 files changed, 28 insertions(+), 19 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index bc8dcfa..4edf5ec 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5000,13 +5000,6 @@ static int qemudDomainSaveFlag(virDomainPtr dom, const char *path, * we need to ensure there's a 512 byte boundary. Unfortunately * we don't have an explicit offset in the header, so we fake * it by padding the XML string with NULLs. - * - * XXX: This means there will be (QEMU_MONITOR_MIGRATE_TO_FILE_BS - * - strlen(xml)) bytes of wastage in each file. - * Unfortunately, a large BS is needed for reasonable - * performance. It would be nice to find a replacement for dd - * that could specify the start offset in bytes rather than - * blocks, to eliminate this waste. */ if (offset % QEMU_MONITOR_MIGRATE_TO_FILE_BS) { unsigned long long pad = diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index e03b4df..8391ebd 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -1,7 +1,7 @@ /* * qemu_monitor.h: interaction with QEMU monitor console * - * Copyright (C) 2006-2009 Red Hat, Inc. + * Copyright (C) 2006-2010 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -261,10 +261,14 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv); -/* In general, a larger BS means better domain save performance, - * at the expense of a larger resulting file - see qemu_driver.c +/* In general, BS is the smallest fundamental block size we can use to + * access a block device; everything must be aligned to a multiple of + * this. However, operating on blocks this small is painfully slow, + * so we also have a transfer size that is larger but only aligned to + * the smaller block size. */ -# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 1024) +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS 512llu +# define QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE (1024llu * 1024) int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e89ad43..40218aa 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,16 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, goto cleanup; } - if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. Use + * <> redirection to avoid truncating a regular file. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } 1<>%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target) < 0) { virReportOOMError(); goto cleanup; } diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index c0ebe5f..c086962 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -1273,10 +1273,16 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, goto cleanup; } - if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. Use + * <> redirection to avoid truncating a regular file. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } 1<>%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target) < 0) { virReportOOMError(); goto cleanup; } -- 1.7.0.1 Interdiff from v1: diff --git i/src/qemu/qemu_monitor_json.c w/src/qemu/qemu_monitor_json.c index 5e6cedd..40218aa 100644 --- i/src/qemu/qemu_monitor_json.c +++ w/src/qemu/qemu_monitor_json.c @@ -1697,9 +1697,10 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, /* Two dd processes, sharing the same stdout, are necessary to * allow starting at an alignment of 512, but without wasting - * padding to get to the larger alignment useful for speed. */ + * padding to get to the larger alignment useful for speed. Use + * <> redirection to avoid truncating a regular file. */ if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } >%s", + "dd bs=%llu; } 1<>%s", argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, diff --git i/src/qemu/qemu_monitor_text.c w/src/qemu/qemu_monitor_text.c index b914c94..c086962 100644 --- i/src/qemu/qemu_monitor_text.c +++ w/src/qemu/qemu_monitor_text.c @@ -1275,9 +1275,10 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, /* Two dd processes, sharing the same stdout, are necessary to * allow starting at an alignment of 512, but without wasting - * padding to get to the larger alignment useful for speed. */ + * padding to get to the larger alignment useful for speed. Use + * <> redirection to avoid truncating a regular file. */ if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } >%s", + "dd bs=%llu; } 1<>%s", argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE,

Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding. * src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. --- v2: avoid truncating regular file destination v3: change minimum block size from 512 to 4k, to avoid unaligned RMW cycles on newer disks with 4k sectors src/qemu/qemu_driver.c | 7 ------- src/qemu/qemu_monitor.h | 14 ++++++++++---- src/qemu/qemu_monitor_json.c | 14 ++++++++++---- src/qemu/qemu_monitor_text.c | 14 ++++++++++---- 4 files changed, 30 insertions(+), 19 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index bc8dcfa..4edf5ec 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -5000,13 +5000,6 @@ static int qemudDomainSaveFlag(virDomainPtr dom, const char *path, * we need to ensure there's a 512 byte boundary. Unfortunately * we don't have an explicit offset in the header, so we fake * it by padding the XML string with NULLs. - * - * XXX: This means there will be (QEMU_MONITOR_MIGRATE_TO_FILE_BS - * - strlen(xml)) bytes of wastage in each file. - * Unfortunately, a large BS is needed for reasonable - * performance. It would be nice to find a replacement for dd - * that could specify the start offset in bytes rather than - * blocks, to eliminate this waste. */ if (offset % QEMU_MONITOR_MIGRATE_TO_FILE_BS) { unsigned long long pad = diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index e03b4df..6c104f1 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -1,7 +1,7 @@ /* * qemu_monitor.h: interaction with QEMU monitor console * - * Copyright (C) 2006-2009 Red Hat, Inc. + * Copyright (C) 2006-2010 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -261,10 +261,16 @@ int qemuMonitorMigrateToCommand(qemuMonitorPtr mon, unsigned int background, const char * const *argv); -/* In general, a larger BS means better domain save performance, - * at the expense of a larger resulting file - see qemu_driver.c +/* In general, BS is the smallest fundamental block size we can use to + * access a block device; everything must be aligned to a multiple of + * this. Linux generally supports a BS as small as 512, but with + * newer disks with 4k sectors, performance is better if we guarantee + * alignment to the sector size. However, operating on BS-sized + * blocks is painfully slow, so we also have a transfer size that is + * larger but only aligned to the smaller block size. */ -# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 1024) +# define QEMU_MONITOR_MIGRATE_TO_FILE_BS (1024llu * 4) +# define QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE (1024llu * 1024) int qemuMonitorMigrateToFile(qemuMonitorPtr mon, unsigned int background, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e89ad43..40218aa 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1695,10 +1695,16 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, goto cleanup; } - if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. Use + * <> redirection to avoid truncating a regular file. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } 1<>%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target) < 0) { virReportOOMError(); goto cleanup; } diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index c0ebe5f..c086962 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -1273,10 +1273,16 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, goto cleanup; } - if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS) < 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. Use + * <> redirection to avoid truncating a regular file. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } 1<>%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target) < 0) { virReportOOMError(); goto cleanup; } -- 1.7.0.1

On 06/10/2010 03:31 PM, Eric Blake wrote:
Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding.
* src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. ---
v2: avoid truncating regular file destination
v3: change minimum block size from 512 to 4k, to avoid unaligned RMW cycles on newer disks with 4k sectors
Okay, I've finally gotten around to trying this out, and in addition to reducing the file size, it also works! ACK (Note: I haven't tested saving to a block device, so I'm taking the word of the dd documentation that this works as well ;-)

On 06/17/2010 01:32 AM, Laine Stump wrote:
On 06/10/2010 03:31 PM, Eric Blake wrote:
Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding.
* src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. ---
v2: avoid truncating regular file destination
v3: change minimum block size from 512 to 4k, to avoid unaligned RMW cycles on newer disks with 4k sectors
Okay, I've finally gotten around to trying this out, and in addition to reducing the file size, it also works!
ACK
(Note: I haven't tested saving to a block device, so I'm taking the word of the dd documentation that this works as well ;-)
Thanks; pushed. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 06/10/2010 06:31 AM, Eric Blake wrote:
Followup to https://bugzilla.redhat.com/show_bug.cgi?id=599091, commit 20206a4b, to reduce disk waste in padding.
* src/qemu/qemu_monitor.h (QEMU_MONITOR_MIGRATE_TO_FILE_BS): Drop back to 512. (QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE): New macro. * src/qemu/qemu_driver.c (qemudDomainSaveFlag): Update comment. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use two invocations of dd to output non-aligned large blocks. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise.
- if (virAsprintf(&dest, "exec:%s | dd of=%s bs=%llu seek=%llu", - argstr, safe_target, - QEMU_MONITOR_MIGRATE_TO_FILE_BS, - offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS)< 0) { + /* Two dd processes, sharing the same stdout, are necessary to + * allow starting at an alignment of 512, but without wasting + * padding to get to the larger alignment useful for speed. Use + *<> redirection to avoid truncating a regular file. */ + if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null&& " + "dd bs=%llu; } 1<>%s", + argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, + offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, + QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, + safe_target)< 0) {
On IRC, Matthias pointed out that stock dash 0.5.5 has a bug where using the '1<>file' redirection operator mistakenly truncates file. I hadn't seen it in my testing, because I was on a system where /bin/sh is bash, and bash is immune, as is zsh 4.3.10 and Solaris /bin/sh. However, when /bin/sh is a buggy dash, then this mistaken truncation clobbers the header, breaking managed save of qemu guests. The redirection bug has since been fixed in dash 0.5.6, and that fix has been backported to at least Ubuntu's dash 0.5.5 build. However, it raises the question - should libvirt be testing for this shell bug for the benefit of people hand-building new libvirt on systems with buggy /bin/sh (such as Ubuntu 10.04), and if so, should I work on a patch to provide a workaround? -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

* configure.ac (VIR_WRAPPER_SHELL): Define to a replacement shell, if /bin/sh is broken on <> redirection. * src/qemu/qemu_monitor.h (VIR_WRAPPER_SHELL_PREFIX) (VIR_WRAPPER_SHELL_SUFFIX): New macros. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. --- Took longer than I thought, but this should do the trick with no overhead on decent systems, and while still avoiding buggy dash == /bin/sh with something that works elsewhere. This passes for me on Fedora, where /bin/sh is bash; testing on Ubuntu or other debian-based system where /bin/sh is dash 0.5.5 would be appreciated. configure.ac | 50 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_monitor.h | 13 +++++++++++ src/qemu/qemu_monitor_json.c | 5 ++- src/qemu/qemu_monitor_text.c | 5 ++- 4 files changed, 69 insertions(+), 4 deletions(-) diff --git a/configure.ac b/configure.ac index e41f2b5..6aa09f7 100644 --- a/configure.ac +++ b/configure.ac @@ -583,6 +583,56 @@ if test "$with_lxc" = "yes" ; then fi AM_CONDITIONAL([WITH_LXC], [test "$with_lxc" = "yes"]) +dnl +dnl check for shell that understands <> redirection without truncation, +dnl needed by src/qemu/qemu_monitor_{text,json}.c. +dnl +if test "$with_qemu" = yes; then + lv_wrapper_shell= + AC_CACHE_CHECK([for shell that supports <> redirection], + [lv_cv_wrapper_shell], + [ + # If cross-compiling, guess that /bin/sh is good enough except for + # Linux, where it might be dash 0.5.5 which is known broken; and on + # Linux, we have a good chance that /bin/bash will exist. + # If we guess wrong, a user can override the cache variable. + # Going through /bin/bash is a slight slowdown if /bin/sh works. + if test "$cross_compiling" = yes; then + case $host_os in + linux*) lv_cv_wrapper_shell=/bin/bash ;; + *) lv_cv_wrapper_shell=/bin/sh ;; + esac + else + for lv_cv_wrapper_shell in /bin/sh bash ksh zsh none; do + test $lv_cv_wrapper_shell = none && + AC_MSG_ERROR([could not find decent shell]) + echo a > conftest.a + $lv_cv_wrapper_shell -c ': 1<>conftest.a' + case `cat conftest.a`.$lv_cv_wrapper_shell in + a./*) break;; dnl /bin/sh is good enough + a.*) dnl bash, ksh, and zsh all understand 'command', use that + dnl to determine the absolute path of the shell + lv_cv_wrapper_shell=`$lv_cv_wrapper_shell -c \ + 'command -v $lv_cv_wrapper_shell'` + case $lv_cv_wrapper_shell in + /*) break;; + esac + ;; + esac + done + rm -f conftest.a + fi + ]) + if test "x$lv_cv_wrapper_shell" != x/bin/sh; then + lv_wrapper_shell=$lv_cv_wrapper_shell + fi + if test "x$lv_wrapper_shell" != x; then + AC_DEFINE_UNQUOTED([VIR_WRAPPER_SHELL], [$lv_wrapper_shell], + [Define to the absolute path of a shell that does not truncate on + <> redirection, if /bin/sh does not fit the bill]) + fi +fi + dnl dnl check for kernel headers required by src/bridge.c diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 48f4c20..7d09145 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -391,4 +391,17 @@ int qemuMonitorDeleteSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply); +/** + * When running two dd process and using <> redirection, we need a + * shell that will not truncate files. These two strings serve that + * purpose. + */ +# ifdef VIR_WRAPPER_SHELL +# define VIR_WRAPPER_SHELL_PREFIX VIR_WRAPPER_SHELL " -c '" +# define VIR_WRAPPER_SHELL_SUFFIX "'" +# else +# define VIR_WRAPPER_SHELL_PREFIX /* nothing */ +# define VIR_WRAPPER_SHELL_SUFFIX /* nothing */ +# endif + #endif /* QEMU_MONITOR_H */ diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index d3ab25f..d2c6f0a 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1698,8 +1698,9 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, * allow starting at an alignment of 512, but without wasting * padding to get to the larger alignment useful for speed. Use * <> redirection to avoid truncating a regular file. */ - if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } 1<>%s", + if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " + "{ dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 69971a6..d7e128c 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -1277,8 +1277,9 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, * allow starting at an alignment of 512, but without wasting * padding to get to the larger alignment useful for speed. Use * <> redirection to avoid truncating a regular file. */ - if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null && " - "dd bs=%llu; } 1<>%s", + if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " + "{ dd bs=%llu seek=%llu if=/dev/null && " + "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, -- 1.7.2.3

On 10/22/2010 07:29 PM, Eric Blake wrote:
* configure.ac (VIR_WRAPPER_SHELL): Define to a replacement shell, if /bin/sh is broken on<> redirection. * src/qemu/qemu_monitor.h (VIR_WRAPPER_SHELL_PREFIX) (VIR_WRAPPER_SHELL_SUFFIX): New macros. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. ---
Took longer than I thought, but this should do the trick with no overhead on decent systems, and while still avoiding buggy dash == /bin/sh with something that works elsewhere.
This passes for me on Fedora, where /bin/sh is bash; testing on Ubuntu or other debian-based system where /bin/sh is dash 0.5.5 would be appreciated.
configure.ac | 50 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_monitor.h | 13 +++++++++++ src/qemu/qemu_monitor_json.c | 5 ++- src/qemu/qemu_monitor_text.c | 5 ++- 4 files changed, 69 insertions(+), 4 deletions(-)
diff --git a/configure.ac b/configure.ac index e41f2b5..6aa09f7 100644 --- a/configure.ac +++ b/configure.ac @@ -583,6 +583,56 @@ if test "$with_lxc" = "yes" ; then fi AM_CONDITIONAL([WITH_LXC], [test "$with_lxc" = "yes"])
+dnl +dnl check for shell that understands<> redirection without truncation, +dnl needed by src/qemu/qemu_monitor_{text,json}.c. +dnl +if test "$with_qemu" = yes; then + lv_wrapper_shell= + AC_CACHE_CHECK([for shell that supports<> redirection], + [lv_cv_wrapper_shell], + [ + # If cross-compiling, guess that /bin/sh is good enough except for + # Linux, where it might be dash 0.5.5 which is known broken; and on + # Linux, we have a good chance that /bin/bash will exist. + # If we guess wrong, a user can override the cache variable. + # Going through /bin/bash is a slight slowdown if /bin/sh works. + if test "$cross_compiling" = yes; then + case $host_os in + linux*) lv_cv_wrapper_shell=/bin/bash ;; + *) lv_cv_wrapper_shell=/bin/sh ;; + esac + else + for lv_cv_wrapper_shell in /bin/sh bash ksh zsh none; do + test $lv_cv_wrapper_shell = none&& + AC_MSG_ERROR([could not find decent shell]) + echo a> conftest.a + $lv_cv_wrapper_shell -c ': 1<>conftest.a' + case `cat conftest.a`.$lv_cv_wrapper_shell in + a./*) break;; dnl /bin/sh is good enough + a.*) dnl bash, ksh, and zsh all understand 'command', use that + dnl to determine the absolute path of the shell + lv_cv_wrapper_shell=`$lv_cv_wrapper_shell -c \ + 'command -v $lv_cv_wrapper_shell'` + case $lv_cv_wrapper_shell in + /*) break;; + esac + ;; + esac + done + rm -f conftest.a + fi + ]) + if test "x$lv_cv_wrapper_shell" != x/bin/sh; then + lv_wrapper_shell=$lv_cv_wrapper_shell + fi + if test "x$lv_wrapper_shell" != x; then + AC_DEFINE_UNQUOTED([VIR_WRAPPER_SHELL], [$lv_wrapper_shell], + [Define to the absolute path of a shell that does not truncate on +<> redirection, if /bin/sh does not fit the bill]) + fi +fi +
dnl dnl check for kernel headers required by src/bridge.c diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 48f4c20..7d09145 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -391,4 +391,17 @@ int qemuMonitorDeleteSnapshot(qemuMonitorPtr mon, const char *name);
int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply);
+/** + * When running two dd process and using<> redirection, we need a + * shell that will not truncate files. These two strings serve that + * purpose. + */ +# ifdef VIR_WRAPPER_SHELL +# define VIR_WRAPPER_SHELL_PREFIX VIR_WRAPPER_SHELL " -c '" +# define VIR_WRAPPER_SHELL_SUFFIX "'" +# else +# define VIR_WRAPPER_SHELL_PREFIX /* nothing */ +# define VIR_WRAPPER_SHELL_SUFFIX /* nothing */ +# endif + #endif /* QEMU_MONITOR_H */ diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index d3ab25f..d2c6f0a 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -1698,8 +1698,9 @@ int qemuMonitorJSONMigrateToFile(qemuMonitorPtr mon, * allow starting at an alignment of 512, but without wasting * padding to get to the larger alignment useful for speed. Use *<> redirection to avoid truncating a regular file. */ - if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null&& " - "dd bs=%llu; } 1<>%s", + if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " + "{ dd bs=%llu seek=%llu if=/dev/null&& " + "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE, diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 69971a6..d7e128c 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -1277,8 +1277,9 @@ int qemuMonitorTextMigrateToFile(qemuMonitorPtr mon, * allow starting at an alignment of 512, but without wasting * padding to get to the larger alignment useful for speed. Use *<> redirection to avoid truncating a regular file. */ - if (virAsprintf(&dest, "exec:%s | { dd bs=%llu seek=%llu if=/dev/null&& " - "dd bs=%llu; } 1<>%s", + if (virAsprintf(&dest, "exec:" VIR_WRAPPER_SHELL_PREFIX "%s | " + "{ dd bs=%llu seek=%llu if=/dev/null&& " + "dd bs=%llu; } 1<>%s" VIR_WRAPPER_SHELL_SUFFIX, argstr, QEMU_MONITOR_MIGRATE_TO_FILE_BS, offset / QEMU_MONITOR_MIGRATE_TO_FILE_BS, QEMU_MONITOR_MIGRATE_TO_FILE_TRANSFER_SIZE,
Cannot test it, but looks good. ACK. Stefan

On 10/23/2010 10:29 AM, Eric Blake wrote:
* configure.ac (VIR_WRAPPER_SHELL): Define to a replacement shell, if /bin/sh is broken on <> redirection. * src/qemu/qemu_monitor.h (VIR_WRAPPER_SHELL_PREFIX) (VIR_WRAPPER_SHELL_SUFFIX): New macros. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. ---
Took longer than I thought, but this should do the trick with no overhead on decent systems, and while still avoiding buggy dash == /bin/sh with something that works elsewhere.
This passes for me on Fedora, where /bin/sh is bash; testing on Ubuntu or other debian-based system where /bin/sh is dash 0.5.5 would be appreciated.
No obvious errors crop up when testing this on Ubuntu 10.04: virsh # managedsave 1 Domain 1 state saved by libvirt virsh # As a usability thing though, there doesn't seem to be a matching "managedrestore" command in virsh, and the "help managedsave" command is bit light on "how do we restart a saved domain?": # help managedsave NAME managedsave - managed save of a domain state SYNOPSIS managedsave <domain> DESCRIPTION Save and stop a running domain, so libvirt can restart it from the same state OPTIONS [--domain] <string> domain name, id or uuid Are we supposed to restart it using "start" or something? (we need to mention this, which I can follow up with a patch)

On 10/25/2010 08:28 PM, Justin Clift wrote:
virsh # managedsave 1 Domain 1 state saved by libvirt
virsh #
As a usability thing though, there doesn't seem to be a matching "managedrestore" command in virsh, and the "help managedsave" command is bit light on "how do we restart a saved domain?":
A successful managedsave will leave an image file in a standard location that the "start" command checks - if there's a file there, it restores the saved image, if not, it does a power-on. That's actually an important part of testing the save, because it will tell you if the saved image is actually usable.

On 10/25/2010 06:28 PM, Justin Clift wrote:
On 10/23/2010 10:29 AM, Eric Blake wrote:
* configure.ac (VIR_WRAPPER_SHELL): Define to a replacement shell, if /bin/sh is broken on<> redirection. * src/qemu/qemu_monitor.h (VIR_WRAPPER_SHELL_PREFIX) (VIR_WRAPPER_SHELL_SUFFIX): New macros. * src/qemu/qemu_monitor_text.c (qemuMonitorTextMigrateToFile): Use them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONMigrateToFile): Likewise. ---
Took longer than I thought, but this should do the trick with no overhead on decent systems, and while still avoiding buggy dash == /bin/sh with something that works elsewhere.
This passes for me on Fedora, where /bin/sh is bash; testing on Ubuntu or other debian-based system where /bin/sh is dash 0.5.5 would be appreciated.
No obvious errors crop up when testing this on Ubuntu 10.04:
virsh # managedsave 1 Domain 1 state saved by libvirt
Thanks for testing, and for Stefan's review. I've pushed this now.
As a usability thing though, there doesn't seem to be a matching "managedrestore" command in virsh, and the "help managedsave" command is bit light on "how do we restart a saved domain?":
# help managedsave NAME managedsave - managed save of a domain state
SYNOPSIS managedsave<domain>
DESCRIPTION Save and stop a running domain, so libvirt can restart it from the same state
Maybe wording along these lines would help: help managedsave ... Save and stop a running domain, so the next start will reuse the same state. help start ... Start a domain, either from the last managedsave state, or via a fresh boot.
Are we supposed to restart it using "start" or something? (we need to mention this, which I can follow up with a patch)
Yes, a followup patch would be nice. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

--- Note that this includes carriage returns in the output string at appropriate locations. This lets us have reasonably verbose output, with good descriptions of the purpose of each command, that also looks decent without wrapping. tools/virsh.c | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-) diff --git a/tools/virsh.c b/tools/virsh.c index a182d4c..11efaea 100644 --- a/tools/virsh.c +++ b/tools/virsh.c @@ -1359,7 +1359,9 @@ cmdUndefine(vshControl *ctl, const vshCmd *cmd) */ static const vshCmdInfo info_start[] = { {"help", N_("start a (previously defined) inactive domain")}, - {"desc", N_("Start a domain.")}, + {"desc", N_("Start a domain, either from the last managedsave\n" + " state, or via a fresh boot if no managedsave state\n" + " is present.")}, {NULL, NULL} }; @@ -1462,7 +1464,10 @@ cmdSave(vshControl *ctl, const vshCmd *cmd) */ static const vshCmdInfo info_managedsave[] = { {"help", N_("managed save of a domain state")}, - {"desc", N_("Save and stop a running domain, so libvirt can restart it from the same state")}, + {"desc", N_("Save and destroy a running domain, so it can be restarted from\n" + " the same state at a later time. When the virsh 'start'\n" + " command is next run for the domain, it will automatically\n" + " be started from this saved state.")}, {NULL, NULL} }; -- 1.7.2.3

On 10/27/2010 10:53 PM, Justin Clift wrote:
--- Note that this includes carriage returns in the output string at
Actually, \n is newline; \r is carriage return.
appropriate locations. This lets us have reasonably verbose output, with good descriptions of the purpose of each command, that also looks decent without wrapping. tools/virsh.c | 9 +++++++-- 1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/tools/virsh.c b/tools/virsh.c index a182d4c..11efaea 100644 --- a/tools/virsh.c +++ b/tools/virsh.c @@ -1359,7 +1359,9 @@ cmdUndefine(vshControl *ctl, const vshCmd *cmd) */ static const vshCmdInfo info_start[] = { {"help", N_("start a (previously defined) inactive domain")}, - {"desc", N_("Start a domain.")}, + {"desc", N_("Start a domain, either from the last managedsave\n" + " state, or via a fresh boot if no managedsave state\n" + " is present.")}, {NULL, NULL} };
@@ -1462,7 +1464,10 @@ cmdSave(vshControl *ctl, const vshCmd *cmd) */ static const vshCmdInfo info_managedsave[] = { {"help", N_("managed save of a domain state")}, - {"desc", N_("Save and stop a running domain, so libvirt can restart it from the same state")}, + {"desc", N_("Save and destroy a running domain, so it can be restarted from\n" + " the same state at a later time. When the virsh 'start'\n" + " command is next run for the domain, it will automatically\n" + " be started from this saved state.")},
ACK - both descriptions look okay to me. But before you push, be sure virsh.pod also has similar wording. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 10/29/2010 01:35 AM, Eric Blake wrote:
ACK - both descriptions look okay to me. But before you push, be sure virsh.pod also has similar wording.
Thanks, pushed. :) Good point about the .pod file too. Adjusted the descriptions in there as follows: For managedsave Save and destroy a running domain, so it can be restarted from the same state at a later time. When the virsh start command is next run for the domain, it will automatically be started from this saved state. For managedsave-remove Remove the managedsave state file for a domain, if it exists. This ensures the domain will do a full boot the next time it is started. For start Start a (previously defined) inactive domain, either from the last managedsave state, or via a fresh boot if no managedsave state is present. The domain will be paused if the --paused option is used and supported by the driver; otherwise it will be running. If --console is requested, attach to the console after creation.
participants (5)
-
Daniel P. Berrange
-
Eric Blake
-
Justin Clift
-
Laine Stump
-
Stefan Berger