Re: [libvirt] [Qemu-devel] [PATCH v4] XBZRLE delta for live migration of large memory apps

On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps From: Aidan Shribman<aidan.shribman@sap.com>
By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime and total live-migration time of VMs running memory write intensive workloads typical of large enterprise applications such as SAP ERP Systems, and generally speaking for any application with a sparse memory update pattern.
On the sender side XBZRLE is used as a compact delta encoding of page updates, retrieving the old page content from an LRU cache (default size of 64 MB). The receiving side uses the existing page content and XBZRLE to decode the new page content.
Work was originally based on research results published VEE 2011: Evaluation of Delta Compression Techniques for Efficient Live Migration of Large Virtual Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder XBRLE was improved further using XBZRLE instead.
XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it ideal for in-line, real-time encoding such as is needed for live-migration.
A typical usage scenario: {qemu} migrate_set_cachesize 256m {qemu} migrate -x -d tcp:destination.host:4444 {qemu} info migrate ... transferred ram-duplicate: A kbytes transferred ram-duplicate: B pages transferred ram-normal: C kbytes transferred ram-normal: D pages transferred ram-xbrle: E kbytes transferred ram-xbrle: F pages overflow ram-xbrle: G pages cache-hit ram-xbrle: H pages cache-lookup ram-xbrle: J pages
Testing: live migration with XBZRLE completed in 110 seconds, without live migration was not able to complete.
A simple synthetic memory r/w load generator: .. include<stdlib.h> .. include<stdio.h> .. int main() .. { .. char *buf = (char *) calloc(4096, 4096); .. while (1) { .. int i; .. for (i = 0; i< 4096 * 4; i++) { .. buf[i * 4096 / 4]++; .. } .. printf("."); .. } .. }
Signed-off-by: Benoit Hudzia<benoit.hudzia@sap.com> Signed-off-by: Petter Svard<petters@cs.umu.se> Signed-off-by: Aidan Shribman<aidan.shribman@sap.com>
One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really. I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms. It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt. Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this. Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering. I'm curious what people think about this type of approach. CC'ing libvirt to get their input. Regards, Anthony Liguori
--
Makefile.target | 1 + arch_init.c | 351 ++++++++++++++++++++++++++++++++++++++++++++++------ block-migration.c | 3 +- hash.h | 72 +++++++++++ hmp-commands.hx | 36 ++++-- hw/hw.h | 3 +- lru.c | 142 +++++++++++++++++++++ lru.h | 13 ++ migration-exec.c | 6 +- migration-fd.c | 6 +- migration-tcp.c | 6 +- migration-unix.c | 6 +- migration.c | 119 +++++++++++++++++- migration.h | 25 +++- qmp-commands.hx | 43 ++++++- savevm.c | 13 ++- sysemu.h | 13 ++- xbzrle.c | 126 +++++++++++++++++++ xbzrle.h | 12 ++ 19 files changed, 917 insertions(+), 79 deletions(-)
diff --git a/Makefile.target b/Makefile.target index 2800f47..b3215de 100644 --- a/Makefile.target +++ b/Makefile.target @@ -186,6 +186,7 @@ endif #CONFIG_BSD_USER ifdef CONFIG_SOFTMMU
obj-y = arch_init.o cpus.o monitor.o machine.o gdbstub.o balloon.o +obj-y += lru.o xbzrle.o # virtio has to be here due to weird dependency between PCI and virtio-net. # need to fix this properly obj-y += virtio-blk.o virtio-balloon.o virtio-net.o virtio-serial-bus.o diff --git a/arch_init.c b/arch_init.c old mode 100644 new mode 100755 index 4486925..d67dc82 --- a/arch_init.c +++ b/arch_init.c @@ -40,6 +40,17 @@ #include "net.h" #include "gdbstub.h" #include "hw/smbios.h" +#include "lru.h" +#include "xbzrle.h" + +//#define DEBUG_ARCH_INIT +#ifdef DEBUG_ARCH_INIT +#define DPRINTF(fmt, ...) \ + do { fprintf(stdout, "arch_init: " fmt, ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) \ + do { } while (0) +#endif
#ifdef TARGET_SPARC int graphic_width = 1024; @@ -88,6 +99,161 @@ const uint32_t arch_type = QEMU_ARCH; #define RAM_SAVE_FLAG_PAGE 0x08 #define RAM_SAVE_FLAG_EOS 0x10 #define RAM_SAVE_FLAG_CONTINUE 0x20 +#define RAM_SAVE_FLAG_XBZRLE 0x40 + +/***********************************************************/ +/* RAM Migration State */ +typedef struct ArchMigrationState { + int use_xbrle; + int64_t xbrle_cache_size; +} ArchMigrationState; + +static ArchMigrationState arch_mig_state; + +void arch_set_params(int blk_enable, int shared_base, int use_xbrle, + int64_t xbrle_cache_size, void *opaque) +{ + arch_mig_state.use_xbrle = use_xbrle; + arch_mig_state.xbrle_cache_size = xbrle_cache_size; +} + +#define BE16_MAGIC 0x0123 + +/***********************************************************/ +/* XBZRLE (Xor Binary Zero Run-Length Encoding) */ +typedef struct XBZRLEHeader { + uint32_t xh_cksum; /* not used */ + uint16_t xh_magic; + uint16_t xh_len; + uint8_t xh_flags; +} XBZRLEHeader; + +static uint8_t dup_buf[TARGET_PAGE_SIZE]; + +/***********************************************************/ +/* accounting */ +typedef struct AccountingInfo{ + uint64_t dup_pages; + uint64_t norm_pages; + uint64_t xbrle_bytes; + uint64_t xbrle_pages; + uint64_t xbrle_overflow; + uint64_t xbrle_cache_lookup; + uint64_t xbrle_cache_hit; + uint64_t iterations; +} AccountingInfo; + +static AccountingInfo acct_info; + +static void acct_clear(void) +{ + memset(&acct_info, 0, sizeof(acct_info)); +} + +uint64_t dup_mig_bytes_transferred(void) +{ + return acct_info.dup_pages; +} + +uint64_t dup_mig_pages_transferred(void) +{ + return acct_info.dup_pages; +} + +uint64_t norm_mig_bytes_transferred(void) +{ + return acct_info.norm_pages * TARGET_PAGE_SIZE; +} + +uint64_t norm_mig_pages_transferred(void) +{ + return acct_info.norm_pages; +} + +uint64_t xbrle_mig_bytes_transferred(void) +{ + return acct_info.xbrle_bytes; +} + +uint64_t xbrle_mig_pages_transferred(void) +{ + return acct_info.xbrle_pages; +} + +uint64_t xbrle_mig_pages_overflow(void) +{ + return acct_info.xbrle_overflow; +} + +uint64_t xbrle_mig_pages_cache_hit(void) +{ + return acct_info.xbrle_cache_hit; +} + +uint64_t xbrle_mig_pages_cache_lookup(void) +{ + return acct_info.xbrle_cache_lookup; +} + +static void save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset, + int cont, int flag) +{ + qemu_put_be64(f, offset | cont | flag); + if (!cont) { + qemu_put_byte(f, strlen(block->idstr)); + qemu_put_buffer(f, (uint8_t *)block->idstr, + strlen(block->idstr)); + } +} + +#define ENCODING_FLAG_XBZRLE 0x1 + +static int save_xbrle_page(QEMUFile *f, uint8_t *current_page, + ram_addr_t current_addr, RAMBlock *block, ram_addr_t offset, int cont) +{ + int encoded_len = 0, bytes_sent = 0; + XBZRLEHeader hdr = {0, BE16_MAGIC}; + uint8_t *encoded, *old_page; + + /* abort if page not cached */ + acct_info.xbrle_cache_lookup++; + old_page = lru_lookup(current_addr); + if (!old_page) { + goto done; + } + acct_info.xbrle_cache_hit++; + + /* XBZRLE (XOR+ZRLE) encoding */ + encoded = (uint8_t *) qemu_malloc(TARGET_PAGE_SIZE); + encoded_len = xbzrle_encode(encoded, old_page, current_page, + TARGET_PAGE_SIZE); + + if (encoded_len< 0) { + DPRINTF("XBZRLE encoding overflow - sending uncompressed\n"); + acct_info.xbrle_overflow++; + goto done; + } + + hdr.xh_len = encoded_len; + hdr.xh_flags |= ENCODING_FLAG_XBZRLE; + + /* Send XBZRLE compressed page */ + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_XBZRLE); + + qemu_put_be32(f, hdr.xh_cksum); + qemu_put_buffer(f, (uint8_t *)&hdr.xh_magic, sizeof (hdr.xh_magic)); + qemu_put_be16(f, hdr.xh_len); + qemu_put_byte(f, hdr.xh_flags); + + qemu_put_buffer(f, encoded, encoded_len); + acct_info.xbrle_pages++; + bytes_sent = encoded_len + sizeof(hdr); + acct_info.xbrle_bytes += bytes_sent; + +done: + qemu_free(encoded); + return bytes_sent; +}
static int is_dup_page(uint8_t *page, uint8_t ch) { @@ -107,7 +273,7 @@ static int is_dup_page(uint8_t *page, uint8_t ch) static RAMBlock *last_block; static ram_addr_t last_offset;
-static int ram_save_block(QEMUFile *f) +static int ram_save_block(QEMUFile *f, int stage) { RAMBlock *block = last_block; ram_addr_t offset = last_offset; @@ -120,6 +286,7 @@ static int ram_save_block(QEMUFile *f) current_addr = block->offset + offset;
do { + lru_free_cb_t free_cb = qemu_free; if (cpu_physical_memory_get_dirty(current_addr, MIGRATION_DIRTY_FLAG)) { uint8_t *p; int cont = (block == last_block) ? RAM_SAVE_FLAG_CONTINUE : 0; @@ -128,28 +295,35 @@ static int ram_save_block(QEMUFile *f) current_addr + TARGET_PAGE_SIZE, MIGRATION_DIRTY_FLAG);
- p = block->host + offset; + if (arch_mig_state.use_xbrle) { + p = qemu_malloc(TARGET_PAGE_SIZE); + memcpy(p, block->host + offset, TARGET_PAGE_SIZE); + } else { + p = block->host + offset; + }
if (is_dup_page(p, *p)) { - qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_COMPRESS); - if (!cont) { - qemu_put_byte(f, strlen(block->idstr)); - qemu_put_buffer(f, (uint8_t *)block->idstr, - strlen(block->idstr)); - } + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_COMPRESS); qemu_put_byte(f, *p); bytes_sent = 1; - } else { - qemu_put_be64(f, offset | cont | RAM_SAVE_FLAG_PAGE); - if (!cont) { - qemu_put_byte(f, strlen(block->idstr)); - qemu_put_buffer(f, (uint8_t *)block->idstr, - strlen(block->idstr)); + acct_info.dup_pages++; + if (arch_mig_state.use_xbrle&& !*p) { + p = dup_buf; + free_cb = NULL; } + } else if (stage == 2&& arch_mig_state.use_xbrle) { + bytes_sent = save_xbrle_page(f, p, current_addr, block, + offset, cont); + } + if (!bytes_sent) { + save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE); qemu_put_buffer(f, p, TARGET_PAGE_SIZE); bytes_sent = TARGET_PAGE_SIZE; + acct_info.norm_pages++; + } + if (arch_mig_state.use_xbrle) { + lru_insert(current_addr, p, free_cb); } - break; }
@@ -221,6 +395,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque)
if (stage< 0) { cpu_physical_memory_set_dirty_tracking(0); + if (arch_mig_state.use_xbrle) { + lru_fini(); + } return 0; }
@@ -235,6 +412,11 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) last_block = NULL; last_offset = 0;
+ if (arch_mig_state.use_xbrle) { + lru_init(arch_mig_state.xbrle_cache_size/TARGET_PAGE_SIZE, 0); + acct_clear(); + } + /* Make sure all dirty bits are set */ QLIST_FOREACH(block,&ram_list.blocks, next) { for (addr = block->offset; addr< block->offset + block->length; @@ -264,8 +446,9 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) while (!qemu_file_rate_limit(f)) { int bytes_sent;
- bytes_sent = ram_save_block(f); + bytes_sent = ram_save_block(f, stage); bytes_transferred += bytes_sent; + acct_info.iterations++; if (bytes_sent == 0) { /* no more blocks */ break; } @@ -285,19 +468,79 @@ int ram_save_live(Monitor *mon, QEMUFile *f, int stage, void *opaque) int bytes_sent;
/* flush all remaining blocks regardless of rate limiting */ - while ((bytes_sent = ram_save_block(f)) != 0) { + while ((bytes_sent = ram_save_block(f, stage))) { bytes_transferred += bytes_sent; } cpu_physical_memory_set_dirty_tracking(0); + if (arch_mig_state.use_xbrle) { + lru_fini(); + } }
qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
expected_time = ram_save_remaining() * TARGET_PAGE_SIZE / bwidth;
+ DPRINTF("ram_save_live: expected(%ld)<= max(%ld)?\n", expected_time, + migrate_max_downtime()); + return (stage == 2)&& (expected_time<= migrate_max_downtime()); }
+static int load_xbrle(QEMUFile *f, ram_addr_t addr, void *host) +{ + int len, rc = -1; + uint8_t *encoded; + XBZRLEHeader hdr = {0}; + + /* extract ZRLE header */ + hdr.xh_cksum = qemu_get_be32(f); + qemu_get_buffer(f, (uint8_t *)&hdr.xh_magic, sizeof (hdr.xh_magic)); + hdr.xh_len = qemu_get_be16(f); + hdr.xh_flags = qemu_get_byte(f); + + if (!(hdr.xh_flags& ENCODING_FLAG_XBZRLE)) { + fprintf(stderr, "Failed to load XZBRLE page - wrong compression!\n"); + goto done; + } + + if (hdr.xh_len> TARGET_PAGE_SIZE) { + fprintf(stderr, "Failed to load XZBRLE page - len overflow!\n"); + goto done; + } + + /* load data and decode */ + encoded = (uint8_t *) qemu_malloc(hdr.xh_len); + qemu_get_buffer(f, encoded, hdr.xh_len); + /* covert endianess if magic indicated destination differs from source */ + if (hdr.xh_magic != BE16_MAGIC) { + const uint64_t *end = (uint64_t *) encoded + + hdr.xh_len / sizeof (uint64_t); + uint64_t *p; + for (p = (uint64_t *) encoded; p< end; p++) { + bswap64s(p); + } + } + + /* decode ZRLE */ + len = xbzrle_decode(host, host, encoded, hdr.xh_len); + if (len == -1) { + fprintf(stderr, "Failed to load XBZRLE page - decode error!\n"); + goto done; + } + + if (len != TARGET_PAGE_SIZE) { + fprintf(stderr, "Failed to load XBZRLE page - size %d expected %d!\n", + len, TARGET_PAGE_SIZE); + goto done; + } + + rc = 0; +done: + qemu_free(encoded); + return rc; +} + static inline void *host_from_stream_offset(QEMUFile *f, ram_addr_t offset, int flags) @@ -328,16 +571,38 @@ static inline void *host_from_stream_offset(QEMUFile *f, return NULL; }
+static inline void *host_from_stream_offset_versioned(int version_id, + QEMUFile *f, ram_addr_t offset, int flags) +{ + void *host; + if (version_id == 3) { + host = qemu_get_ram_ptr(offset); + } else { + host = host_from_stream_offset(f, offset, flags); + } + if (!host) { + fprintf(stderr, "Failed to convert RAM address to host" + " for offset 0x%lX!\n", offset); + abort(); + } + return host; +} + int ram_load(QEMUFile *f, void *opaque, int version_id) { ram_addr_t addr; - int flags; + int flags, ret = 0; + static uint64_t seq_iter; + + seq_iter++;
if (version_id< 3 || version_id> 4) { - return -EINVAL; + ret = -EINVAL; + goto done; }
do { + void *host; addr = qemu_get_be64(f);
flags = addr& ~TARGET_PAGE_MASK; @@ -346,7 +611,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) if (flags& RAM_SAVE_FLAG_MEM_SIZE) { if (version_id == 3) { if (addr != ram_bytes_total()) { - return -EINVAL; + ret = -EINVAL; + goto done; } } else { /* Synchronize RAM block list */ @@ -365,8 +631,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id)
QLIST_FOREACH(block,&ram_list.blocks, next) { if (!strncmp(id, block->idstr, sizeof(id))) { - if (block->length != length) - return -EINVAL; + if (block->length != length) { + ret = -EINVAL; + goto done; + } break; } } @@ -374,7 +642,8 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) if (!block) { fprintf(stderr, "Unknown ramblock \"%s\", cannot " "accept migration\n", id); - return -EINVAL; + ret = -EINVAL; + goto done; }
total_ram_bytes -= length; @@ -383,17 +652,10 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) }
if (flags& RAM_SAVE_FLAG_COMPRESS) { - void *host; uint8_t ch;
- if (version_id == 3) - host = qemu_get_ram_ptr(addr); - else - host = host_from_stream_offset(f, addr, flags); - if (!host) { - return -EINVAL; - } - + host = host_from_stream_offset_versioned(version_id, + f, addr, flags); ch = qemu_get_byte(f); memset(host, ch, TARGET_PAGE_SIZE); #ifndef _WIN32 @@ -403,21 +665,28 @@ int ram_load(QEMUFile *f, void *opaque, int version_id) } #endif } else if (flags& RAM_SAVE_FLAG_PAGE) { - void *host; - - if (version_id == 3) - host = qemu_get_ram_ptr(addr); - else - host = host_from_stream_offset(f, addr, flags); - + host = host_from_stream_offset_versioned(version_id, + f, addr, flags); qemu_get_buffer(f, host, TARGET_PAGE_SIZE); + } else if (flags& RAM_SAVE_FLAG_XBZRLE) { + host = host_from_stream_offset_versioned(version_id, + f, addr, flags); + if (load_xbrle(f, addr, host)< 0) { + ret = -EINVAL; + goto done; + } } + if (qemu_file_has_error(f)) { - return -EIO; + ret = -EIO; + goto done; } } while (!(flags& RAM_SAVE_FLAG_EOS));
- return 0; +done: + DPRINTF("Completed load of VM with exit code %d seq iteration %ld\n", + ret, seq_iter); + return ret; }
void qemu_service_io(void) diff --git a/block-migration.c b/block-migration.c index 3e66f49..504df70 100644 --- a/block-migration.c +++ b/block-migration.c @@ -689,7 +689,8 @@ static int block_load(QEMUFile *f, void *opaque, int version_id) return 0; }
-static void block_set_params(int blk_enable, int shared_base, void *opaque) +static void block_set_params(int blk_enable, int shared_base, + int use_xbrle, int64_t xbrle_cache_size, void *opaque) { block_mig_state.blk_enable = blk_enable; block_mig_state.shared_base = shared_base; diff --git a/hash.h b/hash.h new file mode 100644 index 0000000..7109905 --- /dev/null +++ b/hash.h @@ -0,0 +1,72 @@ +#ifndef _LINUX_HASH_H +#define _LINUX_HASH_H +/* Fast hashing routine for ints, longs and pointers. + (C) 2002 William Lee Irwin III, IBM */ + +/* + * Knuth recommends primes in approximately golden ratio to the maximum + * integer representable by a machine word for multiplicative hashing. + * Chuck Lever verified the effectiveness of this technique: + * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf + * + * These primes are chosen to be bit-sparse, that is operations on + * them can use shifts and additions instead of multiplications for + * machines where multiplications are slow. + */ + +typedef uint64_t u64; +typedef uint32_t u32; +#define BITS_PER_LONG TARGET_LONG_BITS + +/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */ +#define GOLDEN_RATIO_PRIME_32 0x9e370001UL +/* 2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */ +#define GOLDEN_RATIO_PRIME_64 0x9e37fffffffc0001UL + +#if BITS_PER_LONG == 32 +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_32 +#define hash_long(val, bits) hash_32(val, bits) +#elif BITS_PER_LONG == 64 +#define hash_long(val, bits) hash_64(val, bits) +#define GOLDEN_RATIO_PRIME GOLDEN_RATIO_PRIME_64 +#else +#error Wordsize not 32 or 64 +#endif + +static inline u64 hash_64(u64 val, unsigned int bits) +{ + u64 hash = val; + + /* Sigh, gcc can't optimise this alone like it does for 32 bits. */ + u64 n = hash; + n<<= 18; + hash -= n; + n<<= 33; + hash -= n; + n<<= 3; + hash += n; + n<<= 3; + hash -= n; + n<<= 4; + hash += n; + n<<= 2; + hash += n; + + /* High bits are more random, so use them. */ + return hash>> (64 - bits); +} + +static inline u32 hash_32(u32 val, unsigned int bits) +{ + /* On some cpus multiply is faster, on others gcc will do shifts */ + u32 hash = val * GOLDEN_RATIO_PRIME_32; + + /* High bits are more random, so use them. */ + return hash>> (32 - bits); +} + +static inline unsigned long hash_ptr(void *ptr, unsigned int bits) +{ + return hash_long((unsigned long)ptr, bits); +} +#endif /* _LINUX_HASH_H */ diff --git a/hmp-commands.hx b/hmp-commands.hx old mode 100644 new mode 100755 index e5585ba..e49d5be --- a/hmp-commands.hx +++ b/hmp-commands.hx @@ -717,24 +717,27 @@ ETEXI
{ .name = "migrate", - .args_type = "detach:-d,blk:-b,inc:-i,uri:s", - .params = "[-d] [-b] [-i] uri", - .help = "migrate to URI (using -d to not wait for completion)" - "\n\t\t\t -b for migration without shared storage with" - " full copy of disk\n\t\t\t -i for migration without " - "shared storage with incremental copy of disk " - "(base image shared between src and destination)", + .args_type = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s", + .params = "[-d] [-b] [-i] [-x] uri", + .help = "migrate to URI" + "\n\t -d to not wait for completion" + "\n\t -b for migration without shared storage with" + " full copy of disk" + "\n\t -i for migration without" + " shared storage with incremental copy of disk" + " (base image shared between source and destination)" + "\n\t -x to use XBRLE page delta compression", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, },
- STEXI -@item migrate [-d] [-b] [-i] @var{uri} +@item migrate [-d] [-b] [-i] [-x] @var{uri} @findex migrate Migrate to @var{uri} (using -d to not wait for completion). -b for migration with full copy of disk -i for migration with incremental copy of disk (base image is shared) + -x to use XBRLE page delta compression ETEXI
{ @@ -753,10 +756,23 @@ Cancel the current VM migration. ETEXI
{ + .name = "migrate_set_cachesize", + .args_type = "value:s", + .params = "value", + .help = "set cache size (in MB) for XBRLE migrations", + .mhandler.cmd = do_migrate_set_cachesize, + }, + +STEXI +@item migrate_set_cachesize @var{value} +Set cache size (in MB) for xbrle migrations. +ETEXI + + { .name = "migrate_set_speed", .args_type = "value:o", .params = "value", - .help = "set maximum speed (in bytes) for migrations. " + .help = "set maximum XBRLE cache size (in bytes) for migrations. " "Defaults to MB if no size suffix is specified, ie. B/K/M/G/T", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate_set_speed, diff --git a/hw/hw.h b/hw/hw.h index 9d2cfc2..aa336ec 100644 --- a/hw/hw.h +++ b/hw/hw.h @@ -239,7 +239,8 @@ static inline void qemu_get_sbe64s(QEMUFile *f, int64_t *pv) int64_t qemu_ftell(QEMUFile *f); int64_t qemu_fseek(QEMUFile *f, int64_t pos, int whence);
-typedef void SaveSetParamsHandler(int blk_enable, int shared, void * opaque); +typedef void SaveSetParamsHandler(int blk_enable, int shared, + int use_xbrle, int64_t xbrle_cache_size, void *opaque); typedef void SaveStateHandler(QEMUFile *f, void *opaque); typedef int SaveLiveStateHandler(Monitor *mon, QEMUFile *f, int stage, void *opaque); diff --git a/lru.c b/lru.c new file mode 100644 index 0000000..e7230d0 --- /dev/null +++ b/lru.c @@ -0,0 +1,142 @@ +#include<assert.h> +#include<math.h> +#include "qemu-common.h" +#include "qemu-queue.h" +#include "host-utils.h" +#include "lru.h" +#include "hash.h" + +typedef struct CacheItem { + ram_addr_t it_addr; + uint8_t *it_data; + lru_free_cb_t it_free; + QCIRCLEQ_ENTRY(CacheItem) it_lru_next; + QCIRCLEQ_ENTRY(CacheItem) it_bucket_next; +} CacheItem; + +typedef QCIRCLEQ_HEAD(, CacheItem) CacheBucket; +static CacheBucket *page_hash; +static int64_t cache_table_size; +static uint64_t cache_max_items; +static int64_t cache_num_items; +static uint8_t cache_hash_bits; + +static QCIRCLEQ_HEAD(page_lru, CacheItem) page_lru; + +static uint64_t next_pow_of_2(uint64_t v) +{ + v--; + v |= v>> 1; + v |= v>> 2; + v |= v>> 4; + v |= v>> 8; + v |= v>> 16; + v |= v>> 32; + v++; + return v; +} + +void lru_init(int64_t max_items, void *param) +{ + int i; + + cache_num_items = 0; + cache_max_items = max_items; + /* add 20% to table size to reduce collisions */ + cache_table_size = next_pow_of_2(1.2 * max_items); + cache_hash_bits = ctz64(cache_table_size) - 1; + + QCIRCLEQ_INIT(&page_lru); + + page_hash = qemu_mallocz(sizeof(CacheBucket) * cache_table_size); + assert(page_hash); + for (i = 0; i< cache_table_size; i++) { + QCIRCLEQ_INIT(&page_hash[i]); + } +} + +static CacheBucket *page_bucket_list(ram_addr_t addr) +{ + return&page_hash[hash_long(addr, cache_hash_bits)]; +} + +static void do_lru_remove(CacheItem *it) +{ + assert(it); + + QCIRCLEQ_REMOVE(&page_lru, it, it_lru_next); + QCIRCLEQ_REMOVE(page_bucket_list(it->it_addr), it, it_bucket_next); + if (it->it_free) { + (*it->it_free)(it->it_data); + } + qemu_free(it); + cache_num_items--; +} + +static int do_lru_remove_first(void) +{ + CacheItem *first; + + if (QCIRCLEQ_EMPTY(&page_lru)) { + return -1; + } + first = QCIRCLEQ_FIRST(&page_lru); + do_lru_remove(first); + return 0; +} + + +void lru_fini(void) +{ + while (!do_lru_remove_first()) { + } + qemu_free(page_hash); +} + +static CacheItem *do_lru_lookup(ram_addr_t addr) +{ + CacheBucket *head = page_bucket_list(addr); + CacheItem *it; + + if (QCIRCLEQ_EMPTY(head)) { + return NULL; + } + QCIRCLEQ_FOREACH(it, head, it_bucket_next) { + if (addr == it->it_addr) { + return it; + } + } + return NULL; +} + +uint8_t *lru_lookup(ram_addr_t addr) +{ + CacheItem *it = do_lru_lookup(addr); + return it ? it->it_data : NULL; +} + +void lru_insert(ram_addr_t addr, uint8_t *data, lru_free_cb_t free_cb) +{ + CacheItem *it; + + /* remove old if item exists */ + it = do_lru_lookup(addr); + if (it) { + do_lru_remove(it); + } + + /* evict LRU if require free space */ + if (cache_num_items == cache_max_items) { + do_lru_remove_first(); + } + + /* add new entry */ + it = qemu_mallocz(sizeof(*it)); + it->it_addr = addr; + it->it_data = data; + it->it_free = free_cb; + QCIRCLEQ_INSERT_HEAD(page_bucket_list(addr), it, it_bucket_next); + QCIRCLEQ_INSERT_TAIL(&page_lru, it, it_lru_next); + cache_num_items++; +} + diff --git a/lru.h b/lru.h new file mode 100644 index 0000000..6c70095 --- /dev/null +++ b/lru.h @@ -0,0 +1,13 @@ +#ifndef _LRU_H_ +#define _LRU_H_ + +#include<unistd.h> +#include<stdint.h> +#include "cpu-all.h" +typedef void (*lru_free_cb_t)(void *); +void lru_init(ssize_t num_items, void *param); +void lru_fini(void); +void lru_insert(ram_addr_t id, uint8_t *pdata, lru_free_cb_t free_cb); +uint8_t *lru_lookup(ram_addr_t addr); +#endif + diff --git a/migration-exec.c b/migration-exec.c index 14718dd..fe8254a 100644 --- a/migration-exec.c +++ b/migration-exec.c @@ -67,7 +67,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { FdMigrationState *s; FILE *f; @@ -99,6 +101,8 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon,
s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size;
s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration-fd.c b/migration-fd.c index 6d14505..4a1ddbd 100644 --- a/migration-fd.c +++ b/migration-fd.c @@ -56,7 +56,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { FdMigrationState *s;
@@ -82,6 +84,8 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon,
s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size;
s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration-tcp.c b/migration-tcp.c index b55f419..4ca5bf6 100644 --- a/migration-tcp.c +++ b/migration-tcp.c @@ -81,7 +81,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { struct sockaddr_in addr; FdMigrationState *s; @@ -101,6 +103,8 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size;
s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration-unix.c b/migration-unix.c index 57232c0..0813902 100644 --- a/migration-unix.c +++ b/migration-unix.c @@ -80,7 +80,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc) + int inc, + int use_xbrle, + int64_t xbrle_cache_size) { FdMigrationState *s; struct sockaddr_un addr; @@ -100,6 +102,8 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon,
s->mig_state.blk = blk; s->mig_state.shared = inc; + s->mig_state.use_xbrle = use_xbrle; + s->mig_state.xbrle_cache_size = xbrle_cache_size;
s->state = MIG_STATE_ACTIVE; s->mon = NULL; diff --git a/migration.c b/migration.c old mode 100644 new mode 100755 index 9ee8b17..ccacf81 --- a/migration.c +++ b/migration.c @@ -34,6 +34,11 @@ /* Migration speed throttling */ static uint32_t max_throttle = (32<< 20);
+/* Migration XBRLE cache size */ +#define DEFAULT_MIGRATE_CACHE_SIZE (64 * 1024 * 1024) + +static int64_t migrate_cache_size = DEFAULT_MIGRATE_CACHE_SIZE; + static MigrationState *current_migration;
int qemu_start_incoming_migration(const char *uri) @@ -80,6 +85,7 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data) int detach = qdict_get_try_bool(qdict, "detach", 0); int blk = qdict_get_try_bool(qdict, "blk", 0); int inc = qdict_get_try_bool(qdict, "inc", 0); + int use_xbrle = qdict_get_try_bool(qdict, "xbrle", 0); const char *uri = qdict_get_str(qdict, "uri");
if (current_migration&& @@ -90,17 +96,21 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
if (strstart(uri, "tcp:",&p)) { s = tcp_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); #if !defined(WIN32) } else if (strstart(uri, "exec:",&p)) { s = exec_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); } else if (strstart(uri, "unix:",&p)) { s = unix_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); } else if (strstart(uri, "fd:",&p)) { s = fd_start_outgoing_migration(mon, p, max_throttle, detach, - blk, inc); + blk, inc, use_xbrle, + migrate_cache_size); #endif } else { monitor_printf(mon, "unknown migration protocol: %s\n", uri); @@ -185,6 +195,36 @@ static void migrate_print_status(Monitor *mon, const char *name, qdict_get_int(qdict, "total")>> 10); }
+static void migrate_print_ram_status(Monitor *mon, const char *name, + const QDict *status_dict) +{ + QDict *qdict; + uint64_t overflow, cache_hit, cache_lookup; + + qdict = qobject_to_qdict(qdict_get(status_dict, name)); + + monitor_printf(mon, "transferred %s: %" PRIu64 " kbytes\n", name, + qdict_get_int(qdict, "bytes")>> 10); + monitor_printf(mon, "transferred %s: %" PRIu64 " pages\n", name, + qdict_get_int(qdict, "pages")); + overflow = qdict_get_int(qdict, "overflow"); + if (overflow> 0) { + monitor_printf(mon, "overflow %s: %" PRIu64 " pages\n", name, + overflow); + } + cache_hit = qdict_get_int(qdict, "cache-hit"); + if (cache_hit> 0) { + monitor_printf(mon, "cache-hit %s: %" PRIu64 " pages\n", name, + cache_hit); + } + cache_lookup = qdict_get_int(qdict, "cache-lookup"); + if (cache_lookup> 0) { + monitor_printf(mon, "cache-lookup %s: %" PRIu64 " pages\n", name, + cache_lookup); + } + +} + void do_info_migrate_print(Monitor *mon, const QObject *data) { QDict *qdict; @@ -198,6 +238,18 @@ void do_info_migrate_print(Monitor *mon, const QObject *data) migrate_print_status(mon, "ram", qdict); }
+ if (qdict_haskey(qdict, "ram-duplicate")) { + migrate_print_ram_status(mon, "ram-duplicate", qdict); + } + + if (qdict_haskey(qdict, "ram-normal")) { + migrate_print_ram_status(mon, "ram-normal", qdict); + } + + if (qdict_haskey(qdict, "ram-xbrle")) { + migrate_print_ram_status(mon, "ram-xbrle", qdict); + } + if (qdict_haskey(qdict, "disk")) { migrate_print_status(mon, "disk", qdict); } @@ -214,6 +266,23 @@ static void migrate_put_status(QDict *qdict, const char *name, qdict_put_obj(qdict, name, obj); }
+static void migrate_put_ram_status(QDict *qdict, const char *name, + uint64_t bytes, uint64_t pages, + uint64_t overflow, uint64_t cache_hit, + uint64_t cache_lookup) +{ + QObject *obj; + + obj = qobject_from_jsonf("{ 'bytes': %" PRId64 ", " + "'pages': %" PRId64 ", " + "'overflow': %" PRId64 ", " + "'cache-hit': %" PRId64 ", " + "'cache-lookup': %" PRId64 " }", + bytes, pages, overflow, cache_hit, + cache_lookup); + qdict_put_obj(qdict, name, obj); +} + void do_info_migrate(Monitor *mon, QObject **ret_data) { QDict *qdict; @@ -228,6 +297,21 @@ void do_info_migrate(Monitor *mon, QObject **ret_data) migrate_put_status(qdict, "ram", ram_bytes_transferred(), ram_bytes_remaining(), ram_bytes_total());
+ if (s->use_xbrle) { + migrate_put_ram_status(qdict, "ram-duplicate", + dup_mig_bytes_transferred(), + dup_mig_pages_transferred(), 0, 0, 0); + migrate_put_ram_status(qdict, "ram-normal", + norm_mig_bytes_transferred(), + norm_mig_pages_transferred(), 0, 0, 0); + migrate_put_ram_status(qdict, "ram-xbrle", + xbrle_mig_bytes_transferred(), + xbrle_mig_pages_transferred(), + xbrle_mig_pages_overflow(), + xbrle_mig_pages_cache_hit(), + xbrle_mig_pages_cache_lookup()); + } + if (blk_mig_active()) { migrate_put_status(qdict, "disk", blk_mig_bytes_transferred(), blk_mig_bytes_remaining(), @@ -341,7 +425,8 @@ void migrate_fd_connect(FdMigrationState *s)
DPRINTF("beginning savevm\n"); ret = qemu_savevm_state_begin(s->mon, s->file, s->mig_state.blk, - s->mig_state.shared); + s->mig_state.shared, s->mig_state.use_xbrle, + s->mig_state.xbrle_cache_size); if (ret< 0) { DPRINTF("failed, %d\n", ret); migrate_fd_error(s); @@ -448,3 +533,27 @@ int migrate_fd_close(void *opaque) qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL); return s->close(s); } + +void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict) +{ + ssize_t bytes; + const char *value = qdict_get_str(qdict, "value"); + + bytes = strtosz(value, NULL); + if (bytes< 0) { + monitor_printf(mon, "invalid cache size: %s\n", value); + return; + } + + /* On 32-bit hosts, QEMU is limited by virtual address space */ + if (bytes> (2047<< 20)&& HOST_LONG_BITS == 32) { + monitor_printf(mon, "cache can't exceed 2047 MB RAM limit on host\n"); + return; + } + if (bytes != (uint64_t) bytes) { + monitor_printf(mon, "cache size too large\n"); + return; + } + migrate_cache_size = bytes; +} + diff --git a/migration.h b/migration.h index d13ed4f..6dc0543 100644 --- a/migration.h +++ b/migration.h @@ -32,6 +32,8 @@ struct MigrationState void (*release)(MigrationState *s); int blk; int shared; + int use_xbrle; + int64_t xbrle_cache_size; };
typedef struct FdMigrationState FdMigrationState; @@ -76,7 +78,9 @@ MigrationState *exec_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size);
int tcp_start_incoming_migration(const char *host_port);
@@ -85,7 +89,9 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size);
int unix_start_incoming_migration(const char *path);
@@ -94,7 +100,9 @@ MigrationState *unix_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size);
int fd_start_incoming_migration(const char *path);
@@ -103,7 +111,9 @@ MigrationState *fd_start_outgoing_migration(Monitor *mon, int64_t bandwidth_limit, int detach, int blk, - int inc); + int inc, + int use_xbrle, + int64_t xbrle_cache_size);
void migrate_fd_monitor_suspend(FdMigrationState *s, Monitor *mon);
@@ -134,4 +144,11 @@ static inline FdMigrationState *migrate_to_fms(MigrationState *mig_state) return container_of(mig_state, FdMigrationState, mig_state); }
+void do_migrate_set_cachesize(Monitor *mon, const QDict *qdict); + +void arch_set_params(int blk_enable, int shared_base, + int use_xbrle, int64_t xbrle_cache_size, void *opaque); + +int xbrle_mig_active(void); + #endif diff --git a/qmp-commands.hx b/qmp-commands.hx index 793cf1c..8fbe64b 100644 --- a/qmp-commands.hx +++ b/qmp-commands.hx @@ -431,13 +431,16 @@ EQMP
{ .name = "migrate", - .args_type = "detach:-d,blk:-b,inc:-i,uri:s", - .params = "[-d] [-b] [-i] uri", - .help = "migrate to URI (using -d to not wait for completion)" - "\n\t\t\t -b for migration without shared storage with" - " full copy of disk\n\t\t\t -i for migration without " - "shared storage with incremental copy of disk " - "(base image shared between src and destination)", + .args_type = "detach:-d,blk:-b,inc:-i,xbrle:-x,uri:s", + .params = "[-d] [-b] [-i] [-x] uri", + .help = "migrate to URI" + "\n\t -d to not wait for completion" + "\n\t -b for migration without shared storage with" + " full copy of disk" + "\n\t -i for migration without" + " shared storage with incremental copy of disk" + " (base image shared between source and destination)" + "\n\t -x to use XBRLE page delta compression", .user_print = monitor_user_noop, .mhandler.cmd_new = do_migrate, }, @@ -453,6 +456,7 @@ Arguments: - "blk": block migration, full disk copy (json-bool, optional) - "inc": incremental disk copy (json-bool, optional) - "uri": Destination URI (json-string) +- "xbrle": to use XBRLE page delta compression
Example:
@@ -494,6 +498,31 @@ Example: EQMP
{ + .name = "migrate_set_cachesize", + .args_type = "value:s", + .params = "value", + .help = "set cache size (in MB) for xbrle migrations", + .mhandler.cmd = do_migrate_set_cachesize, + }, + +SQMP +migrate_set_cachesize +--------------------- + +Set cache size to be used by XBRLE migration + +Arguments: + +- "value": cache size in bytes (json-number) + +Example: + +-> { "execute": "migrate_set_cachesize", "arguments": { "value": 500M } } +<- { "return": {} } + +EQMP + + { .name = "migrate_set_speed", .args_type = "value:f", .params = "value", diff --git a/savevm.c b/savevm.c index 4e49765..93b512b 100644 --- a/savevm.c +++ b/savevm.c @@ -1141,7 +1141,8 @@ int register_savevm(DeviceState *dev, void *opaque) { return register_savevm_live(dev, idstr, instance_id, version_id, - NULL, NULL, save_state, load_state, opaque); + arch_set_params, NULL, save_state, + load_state, opaque); }
void unregister_savevm(DeviceState *dev, const char *idstr, void *opaque) @@ -1428,15 +1429,17 @@ static int vmstate_save(QEMUFile *f, SaveStateEntry *se) #define QEMU_VM_SUBSECTION 0x05
int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, - int shared) + int shared, int use_xbrle, + int64_t xbrle_cache_size) { SaveStateEntry *se;
QTAILQ_FOREACH(se,&savevm_handlers, entry) { if(se->set_params == NULL) { continue; - } - se->set_params(blk_enable, shared, se->opaque); + } + se->set_params(blk_enable, shared, use_xbrle, xbrle_cache_size, + se->opaque); }
qemu_put_be32(f, QEMU_VM_FILE_MAGIC); @@ -1577,7 +1580,7 @@ static int qemu_savevm_state(Monitor *mon, QEMUFile *f)
bdrv_flush_all();
- ret = qemu_savevm_state_begin(mon, f, 0, 0); + ret = qemu_savevm_state_begin(mon, f, 0, 0, 0, 0); if (ret< 0) goto out;
diff --git a/sysemu.h b/sysemu.h index b81a70e..eb53bf7 100644 --- a/sysemu.h +++ b/sysemu.h @@ -44,6 +44,16 @@ uint64_t ram_bytes_remaining(void); uint64_t ram_bytes_transferred(void); uint64_t ram_bytes_total(void);
+uint64_t dup_mig_bytes_transferred(void); +uint64_t dup_mig_pages_transferred(void); +uint64_t norm_mig_bytes_transferred(void); +uint64_t norm_mig_pages_transferred(void); +uint64_t xbrle_mig_bytes_transferred(void); +uint64_t xbrle_mig_pages_transferred(void); +uint64_t xbrle_mig_pages_overflow(void); +uint64_t xbrle_mig_pages_cache_lookup(void); +uint64_t xbrle_mig_pages_cache_hit(void); + int64_t cpu_get_ticks(void); void cpu_enable_ticks(void); void cpu_disable_ticks(void); @@ -74,7 +84,8 @@ void qemu_announce_self(void); void main_loop_wait(int nonblocking);
int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable, - int shared); + int shared, int use_xbrle, + int64_t xbrle_cache_size); int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f); int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f); void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f); diff --git a/xbzrle.c b/xbzrle.c new file mode 100644 index 0000000..e9285e0 --- /dev/null +++ b/xbzrle.c @@ -0,0 +1,126 @@ +#include<stdint.h> +#include<string.h> +#include<assert.h> +#include "cpu-all.h" +#include "xbzrle.h" + +typedef struct { + uint64_t c; + uint64_t num; +} zero_encoding_t; + +typedef struct { + uint64_t c; +} char_encoding_t; + +static int rle_encode(uint64_t *in, int slen, uint8_t *out, const int dlen) +{ + int dl = 0; + uint64_t cp = 0, c, run_len = 0; + + if (slen<= 0) + return -1; + + while (1) { + if (!slen) + break; + c = *in++; + slen--; + if (!(cp || c)) { + run_len++; + } else if (!cp) { + ((zero_encoding_t *)out)->c = cp; + ((zero_encoding_t *)out)->num = run_len; + dl += sizeof(zero_encoding_t); + out += sizeof(zero_encoding_t); + run_len = 1; + } else { + ((char_encoding_t *)out)->c = cp; + dl += sizeof(char_encoding_t); + out += sizeof(char_encoding_t); + } + cp = c; + } + + if (!cp) { + ((zero_encoding_t *)out)->c = cp; + ((zero_encoding_t *)out)->num = run_len; + dl += sizeof(zero_encoding_t); + out += sizeof(zero_encoding_t); + } else { + ((char_encoding_t *)out)->c = cp; + dl += sizeof(char_encoding_t); + out += sizeof(char_encoding_t); + } + return dl; +} + +static int rle_decode(const uint8_t *in, int slen, uint64_t *out, int dlen) +{ + int tb = 0; + uint64_t run_len, c; + + while (slen> 0) { + c = ((char_encoding_t *) in)->c; + if (c) { + slen -= sizeof(char_encoding_t); + in += sizeof(char_encoding_t); + *out++ = c; + tb++; + continue; + } + run_len = ((zero_encoding_t *) in)->num; + slen -= sizeof(zero_encoding_t); + in += sizeof(zero_encoding_t); + while (run_len--> 0) { + *out++ = c; + tb++; + } + } + return tb; +} + +static void xor_encode_word(uint8_t *dst, const uint8_t *src1, + const uint8_t *src2) +{ + int len = TARGET_PAGE_SIZE / sizeof (uint64_t); + uint64_t *dstw = (uint64_t *) dst; + const uint64_t *srcw1 = (const uint64_t *) src1; + const uint64_t *srcw2 = (const uint64_t *) src2; + + while (len--) { + *dstw++ = *srcw1++ ^ *srcw2++; + } +} + +int xbzrle_encode(uint8_t *xbzrle, const uint8_t *old, const uint8_t *curr, + const size_t max_compressed_len) +{ + int compressed_len; + uint8_t xor_buf[TARGET_PAGE_SIZE]; + uint8_t work_buf[TARGET_PAGE_SIZE * 2]; /* worst case xbzrle is 150% */ + + xor_encode_word(xor_buf, old, curr); + compressed_len = rle_encode((uint64_t *)xor_buf, + sizeof(xor_buf)/sizeof(uint64_t), work_buf, + sizeof(work_buf)); + if (compressed_len> max_compressed_len) { + return -1; + } + memcpy(xbzrle, work_buf, compressed_len); + return compressed_len; +} + +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle, + const size_t compressed_len) +{ + uint8_t xor_buf[TARGET_PAGE_SIZE]; + + int len = rle_decode(xbrle, compressed_len, + (uint64_t *)xor_buf, sizeof(xor_buf)/sizeof(uint64_t)); + if (len< 0) { + return len; + } + xor_encode_word(curr, old, xor_buf); + return len * sizeof(uint64_t); +} diff --git a/xbzrle.h b/xbzrle.h new file mode 100644 index 0000000..5d625a0 --- /dev/null +++ b/xbzrle.h @@ -0,0 +1,12 @@ +#ifndef _XBZRLE_H_ +#define _XBZRLE_H_ + +#include<stdio.h> + +int xbzrle_encode(uint8_t *xbrle, const uint8_t *old, const uint8_t *curr, + const size_t len); +int xbzrle_decode(uint8_t *curr, const uint8_t *old, const uint8_t *xbrle, + const size_t len); + +#endif +

On 08.08.2011, at 15:29, Anthony Liguori wrote:
On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps From: Aidan Shribman<aidan.shribman@sap.com>
By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime and total live-migration time of VMs running memory write intensive workloads typical of large enterprise applications such as SAP ERP Systems, and generally speaking for any application with a sparse memory update pattern.
On the sender side XBZRLE is used as a compact delta encoding of page updates, retrieving the old page content from an LRU cache (default size of 64 MB). The receiving side uses the existing page content and XBZRLE to decode the new page content.
Work was originally based on research results published VEE 2011: Evaluation of Delta Compression Techniques for Efficient Live Migration of Large Virtual Machines by Benoit, Svard, Tordsson and Elmroth. Additionally the delta encoder XBRLE was improved further using XBZRLE instead.
XBZRLE has a sustained bandwidth of 2-2.5 GB/s for typical workloads making it ideal for in-line, real-time encoding such as is needed for live-migration.
A typical usage scenario: {qemu} migrate_set_cachesize 256m {qemu} migrate -x -d tcp:destination.host:4444 {qemu} info migrate ... transferred ram-duplicate: A kbytes transferred ram-duplicate: B pages transferred ram-normal: C kbytes transferred ram-normal: D pages transferred ram-xbrle: E kbytes transferred ram-xbrle: F pages overflow ram-xbrle: G pages cache-hit ram-xbrle: H pages cache-lookup ram-xbrle: J pages
Testing: live migration with XBZRLE completed in 110 seconds, without live migration was not able to complete.
A simple synthetic memory r/w load generator: .. include<stdlib.h> .. include<stdio.h> .. int main() .. { .. char *buf = (char *) calloc(4096, 4096); .. while (1) { .. int i; .. for (i = 0; i< 4096 * 4; i++) { .. buf[i * 4096 / 4]++; .. } .. printf("."); .. } .. }
Signed-off-by: Benoit Hudzia<benoit.hudzia@sap.com> Signed-off-by: Petter Svard<petters@cs.umu.se> Signed-off-by: Aidan Shribman<aidan.shribman@sap.com>
One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt.
Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this.
Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering.
I'm curious what people think about this type of approach. CC'ing libvirt to get their input.
In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer. Directly working with the migration protocol basically ties us to libvirt if we want to do migration, killing competition in the management stack. Just look at how xm is tied to xen - it's one of the major points I dislike about it :). Alex

On 08/08/2011 08:41 AM, Alexander Graf wrote:
On 08.08.2011, at 15:29, Anthony Liguori wrote:
One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt.
Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this.
Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering.
I'm curious what people think about this type of approach. CC'ing libvirt to get their input.
In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer. Directly working with the migration protocol basically ties us to libvirt if we want to do migration, killing competition in the management stack. Just look at how xm is tied to xen - it's one of the major points I dislike about it :).
The way I originally envisioned things, you'd have: (qemu) migrate xbzrle://destination?opt1=value1&opt2=value2 Which would in turn be equivalent to: (qemu) migrate exec:///usr/libexec/qemu/migration-helper-xbzrle --opt1=value1 --opt2=value2 But even if we supported that, it wouldn't get exposed via libvirt unless the libvirt guys exposed QEMU URIs directly. So I think the open question is, how do we do transport plugins in a way that makes libvirt and QEMU both happy? Regards, Anthony Liguori
Alex

On 08/08/2011 04:41 PM, Alexander Graf wrote:
In general, I believe it's a good idea to keep looking at libvirt as a vm management layer and only a vm management layer.
Very much yes. -- error compiling committee.c: too many arguments to function

On 08/08/2011 04:29 PM, Anthony Liguori wrote:
One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
Poking bytes at random places in memory is fairly generic. If you have a lot of small objects, and modify a subset of them, this is the pattern you get.
I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt.
A better model would be plugin based. -- error compiling committee.c: too many arguments to function

On 08/08/2011 08:51 AM, Avi Kivity wrote:
On 08/08/2011 04:29 PM, Anthony Liguori wrote:
One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
Poking bytes at random places in memory is fairly generic. If you have a lot of small objects, and modify a subset of them, this is the pattern you get.
I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt.
A better model would be plugin based.
exec helpers are plugins. They just live in a different address space and a channel to exchange data (pipe). If we did .so plugins, which I'm really not opposed to, I'd want the interface to be something like: typedef struct MigrationTransportClass { ssize_t (*writev)(MigrationTransport *obj, struct iovec *iov, int iovcnt); } MigrationTransportClass; I think it's useful to use an interface like this because it makes it easy to put the transport in a dedicated thread that didn't hold qemu_mutex (which is sort of equivalent to using a fork'd helper but is zero-copy at the expense of less isolation). Regards, Anthony Liguori

On 08/08/2011 05:15 PM, Anthony Liguori wrote:
I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt.
A better model would be plugin based.
exec helpers are plugins. They just live in a different address space and a channel to exchange data (pipe).
libvirt isn't an exec helper.
If we did .so plugins, which I'm really not opposed to, I'd want the interface to be something like:
typedef struct MigrationTransportClass { ssize_t (*writev)(MigrationTransport *obj, struct iovec *iov, int iovcnt); } MigrationTransportClass;
I think it's useful to use an interface like this because it makes it easy to put the transport in a dedicated thread that didn't hold qemu_mutex (which is sort of equivalent to using a fork'd helper but is zero-copy at the expense of less isolation).
If we have a shared object helper, the thread should be maintained by qemu proper, not the plugin. I wouldn't call it "migration transport", but instead a compression/decompression plugin. I don't think it merits a plugin at all though. There's limited scope for compression and it best sits in qemu proper. If anything, it needs to be more integrated (for example turning itself off if it doesn't match enough). -- error compiling committee.c: too many arguments to function

On 08/08/2011 09:23 AM, Avi Kivity wrote:
On 08/08/2011 05:15 PM, Anthony Liguori wrote:
If we did .so plugins, which I'm really not opposed to, I'd want the interface to be something like:
typedef struct MigrationTransportClass { ssize_t (*writev)(MigrationTransport *obj, struct iovec *iov, int iovcnt); } MigrationTransportClass;
I think it's useful to use an interface like this because it makes it easy to put the transport in a dedicated thread that didn't hold qemu_mutex (which is sort of equivalent to using a fork'd helper but is zero-copy at the expense of less isolation).
If we have a shared object helper, the thread should be maintained by qemu proper, not the plugin.
I wouldn't call it "migration transport", but instead a compression/decompression plugin.
I don't think it merits a plugin at all though. There's limited scope for compression and it best sits in qemu proper. If anything, it needs to be more integrated (for example turning itself off if it doesn't match enough).
That adds a tremendous amount of complexity to QEMU. If we're going to change our compression algorithm, we would need to use a single algorithm that worked well for a wide variety of workloads. We struggle enough with migration as it is, it only would get worse if we have 10 different algorithms that we were dynamically enabling/disabling. The other option is to allow 1-off compression algorithms in the form of plugins. I think in this case, plugins are a pretty good compromise in terms of isolating complexity while allowing something that at least works very well for one particular type of workload. Regards, Anthony Liguori

On 08/08/2011 05:33 PM, Anthony Liguori wrote:
If we have a shared object helper, the thread should be maintained by qemu proper, not the plugin.
I wouldn't call it "migration transport", but instead a compression/decompression plugin.
I don't think it merits a plugin at all though. There's limited scope for compression and it best sits in qemu proper. If anything, it needs to be more integrated (for example turning itself off if it doesn't match enough).
That adds a tremendous amount of complexity to QEMU.
Tremendous? You exaggerate. It's a lot simpler than the block or char layers, for example.
If we're going to change our compression algorithm, we would need to use a single algorithm that worked well for a wide variety of workloads.
That algorithm will have to include XBZRLE as a subset, since it matches what workloads actually do (touch memory sparsely).
We struggle enough with migration as it is, it only would get worse if we have 10 different algorithms that we were dynamically enabling/disabling.
The other option is to allow 1-off compression algorithms in the form of plugins. I think in this case, plugins are a pretty good compromise in terms of isolating complexity while allowing something that at least works very well for one particular type of workload.
I think you underestimate the generality of XBZRLE (or maybe I'm overestimating it?). It's not reasonable to ask users to match a compression algorithm to their workload; most times they won't be interacting with the host at all. We need compression to be enabled at all time, turning itself off if it finds it isn't effective so it can consume less cpu. -- error compiling committee.c: too many arguments to function

On 08/08/2011 09:39 AM, Avi Kivity wrote:
The other option is to allow 1-off compression algorithms in the form of plugins. I think in this case, plugins are a pretty good compromise in terms of isolating complexity while allowing something that at least works very well for one particular type of workload.
I think you underestimate the generality of XBZRLE (or maybe I'm overestimating it?).
This is really my fundamental concern. When it comes to something that we have to support for a very long time, no one should be estimating anything. We should make these decisions based on an awful lot of analysis on a wide variety of workloads. It's hard to do this in QEMU today because we don't have a module mechanism to make it easy for users to try out new things without fully committing to including something in the tree. But I don't think that's the root of the problem I have. I really am just extremely reluctant to commit to something that we have to support forever. Thinking more about it though, I think there can be another solution--feature negotiation. I view adding feature negotiation as a pre-requisite to adding any type of transport compression such as XBZRLE. That will let us support migration to older QEMUs and also to eventually remove XBZRLE if we decide it doesn't make sense anymore. Regards, Anthony Liguori
It's not reasonable to ask users to match a compression algorithm to their workload; most times they won't be interacting with the host at all. We need compression to be enabled at all time, turning itself off if it finds it isn't effective so it can consume less cpu.

On Mon, Aug 08, 2011 at 08:29:51AM -0500, Anthony Liguori wrote:
On 08/08/2011 03:42 AM, Shribman, Aidan wrote:
Subject: [PATCH v4] XBZRLE delta for live migration of large memory apps From: Aidan Shribman<aidan.shribman@sap.com>
By using XBZRLE (Xor Binary Zero Run-Length-Encoding) we can reduce VM downtime and total live-migration time of VMs running memory write intensive workloads typical of large enterprise applications such as SAP ERP Systems, and generally speaking for any application with a sparse memory update pattern.
[snip]
One thing that strikes me about this algorithm is that it's very good for a particular type of workload--shockingly good really.
I think workload aware migration compression is possible for a lot of different types of workloads. That makes me a bit wary of QEMU growing quite a lot of compression mechanisms.
It makes me think that this logic may really belong at a higher level where more information is known about the workload. For instance, I can imagine XBZRLE living in something like libvirt.
Today, parsing migration traffic is pretty horrible but I think we're pretty strongly committed to fixing that in 1.0. That makes me wonder if it would be nicer architecturally for a higher level tool to own something like this.
Originally, when I added migration, I had the view that we would have transport plugins based on the exec: protocol. That hasn't really happened since libvirt really owns migration but I think having XBZRLE as a transport plugin for libvirt is something worth considering.
NB I've not been much of a fan of the exec: migration code, since it has proved rather buggy in practice when we used it for 'save/restore to/from file' support. It has been hard to diagnose when things go wrong, and difficult for QEMU to report any useful error messages. Even with the tcp: protocol, QEMU is seemingly unable to provide any useful error reporting even of things as simple as "unable to connect to remote host". So with one exception, current libvirt now uses the 'fd:' protocol for everything, and the last exception will be removed soon too.
I'm curious what people think about this type of approach. CC'ing libvirt to get their input.
In "normal" migration though, even when using fd:, we don't make any attempt to touch the data stream. We just pass a pre-connected TCP socket into QEMU and let it write directly to it. This avoids extra data copying via libvirt. In our alternative "tunnelled" migration mode, libvirt does touch the data stream, passing a pipe FD into QEMU, and copying the data from the pipe into packets to be sent over libvirtd's existing secure RPC stream, and then copying it back to QEMU on the destination. The downside here is that we've added several extra data copies. In our "save/restore to file" code, we use 'fd:' and always have to send the data via a filter program. For example, we have the ability to compress/decompress data via gzip, bzip, xz, and lzop, for which instead pass QEMU as pipe FD to the external compression helper program. We also have another new option where we send data via another I/O helper program that uses O_DIRECT, so save/restore does not pollute the page cache. With this kind of existing precedent, I won't strongly argue against libvirt adding a filter to support this XBZRLE encoding scheme for migration, or indeed save/restore too, if it proves better than lzop which is our current optimal speed/compression winner. My main concern with all these scenarios where libvirt touches the actual data stream though is that we're introducing extra data copies into the migration path which potentially waste CPU cycles. If QEMU can directly XBZRLE encode data into the FD passed via 'fd:' then we minimize data copies. Whether this is a big enough benefit to offset the burden of having to maintain various compression code options in QEMU I can't answer. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 08/08/2011 05:04 PM, Daniel P. Berrange wrote:
My main concern with all these scenarios where libvirt touches the actual data stream though is that we're introducing extra data copies into the migration path which potentially waste CPU cycles. If QEMU can directly XBZRLE encode data into the FD passed via 'fd:' then we minimize data copies. Whether this is a big enough benefit to offset the burden of having to maintain various compression code options in QEMU I can't answer.
It's counterproductive to force an unneeded data copy in order to increase bandwidth. -- error compiling committee.c: too many arguments to function
participants (4)
-
Alexander Graf
-
Anthony Liguori
-
Avi Kivity
-
Daniel P. Berrange