On 9/25/19 11:11 AM, Vladimir Sementsov-Ogievskiy wrote:
25.09.2019 16:52, John Snow wrote:
>
>
> On 8/20/19 6:25 PM, John Snow wrote:
>> Hi, downstream here at Red Hat I've been fielding some questions about
>> the usability and feature readiness of Bitmaps (and related features) in
>> QEMU.
>>
>> Here are some questions I answered internally that I am copying to the
>> list for two reasons:
>>
>> (1) To make sure my answers are actually correct, and
>> (2) To share this pseudo-reference with the community at large.
>>
>> This is long, and mostly for reference. There's a summary at the bottom
>> with some todo items and observations about the usability of the feature
>> as it exists in QEMU.
>>
>> Before too long, I intend to send a more summarized "roadmap" mail
which
>> details all of the current and remaining work to be done in and around
>> the bitmaps feature in QEMU.
>>
>>
>> Questions:
>>
>>> "What format(s) is/are required for this functionality?"
>>
>> From the QEMU API, any format can be used to create and author
>> incremental backups. The only known format limitations are:
>>
>> 1. Persistent bitmaps cannot be created on any format except qcow2,
>> although there are hooks to add support to other formats at a later date
>> if desired.
>>
>> DANGER CAVEAT #1: Adding bitmaps to QEMU by default creates transient
>> bitmaps instead of persistent ones.
>>
>> Possible TODO: Allow users to 'upgrade' transient bitmaps to persistent
>> ones in case they made a mistake.
>>
>>
>> 2. When using push backups (blockdev-backup, drive-backup), you may use
>> any format as a target format.
>>
>> DANGER CAVEAT #2: without backing file and/or filesystem-less sparse
>> support, these images will be unusable.
>>
>> EXAMPLE: Backing up to a raw file loses allocation information, so we
>> can no longer distinguish between zeroes and unallocated regions. The
>> cluster size is also lost. This file will not be usable without
>> additional metadata recorded elsewhere.*
>>
>> (* This is complicated, but it is in theory possible to do a push backup
>> to e.g. an NBD target with custom server code that saves allocation
>> information to a metadata file, which would allow you to reconstruct
>> backups. For instance, recording in a .json file which extents were
>> written out would allow you to -- with a custom binary -- write this
>> information on top of a base file to reconstruct a backup.)
>>
>>
>> 3. Any format can be used for either shared storage or live storage
>> migrations. There are TWO distinct mechanisms for migrating bitmaps:
>>
>> A) The bitmap is flushed to storage and re-opened on the destination.
>> This is only supported for qcow2 and shared-storage migrations.
>>
>> B) The bitmap is live-migrated to the destination. This is supported for
>> any format and can be used for either shared storage or live storage
>> migrations.
>>
>> DANGER CAVEAT #3: The second bitmap migration technique there is an
>> optional migration capability that must be enabled explicitly.
>> Otherwise, some migration combinations may drop bitmaps.
>>
>> Matrix:
>>
>>> migrate = migrate_capability or (persistent and shared_storage)
>>
>> Enumerated:
>>
>> live storage + raw : transient + no-capability: Dropped
>> live-storage + raw : transient + bm-capability: Migrated
>> live-storage + qcow2 : transient + no-capability: Dropped
>> live-storage + qcow2 : transient + bm-capability: Migrated
>> live-storage + qcow2 : persistent + no-capability: Dropped (!)
>> live-storage + qcow2 : persistent + bm-capability: Migrated
>>
>> shared-storage + raw : transient - no-capability: Dropped
>> shared-storage + raw : transient + bm-capability: Migrated
>> shared-storage + qcow2 : transient + no-capability: Migrated
>> shared-storage + qcow2 : transient + bm-capability: Migrated
>> shared-storage + qcow2 : persistent + no-capability: Migrated
>> shared-storage + qcow2 : persistent + bm-capability: Migrated
>>
>> Enabling the bitmap migration capability will ALWAYS migrate the bitmap.
>> If it's disabled, we will only migrate the bitmaps for shared storage
>> migrations where the bitmap is persistent, which is a qcow2-only case.
>>
>> There is no warning or error if you attempt to migrate in a manner that
>> loses your bitmaps.
>>
>> (I might be persuaded to add a case for when you are doing a live
>> storage migration of qcow2 with persistent bitmaps, which is somewhat a
>> conflicting case: you've asked for the bitmap to be persistent, but it
>> seems likely that if this ever happens in practice, it's because you
>> have neglected to ask for it to be migrated to the new host.)
>>
>> See iotest 169 for more details on this.
>>
>> At present, these are the only format limitations I am consciously aware
>> of. From a management API/GUI perspective, it makes sense to restrict
>> the feature set to "qcow2 only" to minimize edge cases.
>>
>>
>>> "Is libvirt aware of these 'gotcha' cases?"
>>
>> From talks I've had with Eric Blake and Peter Krempa, they certainly are
>> now.
>>
>>
>>> "Is it possible to make persistent the default?"
>>
>> Not quickly.
>>
>> In QEMU, not without a deprecation period or some other incompatibility.
>> Default values are not (yet?) introspectable via the schema. We need
>> (possibly) up to two QAPI extensions:
>>
>> I) The ability to return deprecation warnings when issuing a command
>> that will cease to work in the future.
>>
>> This has been being discussed somewhat on-list recently. It seems like
>> there is not a big appetite for tackling something perceived as
>> low-value because it is likely to be ignored.
>>
>> II) The ability to document default values in the QAPI schema for the
>> purposes of introspection.
>>
>> With one or both of these extensions, we could remove the default value
>> for persistence and promote it to a required argument with a
>> transitionary period where it will work with a warning. Then, in the
>> future, users will be forced to specify if they want persistent=true or
>> persistent=false.
>>
>> This is not on my personal roadmap to implement.
>>
>>
>>> "Is it possible to make bitmap migration the default?"
>>
>> I don't know at present. Migration capabilities are either "on" or
"off"
>> and the existing negotiation mechanisms for capabilities are extremely
>> rudimentary.
>>
>> Changing this might require fiddling with machine compat properties,
>> adding features to the migration protocol, or more. I don't know what I
>> don't know, so I will estimate this change as likely invasive.
>>
>> I've discussed this with David Gilbert and it seems like a complicated
>> project for the benefit of this sub-project alone, so this isn't on my
>> personal roadmap to resolve.
>>
>> The general consensus appears to be that protecting the user is
>> libvirt's job.
>>
>>
>>> "Where do we stand with external snapshot support?"
>>
>> Still broken. In the aftermath of 4.1, it's the most obvious outstanding
>> broken feature. Vladimir has patches to fix it, but they need some
>> attention.
>>
>
> It looks as if that the fix is a little risky, but the correct fix is
> going to be much harder. Our reopen support simply does not accommodate
> images needing to write dirty bits on open in a hierarchical graph.
I tried the hard way, you may look through previous series versions.
Kevin disliked it.
Yes, sadly... At the moment it looks like we're going to take the
workaround fix, but Kevin's on PTO so maybe that idea will change.
>
>>
>>> "What needs to happen to bitmaps when doing stream or commit?"
>>
>> Uncertain in QEMU; creating an external snapshot implicitly ends the
>> timeslice represented by the old bitmap, but an explicit checkpoint is
>> better.
>>
>> I think some little ascii charts will help people understand what we're
>> talking about here, so let's cover some examples.
>>
>>
>> SCENARIO 1)
>>
>> Here's a timeline for a single node (one image, no backing files), with
>> some points in time highlighted:
>>
>> Time T = 0.........................n
>> +rec: [--X------Y------Z--------]
>> -rec: [---------x------y--------]
>> region: [aabbbbbbbcccccccddddddddd]
>>
>>
>> X, Y, and Z are points in time where bitmaps 'x', 'y', and
'z' were
>> created and began recording. x and y are points in time where bitmaps
>> 'x' and 'y' stopped recording.
>>
>> This creates a few distinct regions / timeslices.
>>
>> a: Data written before we began tracking writes.
>> b: Data written to bitmap 'x'
>> c: Data written to bitmap 'y'
>> d: data written to bitmap 'z'
>>
>> region 'a' is of an unknown size and indeterminate length, because there
>> is no reference point (checkpoint) prior to it.
>>
>> regions 'b' and 'c' are of finite size and determinate length,
because
>> they have fixed reference points on either sides of their timeslice.
>>
>> region 'd' is also of an unknown size and indeterminate length, because
>> it is actively recording and has no checkpoint to its right. It may be
>> fixed at any time by disabling bitmap 'z'.
>>
>> In QEMU, generally what we want to do is to do several things at one
>> atomic moment to keep these regions adjacent, contiguous, and disjoint.
>> So from a high-level (using a fictional simplified syntax), we do:
>>
>> Transaction(
>> create('y'),
>> disable('x'),
>> backup('x')
>> )
>>
>> which together performs a backup+checkpoint.
>>
>> We can do a backup without a checkpoint:
>>
>> 4.1:
>> Transaction(
>> create('tmp')
>> merge('tmp', 'x')
>> backup('tmp')
>> )
>>
>> 4.2:
>>> backup('x', bitmap_sync=never)
>>
>> Or a checkpoint without a backup:
>>
>> Transaction(
>> create('y'),
>> disable('x')
>> )
>>
>
> Concerning the following scenario:
>
>>
>> SCENARIO 2)
>>
>> Now, what happens when we make an external snapshot and do nothing at
>> all to our bitmaps?
>>
>> Time T = 0.......................................n
>> +rec: [--X------Y------Z--------] <-- [-------]
>> -rec: [---------x------y--------] <-- [-------]
>> region: [aabbbbbbbcccccccddddddddd] <-- [eeeeeee]
>> { base } <-- { top }
>>
>> We've created a new implicit timeslice, "e" without creating a new
>> bitmap. Because the bitmap 'z' was still active at the time of the
>> snapshot, it now has a temporarily-determinate endpoint to its region.
>>
>> This is kind of like an "implied checkpoint", but it's a very poor
one
>> because it's not really addressable.
>>
>> DANGER CAVEAT #4: We have no way to create incremental backups anymore,
>> because the current moment in time has no addressable point.
>>
>> That's not great; but it is likely a fixable scenario when commit is
>> fixed: committing the top layer back down into the base layer will add
>> all new writes to the end of the old region; restoring our backup chain:
>>
>> Time T = 0.........................C.......n
>> +rec: [--X------Y------Z-------- -------]
>> -rec: [---------x------y-------- -------]
>> region: [aabbbbbbbcccccccddddddddd ddddddd]
>>
>> Here, region 'e' just gets appended to region d, and we can make
>> incremental backups from any checkpoint X, Y, Z to the current moment again.
>>
>
> It's been brought to my attention that oVirt wants to be able to create
> snapshots offline.
>
> It's not clear if they are willing to make these snapshots using
> libvirt's offline support, or if they want to do it using qemu-img directly.
>
> If using libvirt, libvirt will be able to manage bitmaps as it sees fit,
> even offline, using qemu and QMP to manage the images (offline).
>
> If it's the second, this snapshot scenario is the one they will
> encounter, where we have a top layer that has no inherent checkpoint or
> bitmap information.
>
> Ramifications of this were discussed below in the original email:
> [scroll ...]
>
>>
>> SCENARIO 3)
>>
>> What happens if we make a firm checkpoint at the same time we make the
>> snapshot?
>>
>> Transaction(
>> disable('z'),
>> snapshot('top'),
>> create('w')
>> )
>>
>> Time T = 0......................... ......n
>> +rec: [--X------Y------Z-------- ] <-- [W------]
>> -rec: [---------x------y--------z] <-- [-------]
>> region: [aabbbbbbbcccccccddddddddd ] <-- [eeeeeee]
>> { base } <-- { top }
>>
>> Now instead of the new region 'e' being implied, it's explicit. We
can
>> make backups between any point and the current moment *across* the gap.
>>
>> It was my thought that this was the most preferable method that libvirt
>> should use, but there is some doubt from Peter Krempa. We'll see how it
>> shakes out.
>>
>>
>>
>> There are questions about what QEMU should do by default, without
>> libvirt's help. At the moment, it's "nothing" but there have
been
>> questions about "something".
>>
>> Keeping in mind that we likely can't change our existing behavior
>> without some kind of flag, there are still some API/usability questions:
>>
>>
>>> If we create an external snapshot on top of an image with actively
>>> recording bitmaps, should we disable them?
>>
>> We can leave them enabled, but they'll never see any writes. Or we can
>> explicitly disable them. Explicitly disabling them may make more sense
>> to prevent modifying bitmaps accidentally on commit.
>>
>> My guess: No. we should leave them alone; allow checkpoint creation
>> mechanisms to do the disable+create dance for bitmaps as needed.
>>
>> Potential problems: The backing image is read-only, and if we change our
>> mind later, we will need to find a way to re-open the backing image as
>> read-write for the purposes of toggling the recording bit prior to any
>> legitimate guest usage of that node. Then, re-open as RO again.
>>
>>
>>
>>> Should we fork bitmaps (on snapshot)?
>>
>> If a bitmap named 'z' is recording when we create an external snapshot,
>> should that bitmap be *copied* into the top layer?
>>
>> My guess: No.
>>
>> This would allow us to create external snapshots *without* creating a
>> checkpoint, but conceptually that's a nightmare: It would allow for
>> mutually independent creation of snapshots OR checkpoints. This would be
>> hard to corral when undoing a snapshot, for instance.
>>
>> In my opinion, snapshots MUST be checkpoints, and therefore allowing a
>> snapshot without creating a checkpoint is a no-go.
>>
>>
>>> (Should we fork bitmaps) if we're not using checkpoints?
>>
>> If we are using a checkpoint-less paradigm (i.e. the rolling incremental
>> backup using only one bitmap) we might want to copy the bitmap up to
>> make the next incremental backup as if nothing ever happened.
>>
>> However, rolling incremental backups doesn't need any kind of auto-copy
>> feature. This is possible today:
>>
>>> create('base', 'A')
>>> transact(snapshot('top'), create('top', 'B'))
>>> merge('B', [('base', 'A'), ('top',
'B')])
>>
>> i.e., we create a new bitmap on the top layer, then merge in the old
>> data from the backing file, which remains addressable.
>>
>> Whether the user wants to copy up or not, there are commands that will
>> do that already.
>>
>>
>
> ... this following section covers some of avoiding the problems of the
> scenario I replied to above, but mostly in the context of what QEMU can
> do to prevent the scenario -- to which the conclusion was "nothing,"
> especially if snapshots are created without QEMU's facilitation (via
> qemu-img.)
>
>>> Should we create new bitmaps by default when we can?
>>
>> If a backing image has bitmaps, should QEMU automatically create a new
>> bitmap for the top layer? Should it be named something new, something
>> user-provided, or based on existing active bitmaps?
>>
>> If a user creates a new external snapshot with no consideration paid to
>> bitmaps (like "SCENARIO 2" above), they temporarily lose the ability
to
>> do incremental backups. They might be able to commit the image back to
>> "try again."
>>
>> That's not great. Here are some options for resolving this:
>>
>> - Automatic names: Might cause collisions out-of-band with management
>> tooling by accident, tooling has to query to discover what bitmaps were
>> automatically created.
>>
>> - Same names: Can create namespace confusion when committing snapshots
>> later; although each layer of a backing chain can have bitmaps named the
>> same thing, it causes future problems when committing together that can
>> be hard to resolve.
>>
>> - User-provided name: This is workable, and requires an amendment to the
>> snapshot command to automatically create a new bitmap on the snapshot.
>>
>>
>> My guess: No, we can't automatically create a new bitmap for the user.
>> We can amend the snapshot commands to accept bitmap names, but at that
>> point we've just duplicated transactions:
>>
>> Transact(
>> snapshot('top'),
>> create('top', 'new-bitmap')
>> )
>>
>
> There's one last relevant mitigation discussed further down: [scroll ...]
>
>>
>> All that said (Mostly a lot "No, let's not do anything"), maybe
there's
>> room for an "assistive" mode for users, a bitmap-aware snapshot
creation
>> command. It could do the following well-defined magic:
>>
>> bitmap-snapshot(base, top, bitmap_name):
>> 1. disable any active bitmaps in the base node.
>> 2. create a bitmap named "bitmap_name" in the top node, failing
if
>> a bitmap by that name already exists on either node.
>>
>> What this accomplishes:
>> - Disables any bitmaps in the base layer ahead of time, in preparation
>> for an eventual commit operation.
>> - Always creates a new, enabled bitmap on the snapshot mode which is
>> guaranteed not to conflict with a name on the base node. This bitmap can
>> be used to create additional copies post-hoc, if desired.
>> - Formalizes our "best practice" suggestion for mixing bitmaps and
>> snapshots into a single, documented command.
>>
>> Is this strictly needed? No, if you have the foresight, you can do this
>> instead:
>>
>> Transact(
>> disable('a'),
>> disable('b'),
>> disable('c'),
>> # plus however many more ...
>> snapshot('top', ...),
>> create('top', 'd')
>> )
>>
>> but a convenience command might still have a role to play in helping
>> take the guesswork out for non-libvirt users.
>>
>>
>>
>> That's the bulk of what was discussed.
>>
>> Summary:
>>
>>
>> GOTCHAs:
>> #1: Bitmaps are created non-persistent by default, and can't be changed.
>>
>> #2: Push backup destination formats will happily back up to a format
>> that isn't semantically useful.
>>
>> #3: Migrating non-shared block storage can drop even persistent bitmaps
>> if you don't pass the bitmap migration capability flag to both QEMU
>> instances.
>>
>> #4: Creating a snapshot without doing some bitmap manipulation
>> beforehand can temporarily render your bitmaps unusable. Failing to
>> disable bitmaps before creating a snapshot might make commits very
>> tricky later on.
>>
>> Gotchas 1 and 4 can be at least partially alleviated. gotcha 2 remains a
>> pain point we cannot intercept at the QEMU layer. gotcha 3 has potential
>> remedies, but they are complicated.
>>
>>
>> QEMU todo items:
>> - Fix bitmap data corruption on commit (Ongoing, by Vladimir@Virtuozzo)
>>
>> - add a set_persistence method for bitmaps that allows us to change the
>> storage class of a bitmap after creation. (Helps alleviate gotcha #1.)
>>
>> - Add a command that allows us to merge allocation data into a bitmap.
>> This helps alleviate gotcha #4: If we create a new image but neglected
>> to do the proper transaction dance, we can simply copy the allocation
>> data into a new bitmap. (Note, we'd still need set_persistence to help
>> us disable the old bitmap before any commit happens.)
>>
>
> ... This was perceived at the time to be an unnecessary convenience
> feature, because the belief was that libvirt should simply avoid this
> from happening in the first place.
>
> However, if we acknowledge that snapshots may be made without libvirt's
> help, this is a quick and easy way to "fix" checkpoint consistency
post-hoc.
Still, even without libvirt, management tool should avoid this from happening.
Or we are saying about using qemu-img by hand by end-user without any management?
That's the fear. I think it's best to avoid it if at all possible, but
unless I find a way to prohibit this workflow, we should be prepared to
accommodate it.
Nir, can you comment on oVirt's use case for needing to do external
snapshots without libvirt facilitating it?
From my perspective: it is of course possible, but there will be
mitigation and recovery work that libvirt will need to do in order to
make qcow2 graphs consistent again after a manual manipulation, which
increases complexity of the design and introduces new failure points.
Once you start using libvirt to manage checkpoints in an image, it would
be best if libvirt was used to manage snapshots from that point forward.
Namely, if libvirt uses special names to track dependencies between
bitmaps in QEMU; the odds of a user successfully applying the manual
actions necessary to make a cohesive tree that doesn't accidentally
invalidate a design invariant in libvirt seems low.
Or, put another way, the checkpoint abstraction exists only in libvirt,
so there's little we can do if you use checkpoint-unaware tools to
manipulate images with checkpoints in them.
And I'm still sure, that qemu-img is wrong instrument and better
is to use qemu
in stopped state for offline manipulations.
Oh, I completely agree. Anything touching bitmaps should go through
QEMU. The case we are wondering about here is adding a new, blank top
layer with qemu-img which doesn't require any bitmap-specific knowledge.
But I'm not opposite to the idea, it should work of course.
>
> --js
>
>> - Add convenience command for easy + safe combination of bitmaps +
>> snapshots. Helps prevent #4.
>>
>>
>> Research items:
>> - How hard is it to reopen a backing image as RW while it's in-use,
>> disable a bitmap, and then reopen as RO? This is to partially address
>> gotcha #4; if we forget to disable bitmaps before creating the snapshot.
>>
>> - How hard is the reverse operation? Can we reopen a backing image RW,
>> enable a bitmap, and then reopen as RO? This gives us better control
>> over what happens on commit.
>>
>> - After we fix the commit bug, what does/should commit actually do with
>> bitmaps? What about bitmaps that collide? The current behavior is that
>> any bitmaps don't transfer from top to base. Any bitmaps active in the
>> base record all the new writes from the top.
>>
>>
>> That's all!
>> --js
>>