What's wrong with internal VM snapshots?

Hi! I am trying to improve the support for VM snapshots in the Cockpit web console, and I am afraid I have questions... We have been asked to prefer the "external" over the "internal" snapshot format, at least on RHEL. I haven't yet figured out why, and consequently I am struggling with deciding how hard the Cockpit UI should push people towards external snapshots. So, what's wrong with internal snapshots? I heard they are "unreliable", but how so in detail? Does the data structure inside the qcow2 files get corrupted easily? Do they behave poorly when the snapshot process runs out of disk in the middle? That sort of thing would help me a lot to figure out what Cockpit should be doing on platforms other than RHEL. And how well (or how soon) can external snapshots be expected to work? I have severely messed up my libvirt state a couple of times while playing around with them, and my confidence in them right now isn't great. :-) Are you surprised by this? Or are external snapshots not yet considered ready? (Most recent example from my experiments: Deleting a full system snapshot of a paused machine fails with "internal error: unable to execute QEMU command 'block-commit': Block node is read-only". Reverting to it works and after that it can also be deleted, all while the VM is paused.) Thanks!

On Thu, May 02, 2024 at 14:14:25 +0300, Marius Vollmer wrote:
Hi!
I am trying to improve the support for VM snapshots in the Cockpit web console, and I am afraid I have questions...
We have been asked to prefer the "external" over the "internal" snapshot format, at least on RHEL. I haven't yet figured out why, and consequently I am struggling with deciding how hard the Cockpit UI should push people towards external snapshots.
This is because development preferentially went into external snapshots. This unfortunately also meant that internal snapshots were neglected.
So, what's wrong with internal snapshots?
Firstly. Internal snapshots only work with storage formats which do support them. Basically you can't snapshot a VM with 'raw' disk. That's not the case with external snapshots as qcow2 is done as an overlay. Currently, internal snapshots (at least when done via libvirt) don't allow you to do a partial (not all disks) snapshot or don't work with UEFI (historical reasons -> memory image would be stored in the UEFI image). Both of those can be solved but will require some work.
I heard they are "unreliable", but how so in detail? Does the data structure inside the qcow2 files get corrupted easily?
I don't think this ever was true.
Do they behave poorly when the snapshot process runs out of disk in the middle?
The main problem is that the VM is paused and the interaction with libvirt blocks until the snapshot is done. The only disk consumption is from when the memory si snapshotted, in either case that should lead to failuer in snapshotting and the VM should continue
That sort of thing would help me a lot to figure out what Cockpit should be doing on platforms other than RHEL.
I don't think having different behaviour is a good idea.
And how well (or how soon) can external snapshots be expected to work?
Very technically libvirt already expects external snapshots to work. Said that it's a relatively recent implementation so there may be bugs.
I have severely messed up my libvirt state a couple of times while playing around with them, and my confidence in them right now isn't great. :-) Are you surprised by this? Or are external snapshots not yet considered ready?
Please do report them including steps how you managed to break stuff.
(Most recent example from my experiments: Deleting a full system snapshot of a paused machine fails with "internal error: unable to execute QEMU command 'block-commit': Block node is read-only". Reverting to it works and after that it can also be deleted, all while the VM is paused.)
Ah right. This is a corner case oversight in the implementation. QEMU apparently removes write flags from the storage while the VM is paused. Once again please report this in our issue tracker.

Peter Krempa <pkrempa@redhat.com> writes:
On Thu, May 02, 2024 at 14:14:25 +0300, Marius Vollmer wrote:
We have been asked to prefer the "external" over the "internal" snapshot format, at least on RHEL. I haven't yet figured out why, and consequently I am struggling with deciding how hard the Cockpit UI should push people towards external snapshots.
This is because development preferentially went into external snapshots. This unfortunately also meant that internal snapshots were neglected.
I see, thanks! So, would it be fair to say that internal snapshots are deprecated by upstream libvirt itself (since 0.10), not just in RHEL?
[...] That sort of thing would help me a lot to figure out what Cockpit should be doing on platforms other than RHEL.
I don't think having different behaviour is a good idea.
Yes, it's also less work. :)
And how well (or how soon) can external snapshots be expected to work?
Very technically libvirt already expects external snapshots to work. Said that it's a relatively recent implementation so there may be bugs. [...] Please do report them including steps how you managed to break stuff.
Will do! To be fair, most of my troubles were probably caused by using external snapshots wrong (such as using --diskspec sda,source=blah without knowing what I was doing and creating broken snapshots that way), and Cockpit will of course prevent people from making those mistakes. Thanks again!

On Mon, May 06, 2024 at 10:03:23 +0300, Marius Vollmer wrote:
Peter Krempa <pkrempa@redhat.com> writes:
On Thu, May 02, 2024 at 14:14:25 +0300, Marius Vollmer wrote:
We have been asked to prefer the "external" over the "internal" snapshot format, at least on RHEL. I haven't yet figured out why, and consequently I am struggling with deciding how hard the Cockpit UI should push people towards external snapshots.
This is because development preferentially went into external snapshots. This unfortunately also meant that internal snapshots were neglected.
I see, thanks! So, would it be fair to say that internal snapshots are deprecated by upstream libvirt itself (since 0.10), not just in RHEL?
No that is not fair to say from upstream point of view. We do not plan to remove the functionality and will accept any form of improvements. Same applies for qemu.
[...] That sort of thing would help me a lot to figure out what Cockpit should be doing on platforms other than RHEL.
I don't think having different behaviour is a good idea.
Yes, it's also less work. :)
And how well (or how soon) can external snapshots be expected to work?
Very technically libvirt already expects external snapshots to work. Said that it's a relatively recent implementation so there may be bugs. [...] Please do report them including steps how you managed to break stuff.
Will do! To be fair, most of my troubles were probably caused by using external snapshots wrong (such as using --diskspec sda,source=blah without knowing what I was doing and creating broken snapshots that way), and Cockpit will of course prevent people from making those mistakes.
Libvirt shouldn't allow creating overtly broken snapshots either. That said there are multiple ways that users can shoot themselves into the foot, which can't be validated/refused as those make sense in certain scenarios. Either way make sure to report anything broken, we can at the very least improve documetation, if there's nothing we can fix code-wise.

Peter Krempa <pkrempa@redhat.com> writes:
On Mon, May 06, 2024 at 10:03:23 +0300, Marius Vollmer wrote:
I see, thanks! So, would it be fair to say that internal snapshots are deprecated by upstream libvirt itself (since 0.10), not just in RHEL?
[ version 10, not 0.10. Sorry, no idea why I thought you are on 0.10... ]
No that is not fair to say from upstream point of view. We do not plan to remove the functionality and will accept any form of improvements. Same applies for qemu.
Ok! Are external snapshots preferred over internal ones? How should a user decide which format to use when both are possible in a given situation?
[...] Either way make sure to report anything broken, we can at the very least improve documetation, if there's nothing we can fix code-wise.
Yes, will do.

On Mon, May 06, 2024 at 11:10:57 +0300, Marius Vollmer wrote:
Peter Krempa <pkrempa@redhat.com> writes:
On Mon, May 06, 2024 at 10:03:23 +0300, Marius Vollmer wrote:
I see, thanks! So, would it be fair to say that internal snapshots are deprecated by upstream libvirt itself (since 0.10), not just in RHEL?
[ version 10, not 0.10. Sorry, no idea why I thought you are on 0.10... ]
No that is not fair to say from upstream point of view. We do not plan to remove the functionality and will accept any form of improvements. Same applies for qemu.
Ok! Are external snapshots preferred over internal ones?
External snapshots are preferred in the terms that they got more development recently. Internal snapshots are still lacking the refactor to new QEMU APIs which would allow libvirt using them more effectively.
How should a user decide which format to use when both are possible in a given situation?
It really depends on what the user wants. Internal snapshots are more self-contained, thus the user might prefer them if they want to move the image around. External snapshots, on the other hand, allow more control where the data is present and are possible even if the original image is not yet qcow2 (or other supporting internal snapshots). Libvirt will need to default to internal snapshots to preserve historical compatibility.
participants (2)
-
Marius Vollmer
-
Peter Krempa