Authoritative info on backup-begin versus snapshots/other state capture

Hello all Apologies for the basic nature of the question, but having recently started working with libvirt - and virtualisation in general - I find there is a lot of out-of-date and sometimes contradictory material out there across blogs, articles, stackoverflow, the usual sources... I thought I might be able to get definitive answers here. For the record, I assume libvirt.org is authoritative but while there is a lot of material there, its structure is not always clear to me. Also the lack of dates on any pages leaves some room for doubt. I am wondering if there is a recent, reliable summary of the various approaches and current best practices for backing up VMs that covers snapshots both internal and external, approaches that use backup-begin and third-party approaches which simply stop the VM and copy off files. If there is not such a summary, can anyone confirm my reading of https://libvirt.org/kbase/domainstatecapture.html that a simple backup-begin <domain-name> will: - pause the VM and quiesce the disk (in which case is qemu agent a requirement on the guest?) - generate a date-suffixed disk-only copy of a VMs disks alongside the originals wherever that storage is - not generate any backing image chains or metadata that needs to be retained Furthermore, is it then possible to restore to that point by stopping a VM, and associating that backup file with the VM either by virsh-editing its xml or overwriting the original file with the backup file. This seems to be my experience in testing this, but there are very few references to this tool compared to the many lengthy discussions about snapshots and other approaches which is a bit puzzling. It would be great to have this understanding confirmed or refined! Many thanks for any pointers

On Thu, Jan 16, 2025 at 16:48:59 -0000, camccuk--- via Users wrote:
Hello all
Apologies for the basic nature of the question, but having recently started working with libvirt - and virtualisation in general - I find there is a lot of out-of-date and sometimes contradictory material out there across blogs, articles, stackoverflow, the usual sources... I thought I might be able to get definitive answers here. For the record, I assume libvirt.org is authoritative but while there is a lot of material there, its structure is not always clear to me. Also the lack of dates on any pages leaves some room for doubt.
I am wondering if there is a recent, reliable summary of the various approaches and current best practices for backing up VMs that covers snapshots both internal and external, approaches that use backup-begin and third-party approaches which simply stop the VM and copy off files.
If there is not such a summary, can anyone confirm my reading of https://libvirt.org/kbase/domainstatecapture.html that a simple backup-begin <domain-name> will:
That document is mostly accurate but it was created before the actual implementation of backups was finished so some things were not actually implemented in the end.
- pause the VM and quiesce the disk (in which case is qemu agent a requirement on the guest?)
The disk quiescing is not part of the backup operation and needs to be done manually via 'virsh domfsfreeze' if required. The original intention was to mirror the snapshot code which does disk quiescing but it's a bit problematic to fold all operations into one so here we didn't do it. Also the backup operation doesn't actually (need to) pause the VM. It can create a point in time backup/copy of the disks without pausing the VM at all. You can even thaw/un-quiesce the disks right away the backup operation starts/while it's running.
- generate a date-suffixed disk-only copy of a VMs disks alongside the originals wherever that storage is
Yes if you don't override the path in the XML the backup images will be stored in the same path with a suffix of the UNIX timestamp of the time when it was started.
- not generate any backing image chains or metadata that needs to be retained
By default a full backup creates a stand-alone image. If you'd use incremental backups, then it is actually creating images that depend on each other.
Furthermore, is it then possible to restore to that point by stopping a VM, and associating that backup file with the VM either by virsh-editing its xml or overwriting the original file with the backup file.
Yes it is. Note though that since the VM was likely running at the point when you took the backup the 'restore' operation will look like a cold-boot after a power failure at the exact time when the backup was taken.
This seems to be my experience in testing this, but there are very few references to this tool compared to the many lengthy discussions about snapshots and other approaches which is a bit puzzling. It would be great to have this understanding confirmed or refined!
Snapshots also allow you to capture memory state and also pre-date backups thus they are documented a bit more in depth.

This is really helpful, thanks.
The disk quiescing is not part of the backup operation and needs to be done manually via 'virsh domfsfreeze' if required. The original
I assume quiescing *would* be necessary for workloads like databases and if we can live with a crash-consistent backup then we can bypass this, but if I was to include this, the sequence would be: virsh domfsfreeze <domain-name> virsh backup-begin <domain-name> virsh domfsthaw <domain-name> Again, I assume the qemu-agent would need to be running on the guest to allow freeze/thaw. I was about to ask how backup-begin is different from creating a disk-only, no-metadata snapshot but I think it is equivalent - the advantage is that we don't need to deal with merging the overlay file and pivoting afterwards, is that right? I also realised this is very like the sequence described at the bottom of that domainstatecapture page comparing 'direct backup' and 'Backup via temporary snapshot' - what confused me there and which I still don't understand are the two references to events. For direct backup, this step is: - wait for push mode event, or pull data over NBD # most time spent here Can you expand this any? I am assuming direct backup is a 'push' mode backup as per the description at https://libvirt.org/kbase/live_full_disk_backup.html - what is this push mode event?
By default a full backup creates a stand-alone image. If you'd use incremental backups, then it is actually creating images that depend on each other.
OK, and that would be by populating an appropriate xml as per https://libvirt.org/formatbackup.html - which I think you answered on this list a year or two ago.
Yes it is. Note though that since the VM was likely running at the point when you took the backup the 'restore' operation will look like a cold-boot after a power failure at the exact time when the backup was taken.
Snapshots also allow you to capture memory state and also pre-date backups thus they are documented a bit more in depth.
OK - just to make this explicit - if we want to capture memory state as well as disk then we *must* use snapshots, either internal or external? And - last question! - while we are covering the bases... managedsave sounds like it is designed for preserving a one-off recovery position for a potentially relatively long outage such as a hypervisor restart. VM restart will pick up just this latest saved image, but it *will* capture memory also? Once again thanks for your clarifications - it's clearing up a lot of confusion for me.

On Fri, Jan 17, 2025 at 00:29:55 -0000, camccuk--- via Users wrote:
This is really helpful, thanks.
The disk quiescing is not part of the backup operation and needs to be done manually via 'virsh domfsfreeze' if required. The original
I assume quiescing *would* be necessary for workloads like databases
So normally the quiescing restricts writes to the device and fluses filesystem caches inside the guest OS. In addition the guest agent should allow you to register scripts which are executed before the FS is quiesced allowing e.g. database memory state to be flused to disk so that also the application data is consistent.
and if we can live with a crash-consistent backup then we can bypass >
The application consistency mentioned above is extra important for the use of the backup API or disk-only snapshots as using the saved state is equivalent to pulling out the power plug of a real machine.
this, but if I was to include this, the sequence would be:
virsh domfsfreeze <domain-name> virsh backup-begin <domain-name> virsh domfsthaw <domain-name>
Again, I assume the qemu-agent would need to be running on the guest to allow freeze/thaw.
Yes the guest agent is needed as this operation actually happens inside the guest OS.
I was about to ask how backup-begin is different from creating a disk-only, no-metadata snapshot but I think it is equivalent - the advantage is that we don't need to deal with merging the overlay file and pivoting afterwards, is that right?
So the basic 'push' mode of doing a full backup is indeed semantically equivalent of creating a disk-only, no-metadata snapshot, then copying out the data to a standalone image and then merging the overlay back. The backup API though also allows tracking differences since the last backup and creating an incremental backup which would be a thinner image of only the differences. Additionally the backup API also allows PULL mode when an NBD connection to an application doing the backup of the actual blocks is used.
I also realised this is very like the sequence described at the bottom of that domainstatecapture page comparing 'direct backup' and 'Backup via temporary snapshot' - what confused me there and which I still don't understand are the two references to events. For direct backup, this step is: - wait for push mode event, or pull data over NBD # most time spent here
Can you expand this any? I am assuming direct backup is a 'push' mode backup as per the description at https://libvirt.org/kbase/live_full_disk_backup.html - what is this push mode event?
So the backup operation is potentially long-running if you're backing up a huge disk. The 'virsh backup-begin' kicks of the operation and returns right away, while the backup progresses on the background. In push mode when qemu is writing the backup image the job is running while data is written, after it finishes an event is fired to clients listening for it notifying that the job is complete and the output images are finished. Note that the state of the backup will still correspond to the point in time when the operation was *started*, even when the guest OS overwrites any blocks subsequently. For a pull mode backup the client doing the backup knows when it's ready so the job is not auto-finished (which would fire the event) but rather needs to be terminated manually.
By default a full backup creates a stand-alone image. If you'd use incremental backups, then it is actually creating images that depend on each other.
OK, and that would be by populating an appropriate xml as per https://libvirt.org/formatbackup.html - which I think you answered on this list a year or two ago.
Yes it is. Note though that since the VM was likely running at the point when you took the backup the 'restore' operation will look like a cold-boot after a power failure at the exact time when the backup was taken.
Snapshots also allow you to capture memory state and also pre-date backups thus they are documented a bit more in depth.
OK - just to make this explicit - if we want to capture memory state as well as disk then we *must* use snapshots, either internal or external?
Yes exactly, currently only snapshots allow memory state capture synchronized with disk state capture.
And - last question! - while we are covering the bases... managedsave sounds like it is designed for preserving a one-off recovery position for a potentially relatively long outage such as a hypervisor restart. VM restart will pick up just this latest saved image, but it *will* capture memory also?
A (managed)-save saves only the memory state to an image, disk images are kept as they are. No preservation points for the disks are created. Resuming from the (managed)-save will continue using/modifying the disk image without the possibility of getting back. It is indeed meant to e.g. preserve the state of VM while the host OS reboots. 'managed' is in brackeds as there is also a non-managed save.
Once again thanks for your clarifications - it's clearing up a lot of confusion for me.
participants (2)
-
camccuk@yahoo.com
-
Peter Krempa