Re: Snapshot operation aborted and volume usage

Thursday, 11 March 2021

On Thu, Mar 11, 2021 at 3:24 PM Peter Krempa <pkrempa(a)redhat.com&gt; wrote:
...

 On Thu, Mar 11, 2021 at 10:51:13 +0200, Liran Rotenberg wrote:
 > We recently had this bug[1]. The thought that came from it is the handling
 > of error code after running virDomainSnapshotCreateXML, we encountered
 > VIR_ERR_OPERATION_ABORTED(78).

 VIR_ERR_OPERATION_ABORTED is an error code which is emitted by the
 migration code only. That means that the error comes from the failure to
 take a memory image/snapshot of the VM.

 Quick skim through the bugreport seems to mention timeout, so your code
 probably aborted the snapshot if it was taking too long.

 > Apparently, the new volume is in use. Are there cases where this will
 > happen and the new volume won't appear in the volumes chain? Can we detect
 > / know when?

 In the vast majority of cases if virDomainSnapshotCreateXML returns
 failure the new disk volumes are NOT used at that point.

 Libvirt tries very hard to ensure that everything is atomic. The memory
 snapshot is taken before installing volumes into the backing chain, so
 if that one fails we don't even attempt to do anything with the disks.

 There are three extremely unlikely reasons where the snapshot API returns
 failure and new images were already installed into the backing chain:

 1) resuming of the VM failed after snapshot
 2) thawing (domfsthaw) of filesystems has failed
     (easily avoided by not using the _QUIESCE flag, but freezing
     manually)
 3) saving of the internal VM state XML failed

 Any error except those above can happen only if the images werent
 installed or the VM died while installing the images.

 In addition if resuming the cpus after the snapshot fails, the cpus
 didn't run so the guest couldn't have written anything to the image.
 Since snapshot is supposed to flush qemu caches, in case you destroy the
 VM without running the vcpus it's safe to discard the overlays as guest
 didn't write anything into them yet.

 > Thinking aloud, if we can detect such cases we can prevent rolling back by
 > reporting it back from VDSM to ovirt. Or, if it can't be detected to go on
 > the safe side in order to save data corruption and prevent the rollback as
 > well.

 In general, except for the case when saving of the guest XML has failed,
 the new disk images will not be used by the VM so it's safe to delete
 them.

 > Currently, in ovirt, if the job is aborted, we will look into the chain to
 > decide whether to rollback or not.

 This is okay, we update the XML only if qemu successfully installed the
 overlays. 
Thanks Peter!

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Snapshot operation aborted and volume usage