Disk extend during migration

Hi, as a follow-up of BZ #1883399 [1], we are reviewing vdsm VM migration flows and solve few follow-up bugs, e.g. BZ #1981079 [2]. I have couple of questions related to libvirt: * if we run disk extend during migration, it can happen that migration finishes sooner than disk extend. In such case we will try to set disk threshold on already stopped VM (we handle libvirt event that VM was stopper, but due to Python GIL there can be a delay between obtaining appropriate signal from libvirt and handling it). In such case we get libvirt VIR_ERR_OPERATION_INVALID when setting disk threshold. Is it safe to catch this exception and ignore it or it's thrown for various reasons and the root cause can be something else than stopped VM? * after disk extend, we resume VM if it's stopped (usually due to running out of the disk space). Is it safe to do so also when we do the disk extend during migration and VM can be stopped because it was already migrated? I.e. can we assume that libvirt will handle such situation and won't resume VM in such case? We do some checks before resume and try to avoid situation when we resume migrated VM, but there can be some corner cases and it would be useful to know if we can rely in libvirt to prevent resuming VM in unwanted cases like one when VM is stopper after migration. Thanks Vojta [1] https://bugzilla.redhat.com/1883399 [2] https://bugzilla.redhat.com/1981079

On Mon, Aug 02, 2021 at 14:20:44 +0200, Vojtech Juranek wrote:
Hi, as a follow-up of BZ #1883399 [1], we are reviewing vdsm VM migration flows and solve few follow-up bugs, e.g. BZ #1981079 [2]. I have couple of questions related to libvirt:
* if we run disk extend during migration, it can happen that migration finishes sooner than disk extend. In such case we will try to set disk threshold on already stopped VM (we handle libvirt event that VM was stopper, but due to Python GIL there can be a delay between obtaining appropriate signal from libvirt and handling it). In such case we get libvirt VIR_ERR_OPERATION_INVALID when setting disk threshold. Is it safe to catch this exception and ignore it or it's thrown for various reasons and the root cause can be something else than stopped VM?
The API to set the block trheshold level can return the following errors including cases when it can happen: VIR_ERR_OPERATION_UNSUPPORTED <- unlikely new qemu supports it VIR_ERR_INVALID_ARG <- disk was not found in VM definition VIR_ERR_INTERNAL_ERROR <- on error from qemu Thus VIR_ERR_OPERATION_INVALID seems to be safe to ignore in your specific case, while not ignoring others can be used to catch problems.

On Monday, 2 August 2021 14:30:05 CEST Peter Krempa wrote:
On Mon, Aug 02, 2021 at 14:20:44 +0200, Vojtech Juranek wrote:
Hi, as a follow-up of BZ #1883399 [1], we are reviewing vdsm VM migration flows and solve few follow-up bugs, e.g. BZ #1981079 [2]. I have couple of questions related to libvirt:
* if we run disk extend during migration, it can happen that migration finishes sooner than disk extend. In such case we will try to set disk threshold on already stopped VM (we handle libvirt event that VM was stopper, but due to Python GIL there can be a delay between obtaining appropriate signal from libvirt and handling it). In such case we get libvirt VIR_ERR_OPERATION_INVALID when setting disk threshold.
actually I was wrong here and the issue is actually caused by delay libvirt setBlockThreshold() call, form vdsm log: 2021-08-02 09:06:01,918-0400 WARN (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting theshold using dom <vdsm.virt.virdomain.Notifying object at 0x7fd06610df28> (drivemonitor:122) [...] 2021-08-02 09:06:03,967-0400 WARN (libvirt/events) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') libvirt event Stopped detail 3 opaque None (vm:5657) [...] 2021-08-02 09:06:03,969-0400 WARN (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Domain not connected, skipping set block threshold for drive 'sdc' (drivemonitor:133) so it took about 2 second to libvirt setBlockThreshold() call to return and in meantime migration was finished and we get VIR_ERR_OPERATION_INVALID error from setBlockThreshold() call. What is the reason for this delay? Is this operation intentionally delayed until migration finishes? I posted relevant libvirt debug log on https://pastebin.com/YkdKYKM5
Is it safe to catch this exception and ignore it or it's thrown for various reasons and the root cause can be something else than stopped VM?
The API to set the block trheshold level can return the following errors including cases when it can happen:
VIR_ERR_OPERATION_UNSUPPORTED <- unlikely new qemu supports it VIR_ERR_INVALID_ARG <- disk was not found in VM definition VIR_ERR_INTERNAL_ERROR <- on error from qemu
Thus VIR_ERR_OPERATION_INVALID seems to be safe to ignore in your specific case, while not ignoring others can be used to catch problems.
thanks for your answer

On Mon, Aug 02, 2021 at 15:34:52 +0200, Vojtech Juranek wrote:
On Monday, 2 August 2021 14:30:05 CEST Peter Krempa wrote:
On Mon, Aug 02, 2021 at 14:20:44 +0200, Vojtech Juranek wrote:
Hi, as a follow-up of BZ #1883399 [1], we are reviewing vdsm VM migration flows and solve few follow-up bugs, e.g. BZ #1981079 [2]. I have couple of questions related to libvirt:
* if we run disk extend during migration, it can happen that migration finishes sooner than disk extend. In such case we will try to set disk threshold on already stopped VM (we handle libvirt event that VM was stopper, but due to Python GIL there can be a delay between obtaining appropriate signal from libvirt and handling it). In such case we get libvirt VIR_ERR_OPERATION_INVALID when setting disk threshold.
actually I was wrong here and the issue is actually caused by delay libvirt setBlockThreshold() call, form vdsm log:
2021-08-02 09:06:01,918-0400 WARN (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') setting theshold using dom <vdsm.virt.virdomain.Notifying object at 0x7fd06610df28> (drivemonitor:122)
[...]
2021-08-02 09:06:03,967-0400 WARN (libvirt/events) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') libvirt event Stopped detail 3 opaque None (vm:5657)
[...]
2021-08-02 09:06:03,969-0400 WARN (mailbox-hsm/3) [virt.vm] (vmId='2dad9038-3e3a-4b5e-8d20-b0da37d9ef79') Domain not connected, skipping set block threshold for drive 'sdc' (drivemonitor:133)
so it took about 2 second to libvirt setBlockThreshold() call to return and in meantime migration was finished and we get VIR_ERR_OPERATION_INVALID error from setBlockThreshold() call.
What is the reason for this delay? Is this operation intentionally delayed until migration finishes?
Actually, qemuDomainSetBlockThreshold which is the backend for virDomainSetBlockThreshold requires a QEMU_JOB_MODIFY job on the domain, so this actually can't even be set _during_ migration. In fact what happens is that the API call is waiting to be able to obtain the MODIFY job and that can happen only after the migration is finished, thus it always serializes after the migration.

so it took about 2 second to libvirt setBlockThreshold() call to return and in meantime migration was finished and we get VIR_ERR_OPERATION_INVALID error from setBlockThreshold() call.
What is the reason for this delay? Is this operation intentionally delayed until migration finishes?
Actually, qemuDomainSetBlockThreshold which is the backend for virDomainSetBlockThreshold requires a QEMU_JOB_MODIFY job on the domain, so this actually can't even be set _during_ migration.
In fact what happens is that the API call is waiting to be able to obtain the MODIFY job and that can happen only after the migration is finished, thus it always serializes after the migration.
makes sense, thanks for clarification!
participants (2)
-
Peter Krempa
-
Vojtech Juranek