Re: [Questions] non-shared disk migration: jobs abort and bandwidth

9 Jun 2022

      On Wed, Jun 8, 2022 at 6:49 PM Peter Krempa <pkrempa@redhat.com> wrote:
...
...
Hi developers,
Recently, I am researching migration with non-share disk(flags
VIR_MIGRATE_NON_SHARED_DISK and VIR_MIGRATE_NON_SHARED_INC).
As we know, the non-shared disk migration could have block jobs to copy
On Wed, Jun 08, 2022 at 17:32:57 +0800, Han Han wrote:
the
...
disk image from the src host to the dst host. So here are my questions
for
non-shared disk migration:
q1. For the API virDomainMigrate3 with the bandwidth param, could it set
the bandwidth of block jobs?
q2. For the API virDomainMigrateSetMaxSpeed, could it set the bandwidth
of
block jobs?
q3. For the domain job abort API virDomainAbortJob, could it stop the
block
job of non-shared disk migration?
q4. For the block job bandwidth API virDomainBlockJobSetSpeed, could it
set
the block job of non-shared disk migration?
q5. For the block job abort API virDomainBlockJobAbort, could it stop the
block job of non-shared disk migration?
Then I got the test results of libvirt-8.4.0-1.el9.x86_64
qemu-kvm-7.0.0-4.el9.x86_64:
q1: The bandwidth limit of virDomainMigrate3 is effective to the
blockjob:
➜  ~ virsh migrate OVMF qemu+ssh://root@hhan-rhel9--1/system --live
--p2p
--tls --tls-destination hhan-rhel9--1 --copy-storage-all --disks-uri
tcp://hhan-rhel9--1:49156 --bandwidth 2
➜  ~ virsh blockjob OVMF vda
Block Copy: [  0 %]    Bandwidth limit: 2097152 bytes/s (2.000 MiB/s)
This is expected and desired.
...
q2: The virDomainMigrateSetMaxSpeed doesn't change the the bandwidth of
block jobs.
➜  ~ virsh migrate-setspeed OVMF 8
➜  ~ virsh blockjob OVMF vda
Block Copy: [  9 %]    Bandwidth limit: 2097152 bytes/s (2.000 MiB/s)
This is a bug though, setting the migration speed should, based on the
fact that  we want to use the global migration speed flag for disks too
, apply also to the disk migration streams.
File a bug here: https://bugzilla.redhat.com/show_bug.cgi?id=2095093
...
...
q3: The virDomainAbortJob could stop a block job of non-shared disk
migration
➜  ~ virsh migrate OVMF qemu+ssh://root@hhan-rhel9--1/system --live
--p2p
--tls --tls-destination hhan-rhel9--1 --copy-storage-all --disks-uri
tcp://hhan-rhel9--1:49156 --bandwidth 2
Then start a virsh event on another terminal:
➜  ~ virsh event --loop --all
Abort the domain job:
➜  ~ virsh domjobabort OVMF
The error "error: operation aborted: migration out: canceled by client"
appears at the terminal of "virsh migrate"
The terminal of "virsh event" shows the block job has been failed:
event 'block-job' for domain 'OVMF': Block Copy for
/var/lib/libvirt/images/OVMF.qcow2 failed
event 'block-job-2' for domain 'OVMF': Block Copy for vda failed
This is again expected, the blockjobs are started by the migration thus
when you cancel the migration we also need to cancel the blockjobs.
...
q4: The block job bandwidth of non-shared disk migration cannot be set by
virDomainBlockJobSetSpeed:
➜  ~ virsh blockjob OVMF vda --bandwidth 10
error: Timed out during operation: cannot acquire state change lock (held
by monitor=remoteDispatchDomainMigratePerform3Params)
This is okay, but we could take it a sa feature request to allow tuning
of the individual blockjobs.
Assuming that tuning the individual blockjobs is supported, it is hard to
tell the bandwidth got from
virDomainMigrateGetMaxSpeed is the speed of  VM migration or the speed of
blockjob.
In contrast to virDomainMigrateSetMaxSpeed, the bandwidth is aimed for both
bandwidths.

I am not sure if there is such a user case: the VM migration data is
transported via sub-netA while
the block is transported via sub-netB. Then it may require to set different
bandwidth for different sub-nets.
If all the data is transported via the same net interface, just  keep it as
it is now.

BWT, what is the meaning of  "sa feature"?
...
...
q5: The block job of non-shared disk migration cannot be aborted by
virDomainBlockJobAbort:
➜  ~ virsh blockjob OVMF vda --abort
error: Timed out during operation: cannot acquire state change lock (held
by monitor=remoteDispatchDomainMigratePerform3Params)
This is expected. Same as above, we dodn't want to allow users to
control this. In contrast to 'q4' I'd refuse a RFE to allow cancelling
of individual jobs.
...
Are the results above expected?
Here are my personal thoughts:
For the bandwidth in q1 and q2, they are commented as migration
bandwidth(
https://gitlab.com/libvirt/libvirt/-/blob/master/include/libvirt/libvirt-dom...
...
,
https://gitlab.com/libvirt/libvirt/-/blob/master/src/libvirt-domain.c#L9696
...
), but one works for block jobs while one doesn't. So we should make the
comment clear whether they are the bandwidth of VM migration or the
bandwidth of migration with blockjobs. What's more, add a flag to
virDomainMigrateMaxSpeedFlags to support set bandwidth to the blockjobs
in
migration.
For q4 and q5, if we will not support to change the block job of
non-shared
disk migration by blockjob APIs, we should note that in the migration doc
or the block job doc, to present the difference between this type of
block
job and the others.