Thanks Peter for your feedback. Interestingly the version of virsh is newer
than 1.2.18 and thus should contain the fix:
$ virsh --version
1.3.1
$ uname -a
Linux agsserver 4.4.0-91-generic #114-Ubuntu SMP Tue Aug 8 11:56:56 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.3 LTS
Release: 16.04
Codename: xenial
But we're still having the issue. Is there anything else that you can think
about? Feel free to query me for more information. I'm willing to help
wherever I can because this bugs us quite regularly. We could probably
improve our daily backup cronjob to retry blockcommit after a blockjob
abort, but it feels so hacky that I would do that only as the last resort.
2017-08-14 17:05 GMT+02:00 Peter Krempa <pkrempa(a)redhat.com>:
On Mon, Aug 14, 2017 at 08:42:24 +0200, Dominik Psenner wrote:
> Hi,
Hi,
>
> a small update on this. We have migrated the virtualized host to use the
> virtio drivers and now the drive performance is improved so that we can
see
> a constant transfer rate. Before it used to be the same rate but
regularly
> dropped to a few bytes/sec for a few seconds and then was fast again.
>
> However we still observe that the following fails regularily:
>
> $ virsh snapshot-create-as --domain domain --name backup --no-metadata
> --atomic --disk-only --diskspec hda,snapshot=external
> $ virsh blockcommit domain hda --active --pivot
> error: failed to pivot job for disk hda
> error: block copy still active: disk 'hda' not ready for pivot yet
> Could not merge changes for disk hda of domain. VM may be in invalid
state.
since this thread was renamed, please re-state the version of libvirt
you are using. I don't really want to dig through the old thread.
> Then running the following in the morning succeeds and successfully
pivotes
> the snapshot into the base image while the vm is live:
>
> $ virsh blockjob domain hda --abort
> $ virsh blockcommit domain hda --active --pivot
> Successfully pivoted
>
> We run the backup process every day once and it failed on the following
> days:
>
> 2017-07-07
> 2017-07-20
> 2017-07-27
> 2017-08-12
> 2017-08-14
>
> Looking at this it roughly happens once a week and the guest from then on
> writes into the snapshot backlog. That snapshot backlog file grows about
> 8gb every day and thus the issue always needs immediate attention.
>
> Any ideas what could cause this issue? Is this a bug (race condition) of
> `virsh blockcommit` that sometimes fails because it is invoked at the
wrong
> time?
So the 'virsh blockcommit domain hda --active --pivot' operation
consists of 3 parts:
1) virsh blockcommit domain hda --active
2) waiting until the block job finishes
3) virsh blockjob --pivot domain hda
The problem is that some times 2) finishes too soon and then operation 3
fails. This should not happen any more, since there's code in virsh [1]
which waits for the completion event from libvirtd, which is fired only
when the job is actually ready to be pivoted.
This code has a lot of fallback options in case when libvirtd is old or
so.
At any rate, manual pivoting later should help. Also probably updating
to a more recent version.
In case you are using a farily recent version, it's possible that there
are still bugs though.
Peter
[1]:
commit 7408403560f7d054da75acaab855a95c51a92e2b
Author: Peter Krempa <pkrempa(a)redhat.com>
Date: Mon Jul 13 17:04:49 2015 +0200
virsh: Refactor block job waiting in cmdBlockCommit
Reuse the vshBlockJobWait infrastructure to refactor cmdBlockCommit to
use the common code. This additionally fixes a bug when working with
new qemus, where when doing an active commit with --pivot the pivoting
would fail, since qemu reaches 100% completion but the job doesn't
switch to synchronized phase right away.
$ git describe --contains 7408403560f7d054da75acaab855a95c51a92e2b
v1.2.18-rc1~33
--
Dominik Psenner