Re: [libvirt-users] Could not destroy domain, current job is remoteDispatchConnectGetAllDomainStats

18 Jan 2018

      2018-01-18 17:49 GMT+02:00 Michal Privoznik <mprivozn@redhat.com>:
...
On 01/18/2018 08:25 AM, Ján Tomko wrote:
...
On Wed, Jan 17, 2018 at 04:45:38PM +0200, Serhii Kharchenko wrote:
...
Hello libvirt-users list,
We're catching the same bug since 3.4.0 version (3.3.0 works OK).
So, we have process that is permanently connected to libvirtd via socket
and it is collecting stats, listening to events and control the VPSes.
When we try to 'shutdown' a number of VPSes we often catch the bug.
One of
VPSes sticks in 'in shutdown' state, no related 'qemu' process is
present,
and there is the next error in the log:
Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.005+0000:
20438: warning : qemuGetProcessInfo:1460 : cannot parse process status
data
Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.006+0000:
20441: error : virFileReadAll:1420 : Failed to open file
'/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\
x2d36\x2dDOMAIN1.scope/cpuacct.usage':
No such file or directory
Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.006+0000:
20441: error : virCgroupGetValueStr:844 : Unable to read from
'/sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\
x2d36\x2dDOMAIN1.scope/cpuacct.usage':
No such file or directory
Jan 17 13:54:20 server1 libvirtd[20437]: 2018-01-17 13:54:20.006+0000:
20441: error : virCgroupGetDomainTotalCpuStats:3319 : unable to get cpu
account: Operation not permitted
Jan 17 13:54:23 server1 libvirtd[20437]: 2018-01-17 13:54:23.805+0000:
20522: warning : qemuDomainObjBeginJobInternal:4862 : Cannot start job
(destroy, none) for domain DOMAIN1; current job is (query, none) owned
by
(20440 remoteDispatchConnectGetAllDomainStats, 0 <null>) for (30s, 0s)
Jan 17 13:54:23 server1 libvirtd[20437]: 2018-01-17 13:54:23.805+0000:
20522: error : qemuDomainObjBeginJobInternal:4874 : Timed out during
operation: cannot acquire state change lock (held by
remoteDispatchConnectGetAllDomainStats)
I think only the last line matters.
The bug is highly reproducible. We can easily catch it even when we call
multiple 'virsh shutdown' in shell one by one.
When we shutdown the process connected to the socket - everything
become OK
and the bug is gone.
The system is used is Gentoo Linux, tried all modern versions of libvirt
(3.4.0, 3.7.0, 3.8.0, 3.9.0, 3.10.0, 4.0.0-rc2 (today's version from
git))
and they have this bug. 3.3.0 works OK.
I don't see anything obvious stats related in the diff between 3.3.0 and
3.4.0. We have added reporting of the shutdown reason, but that's just
parsing one more JSON reply we previously ignored.
Can you try running 'git bisect' to pinpoint the exact commit that
caused this issue?
I am able to reproduce this issue, ran bisect and fount that the commit
which broke it is aeda1b8c56dc58b0a413acc61bbea938b40499e1.
https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=
aeda1b8c56dc58b0a413acc61bbea938b40499e1;hp=ec337aee9b20091d6f9f60b78f210d
55f812500b
But it's very unlikely that the commit is causing the error. If anything
it is just exposing whatever error we have there. I mean, if I revert
the commit on the top of current HEAD I can no longer reproduce the issue.
Michal
Michal, Ján,

I've got the same results:
Bisecting: 0 revisions left to test after this (roughly 0 steps)
[aeda1b8c56dc58b0a413acc61bbea938b40499e1] qemu: monitor: do not report
error on shutdown

And yes, when I revert it in HEAD - the problem is gone.