
On Thu, Sep 27, 2018 at 10:15 AM +0200, Michal Privoznik <mprivozn@redhat.com> wrote:
On 09/27/2018 09:01 AM, Bjoern Walk wrote:
Michal Privoznik <mprivozn@redhat.com> [2018-09-19, 11:45AM +0200]:
On 09/19/2018 11:17 AM, Bjoern Walk wrote:
Bjoern Walk <bwalk@linux.ibm.com> [2018-09-12, 01:17PM +0200]:
Michal Privoznik <mprivozn@redhat.com> [2018-09-12, 11:32AM +0200]:
Still seeing the same timeout. Is this expected behaviour?
Nope. I wonder if something has locked the path and forgot to unlock it (however, virtlockd should have unlocked all the paths owned by PID on connection close), or something is still holding the lock and connection opened.
Can you see the timeout even when you turn off the selinux driver (security_driver="none' in qemu.conf)? I tried to reproduce the issue yesterday and was unsuccessful. Do you have any steps to reproduce?
So, I haven't been able to actually dig into the debugging but we have been able to reproduce this behaviour on x86 (both with SELinux and DAC). Looks like a general problem, if a problem at all, because from what I can see in the code, the 60 seconds timeout is actually intended, or not? The security manager does try for 60 seconds to acquire the lock and only then bails out. Why is this?
The ideal solution would be to just tell virlockd "these are the paths I want you to lock on my behalf" and virtlockd would use F_SETLKW so that the moment all paths are unlocked virtlockd will lock them and libvirtd can continue its execution (i.e. chown() and setfcon()). However, we can't do this because virtlockd runs single threaded [1] and therefore if one thread is waiting for lock to be acquired there is no other thread to unlock the path.
Therefore I had to move the logic into libvirtd which tries repeatedly to lock all the paths needed. And this is where the timeout steps in - the lock acquiring attempts are capped at 60 seconds. This is an arbitrary chosen timeout. We can make it smaller, but that will not solve your problem - only mask it.
However, an actual bug is that while the security manager waits for the lock acquire the whole libvirtd hangs, but from what I understood and Marc explained to me, this is more of a pathological error in libvirt behaviour with various domain locks.
Whole libvirtd shouldn't hang.
The main loop doesn’t hang.
Only those threads which try to acquire domain object lock. IOW it should be possible to run APIs over different domains (or other objects for that matter). For instance dumpxml over different domain works just fine.
Oh sry, my assumption that the thread A is also holding the virDomainObjList lock is wrong! Here is the actual backtrace of a waiting thread (in this case the “worker thread for the "virsh list" command”) (gdb) bt #0 0x00007f850887b7ed in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8508874cf4 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x00007f850c26cfd9 in virMutexLock (m=<optimized out>) at util/virthread.c:89 #3 0x00007f850c2455df in virObjectLock (anyobj=<optimized out>) at util/virobject.c:429 #4 0x00007f850c2ceaba in virDomainObjListFilter (list=list@entry=0x7f84f47a97c0, nvms=nvms@entry=0x7f84f47a97c8, conn=conn@entry=0x7f84dc01ad00, filter=filter@entry=0x7f850c318c70 <virConnectListAllDomainsCheckACL>, flags=flags@entry=1) at conf/virdomainobjlist.c:919 #5 0x00007f850c2cfef2 in virDomainObjListCollect (domlist=0x7f848c10e290, conn=conn@entry=0x7f84dc01ad00, vms=vms@entry=0x7f84f47a9820, nvms=nvms@entry=0x7f84f47a9830, filter=0x7f850c318c70 <virConnectListAllDomainsCheckACL>, flags=1) at conf/virdomainobjlist.c:961 #6 0x00007f850c2d01a6 in virDomainObjListExport (domlist=<optimized out>, conn=0x7f84dc01ad00, domains=0x7f84f47a9890, filter=<optimized out>, flags=<optimized out>) at conf/virdomainobjlist.c:1042 #7 0x00007f850c426be9 in virConnectListAllDomains (conn=0x7f84dc01ad00, domains=0x7f84f47a9890, flags=1) at libvirt-domain.c:6493 #8 0x000055a82b2d905d in remoteDispatchConnectListAllDomains (server=0x55a82c9e16d0, msg=0x55a82ca53380, args=0x7f84dc01ae30, args=0x7f84dc01ae30, ret=0x7f84dc005bb0, rerr=0x7f84f47a9960, client=<optimized out>) at remote/remote_daemon_dispatch_stubs.h:1372 #9 remoteDispatchConnectListAllDomainsHelper (server=0x55a82c9e16d0, client=<optimized out>, msg=0x55a82ca53380, rerr=0x7f84f47a9960, args=0x7f84dc01ae30, ret=0x7f84dc005bb0) at remote/remote_daemon_dispatch_stubs.h:1348 #10 0x00007f850c338734 in virNetServerProgramDispatchCall (msg=0x55a82ca53380, client=0x55a82ca52a40, server=0x55a82c9e16d0, prog=0x55a82ca4be70) at rpc/virnetserverprogram.c:437 … virDomainObjListFilter needs the lock of every domain… That’s the actual problem.
Anyway, we need to get to the bottom of this. Looks like something keeps the file locked and then when libvirt wants to lock if for metadata the timeout is hit and whole operation fails. Do you mind running 'lslocks -u' when starting a domain and before the timeout is hit?
Michal
1: The reason that virtlockd has to run single threaded is stupidity of POSIX file locks. Imagine one thread doing: open() + fcntl(fd, F_SETLKW, ...) and entering critical section. If another thread does open() + close() on the same file the file is unlocked. Because we can't guarantee this will not happen in multithreaded libvirt we had to introduce a separate process to take care of that. And this process has to be single threaded so there won't ever be the second thread to call close() and unintentionally release the lock.
Perhaps we could use OFD locks but those are not part of POSIX and are available on Linux only.
-- Kind regards / Beste Grüße Marc Hartmayer IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294