On 03/06/2018 10:58 AM, Daniel P. Berrangé wrote:
Currently both virtlogd and virtlockd use a single worker thread for
dispatching RPC messages. Even this is overkill and their RPC message
handling callbacks all run in short, finite time and so blocking the
main loop is not an issue like you'd see in libvirtd with long running
QEMU commands.
By setting max_workers==0, we can turn off the worker thread and run
these daemons single threaded. This in turn fixes a serious problem in
the virtlockd daemon whereby it looses all fcntl() locks at re-exec due
to multiple threads existing. fcntl() locks only get preserved if the
process is single threaded at time of exec().
I suppose this change has no affect when e.g. starting many domains in parallel
when locking is enabled. Before the change, there's still only one worker thread
to process requests.
I've tested the series and locks are now preserved across re-execs of virtlockd.
Question is whether we want this change or pursue fixing the underlying kernel bug?
FYI, via the non-public bug I asked a glibc maintainer about the lost lock
behavior. He agreed it is a kernel bug and posted the below comment to the bug.
Regards,
Jim
First, I agree that POSIX file record locks (i.e. the fcntl F_SETLK ones, which
you're using) _are_ to be preserved over execve (absent any FD_CLOEXEC of
course, which you aren't using). (Relevant quote from fcntl(2):
Record locks are not inherited by a child created via fork(2),
but are preserved across an execve(2).
Second I agree that the existence or non-existence of threads must not play
a role in the above.
Third I see this from strace of your reproducer (only relevant snippets, with
a bit of explanation):
13:54:09.581429 clone(child_stack=0x7f74b0517ff0,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7f74b05189d0, tls=0x7f74b0518700, child_tidptr=0x7f74b05189d0)
= 30911
Process 30911 attached
So, here we created the thread 30911. PID 30910 is now taking the lock:
[pid 30910] 13:54:09.581562 open("lock.txt", O_WRONLY|O_CREAT|O_TRUNC, 0755
<unfinished ...>
[pid 30911] 13:54:09.581595 set_robust_list(0x7f74b05189e0, 24 <unfinished ...>
[pid 30910] 13:54:09.581647 <... open resumed> ) = 3
[pid 30911] 13:54:09.581671 <... set_robust_list resumed> ) = 0
[pid 30910] 13:54:09.581693 fcntl(3, F_SETLK, {type=F_WRLCK, whence=SEEK_SET,
start=0, len=42} <unfinished ...>
[pid 30911] 13:54:09.581722 rt_sigprocmask(SIG_BLOCK, [CHLD], <unfinished ...>
[pid 30910] 13:54:09.581750 <... fcntl resumed> ) = 0
Lock aquired. Now we do a little debug output and sleeping, and the only
actions on FD 3 during this are:
[pid 30910] 13:54:09.581790 fcntl(3, F_GETFD <unfinished ...>
[pid 30910] 13:54:09.581872 <... fcntl resumed> ) = 0
[pid 30910] 13:54:09.581911 fcntl(3, F_SETFD, 0 <unfinished ...>
[pid 30910] 13:54:09.581958 <... fcntl resumed> ) = 0
Then comes the execve from 30910, with no intervening thread exit or the like.
But of course during the execve the thread 30911 is killed:
[pid 30910] 13:54:19.582600 execve("/suse/matz/lock-strangeness",
["/suse/matz/lock-strangeness", "3"], [/* 0 vars */] <unfinished
...>
[pid 30911] 13:54:19.583422 +++ exited with 0 +++
13:54:19.583749 <... execve resumed> ) = 0
13:54:19.583845 brk(0) = 0xc0c000
So, 30910 is alone again and is executing, loading the shared libs, relocating
and so on. The first action done to FD 3 (retained over the execve) is:
13:54:19.588016 fcntl(3, F_GETLK, {type=F_UNLCK, whence=SEEK_SET, start=0,
len=42, pid=0}) = 0
13:54:19.588101 fcntl(3, F_GETFD) = 0
13:54:19.588480 fcntl(3, F_SETFD, FD_CLOEXEC) = 0
Bleah! F_UNLCK we get, which isn't what we're supposed to get.
As there are no intervening syscalls that could in any way have removed the
lock
(in fact there are none other to file descriptor 3) it can't be glibc or
libpthread (or anything else userspace is doing) that would have caused the
lock to be removed. It's the kernel itself, and hence a bug therein.
If I may hazard a guess it would be that the forced exits of all non-main
threads done by the kernel somehow lead to cleaning up the locks, possibly
because the actions that need doing during fork(2) (which also includes
other-threads-exit, but also lock-removal) are conflated with the actions that
need happen during execve(2) (which must _not_ include lock-removel).