
On 03/07/2018 06:07 AM, Daniel P. Berrangé wrote:
On Wed, Mar 07, 2018 at 10:10:29AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 06, 2018 at 04:46:05PM -0700, Jim Fehlig wrote:
On 03/06/2018 10:58 AM, Daniel P. Berrangé wrote:
Currently both virtlogd and virtlockd use a single worker thread for dispatching RPC messages. Even this is overkill and their RPC message handling callbacks all run in short, finite time and so blocking the main loop is not an issue like you'd see in libvirtd with long running QEMU commands.
By setting max_workers==0, we can turn off the worker thread and run these daemons single threaded. This in turn fixes a serious problem in the virtlockd daemon whereby it looses all fcntl() locks at re-exec due to multiple threads existing. fcntl() locks only get preserved if the process is single threaded at time of exec().
I suppose this change has no affect when e.g. starting many domains in parallel when locking is enabled. Before the change, there's still only one worker thread to process requests.
I've tested the series and locks are now preserved across re-execs of virtlockd. Question is whether we want this change or pursue fixing the underlying kernel bug?
FYI, via the non-public bug I asked a glibc maintainer about the lost lock behavior. He agreed it is a kernel bug and posted the below comment to the bug.
Regards, Jim
First, I agree that POSIX file record locks (i.e. the fcntl F_SETLK ones, which you're using) _are_ to be preserved over execve (absent any FD_CLOEXEC of course, which you aren't using). (Relevant quote from fcntl(2):
Record locks are not inherited by a child created via fork(2), but are preserved across an execve(2).
Second I agree that the existence or non-existence of threads must not play a role in the above.
I've asked some Red Hat experts too and they suggest it looks like a kernel bug. The question is whether this is a recent kernel regression, that is easily fixed, or whether its a long term problem.
I've at least verified that this broken behaviour existed in RHEL-7 (but its possible it was backported when OFD locks were implemented). I still want to test RHEL-6 and RHEL-5 to see if this problem goes back indefinitely.
I've checked RHEL6 & RHEL5 and both are affected, so this a long time Linux problem, and so we'll need to workaround it.
We have some vintage distros around for long term support and I managed to "bisect" the problem a bit: The reproducer works on kernel 2.6.16 but breaks on 2.6.32.
FYI I've got kernel bug open here to track it from RHEL side:
Thanks! Regards, Jim