On Tue, Mar 06, 2018 at 04:46:05PM -0700, Jim Fehlig wrote:
On 03/06/2018 10:58 AM, Daniel P. Berrangé wrote:
> Currently both virtlogd and virtlockd use a single worker thread for
> dispatching RPC messages. Even this is overkill and their RPC message
> handling callbacks all run in short, finite time and so blocking the
> main loop is not an issue like you'd see in libvirtd with long running
> QEMU commands.
>
> By setting max_workers==0, we can turn off the worker thread and run
> these daemons single threaded. This in turn fixes a serious problem in
> the virtlockd daemon whereby it looses all fcntl() locks at re-exec due
> to multiple threads existing. fcntl() locks only get preserved if the
> process is single threaded at time of exec().
I suppose this change has no affect when e.g. starting many domains in
parallel when locking is enabled. Before the change, there's still only one
worker thread to process requests.
I've tested the series and locks are now preserved across re-execs of
virtlockd. Question is whether we want this change or pursue fixing the
underlying kernel bug?
FYI, via the non-public bug I asked a glibc maintainer about the lost lock
behavior. He agreed it is a kernel bug and posted the below comment to the
bug.
Regards,
Jim
First, I agree that POSIX file record locks (i.e. the fcntl F_SETLK ones, which
you're using) _are_ to be preserved over execve (absent any FD_CLOEXEC of
course, which you aren't using). (Relevant quote from fcntl(2):
Record locks are not inherited by a child created via fork(2),
but are preserved across an execve(2).
Second I agree that the existence or non-existence of threads must not play
a role in the above.
I've asked some Red Hat experts too and they suggest it looks like a kernel
bug. The question is whether this is a recent kernel regression, that is easily
fixed, or whether its a long term problem.
I've at least verified that this broken behaviour existed in RHEL-7 (but its
possible it was backported when OFD locks were implemented). I still want to
test RHEL-6 and RHEL-5 to see if this problem goes back indefinitely.
My inclination though is that we'll need to work around the problem in
libvirt regardless.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|