On Wed, Mar 15, 2023 at 08:10:26PM +0100, Michal Privoznik wrote:
We ran into so many clang bugs recently, that I'm strongly in
favor of
making it optional and aim on gcc only. Debugging these issues burns our
time needlessly. Just consider what's happening here.
I've pushed patches recently that call some virNuma*() APIs during QEMU
cmd line generation. And while at it, I removed virNuma*() mocks from
qemuxml2argvmock.c (v9.1.0-213-g95ae91fdd4). Andrea told me in review to
make sure our CI works. So I've fired up my freebsd VM and compiled my
code there. It worked. Except, I'm compiling with -O0 and our CI
compiles with the default (-O2). And of course this is the source of the
current build breakage.
Long story short, clang is so eager to produce the fastest code that it
simply ignores written code and generates what it *assumes* behaves the
same. Well, it doesn't and this breaks our mocking and thus our tests.
snip
And of course it's happening only with optimizations.
snip
To reproduce this issue even on Linux, configure with:
export CC=clang
meson setup -Dsystem=true -Dexpensive_tests=enabled -Dnumactl=disabled _build
At this point, I think we can call clang broken and focus our time on
developing libvirt. Not coming up with hacks around broken compilers.
Clang's behaviour is not broken or wrong. The code it generates is
functionally correct when used, because according to the C stndard
it is forbidden to define the same function more than once. Apps
linked aginst libvirt.so or our daemons all operate perfectly.
The problem is confined to our test suite where we're trying to
selectively replace code at runtime and in doing so, we're making
assumptions about code optimization that are not valid. In doing
this we're violating the language standard. Of course this is a
feature that ELF explicitly supports, despite being outside the
language stndards. What this means is that we can't assume the
compiler default behaviour will do what we want.
The sepecific problem we're facing here is that the compiler is
doing inter-procedural analysis because it can see the body of
both functions. We've already tried to suppress the optimization
by using the 'noinline' annotation but unfortunately is possible
to keep a function as non-inlined, but still be wrong. Whuat we're
seeing here is that a separate optimization is looking at the
return value and determining that it is a compile time constant
and thus elide the conditional, while still calling the orginal
function !
I can demonstrate this a little more clearly with an isolated
example:
$ cat inline.h
int IsAvailable(int i) __attribute__((noinline));
int IsSomethingOK(int i);
$ cat inline.c
#include <stdbool.h>
int IsAvailable(int i)
{
fprintf(stderr, "Bad hit\n");
return false;
}
#endif
int IsSomethingOK(int i)
{
while (i--) {
if (IsAvailable(i))
continue;
fprintf(stderr, "Bad stuff\n");
return -1;
}
fprintf(stderr, "OK\n");
return 0;
}
$ cat mock.c
#include <stdbool.h>
#include <stdio.h>
#include "inline.h"
int IsAvailable(int i)
{
fprintf(stderr, "Good hit\n");
return true;
}
$ LD_LIBRARY_PATH=. ./something
Bad hit
Bad stuff
Here we see 'Bad hit' so it must be running the original
IsAvailable() code body, and we see 'Bad stuff' so it
must be seeing the 'return false' return value.
Now running with the override
$ LD_PRELOAD=mock.so LD_LIBRARY_PATH=. ./something
Good hit
Bad stuff
we see 'Good hit' so it is running our replacement
mocked impl of IsAvailable(), but we still see 'Bad stuff'
so it must be still using the old 'return value' value.
This proves that the "inlining" optimization is separate
from the "constant return value" optimization.
We're only disabling the former optimization, so our mocks
are not safe.
We've hit this about 6 years ago which lead to using 'weak' annotation
too:
commit e4b980c853d2114b25fa805a84ea288384416221
Author: Daniel P. Berrangé <berrange(a)redhat.com>
Date: Wed Jul 5 11:46:28 2017 +0100
Prevent more compiler optimization of mockable functions
Currently all mockable functions are annotated with the 'noinline'
attribute. This is insufficient to guarantee that a function can
be reliably mocked with an LD_PRELOAD. The C language spec allows
the compiler to assume there is only a single implementation of
each function. It can thus do things like propagating constant
return values into the caller at compile time, or creating
multiple specialized copies of the function body each optimized
for a different caller. To prevent these optimizations we must
also set the 'noclone' and 'weak' attributes.
This fixes the test suite when libvirt.so is built with CLang
with optimization enabled.
Signed-off-by: Daniel P. Berrange <berrange(a)redhat.com>
but this caused unwanted side effects:
commit 407a281a8e2b6c5078ba1148535663ea64fd9314
Author: Daniel P. Berrangé <berrange(a)redhat.com>
Date: Wed Jul 12 11:07:17 2017 +0100
Revert "Prevent more compiler optimization of mockable functions"
This reverts commit e4b980c853d2114b25fa805a84ea288384416221.
When a binary links against a .a archive (as opposed to a shared library),
any symbols which are marked as 'weak' get silently dropped. As a result
when the binary later runs, those 'weak' functions have an address of
0x0 and thus crash when run.
This happened with virtlogd and virtlockd because they don't link to
libvirt.so, but instead just libvirt_util.a and libvirt_rpc.a. The
virRandomBits symbols was weak and so left out of the virtlogd &
virtlockd binaries, despite being required by virHashTable functions.
Various other binaries like libvirt_lxc, libvirt_iohelper, etc also
link directly to .a files instead of libvirt.so, so are potentially
at risk of dropping symbols leading to a later runtime crash.
This is normal linker behaviour because a weak symbol is not treated
as undefined, so nothing forces it to be pulled in from the .a You
have to force the linker to pull in weak symbols using -u$SYMNAME
which is not a practical approach.
This risk is silent bad linkage that affects runtime behaviour is
not acceptable for a fix that was merely trying to fix the test
suite. So stop using __weak__ again.
Signed-off-by: Daniel P. Berrange <berrange(a)redhat.com>
There's a bit of a thread here:
https://listman.redhat.com/archives/libvir-list/2017-July/150772.html
I tied another approach here:
https://listman.redhat.com/archives/libvir-list/2017-July/150809.html
which didn't work well enough
and then IIRC we abandoned attempts at that time.
Meanwhile though GCC gained a 'noipa' attribute (no inter procedural
analysis) which can suppress these return value optimizations, amongst
other things. CLang though doesn't support it yet
https://reviews.llvm.org/D101011
Looking at the compiler flags though there is another option which is
to pass '-fsemantic-interposition'. With CLang this defaults to off,
which means it is free to assume replacement functions have the same
semantics. Passing -fsemantic-interposition forces it to assume that
replacements might have different semantics, and thus cannot do the
return value optimizations.
The downside is that it would have a impact on performance across
everything we compile with that arg, but I don't think libvirt stuff
is generally that performance sensitive, so we can probably just
go with -fsemantic-interposition on clang unconditionally.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|