We ran into so many clang bugs recently, that I'm strongly in favor of
making it optional and aim on gcc only. Debugging these issues burns our
time needlessly. Just consider what's happening here.
I've pushed patches recently that call some virNuma*() APIs during QEMU
cmd line generation. And while at it, I removed virNuma*() mocks from
qemuxml2argvmock.c (v9.1.0-213-g95ae91fdd4). Andrea told me in review to
make sure our CI works. So I've fired up my freebsd VM and compiled my
code there. It worked. Except, I'm compiling with -O0 and our CI
compiles with the default (-O2). And of course this is the source of the
current build breakage.
Long story short, clang is so eager to produce the fastest code that it
simply ignores written code and generates what it *assumes* behaves the
same. Well, it doesn't and this breaks our mocking and thus our tests.
And of course it's happening only with optimizations.
In this particular case, I've noticed that while
virNumaNodesetIsAvailable() calls virNumaNodeIsAvailable(), it
disregards the return value and virReportError()-s immediately.
This is because the !WITH_NUMACTL variant of virNumaNodeIsAvailable()
calls virNumaGetMaxNode() which fails and thus false is returned.
WORSE, then I stepped in gdb into the mock, I've seen random numbers as
integers. This is because the function from virnumamock.c wasn't
optimized as much and followed standard calling convention (it grabbed
the integer argument from the stack). But the caller
(virNumaNodesetIsAvailable()) was so optimized that it did not even
bothered to push anything onto stack.
After these patches, there is still one qemuxml2argvtest case failing,
and it's simply because no matter how hard I try, I can't stop clang
from optimizing the function. Consider the following code:
virNumaCPUSetToNodeset(virBitmap *cpuset,
virBitmap **nodeset)
{
g_autoptr(virBitmap) nodes = virBitmapNew(0);
ssize_t pos = -1;
while ((pos = virBitmapNextSetBit(cpuset, pos)) >= 0) {
int node = virNumaGetNodeOfCPU(pos);
if (node < 0) {
virReportSystemError(errno,
_("Unable to get NUMA node of cpu %zd"),
pos);
return -1;
}
...
}
...
}
And this is the assembly code that clang generates (even after
VIR_OPTNONE treatment):
116f5f: e8 0c b7 f9 ff call b2670 <virBitmapNew@plt>
116f64: 48 89 44 24 40 mov %rax,0x40(%rsp)
116f69: 48 c7 44 24 18 ff ff movq $0xffffffffffffffff,0x18(%rsp)
116f70: ff ff
116f72: 48 8b 7c 24 38 mov 0x38(%rsp),%rdi
116f77: 48 8b 74 24 18 mov 0x18(%rsp),%rsi
116f7c: e8 0f 74 f9 ff call ae390 <virBitmapNextSetBit@plt>
116f81: eb 00 jmp 116f83 <virNumaCPUSetToNodeset+0x43>
116f83: 48 89 44 24 18 mov %rax,0x18(%rsp)
116f88: 48 83 f8 00 cmp $0x0,%rax
116f8c: 0f 8c b2 00 00 00 jl 117044
<virNumaCPUSetToNodeset+0x104>
116f92: 48 8b 7c 24 18 mov 0x18(%rsp),%rdi
116f97: e8 d4 a1 f9 ff call b1170 <virNumaGetNodeOfCPU@plt>
116f9c: c7 44 24 10 ff ff ff movl $0xffffffff,0x10(%rsp)
116fa3: ff
116fa4: 83 7c 24 10 00 cmpl $0x0,0x10(%rsp)
116fa9: 7d 74 jge 11701f <virNumaCPUSetToNodeset+0xdf>
116fab: e8 80 72 f9 ff call ae230 <__errno_location@plt>
116fb0: 8b 18 mov (%rax),%ebx
116fb2: 48 8d 3d 27 35 2b 00 lea 0x2b3527(%rip),%rdi # 3ca4e0
<vmdk4GetBackingStore.prefix+0xf260>
116fb9: 48 8d 35 8c e3 25 00 lea 0x25e38c(%rip),%rsi # 37534c
<modeMap+0x215c>
116fc0: ba 05 00 00 00 mov $0x5,%edx
116fc5: e8 e6 a7 f9 ff call b17b0 <dcgettext@plt>
116fca: 48 8b 4c 24 18 mov 0x18(%rsp),%rcx
116fcf: 48 89 0c 24 mov %rcx,(%rsp)
116fd3: 48 8d 15 ea e0 25 00 lea 0x25e0ea(%rip),%rdx # 3750c4
<modeMap+0x1ed4>
116fda: 48 8d 0d 54 e3 25 00 lea 0x25e354(%rip),%rcx # 375335
<modeMap+0x2145>
116fe1: 31 ff xor %edi,%edi
116fe3: 89 de mov %ebx,%esi
116fe5: 41 b8 10 04 00 00 mov $0x410,%r8d
116feb: 49 89 c1 mov %rax,%r9
116fee: 31 c0 xor %eax,%eax
116ff0: e8 8b c1 f9 ff call b3180 <virReportSystemErrorFull@plt>
My assembler is a bit rusty, but at virNumaGetNodeOfCPU@plt is called
116f97, after which, -1 is stored on a stack (which corresponds to @pos
variable), after which comparison against 0 is made and the jump happens
iff the value (-1) is greater or equal than zero. Otherwise
virReportSystemError() is called.
To reproduce this issue even on Linux, configure with:
export CC=clang
meson setup -Dsystem=true -Dexpensive_tests=enabled -Dnumactl=disabled _build
At this point, I think we can call clang broken and focus our time on
developing libvirt. Not coming up with hacks around broken compilers.
Michal Prívozník (2):
internal.h: Invent VIR_OPTNONE
virnuma: Annotate some functions as VIR_OPTNONE
src/internal.h | 20 ++++++++++++++++++++
src/util/virnuma.c | 20 ++++++++++----------
2 files changed, 30 insertions(+), 10 deletions(-)
--
2.39.2