[libvirt] Entering freeze for libvirt-3.5.0

Slightly late so little advance warning see my mail earlier today, I have now tagged RC1 in git head and pushed signed tarball and rpms to the usual place: ftp://libvirt.org/libvirt/ Seems to work well in my limited testing, https://ci.centos.org/view/libvirt/ is all green except one failed build, so things look smooth overall. Please give it a try to find out issues especially for portability accross systems. If everything goes well I will push an RC2 on Friday, and then make the release on Tuesday, thanks for trying it out, Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veillard@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/

On Wed, 2017-06-28 at 22:23 +0200, Daniel Veillard wrote:
Slightly late so little advance warning see my mail earlier today, I have now tagged RC1 in git head and pushed signed tarball and rpms to the usual place: ftp://libvirt.org/libvirt/ Seems to work well in my limited testing, https://ci.centos.org/view/libvirt/ is all green except one failed build, so things look smooth overall.
The failure to build on FreeBSD looks like a CI issue rather than a problem in our code: FATAL: command execution failed java.io.IOException: Connection reset by peer [...] Build step 'Execute shell' marked build as failure FATAL: channel is already closed java.io.IOException: Connection reset by peer [...] ERROR: Step ?E-mail Notification? failed: no workspace for libvirt-master-build/systems=libvirt-freebsd #666 Finished: FAILURE CC'ing Yash so he can take a look. I've compiled master on FreeBSD myself and I didn't run into any error, so I'm pretty confident we're good release-wise. -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Jun 29, 2017 at 08:39:45AM +0200, Andrea Bolognani wrote:
On Wed, 2017-06-28 at 22:23 +0200, Daniel Veillard wrote:
Slightly late so little advance warning see my mail earlier today, I have now tagged RC1 in git head and pushed signed tarball and rpms to the usual place: ftp://libvirt.org/libvirt/ Seems to work well in my limited testing, https://ci.centos.org/view/libvirt/ is all green except one failed build, so things look smooth overall.
The failure to build on FreeBSD looks like a CI issue rather than a problem in our code:
FATAL: command execution failed java.io.IOException: Connection reset by peer [...] Build step 'Execute shell' marked build as failure FATAL: channel is already closed java.io.IOException: Connection reset by peer [...] ERROR: Step ?E-mail Notification? failed: no workspace for libvirt-master-build/systems=libvirt-freebsd #666 Finished: FAILURE
CC'ing Yash so he can take a look.
I've compiled master on FreeBSD myself and I didn't run into any error, so I'm pretty confident we're good release-wise.
Okay, thanks for digging in :-) BTW what is the list of platfdorms we compile on in CentOS CI ? thanks, Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veillard@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/

On Thu, 2017-06-29 at 09:04 +0200, Daniel Veillard wrote:
BTW what is the list of platfdorms we compile on in CentOS CI ?
* CentOS 6 and 7 * Fedora 23, 24, 25 and rawhide * Debian (oldstable?) * FreeBSD (11?) https://ci.centos.org/view/libvirt/job/libvirt-master-build/ Not all builders run all jobs, eg. the test suite is skipped on FreeBSD, but they all at least compile the library. -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Jun 29, 2017 at 12:13:40PM +0200, Andrea Bolognani wrote:
On Thu, 2017-06-29 at 09:04 +0200, Daniel Veillard wrote:
BTW what is the list of platfdorms we compile on in CentOS CI ?
* CentOS 6 and 7 * Fedora 23, 24, 25 and rawhide * Debian (oldstable?) * FreeBSD (11?)
https://ci.centos.org/view/libvirt/job/libvirt-master-build/
Travis adds Ubuntu (build + test) and OS-X (build only) coverage too. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani wrote:
On Thu, 2017-06-29 at 09:04 +0200, Daniel Veillard wrote:
BTW what is the list of platfdorms we compile on in CentOS CI ?
* CentOS 6 and 7 * Fedora 23, 24, 25 and rawhide * Debian (oldstable?) * FreeBSD (11?)
https://ci.centos.org/view/libvirt/job/libvirt-master-build/
Not all builders run all jobs, eg. the test suite is skipped on FreeBSD, but they all at least compile the library.
Is there any specific reason why it doesn't run test suite on FreeBSD? Generally, '(g)make check' should run fine, with the only exception that virnetsockettest fails from time to time (maybe once in 4-5 runs). 'syntax-check' will not work without local hacks though because it hits argmax limit that results in 'argument list too long' for a lot of checks.
-- Andrea Bolognani / Red Hat / Virtualization
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Roman Bogorodskiy

On Fri, Jun 30, 2017 at 05:11:09PM +0400, Roman Bogorodskiy wrote:
Andrea Bolognani wrote:
On Thu, 2017-06-29 at 09:04 +0200, Daniel Veillard wrote:
BTW what is the list of platfdorms we compile on in CentOS CI ?
* CentOS 6 and 7 * Fedora 23, 24, 25 and rawhide * Debian (oldstable?) * FreeBSD (11?)
https://ci.centos.org/view/libvirt/job/libvirt-master-build/
Not all builders run all jobs, eg. the test suite is skipped on FreeBSD, but they all at least compile the library.
Is there any specific reason why it doesn't run test suite on FreeBSD? Generally, '(g)make check' should run fine, with the only exception that virnetsockettest fails from time to time (maybe once in 4-5 runs).
'syntax-check' will not work without local hacks though because it hits argmax limit that results in 'argument list too long' for a lot of checks.
Our 'check' jobs depend on the 'syntax-check' jobs as a pre-requisite, so its fallout from not running syntax-check on BSD Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, 2017-06-30 at 14:19 +0100, Daniel P. Berrange wrote:
Not all builders run all jobs, eg. the test suite is skipped on FreeBSD, but they all at least compile the library. Is there any specific reason why it doesn't run test suite on FreeBSD? Generally, '(g)make check' should run fine, with the only exception that virnetsockettest fails from time to time (maybe once in 4-5 runs).
qemuxml2argvtest fails consistently in my FreeBSD guest. virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
'syntax-check' will not work without local hacks though because it hits argmax limit that results in 'argument list too long' for a lot of checks. Our 'check' jobs depend on the 'syntax-check' jobs as a pre-requisite, so its fallout from not running syntax-check on BSD
Can we invert the dependency so that syntax-check requires check instead? Not that it would be a good idea until the issues mentioned above are solved and check can run 100% reliably on FreeBSD, of course. -- Andrea Bolognani / Red Hat / Virtualization

On Fri, Jun 30, 2017 at 05:45:04PM +0200, Andrea Bolognani wrote:
On Fri, 2017-06-30 at 14:19 +0100, Daniel P. Berrange wrote:
Not all builders run all jobs, eg. the test suite is skipped on FreeBSD, but they all at least compile the library. Is there any specific reason why it doesn't run test suite on FreeBSD? Generally, '(g)make check' should run fine, with the only exception that virnetsockettest fails from time to time (maybe once in 4-5 runs).
qemuxml2argvtest fails consistently in my FreeBSD guest.
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
'syntax-check' will not work without local hacks though because it hits argmax limit that results in 'argument list too long' for a lot of checks. Our 'check' jobs depend on the 'syntax-check' jobs as a pre-requisite, so its fallout from not running syntax-check on BSD
Can we invert the dependency so that syntax-check requires check instead?
I think we could actually just let them run in parallel, and then make the rpm job depend on both Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Andrea Bolognani wrote:
On Fri, 2017-06-30 at 14:19 +0100, Daniel P. Berrange wrote:
Not all builders run all jobs, eg. the test suite is skipped on FreeBSD, but they all at least compile the library. Is there any specific reason why it doesn't run test suite on FreeBSD? Generally, '(g)make check' should run fine, with the only exception that virnetsockettest fails from time to time (maybe once in 4-5 runs).
qemuxml2argvtest fails consistently in my FreeBSD guest.
I guess that's caused by clang inlining functions that are mocked (specifically, some numa related stuff); I think that was discussed several times already. Anyway, it should work fine with '-O0' in CFLAGS.
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
I played a little more with virnetsockettest to get real stats and figured the following: 1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50 'check' runs 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it fails from 1 to 6 times; I did this test a couple of times and either I was lucky or failure rate is higher when my Jenkins perform regular builds. Anyway, I'll try to find a way to debug what's going on with virnetsockettest.
'syntax-check' will not work without local hacks though because it hits argmax limit that results in 'argument list too long' for a lot of checks. Our 'check' jobs depend on the 'syntax-check' jobs as a pre-requisite, so its fallout from not running syntax-check on BSD
Can we invert the dependency so that syntax-check requires check instead?
Not that it would be a good idea until the issues mentioned above are solved and check can run 100% reliably on FreeBSD, of course.
-- Andrea Bolognani / Red Hat / Virtualization
Roman Bogorodskiy

[cc: Guido] On Sat, Jul 01, 2017 at 02:18:58PM +0400, Roman Bogorodskiy wrote:
Andrea Bolognani wrote:
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
I played a little more with virnetsockettest to get real stats and figured the following:
1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50 'check' runs 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it fails from 1 to 6 times; I did this test a couple of times and either I was lucky or failure rate is higher when my Jenkins perform regular builds.
Anyway, I'll try to find a way to debug what's going on with virnetsockettest.
IIRC Debian disabled this test years ago. Guido, have you ever discovered the cause? Jan

Ján Tomko wrote:
[cc: Guido]
On Sat, Jul 01, 2017 at 02:18:58PM +0400, Roman Bogorodskiy wrote:
Andrea Bolognani wrote:
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
I played a little more with virnetsockettest to get real stats and figured the following:
1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50 'check' runs 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it fails from 1 to 6 times; I did this test a couple of times and either I was lucky or failure rate is higher when my Jenkins perform regular builds.
Anyway, I'll try to find a way to debug what's going on with virnetsockettest.
IIRC Debian disabled this test years ago.
Guido, have you ever discovered the cause?
Jan
I made some experiments on the weekend, and here are my results: On a box where test fails from time to time, it fails at this point: virObjectUnref(csock); for (i = 0; i < nlsock; i++) { if (virNetSocketAccept(lsock[i], &ssock) != -1 && ssock) { char c = 'a'; if (virNetSocketWrite(ssock, &c, 1) != -1 && virNetSocketRead(ssock, &c, 1) != -1) { VIR_DEBUG("Unexpected client socket present"); <--- HERE goto cleanup; } } virObjectUnref(ssock); ssock = NULL; } On a box where this test never fails, it reaches this block, but: * virNetSocketWrite(ssock, &c, 1) != -1 * virNetSocketRead(ssock, &c, 1) == -1 It's enough to make the test pass. On a failing box both Write() and Read() return != -1 when the test fails. I'm not quite sure what this specific block is testing though. My guess was that calling "virObjectUnref(csock);" will destroy the client socket and Accept() will not work (this should also make the test pass I guess). Anyway, I tried to insert sleep(1) right after virObjectUnref(csock), the one before the virNetSocketAccept() call, and the test stopped failing. I've started it in a loop like: while test $? -eq 0; do ./virnetsockettest; done and actually forgot to stop, so it's running since Saturday without failures. I've haven't had a chance yet to debug it further. Roman Bogorodskiy

On Mon, Jul 03, 2017 at 05:20:13PM +0400, Roman Bogorodskiy wrote:
Ján Tomko wrote:
[cc: Guido]
On Sat, Jul 01, 2017 at 02:18:58PM +0400, Roman Bogorodskiy wrote:
Andrea Bolognani wrote:
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
I played a little more with virnetsockettest to get real stats and figured the following:
1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50 'check' runs 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it fails from 1 to 6 times; I did this test a couple of times and either I was lucky or failure rate is higher when my Jenkins perform regular builds.
Anyway, I'll try to find a way to debug what's going on with virnetsockettest.
IIRC Debian disabled this test years ago.
Guido, have you ever discovered the cause?
Jan
I made some experiments on the weekend, and here are my results:
On a box where test fails from time to time, it fails at this point:
virObjectUnref(csock);
for (i = 0; i < nlsock; i++) { if (virNetSocketAccept(lsock[i], &ssock) != -1 && ssock) { char c = 'a'; if (virNetSocketWrite(ssock, &c, 1) != -1 && virNetSocketRead(ssock, &c, 1) != -1) { VIR_DEBUG("Unexpected client socket present"); <--- HERE goto cleanup; } } virObjectUnref(ssock); ssock = NULL; }
On a box where this test never fails, it reaches this block, but:
* virNetSocketWrite(ssock, &c, 1) != -1 * virNetSocketRead(ssock, &c, 1) == -1
It's enough to make the test pass. On a failing box both Write() and Read() return != -1 when the test fails.
We discussed this on IRC and what is happening here is a race condition. The test suite is connecting a client to the server and then immediately closing the client connection. When the server tries to accept the client, usually it'll get -1 because the client has already gone away. Socket termination is a multi-stage process at the network layer, and so there is a non-zero chance that Accept will succeed. The test suite is assuming that if the accept did succeeed, then we'll get I/O error on read & write, but this is also not actually guaranteed. It is possible that write may suceed buffering output, and read may simply see '0' for EOF. When this happens the test suite will fail. Roman debugging a failing run & confirmed this is exactly what's happening. IOW, this test suite is just plain broken. It needs rewriting to do a more sensible real world test. ie spawn a thread to act as the server, and have the server just read from the client & echo it back to the client. The main thread acts as the client and tests this echo'ing. This is race free and is an real-world example of usage. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Daniel P. Berrange wrote:
On Mon, Jul 03, 2017 at 05:20:13PM +0400, Roman Bogorodskiy wrote:
Ján Tomko wrote:
[cc: Guido]
On Sat, Jul 01, 2017 at 02:18:58PM +0400, Roman Bogorodskiy wrote:
Andrea Bolognani wrote:
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
I played a little more with virnetsockettest to get real stats and figured the following:
1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50 'check' runs 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it fails from 1 to 6 times; I did this test a couple of times and either I was lucky or failure rate is higher when my Jenkins perform regular builds.
Anyway, I'll try to find a way to debug what's going on with virnetsockettest.
IIRC Debian disabled this test years ago.
Guido, have you ever discovered the cause?
Jan
I made some experiments on the weekend, and here are my results:
On a box where test fails from time to time, it fails at this point:
virObjectUnref(csock);
for (i = 0; i < nlsock; i++) { if (virNetSocketAccept(lsock[i], &ssock) != -1 && ssock) { char c = 'a'; if (virNetSocketWrite(ssock, &c, 1) != -1 && virNetSocketRead(ssock, &c, 1) != -1) { VIR_DEBUG("Unexpected client socket present"); <--- HERE goto cleanup; } } virObjectUnref(ssock); ssock = NULL; }
On a box where this test never fails, it reaches this block, but:
* virNetSocketWrite(ssock, &c, 1) != -1 * virNetSocketRead(ssock, &c, 1) == -1
It's enough to make the test pass. On a failing box both Write() and Read() return != -1 when the test fails.
We discussed this on IRC and what is happening here is a race condition.
The test suite is connecting a client to the server and then immediately closing the client connection. When the server tries to accept the client, usually it'll get -1 because the client has already gone away. Socket termination is a multi-stage process at the network layer, and so there is a non-zero chance that Accept will succeed. The test suite is assuming that if the accept did succeeed, then we'll get I/O error on read & write, but this is also not actually guaranteed. It is possible that write may suceed buffering output, and read may simply see '0' for EOF. When this happens the test suite will fail. Roman debugging a failing run & confirmed this is exactly what's happening.
IOW, this test suite is just plain broken. It needs rewriting to do a more sensible real world test. ie spawn a thread to act as the server, and have the server just read from the client & echo it back to the client. The main thread acts as the client and tests this echo'ing. This is race free and is an real-world example of usage.
Yeah, I'll try to re-write this test, this weekend hopefully. Roman Bogorodskiy

On Mon, Jul 03, 2017 at 10:49:46AM +0200, Ján Tomko wrote:
[cc: Guido]
On Sat, Jul 01, 2017 at 02:18:58PM +0400, Roman Bogorodskiy wrote:
Andrea Bolognani wrote:
virnetsockettest also fails pretty often for me, certainly more than your figure; even if that wasn't the case, 1/5 failure rate is way too high for a CI job.
I played a little more with virnetsockettest to get real stats and figured the following:
1. On my desktop (i5) and laptop (i3), I didn't get any failures in 50 'check' runs 2. On a VM that I use to run test builds in Jenkins, out of 50 runs it fails from 1 to 6 times; I did this test a couple of times and either I was lucky or failure rate is higher when my Jenkins perform regular builds.
Anyway, I'll try to find a way to debug what's going on with virnetsockettest.
IIRC Debian disabled this test years ago.
Guido, have you ever discovered the cause?
No, sorry. I diabled it ages ago since it failed randomly and never got around to have a look. Cheers, -- Guido

On Mon, 2017-07-03 at 18:47 +0200, Guido Günther wrote:
Anyway, I'll try to find a way to debug what's going on with virnetsockettest. IIRC Debian disabled this test years ago. Guido, have you ever discovered the cause? No, sorry. I diabled it ages ago since it failed randomly and never got around to have a look.
I just ran 10000 iterations of the test on a Debian guest without running into a single failure. Maybe it's time to revisit that decision, especially with Stretch out of the door? Any other test cases that have been blacklisted and might not need it anymore? -- Andrea Bolognani / Red Hat / Virtualization

On Tue, Jul 04, 2017 at 12:27:19PM +0200, Andrea Bolognani wrote:
On Mon, 2017-07-03 at 18:47 +0200, Guido Günther wrote:
Anyway, I'll try to find a way to debug what's going on with virnetsockettest. IIRC Debian disabled this test years ago. Guido, have you ever discovered the cause? No, sorry. I diabled it ages ago since it failed randomly and never got around to have a look.
I just ran 10000 iterations of the test on a Debian guest without running into a single failure. Maybe it's time to revisit that decision, especially with Stretch out of the door? Any other test cases that have been blacklisted and might not need it anymore?
I just checked again and noticed that I reenabled the test back in 2015 with fbb27088eec1b54fcd5a0950b11c31d27a2598d4 fixing the cause. The only tests we have disabled are in gnulib (where upstream rejected the patches) https://github.com/agx/libvirt-debian/blob/debian/sid/debian/patches/openpty... https://github.com/agx/libvirt-debian/blob/debian/sid/debian/patches/test-po... https://github.com/agx/libvirt-debian/blob/debian/sid/debian/patches/Disable... caused by running the tests in a chroot prepared by pbuilder and the vircgrouptest where the mock was incomplete last time I checked: https://github.com/agx/libvirt-debian/blob/debian/sid/debian/patches/Skip-vi... Cheers, -- Guido

On Sat, 2017-07-01 at 14:18 +0400, Roman Bogorodskiy wrote:
qemuxml2argvtest fails consistently in my FreeBSD guest. I guess that's caused by clang inlining functions that are mocked (specifically, some numa related stuff); I think that was discussed several times already. Anyway, it should work fine with '-O0' in CFLAGS.
Well, would you look at that. It does indeed work flawlessly when compiled without optimizations! :O I'm not sure if that would be considered a reasonable compromise to get the test suite running on FreeBSD in the context of CI, though. I think it working reliably without messing with CFLAGS would be a requirement; others might disagree. -- Andrea Bolognani / Red Hat / Virtualization

On Tue, Jul 04, 2017 at 01:03:52PM +0200, Andrea Bolognani wrote:
On Sat, 2017-07-01 at 14:18 +0400, Roman Bogorodskiy wrote:
qemuxml2argvtest fails consistently in my FreeBSD guest. I guess that's caused by clang inlining functions that are mocked (specifically, some numa related stuff); I think that was discussed several times already. Anyway, it should work fine with '-O0' in CFLAGS.
Well, would you look at that. It does indeed work flawlessly when compiled without optimizations! :O
I'm not sure if that would be considered a reasonable compromise to get the test suite running on FreeBSD in the context of CI, though. I think it working reliably without messing with CFLAGS would be a requirement; others might disagree.
Hmm, I thought I fixed that problem when I introduce this patch: commit 728cacc8abed2b8de39e7b96fa42fde6850ec23a Author: Daniel P. Berrange <berrange@redhat.com> Date: Fri Apr 7 15:07:49 2017 +0100 annotate all mocked functions with noinline This made us annotate all mocked functions with noinline, which was sufficient to make CLang builds pass tests on Ubuntu VMs. Perhaps my syntax-check rule is missing some functions that still need to be marked noinline to get BSD working ? Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Jul 04, 2017 at 12:21:21PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 04, 2017 at 01:03:52PM +0200, Andrea Bolognani wrote:
On Sat, 2017-07-01 at 14:18 +0400, Roman Bogorodskiy wrote:
qemuxml2argvtest fails consistently in my FreeBSD guest. I guess that's caused by clang inlining functions that are mocked (specifically, some numa related stuff); I think that was discussed several times already. Anyway, it should work fine with '-O0' in CFLAGS.
Well, would you look at that. It does indeed work flawlessly when compiled without optimizations! :O
I'm not sure if that would be considered a reasonable compromise to get the test suite running on FreeBSD in the context of CI, though. I think it working reliably without messing with CFLAGS would be a requirement; others might disagree.
Hmm, I thought I fixed that problem when I introduce this patch:
commit 728cacc8abed2b8de39e7b96fa42fde6850ec23a Author: Daniel P. Berrange <berrange@redhat.com> Date: Fri Apr 7 15:07:49 2017 +0100
annotate all mocked functions with noinline
This made us annotate all mocked functions with noinline, which was sufficient to make CLang builds pass tests on Ubuntu VMs.
So I did some debugging and this is wierder than I can imagine possible. I put a printf statement in virNumaGetMaxNode in qemuxml2argvmock.c to print out the return value. I also put a printf statement in virNumaNodeIsAvailable in virnuma.c (in the non-NUMACTL conditional block), that prints out the return value received from virNumaGetMaxNode The first printf displays 7, while the second printf displays -1 So we're definitely calling our mock override, but the return value is getting mangled when seen by the caller, which is a giant wtf to me. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Jul 04, 2017 at 05:32:03PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 04, 2017 at 12:21:21PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 04, 2017 at 01:03:52PM +0200, Andrea Bolognani wrote:
On Sat, 2017-07-01 at 14:18 +0400, Roman Bogorodskiy wrote:
qemuxml2argvtest fails consistently in my FreeBSD guest. I guess that's caused by clang inlining functions that are mocked (specifically, some numa related stuff); I think that was discussed several times already. Anyway, it should work fine with '-O0' in CFLAGS.
Well, would you look at that. It does indeed work flawlessly when compiled without optimizations! :O
I'm not sure if that would be considered a reasonable compromise to get the test suite running on FreeBSD in the context of CI, though. I think it working reliably without messing with CFLAGS would be a requirement; others might disagree.
Hmm, I thought I fixed that problem when I introduce this patch:
commit 728cacc8abed2b8de39e7b96fa42fde6850ec23a Author: Daniel P. Berrange <berrange@redhat.com> Date: Fri Apr 7 15:07:49 2017 +0100
annotate all mocked functions with noinline
This made us annotate all mocked functions with noinline, which was sufficient to make CLang builds pass tests on Ubuntu VMs.
So I did some debugging and this is wierder than I can imagine possible.
I put a printf statement in virNumaGetMaxNode in qemuxml2argvmock.c to print out the return value.
I also put a printf statement in virNumaNodeIsAvailable in virnuma.c (in the non-NUMACTL conditional block), that prints out the return value received from virNumaGetMaxNode
The first printf displays 7, while the second printf displays -1
So we're definitely calling our mock override, but the return value is getting mangled when seen by the caller, which is a giant wtf to me.
I wrote an isolated test case $ cat lib.h int get_max(void) __attribute__((noinline)); int is_ok(int i); $ cat lib.c #include <stdio.h> #include "lib.h" int get_max(void) { fprintf(stderr, "In original max, returning 3\n"); return 3; } int is_ok(int i) { int max = get_max(); fprintf(stderr, "Received max %d\n", max); return i > 0 && i <= max; } $ cat mock.c #include <stdio.h> #include "lib.h" int get_max(void) { fprintf(stderr, "In mock max, returning 7\n"); return 7; } $ cat run.c #include <stdio.h> #include "lib.h" int main(int argc, char **argv) { fprintf(stderr, "Is 5 in range ? %d\n", is_ok(5)); } $ clang -O2 -g -Wall -shared -o libdemo.so -fPIC lib.c $ clang -O2 -g -Wall -shared -o mock.so -fPIC mock.c $ clang -O2 -Wall -o run run.c -L. -ldemo $ ./run In original max, returning 3 Received max 3 Is 5 in range ? 0 $ LD_PRELOAD=mock.so ./run In mock max, returning 7 Received max 3 Is 5 in range ? 0 $ clang -O0 -g -Wall -shared -o libdemo.so -fPIC lib.c $ LD_PRELOAD=mock.so ./run In mock max, returning 7 Received max 7 Is 5 in range ? 1 So clang is definitely *not* inlining the function, *is* running out mock function, but none the less getting the return value from the original function. Turning on optimizer debugging i see $ clang -O2 -g -Wall -shared -o libdemo.so -Rpass=.* -fPIC lib.c lib.c:7:3: remark: marked this call a tail call candidate [-Rpass=tailcallelim] fprintf(stderr, "In original max, returning 3\n"); ^ lib.c:11:15: remark: marked this call a tail call candidate [-Rpass=tailcallelim] int is_ok(int i) ^ lib.c:13:13: remark: marked this call a tail call candidate [-Rpass=tailcallelim] int max = get_max(); ^ lib.c:13:7: remark: marked this call a tail call candidate [-Rpass=tailcallelim] int max = get_max(); ^ lib.c:14:3: remark: marked this call a tail call candidate [-Rpass=tailcallelim] fprintf(stderr, "Received max %d\n", max); ^ so, I'm thinking this problem is a result of tail call optimization making an assumption that is violated when mocking the function. I'm unclear how to prevent tail call optimization without the big hammer of -O0 Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Jul 04, 2017 at 06:26:22PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 04, 2017 at 05:32:03PM +0100, Daniel P. Berrange wrote:
On Tue, Jul 04, 2017 at 12:21:21PM +0100, Daniel P. Berrange wrote: So I did some debugging and this is wierder than I can imagine possible.
I put a printf statement in virNumaGetMaxNode in qemuxml2argvmock.c to print out the return value.
I also put a printf statement in virNumaNodeIsAvailable in virnuma.c (in the non-NUMACTL conditional block), that prints out the return value received from virNumaGetMaxNode
The first printf displays 7, while the second printf displays -1
So we're definitely calling our mock override, but the return value is getting mangled when seen by the caller, which is a giant wtf to me.
I wrote an isolated test case
$ cat lib.h
int get_max(void) __attribute__((noinline));
int is_ok(int i);
$ cat lib.c
#include <stdio.h> #include "lib.h"
int get_max(void) { fprintf(stderr, "In original max, returning 3\n"); return 3; }
int is_ok(int i) { int max = get_max(); fprintf(stderr, "Received max %d\n", max); return i > 0 && i <= max; }
$ cat mock.c
#include <stdio.h> #include "lib.h"
int get_max(void) { fprintf(stderr, "In mock max, returning 7\n"); return 7; }
$ cat run.c
#include <stdio.h> #include "lib.h"
int main(int argc, char **argv)
{ fprintf(stderr, "Is 5 in range ? %d\n", is_ok(5)); }
$ clang -O2 -g -Wall -shared -o libdemo.so -fPIC lib.c $ clang -O2 -g -Wall -shared -o mock.so -fPIC mock.c $ clang -O2 -Wall -o run run.c -L. -ldemo
$ ./run In original max, returning 3 Received max 3 Is 5 in range ? 0 $ LD_PRELOAD=mock.so ./run In mock max, returning 7 Received max 3 Is 5 in range ? 0
$ clang -O0 -g -Wall -shared -o libdemo.so -fPIC lib.c $ LD_PRELOAD=mock.so ./run In mock max, returning 7 Received max 7 Is 5 in range ? 1
So clang is definitely *not* inlining the function, *is* running out mock function, but none the less getting the return value from the original function.
Turning on optimizer debugging i see
$ clang -O2 -g -Wall -shared -o libdemo.so -Rpass=.* -fPIC lib.c lib.c:7:3: remark: marked this call a tail call candidate [-Rpass=tailcallelim] fprintf(stderr, "In original max, returning 3\n"); ^ lib.c:11:15: remark: marked this call a tail call candidate [-Rpass=tailcallelim] int is_ok(int i) ^ lib.c:13:13: remark: marked this call a tail call candidate [-Rpass=tailcallelim] int max = get_max(); ^ lib.c:13:7: remark: marked this call a tail call candidate [-Rpass=tailcallelim] int max = get_max(); ^ lib.c:14:3: remark: marked this call a tail call candidate [-Rpass=tailcallelim] fprintf(stderr, "Received max %d\n", max); ^
so, I'm thinking this problem is a result of tail call optimization making an assumption that is violated when mocking the function.
The tail call stuff was a red-herring and not related. After much trial and error I've found that it is possible to make this work by annotating the functions with the attribute "weak". This explicitly tells the compiler that the function is designed to be overridden by an external source, thus preventing any of the call convention optimization clang does. With 'weak' added to virNumaGetMaxNode() the test suite passes on FreeBSD ! Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

As planned, it is tagged in git and signed tarballs and rpms are available from the usual place: ftp://libvirt.org/libvirt/ Seems just fine in my very limited testing. Learning about the larger than I though automated testing for other platforms is comforting. Assuming nobody find issues I think I will try to make the final release on Tuesday, but in the meantime, please double check :) thanks, Daniel -- Daniel Veillard | Red Hat Developers Tools http://developer.redhat.com/ veillard@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/
participants (6)
-
Andrea Bolognani
-
Daniel P. Berrange
-
Daniel Veillard
-
Guido Günther
-
Ján Tomko
-
Roman Bogorodskiy