[libvirt PATCH 0/2] Fix tests on macOS

NB, even with this done there is still a latent bug affecting all platforms. When we call g_source_destroy the removal is async but we usually close the FD synchronously. This leads to poll'ing on a bad FD. We've actually had this race in libvirt since day 1 - our previous poll() event loop impl before glib would also implement the virEventRemoveHandle call async by just writing to a pipe to interrupt the other thread in poll, just as glib does. We've always relied on parallelism to make this async call almost instantaneous but under the right load conditions we trigger the POLLNVAL / EBADF issue. The only viable solution to this that I see is to only ever call g_source_destroy + g_source_unref from an idle callback, to guarantee that poll() isn't currently running. We know this has a bit of a perf hit on code that is sensitive to main loop iterations, so we tried to avoid it where possible right now: https://listman.redhat.com/archives/libvir-list/2020-November/212411.html I think we'll need to revisit this though, as known BADF problems are not good. Daniel P. Berrangé (2): ci: print stack traces on macOS if any tests fail tests: don't set G_DEBUG=fatal-warnings on macOS ci/cirrus/build.yml | 2 +- tests/meson.build | 17 ++++++++++++++++- 2 files changed, 17 insertions(+), 2 deletions(-) -- 2.35.1

Mostly the tests should fail cleanly, but sometimes the test might crash or abort. In this case we'll need to know the stack traces in order to debug the problem. Fortunately on the Cirrus CI macOS instances, the crash reporting service is active and saving crashes to the directory ~/Library/Logs/DiagnosticReports/ Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- ci/cirrus/build.yml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ci/cirrus/build.yml b/ci/cirrus/build.yml index f03ad58143..9332e968bd 100644 --- a/ci/cirrus/build.yml +++ b/ci/cirrus/build.yml @@ -26,4 +26,4 @@ build_task: - meson setup build - meson dist -C build --no-tests - meson compile -C build - - meson test -C build --no-suite syntax-check --print-errorlogs + - meson test -C build --no-suite syntax-check --print-errorlogs || (cat ~/Library/Logs/DiagnosticReports/*.crash && exit 1) -- 2.35.1

On macOS when BROKEN_POLL is set in GLib, our tests will periodically trigger a warning: (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor. Our code is inherantly racy, calling g_source_destroy which removes the FD from the event thread poll asynchronously but we close the FD immediately after g_source_destroy returns. With poll() this results in POLLNVAL which we're ignoring, but with select() it generates the BADF error on macOS. We need to ignore the warnings on macOS to avoid abort()ing our test programs. Signed-off-by: Daniel P. Berrangé <berrange@redhat.com> --- tests/meson.build | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/tests/meson.build b/tests/meson.build index fd78d8a2fd..08d2ae67cb 100644 --- a/tests/meson.build +++ b/tests/meson.build @@ -39,9 +39,24 @@ tests_env = [ 'abs_top_srcdir=@0@'.format(meson.source_root()), 'LC_ALL=C', 'LIBVIRT_AUTOSTART=0', - 'G_DEBUG=fatal-warnings', ] + +# On macOS when BROKEN_POLL is set for GLib, our tests will +# periodically trigger a warning: +# +# (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor. +# +# Our code is inherantly racy, calling g_source_destroy which +# removes the FD from the event thread poll asynchronously +# but we close the FD immediately after g_source_destroy returns. +# +# With poll() this results in POLLNVAL, but with select() it +# generates the BADF error on macOS +if host_machine.system() != 'darwin' + tests_env += ['G_DEBUG=fatal-warnings'] +endif + if use_expensive_tests tests_env += 'VIR_TEST_EXPENSIVE=1' else -- 2.35.1

On Thu, Apr 28, 2022 at 12:22:14PM +0100, Daniel P. Berrangé wrote:
+# On macOS when BROKEN_POLL is set for GLib, our tests will +# periodically trigger a warning: +# +# (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor. +# +# Our code is inherantly racy, calling g_source_destroy which +# removes the FD from the event thread poll asynchronously +# but we close the FD immediately after g_source_destroy returns. +# +# With poll() this results in POLLNVAL, but with select() it +# generates the BADF error on macOS +if host_machine.system() != 'darwin' + tests_env += ['G_DEBUG=fatal-warnings'] +endif
Doesn't this just sweep the issue under the carpet? Shouldn't GLib be changed to do something like #ifndef BROKEN_POLL if (poll() < 0) if (!POLLNVAL) return -1; #else if (select() < 0) if (!EBADF) return -1; #endif instead? If acting on an fd that's already been closed is okay when using the poll()-based implementation, the same should apply to the select()-based one as well. -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Apr 28, 2022 at 06:39:42AM -0700, Andrea Bolognani wrote:
On Thu, Apr 28, 2022 at 12:22:14PM +0100, Daniel P. Berrangé wrote:
+# On macOS when BROKEN_POLL is set for GLib, our tests will +# periodically trigger a warning: +# +# (process:50880): GLib-WARNING **: 02:54:15.272: poll(2) failed due to: Bad file descriptor. +# +# Our code is inherantly racy, calling g_source_destroy which +# removes the FD from the event thread poll asynchronously +# but we close the FD immediately after g_source_destroy returns. +# +# With poll() this results in POLLNVAL, but with select() it +# generates the BADF error on macOS +if host_machine.system() != 'darwin' + tests_env += ['G_DEBUG=fatal-warnings'] +endif
Doesn't this just sweep the issue under the carpet? Shouldn't GLib be changed to do something like
#ifndef BROKEN_POLL if (poll() < 0) if (!POLLNVAL) return -1;
That's not the way POLLNVAL works - this is an even that is set in each 'struct pollfd', so it is associated with individual file descriptors. The application code can thus receive and handle the G_IO_NVAL condition - it isn't fatal from POV of g_poll
#else if (select() < 0) if (!EBADF) return -1; #endif
instead? If acting on an fd that's already been closed is okay when using the poll()-based implementation, the same should apply to the select()-based one as well.
The only way to get the same semantics for select() would be to see EBADF and then iterate over struct pollfd, calling fcntl() to see which one(s) are BADF. Possible, but its unpleasantly non-scalable, so I imagine that's why they didn't do this. I'll still probably inclined to file a bug against glib, if nothing else to provide a bug that other people can discover to explain why the behaviour is different. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Apr 28, 2022 at 02:52:45PM +0100, Daniel P. Berrangé wrote:
On Thu, Apr 28, 2022 at 06:39:42AM -0700, Andrea Bolognani wrote:
Shouldn't GLib be changed to do something like
#ifndef BROKEN_POLL if (poll() < 0) if (!POLLNVAL) return -1;
That's not the way POLLNVAL works - this is an even that is set in each 'struct pollfd', so it is associated with individual file descriptors. The application code can thus receive and handle the G_IO_NVAL condition - it isn't fatal from POV of g_poll
Right, I might have oversimplified things :) But your correction made the situation clearer in my head, thanks!
#else if (select() < 0) if (!EBADF) return -1; #endif
instead? If acting on an fd that's already been closed is okay when using the poll()-based implementation, the same should apply to the select()-based one as well.
The only way to get the same semantics for select() would be to see EBADF and then iterate over struct pollfd, calling fcntl() to see which one(s) are BADF. Possible, but its unpleasantly non-scalable, so I imagine that's why they didn't do this.
IIUC g_poll(), just like any other function used as GPollFunc, is supposed to have the same semantics as poll()[1], which means that passing an invalid FD should result in the corresponding flag being set rather than an outright failure. In other words, the current implementation of g_poll() on macOS doesn't follow the contract defined by GLib itself. It seems to me that this is a (fairly serious) bug in the library, no? [1] https://docs.gtk.org/glib/callback.PollFunc.html -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
On Thu, Apr 28, 2022 at 02:52:45PM +0100, Daniel P. Berrangé wrote:
#else if (select() < 0) if (!EBADF) return -1; #endif
instead? If acting on an fd that's already been closed is okay when using the poll()-based implementation, the same should apply to the select()-based one as well.
The only way to get the same semantics for select() would be to see EBADF and then iterate over struct pollfd, calling fcntl() to see which one(s) are BADF. Possible, but its unpleasantly non-scalable, so I imagine that's why they didn't do this.
IIUC g_poll(), just like any other function used as GPollFunc, is supposed to have the same semantics as poll()[1], which means that passing an invalid FD should result in the corresponding flag being set rather than an outright failure.
In other words, the current implementation of g_poll() on macOS doesn't follow the contract defined by GLib itself. It seems to me that this is a (fairly serious) bug in the library, no?
It is significant, but long standing. GLib actually had this behaviour forever on macOS, but it regressed when Meson was introduced, until the recent fix. The question is whether efficiency trumps API semantics. Normally I'm heavily biased towards API semantics, but poll is a performance critical API, so it isn't so easy to declare we must workaround all the quirks. I filed a bug to raise the subject for discussion though https://gitlab.gnome.org/GNOME/glib/-/issues/2644 With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Apr 28, 2022 at 05:55:41PM +0100, Daniel P. Berrangé wrote:
On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
In other words, the current implementation of g_poll() on macOS doesn't follow the contract defined by GLib itself. It seems to me that this is a (fairly serious) bug in the library, no?
It is significant, but long standing. GLib actually had this behaviour forever on macOS, but it regressed when Meson was introduced, until the recent fix.
The question is whether efficiency trumps API semantics. Normally I'm heavily biased towards API semantics, but poll is a performance critical API, so it isn't so easy to declare we must workaround all the quirks.
I filed a bug to raise the subject for discussion though
That's an excellent bug report! Thanks for spending time on it, and let's see where the upstream discussion leads :) -- Andrea Bolognani / Red Hat / Virtualization

On Fri, Apr 29, 2022 at 01:05:42AM -0700, Andrea Bolognani wrote:
On Thu, Apr 28, 2022 at 05:55:41PM +0100, Daniel P. Berrangé wrote:
On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
In other words, the current implementation of g_poll() on macOS doesn't follow the contract defined by GLib itself. It seems to me that this is a (fairly serious) bug in the library, no?
It is significant, but long standing. GLib actually had this behaviour forever on macOS, but it regressed when Meson was introduced, until the recent fix.
The question is whether efficiency trumps API semantics. Normally I'm heavily biased towards API semantics, but poll is a performance critical API, so it isn't so easy to declare we must workaround all the quirks.
I filed a bug to raise the subject for discussion though
That's an excellent bug report! Thanks for spending time on it, and let's see where the upstream discussion leads :)
Just so I don't sway the potential discussion in the issue to the wrong way, I'll ask here. Since BROKEN_POLL is used due to poll() on OS/X not behaving correctly with fds=NULL, nfds=0 [0], wouldn't it be possible to "fix" at least the internal calls to g_poll() which are done when running the main loop since they always have at least one FD (the eventfd used to signal the main loop [1])? I know it would not fix the problem, but it would at least workaround the part that is used the most, I presume. Martin [0] https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gpoll.c#L59 [1] this is a presumption I did not find in the code, but I've seen in gdb and strace

On Mon, May 02, 2022 at 03:28:50PM +0200, Martin Kletzander wrote:
On Fri, Apr 29, 2022 at 01:05:42AM -0700, Andrea Bolognani wrote:
On Thu, Apr 28, 2022 at 05:55:41PM +0100, Daniel P. Berrangé wrote:
On Thu, Apr 28, 2022 at 08:33:46AM -0700, Andrea Bolognani wrote:
In other words, the current implementation of g_poll() on macOS doesn't follow the contract defined by GLib itself. It seems to me that this is a (fairly serious) bug in the library, no?
It is significant, but long standing. GLib actually had this behaviour forever on macOS, but it regressed when Meson was introduced, until the recent fix.
The question is whether efficiency trumps API semantics. Normally I'm heavily biased towards API semantics, but poll is a performance critical API, so it isn't so easy to declare we must workaround all the quirks.
I filed a bug to raise the subject for discussion though
That's an excellent bug report! Thanks for spending time on it, and let's see where the upstream discussion leads :)
Just so I don't sway the potential discussion in the issue to the wrong way, I'll ask here. Since BROKEN_POLL is used due to poll() on OS/X not behaving correctly with fds=NULL, nfds=0 [0], wouldn't it be possible to "fix" at least the internal calls to g_poll() which are done when running the main loop since they always have at least one FD (the eventfd used to signal the main loop [1])? I know it would not fix the problem, but it would at least workaround the part that is used the most, I presume.
The comment is somewhat misleading - the nfds=0 case is not actually the main problem they were fixing. The real killer issue is that poll() doesn't work with block/char devices - ie any /dev/NNN FD. This is what broke QEMU because it failed with TAP device FDs.
[0] https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gpoll.c#L59
Dunno why the comment mentions nfds=0 because the the original commit message describes the device node FDs scenario. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (4)
-
Andrea Bolognani
-
Daniel P. Berrangé
-
Martin Kletzander
-
Peter Krempa