This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread)
is apparently temporarily binding to a NETLINK_ROUTE raw socket with
an address of "pid of libvirtd" during startup. This is the same
address used by libnl for the first netlink socket it binds, and the
netlink socket allocated for virNetlinkEventServiceStart() happens to
be that first socket; the result is that nl_connect() fails about
15-20% of the time (but apparently only if there is a guest running at
the time libvirtd starts).
Testing has shown that in the case that nl_connect fails the first
time, retrying it after a 500msec sleep leads to success 100% of the
time, so this patch doubles that delay (which also has 100% success
rate.
An alternate patch is to allocate an extra nl_handle that will never
be used, thus effectively "reserving" the "pid of libvirtd" address
for the mystery library. I will be sending that in a separate patch so
everyone has the change to choose.
(Note that a similar-looking problem came up over a year ago with the
libnl usage by macvtap code. At that time Stefan Berger found bugs in
libnl itself. These new errors are encountered while using the patched
libnl; the main problem remaining in libnl is with the semantics of
the API, which assumes that libnl is the only entity on the system (or
at least in the current process) using netlink sockets, and it can
thus make an assumption about what address to use for binding.)
---
src/util/virnetlink.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c
index b2e9d51..b9dae86 100644
--- a/src/util/virnetlink.c
+++ b/src/util/virnetlink.c
@@ -355,9 +355,18 @@ virNetlinkEventServiceStart(void)
}
if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) {
- virReportSystemError(errno,
- "%s", _("cannot connect to netlink
socket"));
- goto error_server;
+ /* the address that libnl wants to use for this connect ("pid
+ * of libvirtd") is sometimes temporarily in use by some other
+ * unidentified code. Retrying after a 500msec sleep has
+ * achieved 100% success rates, so we sleep for 1000msec and
+ * retry.
+ */
+ usleep(1000000);
+ if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) {
+ virReportSystemError(errno,
+ "%s", _("cannot connect to netlink
socket"));
+ goto error_server;
+ }
}
fd = nl_socket_get_fd(srv->netlinknh);
--
1.7.10