
On 05/01/2012 01:10 PM, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. This is the same address used by libnl for the first netlink socket it binds, and the netlink socket allocated for virNetlinkEventServiceStart() happens to be that first socket; the result is that nl_connect() fails about 15-20% of the time (but apparently only if there is a guest running at the time libvirtd starts).
Testing has shown that in the case that nl_connect fails the first time, retrying it after a 500msec sleep leads to success 100% of the time, so this patch doubles that delay (which also has 100% success rate.
+++ b/src/util/virnetlink.c @@ -355,9 +355,18 @@ virNetlinkEventServiceStart(void) }
if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { - virReportSystemError(errno, - "%s", _("cannot connect to netlink socket")); - goto error_server; + /* the address that libnl wants to use for this connect ("pid + * of libvirtd") is sometimes temporarily in use by some other + * unidentified code. Retrying after a 500msec sleep has + * achieved 100% success rates, so we sleep for 1000msec and + * retry. + */ + usleep(1000000);
Sleeping for 1 entire second is user-visible; if we go with this approach, I'd rather see it be as a retry loop that probes something like once every 200ms for 5 tries (or something similar), for better response time. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org