[libvirt] [PATCH - 2 alternatives] util: fix libvirtd startup failure due to netlink error

The two following patches fix the same problem (described in https://bugzilla.redhat.com/show_bug.cgi?id=816465) in two alternate ways - one by retrying the failing operation after a delay, the other by using knowledge of how libnl works internally to artificially "reserve" a particular address so libnl doesn't attempt to bind to it. You might think that libnl is the right place to fix this bug. Unfortunately, that isn't possible, because it would involve changing libnl's API - currently libnl decides what address to use for future binds of a netlink socket at the time nl_handle_alloc() is called, but doesn't actually attempt to bind it. Later, during nl_connect(), it calls bind() with the address it had earlier decided. It would be nice if we could just retry the bind with a different address when the first attempt failed, but libnl allows client applications to retrieve the bind address *before nl_connect() is called*, so an application may have already gotten the address prior to calling nl_connect(), and changing it would render the applications information incorrect. So the best we can do (for now at least) is work around the problem, and these are two possible workarounds.

This patch is one alternative to solve the problem detailed in: https://bugzilla.redhat.com/show_bug.cgi?id=816465 Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. This is the same address used by libnl for the first netlink socket it binds, and the netlink socket allocated for virNetlinkEventServiceStart() happens to be that first socket; the result is that nl_connect() fails about 15-20% of the time (but apparently only if there is a guest running at the time libvirtd starts). Testing has shown that in the case that nl_connect fails the first time, retrying it after a 500msec sleep leads to success 100% of the time, so this patch doubles that delay (which also has 100% success rate. An alternate patch is to allocate an extra nl_handle that will never be used, thus effectively "reserving" the "pid of libvirtd" address for the mystery library. I will be sending that in a separate patch so everyone has the change to choose. (Note that a similar-looking problem came up over a year ago with the libnl usage by macvtap code. At that time Stefan Berger found bugs in libnl itself. These new errors are encountered while using the patched libnl; the main problem remaining in libnl is with the semantics of the API, which assumes that libnl is the only entity on the system (or at least in the current process) using netlink sockets, and it can thus make an assumption about what address to use for binding.) --- src/util/virnetlink.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index b2e9d51..b9dae86 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -355,9 +355,18 @@ virNetlinkEventServiceStart(void) } if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { - virReportSystemError(errno, - "%s", _("cannot connect to netlink socket")); - goto error_server; + /* the address that libnl wants to use for this connect ("pid + * of libvirtd") is sometimes temporarily in use by some other + * unidentified code. Retrying after a 500msec sleep has + * achieved 100% success rates, so we sleep for 1000msec and + * retry. + */ + usleep(1000000); + if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { + virReportSystemError(errno, + "%s", _("cannot connect to netlink socket")); + goto error_server; + } } fd = nl_socket_get_fd(srv->netlinknh); -- 1.7.10

On 05/01/2012 01:10 PM, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. This is the same address used by libnl for the first netlink socket it binds, and the netlink socket allocated for virNetlinkEventServiceStart() happens to be that first socket; the result is that nl_connect() fails about 15-20% of the time (but apparently only if there is a guest running at the time libvirtd starts).
Testing has shown that in the case that nl_connect fails the first time, retrying it after a 500msec sleep leads to success 100% of the time, so this patch doubles that delay (which also has 100% success rate.
+++ b/src/util/virnetlink.c @@ -355,9 +355,18 @@ virNetlinkEventServiceStart(void) }
if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { - virReportSystemError(errno, - "%s", _("cannot connect to netlink socket")); - goto error_server; + /* the address that libnl wants to use for this connect ("pid + * of libvirtd") is sometimes temporarily in use by some other + * unidentified code. Retrying after a 500msec sleep has + * achieved 100% success rates, so we sleep for 1000msec and + * retry. + */ + usleep(1000000);
Sleeping for 1 entire second is user-visible; if we go with this approach, I'd rather see it be as a retry loop that probes something like once every 200ms for 5 tries (or something similar), for better response time. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup.
Can you identify this library. It should be possible to do so using systemtap without all that much trouble. I'd rather we knew exactly what was causing the problem, before we consider fixes or workarounds Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. Can you identify this library.
I made a few attempts, but didn't have any luck and decided to post these patches based on the other evidence I'd gathered. I agree that I would much prefer understanding who is doing this, even if it doesn't change the workaround method.
It should be possible to do so using systemtap without all that much trouble.
My full experience with systemtap is using some of the examples from your blog posting on the topic :-) I would love to figure this out, though. The complicating factor I can see (aside from me needing to learn how to write a systemtap script) is that in this case stap needs to be run on a daemonizing process, from the very beginning. If you can give me any better advice than "go read the systemtap website", please do. In the meantime, the people who developed the netlink event service code have agreed to test out the patch in alternative 2 to verify that it doesn't break communication between lldpad and libvirtd.

On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote:
On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. Can you identify this library.
I made a few attempts, but didn't have any luck and decided to post these patches based on the other evidence I'd gathered. I agree that I would much prefer understanding who is doing this, even if it doesn't change the workaround method.
It should be possible to do so using systemtap without all that much trouble.
My full experience with systemtap is using some of the examples from your blog posting on the topic :-)
I assume you mean this one http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using...
I would love to figure this out, though. The complicating factor I can see (aside from me needing to learn how to write a systemtap script) is that in this case stap needs to be run on a daemonizing process, from the very beginning. If you can give me any better advice than "go read the systemtap website", please do.
I can't help today, but ping me on IRC tomorrow and I'll help you get sorted with systemtap. You can start the stap scripts before even running libvirtd, so there's no issue with the daemonizing side of things. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/02/2012 11:32 AM, Daniel P. Berrange wrote:
On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote:
On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. Can you identify this library.
I made a few attempts, but didn't have any luck and decided to post these patches based on the other evidence I'd gathered. I agree that I would much prefer understanding who is doing this, even if it doesn't change the workaround method.
It should be possible to do so using systemtap without all that much trouble. My full experience with systemtap is using some of the examples from your blog posting on the topic :-) I assume you mean this one
http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using...
Yes, that's the one. I wasn't actually interested in watching the rpc protocol, but the interaction between libvirtd and the qemu monitor, which was very helpful.
I would love to figure this out, though. The complicating factor I can see (aside from me needing to learn how to write a systemtap script) is that in this case stap needs to be run on a daemonizing process, from the very beginning. If you can give me any better advice than "go read the systemtap website", please do. I can't help today, but ping me on IRC tomorrow and I'll help you get sorted with systemtap. You can start the stap scripts before even running libvirtd, so there's no issue with the daemonizing side of things.
With some help from mjw in #systemtap on freenode, I was able to figure out how to use systemtap to print a backtrace all calls to bind, and although the failures ceased as soon as I turned on the tracing (of course), it did at least give me a list of bind calls to research. It turns out that this is the interesting one (or one example of it, anyway): [23876,init 0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so] 0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so] 0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so] 0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49d77 : virNetworkDefParse+0x57/0x70 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100 [/usr/lib64/libvirt.so.0.9.10] 0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd] 0x7fa695f806d0 : virStateInitialize+0x60/0xd0 [/usr/lib64/libvirt.so.0.9.10] 0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd] 0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10] 0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so] ] __check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel, netlink_bind() uses 0 as an indicator that it should auto-bind, preferring the pid of the calling process (i.e. "pid of libvirtd") as its nl_pid in the nladdr. This NETLINK socket is used for a short period to get a list of interface addresses, and is then closed. Once main() has started up its other threads, these threads may call virSocketAddrParse (and thus __check_pf()) any number of times, creating many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main thread, virNetlinkEventServiceStart() is the first function in libvirtd to call libnl's nl_handle_alloc(), which mistakenly assumes that it has all control over netlink sockets, and that it can assign the address of "pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is called, which calls bind() with a *fixed* address of "pid of libvirtd". If another thread happens to currently be in a call to __pf_check(), we lose the lottery and bind() fails. If not, we win the lottery, bind() succeeds, and future calls to bind() by __check_pf() will auto-bind to a different address (unlike with libnl, which assigns subsequent sockets the address of "pid + (n << 22)" with a maximum of 1024 sockets per process (i.e. it will always be positive), auto-binds in the kernel will assign the first free address found between -2047 and -2,147,483,648 (i.e. it will always be negative)). So, the conclusions to draw from this analysis are: 1) my "alternative 1" patch was only coincidentally succeeding, and would be about as useful as everyone removing their shoes at airport security checkpoints. 2) If libvirtd has multiple threads started up before any netlink sockets have been bound to "pid of libvirtd", there is a possibility that the first call to nl_connect will fail (due to another thread being in getaddrinfo/__check_pf()). This is just as true for the macvtap and netcf uses of libnl as for the virNetlinkEventService use. 3) Once the first call to nl_connect is successfully completed (and/or if an extra (and otherwise unused) nlhandle is created with nl_handle_alloc() before creating any nlhandles that are subsequently nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is effectively 0, since the address space used by libnl is all positive 32 bit numbers, and the address space used by the auto-bind address in the kernel is (almost) all negative 32 bit numbers. 4) libnl should, at the very least, be modified to not use exactly nl_pid = pid, since there is a very high likelihood that particular address will already be taken by a library function that is calling bind directly, rather than through libnl. Really, its API shouldn't allow applications to retrieve the bind address used until after nl_connect() has already completed successfully; unfortunately, that would require an incompatible change in the API. Now that I completely understand the problem, I actually think that neither of these patches is quite correct; the first because it is simply bogus, and the second because it only solves the problem to virNetlinkEventService - it still leaves open the possibility that macvtap or netcf usage of libnl could result in a failure (although *only* if one of those uses happened to be called prior to virNetlinkEventService). To be 100% safe, I think what we need to do is put an extra call to nl_handle_alloc() very early in main, prior to calling virNetServerNew(), which is when all the other worker threads are created. I'll put together such a patch and send it to the list later tonight.

On Wed, May 02, 2012 at 04:35:48PM -0400, Laine Stump wrote:
On 05/02/2012 11:32 AM, Daniel P. Berrange wrote:
On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote:
On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
This patch is one alternative to solve the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Some other unidentified library in use by libvirtd (in another thread) is apparently temporarily binding to a NETLINK_ROUTE raw socket with an address of "pid of libvirtd" during startup. Can you identify this library.
I made a few attempts, but didn't have any luck and decided to post these patches based on the other evidence I'd gathered. I agree that I would much prefer understanding who is doing this, even if it doesn't change the workaround method.
It should be possible to do so using systemtap without all that much trouble. My full experience with systemtap is using some of the examples from your blog posting on the topic :-) I assume you mean this one
http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-using...
Yes, that's the one. I wasn't actually interested in watching the rpc protocol, but the interaction between libvirtd and the qemu monitor, which was very helpful.
I would love to figure this out, though. The complicating factor I can see (aside from me needing to learn how to write a systemtap script) is that in this case stap needs to be run on a daemonizing process, from the very beginning. If you can give me any better advice than "go read the systemtap website", please do. I can't help today, but ping me on IRC tomorrow and I'll help you get sorted with systemtap. You can start the stap scripts before even running libvirtd, so there's no issue with the daemonizing side of things.
With some help from mjw in #systemtap on freenode, I was able to figure out how to use systemtap to print a backtrace all calls to bind, and although the failures ceased as soon as I turned on the tracing (of course), it did at least give me a list of bind calls to research.
It turns out that this is the interesting one (or one example of it, anyway):
[23876,init 0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so] 0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so] 0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so] 0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49d77 : virNetworkDefParse+0x57/0x70 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0 [/usr/lib64/libvirt.so.0.9.10] 0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100 [/usr/lib64/libvirt.so.0.9.10] 0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd] 0x7fa695f806d0 : virStateInitialize+0x60/0xd0 [/usr/lib64/libvirt.so.0.9.10] 0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd] 0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10] 0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so] 0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so] ]
__check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel, netlink_bind() uses 0 as an indicator that it should auto-bind, preferring the pid of the calling process (i.e. "pid of libvirtd") as its nl_pid in the nladdr. This NETLINK socket is used for a short period to get a list of interface addresses, and is then closed.
Once main() has started up its other threads, these threads may call virSocketAddrParse (and thus __check_pf()) any number of times, creating many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main thread, virNetlinkEventServiceStart() is the first function in libvirtd to call libnl's nl_handle_alloc(), which mistakenly assumes that it has all control over netlink sockets, and that it can assign the address of "pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is called, which calls bind() with a *fixed* address of "pid of libvirtd". If another thread happens to currently be in a call to __pf_check(), we lose the lottery and bind() fails. If not, we win the lottery, bind() succeeds, and future calls to bind() by __check_pf() will auto-bind to a different address (unlike with libnl, which assigns subsequent sockets the address of "pid + (n << 22)" with a maximum of 1024 sockets per process (i.e. it will always be positive), auto-binds in the kernel will assign the first free address found between -2047 and -2,147,483,648 (i.e. it will always be negative)).
So, the conclusions to draw from this analysis are:
1) my "alternative 1" patch was only coincidentally succeeding, and would be about as useful as everyone removing their shoes at airport security checkpoints.
2) If libvirtd has multiple threads started up before any netlink sockets have been bound to "pid of libvirtd", there is a possibility that the first call to nl_connect will fail (due to another thread being in getaddrinfo/__check_pf()). This is just as true for the macvtap and netcf uses of libnl as for the virNetlinkEventService use.
3) Once the first call to nl_connect is successfully completed (and/or if an extra (and otherwise unused) nlhandle is created with nl_handle_alloc() before creating any nlhandles that are subsequently nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is effectively 0, since the address space used by libnl is all positive 32 bit numbers, and the address space used by the auto-bind address in the kernel is (almost) all negative 32 bit numbers.
4) libnl should, at the very least, be modified to not use exactly nl_pid = pid, since there is a very high likelihood that particular address will already be taken by a library function that is calling bind directly, rather than through libnl. Really, its API shouldn't allow applications to retrieve the bind address used until after nl_connect() has already completed successfully; unfortunately, that would require an incompatible change in the API.
Now that I completely understand the problem, I actually think that neither of these patches is quite correct; the first because it is simply bogus, and the second because it only solves the problem to virNetlinkEventService - it still leaves open the possibility that macvtap or netcf usage of libnl could result in a failure (although *only* if one of those uses happened to be called prior to virNetlinkEventService).
To be 100% safe, I think what we need to do is put an extra call to nl_handle_alloc() very early in main, prior to calling virNetServerNew(), which is when all the other worker threads are created. I'll put together such a patch and send it to the list later tonight.
Wow, thanks for figuring this out. It is all far worse than I imagined :-( Clearly libnl is broken here, but I guess it is dead upstream in favour of libnl3. I wonder if that shares the same problem. Agree that creating a netlink handle in libvirtd main() sounds like a way to workaround it. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/03/2012 04:09 AM, Daniel P. Berrange wrote:
On Wed, May 02, 2012 at 04:35:48PM -0400, Laine Stump wrote:
[lots of explanation snipped]
So, the conclusions to draw from this analysis are:
1) my "alternative 1" patch was only coincidentally succeeding, and would be about as useful as everyone removing their shoes at airport security checkpoints.
2) If libvirtd has multiple threads started up before any netlink sockets have been bound to "pid of libvirtd", there is a possibility that the first call to nl_connect will fail (due to another thread being in getaddrinfo/__check_pf()). This is just as true for the macvtap and netcf uses of libnl as for the virNetlinkEventService use.
3) Once the first call to nl_connect is successfully completed (and/or if an extra (and otherwise unused) nlhandle is created with nl_handle_alloc() before creating any nlhandles that are subsequently nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is effectively 0, since the address space used by libnl is all positive 32 bit numbers, and the address space used by the auto-bind address in the kernel is (almost) all negative 32 bit numbers.
4) libnl should, at the very least, be modified to not use exactly nl_pid = pid, since there is a very high likelihood that particular address will already be taken by a library function that is calling bind directly, rather than through libnl. Really, its API shouldn't allow applications to retrieve the bind address used until after nl_connect() has already completed successfully; unfortunately, that would require an incompatible change in the API.
Now that I completely understand the problem, I actually think that neither of these patches is quite correct; the first because it is simply bogus, and the second because it only solves the problem to virNetlinkEventService - it still leaves open the possibility that macvtap or netcf usage of libnl could result in a failure (although *only* if one of those uses happened to be called prior to virNetlinkEventService).
To be 100% safe, I think what we need to do is put an extra call to nl_handle_alloc() very early in main, prior to calling virNetServerNew(), which is when all the other worker threads are created. I'll put together such a patch and send it to the list later tonight. Wow, thanks for figuring this out. It is all far worse than I imagined :-(
Actually, investigating this was the most fun I've had in awhile :-) It could have been worse - it could have turned out that the address spaces continued to collide even after reserving "pid". Still, it's not a pretty situation.
Clearly libnl is broken here, but I guess it is dead upstream in favour of libnl3. I wonder if that shares the same problem.
As far as I can tell, the libnl3 code is unchanged in this respect - nl_handle_alloc() is for some reason replaced with nl_socket_alloc(), but the address (stored in nl_pid) is still generated at that time, and in the same manner, is available at any time via nl_socket_get_local_port(), and is used by nl_connect() to call bind(), with no attempts to adjust it if EADDRINUSE is encountered. By the way, a workaround in libnl (as suggested in my conclusion (4) would also be very simple - that could be done by patching libnl's generate_local_port to skip over "pid + (0 << 22)", like this: --- a/lib/socket.c +++ b/lib/socket.c @@ -58,7 +58,7 @@ static uint32_t generate_local_port(void) if (used_ports_map[i] == 0xFFFFFFFF) continue; - for (n = 0; n < 32; n++) { + for (n = (i ? 0 : 1); n < 32; n++) { if (1UL & (used_ports_map[i] >> n)) continue; (this is based on upstream git, but generate_local_port() is identical in libnl-1.1). Note that I call this a "workaround" rather than a "fix" - in my opinion, a true fix isn't possible with the existing libnl API - as I've said before, not only should libnl not be unilaterally deciding the bind address, it certainly shouldn't be allowing the application to retrieve the bind address before a successful bind has taken place, as this eliminates the possibility of changing that address in case of EADDRINUSE. The downside to the above patch (compared to patching libvirt) is that there may be some other application that is (knowingly or unknowingly) depending on the bind address being based on "pid" rather than "pid + (1 << 22)", and that application would mysteriously begin to fail. So as a localized patch that is certain to fix libvirt's libnl problem with no chance of making life worse for anyone else, I still prefer a patch to libvirt.
Agree that creating a netlink handle in libvirtd main() sounds like a way to workaround it.
I just sent a new patch upstream that does this: https://www.redhat.com/archives/libvir-list/2012-May/msg00235.html As noted there, I don't want to push it until I get experimental verification that it doesn't damage lldpad<-->libvirtd communication.

This is an alternate method of solving the problem detailed in: https://bugzilla.redhat.com/show_bug.cgi?id=816465 Testing has shown that if we "reserve" the bind address of "pid of libvirtd" so that libnl never tries to bind it, our bind of "pid of libnl + (1 << 22)" (which is what libnl ends up using for the 2nd socket it binds) will *always* succeed the first time. The way to make this reservation is to allocate a handle from libnl (it internally assigns the address it will use at handle alloc time), then just never use that handle - it is a place holder for whatever other code in the process is using netlink sockets directly (i.e. not through libnl). The advantage of this patch over the other is that it doesn't rely on timing at all. The disadvantage (?maybe? maybe not, I'm not sure - someone more knowledgeable about the libvirtd<->lldpad communication please inform) is that the bind address used by the netlink socket that communicates with lldpad will be "pid of libvirtd + (1 << 22)", not "pid of libvirtd". --- src/util/virnetlink.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index b2e9d51..aca0c07 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -68,6 +68,7 @@ struct _virNetlinkEventSrvPrivate { int eventwatch; int netlinkfd; struct nl_handle *netlinknh; + struct nl_handle *dummy_netlinknh; /*Events*/ int handled; size_t handlesCount; @@ -286,6 +287,7 @@ virNetlinkEventServiceStop(void) virNetlinkEventServerLock(srv); nl_close(srv->netlinknh); nl_handle_destroy(srv->netlinknh); + nl_handle_destroy(srv->dummy_netlinknh); virEventRemoveHandle(srv->eventwatch); /* free any remaining clients on the list */ @@ -345,13 +347,24 @@ virNetlinkEventServiceStart(void) virNetlinkEventServerLock(srv); + srv->netlinknh = NULL; + /* Allocate a dummy nl_handle to reserve the address "pid of + * libvirtd" for whatever library is using it. + */ + srv->dummy_netlinknh = nl_handle_alloc(); + if (!srv->dummy_netlinknh) { + virReportSystemError(errno, "%s", + _("cannot allocate space holder nlhandle for virNetlinkEvent server")); + goto error_server; + } + /* Allocate a new socket and get fd */ srv->netlinknh = nl_handle_alloc(); if (!srv->netlinknh) { virReportSystemError(errno, "%s", _("cannot allocate nlhandle for virNetlinkEvent server")); - goto error_locked; + goto error_server; } if (nl_connect(srv->netlinknh, NETLINK_ROUTE) < 0) { @@ -391,10 +404,14 @@ virNetlinkEventServiceStart(void) error_server: if (ret < 0) { - nl_close(srv->netlinknh); - nl_handle_destroy(srv->netlinknh); + if (srv->netlinknh) { + nl_close(srv->netlinknh); + nl_handle_destroy(srv->netlinknh); + } + if (srv->dummy_netlinknh) { + nl_handle_destroy(srv->dummy_netlinknh); + } } -error_locked: virNetlinkEventServerUnlock(srv); if (ret < 0) { virMutexDestroy(&srv->lock); -- 1.7.10

On 05/01/2012 01:11 PM, Laine Stump wrote:
This is an alternate method of solving the problem detailed in:
https://bugzilla.redhat.com/show_bug.cgi?id=816465
Testing has shown that if we "reserve" the bind address of "pid of libvirtd" so that libnl never tries to bind it, our bind of "pid of libnl + (1 << 22)" (which is what libnl ends up using for the 2nd
s/pid of libnl/pid of libvirtd/
socket it binds) will *always* succeed the first time. The way to make this reservation is to allocate a handle from libnl (it internally assigns the address it will use at handle alloc time), then just never use that handle - it is a place holder for whatever other code in the process is using netlink sockets directly (i.e. not through libnl).
Kind of gross to be relying on undocumented internals, but we already know that libnl will be changing API when we add libnl-3 support (and possibly again, once you report this design flaw upstream); and we also have the argument that libnl is open source code so we are doing this based on actual code inspection (and not random guesses) - if libnl were a proprietary library, we'd be stuck.
The advantage of this patch over the other is that it doesn't rely on timing at all.
And that's why I like this more than alternative 1.
The disadvantage (?maybe? maybe not, I'm not sure - someone more knowledgeable about the libvirtd<->lldpad communication please inform) is that the bind address used by the netlink socket that communicates with lldpad will be "pid of libvirtd + (1 << 22)", not "pid of libvirtd".
I don't know of any downside, but I also am not knowledgeable about lldpad, so I could be entirely missing something. If that proves to be a problem in practice, speak up now, and we can go with option 1; but if no one speaks up, I like option 2 better. ACK to the actual code for this workaround, although...
+ srv->netlinknh = NULL; + /* Allocate a dummy nl_handle to reserve the address "pid of + * libvirtd" for whatever library is using it.
Adding a URL to this thread (and/or a thread on the libnl list reporting the problem) might be nice for future reference. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 05/01/2012 01:05 PM, Laine Stump wrote:
The two following patches fix the same problem (described in https://bugzilla.redhat.com/show_bug.cgi?id=816465) in two alternate ways - one by retrying the failing operation after a delay, the other by using knowledge of how libnl works internally to artificially "reserve" a particular address so libnl doesn't attempt to bind to it.
You might think that libnl is the right place to fix this bug. Unfortunately, that isn't possible, because it would involve changing libnl's API - currently libnl decides what address to use for future binds of a netlink socket at the time nl_handle_alloc() is called, but doesn't actually attempt to bind it. Later, during nl_connect(), it calls bind() with the address it had earlier decided. It would be nice if we could just retry the bind with a different address when the first attempt failed, but libnl allows client applications to retrieve the bind address *before nl_connect() is called*, so an application may have already gotten the address prior to calling nl_connect(), and changing it would render the applications information incorrect.
Does libnl-3 have the same issue? It would be interesting if Serge Hallyn's patches to support libnl-3 ended up allowing us to move to a version of the library that doesn't have the same fundamental flaw. Have we complained about this flaw to the libnl upstream folks?
So the best we can do (for now at least) is work around the problem, and these are two possible workarounds.
I'm still debating which workaround is more palatable, but agree that we have to do something. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 05/01/2012 03:16 PM, Eric Blake wrote:
On 05/01/2012 01:05 PM, Laine Stump wrote:
The two following patches fix the same problem (described in https://bugzilla.redhat.com/show_bug.cgi?id=816465) in two alternate ways - one by retrying the failing operation after a delay, the other by using knowledge of how libnl works internally to artificially "reserve" a particular address so libnl doesn't attempt to bind to it.
You might think that libnl is the right place to fix this bug. Unfortunately, that isn't possible, because it would involve changing libnl's API - currently libnl decides what address to use for future binds of a netlink socket at the time nl_handle_alloc() is called, but doesn't actually attempt to bind it. Later, during nl_connect(), it calls bind() with the address it had earlier decided. It would be nice if we could just retry the bind with a different address when the first attempt failed, but libnl allows client applications to retrieve the bind address *before nl_connect() is called*, so an application may have already gotten the address prior to calling nl_connect(), and changing it would render the applications information incorrect. Does libnl-3 have the same issue? It would be interesting if Serge Hallyn's patches to support libnl-3 ended up allowing us to move to a version of the library that doesn't have the same fundamental flaw. Have we complained about this flaw to the libnl upstream folks?
Yes, libnl-3 has the same flaw in the API/code, but no I haven't contacted them about it yet - I've been too busy gathering information. That's on my list of things to do, though (actually I'm thinking it would be good to have the libnl maintainer take a look at Serge's patches - he had earlier agreed to help out with making libvirt libnl-3-compliant when we got around to it).
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Laine Stump