On 05/02/2012 11:32 AM, Daniel P. Berrange wrote:
> On Wed, May 02, 2012 at 11:29:36AM -0400, Laine Stump wrote:
>> On 05/02/2012 05:11 AM, Daniel P. Berrange wrote:
>>> On Tue, May 01, 2012 at 03:10:42PM -0400, Laine Stump wrote:
>>>> This patch is one alternative to solve the problem detailed in:
>>>>
>>>>
https://bugzilla.redhat.com/show_bug.cgi?id=816465
>>>>
>>>> Some other unidentified library in use by libvirtd (in another thread)
>>>> is apparently temporarily binding to a NETLINK_ROUTE raw socket with
>>>> an address of "pid of libvirtd" during startup.
>>> Can you identify this library.
>>
>> I made a few attempts, but didn't have any luck and decided to post
>> these patches based on the other evidence I'd gathered. I agree that I
>> would much prefer understanding who is doing this, even if it doesn't
>> change the workaround method.
>>
>>
>>> It should be possible to do so using
>>> systemtap without all that much trouble.
>> My full experience with systemtap is using some of the examples from
>> your blog posting on the topic :-)
> I assume you mean this one
>
>
http://berrange.com/posts/2011/11/30/watching-the-libvirt-rpc-protocol-us...
Yes, that's the one. I wasn't actually interested in watching the rpc
protocol, but the interaction between libvirtd and the qemu monitor,
which was very helpful.
>> I would love to figure this out, though. The complicating factor I can
>> see (aside from me needing to learn how to write a systemtap script) is
>> that in this case stap needs to be run on a daemonizing process, from
>> the very beginning. If you can give me any better advice than "go read
>> the systemtap website", please do.
> I can't help today, but ping me on IRC tomorrow and I'll help you
> get sorted with systemtap. You can start the stap scripts before
> even running libvirtd, so there's no issue with the daemonizing
> side of things.
With some help from mjw in #systemtap on freenode, I was able to figure
out how to use systemtap to print a backtrace all calls to bind, and
although the failures ceased as soon as I turned on the tracing (of
course), it did at least give me a list of bind calls to research.
It turns out that this is the interesting one (or one example of it,
anyway):
[23876,init
0x35b90e8277 : bind+0x7/0x30 [/lib64/libc-2.12.so]
0x35b910e540 : __check_pf+0x80/0xf0 [/lib64/libc-2.12.so]
0x35b90d1ab7 : getaddrinfo+0xe7/0x890 [/lib64/libc-2.12.so]
0x7fa695f1e61d : virSocketAddrParse+0x4d/0x190
[/usr/lib64/libvirt.so.0.9.10]
0x7fa695f47f2a : virNetworkIPParseXML+0xaa/0x4c0
[/usr/lib64/libvirt.so.0.9.10]
0x7fa695f48f37 : virNetworkDefParseNode+0xbf7/0x19e0
[/usr/lib64/libvirt.so.0.9.10]
0x7fa695f49d77 : virNetworkDefParse+0x57/0x70
[/usr/lib64/libvirt.so.0.9.10]
0x7fa695f49e2c : virNetworkLoadConfig+0x8c/0x1b0
[/usr/lib64/libvirt.so.0.9.10]
0x7fa695f49fb3 : virNetworkLoadAllConfigs+0x63/0x100
[/usr/lib64/libvirt.so.0.9.10]
0x4d5f97 : networkStartup+0x157/0x460 [/usr/sbin/libvirtd]
0x7fa695f806d0 : virStateInitialize+0x60/0xd0
[/usr/lib64/libvirt.so.0.9.10]
0x420ff1 : daemonRunStateInit+0x11/0x80 [/usr/sbin/libvirtd]
0x7fa695f08749 : virThreadHelper+0x29/0x40 [/usr/lib64/libvirt.so.0.9.10]
0x35b9c07851 : start_thread+0xd1/0x3d4 [/lib64/libpthread-2.12.so]
0x35b90e767d : __clone+0x6d/0x90 [/lib64/libc-2.12.so]
]
__check_pf() is in glibc - sysdeps/unix/sysv/linux/check_pf.c, and it
does directly (not through libnl) call socket(PF_NETLINK, SOCK_RAW,
NETLINK_ROUTE), set the nladdr to 0's, then bind() it. In the kernel,
netlink_bind() uses 0 as an indicator that it should auto-bind,
preferring the pid of the calling process (i.e. "pid of libvirtd") as
its nl_pid in the nladdr. This NETLINK socket is used for a short period
to get a list of interface addresses, and is then closed.
Once main() has started up its other threads, these threads may call
virSocketAddrParse (and thus __check_pf()) any number of times, creating
many socket/bind/close cycles of NETLINK sockets. Meanwhile, in the main
thread, virNetlinkEventServiceStart() is the first function in libvirtd
to call libnl's nl_handle_alloc(), which mistakenly assumes that it has
all control over netlink sockets, and that it can assign the address of
"pid of libvirtd" to this nlhandle. Shortly after that, nl_connect() is
called, which calls bind() with a *fixed* address of "pid of libvirtd".
If another thread happens to currently be in a call to __pf_check(), we
lose the lottery and bind() fails. If not, we win the lottery, bind()
succeeds, and future calls to bind() by __check_pf() will auto-bind to a
different address (unlike with libnl, which assigns subsequent sockets
the address of "pid + (n << 22)" with a maximum of 1024 sockets per
process (i.e. it will always be positive), auto-binds in the kernel will
assign the first free address found between -2047 and -2,147,483,648
(i.e. it will always be negative)).
So, the conclusions to draw from this analysis are:
1) my "alternative 1" patch was only coincidentally succeeding, and
would be about as useful as everyone removing their shoes at airport
security checkpoints.
2) If libvirtd has multiple threads started up before any netlink
sockets have been bound to "pid of libvirtd", there is a possibility
that the first call to nl_connect will fail (due to another thread being
in getaddrinfo/__check_pf()). This is just as true for the macvtap and
netcf uses of libnl as for the virNetlinkEventService use.
3) Once the first call to nl_connect is successfully completed (and/or
if an extra (and otherwise unused) nlhandle is created with
nl_handle_alloc() before creating any nlhandles that are subsequently
nl_connect()ed), the likelyhood of a subsequent nl_connect() failure is
effectively 0, since the address space used by libnl is all positive 32
bit numbers, and the address space used by the auto-bind address in the
kernel is (almost) all negative 32 bit numbers.
4) libnl should, at the very least, be modified to not use exactly
nl_pid = pid, since there is a very high likelihood that particular
address will already be taken by a library function that is calling bind
directly, rather than through libnl. Really, its API shouldn't allow
applications to retrieve the bind address used until after nl_connect()
has already completed successfully; unfortunately, that would require an
incompatible change in the API.
Now that I completely understand the problem, I actually think that
neither of these patches is quite correct; the first because it is
simply bogus, and the second because it only solves the problem to
virNetlinkEventService - it still leaves open the possibility that
macvtap or netcf usage of libnl could result in a failure (although
*only* if one of those uses happened to be called prior to
virNetlinkEventService).
To be 100% safe, I think what we need to do is put an extra call to
nl_handle_alloc() very early in main, prior to calling
virNetServerNew(), which is when all the other worker threads are
created. I'll put together such a patch and send it to the list later
tonight.
Wow, thanks for figuring this out. It is all far worse than I imagined :-(
Clearly libnl is broken here, but I guess it is dead upstream in favour
of libnl3. I wonder if that shares the same problem.
Agree that creating a netlink handle in libvirtd main() sounds like a
way to workaround it.
Daniel
--
|: