On 02/28/2013 11:37 AM, Daniel P. Berrange wrote:
On Fri, Mar 01, 2013 at 12:31:34AM +0800, Daniel Veillard wrote:
> On Thu, Feb 28, 2013 at 04:24:17PM +0000, Daniel P. Berrange wrote:
>> On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote:
> [...]
>> Oh joy, it is worse than you could possibly imagine.
Heh. I didn't see this message before my followup, or I would have just
replied here.
>>
>> On libnl1 the return value is a valid -errno, while in libnl3
>> the return value is an error code of its own invention.
Ugh. Another indication that we can't use the return value.
>>
>> Further in libnl1 we can';t rely on the global errno, because
>> other calls libnl does may have overritten it with garbage.
>> We must use the return value from the function.
Or call nl_geterror(), which your patch does.
>>
>> For yet more fun, libnl1's error handling is not threadsafe.
>> Whenever it hits an error with a syscall, it internally
>> runs __nl_error() which mallocs/frees a static global
That static global (errbuf) has been turned into
static __thread char *errbuf;
by patches in RHEL6 and Fedora.
>> variable containing the contents of strerror() which is
>> itself also not threadsafe :-(
I think at this point it's down to this one problem (not sure why we
didn't fix that the last time we visited this problem)
>>
>> Did I mention we should just throw out all versions of libnl
>> entirely and talk to the kernel ourselves..... It has caused
>> us no end of pain in all its versions.
> No chance of educating them instead ? We can't rewrite everything :)
Sure, it has been getting better over time, but that doesn't help us
for all existing distros, particular rhel-5 and rhel-6 which libvirt
is going to be crash-prone due to unsolvable libnl design flaws in
those versions.
Except for libnl's calling strerror, the thread-safety of error handling
in libnl1 has been avoided in the versions in RHEL6 and older Fedoras
(F18 libvirt uses libnl3) by changing the static global you mention
above (errbuf) to thread-local:
static __thread char *errbuf;
In RHEL5, neither netcf nor macvtap is supported (netcf originally
wasn't supported in RHEL5 because there was no libnl, and macvtap isn't
even in the kernel - it requires kernel 2.6.34, but RHEL5 is using
2.6.18). This means that libnl isn't compiled into libvirt on RHEL5, and
also isn't indirectly used, since netcf isn't supported. So we don't
need to worry about RHEL5.
Still there is the fact that glibc's strerror is getting called :-(
Looking at the code there are two basic sets of APIs we rely on
nl_XXXX
nla_XXXX
The nl_XXX APis are basically just wrappers around the normal socket()
based APIs, hiding a few bits about the AF_NETLINK socket type. It
would be trivially to do all that work ourselves, since socket() handling
is nothing special. These are the APIs which have caused us multiple
thread safety crash problems over the years.
The nla_XXX APIs are all about complex data formatting, and we wouldn't
want to try todo that ourselves. Fortunately the nla_XXX APIs are not
the ones that are causing us trouble - AFAICT those look pretty safe
in what they do fro a thread POV, since they're all just working on the
object instances you pass in, no global state.
That is an interesting idea, though. (It would have been an even better
idea 2-3 years ago, if we were only clairvoyant :-)
Or maybe we could just push to get strerror replaced with strerror_r in
libnl1 (and push that into the various libnl1-using distros, then cross
our fingers for the next thread safety problem).