On Thu, Feb 28, 2013 at 1:11 PM, Laine Stump <laine(a)laine.org> wrote:
On 02/28/2013 11:37 AM, Daniel P. Berrange wrote:
> On Fri, Mar 01, 2013 at 12:31:34AM +0800, Daniel Veillard wrote:
>> On Thu, Feb 28, 2013 at 04:24:17PM +0000, Daniel P. Berrange wrote:
>>> On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote:
>> [...]
>>> Oh joy, it is worse than you could possibly imagine.
Heh. I didn't see this message before my followup, or I would have just
replied here.
>>>
>>> On libnl1 the return value is a valid -errno, while in libnl3
>>> the return value is an error code of its own invention.
Ugh. Another indication that we can't use the return value.
>>>
>>> Further in libnl1 we can';t rely on the global errno, because
>>> other calls libnl does may have overritten it with garbage.
>>> We must use the return value from the function.
Or call nl_geterror(), which your patch does.
>>>
>>> For yet more fun, libnl1's error handling is not threadsafe.
>>> Whenever it hits an error with a syscall, it internally
>>> runs __nl_error() which mallocs/frees a static global
That static global (errbuf) has been turned into
static __thread char *errbuf;
by patches in RHEL6 and Fedora.
>>> variable containing the contents of strerror() which is
>>> itself also not threadsafe :-(
I think at this point it's down to this one problem (not sure why we
didn't fix that the last time we visited this problem)
>>>
>>> Did I mention we should just throw out all versions of libnl
>>> entirely and talk to the kernel ourselves..... It has caused
>>> us no end of pain in all its versions.
>> No chance of educating them instead ? We can't rewrite everything :)
> Sure, it has been getting better over time, but that doesn't help us
> for all existing distros, particular rhel-5 and rhel-6 which libvirt
> is going to be crash-prone due to unsolvable libnl design flaws in
> those versions.
Except for libnl's calling strerror, the thread-safety of error handling
in libnl1 has been avoided in the versions in RHEL6 and older Fedoras
(F18 libvirt uses libnl3) by changing the static global you mention
above (errbuf) to thread-local:
static __thread char *errbuf;
In RHEL5, neither netcf nor macvtap is supported (netcf originally
wasn't supported in RHEL5 because there was no libnl, and macvtap isn't
even in the kernel - it requires kernel 2.6.34, but RHEL5 is using
2.6.18). This means that libnl isn't compiled into libvirt on RHEL5, and
also isn't indirectly used, since netcf isn't supported. So we don't
need to worry about RHEL5.
Still there is the fact that glibc's strerror is getting called :-(
>
> Looking at the code there are two basic sets of APIs we rely on
>
> nl_XXXX
> nla_XXXX
>
> The nl_XXX APis are basically just wrappers around the normal socket()
> based APIs, hiding a few bits about the AF_NETLINK socket type. It
> would be trivially to do all that work ourselves, since socket() handling
> is nothing special. These are the APIs which have caused us multiple
> thread safety crash problems over the years.
>
> The nla_XXX APIs are all about complex data formatting, and we wouldn't
> want to try todo that ourselves. Fortunately the nla_XXX APIs are not
> the ones that are causing us trouble - AFAICT those look pretty safe
> in what they do fro a thread POV, since they're all just working on the
> object instances you pass in, no global state.
That is an interesting idea, though. (It would have been an even better
idea 2-3 years ago, if we were only clairvoyant :-)
Or maybe we could just push to get strerror replaced with strerror_r in
libnl1 (and push that into the various libnl1-using distros, then cross
our fingers for the next thread safety problem).
Honestly, I think this is likely one of your better bets. Given that
you'll have to support libnl1 due to RHEL6 for another roughly 5
years. But long term you guys will likely want to push Fedora and
RHEL7 to libnl3, any distros have shifted away from libnl1 (I know
Gentoo has simply because my personal bad experiences with libnl1).
I've spent the past few minutes checking various distros that I know
publish their depends in an easy to access fashion. Ubuntu, Arch,
Gentoo, and Mageia are using libnl3 for libvirt. Fedora, Debian,
Mandriva (discontinued for Mageia) and RHEL are using libnl1. SuSE's
libvirt packages don't link to libnl at all.
While not an exhaustive list, it does show that there is traction
towards going with libnl3. The big holdout IMHO is Fedora and
switching to libnl3 should at the very least be on the roadmap for
Fedora 19.
--
Doug Goldstein