[libvirt] [PATCH] Fix error report from nl_recvmsg

From: "Daniel P. Berrange" <berrange@redhat.com> The nl_recvmsg does not always set errno. Instead it returns its own custom set of error codes. Thus we were reporting the wrong data. --- src/util/virnetlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index 0b36fdc..8b47ede 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -335,8 +335,9 @@ virNetlinkEventCallback(int watch, if (length == 0) return; if (length < 0) { - virReportSystemError(errno, - "%s", _("nl_recv returned with error")); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("nl_recv returned with error: %s"), + nl_geterror(length)); return; } -- 1.7.11.7

On 02/28/2013 08:37 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
The nl_recvmsg does not always set errno. Instead it returns its own custom set of error codes. Thus we were reporting the wrong data. --- src/util/virnetlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index 0b36fdc..8b47ede 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -335,8 +335,9 @@ virNetlinkEventCallback(int watch, if (length == 0) return; if (length < 0) { - virReportSystemError(errno, - "%s", _("nl_recv returned with error")); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("nl_recv returned with error: %s"), + nl_geterror(length));
My recollection is that we specifically avoided calling nl_geterror() because it isn't threadsafe. I'll go take another look to verify.

On Thu, Feb 28, 2013 at 11:11:53AM -0500, Laine Stump wrote:
On 02/28/2013 08:37 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
The nl_recvmsg does not always set errno. Instead it returns its own custom set of error codes. Thus we were reporting the wrong data. --- src/util/virnetlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index 0b36fdc..8b47ede 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -335,8 +335,9 @@ virNetlinkEventCallback(int watch, if (length == 0) return; if (length < 0) { - virReportSystemError(errno, - "%s", _("nl_recv returned with error")); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("nl_recv returned with error: %s"), + nl_geterror(length));
My recollection is that we specifically avoided calling nl_geterror() because it isn't threadsafe.
I'll go take another look to verify.
I did check this, but only for libnl3 which merely does a static string table lookup: const char *nl_geterror(int error) { error = abs(error); if (error > NLE_MAX) error = NLE_FAILURE; return errmsg[error]; } Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 11:11:53AM -0500, Laine Stump wrote:
On 02/28/2013 08:37 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
The nl_recvmsg does not always set errno. Instead it returns its own custom set of error codes. Thus we were reporting the wrong data. --- src/util/virnetlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index 0b36fdc..8b47ede 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -335,8 +335,9 @@ virNetlinkEventCallback(int watch, if (length == 0) return; if (length < 0) { - virReportSystemError(errno, - "%s", _("nl_recv returned with error")); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("nl_recv returned with error: %s"), + nl_geterror(length));
My recollection is that we specifically avoided calling nl_geterror() because it isn't threadsafe.
I'll go take another look to verify.
I did check this, but only for libnl3 which merely does a static string table lookup:
Oh joy, it is worse than you could possibly imagine. On libnl1 the return value is a valid -errno, while in libnl3 the return value is an error code of its own invention. Further in libnl1 we can';t rely on the global errno, because other calls libnl does may have overritten it with garbage. We must use the return value from the function. For yet more fun, libnl1's error handling is not threadsafe. Whenever it hits an error with a syscall, it internally runs __nl_error() which mallocs/frees a static global variable containing the contents of strerror() which is itself also not threadsafe :-( Did I mention we should just throw out all versions of libnl entirely and talk to the kernel ourselves..... It has caused us no end of pain in all its versions. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Feb 28, 2013 at 04:24:17PM +0000, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote: [...] Oh joy, it is worse than you could possibly imagine.
On libnl1 the return value is a valid -errno, while in libnl3 the return value is an error code of its own invention.
Further in libnl1 we can';t rely on the global errno, because other calls libnl does may have overritten it with garbage. We must use the return value from the function.
For yet more fun, libnl1's error handling is not threadsafe. Whenever it hits an error with a syscall, it internally runs __nl_error() which mallocs/frees a static global variable containing the contents of strerror() which is itself also not threadsafe :-(
Did I mention we should just throw out all versions of libnl entirely and talk to the kernel ourselves..... It has caused us no end of pain in all its versions.
No chance of educating them instead ? We can't rewrite everything :) Daniel -- Daniel Veillard | Open Source and Standards, Red Hat veillard@redhat.com | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | virtualization library http://libvirt.org/

On Fri, Mar 01, 2013 at 12:31:34AM +0800, Daniel Veillard wrote:
On Thu, Feb 28, 2013 at 04:24:17PM +0000, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote: [...] Oh joy, it is worse than you could possibly imagine.
On libnl1 the return value is a valid -errno, while in libnl3 the return value is an error code of its own invention.
Further in libnl1 we can';t rely on the global errno, because other calls libnl does may have overritten it with garbage. We must use the return value from the function.
For yet more fun, libnl1's error handling is not threadsafe. Whenever it hits an error with a syscall, it internally runs __nl_error() which mallocs/frees a static global variable containing the contents of strerror() which is itself also not threadsafe :-(
Did I mention we should just throw out all versions of libnl entirely and talk to the kernel ourselves..... It has caused us no end of pain in all its versions.
No chance of educating them instead ? We can't rewrite everything :)
Sure, it has been getting better over time, but that doesn't help us for all existing distros, particular rhel-5 and rhel-6 which libvirt is going to be crash-prone due to unsolvable libnl design flaws in those versions. Looking at the code there are two basic sets of APIs we rely on nl_XXXX nla_XXXX The nl_XXX APis are basically just wrappers around the normal socket() based APIs, hiding a few bits about the AF_NETLINK socket type. It would be trivially to do all that work ourselves, since socket() handling is nothing special. These are the APIs which have caused us multiple thread safety crash problems over the years. The nla_XXX APIs are all about complex data formatting, and we wouldn't want to try todo that ourselves. Fortunately the nla_XXX APIs are not the ones that are causing us trouble - AFAICT those look pretty safe in what they do fro a thread POV, since they're all just working on the object instances you pass in, no global state. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 02/28/2013 11:37 AM, Daniel P. Berrange wrote:
On Fri, Mar 01, 2013 at 12:31:34AM +0800, Daniel Veillard wrote:
On Thu, Feb 28, 2013 at 04:24:17PM +0000, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote: [...] Oh joy, it is worse than you could possibly imagine.
Heh. I didn't see this message before my followup, or I would have just replied here.
On libnl1 the return value is a valid -errno, while in libnl3 the return value is an error code of its own invention.
Ugh. Another indication that we can't use the return value.
Further in libnl1 we can';t rely on the global errno, because other calls libnl does may have overritten it with garbage. We must use the return value from the function.
Or call nl_geterror(), which your patch does.
For yet more fun, libnl1's error handling is not threadsafe. Whenever it hits an error with a syscall, it internally runs __nl_error() which mallocs/frees a static global
That static global (errbuf) has been turned into static __thread char *errbuf; by patches in RHEL6 and Fedora.
variable containing the contents of strerror() which is itself also not threadsafe :-(
I think at this point it's down to this one problem (not sure why we didn't fix that the last time we visited this problem)
Did I mention we should just throw out all versions of libnl entirely and talk to the kernel ourselves..... It has caused us no end of pain in all its versions.
No chance of educating them instead ? We can't rewrite everything :) Sure, it has been getting better over time, but that doesn't help us for all existing distros, particular rhel-5 and rhel-6 which libvirt is going to be crash-prone due to unsolvable libnl design flaws in those versions.
Except for libnl's calling strerror, the thread-safety of error handling in libnl1 has been avoided in the versions in RHEL6 and older Fedoras (F18 libvirt uses libnl3) by changing the static global you mention above (errbuf) to thread-local: static __thread char *errbuf; In RHEL5, neither netcf nor macvtap is supported (netcf originally wasn't supported in RHEL5 because there was no libnl, and macvtap isn't even in the kernel - it requires kernel 2.6.34, but RHEL5 is using 2.6.18). This means that libnl isn't compiled into libvirt on RHEL5, and also isn't indirectly used, since netcf isn't supported. So we don't need to worry about RHEL5. Still there is the fact that glibc's strerror is getting called :-(
Looking at the code there are two basic sets of APIs we rely on
nl_XXXX nla_XXXX
The nl_XXX APis are basically just wrappers around the normal socket() based APIs, hiding a few bits about the AF_NETLINK socket type. It would be trivially to do all that work ourselves, since socket() handling is nothing special. These are the APIs which have caused us multiple thread safety crash problems over the years.
The nla_XXX APIs are all about complex data formatting, and we wouldn't want to try todo that ourselves. Fortunately the nla_XXX APIs are not the ones that are causing us trouble - AFAICT those look pretty safe in what they do fro a thread POV, since they're all just working on the object instances you pass in, no global state.
That is an interesting idea, though. (It would have been an even better idea 2-3 years ago, if we were only clairvoyant :-) Or maybe we could just push to get strerror replaced with strerror_r in libnl1 (and push that into the various libnl1-using distros, then cross our fingers for the next thread safety problem).

On Thu, Feb 28, 2013 at 1:11 PM, Laine Stump <laine@laine.org> wrote:
On 02/28/2013 11:37 AM, Daniel P. Berrange wrote:
On Fri, Mar 01, 2013 at 12:31:34AM +0800, Daniel Veillard wrote:
On Thu, Feb 28, 2013 at 04:24:17PM +0000, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 04:16:37PM +0000, Daniel P. Berrange wrote: [...] Oh joy, it is worse than you could possibly imagine.
Heh. I didn't see this message before my followup, or I would have just replied here.
On libnl1 the return value is a valid -errno, while in libnl3 the return value is an error code of its own invention.
Ugh. Another indication that we can't use the return value.
Further in libnl1 we can';t rely on the global errno, because other calls libnl does may have overritten it with garbage. We must use the return value from the function.
Or call nl_geterror(), which your patch does.
For yet more fun, libnl1's error handling is not threadsafe. Whenever it hits an error with a syscall, it internally runs __nl_error() which mallocs/frees a static global
That static global (errbuf) has been turned into
static __thread char *errbuf;
by patches in RHEL6 and Fedora.
variable containing the contents of strerror() which is itself also not threadsafe :-(
I think at this point it's down to this one problem (not sure why we didn't fix that the last time we visited this problem)
Did I mention we should just throw out all versions of libnl entirely and talk to the kernel ourselves..... It has caused us no end of pain in all its versions.
No chance of educating them instead ? We can't rewrite everything :) Sure, it has been getting better over time, but that doesn't help us for all existing distros, particular rhel-5 and rhel-6 which libvirt is going to be crash-prone due to unsolvable libnl design flaws in those versions.
Except for libnl's calling strerror, the thread-safety of error handling in libnl1 has been avoided in the versions in RHEL6 and older Fedoras (F18 libvirt uses libnl3) by changing the static global you mention above (errbuf) to thread-local:
static __thread char *errbuf;
In RHEL5, neither netcf nor macvtap is supported (netcf originally wasn't supported in RHEL5 because there was no libnl, and macvtap isn't even in the kernel - it requires kernel 2.6.34, but RHEL5 is using 2.6.18). This means that libnl isn't compiled into libvirt on RHEL5, and also isn't indirectly used, since netcf isn't supported. So we don't need to worry about RHEL5.
Still there is the fact that glibc's strerror is getting called :-(
Looking at the code there are two basic sets of APIs we rely on
nl_XXXX nla_XXXX
The nl_XXX APis are basically just wrappers around the normal socket() based APIs, hiding a few bits about the AF_NETLINK socket type. It would be trivially to do all that work ourselves, since socket() handling is nothing special. These are the APIs which have caused us multiple thread safety crash problems over the years.
The nla_XXX APIs are all about complex data formatting, and we wouldn't want to try todo that ourselves. Fortunately the nla_XXX APIs are not the ones that are causing us trouble - AFAICT those look pretty safe in what they do fro a thread POV, since they're all just working on the object instances you pass in, no global state.
That is an interesting idea, though. (It would have been an even better idea 2-3 years ago, if we were only clairvoyant :-)
Or maybe we could just push to get strerror replaced with strerror_r in libnl1 (and push that into the various libnl1-using distros, then cross our fingers for the next thread safety problem).
Honestly, I think this is likely one of your better bets. Given that you'll have to support libnl1 due to RHEL6 for another roughly 5 years. But long term you guys will likely want to push Fedora and RHEL7 to libnl3, any distros have shifted away from libnl1 (I know Gentoo has simply because my personal bad experiences with libnl1). I've spent the past few minutes checking various distros that I know publish their depends in an easy to access fashion. Ubuntu, Arch, Gentoo, and Mageia are using libnl3 for libvirt. Fedora, Debian, Mandriva (discontinued for Mageia) and RHEL are using libnl1. SuSE's libvirt packages don't link to libnl at all. While not an exhaustive list, it does show that there is traction towards going with libnl3. The big holdout IMHO is Fedora and switching to libnl3 should at the very least be on the roadmap for Fedora 19. -- Doug Goldstein

On 02/28/2013 11:16 AM, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 11:11:53AM -0500, Laine Stump wrote:
On 02/28/2013 08:37 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
The nl_recvmsg does not always set errno. Instead it returns its own custom set of error codes. Thus we were reporting the wrong data. --- src/util/virnetlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index 0b36fdc..8b47ede 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -335,8 +335,9 @@ virNetlinkEventCallback(int watch, if (length == 0) return; if (length < 0) { - virReportSystemError(errno, - "%s", _("nl_recv returned with error")); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("nl_recv returned with error: %s"), + nl_geterror(length)); My recollection is that we specifically avoided calling nl_geterror() because it isn't threadsafe.
I'll go take another look to verify. I did check this, but only for libnl3 which merely does a static string table lookup:
const char *nl_geterror(int error) { error = abs(error);
if (error > NLE_MAX) error = NLE_FAILURE;
return errmsg[error]; }
nl_geterror() in libnl-1.1 calls strerror() which isn't threadsafe: char *nl_geterror(void) { if (errbuf) return errbuf; if (nlerrno) return strerror(nlerrno); return "Sucess\n"; } Of course strerror is only called from here if errbuf hasn't been set, and I *think* that is rare, and we added a patch to all Fedora/RHEL builds of libnl-1.1 that put errbuf and nlerrno into thread-local storage quite a long time ago (August 2010: https://bugzilla.redhat.com/show_bug.cgi?id=617291 ). *But* the function that sets errbuf, __nl_error(), itself calls strerror(). And when I look at strerror() in glibc, I'm not exactly getting warm fuzzies about what could happen if it was called simultaneously from two threads. It mallocs a global char* (after freeing the original) then uses strerror_r to duplicate the standard system error string into the malloc'ed buffer. So it's possible for one thread to be dereferencing a pointer to a buffer that has already been free'd by another thread. Of course that's all academic, since __nl_error() is called when an error occurs anyway, regardless of whether you subsequently decide to call nl_geterror() or not. So the conclusion is that I see no extra harm in calling nl_geterror(). ACK.

On Thu, Feb 28, 2013 at 11:59:42AM -0500, Laine Stump wrote:
On 02/28/2013 11:16 AM, Daniel P. Berrange wrote:
On Thu, Feb 28, 2013 at 11:11:53AM -0500, Laine Stump wrote:
On 02/28/2013 08:37 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
The nl_recvmsg does not always set errno. Instead it returns its own custom set of error codes. Thus we were reporting the wrong data. --- src/util/virnetlink.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/util/virnetlink.c b/src/util/virnetlink.c index 0b36fdc..8b47ede 100644 --- a/src/util/virnetlink.c +++ b/src/util/virnetlink.c @@ -335,8 +335,9 @@ virNetlinkEventCallback(int watch, if (length == 0) return; if (length < 0) { - virReportSystemError(errno, - "%s", _("nl_recv returned with error")); + virReportError(VIR_ERR_INTERNAL_ERROR, + _("nl_recv returned with error: %s"), + nl_geterror(length)); My recollection is that we specifically avoided calling nl_geterror() because it isn't threadsafe.
I'll go take another look to verify. I did check this, but only for libnl3 which merely does a static string table lookup:
const char *nl_geterror(int error) { error = abs(error);
if (error > NLE_MAX) error = NLE_FAILURE;
return errmsg[error]; }
nl_geterror() in libnl-1.1 calls strerror() which isn't threadsafe:
char *nl_geterror(void) { if (errbuf) return errbuf;
if (nlerrno) return strerror(nlerrno);
return "Sucess\n"; }
Of course strerror is only called from here if errbuf hasn't been set, and I *think* that is rare, and we added a patch to all Fedora/RHEL builds of libnl-1.1 that put errbuf and nlerrno into thread-local storage quite a long time ago (August 2010: https://bugzilla.redhat.com/show_bug.cgi?id=617291 ).
*But* the function that sets errbuf, __nl_error(), itself calls strerror(). And when I look at strerror() in glibc, I'm not exactly getting warm fuzzies about what could happen if it was called simultaneously from two threads. It mallocs a global char* (after freeing the original) then uses strerror_r to duplicate the standard system error string into the malloc'ed buffer. So it's possible for one thread to be dereferencing a pointer to a buffer that has already been free'd by another thread.
Of course that's all academic, since __nl_error() is called when an error occurs anyway, regardless of whether you subsequently decide to call nl_geterror() or not.
So the conclusion is that I see no extra harm in calling nl_geterror(). ACK.
Except the API signature is different, so my patch won't work with both versions :-( Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 03/01/2013 03:18 AM, Daniel P. Berrange wrote:
So the conclusion is that I see no extra harm in calling nl_geterror(). ACK.
Except the API signature is different, so my patch won't work with both versions :-(
Sounds like its time for a wrapper function that #ifdefs away the difference in underlying API calls between libnl1 and libnl3, with the bulk of our code using our wrapper instead of direct nl_* functions. And by the time we convert our code to all go through our wrappers, how much simpler is it to have our wrapper do the direct socket work instead of catering to two parallel nl_ abstractions? -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Fri, Mar 1, 2013 at 8:44 AM, Eric Blake <eblake@redhat.com> wrote:
On 03/01/2013 03:18 AM, Daniel P. Berrange wrote:
So the conclusion is that I see no extra harm in calling nl_geterror(). ACK.
Except the API signature is different, so my patch won't work with both versions :-(
Sounds like its time for a wrapper function that #ifdefs away the difference in underlying API calls between libnl1 and libnl3, with the bulk of our code using our wrapper instead of direct nl_* functions. And by the time we convert our code to all go through our wrappers, how much simpler is it to have our wrapper do the direct socket work instead of catering to two parallel nl_ abstractions?
-- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
This is the same road I went down with for an application at work. By the time I got close to abstracting away all the badness, I had a working netlink layer and dropped libnl entirely. I wasn't using all the complex gennl types that libnl contains and abstracts however. A reasonable read is libnfnetlink to see how really simple it is to use netlink. https://git.netfilter.org/libnfnetlink/ -- Doug Goldstein

On Sat, Mar 02, 2013 at 11:42:01AM -0600, Doug Goldstein wrote:
On Fri, Mar 1, 2013 at 8:44 AM, Eric Blake <eblake@redhat.com> wrote:
On 03/01/2013 03:18 AM, Daniel P. Berrange wrote:
So the conclusion is that I see no extra harm in calling nl_geterror(). ACK.
Except the API signature is different, so my patch won't work with both versions :-(
Sounds like its time for a wrapper function that #ifdefs away the difference in underlying API calls between libnl1 and libnl3, with the bulk of our code using our wrapper instead of direct nl_* functions. And by the time we convert our code to all go through our wrappers, how much simpler is it to have our wrapper do the direct socket work instead of catering to two parallel nl_ abstractions?
-- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
This is the same road I went down with for an application at work. By the time I got close to abstracting away all the badness, I had a working netlink layer and dropped libnl entirely. I wasn't using all the complex gennl types that libnl contains and abstracts however. A reasonable read is libnfnetlink to see how really simple it is to use netlink.
Thanks for the tip. I don't have time to work on this myself, but I'd encourage anyone interested to cook up a patch to remove libvirt's use of the nl_ prefixed functions (leaving only the nla_ functions) Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (5)
-
Daniel P. Berrange
-
Daniel Veillard
-
Doug Goldstein
-
Eric Blake
-
Laine Stump