On 08/05/2015 12:09 PM, Brian Rak wrote:
I recently compiled 1.2.18 to start testing with it, and was getting
this error on startup:
*** stack smashing detected ***: libvirtd terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x7fe1ac631527]
/lib64/libc.so.6(__fortify_fail+0x0)[0x7fe1ac6314f0]
//lib/libvirt.so.0(+0xa7927)[0x7fe1aeda2927]
//lib/libvirt/connection-driver/libvirt_driver_nodedev.so(+0x947d)[0x7fe1958a047d]
//lib/libvirt/connection-driver/libvirt_driver_nodedev.so(+0xa6c2)[0x7fe1958a16c2]
//lib/libvirt/connection-driver/libvirt_driver_nodedev.so(+0xaf4e)[0x7fe1958a1f4e]
//lib/libvirt.so.0(virStateInitialize+0xb8)[0x7fe1aee6d0a8]
libvirtd(+0x15120)[0x7fe1afae6120]
//lib/libvirt.so.0(+0xd4975)[0x7fe1aedcf975]
/lib64/libpthread.so.0(+0x30316079d1)[0x7fe1ada8c9d1]
/lib64/libc.so.6(clone+0x6d)[0x7fe1ac6178fd]
(gdb) bt
#0 0x00007ffff4a8f625 in raise () from /lib64/libc.so.6
#1 0x00007ffff4a90e05 in abort () from /lib64/libc.so.6
#2 0x00007ffff4acd537 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff4b5f527 in __fortify_fail () from /lib64/libc.so.6
#4 0x00007ffff4b5f4f0 in __stack_chk_fail () from /lib64/libc.so.6
#5 0x00007ffff72d0927 in virNetDevGetFeatures (ifname=<value
optimized out>, out=<value optimized out>) at util/virnetdev.c:3200
#6 0x00007fffdddce47d in udevProcessNetworkInterface
(device=0x7fffd4071f70, def=0x6) at node_device/node_device_udev.c:694
#7 udevGetDeviceDetails (device=0x7fffd4071f70, def=0x6) at
node_device/node_device_udev.c:1272
#8 0x00007fffdddcf6c2 in udevAddOneDevice (device=0x7fffd4071f70) at
node_device/node_device_udev.c:1394
#9 0x00007fffdddcff4e in udevProcessDeviceListEntry
(privileged=<value optimized out>, callback=<value optimized out>,
opaque=<value optimized out>)
at node_device/node_device_udev.c:1433
#10 udevEnumerateDevices (privileged=<value optimized out>,
callback=<value optimized out>, opaque=<value optimized out>) at
node_device/node_device_udev.c:1463
#11 nodeStateInitialize (privileged=<value optimized out>,
callback=<value optimized out>, opaque=<value optimized out>) at
node_device/node_device_udev.c:1773
#12 0x00007ffff739b0a8 in virStateInitialize (privileged=true,
callback=0x555555569070 <daemonInhibitCallback>,
opaque=0x5555557f1db0) at libvirt.c:777
#13 0x0000555555569120 in daemonRunStateInit (opaque=<value optimized
out>) at libvirtd.c:947
#14 0x00007ffff72fd975 in virThreadHelper (data=<value optimized out>)
at util/virthread.c:206
#15 0x00007ffff5fba9d1 in start_thread () from /lib64/libpthread.so.0
#16 0x00007ffff4b458fd in clone () from /lib64/libc.so.6
In IRC, we tracked this down to this bit of code:
g_cmd.cmd = ETHTOOL_GFEATURES;
g_cmd.size = GFEATURES_SIZE;
if (virNetDevGFeatureAvailable(ifname, &g_cmd))
ignore_value(virBitmapSetBit(*out, VIR_NET_DEV_FEAT_TXUDPTNL));
GFEATURES_SIZE is currently defined as 2, but this value needs to be
higher in order to support newer kernels. It looks like this code was
added in ac3ed2085fcbeecaf5aa347c0b1bffaf94fff293
ethtool calculates this value based on the number of supported
features:
http://lxr.free-electrons.com/source/net/core/ethtool.c#L55
I don't know enough about this to properly fix this, but raising
GFEATURES_SIZE to 3 has fixed this issue for me (though, this will
obviously need to go higher as more features get added)
(in later IRC conversation, Brian noted that raising GFEATURES_SIZE
*didn't* always eliminate the issue...)
The problem goes beyond that:
1) as far as I can see, g_cmd.size needs to be set to the number of
items in the array g_cmd.feature, and we're setting it to 2
(GFEATURES_SIZE), but we have allocated space for exactly *0* items in
that array. If we're going to tell the kernel we have 2 items in the
array, we need to actually have that space available, or the kernel will
overwrite something else.
2) the feature we're looking for is called "TX_UDP_TNL" in libvirt, and
is manually #defined to be bit 25. From the title of the commit log for
the patch that added this code to libvirt, you can see what we want to
check for is the feature called "tx-udp_tnl-segmentation", and if you
look at the ethtool.c source from the kernel that Brian has linked to
above, you'll see that the
netdev_features_strings[NETIF_F_GSO_UDP_TUNNEL_BIT] is initialized to
"tx-upd_tnl-segmentation". When you look up NETIF_F_GSO_UDP_TUNNEL_BIT,
it seems to be *26* in the enum where it is defined:
http://osxr.org/linux/source/include/linux/netdev_features.h#0047
So are we checking for the wrong feature?
It would be "really nice" if we could avoid #defining magic values like
TX_UDP_TNL and GFEATURES_SIZE in our source. In my quick investigation I
couldn't see a way around that though (since NETIF_GSO_UDP_TUNNEL_BIT)
isn't available outside kernel source).
This crash was occurring on a CentOS 6 system, running a the ELRepo
kernel-ml kernel. The stock CentOS 6 kernel (2.6.32) does not appear
to have sufficient features available to trigger this.
I guess it would depend on whether or not ETHTOOL_GFEATURES is defined
for the 2.6 kernels and, if so, then what was being overwritten beyond
the end of g_cmd. (there are other locals defined both before and after
g_cmd; all are used only *before* g_cmd is used. I'm not sure what order
locals are in memory, so I don't know which are being overwritten.)