[libvirt] [PATCH] nodedev_udev: Fix missing events when kernel report lots of udev events within a short time

newer
[libvirt] [PATCH] Add missing deps...

ZhiPeng Lu

9 May 2017 9 May '17

11:09 a.m.

From: "ning.bo" <ning.bo9@zte.com.cn> When create Virtual Function for Inter XL710 use below commands: for i in `seq 0 1`; do echo 63 > /sys/devices/pci0000:00/0000:00:03.2/0000:07:00.$i/sriov_numvfs done for i in `seq 0 3`; do echo 31 > /sys/devices/pci0000:80/0000:80:02.2/0000:82:00.$i/sriov_numvfs done The libvirtd will missing some udev events, the result of libvirt-python API listDevices('pci') will not list all pci devices. The reason is that the buffer of udev monitor default size cann't save all udev events reported by kernel. So we need change buffer size so that we can receive as much events as possible whitin a short time. Signed-off-by: ning.bo <ning.bo9@zte.com.cn> --- src/node_device/node_device_udev.c | 1 + 1 file changed, 1 insertion(+) diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 6e706a1..d813206 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1564,6 +1564,7 @@ static int nodeStateInitialize(bool privileged, } udev_monitor_enable_receiving(priv->udev_monitor); + udev_monitor_set_receive_buffer_size(priv->udev_monitor, 128 * 1024 * 1024); /* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing -- 2.8.3

Show replies by date

Erik Skultety

10 May 10 May

8:38 p.m.

On Tue, May 09, 2017 at 10:09:07AM +0800, ZhiPeng Lu wrote:

...

From: "ning.bo" <ning.bo9@zte.com.cn>

When create Virtual Function for Inter XL710 use below commands: for i in `seq 0 1`; do echo 63 > /sys/devices/pci0000:00/0000:00:03.2/0000:07:00.$i/sriov_numvfs done for i in `seq 0 3`; do echo 31 > /sys/devices/pci0000:80/0000:80:02.2/0000:82:00.$i/sriov_numvfs done

Hi, I think this is worth a BZ, so we can track and test the issue. Now, I'm working on a similar issue which is very hard to reproduce and this looks like it could be reproduced easily with 100% chance, but I don't have any HW to try it on (it might take a while to get my hands on some). So, would you mind opening a BZ for this, attaching not only libvirtd's debug logs, but preferably udev debug logs as well (libudev API is kinda poorly designed in this aspect and the only way to check for the real error is to enable the debugging and look into the logs).

...

The libvirtd will missing some udev events, the result of libvirt-python API listDevices('pci') will not list all pci devices. The reason is that the buffer of udev monitor default size cann't save all udev events reported by kernel. So we need change buffer size so that we can receive as much events as possible whitin a short time.

Signed-off-by: ning.bo <ning.bo9@zte.com.cn> --- src/node_device/node_device_udev.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 6e706a1..d813206 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1564,6 +1564,7 @@ static int nodeStateInitialize(bool privileged, }

udev_monitor_enable_receiving(priv->udev_monitor); + udev_monitor_set_receive_buffer_size(priv->udev_monitor, 128 * 1024 * 1024);

The reason why I want to investigate the logs (ideally try it myself) is rather than blindly increasing the buffer size, maybe we receive just too many events, out of which some we might filter out and the problem would disappear. Anyhow, isn't 129 MiB bit of an overkill for a buffer size? I mean, you didn't provide any commentary on why you chose 128MiB specifically, I can only guess it somehow relates to the fact, that XL710 is capable of 128 VFs?? Thanks, Erik

...

/* We register the monitor with the event callback so we are * notified by udev of device changes before we enumerate existing -- 2.8.3

-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Daniel P. Berrange

8:44 p.m.

On Wed, May 10, 2017 at 01:38:00PM +0200, Erik Skultety wrote:

...

On Tue, May 09, 2017 at 10:09:07AM +0800, ZhiPeng Lu wrote:

...
From: "ning.bo" <ning.bo9@zte.com.cn>

When create Virtual Function for Inter XL710 use below commands: for i in `seq 0 1`; do echo 63 > /sys/devices/pci0000:00/0000:00:03.2/0000:07:00.$i/sriov_numvfs done for i in `seq 0 3`; do echo 31 > /sys/devices/pci0000:80/0000:80:02.2/0000:82:00.$i/sriov_numvfs done

Hi, I think this is worth a BZ, so we can track and test the issue. Now, I'm working on a similar issue which is very hard to reproduce and this looks like it could be reproduced easily with 100% chance, but I don't have any HW to try it on (it might take a while to get my hands on some). So, would you mind opening a BZ for this, attaching not only libvirtd's debug logs, but preferably udev debug logs as well (libudev API is kinda poorly designed in this aspect and the only way to check for the real error is to enable the debugging and look into the logs).

...
The libvirtd will missing some udev events, the result of libvirt-python API listDevices('pci') will not list all pci devices. The reason is that the buffer of udev monitor default size cann't save all udev events reported by kernel. So we need change buffer size so that we can receive as much events as possible whitin a short time.

Signed-off-by: ning.bo <ning.bo9@zte.com.cn> --- src/node_device/node_device_udev.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 6e706a1..d813206 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1564,6 +1564,7 @@ static int nodeStateInitialize(bool privileged, }

udev_monitor_enable_receiving(priv->udev_monitor); + udev_monitor_set_receive_buffer_size(priv->udev_monitor, 128 * 1024 * 1024);

The reason why I want to investigate the logs (ideally try it myself) is rather than blindly increasing the buffer size, maybe we receive just too many events, out of which some we might filter out and the problem would disappear. Anyhow, isn't 129 MiB bit of an overkill for a buffer size? I mean, you didn't provide any commentary on why you chose 128MiB specifically, I can only guess it somehow relates to the fact, that XL710 is capable of 128 VFs??

This matches what udevd sets for the buffer. This method ends up setting the SO_RCVRBUFFORCE socket options. So we're not allocating 128 MB in userspace - IIUC it just lets the kernel expand it upto 128 MB if needed. This is a privileged operation however, so we better make sure we don't call this when running non-root. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Erik Skultety

11 May 11 May

12:09 a.m.

On Wed, May 10, 2017 at 12:44:22PM +0100, Daniel P. Berrange wrote:

...

On Wed, May 10, 2017 at 01:38:00PM +0200, Erik Skultety wrote:

...
On Tue, May 09, 2017 at 10:09:07AM +0800, ZhiPeng Lu wrote:

...
From: "ning.bo" <ning.bo9@zte.com.cn>

When create Virtual Function for Inter XL710 use below commands: for i in `seq 0 1`; do echo 63 > /sys/devices/pci0000:00/0000:00:03.2/0000:07:00.$i/sriov_numvfs done for i in `seq 0 3`; do echo 31 > /sys/devices/pci0000:80/0000:80:02.2/0000:82:00.$i/sriov_numvfs done

Hi, I think this is worth a BZ, so we can track and test the issue. Now, I'm working on a similar issue which is very hard to reproduce and this looks like it could be reproduced easily with 100% chance, but I don't have any HW to try it on (it might take a while to get my hands on some). So, would you mind opening a BZ for this, attaching not only libvirtd's debug logs, but preferably udev debug logs as well (libudev API is kinda poorly designed in this aspect and the only way to check for the real error is to enable the debugging and look into the logs).

...
The libvirtd will missing some udev events, the result of libvirt-python API listDevices('pci') will not list all pci devices. The reason is that the buffer of udev monitor default size cann't save all udev events reported by kernel. So we need change buffer size so that we can receive as much events as possible whitin a short time.

Signed-off-by: ning.bo <ning.bo9@zte.com.cn> --- src/node_device/node_device_udev.c | 1 + 1 file changed, 1 insertion(+)

diff --git a/src/node_device/node_device_udev.c b/src/node_device/node_device_udev.c index 6e706a1..d813206 100644 --- a/src/node_device/node_device_udev.c +++ b/src/node_device/node_device_udev.c @@ -1564,6 +1564,7 @@ static int nodeStateInitialize(bool privileged, }

udev_monitor_enable_receiving(priv->udev_monitor); + udev_monitor_set_receive_buffer_size(priv->udev_monitor, 128 * 1024 * 1024);

The reason why I want to investigate the logs (ideally try it myself) is rather than blindly increasing the buffer size, maybe we receive just too many events, out of which some we might filter out and the problem would disappear. Anyhow, isn't 129 MiB bit of an overkill for a buffer size? I mean, you didn't provide any commentary on why you chose 128MiB specifically, I can only guess it somehow relates to the fact, that XL710 is capable of 128 VFs??

This matches what udevd sets for the buffer.

Ah, right.

...

This method ends up setting the SO_RCVRBUFFORCE socket options. So we're not allocating 128 MB in userspace - IIUC it just lets the kernel expand it upto 128 MB if needed. This is a privileged operation however, so

...

we better make sure we don't call this when running non-root. Yes, good point, but what about unprivileged daemon, you should be safe to use SO_RCVBUF safely in that case, as long as it's less than the global rmem_max

Yeah, I just checked how kernel handles sockets and buffers associated with them and you're absolutely right, when a datagram is to be delivered to a socket, sk_rmem_alloc size is incremented by the size of the buffer holding the datagram and in case the incremented size is to be bigger than sk_rcvbuf (that's the limit set by SO_RCVBUF) the datagram is dropped and ENOMEM is returned. limit. Now, since I'm dealing with a similar BZ, IMHO trying to come up with a suitable buffer size is non-trivial and setting it according to the udevd's manner is quite neat I have to admin, so the unprivileged case was just out of curiousity whether we need to take account for that as well. Thanks, Erik

...

Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

3126

Age (days ago)

3127

Last active (days ago)

List overview

Download

3 comments

3 participants

participants (3)

Daniel P. Berrange
Erik Skultety
ZhiPeng Lu