On 04/26/2018 04:38 AM, Daniel P. Berrangé wrote:
On Thu, Apr 26, 2018 at 08:09:47AM +0200, Christian Ehrhardt wrote:
On Wed, Apr 25, 2018 at 11:25 PM, Laine Stump <laine(a)laine.org>
<laine(a)laine.org> wrote:
When an nwfilter rule sets the parameter CTRL_IP_LEARNING to "dhcp",
this turns on the "dhcpsnoop" thread, which uses libpcap to monitor
traffic on the domain's tap device and extract the IP address from the
DHCP response.
If libpcap on the host is built with TPACKET_V3 defined, the dhcpsnoop
code's initialization of the libpcap socket fails with the following
error:
virNWFilterSnoopDHCPOpen:1134 : internal error: pcap_setfilter: can't
remove kernel filter: Bad file descriptor
It turns out that this was because libpcap with TPACKET_V3 defined
requires a larger buffer size than libvirt was setting (we were
setting it to 128k). Changing the buffer size to 256k eliminates the
error, and the dhcpsnoop thread once again works properly.
Thanks to Christian Ehrhardt <paelzer(a)gmail.com> <paelzer(a)gmail.com> for
discovering that
buffer size was the problem.
Resolves:
https://bugzilla.redhat.com/show_bug.cgi?id=1547237
Signed-off-by: Laine Stump <laine(a)laine.org> <laine(a)laine.org>
---
src/nwfilter/nwfilter_dhcpsnoop.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/nwfilter/nwfilter_dhcpsnoop.c b/src/nwfilter/nwfilter_
dhcpsnoop.c
index 6069e70460..62eb617515 100644
--- a/src/nwfilter/nwfilter_dhcpsnoop.c
+++ b/src/nwfilter/nwfilter_dhcpsnoop.c
@@ -259,7 +259,7 @@ struct _virNWFilterDHCPDecodeJob {
* libpcap 1.5 requires a 128kb buffer
* 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
*/
I just started building with the change for a few tests on this - no
results yet.
But we are all puzzled/unsure enough on the size that I'd already ask to
modify the comment above to explain the new size.
Maybe we should explain:
- why 128 isn't enough
- why you chose "only" 256
- why the default size might be too big
- your size considerations for many guest scenarios
Okay, so I looked into this in more detail. I don't know that I'm totally
at the bottom of it, but this is what I found in libpcap:
* The function create_ring() calls setsockopt(fd, SOL_PACKET,
PACKET_RX_RING, &req ...) to setup a ring buffer for receiving packets.
* one of the attributes of req is tp_frame_size, and another is
tp_frame_nr. tp_frame_nr is set to the user-supplied (or default) buffer
size (for us it was 128k) / tp_frame_nr.
* if TPACKETV3 isn't used, tp_frame_size is set to (roughly, there is some
aligning done, and a bit additional added for one ethernet header) the
user-supplied snaplen (in our case 576). So tp_frame_nr is going to be
something > 200.
* if TPACKET3 *is* used, tp_frame_size is set to a predefined constant
MAXIMUM_SNAPLEN, which is 262144. This means that tp_frame_nr will be
131072/262144, which is 0.
So the result will be that setsockopt() is called requesting a ring buffer
that can read 0 frames. I didn't trace through the call to see exactly what
happens, but it obviously can't be good. (Right after that, tp_frame_nr is
also used in a call to malloc a buffer to hold pointers to frame headers,
and that of course would also fail if we ever got that far. I'm assuming we
don't.
Why are the frames in the ring buffer so large for TPACKETV3? The code
says this:
/* The "frames" for this are actually buffers that
* contain multiple variable-sized frames.
*
* We pick a "frame" size of 128K to leave enough
* room for at least one reasonably-sized packet
* in the "frame".
*/
This all explains why changing the buffer size to 256k works for us - that
gets tp_frame_nr to 1 (just barely!) so there is no multiplying by 0. And
because multiple packets are received into this single "frame", we still
operate correctly.
However, this all seems very arbitrary - there's an API for setting the
size of a buffer, but it's not defined anywhere in the API what are the
units of measure (that's not an exactly accurate description, but I'm too
tired to think of the right way to say it). pcap-int.h itself even admits
this in the comment above its definition of MAXIMUM_SNAPLEN:
/*
* Maximum snapshot length.
*
* Somewhat arbitrary, but chosen to be:
*
* [blah blah blah details details...]
*/
#define MAXIMUM_SNAPLEN 262144
Also, note that the comment where MAXIMUM_SNAPLEN is used in pcap-linux.c
says that it is 128k, but the actual definition sets it at 256k. :-/
This gives me 0 confidence that any value we choose would continue to be
viable in the future.
That will help the next one stumbling over this code.
FWIW, having just checked libpcap git history, the current 2 MB default
size has been there since 2008 ! I'm guessing we don't use that as we
don't want to consume lots of RAM when many guests are running.
Yeah, I'm guessing this is why Stefan chose to set the buffer size.
In the case of static IP (the learnIPAddress codepath) we open the pcap
socket, get a single packet that has an IP address, and then close the pcap
socket, so we don't normally have more than a few around at any time, and
the size of the buffer is a non-issue so it's left at the default.
For the case of dhcp snooping, though, the handle remains open for the
entire lifetime of the domain (I guess in case the address is released and
re-requested and a different address is handed out the next time). So if
we're using the default 2MB buffer, there would be 2MB for each client. On
one hand, for a system with 100 clients, that's 200MB of memory. On the
other hand, any system that's miniscule in relation to the total amount of
memory (it might reduce the capacity of a very large host by one guest at
most).
(NB: I checked this out by putting VIR_WARNs in the source at strategic
places, and found that we are actually keeping open *TWO* pcap sockets for
each domain that has CTRL_IP_LEARNING=dhcp. We should probably look into
that...)
So if everyone is okay with that extra memory usage, I'd suggest Christian
could send his original patch (deleting the pcap_set_buffer_size()) to the
list (but including the comments from my patch about the exact error
message, Resolves: blah.bugzilla.blah, etc).
I can do so if "everyone is okay with that extra memory usage" applies.
So waiting for opinions on that for now.
(Or if you don't have time, I can send it and credit you in the
commit
log).
> I recently see more and more Resolves: instead of "Fixes:" did we
change the recommended format for some tools and I missed it?
Who is "we"? :-) The overwhelming majority of these tags in libvirt commit
logs have always been Resolves:
laine@vhost2 ~/devel/libvirt (master*)>git log | grep "Resolves: h" | wc
508 1017 33340
laine@vhost2 ~/devel/libvirt (master*)>git log | grep "Fixes: h" | wc
9 18 654
I do notice that the few Fixes: are for bugs on launchpad (with one
exception), and the Resolves: are all
bugzilla.redhat.com and a few
opensuse.org/suse.com