On 04/26/2018 04:38 AM, Daniel P. Berrangé wrote:
On Thu, Apr 26, 2018 at 08:09:47AM +0200, Christian Ehrhardt wrote:
> On Wed, Apr 25, 2018 at 11:25 PM, Laine Stump <laine(a)laine.org> wrote:
>
>> When an nwfilter rule sets the parameter CTRL_IP_LEARNING to "dhcp",
>> this turns on the "dhcpsnoop" thread, which uses libpcap to monitor
>> traffic on the domain's tap device and extract the IP address from the
>> DHCP response.
>>
>> If libpcap on the host is built with TPACKET_V3 defined, the dhcpsnoop
>> code's initialization of the libpcap socket fails with the following
>> error:
>>
>> virNWFilterSnoopDHCPOpen:1134 : internal error: pcap_setfilter: can't
>> remove kernel filter: Bad file descriptor
>>
>> It turns out that this was because libpcap with TPACKET_V3 defined
>> requires a larger buffer size than libvirt was setting (we were
>> setting it to 128k). Changing the buffer size to 256k eliminates the
>> error, and the dhcpsnoop thread once again works properly.
>>
>> Thanks to Christian Ehrhardt <paelzer(a)gmail.com> for discovering that
>> buffer size was the problem.
>>
>> Resolves:
https://bugzilla.redhat.com/show_bug.cgi?id=1547237
>> Signed-off-by: Laine Stump <laine(a)laine.org>
>> ---
>> src/nwfilter/nwfilter_dhcpsnoop.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/src/nwfilter/nwfilter_dhcpsnoop.c b/src/nwfilter/nwfilter_
>> dhcpsnoop.c
>> index 6069e70460..62eb617515 100644
>> --- a/src/nwfilter/nwfilter_dhcpsnoop.c
>> +++ b/src/nwfilter/nwfilter_dhcpsnoop.c
>> @@ -259,7 +259,7 @@ struct _virNWFilterDHCPDecodeJob {
>> * libpcap 1.5 requires a 128kb buffer
>> * 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
>> */
>>
> I just started building with the change for a few tests on this - no
> results yet.
>
> But we are all puzzled/unsure enough on the size that I'd already ask to
> modify the comment above to explain the new size.
>
> Maybe we should explain:
> - why 128 isn't enough
> - why you chose "only" 256
> - why the default size might be too big
> - your size considerations for many guest scenarios
Okay, so I looked into this in more detail. I don't know that I'm
totally at the bottom of it, but this is what I found in libpcap:
* The function create_ring() calls setsockopt(fd, SOL_PACKET,
PACKET_RX_RING, &req ...) to setup a ring buffer for receiving packets.
* one of the attributes of req is tp_frame_size, and another is
tp_frame_nr. tp_frame_nr is set to the user-supplied (or default) buffer
size (for us it was 128k) / tp_frame_nr.
* if TPACKETV3 isn't used, tp_frame_size is set to (roughly, there is
some aligning done, and a bit additional added for one ethernet header)
the user-supplied snaplen (in our case 576). So tp_frame_nr is going to
be something > 200.
* if TPACKET3 *is* used, tp_frame_size is set to a predefined constant
MAXIMUM_SNAPLEN, which is 262144. This means that tp_frame_nr will be
131072/262144, which is 0.
So the result will be that setsockopt() is called requesting a ring
buffer that can read 0 frames. I didn't trace through the call to see
exactly what happens, but it obviously can't be good. (Right after that,
tp_frame_nr is also used in a call to malloc a buffer to hold pointers
to frame headers, and that of course would also fail if we ever got that
far. I'm assuming we don't.
Why are the frames in the ring buffer so large for TPACKETV3? The code
says this:
/* The "frames" for this are actually buffers that
* contain multiple variable-sized frames.
*
* We pick a "frame" size of 128K to leave enough
* room for at least one reasonably-sized packet
* in the "frame".
*/
This all explains why changing the buffer size to 256k works for us -
that gets tp_frame_nr to 1 (just barely!) so there is no multiplying by
0. And because multiple packets are received into this single "frame",
we still operate correctly.
However, this all seems very arbitrary - there's an API for setting the
size of a buffer, but it's not defined anywhere in the API what are the
units of measure (that's not an exactly accurate description, but I'm
too tired to think of the right way to say it). pcap-int.h itself even
admits this in the comment above its definition of MAXIMUM_SNAPLEN:
/*
* Maximum snapshot length.
*
* Somewhat arbitrary, but chosen to be:
*
* [blah blah blah details details...]
*/
#define MAXIMUM_SNAPLEN 262144
Also, note that the comment where MAXIMUM_SNAPLEN is used in
pcap-linux.c says that it is 128k, but the actual definition sets it at
256k. :-/
This gives me 0 confidence that any value we choose would continue to be
viable in the future.
> That will help the next one stumbling over this code.
FWIW, having just checked libpcap git history, the current 2 MB default
size has been there since 2008 ! I'm guessing we don't use that as we
don't want to consume lots of RAM when many guests are running.
Yeah, I'm guessing this is why Stefan chose to set the buffer size.
In the case of static IP (the learnIPAddress codepath) we open the pcap
socket, get a single packet that has an IP address, and then close the
pcap socket, so we don't normally have more than a few around at any
time, and the size of the buffer is a non-issue so it's left at the default.
For the case of dhcp snooping, though, the handle remains open for the
entire lifetime of the domain (I guess in case the address is released
and re-requested and a different address is handed out the next time).
So if we're using the default 2MB buffer, there would be 2MB for each
client. On one hand, for a system with 100 clients, that's 200MB of
memory. On the other hand, any system that's miniscule in relation to
the total amount of memory (it might reduce the capacity of a very large
host by one guest at most).
(NB: I checked this out by putting VIR_WARNs in the source at strategic
places, and found that we are actually keeping open *TWO* pcap sockets
for each domain that has CTRL_IP_LEARNING=dhcp. We should probably look
into that...)
So if everyone is okay with that extra memory usage, I'd suggest
Christian could send his original patch (deleting the
pcap_set_buffer_size()) to the list (but including the comments from my
patch about the exact error message, Resolves: blah.bugzilla.blah, etc).
(Or if you don't have time, I can send it and credit you in the commit log).
I recently see more and more Resolves: instead of "Fixes:"
did we
change the recommended format for some tools and I missed it?
Who is "we"? :-) The overwhelming majority of these tags in libvirt
commit logs have always been Resolves:
laine@vhost2 ~/devel/libvirt (master*)>git log | grep "Resolves: h" | wc
508 1017 33340
laine@vhost2 ~/devel/libvirt (master*)>git log | grep "Fixes: h" | wc
9 18 654
I do notice that the few Fixes: are for bugs on launchpad (with one
exception), and the Resolves: are all
bugzilla.redhat.com and a few
opensuse.org/suse.com