On Thu, Apr 26, 2018 at 5:55 PM, Laine Stump <laine@laine.org> wrote:
On 04/26/2018 04:38 AM, Daniel P. Berrangé wrote:
On Thu, Apr 26, 2018 at 08:09:47AM +0200, Christian Ehrhardt wrote:
On Wed, Apr 25, 2018 at 11:25 PM, Laine Stump <laine@laine.org> wrote:

When an nwfilter rule sets the parameter CTRL_IP_LEARNING to "dhcp",
this turns on the "dhcpsnoop" thread, which uses libpcap to monitor
traffic on the domain's tap device and extract the IP address from the
DHCP response.

If libpcap on the host is built with TPACKET_V3 defined, the dhcpsnoop
code's initialization of the libpcap socket fails with the following
error:

  virNWFilterSnoopDHCPOpen:1134 : internal error: pcap_setfilter: can't
remove kernel filter: Bad file descriptor

It turns out that this was because libpcap with TPACKET_V3 defined
requires a larger buffer size than libvirt was setting (we were
setting it to 128k). Changing the buffer size to 256k eliminates the
error, and the dhcpsnoop thread once again works properly.

Thanks to Christian Ehrhardt <paelzer@gmail.com> for discovering that
buffer size was the problem.

Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1547237
Signed-off-by: Laine Stump <laine@laine.org>
---
 src/nwfilter/nwfilter_dhcpsnoop.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/nwfilter/nwfilter_dhcpsnoop.c b/src/nwfilter/nwfilter_
dhcpsnoop.c
index 6069e70460..62eb617515 100644
--- a/src/nwfilter/nwfilter_dhcpsnoop.c
+++ b/src/nwfilter/nwfilter_dhcpsnoop.c
@@ -259,7 +259,7 @@ struct _virNWFilterDHCPDecodeJob {
  * libpcap 1.5 requires a 128kb buffer
  * 128 kb is bigger than (DHCP_PKT_BURST * PCAP_PBUFSIZE / 2)
  */

I just started building with the change for a few tests on this - no
results yet.

But we are all puzzled/unsure enough on the size that I'd already ask to
modify the comment above to explain the new size.

Maybe we should explain:
- why 128 isn't enough
- why you chose "only" 256
- why the default size might be too big
- your size considerations for many guest scenarios

Okay, so I looked into this in more detail. I don't know that I'm totally at the bottom of it, but this is what I found in libpcap:

* The function create_ring() calls setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &req ...) to setup a ring buffer for receiving packets.

* one of the attributes of req is tp_frame_size, and another is tp_frame_nr. tp_frame_nr is set to the user-supplied (or default) buffer size (for us it was 128k) / tp_frame_nr.

* if TPACKETV3 isn't used, tp_frame_size is set to (roughly, there is some aligning done, and a bit additional added for one ethernet header) the user-supplied snaplen (in our case 576). So tp_frame_nr is going to be something > 200.

* if TPACKET3 *is* used, tp_frame_size is set to a predefined constant MAXIMUM_SNAPLEN, which is 262144. This means that tp_frame_nr will be 131072/262144, which is 0.

So the result will be that setsockopt() is called requesting a ring buffer that can read 0 frames. I didn't trace through the call to see exactly what happens, but it obviously can't be good. (Right after that, tp_frame_nr is also used in a call to malloc a buffer to hold pointers to frame headers, and that of course would also fail if we ever got that far. I'm assuming we don't.

Why are the frames in the ring buffer so large for TPACKETV3? The code says this:

   /* The "frames" for this are actually buffers that
    * contain multiple variable-sized frames.
    *
    * We pick a "frame" size of 128K to leave enough
    * room for at least one reasonably-sized packet
    * in the "frame".
    */

This all explains why changing the buffer size to 256k works for us - that gets tp_frame_nr to 1 (just barely!) so there is no multiplying by 0. And because multiple packets are received into this single "frame", we still operate correctly.

However, this all seems very arbitrary - there's an API for setting the size of a buffer, but it's not defined anywhere in the API what are the units of measure (that's not an exactly accurate description, but I'm too tired to think of the right way to say it). pcap-int.h itself even admits this in the comment above its definition of MAXIMUM_SNAPLEN:

   /*
    * Maximum snapshot length.
    *
    * Somewhat arbitrary, but chosen to be:
    *
    * [blah blah blah details details...]
    */
    #define MAXIMUM_SNAPLEN       262144


Also, note that the comment where MAXIMUM_SNAPLEN is used in pcap-linux.c says that it is 128k, but the actual definition sets it at 256k. :-/

This gives me 0 confidence that any value we choose would continue to be viable in the future.

That will help the next one stumbling over this code.
FWIW, having just checked libpcap git history, the current 2 MB default
size has been there since 2008 !   I'm guessing we don't use that as we
don't want to consume lots of RAM when many guests are running.


Yeah, I'm guessing this is why Stefan chose to set the buffer size.

In the case of static IP (the learnIPAddress codepath) we open the pcap socket, get a single packet that has an IP address, and then close the pcap socket, so we don't normally have more than a few around at any time, and the size of the buffer is a non-issue so it's left at the default.

For the case of dhcp snooping, though, the handle remains open for the entire lifetime of the domain (I guess in case the address is released and re-requested and a different address is handed out the next time). So if we're using the default 2MB buffer, there would be 2MB for each client. On one hand, for a system with 100 clients, that's 200MB of memory. On the other hand, any system that's miniscule in relation to the total amount of memory (it might reduce the capacity of a very large host by one guest at most).

(NB: I checked this out by putting VIR_WARNs in the source at strategic places, and found that we are actually keeping open *TWO* pcap sockets for each domain that has CTRL_IP_LEARNING=dhcp. We should probably look into that...)

So if everyone is okay with that extra memory usage, I'd suggest Christian could send his original patch (deleting the pcap_set_buffer_size()) to the list (but including the comments from my patch about the exact error message, Resolves: blah.bugzilla.blah, etc).

I can do so if "everyone is okay with that extra memory usage" applies.
So waiting for opinions on that for now.
 
(Or if you don't have time, I can send it and credit you in the commit log).

> I recently see more and more Resolves: instead of "Fixes:" did we change the recommended format for some tools and I missed it?

Who is "we"? :-) The overwhelming majority of these tags in libvirt commit logs have always been Resolves:

laine@vhost2 ~/devel/libvirt (master*)>git log | grep "Resolves: h" | wc
    508    1017   33340
laine@vhost2 ~/devel/libvirt (master*)>git log | grep "Fixes: h" | wc
      9      18     654

I do notice that the few Fixes: are for bugs on launchpad (with one exception), and the Resolves: are all bugzilla.redhat.com and a few opensuse.org/suse.com



--
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd