[libvirt] virDomainMemoryPeek & maximum remote message buffer size

The kernel images that I want to snoop in virt-mem are around 16 MB in size. In the qemu / KVM case, these images have to travel over the remote connection. Because of limits on the maximum message size, they have to travel currently in 64 KB chunks, and it turns out that this is slow. Apparently the dominating factors are how long it takes to issue the 'memsave' command in the qemu monitor (there is some big constant overhead), and extra network round trips. The current remote message size is intentionally limited to 256KB (fully serialized, including all XDR headers and overhead), so the most we could practically send in a single message at the moment is 128KB if we stick to powers of two, or ~255KB if we don't. The reason we limit it is to avoid denial of service attacks, where a rogue client or server sends excessively large messages and causes the peer to allocate lots of memory [eg. if we didn't have any limit, then you could send a message which was several GB in size and cause problems at the other end, because the message is slurped in before it is fully parsed]. There is a second problem with reading the kernel in small chunks, namely that this allows the virtual machine to make a lot of progress, so we don't get anything near an 'instantaneous' snapshot (getting the kernel in a single chunk doesn't necessarily guarantee this either, but it's better). As an experiment, I tried increasing the maximum message to 32 MB, so that I could send the whole kernel in one go. Unfortunately just increasing the limit doesn't work for two reasons, one prosaic and one very weird: (1) The current code likes to keep message buffers on the stack, and because Linux limits the stack to something artificially small, this fails. Increasing the stack ulimit is a short-term fix for this, while testing. In the long term we could rewrite any code which does this to use heap buffers instead. (2) There is some really odd problem with our use of recv(2) which causes messages > 64 KB to fail. I have no idea what is really happening, but the sequence of events seems to be this: server client write(sock,buf,len) = len-k recv(sock,buf,len) = len-k write(sock,buf+len-k,len-k) = k recv(sock,buf,k) = 0 [NOT k] At this point the client assumes that the server has unexpectedly closed the connection and fails. I have stared at this for a while, but I've got no idea at all what's going on. A test program is attached. You'll need a 32 bit KVM guest. Rich. -- Richard Jones, Emerging Technologies, Red Hat http://et.redhat.com/~rjones virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://et.redhat.com/~rjones/virt-df/

On Wed, Jul 09, 2008 at 08:26:47PM +0100, Richard W.M. Jones wrote:
(2) There is some really odd problem with our use of recv(2) which causes messages > 64 KB to fail. I have no idea what is really happening, but the sequence of events seems to be this:
server client
write(sock,buf,len) = len-k recv(sock,buf,len) = len-k
write(sock,buf+len-k,len-k) = k
recv(sock,buf,k) = 0 [NOT k]
At this point the client assumes that the server has unexpectedly closed the connection and fails. I have stared at this for a while, but I've got no idea at all what's going on.
I don't think you can expect the second recv() to return exactly k as this can get fragmented (nor expect that the first recv would get len -k either), but if you got 0 that would mean a packet has been received and there is no more data, that would be a bug IMHO. That's strange ... Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

On Wed, Jul 09, 2008 at 08:26:47PM +0100, Richard W.M. Jones wrote:
The kernel images that I want to snoop in virt-mem are around 16 MB in size. In the qemu / KVM case, these images have to travel over the remote connection. Because of limits on the maximum message size, they have to travel currently in 64 KB chunks, and it turns out that this is slow. Apparently the dominating factors are how long it takes to issue the 'memsave' command in the qemu monitor (there is some big constant overhead), and extra network round trips.
The current remote message size is intentionally limited to 256KB (fully serialized, including all XDR headers and overhead), so the most we could practically send in a single message at the moment is 128KB if we stick to powers of two, or ~255KB if we don't.
The reason we limit it is to avoid denial of service attacks, where a rogue client or server sends excessively large messages and causes the peer to allocate lots of memory [eg. if we didn't have any limit, then you could send a message which was several GB in size and cause problems at the other end, because the message is slurped in before it is fully parsed].
There is a second problem with reading the kernel in small chunks, namely that this allows the virtual machine to make a lot of progress, so we don't get anything near an 'instantaneous' snapshot (getting the kernel in a single chunk doesn't necessarily guarantee this either, but it's better).
As an experiment, I tried increasing the maximum message to 32 MB, so that I could send the whole kernel in one go.
Unfortunately just increasing the limit doesn't work for two reasons, one prosaic and one very weird:
(1) The current code likes to keep message buffers on the stack, and because Linux limits the stack to something artificially small, this fails. Increasing the stack ulimit is a short-term fix for this, while testing. In the long term we could rewrite any code which does this to use heap buffers instead.
Yeah we should fix this. I've had a patch for refactoring the main dispatch method pending for quite a while which dramatically reduces stack usage
(2) There is some really odd problem with our use of recv(2) which causes messages > 64 KB to fail. I have no idea what is really happening, but the sequence of events seems to be this:
server client
write(sock,buf,len) = len-k recv(sock,buf,len) = len-k
write(sock,buf+len-k,len-k) = k
recv(sock,buf,k) = 0 [NOT k]
Bizarre. The docs quite clearly say These calls return the number of bytes received, or -1 if an error occurred. The return value will be 0 when the peer has performed an orderly shutdown. So its clearly thinking there's a shutdown here. Were you doing this over the UNIX socket, or the TCP ? If the latter then might want to turn off all authentication and use the TCP socket to ensure none of the encryption routines are in use. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
participants (3)
-
Daniel P. Berrange
-
Daniel Veillard
-
Richard W.M. Jones