[libvirt] virNetClientPtr leak in remote driver

doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2. static virNetClientPtr virNetClientNew(virNetSocketPtr sock, const char *hostname) { [...] client->refs = 1; [...] /* Set up a callback to listen on the socket data */ client->refs++; if (virNetSocketAddIOCallback(client->sock, VIR_EVENT_HANDLE_READABLE, virNetClientIncomingEvent, client, virNetClientEventFree) < 0) { client->refs--; VIR_DEBUG("Failed to add event watch, disabling events"); } [...] } virNetClientNew adds a ref before calling virNetSocketAddIOCallback but only removes it when virNetSocketAddIOCallback fails. This seems wrong too me and the ref should be removed in the success case too. The same pattern was added in 0302391ee643ad91fdc6d2ecf7e4cf0fc9724516 (Fix race in ref counting when handling RPC jobs) --- a/src/rpc/virnetserverclient.c +++ b/src/rpc/virnetserverclient.c @@ -763,10 +763,12 @@ readmore: /* Send off to for normal dispatch to workers */ if (msg) { + client->refs++; if (!client->dispatchFunc || client->dispatchFunc(client, msg, client->dispatchOpaque) < 0) { virNetMessageFree(msg); client->wantClose = true; + client->refs--; return; } } Again, this seems wrong and the ref should be removed in the success case here too. Before I spent time to figure out how the refcounting is supposed to work here I just report it and hope that Dan can easily fix this. -- Matthias Bolte http://photron.blogspot.com

2011/7/27 Matthias Bolte <matthias.bolte@googlemail.com>:
doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2.
static virNetClientPtr virNetClientNew(virNetSocketPtr sock, const char *hostname) { [...] client->refs = 1; [...] /* Set up a callback to listen on the socket data */ client->refs++; if (virNetSocketAddIOCallback(client->sock, VIR_EVENT_HANDLE_READABLE, virNetClientIncomingEvent, client, virNetClientEventFree) < 0) { client->refs--; VIR_DEBUG("Failed to add event watch, disabling events"); } [...] }
virNetClientNew adds a ref before calling virNetSocketAddIOCallback but only removes it when virNetSocketAddIOCallback fails. This seems wrong too me and the ref should be removed in the success case too.
The same pattern was added in 0302391ee643ad91fdc6d2ecf7e4cf0fc9724516 (Fix race in ref counting when handling RPC jobs)
--- a/src/rpc/virnetserverclient.c +++ b/src/rpc/virnetserverclient.c @@ -763,10 +763,12 @@ readmore:
/* Send off to for normal dispatch to workers */ if (msg) { + client->refs++; if (!client->dispatchFunc || client->dispatchFunc(client, msg, client->dispatchOpaque) < 0) { virNetMessageFree(msg); client->wantClose = true; + client->refs--; return; } }
Again, this seems wrong and the ref should be removed in the success case here too. Before I spent time to figure out how the refcounting is supposed to work here I just report it and hope that Dan can easily fix this.
Okay, after some discussion on IRC with Michal and Eric and further testing I think that the ref counting is okay here. virNetClientNew adds the second ref because the free callback (virNetClientEventFree) passed to virNetSocketAddIOCallback will call virNetClientFree. Another observation is that virNetClientFree calls virNetSocketRemoveIOCallback when the last ref was removed. Actually that's pointless because virNetSocketRemoveIOCallback might indirectly removed the last ref. So this looks like an ordering problem. The ordering gets fixed by calling virNetClientClose before virNetClientFree, because virNetClientClose calls virNetSocketRemoveIOCallback. Another problem is that virNetSocketRemoveIOCallback only indirectly results in a call to virNetClientFree, because the event loop will call virNetClientFree when removing callbacks marked for deletion -- virNetSocketRemoveIOCallback does this marking. The memory leak I saw was due to virsh calling virEventRegisterDefaultImpl, but not calling virEventRunDefaultImpl. Because the event loop is initialized, the call to virNetSocketAddIOCallback in virNetClientNew succeeds. But virsh not driving the event loop results in not removing callbacks that were marked for deletion. Finally this leaks the virNetClientPtr using in the remote driver. I used a virsh -c qemu:///system to test. I was able fix this by calling virEventRunDefaultImpl after virConnectClose in virsh. But I don't think that this is the correct general solution. To me the general problem seems to be the entanglement between the virNetClientPtr refcounting and the event loop. I'm not sure how to improve this in general. Maybe should have a thread calling virEventRunDefaultImpl as the comment on virEventRunDefaultImpl suggest. -- Matthias Bolte http://photron.blogspot.com

On 07/28/2011 12:07 PM, Matthias Bolte wrote:
2011/7/27 Matthias Bolte <matthias.bolte@googlemail.com>:
doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2.
The memory leak I saw was due to virsh calling virEventRegisterDefaultImpl, but not calling virEventRunDefaultImpl. Because the event loop is initialized, the call to virNetSocketAddIOCallback in virNetClientNew succeeds. But virsh not driving the event loop results in not removing callbacks that were marked for deletion. Finally this leaks the virNetClientPtr using in the remote driver. I used a virsh -c qemu:///system to test.
I was able fix this by calling virEventRunDefaultImpl after virConnectClose in virsh. But I don't think that this is the correct general solution.
Where do we stand with 0.9.4? Is this something where we need the general fix before the release, or is just the virsh hack to call virEventRunDefaultImpl good enough for the release, or is this problem not severe enough and we can release as-is?
To me the general problem seems to be the entanglement between the virNetClientPtr refcounting and the event loop. I'm not sure how to improve this in general. Maybe should have a thread calling virEventRunDefaultImpl as the comment on virEventRunDefaultImpl suggest.
I'm hoping danpb will be able to chime in soon. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

2011/8/1 Eric Blake <eblake@redhat.com>:
On 07/28/2011 12:07 PM, Matthias Bolte wrote:
2011/7/27 Matthias Bolte <matthias.bolte@googlemail.com>:
doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2.
The memory leak I saw was due to virsh calling virEventRegisterDefaultImpl, but not calling virEventRunDefaultImpl. Because the event loop is initialized, the call to virNetSocketAddIOCallback in virNetClientNew succeeds. But virsh not driving the event loop results in not removing callbacks that were marked for deletion. Finally this leaks the virNetClientPtr using in the remote driver. I used a virsh -c qemu:///system to test.
I was able fix this by calling virEventRunDefaultImpl after virConnectClose in virsh. But I don't think that this is the correct general solution.
Where do we stand with 0.9.4? Is this something where we need the general fix before the release, or is just the virsh hack to call virEventRunDefaultImpl good enough for the release, or is this problem not severe enough and we can release as-is?
virsh is a bit special here as it registers/initializes the default even loop but doesn't drive it properly. It only drives it for the console command. That's the problem in virsh. This can be avoided by calling virEventRunDefaultImpl after virConnectClose iff it's a remote connection, in other cases virEventRunDefaultImpl might block. Therefore, we shouldn't be calling it in general after a virConnectClose in virsh. If we assume that an application that registers an event loop will drive it properly then this problem is not critical, as it doesn't exist. Just virsh is a special case that leaks 260kb per closed remote connection. When we assume that a typical virsh user uses a single connection per virsh invocation then we can view this as a static leak. Therefore, I'd say we could just leave this as-is for 0.9.4, except Dan would be back soon and the problem is easy to fix for him. Another problem is the libvirtd crash that Wen discovered. That's more serious, as a misbehaving client shouldn't be able to crash libvirtd. https://www.redhat.com/archives/libvir-list/2011-August/msg00005.html -- Matthias Bolte http://photron.blogspot.com

On Mon, Aug 01, 2011 at 06:35:21PM +0200, Matthias Bolte wrote:
2011/8/1 Eric Blake <eblake@redhat.com>:
On 07/28/2011 12:07 PM, Matthias Bolte wrote:
2011/7/27 Matthias Bolte <matthias.bolte@googlemail.com>:
doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2.
The memory leak I saw was due to virsh calling virEventRegisterDefaultImpl, but not calling virEventRunDefaultImpl. Because the event loop is initialized, the call to virNetSocketAddIOCallback in virNetClientNew succeeds. But virsh not driving the event loop results in not removing callbacks that were marked for deletion. Finally this leaks the virNetClientPtr using in the remote driver. I used a virsh -c qemu:///system to test.
I was able fix this by calling virEventRunDefaultImpl after virConnectClose in virsh. But I don't think that this is the correct general solution.
Where do we stand with 0.9.4? Is this something where we need the general fix before the release, or is just the virsh hack to call virEventRunDefaultImpl good enough for the release, or is this problem not severe enough and we can release as-is?
virsh is a bit special here as it registers/initializes the default even loop but doesn't drive it properly. It only drives it for the console command. That's the problem in virsh. This can be avoided by calling virEventRunDefaultImpl after virConnectClose iff it's a remote connection, in other cases virEventRunDefaultImpl might block. Therefore, we shouldn't be calling it in general after a virConnectClose in virsh.
If we assume that an application that registers an event loop will drive it properly then this problem is not critical, as it doesn't exist. Just virsh is a special case that leaks 260kb per closed remote connection. When we assume that a typical virsh user uses a single connection per virsh invocation then we can view this as a static leak.
Yeah, this is a case I never thought of. The "easy" fix is to just make virsh spawn a new thread to run the event loop in the background. The "hard" fix is to make virsh I/O entirely event driven, so that we don't just sit in a blocking read of stdin waiting for input, but instead use the event loop to process stdin. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

2011/8/2 Daniel P. Berrange <berrange@redhat.com>:
On Mon, Aug 01, 2011 at 06:35:21PM +0200, Matthias Bolte wrote:
2011/8/1 Eric Blake <eblake@redhat.com>:
On 07/28/2011 12:07 PM, Matthias Bolte wrote:
2011/7/27 Matthias Bolte <matthias.bolte@googlemail.com>:
doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2.
The memory leak I saw was due to virsh calling virEventRegisterDefaultImpl, but not calling virEventRunDefaultImpl. Because the event loop is initialized, the call to virNetSocketAddIOCallback in virNetClientNew succeeds. But virsh not driving the event loop results in not removing callbacks that were marked for deletion. Finally this leaks the virNetClientPtr using in the remote driver. I used a virsh -c qemu:///system to test.
I was able fix this by calling virEventRunDefaultImpl after virConnectClose in virsh. But I don't think that this is the correct general solution.
Where do we stand with 0.9.4? Is this something where we need the general fix before the release, or is just the virsh hack to call virEventRunDefaultImpl good enough for the release, or is this problem not severe enough and we can release as-is?
virsh is a bit special here as it registers/initializes the default even loop but doesn't drive it properly. It only drives it for the console command. That's the problem in virsh. This can be avoided by calling virEventRunDefaultImpl after virConnectClose iff it's a remote connection, in other cases virEventRunDefaultImpl might block. Therefore, we shouldn't be calling it in general after a virConnectClose in virsh.
If we assume that an application that registers an event loop will drive it properly then this problem is not critical, as it doesn't exist. Just virsh is a special case that leaks 260kb per closed remote connection. When we assume that a typical virsh user uses a single connection per virsh invocation then we can view this as a static leak.
Yeah, this is a case I never thought of. The "easy" fix is to just make virsh spawn a new thread to run the event loop in the background.
The problem here will probably be the console command with this lines while (!con->quit) { if (virEventRunDefaultImpl() < 0) break; } in console.c. Having two threads calling virEventRunDefaultImpl probably isn't a good idea.
The "hard" fix is to make virsh I/O entirely event driven, so that we don't just sit in a blocking read of stdin waiting for input, but instead use the event loop to process stdin.
Daniel
-- Matthias Bolte http://photron.blogspot.com

On Thu, Aug 04, 2011 at 05:26:31PM +0200, Matthias Bolte wrote:
2011/8/2 Daniel P. Berrange <berrange@redhat.com>:
On Mon, Aug 01, 2011 at 06:35:21PM +0200, Matthias Bolte wrote:
2011/8/1 Eric Blake <eblake@redhat.com>:
On 07/28/2011 12:07 PM, Matthias Bolte wrote:
2011/7/27 Matthias Bolte <matthias.bolte@googlemail.com>:
doRemoteClose doesn't free the virNetClientPtr and this creates a 260kb leak per remote connection. This happens because virNetClientFree doesn't remove the last ref, because virNetClientNew creates the virNetClientPtr with a refcount of 2.
The memory leak I saw was due to virsh calling virEventRegisterDefaultImpl, but not calling virEventRunDefaultImpl. Because the event loop is initialized, the call to virNetSocketAddIOCallback in virNetClientNew succeeds. But virsh not driving the event loop results in not removing callbacks that were marked for deletion. Finally this leaks the virNetClientPtr using in the remote driver. I used a virsh -c qemu:///system to test.
I was able fix this by calling virEventRunDefaultImpl after virConnectClose in virsh. But I don't think that this is the correct general solution.
Where do we stand with 0.9.4? Is this something where we need the general fix before the release, or is just the virsh hack to call virEventRunDefaultImpl good enough for the release, or is this problem not severe enough and we can release as-is?
virsh is a bit special here as it registers/initializes the default even loop but doesn't drive it properly. It only drives it for the console command. That's the problem in virsh. This can be avoided by calling virEventRunDefaultImpl after virConnectClose iff it's a remote connection, in other cases virEventRunDefaultImpl might block. Therefore, we shouldn't be calling it in general after a virConnectClose in virsh.
If we assume that an application that registers an event loop will drive it properly then this problem is not critical, as it doesn't exist. Just virsh is a special case that leaks 260kb per closed remote connection. When we assume that a typical virsh user uses a single connection per virsh invocation then we can view this as a static leak.
Yeah, this is a case I never thought of. The "easy" fix is to just make virsh spawn a new thread to run the event loop in the background.
The problem here will probably be the console command with this lines
while (!con->quit) { if (virEventRunDefaultImpl() < 0) break; }
in console.c. Having two threads calling virEventRunDefaultImpl probably isn't a good idea.
Oh yes, we'd have to change that code too, to delegate to the background event loop thread.
The "hard" fix is to make virsh I/O entirely event driven, so that we don't just sit in a blocking read of stdin waiting for input, but instead use the event loop to process stdin.
Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Eric Blake
-
Matthias Bolte