On Tue, Jan 25, 2022 at 02:18:29PM -0500, Masayoshi Mizuma wrote:
Hello,
I found an issue that libvirt isn't close an unix domain socket to connect to the qemu
monitor
even after the VM is destroyed.
This issue happens since commit 695bdb3841 ("src: ensure GSource background unref
happens in correct event loop")
on the system whose glib version is 2.56.
I would appreciate it if you could give any ideas to solve the issue.
My very rough and fast guess is that this happens due to our workaround
while running on a system on which the fix was actually backported. You
can remove the workaround (which is implemented for glib < 2.64.0), keep
only the new code and see if it reproduces. I know it's not very
scientific, but we already faced so many workarounds for old buggy glib
that it's tiresome to try and fix all that when it goes away with update
of glib. If the above does not help, then I would look at what could
actually happening (maybe the unref does not get scheduled because the
event loop it is requested on does not run any more, but that's just a
guess).
The socket is allocated in qemuMonitorOpenUnix(), and used by the
monitor->socket
and monitor->watch:
qemuMonitorOpen
qemuMonitorOpenUnix
if ((monfd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) {
qemuMonitorOpenInternal
mon->socket = g_socket_new_from_fd(fd, &gerr);
qemuMonitorRegister
mon->watch = g_socket_create_source(mon->socket,
Usually, the socket is closed when the reference counter of the glib object
(mon->socket and mon->watch) gets 0:
qemuMonitorClose
qemuMonitorUnregister
vir_g_source_unref(mon->watch, mon->context);
g_source_set_callback(idle, virEventGLibSourceUnrefIdle, src, NULL);
virEventGLibSourceUnrefIdle
g_source_unref(src); <== unref monitor->watch
g_object_unref(mon->socket); <== unref monitor->socket
It seems that the callback virEventGLibSourceUnrefIdle() to unref the monitor->watch
doesn't
work when qemuMonitorUnregister() is called via qemuProcessStop(), so the socket isn't
closed.
I'm not sure why the callback doesn't work at that time. I suspect that the VM is
closing
so the main loop of the monitor doesn't work any more.
We can close the socket to add g_socket_close() before unref the mon->socket, however,
it may remain a memory leak because of mon->watch (GSource object), so probably
it isn't a good idea to close the socket.
We can unref the mon->watch to set NULL to the second argument of vir_g_source_unref()
because the default main loop still works at that time, however, I'm not
sure it's an appropriate way to avoid the gobject issue which the commit solves...
I found this issue on the qemu monitor, and probably the qemu agent has the same issue
because the socket procedure is almost the same as the monitor.
I would appreciate it if you could give any ideas to solve this issue.
Following is to observe the callback working:
---
diff --git a/src/util/glibcompat.c b/src/util/glibcompat.c
index eb6dcc0111..b8b1770424 100644
--- a/src/util/glibcompat.c
+++ b/src/util/glibcompat.c
@@ -24,6 +24,9 @@
#include "glibcompat.h"
+#include "virlog.h"
+
+VIR_LOG_INIT("util.glibcompat");
/*
* Note that because of the GLIB_VERSION_MAX_ALLOWED constant in
* config-post.h, allowing use of functions from newer GLib via
@@ -244,6 +247,7 @@ virEventGLibSourceUnrefIdle(gpointer data)
GSource *src = data;
g_source_unref(src);
+ VIR_DEBUG("unrefed: %p", src);
return FALSE;
}
@@ -257,6 +261,7 @@ void vir_g_source_unref(GSource *src, GMainContext *ctx)
g_source_attach(idle, ctx);
g_source_unref(idle);
+ VIR_DEBUG("unref registered: %p ctx: %p", src, ctx);
}
#endif
---
Case the mon->watch (0xffff28008af0) is unreffed correctly
(via qemuMonitorUpdateWatch()):
18:54:15.403+0000: 16845: debug : qemuMonitorEmitResume:1159 : mon=0xffff683ac020
18:54:15.403+0000: 16845: debug : qemuProcessHandleResume:713 : Transitioned guest test-vm
into running state, reason 'booted', event detail 0
18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered:
0xffff28008af0 ctx: 0xffff780169a0
18:54:15.404+0000: 16845: debug : qemuMonitorJSONIOProcessLine:222 : Line
[{"return": {}, "id": "libvirt-10"}]
18:54:15.404+0000: 16845: info : qemuMonitorJSONIOProcessLine:242 :
QEMU_MONITOR_RECV_REPLY: mon=0xffff683ac020 reply={"return": {}, "id":
"libvirt-10"}
18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered:
0xffff2819a260 ctx: 0xffff780169a0
18:54:15.404+0000: 16845: debug : virEventGLibSourceUnrefIdle:250 : unrefed:
0xffff28008af0
Case the mon->watch (0xffff7802bb30) isn't unreffed
(via qemuProcessStop()):
18:54:15.642+0000: 16589: debug : qemuProcessStop:8008 : Shutting down vm=0xaaaad40edec0
name=test-vm id=3 pid=16842, reason=destroyed, asyncJob=none, flags=0x0
18:54:15.642+0000: 16589: debug : qemuDomainLogAppendMessage:6733 : Append log message
(vm='test-vm' message='2022-01-25 18:54:15.642+0000: shutting down,
reason=destroyed) stdioLogD=1
18:54:15.643+0000: 16589: info : qemuMonitorClose:834 : QEMU_MONITOR_CLOSE:
mon=0xffff683ac020
18:54:15.643+0000: 16589: debug : vir_g_source_unref:264 : unref registered:
0xffff7802bb30 ctx: 0xffff780169a0
18:54:15.643+0000: 16845: debug : qemuMonitorJSONIOProcessEvent:209 : handle SHUTDOWN
handler=0xffff4ef057c0 data=0xffff28007da0
18:54:15.643+0000: 16845: debug : qemuMonitorEmitShutdown:1132 : mon=0xffff683ac020
guest=2
18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:572 : vm=0xaaaad40edec0
18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:592 : Transitioned guest
test-vm to shutdown state
18:54:15.643+0000: 16845: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm
pid=16842 flags=0x2
18:54:15.643+0000: 16845: debug : qemuMonitorDispose:216 : mon=0xffff683ac020
18:54:15.643+0000: 16589: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm
pid=16842 flags=0x5
18:54:15.643+0000: 16589: debug : qemuDomainCleanupRun:7313 : driver=0xffff281441b0,
vm=test-vm
18:54:15.644+0000: 16589: debug : qemuProcessAutoDestroyRemove:8395 : vm=test-vm
18:54:15.645+0000: 16589: debug : qemuRemoveCgroup:1221 : Failed to terminate cgroup for
test-vm
18:54:15.647+0000: 16589: debug : qemuDomainObjEndJob:1141 : Stopping job: destroy
(async=none vm=0xaaaad40edec0 name=test-vm)
Thanks!
Masa