[libvirt] [PATCH 0/4]

For https://bugzilla.redhat.com/show_bug.cgi?id=620847 We have had sporadic reports of # virsh capabilities error: failed to get capabilities error: server closed connection: This normally means that libvirtd has crashed, closing the connection but in this case libvirtd has always remained running. It turns out that the capabilities XML was too large for the remote RPC message size. This caused XDR serialization to fail. This caused libvirtd to close the client connection immediately. The cause of the large XML was node handling an edge case in libnuma where it returns a CPU mask of all-1s to indicate a non-existant node. Machines that exhibit this problem will show this as a symptom in the logs # grep NUMA /var/log/messages Aug 16 10:30:34 sgi-xe270-01 libvirtd: 10:30:34.933: warning : nodeCapsInitNUMA:388 : NUMA topology for cell 1 of 2 not available, ignoring And have sparse NUMA topology (ie empty nodes) This series does many things: - Adds explicit warnings in places where XDR serialization fails, so we see an indication of problem in /var/log/messages - Try to send a real remote_error back to client, instead of closing its connection - Add logging of capabilities XML in libvirt.c so we can identify the too large doc in libvirtd - Add fix to cope with all-1s node mask This may also fix some other unexplained bug reports we've had with 'server closed connection' messages, or at least make it possible to diagnose them Daniel

When libvirtd fails to serialize a message to XDR the client connection is terminated immediately. To enable this to be diagnosed, log the message which caused the problem on the server * daemon/dispatch.c: Log XDR serialization failures --- daemon/dispatch.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/daemon/dispatch.c b/daemon/dispatch.c index 8f55eaa..9d1abc0 100644 --- a/daemon/dispatch.c +++ b/daemon/dispatch.c @@ -194,6 +194,8 @@ remoteSerializeError(struct qemud_client *client, return 0; xdr_error: + VIR_WARN("Failed to serialize remote error '%s' as XDR", + rerr->message ? *rerr->message : "<unknown>"); xdr_destroy(&xdr); VIR_FREE(msg); fatal_error: @@ -581,6 +583,8 @@ xdr_error: xdr_free (data->ret_filter, (char*)&ret); xdr_destroy (&xdr); fatal_error: + VIR_WARN("Failed to serialize reply for program '%d' proc '%d' as XDR", + msg->hdr.prog, msg->hdr.proc); return -1; } @@ -664,5 +668,7 @@ xdr_error: xdr_destroy (&xdr); fatal_error: VIR_FREE(msg); + VIR_WARN("Failed to serialize stream data for proc %d as XDR", + stream->procedure); return -1; } -- 1.7.2.1

When failing to serialize the normal RPC reply, try harder to send a error message back to the client, instead of immediately closing the connection. * daemon/dispatch.c: Improve error messages when RPC reply can not be sent --- daemon/dispatch.c | 45 ++++++++++++++++++++++++++++----------------- 1 files changed, 28 insertions(+), 17 deletions(-) diff --git a/daemon/dispatch.c b/daemon/dispatch.c index 9d1abc0..a24c811 100644 --- a/daemon/dispatch.c +++ b/daemon/dispatch.c @@ -527,7 +527,8 @@ remoteDispatchClientCall (struct qemud_server *server, if (remoteEncodeClientMessageHeader(msg) < 0) { xdr_free (data->ret_filter, (char*)&ret); - goto fatal_error; + remoteDispatchFormatError(&rerr, "%s", _("failed to serialize reply header")); + goto xdr_hdr_error; } @@ -537,22 +538,30 @@ remoteDispatchClientCall (struct qemud_server *server, msg->bufferLength, XDR_ENCODE); - if (xdr_setpos(&xdr, msg->bufferOffset) == 0) + if (xdr_setpos(&xdr, msg->bufferOffset) == 0) { + remoteDispatchFormatError(&rerr, "%s", _("failed to change XDR reply offset")); goto xdr_error; + } /* If OK, serialise return structure, if error serialise error. */ /* Serialise reply data */ - if (!((data->ret_filter) (&xdr, &ret))) + if (!((data->ret_filter) (&xdr, &ret))) { + remoteDispatchFormatError(&rerr, "%s", _("failed to serialize reply payload (probable message size limit)")); goto xdr_error; + } /* Update the length word. */ msg->bufferOffset += xdr_getpos (&xdr); len = msg->bufferOffset; - if (xdr_setpos (&xdr, 0) == 0) + if (xdr_setpos (&xdr, 0) == 0) { + remoteDispatchFormatError(&rerr, "%s", _("failed to change XDR reply offset")); goto xdr_error; + } - if (!xdr_u_int (&xdr, &len)) + if (!xdr_u_int (&xdr, &len)) { + remoteDispatchFormatError(&rerr, "%s", _("failed to update reply length header")); goto xdr_error; + } xdr_destroy (&xdr); xdr_free (data->ret_filter, (char*)&ret); @@ -567,25 +576,27 @@ remoteDispatchClientCall (struct qemud_server *server, return 0; +xdr_error: + /* Bad stuff serializing reply. Try to send a little info + * back to client to assist in bug reporting/diagnosis */ + xdr_free (data->ret_filter, (char*)&ret); + xdr_destroy (&xdr); + /* fallthrough */ + +xdr_hdr_error: + VIR_WARN("Failed to serialize reply for program '%d' proc '%d' as XDR", + msg->hdr.prog, msg->hdr.proc); + /* fallthrough */ + rpc_error: - /* Semi-bad stuff happened, we can still try to send back - * an RPC error message to client */ + /* Bad stuff (de-)serializing message, but we have an + * RPC error message we can send back to the client */ rv = remoteSerializeReplyError(client, &rerr, &msg->hdr); if (rv >= 0) VIR_FREE(msg); return rv; - -xdr_error: - /* Seriously bad stuff happened, so we'll kill off this client - and not send back any RPC error */ - xdr_free (data->ret_filter, (char*)&ret); - xdr_destroy (&xdr); -fatal_error: - VIR_WARN("Failed to serialize reply for program '%d' proc '%d' as XDR", - msg->hdr.prog, msg->hdr.proc); - return -1; } -- 1.7.2.1

Enabling debug doesn't show the capabilities XML for a connection. Add an extra debug statement for the return value * src/libvirt.c: Enable debug logging of capabilities XML --- src/libvirt.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/src/libvirt.c b/src/libvirt.c index c40f20f..e124bcc 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -4047,6 +4047,7 @@ virConnectGetCapabilities (virConnectPtr conn) ret = conn->driver->getCapabilities (conn); if (!ret) goto error; + DEBUG("conn=%p ret=%s", conn, ret); return ret; } -- 1.7.2.1

When finding a sparse NUMA topology, libnuma will return ENOENT the first time it is invoked. On subsequent invocations it will return success, but with an all-1's CPU mask. Check for this, to avoid polluting the capabilities XML with 4096 bogus CPUs * src/nodeinfo.c: Check for all-1s CPU mask --- src/nodeinfo.c | 14 +++++++++++++- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 982240a..9aba025 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -369,6 +369,7 @@ nodeCapsInitNUMA(virCapsPtr caps) { int n; unsigned long *mask = NULL; + unsigned long *allonesmask = NULL; int *cpus = NULL; int ret = -1; int max_n_cpus = NUMA_MAX_N_CPUS; @@ -379,13 +380,23 @@ nodeCapsInitNUMA(virCapsPtr caps) int mask_n_bytes = max_n_cpus / 8; if (VIR_ALLOC_N(mask, mask_n_bytes / sizeof *mask) < 0) goto cleanup; + if (VIR_ALLOC_N(allonesmask, mask_n_bytes / sizeof *mask) < 0) + goto cleanup; + memset(allonesmask, 0xff, mask_n_bytes); for (n = 0 ; n <= numa_max_node() ; n++) { int i; int ncpus; + /* The first time this returns -1, ENOENT if node doesn't exist... */ if (numa_node_to_cpus(n, mask, mask_n_bytes) < 0) { VIR_WARN("NUMA topology for cell %d of %d not available, ignoring", - n, numa_max_node()); + n, numa_max_node()+1); + continue; + } + /* second, third... times it returns an all-1's mask */ + if (memcmp(mask, allonesmask, mask_n_bytes) == 0) { + VIR_DEBUG("NUMA topology for cell %d of %d is all ones, ignoring", + n, numa_max_node()+1); continue; } @@ -414,6 +425,7 @@ nodeCapsInitNUMA(virCapsPtr caps) cleanup: VIR_FREE(cpus); VIR_FREE(mask); + VIR_FREE(allonesmask); return ret; } -- 1.7.2.1

On Tue, Aug 17, 2010 at 05:00:22PM +0100, Daniel P. Berrange wrote:
For
https://bugzilla.redhat.com/show_bug.cgi?id=620847
We have had sporadic reports of
# virsh capabilities error: failed to get capabilities error: server closed connection:
This normally means that libvirtd has crashed, closing the connection but in this case libvirtd has always remained running. It turns out that the capabilities XML was too large for the remote RPC message size. This caused XDR serialization to fail. This caused libvirtd to close the client connection immediately. The cause of the large XML was node handling an edge case in libnuma where it returns a CPU mask of all-1s to indicate a non-existant node.
Machines that exhibit this problem will show this as a symptom in the logs
# grep NUMA /var/log/messages Aug 16 10:30:34 sgi-xe270-01 libvirtd: 10:30:34.933: warning : nodeCapsInitNUMA:388 : NUMA topology for cell 1 of 2 not available, ignoring
And have sparse NUMA topology (ie empty nodes)
This series does many things:
- Adds explicit warnings in places where XDR serialization fails, so we see an indication of problem in /var/log/messages - Try to send a real remote_error back to client, instead of closing its connection - Add logging of capabilities XML in libvirt.c so we can identify the too large doc in libvirtd - Add fix to cope with all-1s node mask
This may also fix some other unexplained bug reports we've had with 'server closed connection' messages, or at least make it possible to diagnose them
ACK for the 4 patches, as for upstream, Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On 08/17/2010 03:10 PM, Daniel Veillard wrote:
On Tue, Aug 17, 2010 at 05:00:22PM +0100, Daniel P. Berrange wrote:
For
ACK for the 4 patches, as for upstream,
This _is_ upstream; and a duplicate of the earlier posting of the same series here: https://www.redhat.com/archives/libvir-list/2010-August/msg00352.html :) -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
participants (3)
-
Daniel P. Berrange
-
Daniel Veillard
-
Eric Blake