For
https://bugzilla.redhat.com/show_bug.cgi?id=620847
We have had sporadic reports of
# virsh capabilities
error: failed to get capabilities
error: server closed connection:
This normally means that libvirtd has crashed, closing the connection
but in this case libvirtd has always remained running. It turns out
that the capabilities XML was too large for the remote RPC message
size. This caused XDR serialization to fail. This caused libvirtd to
close the client connection immediately. The cause of the large XML
was node handling an edge case in libnuma where it returns a CPU mask
of all-1s to indicate a non-existant node.
Machines that exhibit this problem will show this as a symptom in
the logs
# grep NUMA /var/log/messages
Aug 16 10:30:34 sgi-xe270-01 libvirtd: 10:30:34.933: warning :
nodeCapsInitNUMA:388 : NUMA topology for cell 1 of 2 not available, ignoring
And have sparse NUMA topology (ie empty nodes)
This series does many things:
- Adds explicit warnings in places where XDR serialization fails,
so we see an indication of problem in /var/log/messages
- Try to send a real remote_error back to client, instead of
closing its connection
- Add logging of capabilities XML in libvirt.c so we can identify
the too large doc in libvirtd
- Add fix to cope with all-1s node mask
This may also fix some other unexplained bug reports we've had with
'server closed connection' messages, or at least make it possible
to diagnose them
Daniel