
I have identical two hypervisors same operating system: Ubuntu 22.04.2 LTS Recently both virsh stopped talking to the libvirtd. Both stopped within a few days of each other. Currently if I run: virsh uri virsh version virsh list # virsh list ..nothing just hangs When I ran strace on these broken machines it get stuck at same spot: strace virsh list ... access("/var/run/libvirt/virtqemud-sock", F_OK) = -1 ENOENT (No such file or directory) access("/var/run/libvirt/libvirt-sock", F_OK) = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 5 connect(5, {sa_family=AF_UNIX, sun_path="/var/run/libvirt/libvirt-sock"}, 110) = 0 getsockname(5, {sa_family=AF_UNIX}, [128 => 2]) = 0 futex(0x7fa716a672f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 fcntl(5, F_GETFD) = 0 fcntl(5, F_SETFD, FD_CLOEXEC) = 0 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 futex(0x7fa716a67348, FUTEX_WAKE_PRIVATE, 2147483647) = 0 eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK) = 6 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 write(4, "\1\0\0\0\0\0\0\0", 8) = 8 write(4, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x7fa70c001cb0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fa716a6786c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7fa716a67378, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(4, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x7fa70c001cb0, FUTEX_WAKE_PRIVATE, 1) = 1 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 poll([{fd=5, events=POLLOUT}, {fd=6, events=POLLIN}], 2, -1) = 2 ([{fd=5, revents=POLLOUT}, {fd=6, revents=POLLIN}]) read(6, "\2\0\0\0\0\0\0\0", 16) = 8 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x5628ce6e9710, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 write(5, "\0\0\0\34 \0\200\206\0\0\0\1\0\0\0B\0\0\0\0\0\0\0\0\0\0\0\0", 28) = 28 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 2, -1) = 1 ([{fd=6, revents=POLLIN}]) read(6, "\5\0\0\0\0\0\0\0", 16) = 8 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 2, -1 It gets stuck at this poll(). Note I tested strace on identical new install of ubtuntu 22.04 where virsh connects fine and get identical strace, except after this poll() it continues on with read/write ..etc. I turned on debugging for libvirtd and get no errors while virsh is trying to connect. I am able to get a virsh# shell. The shell only hangs when I try "connect, uri, version". Another method of debugging I tried was: LIBVIRT_DEBUG=error LIBVIRT_LOG_FILTERS="1:* " virsh uri .. .. 2023-06-06 20:51:22.312+0000: 1647: debug : doRemoteOpen:1128 : Trying authentication 2023-06-06 20:51:22.312+0000: 1647: debug : virNetMessageNew:44 : msg=0x55b996539680 tracked=0 2023-06-06 20:51:22.312+0000: 1647: debug : virNetMessageEncodePayload:383 : Encode length as 28 2023-06-06 20:51:22.312+0000: 1647: info : virNetClientSendInternal:2151 : RPC_CLIENT_MSG_TX_QUEUE: client=0x55b996538010 len=28 prog=536903814 vers=1 proc=66 type=0 status=0 serial=0 2023-06-06 20:51:22.312+0000: 1647: debug : virNetClientCallNew:2107 : New call 0x55b996535f80: msg=0x55b996539680, expectReply=1, nonBlock=0 2023-06-06 20:51:22.312+0000: 1647: debug : virNetClientIO:1920 : Outgoing message prog=536903814 version=1 serial=0 proc=66 type=0 length=28 dispatch=(nil) 2023-06-06 20:51:22.312+0000: 1647: debug : virNetClientIO:1978 : We have the buck head=0x55b996535f80 call=0x55b996535f80 2023-06-06 20:51:22.312+0000: 1647: info : virEventGLibHandleUpdate:195 : EVENT_GLIB_UPDATE_HANDLE: watch=1 events=0 2023-06-06 20:51:22.312+0000: 1647: debug : virEventGLibHandleUpdate:206 : Update handle data=0x55b996534d30 watch=1 fd=5 events=0 2023-06-06 20:51:22.312+0000: 1647: debug : virEventGLibHandleUpdate:229 : Removed old handle source=0x55b996534de0 2023-06-06 20:51:22.312+0000: 1648: debug : virEventRunDefaultImpl:341 : running default event implementation Any help would be appreciated. thanks jerry

On Tue, Jun 06, 2023 at 04:56:38PM -0400, Jerry Buburuz wrote:
I have identical two hypervisors same operating system: Ubuntu 22.04.2 LTS
Recently both virsh stopped talking to the libvirtd. Both stopped within a few days of each other.
Currently if I run:
virsh uri virsh version virsh list
# virsh list ..nothing just hangs
When I ran strace on these broken machines it get stuck at same spot:
Is libvirtd running? It might be that you have socket activation with systemd and the socket this virsh is connecting to is not properly associated with the service. One of the things that might happen is that you want to debug the service, stop the service and a socket unit, but that will not remove it. Before debugging this make sure everything related in systemd is stopped and then try running libvirtd (or virtqemud, there are two services) with debugging enabled and then run the virsh commands with debugging enabled as well. Martin
strace virsh list ...
access("/var/run/libvirt/virtqemud-sock", F_OK) = -1 ENOENT (No such file or directory) access("/var/run/libvirt/libvirt-sock", F_OK) = 0 socket(AF_UNIX, SOCK_STREAM, 0) = 5 connect(5, {sa_family=AF_UNIX, sun_path="/var/run/libvirt/libvirt-sock"}, 110) = 0 getsockname(5, {sa_family=AF_UNIX}, [128 => 2]) = 0 futex(0x7fa716a672f0, FUTEX_WAKE_PRIVATE, 2147483647) = 0 fcntl(5, F_GETFD) = 0 fcntl(5, F_SETFD, FD_CLOEXEC) = 0 fcntl(5, F_GETFL) = 0x2 (flags O_RDWR) fcntl(5, F_SETFL, O_RDWR|O_NONBLOCK) = 0 futex(0x7fa716a67348, FUTEX_WAKE_PRIVATE, 2147483647) = 0 eventfd2(0, EFD_CLOEXEC|EFD_NONBLOCK) = 6 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 write(4, "\1\0\0\0\0\0\0\0", 8) = 8 write(4, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x7fa70c001cb0, FUTEX_WAKE_PRIVATE, 1) = 1 futex(0x7fa716a6786c, FUTEX_WAKE_PRIVATE, 2147483647) = 0 futex(0x7fa716a67378, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(4, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x7fa70c001cb0, FUTEX_WAKE_PRIVATE, 1) = 1 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 poll([{fd=5, events=POLLOUT}, {fd=6, events=POLLIN}], 2, -1) = 2 ([{fd=5, revents=POLLOUT}, {fd=6, revents=POLLIN}]) read(6, "\2\0\0\0\0\0\0\0", 16) = 8 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 futex(0x5628ce6e9710, FUTEX_WAKE_PRIVATE, 2147483647) = 0 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 write(5, "\0\0\0\34 \0\200\206\0\0\0\1\0\0\0B\0\0\0\0\0\0\0\0\0\0\0\0", 28) = 28 write(6, "\1\0\0\0\0\0\0\0", 8) = 8 rt_sigprocmask(SIG_BLOCK, [PIPE CHLD WINCH], [], 8) = 0 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 2, -1) = 1 ([{fd=6, revents=POLLIN}]) read(6, "\5\0\0\0\0\0\0\0", 16) = 8 poll([{fd=5, events=POLLIN}, {fd=6, events=POLLIN}], 2, -1
It gets stuck at this poll(). Note I tested strace on identical new install of ubtuntu 22.04 where virsh connects fine and get identical strace, except after this poll() it continues on with read/write ..etc.
I turned on debugging for libvirtd and get no errors while virsh is trying to connect.
I am able to get a virsh# shell. The shell only hangs when I try "connect, uri, version".
Another method of debugging I tried was:
LIBVIRT_DEBUG=error LIBVIRT_LOG_FILTERS="1:* " virsh uri .. .. 2023-06-06 20:51:22.312+0000: 1647: debug : doRemoteOpen:1128 : Trying authentication 2023-06-06 20:51:22.312+0000: 1647: debug : virNetMessageNew:44 : msg=0x55b996539680 tracked=0 2023-06-06 20:51:22.312+0000: 1647: debug : virNetMessageEncodePayload:383 : Encode length as 28 2023-06-06 20:51:22.312+0000: 1647: info : virNetClientSendInternal:2151 : RPC_CLIENT_MSG_TX_QUEUE: client=0x55b996538010 len=28 prog=536903814 vers=1 proc=66 type=0 status=0 serial=0 2023-06-06 20:51:22.312+0000: 1647: debug : virNetClientCallNew:2107 : New call 0x55b996535f80: msg=0x55b996539680, expectReply=1, nonBlock=0 2023-06-06 20:51:22.312+0000: 1647: debug : virNetClientIO:1920 : Outgoing message prog=536903814 version=1 serial=0 proc=66 type=0 length=28 dispatch=(nil) 2023-06-06 20:51:22.312+0000: 1647: debug : virNetClientIO:1978 : We have the buck head=0x55b996535f80 call=0x55b996535f80 2023-06-06 20:51:22.312+0000: 1647: info : virEventGLibHandleUpdate:195 : EVENT_GLIB_UPDATE_HANDLE: watch=1 events=0 2023-06-06 20:51:22.312+0000: 1647: debug : virEventGLibHandleUpdate:206 : Update handle data=0x55b996534d30 watch=1 fd=5 events=0 2023-06-06 20:51:22.312+0000: 1647: debug : virEventGLibHandleUpdate:229 : Removed old handle source=0x55b996534de0 2023-06-06 20:51:22.312+0000: 1648: debug : virEventRunDefaultImpl:341 : running default event implementation
Any help would be appreciated.
thanks jerry

On Tue, Jun 06, 2023 at 04:56:38PM -0400, Jerry Buburuz wrote:
Recently both virsh stopped talking to the libvirtd. Both stopped within a few days of each other.
I've run into exactly the same problem. I'm running libvirt (libvirt-9.0.0-3.fc38.x86_64) on Fedora 38. On Fedora, libvirtd is configured by default to use socket activation and is run with the `--timeout 120` option. After some recent upgrades, I'm seeing the exact same symptoms that Jerry described -- virsh commands simply get stuck at same call to `poll()`. It looks like libvirtd is either crashing or failing to start, because when virsh is in this state the `libvirtd` process isn't running. This makes it *sound* like a systemd problem, but I'm not seeing errors anywhere -- either from libvirtd or from systemd. I've worked around the problem locally by re-configuring libvirtd to run persistently rather than using socket activation: systemctl disable --now libvirtd{,-ro,-admin}.socket cat > /etc/systemd/system/libvirtd.service.d/override.conf <<EOF [Service] EnvironmentFile= EOF systemctl restart libvirtd Package versions in case this helps correlate something: - libvirt-9.0.0-3.fc38.x86_64 - systemd-253.5-1.fc38.x86_64 - kernel-6.3.6-200.fc38.x86_64 Libvirt uri: qemu:///system -- Lars Kellogg-Stedman <lars@redhat.com> | larsks @ {irc,twitter,github} http://blog.oddbit.com/ | N1LKS

Thank you Lars. My next step is to try TCP rather than unix socket. Just to clarify: * I am using ubuntu 22.04 LTS * systemd shows libvirtd no errors and its running and creates unix sockets in /run/libvirt/libvirt-sock * none of the services are failing. * I have been trying to turn on every debugging feature possible, no errors with virsh or libvirtd services. * recently tried gdb attaching to libvirtd and virsh process and not seeing any errors. Recently tried a identical vm with 22.04 and all patches and compared permissions, files opened(lsof) , logs ..etc. THe new vm virsh connects not problem. My two existing hyperviros are still dead. The only difference between my new test VM and the dead hypervisors if the problem hypervisors use a mounted cephfs to store virtual machines. I have not tried to unmount the cephfs yet. Maybe its causing delays in something? The virsh and /etc/libvirtd/ is local to the hypervisor. I only use the cephfs to store the images. This problem started around the same time my cephfs storage had issues. thanks jerry Lars Kellogg-Stedman
On Tue, Jun 06, 2023 at 04:56:38PM -0400, Jerry Buburuz wrote:
Recently both virsh stopped talking to the libvirtd. Both stopped within a few days of each other.
I've run into exactly the same problem.
I'm running libvirt (libvirt-9.0.0-3.fc38.x86_64) on Fedora 38. On Fedora, libvirtd is configured by default to use socket activation and is run with the `--timeout 120` option.
After some recent upgrades, I'm seeing the exact same symptoms that Jerry described -- virsh commands simply get stuck at same call to `poll()`.
It looks like libvirtd is either crashing or failing to start, because when virsh is in this state the `libvirtd` process isn't running. This makes it *sound* like a systemd problem, but I'm not seeing errors anywhere -- either from libvirtd or from systemd.
I've worked around the problem locally by re-configuring libvirtd to run persistently rather than using socket activation:
systemctl disable --now libvirtd{,-ro,-admin}.socket
cat > /etc/systemd/system/libvirtd.service.d/override.conf <<EOF [Service] EnvironmentFile= EOF
systemctl restart libvirtd
Package versions in case this helps correlate something:
- libvirt-9.0.0-3.fc38.x86_64 - systemd-253.5-1.fc38.x86_64 - kernel-6.3.6-200.fc38.x86_64
Libvirt uri: qemu:///system
-- Lars Kellogg-Stedman <lars@redhat.com> | larsks @ {irc,twitter,github} http://blog.oddbit.com/ | N1LKS

Just found my issue. After I removed the cephfs mounts it worked! I will debug ceph. I assumed because I could touch files on mounted cephfs it was working. Now virsh list works! thanks jerry Lars Kellogg-Stedman
On Tue, Jun 06, 2023 at 04:56:38PM -0400, Jerry Buburuz wrote:
Recently both virsh stopped talking to the libvirtd. Both stopped within a few days of each other.
I've run into exactly the same problem.
I'm running libvirt (libvirt-9.0.0-3.fc38.x86_64) on Fedora 38. On Fedora, libvirtd is configured by default to use socket activation and is run with the `--timeout 120` option.
After some recent upgrades, I'm seeing the exact same symptoms that Jerry described -- virsh commands simply get stuck at same call to `poll()`.
It looks like libvirtd is either crashing or failing to start, because when virsh is in this state the `libvirtd` process isn't running. This makes it *sound* like a systemd problem, but I'm not seeing errors anywhere -- either from libvirtd or from systemd.
I've worked around the problem locally by re-configuring libvirtd to run persistently rather than using socket activation:
systemctl disable --now libvirtd{,-ro,-admin}.socket
cat > /etc/systemd/system/libvirtd.service.d/override.conf <<EOF [Service] EnvironmentFile= EOF
systemctl restart libvirtd
Package versions in case this helps correlate something:
- libvirt-9.0.0-3.fc38.x86_64 - systemd-253.5-1.fc38.x86_64 - kernel-6.3.6-200.fc38.x86_64
Libvirt uri: qemu:///system
-- Lars Kellogg-Stedman <lars@redhat.com> | larsks @ {irc,twitter,github} http://blog.oddbit.com/ | N1LKS

Just a brief update: As soon as I umount cephfs virsh is able to talk to libvirtd. I tested the cephfs with: df (no problem) dd if=/dev/zero of=/cephstorage/a.img bs=1G count=1 oflag=dsync (no problem created random1G file, no I/O issues.) After mounting cephfs and restarting libvirtd "virsh" hangs again. Obviously virsh and libvirtd don't like the cephfs mount. I am just starting to debug the potential problem with cephfs and libvirtd/virsh. I originally noted when this problem occurred on two hypervisors the problem occurred a couple days a part which matched some updates that took place. I have not tried rolling back patches yet. I am curious if anyone uses cephfs filesystem and had similar problems recently. I will update the form if I find a solution. Thanks jerry Jerry Buburuz
Just found my issue.
After I removed the cephfs mounts it worked!
I will debug ceph.
I assumed because I could touch files on mounted cephfs it was working.
Now virsh list works!
thanks jerry
Lars Kellogg-Stedman
On Tue, Jun 06, 2023 at 04:56:38PM -0400, Jerry Buburuz wrote:
Recently both virsh stopped talking to the libvirtd. Both stopped within a few days of each other.
I've run into exactly the same problem.
I'm running libvirt (libvirt-9.0.0-3.fc38.x86_64) on Fedora 38. On Fedora, libvirtd is configured by default to use socket activation and is run with the `--timeout 120` option.
After some recent upgrades, I'm seeing the exact same symptoms that Jerry described -- virsh commands simply get stuck at same call to `poll()`.
It looks like libvirtd is either crashing or failing to start, because when virsh is in this state the `libvirtd` process isn't running. This makes it *sound* like a systemd problem, but I'm not seeing errors anywhere -- either from libvirtd or from systemd.
I've worked around the problem locally by re-configuring libvirtd to run persistently rather than using socket activation:
systemctl disable --now libvirtd{,-ro,-admin}.socket
cat > /etc/systemd/system/libvirtd.service.d/override.conf <<EOF [Service] EnvironmentFile= EOF
systemctl restart libvirtd
Package versions in case this helps correlate something:
- libvirt-9.0.0-3.fc38.x86_64 - systemd-253.5-1.fc38.x86_64 - kernel-6.3.6-200.fc38.x86_64
Libvirt uri: qemu:///system
-- Lars Kellogg-Stedman <lars@redhat.com> | larsks @ {irc,twitter,github} http://blog.oddbit.com/ | N1LKS

As soon as I umount cephfs virsh is able to talk to libvirtd.
I tested the cephfs with:
df (no problem) dd if=/dev/zero of=/cephstorage/a.img bs=1G count=1 oflag=dsync (no problem created random1G file, no I/O issues.)
After mounting cephfs and restarting libvirtd "virsh" hangs again. Obviously virsh and libvirtd don't like the cephfs mount.
I am just starting to debug the potential problem with cephfs and libvirtd/virsh.
I originally noted when this problem occurred on two hypervisors the problem occurred a couple days a part which matched some updates that took place. I have not tried rolling back patches yet.
I am curious if anyone uses cephfs filesystem and had similar problems recently.
I used to use the cephfs until I got the issues with the kernel mount version and fuse was just to slow. So now I am having a nfs-ganesha mount. Occasionally I have (had) issues connection issues virsh/virt-manager, which are related to guest having iso images linked media. It looks like this got a lot less when I started removing the iso medias from vm's. When I do have this lockup, I could fix it by doing a 'umount -l /mnt/vps-isos'. I don't think this was specific to libvirt as a ls -l /mnt/vps-isos would 'hang' also.

On 6/12/23 20:17, Jerry Buburuz wrote:
Just found my issue.
After I removed the cephfs mounts it worked!
I will debug ceph.
I assumed because I could touch files on mounted cephfs it was working.
Now virsh list works!
Out of curiosity. Do you perhaps have a storage pool defined over cephfs? I can see two possible sources for the problem: 1) autostarted storage pool that makes libvirt mount cephfs, or 2) a storage pool defined over a path where cephfs is mounted. The problem with 1) is obvious (in fact it's not specific to ceph, if it was NFS/iSCSI and the server wasn't responding then libvirtd would just hang). The problem with 2) is that for some types of storage pools ('dir' typically) libvirt assumes they are always 'running'. And proceeds to enumerate volumes in that pool (i.e. files under the dir). And if there's a stale mount point, this might stuck libvirtd. But again, this is not limited to ceph, any network FS might do this. Michal
participants (5)
-
Jerry Buburuz
-
Lars Kellogg-Stedman
-
Marc
-
Martin Kletzander
-
Michal Prívozník