frequent network collapse possibly due to bridging

Hi, I would like some help to troubleshoot the problem I have been having lately with my VM host, which contains 5 VMs, one of which is for pi-hole, unbound services. It has been a relatively common occurrence in the last few weeks for me to find that the host machine has lost its network when I get back home from work. Restoring the VM/VMs do not fix the problem, the host needs to be restarted for a fix, otherwise there is both loss of name resolution, as well as an internet connection; I cannot ping even IPs such as 8.8.8.8. Since I use the pi-hole VM as the DNS server for my LAN, this means that my whole LAN gets disconnected from internet, until the host machine is rebooted. The host machine has a little complicated network setup: the two gigabit connections are bonded and bridged to the VMs; however this set up has been serving me so well for several years now. The problem, on the other hand, appeared a few weeks ago. This doesn't happen every day but often enough to be annoying and disruptive for my family. My question is, how can I troubleshoot this problem and figure out whether it is truly due to network bridging somehow collapsing or not? I tried to find some log files but all I could find were the /var/log/libvirt/qemu/$VM files, and the particular log file for the pi-hole VM reported the following lines; however, I am not sure if they are associated with a real crash or just due to shutting down and restarting the host (please excuse the word-wrapping): char device redirected to /dev/pts/2 (label charserial0) qxl_send_events: spice-server bug: guest stopped, ignoring 2022-01-20T23:41:17.012445Z qemu-system-x86_64: terminating on signal 15 from pid 1 (/sbin/init) 2022-01-20 23:41:17.716+0000: shutting down, reason=crashed 2022-01-20 23:42:46.059+0000: starting up libvirt version: 7.10.0, qemu version: 6.2.0, kernel: 5.10.89-1-MANJARO, hostname: -redacted- Please excuse my ignorance but is there a way to restart the networking without rebooting the host machine? This will not solve my problem since I won't be able to reach to the host remotely if the networking is down. The real solution would be preventing these network crashes and the first step in that would be effective troubleshooting in my opinion. Any input/guidance will be greatly appreciated. I can provide more info about my host/VM(s) if the above is not adequate. Thanks, Hakan Duran

On Fri, Jan 21, 2022 at 08:42:58AM -0600, Hakan E. Duran wrote:
Hi,
I would like some help to troubleshoot the problem I have been having lately with my VM host, which contains 5 VMs, one of which is for pi-hole, unbound services. It has been a relatively common occurrence in the last few weeks for me to find that the host machine has lost its network when I get back home from work. Restoring the VM/VMs do not fix the problem, the host needs to be restarted for a fix, otherwise there is both loss of name resolution, as well as an internet connection; I cannot ping even IPs such as 8.8.8.8. Since I use the pi-hole VM as the DNS server for my LAN, this means that my whole LAN gets disconnected from internet, until the host machine is rebooted. The host machine has a little complicated network setup: the two gigabit connections are bonded and bridged to the VMs; however this set up has been serving me so well for several years now. The problem, on the other hand, appeared a few weeks ago. This doesn't happen every day but often enough to be annoying and disruptive for my family.
Always good to check what has changed those weeks ago, but I understand it is difficult to find out what you were updating and where.
My question is, how can I troubleshoot this problem and figure out whether it is truly due to network bridging somehow collapsing or not? I tried to find some log files but all I could find were the /var/log/libvirt/qemu/$VM files, and the particular log file for the pi-hole VM reported the following lines; however, I am not sure if they are associated with a real crash or just due to shutting down and restarting the host (please excuse the word-wrapping):
char device redirected to /dev/pts/2 (label charserial0) qxl_send_events: spice-server bug: guest stopped, ignoring 2022-01-20T23:41:17.012445Z qemu-system-x86_64: terminating on signal 15 from pid 1 (/sbin/init)
Probably restarting the host as it got SIGTERM'd by init. Maybe it was restarted in a bad time and there is some inconsistency on the disk? Using something like libvirt-guests which can manage your machines when rebooting would be a good idea.
2022-01-20 23:41:17.716+0000: shutting down, reason=crashed 2022-01-20 23:42:46.059+0000: starting up libvirt version: 7.10.0, qemu version: 6.2.0, kernel: 5.10.89-1-MANJARO, hostname: -redacted-
Please excuse my ignorance but is there a way to restart the networking without rebooting the host machine? This will not solve my
You can do: virsh net-destroy <network_name> virsh net-start <network_name> but depending on what the network looks like, how it is set up etc. you might need to restart some of the VMs or manually plug them in.
problem since I won't be able to reach to the host remotely if the networking is down. The real solution would be preventing these network crashes and the first step in that would be effective troubleshooting in my opinion. Any input/guidance will be greatly appreciated.
I can provide more info about my host/VM(s) if the above is not adequate.
I'm not sure how much more I can help as I do not understand what is the actual setup. What I would do is try to figure out what exactly happens when it breaks and then go from that (setting up logging etc.), just general tips I guess.
Thanks,
Hakan Duran

On 1/24/22 4:35 AM, Martin Kletzander wrote:
On Fri, Jan 21, 2022 at 08:42:58AM -0600, Hakan E. Duran wrote:
Hi,
I would like some help to troubleshoot the problem I have been having lately with my VM host, which contains 5 VMs, one of which is for pi-hole, unbound services. It has been a relatively common occurrence in the last few weeks for me to find that the host machine has lost its network when I get back home from work. Restoring the VM/VMs do not fix the problem, the host needs to be restarted for a fix, otherwise there is both loss of name resolution, as well as an internet connection; I cannot ping even IPs such as 8.8.8.8. Since I use the pi-hole VM as the DNS server for my LAN, this means that my whole LAN gets disconnected from internet, until the host machine is rebooted. The host machine has a little complicated network setup: the two gigabit connections are bonded and bridged to the VMs; however this set up has been serving me so well for several years now. The problem, on the other hand, appeared a few weeks ago. This doesn't happen every day but often enough to be annoying and disruptive for my family.
Always good to check what has changed those weeks ago, but I understand it is difficult to find out what you were updating and where.
My question is, how can I troubleshoot this problem and figure out whether it is truly due to network bridging somehow collapsing or not? I tried to find some log files but all I could find were the /var/log/libvirt/qemu/$VM files, and the particular log file for the pi-hole VM reported the following lines; however, I am not sure if they are associated with a real crash or just due to shutting down and restarting the host (please excuse the word-wrapping):
char device redirected to /dev/pts/2 (label charserial0) qxl_send_events: spice-server bug: guest stopped, ignoring 2022-01-20T23:41:17.012445Z qemu-system-x86_64: terminating on signal 15 from pid 1 (/sbin/init)
Probably restarting the host as it got SIGTERM'd by init. Maybe it was restarted in a bad time and there is some inconsistency on the disk? Using something like libvirt-guests which can manage your machines when rebooting would be a good idea.
2022-01-20 23:41:17.716+0000: shutting down, reason=crashed 2022-01-20 23:42:46.059+0000: starting up libvirt version: 7.10.0, qemu version: 6.2.0, kernel: 5.10.89-1-MANJARO, hostname: -redacted-
Please excuse my ignorance but is there a way to restart the networking without rebooting the host machine? This will not solve my
You can do:
virsh net-destroy <network_name> virsh net-start <network_name>
but depending on what the network looks like, how it is set up etc. you might need to restart some of the VMs or manually plug them in.
The connection between any guest tap device and a host bridge device will be broken by virsh net-destroy, and not restored by virsh net-start (because the network driver has no good way of notifying the QEMU driver that it has restarted a network). This is something that's been on my "list of annoying things I should fix some day" for a very long time, but I've never been motivated enough to figure out a clean solution. In the meantime, if you destroy/start a network, you can get all the guest tap devices reconnected by restarting libvirtd: systemctl restart libvirtd.service or if you're using split daemons: systemctl restart virtqemud.service One of the things the QEMU driver does when it's initializing is to check where each guest tap device *should* be connected, compare that to where it *is* connected, and if those don't match then fix it.

Thank you for the reply. I truly appreciate them. I used virt-manager to set up my VMs and manage them on a daily basis. After reading your responses I used the commands below to gather some information: $sudo virsh net-list --all Name State Autostart Persistent ---------------------------------------------- default inactive no yes $sudo virsh net-info default Name: default UUID: some-number Active: no Persistent: yes Autostart: no Bridge: virbr0 I thought it was interesting that the default network was not marked to restart at boot and changed that: $sudo virsh net-autostart default Network default marked as autostarted Then due to its inactive status, I thought it would be a good idea to start it with: $sudo virsh net-start default Network default started Of note, even though the default network was marked as inactive as above, it was working. In other words, I was able to reach the VMs, which are part of that network, even before the `virsh net-start default` command. Nothing seemed to break with the command either, and everyting still seemed to work afterwards. $sudo virsh net-info default Name: default UUID: some-number Active: yes Persistent: yes Autostart: yes Bridge: virbr0 I would really appreciate if you can confirm that this is the desired state for my network for the purposes I discussed previously. I apologize if I am oversimplifying things here, it is because of my lack of in-depth understanding the appropriate set up. Thanks, Hakan On 22/01/24 05:30PM, Laine Stump wrote:
On 1/24/22 4:35 AM, Martin Kletzander wrote:
On Fri, Jan 21, 2022 at 08:42:58AM -0600, Hakan E. Duran wrote:
Hi,
I would like some help to troubleshoot the problem I have been having lately with my VM host, which contains 5 VMs, one of which is for pi-hole, unbound services. It has been a relatively common occurrence in the last few weeks for me to find that the host machine has lost its network when I get back home from work. Restoring the VM/VMs do not fix the problem, the host needs to be restarted for a fix, otherwise there is both loss of name resolution, as well as an internet connection; I cannot ping even IPs such as 8.8.8.8. Since I use the pi-hole VM as the DNS server for my LAN, this means that my whole LAN gets disconnected from internet, until the host machine is rebooted. The host machine has a little complicated network setup: the two gigabit connections are bonded and bridged to the VMs; however this set up has been serving me so well for several years now. The problem, on the other hand, appeared a few weeks ago. This doesn't happen every day but often enough to be annoying and disruptive for my family.
Always good to check what has changed those weeks ago, but I understand it is difficult to find out what you were updating and where.
My question is, how can I troubleshoot this problem and figure out whether it is truly due to network bridging somehow collapsing or not? I tried to find some log files but all I could find were the /var/log/libvirt/qemu/$VM files, and the particular log file for the pi-hole VM reported the following lines; however, I am not sure if they are associated with a real crash or just due to shutting down and restarting the host (please excuse the word-wrapping):
char device redirected to /dev/pts/2 (label charserial0) qxl_send_events: spice-server bug: guest stopped, ignoring 2022-01-20T23:41:17.012445Z qemu-system-x86_64: terminating on signal 15 from pid 1 (/sbin/init)
Probably restarting the host as it got SIGTERM'd by init. Maybe it was restarted in a bad time and there is some inconsistency on the disk? Using something like libvirt-guests which can manage your machines when rebooting would be a good idea.
2022-01-20 23:41:17.716+0000: shutting down, reason=crashed 2022-01-20 23:42:46.059+0000: starting up libvirt version: 7.10.0, qemu version: 6.2.0, kernel: 5.10.89-1-MANJARO, hostname: -redacted-
Please excuse my ignorance but is there a way to restart the networking without rebooting the host machine? This will not solve my
You can do:
virsh net-destroy <network_name> virsh net-start <network_name>
but depending on what the network looks like, how it is set up etc. you might need to restart some of the VMs or manually plug them in.
The connection between any guest tap device and a host bridge device will be broken by virsh net-destroy, and not restored by virsh net-start (because the network driver has no good way of notifying the QEMU driver that it has restarted a network). This is something that's been on my "list of annoying things I should fix some day" for a very long time, but I've never been motivated enough to figure out a clean solution.
In the meantime, if you destroy/start a network, you can get all the guest tap devices reconnected by restarting libvirtd:
systemctl restart libvirtd.service
or if you're using split daemons:
systemctl restart virtqemud.service
One of the things the QEMU driver does when it's initializing is to check where each guest tap device *should* be connected, compare that to where it *is* connected, and if those don't match then fix it.

On 1/24/22 17:29, Hakan E. Duran wrote
Then due to its inactive status, I thought it would be a good idea to start it with:
$sudo virsh net-start default Network default started
Of note, even though the default network was marked as inactive as above, it was working. In other words, I was able to reach the VMs, which are part of that network, even before the `virsh net-start default` command. Nothing seemed to break with the command either, and everyting still seemed to work afterwards.
$sudo virsh net-info default Name: default UUID: some-number Active: yes Persistent: yes Autostart: yes Bridge: virbr0
I would really appreciate if you can confirm that this is the desired state for my network for the purposes I discussed previously. I apologize if I am oversimplifying things here, it is because of my lack of in-depth understanding the appropriate set up.
Seeing as there are no other replies yet, for what it's worth, on my hypervisor I see similar results for a working system, and to my knowledge it's all running correctly: # virsh net-info default Name: default UUID: 32ecb497-5a0b-46fd-9786-df4a6ceec9ce Active: yes Persistent: yes Autostart: yes Bridge: virbr0 Also, this from my library of scripts: # ---- cut here ---- #!/bin/bash # # Yury V. Zaytsev <yury@shurup.com> (C) 2011 # cf. http://git.zaytsev.net/?p=anubis-puppet.git;a=blob;f=manifests/files/common/... # # This work is herewith placed in public domain. # # Use this script to cleanly restart the default libvirt network after its # definition have been changed (e.g. added new static MAC+IP mappings) in order # for the changes to take effect. Restarting the network alone, however, causes # the guests to lose connectivity with the host until their network interfaces # are re-attached. # # The script re-attaches the interfaces by obtaining the information about them # from the current libvirt definitions. It has the following dependencies: # # - virsh (obviously) # - tail / head / grep / awk / cut # - XML::XPath (e.g. perl-XML-XPath package) # # Note that it assumes that the guests have exactly 1 NAC each attached to the # given network! Extensions to account for more (or none) interfaces etc. are, # of course, most welcome. # # ZYV # set -e set -u NETWORK_NAME=default NETWORK_HOOK=/etc/libvirt/hooks/qemu virsh net-define /opt/config/libvirt/network-$NETWORK_NAME.xml virsh net-destroy $NETWORK_NAME virsh net-start $NETWORK_NAME MACHINES=$( virsh list | tail -n +3 | head -n -1 | awk '{ print $2; }' ) for m in $MACHINES ; do MACHINE_INFO=$( virsh dumpxml "$m" | xpath /domain/devices/interface[1] 2> /dev/null ) MACHINE_MAC=$( echo "$MACHINE_INFO" | grep "mac address" | cut -d '"' -f 2 ) MACHINE_MOD=$( echo "$MACHINE_INFO" | grep "model type" | cut -d '"' -f 2 ) set +e virsh detach-interface "$m" network --mac "$MACHINE_MAC" && sleep 3 virsh attach-interface "$m" network $NETWORK_NAME --mac "$MACHINE_MAC" --model "$MACHINE_MOD" set -e $NETWORK_HOOK "$m" stopped && sleep 3 $NETWORK_HOOK "$m" start done # ---- cut here ----

On 22/01/28 10:47PM, Charles Polisher wrote:
... Seeing as there are no other replies yet, for what it's worth, on my hypervisor I see similar results for a working system, and to my knowledge it's all running correctly:
# virsh net-info default Name: default UUID: 32ecb497-5a0b-46fd-9786-df4a6ceec9ce Active: yes Persistent: yes Autostart: yes Bridge: virbr0
Also, this from my library of scripts:
Thank you so much for confirming this!
# ---- cut here ---- #!/bin/bash # # Yury V. Zaytsev <yury@shurup.com> (C) 2011 # cf. http://git.zaytsev.net/?p=anubis-puppet.git;a=blob;f=manifests/files/common/... # # This work is herewith placed in public domain. # # Use this script to cleanly restart the default libvirt network after its # definition have been changed (e.g. added new static MAC+IP mappings) in order # for the changes to take effect. Restarting the network alone, however, causes # the guests to lose connectivity with the host until their network interfaces # are re-attached. # # The script re-attaches the interfaces by obtaining the information about them # from the current libvirt definitions. It has the following dependencies: # # - virsh (obviously) # - tail / head / grep / awk / cut # - XML::XPath (e.g. perl-XML-XPath package) # # Note that it assumes that the guests have exactly 1 NAC each attached to the # given network! Extensions to account for more (or none) interfaces etc. are, # of course, most welcome. # # ZYV #
set -e set -u
NETWORK_NAME=default NETWORK_HOOK=/etc/libvirt/hooks/qemu
virsh net-define /opt/config/libvirt/network-$NETWORK_NAME.xml virsh net-destroy $NETWORK_NAME virsh net-start $NETWORK_NAME
MACHINES=$( virsh list | tail -n +3 | head -n -1 | awk '{ print $2; }' )
for m in $MACHINES ; do
MACHINE_INFO=$( virsh dumpxml "$m" | xpath /domain/devices/interface[1] 2> /dev/null ) MACHINE_MAC=$( echo "$MACHINE_INFO" | grep "mac address" | cut -d '"' -f 2 ) MACHINE_MOD=$( echo "$MACHINE_INFO" | grep "model type" | cut -d '"' -f 2 )
set +e virsh detach-interface "$m" network --mac "$MACHINE_MAC" && sleep 3 virsh attach-interface "$m" network $NETWORK_NAME --mac "$MACHINE_MAC" --model "$MACHINE_MOD" set -e
$NETWORK_HOOK "$m" stopped && sleep 3 $NETWORK_HOOK "$m" start
done # ---- cut here ----
Thank you so much for sharing your script as well! This is so helpful. I had my LAN crash yesterday again and I did a few experiments. I tested pinging the hosts within the LAN from one of my VMs and this wasn't possible (host unreachable). So, just to clarify, pinging any host within the LAN as well as outside the LAN was not possible from the VM. The hypervisor WAS able to ping the VM though and I was able to ssh into the VM from the hypervisor. Hypervisor could not ping any other computer in the LAN either (same host unreachable error). Name resolution within the LAN was confirmed to be intact while pinging both by hostname and IP. Just to test it, I destroyed the libvirt network and restarted it, however, I didn't use several commands used by this script such as virsh detach/attach-interface and $NETWORK_HOOK, so looking back, it was an incomplete attempt. This did not solve the problem. My LAN architecture starts with cable modem that is connected via ethernet cable to my router, which is connected to my 24-port switch with the same. All computers that were pinged and could not be reached are connected to the switch via ethernet cables, just like the hypervisor itself. Wireless access point is also connected to the switch via ethernet cable, and the internet access was lost on wireless network as well. I use one of my VMs in this hypervisor as the primary DNS server for my lan through pi-hole and unbound. I believe the loss of internet is attributable to the loss of bridge network on this VM. Internal network between the hypervisor and the VM seems to have been preserved somehow (see above). The cable modem seemed OK and connected by interpretation of its lights during the crash. Rebooting the hypervisor solved the problem and LAN was again restored without needing to do anything else. I am stumped. For now, I changed my primary DNS server from the VM to a physical raspberry pi as a first step. Pi used to be the secondary DNS server, now it is primary. The VM, on the other hand, is now the secondary DNS server. I am hoping that, if the libvirt network crashes again(?) I shouldn't lose the whole LAN perhaps by this change. I am not sure what may be the root cause of this problem. virsh net-info default during the crash gave the same output as if nothing was wrong with it. Thank you for your time, I truly appreciate it knowing that it may not be possible to identify what actually is wrong/failing. Hakan

On 1/29/22 10:06, Hakan E. Duran wrote:
Thank you so much for sharing your script as well! This is so helpful. Thank the author, Yury V. Zaytsev <yury@shurup.com> I truly appreciate it knowing that it may not be possible to identify what actually is wrong/failing.
Please post the version of your Hypervisor's OS ("cat /etc/os-release"). If that file doesn't exist, any of: /etc/system-release, /etc/redhat-release, /etc/SuSE-release, /etc/debian_version, /etc/arch-release, /etc/gentoo-release, /etc/slackware-version, /etc/frugalware-release, /etc/altlinux-release, /etc/mandriva-release, or /etc/meego-release. Also please post the version of spice-server: considering https://bugzilla.redhat.com/show_bug.cgi?id=912218 , which is described as boot up guest in rhel7 host then, do screen via qmp monitor in loop, qemu report error message:*"qxl_send_events: spice-server** ** bug: guest stopped, ignoring*". Version-Release number of selected component : *spice-server-0.12.2-1.el7.x86_64* qemu-kvm-1.3.0-5.el7.x86_64 and which is claimed to be fixed. If this is a regression, and you're running a Red Hat system, you might report it there. Finally, there is a method for increasing the verbosity of error logging, but it's a bit tedious to set up and to interpret the results. See: https://libvirt.org/logging.html#log_config For example, setting *LIBVIRT_DEBUG=2* in the startup code for libvirt. Where exactly to set this will depend on your OS. Best luck, -- Charles
participants (4)
-
Charles Polisher
-
Hakan E. Duran
-
Laine Stump
-
Martin Kletzander