[libvirt] Reproducible VM start bug which affects libvirt 5.1.0 and 5.2.0

Hello, I am currently running into a reproducible libvirt bug which affects libvirt 5.1.0 and 5.2.0. There seem to be a racecondition in the nwfilter-define and virsh start commands. Several times a day I'm not able to start a VM anymore with the following error message: error: Failed to start domain test error: internal error: Failed to apply firewall rules /sbin/iptables -w -I FORWARD 1 -j libvirt-in: iptables v1.6.0: Couldn't load target `libvirt-in':No such file or directory To fix this issue I have to restart libvirt. Some iptable chains are missing, which is probably caused by a nwfilter-define operation. I'm able to reproduce this bug within 2 hours by running 2 loops. One loop is defining nwfilters and the second loop is destroying and starting multiple VMs. I found an entry in the changelog of libvirt 5.1.0, which seems related to this bug: Create private chains for virtual network firewall rules Historically firewall rules for virtual networks were added straight into the base chains. This works but has a number of bugs and design limitations. To address them, libvirt now puts firewall rules into its own chains. Note that with this change the filter, nat and mangle tables are required for both IPv4 and IPv6. So far I am not able to reproduce this bug on libvirt 5.0.0. Is there any information I can provide to the mailinglist to help debug and/or fix this bug? I am also willing to test patches. With kind regards, Frank Schreuder

On Wed, Apr 03, 2019 at 10:52:57AM +0000, Frank Schreuder wrote:
Hello,
I am currently running into a reproducible libvirt bug which affects libvirt 5.1.0 and 5.2.0.
There seem to be a racecondition in the nwfilter-define and virsh start commands. Several times a day I'm not able to start a VM anymore with the following error message: error: Failed to start domain test error: internal error: Failed to apply firewall rules /sbin/iptables -w -I FORWARD 1 -j libvirt-in: iptables v1.6.0: Couldn't load target `libvirt-in':No such file or directory
"libvirt-in" is a chain created by the nwfilter driver to hold its rules AFAICT, it tries to re-create this chain every time. The code runs this series of ops: virFirewallAddRuleFull(fw, layer, true, NULL, NULL, "-N", VIRT_IN_CHAIN, NULL); virFirewallAddRuleFull(fw, layer, true, NULL, NULL, "-D", "FORWARD", "-j", VIRT_IN_CHAIN, NULL); virFirewallAddRule(fw, layer, "-I", "FORWARD", "1", "-j", VIRT_IN_CHAIN, NULL); where "VIRT_IN_CHAIN" is a macro expanding to "libvirt-in". So AFAICT it should be impossible to hit the problem you are showing, since we just created the "libvirt-in" chain before trying to add it to the FORWARD chain !
To fix this issue I have to restart libvirt. Some iptable chains are missing, which is probably caused by a nwfilter-define operation. I'm able to reproduce this bug within 2 hours by running 2 loops. One loop is defining nwfilters and the second loop is destroying and starting multiple VMs.
The fact that we recreate it everytime we try to start a guest also means any problem should be self-correcting which makes it even more strange that you need to have a restart
I found an entry in the changelog of libvirt 5.1.0, which seems related to this bug: Create private chains for virtual network firewall rules Historically firewall rules for virtual networks were added straight into the base chains. This works but has a number of bugs and design limitations. To address them, libvirt now puts firewall rules into its own chains. Note that with this change the filter, nat and mangle tables are required for both IPv4 and IPv6.
This is relating tot he virtual network driver's firewall rules, which are completely independnat of the nwfilter driver's firewall rules. This changelog entry is refering to the new top level chains we create: INPUT --> LIBVIRT_INP (filter) OUTPUT --> LIBVIRT_OUT (filter) FORWARD +-> LIBVIRT_FWX (filter) +-> LIBVIRT_FWO \-> LIBVIRT_FWI POSTROUTING --> LIBVIRT_PRT (nat & mangle) as you can see these are different chains from the nwfilter ones you mention.
So far I am not able to reproduce this bug on libvirt 5.0.0.
This is interesting, because AFAICT we had no changes to the nwfilter driver between 5.0.0 and 5.1.0 that would affect this behaviour. We did have the changes to the virtual network driver but that should not interfere with the nwfilter driver.
Is there any information I can provide to the mailinglist to help debug and/or fix this bug? I am also willing to test patches.
Probably only thing is to turn on debugging log_filters="1:libvirt 1:firewall 1:iptables 1:nwfilter 1:network" log_outputs="1:file:/var/log/libvirt/libvirtd.log" in libvirtd.conf & restart libvirtd. Since you say this takes 2 hours to reprorduce though I think this is going to create a *HUGE* logfile. We'll probably only need the last few 100 KB or so, of the logfile. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

To fix this issue I have to restart libvirt. Some iptable chains are missing, which is probably caused by a nwfilter-define operation. I'm able to reproduce this bug within 2 hours by running 2 loops. One loop is defining nwfilters and the second loop is destroying and starting multiple VMs.
The fact that we recreate it everytime we try to start a guest also means any problem should be self-correcting which makes it even more strange that you need to have a restart
It seems that the problem is a race condition between libvirt and our reload-iptables script. Libvirt inserts and removes rules one by one, while reload-iptables uses iptables-save and iptables-restore. The script reload-iptables saves libvirt firewall rules to a temp-file appends puppet's rules, and then imports said temp-file. When libvirt is inserting firewall rules between the save and import from reload-iptables we get unexpected behaviour.
So far I am not able to reproduce this bug on libvirt 5.0.0.
This is interesting, because AFAICT we had no changes to the nwfilter driver between 5.0.0 and 5.1.0 that would affect this behaviour.
We did have the changes to the virtual network driver but that should not interfere with the nwfilter driver.
The hypervisor running libvirt 5.0.0 was not using this reload-iptables script. Regards, Frank

On Thu, Apr 04, 2019 at 02:08:41PM +0000, Frank Schreuder wrote:
To fix this issue I have to restart libvirt. Some iptable chains are missing, which is probably caused by a nwfilter-define operation. I'm able to reproduce this bug within 2 hours by running 2 loops. One loop is defining nwfilters and the second loop is destroying and starting multiple VMs.
The fact that we recreate it everytime we try to start a guest also means any problem should be self-correcting which makes it even more strange that you need to have a restart
It seems that the problem is a race condition between libvirt and our reload-iptables script. Libvirt inserts and removes rules one by one, while reload-iptables uses iptables-save and iptables-restore. The script reload-iptables saves libvirt firewall rules to a temp-file appends puppet's rules, and then imports said temp-file. When libvirt is inserting firewall rules between the save and import from reload-iptables we get unexpected behaviour.
Ah right, I should have thought of something like that. Protecting against concurrent app that dumps & recreates iptables rules in parallel with libvirt doing its work with iptables is not really practical I'm afraid :-( It is one of the big painpoints of dealing with iptables.
So far I am not able to reproduce this bug on libvirt 5.0.0.
This is interesting, because AFAICT we had no changes to the nwfilter driver between 5.0.0 and 5.1.0 that would affect this behaviour.
We did have the changes to the virtual network driver but that should not interfere with the nwfilter driver.
The hypervisor running libvirt 5.0.0 was not using this reload-iptables script.
Ok, that explains it! Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (2)
-
Daniel P. Berrangé
-
Frank Schreuder