[libvirt] Zombie process after open libvirt connection

Hello, I have the following Perl test script and when open libvirt connection with qemu+tls i get zombie process. Let me show the output of script for different versions: * connection with qemu+libssh2 or test:/// $ perl test-chldhandle-bug.pl init... pid=15305 while... fork 1 end... pid=15307 receive chld fork 2 end... pid=15308 receive chld connection open fork 3 end... pid=15350 receive chld fork 4 receive chld go next... * connection with qemu+tls $ perl test-chldhandle-bug.pl init... pid=13723 while... fork 1 end... pid=13725 receive chld fork 2 end... pid=13726 receive chld 2014-03-13 18:15:56.639+0000: 13723: info : libvirt version: 1.0.5.7, package: 2.fc19 (Fedora Project, 2013-11-17-23:21:57, buildvm-18.phx2.fedoraproject.org) 2014-03-13 18:15:56.639+0000: 13723: warning : virNetTLSContextCheckCertificate:1099 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=13773 fork 4 end... pid=13800 go next... Does anyone know how can i solve this issue? Regards, -- Carlos Rodrigues Engenheiro de Software Sénior Eurotux Informática, S.A. | www.eurotux.com (t) +351 253 680 300 (m) +351 911 926 110

Hello, Does anyone can help me? I a need a solution for this, 'cause i have a process/daemon, it reaches about 200 zombie processes, after open a new libvirt connection with qemu+tls. Best regards, -- Carlos Rodrigues Engenheiro de Software Sénior Eurotux Informática, S.A. | www.eurotux.com (t) +351 253 680 300 (m) +351 911 926 110 On Qui, 2014-03-13 at 18:27 +0000, Carlos Rodrigues wrote:
Hello,
I have the following Perl test script and when open libvirt connection with qemu+tls i get zombie process.
Let me show the output of script for different versions:
* connection with qemu+libssh2 or test:/// $ perl test-chldhandle-bug.pl init... pid=15305 while... fork 1 end... pid=15307 receive chld fork 2 end... pid=15308 receive chld connection open fork 3 end... pid=15350 receive chld fork 4 receive chld go next...
* connection with qemu+tls $ perl test-chldhandle-bug.pl init... pid=13723 while... fork 1 end... pid=13725 receive chld fork 2 end... pid=13726 receive chld 2014-03-13 18:15:56.639+0000: 13723: info : libvirt version: 1.0.5.7, package: 2.fc19 (Fedora Project, 2013-11-17-23:21:57, buildvm-18.phx2.fedoraproject.org) 2014-03-13 18:15:56.639+0000: 13723: warning : virNetTLSContextCheckCertificate:1099 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=13773 fork 4 end... pid=13800 go next...
Does anyone know how can i solve this issue?
Regards,
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 17.03.2014 11:13, Carlos Rodrigues wrote:
Hello,
Does anyone can help me? I a need a solution for this, 'cause i have a process/daemon, it reaches about 200 zombie processes, after open a new libvirt connection with qemu+tls.
Best regards,
From the script you've attached: $SIG{'CHLD'} = sub { print "receive chld","\n"; }; You should either waitpid() child here or at the end of the script. The former approach won't leave any zombies hanging around, the latter requires you to keep return values of fork(). So I think your script needs to look like this: $ diff -up test-chldhandle-bug.pl test-chldhandle-bug-fixed.pl --- test-chldhandle-bug.pl 2014-03-17 11:59:48.972238074 +0100 +++ test-chldhandle-bug-fixed.pl 2014-03-17 12:03:08.693875968 +0100 @@ -5,10 +5,12 @@ use strict; use Sys::Virt; use POSIX qw(:signal_h); +use POSIX ":sys_wait_h"; sub main { print "init... pid=$$","\n"; - $SIG{'CHLD'} = sub { print "receive chld","\n"; }; + $SIG{'CHLD'} = sub { my $kid; print "receive chld","\n"; + do { $kid = waitpid(-1, WNOHANG); } while $kid > 0; }; while(1){ print "while...","\n"; Michal

Hell Michal, Thank you for your answer, but this doesn't fix my problem. Run your fixed script and we get the same behavior: $ perl test-chldhandle-bug-fixed.pl init... pid=29713 while... fork 1 end... pid=29716 receive chld fork 2 end... pid=29717 receive chld 2014-03-17 11:10:37.234+0000: 29713: info : libvirt version: 1.0.5.7, package: 2.fc19 (Fedora Project, 2013-11-17-23:21:57, buildvm-18.phx2.fedoraproject.org) 2014-03-17 11:10:37.234+0000: 29713: warning : virNetTLSContextCheckCertificate:1099 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=29827 fork 4 end... pid=29930 go next... In my daemon version, i also use waitpid for child treatment. Regards, -- Carlos Rodrigues Engenheiro de Software Sénior Eurotux Informática, S.A. | www.eurotux.com (t) +351 253 680 300 (m) +351 911 926 110 On Seg, 2014-03-17 at 12:04 +0100, Michal Privoznik wrote:
On 17.03.2014 11:13, Carlos Rodrigues wrote:
Hello,
Does anyone can help me? I a need a solution for this, 'cause i have a process/daemon, it reaches about 200 zombie processes, after open a new libvirt connection with qemu+tls.
Best regards,
From the script you've attached:
$SIG{'CHLD'} = sub { print "receive chld","\n"; };
You should either waitpid() child here or at the end of the script. The former approach won't leave any zombies hanging around, the latter requires you to keep return values of fork(). So I think your script needs to look like this:
$ diff -up test-chldhandle-bug.pl test-chldhandle-bug-fixed.pl --- test-chldhandle-bug.pl 2014-03-17 11:59:48.972238074 +0100 +++ test-chldhandle-bug-fixed.pl 2014-03-17 12:03:08.693875968 +0100 @@ -5,10 +5,12 @@ use strict;
use Sys::Virt; use POSIX qw(:signal_h); +use POSIX ":sys_wait_h";
sub main { print "init... pid=$$","\n"; - $SIG{'CHLD'} = sub { print "receive chld","\n"; }; + $SIG{'CHLD'} = sub { my $kid; print "receive chld","\n"; + do { $kid = waitpid(-1, WNOHANG); } while $kid > 0; };
while(1){ print "while...","\n";
Michal

On 17.03.2014 12:13, Carlos Rodrigues wrote:
Hell Michal,
Thank you for your answer, but this doesn't fix my problem.
Run your fixed script and we get the same behavior:
$ perl test-chldhandle-bug-fixed.pl init... pid=29713 while... fork 1 end... pid=29716 receive chld fork 2 end... pid=29717 receive chld 2014-03-17 11:10:37.234+0000: 29713: info : libvirt version: 1.0.5.7, package: 2.fc19 (Fedora Project, 2013-11-17-23:21:57, buildvm-18.phx2.fedoraproject.org) 2014-03-17 11:10:37.234+0000: 29713: warning : virNetTLSContextCheckCertificate:1099 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=29827 fork 4 end... pid=29930 go next...
I'm not a perl expert, but I don't think it's a libvirt bug anyhow. Moreover, I don't see any zombies: $ perl test-chldhandle-bug-fixed.pl & sleep 5 && echo && ps axf | grep perl && echo [1] 11239 init... pid=11239 while... fork 1 end... pid=11241 receive chld fork 2 end... pid=11242 receive chld connection open 11239 pts/18 S 0:00 | \_ perl test-chldhandle-bug-fixed.pl 11245 pts/18 S+ 0:00 | \_ grep --colour=auto perl fork 3 end... pid=11246 receive chld fork 4 end... pid=11247 receive chld go next... btw: with older version I'm seeing this: 11399 pts/18 S 0:00 | \_ perl test-chldhandle-bug.pl 11401 pts/18 Z 0:00 | | \_ [perl] <defunct> 11402 pts/18 Z 0:00 | | \_ [perl] <defunct> 11405 pts/18 S+ 0:00 | \_ grep --colour=auto perl What zombies are you seeing? Perl ones or libvirt or ..,? Michal

Hello Michal, I am using libvirt 1.1.3 and perl-Sys-Virt 1.1.3 and perl-5.16 on Fedora 19 x86_64 The zombie process appears after open libvirt connection with qemu-tls, and perl module is binding for libvirt library XS. Here is my running example with zombie process: $ perl test-chldhandle-bug-fixed.pl & sleep 15 && echo && ps axf | grep perl && echo [2] 12427 init... pid=12427 while... fork 1 end... pid=12430 receive chld fork 2 end... pid=12431 receive chld 2014-03-19 11:06:38.712+0000: 12427: info : libvirt version: 1.1.3.1, package: 2.fc19 (Unknown, 2014-03-17-15:02:00, cmar-laptop.lan) 2014-03-19 11:06:38.712+0000: 12427: warning : virNetTLSContextCheckCertificate:1140 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=12432 fork 4 end... pid=12440 12427 pts/2 S 0:00 | \_ perl test-chldhandle-bug-fixed.pl 12432 pts/2 Z 0:00 | | \_ [perl] <defunct> 12440 pts/2 Z 0:00 | | \_ [perl] <defunct> 12442 pts/2 S+ 0:00 | \_ grep --color=auto perl Regards, -- Carlos Rodrigues Engenheiro de Software Sénior Eurotux Informática, S.A. | www.eurotux.com (t) +351 253 680 300 (m) +351 911 926 110 On Seg, 2014-03-17 at 13:37 +0100, Michal Privoznik wrote:
On 17.03.2014 12:13, Carlos Rodrigues wrote:
Hell Michal,
Thank you for your answer, but this doesn't fix my problem.
Run your fixed script and we get the same behavior:
$ perl test-chldhandle-bug-fixed.pl init... pid=29713 while... fork 1 end... pid=29716 receive chld fork 2 end... pid=29717 receive chld 2014-03-17 11:10:37.234+0000: 29713: info : libvirt version: 1.0.5.7, package: 2.fc19 (Fedora Project, 2013-11-17-23:21:57, buildvm-18.phx2.fedoraproject.org) 2014-03-17 11:10:37.234+0000: 29713: warning : virNetTLSContextCheckCertificate:1099 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=29827 fork 4 end... pid=29930 go next...
I'm not a perl expert, but I don't think it's a libvirt bug anyhow. Moreover, I don't see any zombies:
$ perl test-chldhandle-bug-fixed.pl & sleep 5 && echo && ps axf | grep perl && echo [1] 11239 init... pid=11239 while... fork 1 end... pid=11241 receive chld fork 2 end... pid=11242 receive chld connection open
11239 pts/18 S 0:00 | \_ perl test-chldhandle-bug-fixed.pl 11245 pts/18 S+ 0:00 | \_ grep --colour=auto perl
fork 3 end... pid=11246 receive chld fork 4 end... pid=11247 receive chld go next...
btw: with older version I'm seeing this:
11399 pts/18 S 0:00 | \_ perl test-chldhandle-bug.pl 11401 pts/18 Z 0:00 | | \_ [perl] <defunct> 11402 pts/18 Z 0:00 | | \_ [perl] <defunct> 11405 pts/18 S+ 0:00 | \_ grep --colour=auto perl
What zombies are you seeing? Perl ones or libvirt or ..,?
Michal

On 19.03.2014 12:10, Carlos Rodrigues wrote:
Hello Michal,
I am using libvirt 1.1.3 and perl-Sys-Virt 1.1.3 and perl-5.16 on Fedora 19 x86_64
The zombie process appears after open libvirt connection with qemu-tls, and perl module is binding for libvirt library XS.
Here is my running example with zombie process:
$ perl test-chldhandle-bug-fixed.pl & sleep 15 && echo && ps axf | grep perl && echo [2] 12427 init... pid=12427 while... fork 1 end... pid=12430 receive chld fork 2 end... pid=12431 receive chld 2014-03-19 11:06:38.712+0000: 12427: info : libvirt version: 1.1.3.1, package: 2.fc19 (Unknown, 2014-03-17-15:02:00, cmar-laptop.lan) 2014-03-19 11:06:38.712+0000: 12427: warning : virNetTLSContextCheckCertificate:1140 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=12432 fork 4 end... pid=12440
12427 pts/2 S 0:00 | \_ perl test-chldhandle-bug-fixed.pl 12432 pts/2 Z 0:00 | | \_ [perl] <defunct> 12440 pts/2 Z 0:00 | | \_ [perl] <defunct> 12442 pts/2 S+ 0:00 | \_ grep --color=auto perl
Aha! It seems like this is only present if using tls, I was unable to reproduce this with tcp or unix sockets. And when using tcp I can see SIGCHLD being delivered while with tls it is not. That makes me wonder if either libvirt or gnutls silently sets signal mask and not restore it back. Because if I take a look at signal mask I can clearly see SIGCHLD to be blocked (from /proc/$pid/status): SigPnd: 0000000000000000 ShdPnd: 0000000000010000 SigBlk: 0000000008011000 SigIgn: 0000000000001080 SigCgt: 0000000180010000 What we can see here is, SigBlk (the bitmask of blocked signals) contains 0x801100 which is SIGPIPE, SIGCHLD and SIGWINCH. Right, why would libvirt care about SIGWINCH anyway? Git greping it leads us to virNetClientSetTLSSession(). I can clearly see there we are adding just those three signals to a mask. Then setting this mask just prior to calling poll() and then restoring back. Oh wait, we are not! pthread_sigmask(SIG_BLOCK,...) is just adding new signals to the mask, not overwriting the old one. So yes, this is clearly libvirt bug. If I use SIG_SETMASK there, I am no longer getting any zombies. I'll post the patch shortly. Michal

Thank you Michal, this is good news for me. I'll wait for this patch. Regards, -- Carlos Rodrigues Engenheiro de Software Sénior Eurotux Informática, S.A. | www.eurotux.com (t) +351 253 680 300 (m) +351 911 926 110 On Qua, 2014-03-19 at 18:27 +0100, Michal Privoznik wrote:
On 19.03.2014 12:10, Carlos Rodrigues wrote:
Hello Michal,
I am using libvirt 1.1.3 and perl-Sys-Virt 1.1.3 and perl-5.16 on Fedora 19 x86_64
The zombie process appears after open libvirt connection with qemu-tls, and perl module is binding for libvirt library XS.
Here is my running example with zombie process:
$ perl test-chldhandle-bug-fixed.pl & sleep 15 && echo && ps axf | grep perl && echo [2] 12427 init... pid=12427 while... fork 1 end... pid=12430 receive chld fork 2 end... pid=12431 receive chld 2014-03-19 11:06:38.712+0000: 12427: info : libvirt version: 1.1.3.1, package: 2.fc19 (Unknown, 2014-03-17-15:02:00, cmar-laptop.lan) 2014-03-19 11:06:38.712+0000: 12427: warning : virNetTLSContextCheckCertificate:1140 : Certificate check failed Certificate [session] owner does not match the hostname 10.10.4.249 connection open fork 3 end... pid=12432 fork 4 end... pid=12440
12427 pts/2 S 0:00 | \_ perl test-chldhandle-bug-fixed.pl 12432 pts/2 Z 0:00 | | \_ [perl] <defunct> 12440 pts/2 Z 0:00 | | \_ [perl] <defunct> 12442 pts/2 S+ 0:00 | \_ grep --color=auto perl
Aha! It seems like this is only present if using tls, I was unable to reproduce this with tcp or unix sockets. And when using tcp I can see SIGCHLD being delivered while with tls it is not. That makes me wonder if either libvirt or gnutls silently sets signal mask and not restore it back. Because if I take a look at signal mask I can clearly see SIGCHLD to be blocked (from /proc/$pid/status):
SigPnd: 0000000000000000 ShdPnd: 0000000000010000 SigBlk: 0000000008011000 SigIgn: 0000000000001080 SigCgt: 0000000180010000
What we can see here is, SigBlk (the bitmask of blocked signals) contains 0x801100 which is SIGPIPE, SIGCHLD and SIGWINCH. Right, why would libvirt care about SIGWINCH anyway? Git greping it leads us to virNetClientSetTLSSession(). I can clearly see there we are adding just those three signals to a mask. Then setting this mask just prior to calling poll() and then restoring back. Oh wait, we are not! pthread_sigmask(SIG_BLOCK,...) is just adding new signals to the mask, not overwriting the old one. So yes, this is clearly libvirt bug.
If I use SIG_SETMASK there, I am no longer getting any zombies. I'll post the patch shortly.
Michal

On 19.03.2014 18:55, Carlos Rodrigues wrote:
Thank you Michal, this is good news for me.
I'll wait for this patch.
Regards,
I've just pushed it to the repository: commit 3d4b4f5ac634c123af1981084add29d3a2ca6ab0 Author: Michal Privoznik <mprivozn@redhat.com> AuthorDate: Wed Mar 19 18:10:34 2014 +0100 Commit: Michal Privoznik <mprivozn@redhat.com> CommitDate: Wed Mar 19 18:54:51 2014 +0100 virNetClientSetTLSSession: Restore original signal mask Currently, we use pthread_sigmask(SIG_BLOCK, ...) prior to calling poll(). This is okay, as we don't want poll() to be interrupted. However, then - immediately as we fall out from the poll() - we try to restore the original sigmask - again using SIG_BLOCK. But as the man page says, SIG_BLOCK adds signals to the signal mask: SIG_BLOCK The set of blocked signals is the union of the current set and the set argument. Therefore, when restoring the original mask, we need to completely overwrite the one we set earlier and hence we should be using: SIG_SETMASK The set of blocked signals is set to the argument set. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Michal
participants (2)
-
Carlos Rodrigues
-
Michal Privoznik