[Libvir] PATCH: Prevent zombie ssh tunnels

I noticed that when using the SSH tunnel for the remote driver I ended up with alot of zombie SSH processes. We simply forgot to waitpid() on the child when a connection attempt failed, or when shutting down an open remote connection. Attached is a possible patch Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
I noticed that when using the SSH tunnel for the remote driver I ended up with alot of zombie SSH processes. We simply forgot to waitpid() on the child when a connection attempt failed, or when shutting down an open remote connection. Attached is a possible patch
This is a good catch. I would prefer to factor out that common wait/kill code into a separate function, particularly because it seems to kill the ssh client after 3 seconds of waiting, so it'd be good to document that. But apart from that, +1. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Tue, Sep 11, 2007 at 12:59:59AM +0100, Daniel P. Berrange wrote:
I noticed that when using the SSH tunnel for the remote driver I ended up with alot of zombie SSH processes. We simply forgot to waitpid() on the child when a connection attempt failed, or when shutting down an open remote connection. Attached is a possible patch
Looks fine, like Rich maybe a bit of refactoring might be good. The only worries I have is the following scenario: - the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck (not completely unlikely since the ssh may have been started a long time ago) - we end-up killing a random process in the system I think this is mostly avoidable by resetting priv->pid to -1 or 0 on any child communication error, and before doing the kill in the patch. Even better would be to be able to check that the process corresponding to priv->pid is still a child of the current process, I wonder if this can be achieved without blocking with an initial waitpid() Maybe I'm too cautious, I'm fine with the principle of the patch though
@@ -646,6 +648,19 @@ doRemoteOpen (virConnectPtr conn, struct gnutls_bye (priv->session, GNUTLS_SHUT_RDWR); close (priv->sock); } + if (priv->pid > 0) { + pid_t reap; + int status, n = 0; + kill(priv->pid, SIGTERM); + do { + if (n) + usleep(n*1000); + if (n > 3) + kill(priv->pid, SIGKILL); + reap = waitpid(priv->pid, &status, WNOHANG); + n++; + } while (reap != -1 && reap != priv->pid); + }
/* Free up the URL and strings. */ xmlFreeURI (uri);
Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
- the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck
Can not happen as long as libvirt hasn't asked for the exist status via waitpid() because the pid is still in use by the zombie ssh process. cheers, Gerd

On Tue, Sep 11, 2007 at 11:35:46AM +0200, Gerd Hoffmann wrote:
Daniel Veillard wrote:
- the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck
Can not happen as long as libvirt hasn't asked for the exist status via waitpid() because the pid is still in use by the zombie ssh process.
Hum, which is precisely why we need the patch. Still I would feel a bit better if we could check that priv->pid is a child of the current process something like (getppid(priv->pid) == getpid()) test before any kill would do this easilly I think. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
On Tue, Sep 11, 2007 at 11:35:46AM +0200, Gerd Hoffmann wrote:
Daniel Veillard wrote:
- the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck Can not happen as long as libvirt hasn't asked for the exist status via waitpid() because the pid is still in use by the zombie ssh process.
Hum, which is precisely why we need the patch. Still I would feel a bit better if we could check that priv->pid is a child of the current process something like (getppid(priv->pid) == getpid()) test before any kill would do this easilly I think.
I think Gerd's point was that as long as we haven't waited for the PID within this process before, the PID cannot be reused. That doesn't mean the situation cannot arise -- for example the main program might be using other libraries as well as libvirt, and those other libraries might blindly wait(2) for children. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Tue, Sep 11, 2007 at 12:00:33PM +0100, Richard W.M. Jones wrote:
Daniel Veillard wrote:
On Tue, Sep 11, 2007 at 11:35:46AM +0200, Gerd Hoffmann wrote:
Daniel Veillard wrote:
- the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck Can not happen as long as libvirt hasn't asked for the exist status via waitpid() because the pid is still in use by the zombie ssh process.
Hum, which is precisely why we need the patch. Still I would feel a bit better if we could check that priv->pid is a child of the current process something like (getppid(priv->pid) == getpid()) test before any kill would do this easilly I think.
I think Gerd's point was that as long as we haven't waited for the PID within this process before, the PID cannot be reused.
AFAIK there is no API to give you the parent PID of an arbitrary PID. The getppid() call returns your own parent - you can't ask it for someone else's parent.
That doesn't mean the situation cannot arise -- for example the main program might be using other libraries as well as libvirt, and those other libraries might blindly wait(2) for children.
There is an issue if the app has set SIGCHLD to SIG_IGN - the kernel will automatically reap zombies then. This would allow the race that Daniel illustrates above, where we might 'kill' a program that is no longer our own SSH client. In the case of cleaning up after a failed doRemoteOpen call we should be safe enough, since we only spawned the SSH process 10 lines higher up and the system was have to be insanely busy to cycle through 65536 PIDs before we completed those 10 lines. In the case of doRemoteClose we've not got alot of good options. Either take the risk that SIGCHILD is SIG_IGN or someone else called wait() and do the kill() anyway. Another option is to double-fork() when running the SSH tunnel so it gets inherited by init. Then we assume that SSH will die & exit when we close the socket in doRemoteClose. ie closing our end of the socket should make SSH get a SIGPIPE / EOF and exit - or equivalently if the server closes its end. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel Veillard wrote:
On Tue, Sep 11, 2007 at 11:35:46AM +0200, Gerd Hoffmann wrote:
Daniel Veillard wrote:
- the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck Can not happen as long as libvirt hasn't asked for the exist status via waitpid() because the pid is still in use by the zombie ssh process.
Hum, which is precisely why we need the patch. Still I would feel a bit better if we could check that priv->pid is a child of the current process something like (getppid(priv->pid) == getpid()) test before any kill would do this easilly I think.
Simply keep track which processes you've (successfully) waitpid()'ed already, then you'll know. cheers, Gerd

On Tue, Sep 11, 2007 at 04:56:13AM -0400, Daniel Veillard wrote:
On Tue, Sep 11, 2007 at 12:59:59AM +0100, Daniel P. Berrange wrote:
I noticed that when using the SSH tunnel for the remote driver I ended up with alot of zombie SSH processes. We simply forgot to waitpid() on the child when a connection attempt failed, or when shutting down an open remote connection. Attached is a possible patch
Looks fine, like Rich maybe a bit of refactoring might be good. The only worries I have is the following scenario: - the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck (not completely unlikely since the ssh may have been started a long time ago) - we end-up killing a random process in the system
I think this is mostly avoidable by resetting priv->pid to -1 or 0 on any child communication error, and before doing the kill in the patch. Even better would be to be able to check that the process corresponding to priv->pid is still a child of the current process, I wonder if this can be achieved without blocking with an initial waitpid()
After some testing I believe we can safely do away with kill() completely. If we simply close(priv->sock) which is our end of the socketpair used to talk to SSH, then SSH detects the end-of-file condition and exits of its on accord. So there's no need to kill() - just close the socket and waitpid() seems to do the trickm, unless I'm missing something bad. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote: > On Tue, Sep 11, 2007 at 04:56:13AM -0400, Daniel Veillard wrote: >> On Tue, Sep 11, 2007 at 12:59:59AM +0100, Daniel P. Berrange wrote: >>> I noticed that when using the SSH tunnel for the remote driver I ended up >>> with alot of zombie SSH processes. We simply forgot to waitpid() on the >>> child when a connection attempt failed, or when shutting down an open remote >>> connection. Attached is a possible patch >> Looks fine, like Rich maybe a bit of refactoring might be good. The >> only worries I have is the following scenario: >> - the ssh process dies >> - libvirt based application takes some time to notice it >> - the OS span a new process with the same PID after a PID rollabck >> (not completely unlikely since the ssh may have been started a long >> time ago) >> - we end-up killing a random process in the system >> >> I think this is mostly avoidable by resetting priv->pid to -1 or 0 on >> any child communication error, and before doing the kill in the patch. >> Even better would be to be able to check that the process corresponding >> to priv->pid is still a child of the current process, I wonder if this >> can be achieved without blocking with an initial waitpid() > > After some testing I believe we can safely do away with kill() completely. > If we simply close(priv->sock) which is our end of the socketpair used > to talk to SSH, then SSH detects the end-of-file condition and exits of > its on accord. So there's no need to kill() - just close the socket and > waitpid() seems to do the trickm, unless I'm missing something bad. +1, but be nicer if that common code was in a function. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Fri, Sep 14, 2007 at 03:14:02AM +0100, Daniel P. Berrange wrote:
On Tue, Sep 11, 2007 at 04:56:13AM -0400, Daniel Veillard wrote:
On Tue, Sep 11, 2007 at 12:59:59AM +0100, Daniel P. Berrange wrote:
I noticed that when using the SSH tunnel for the remote driver I ended up with alot of zombie SSH processes. We simply forgot to waitpid() on the child when a connection attempt failed, or when shutting down an open remote connection. Attached is a possible patch
Looks fine, like Rich maybe a bit of refactoring might be good. The only worries I have is the following scenario: - the ssh process dies - libvirt based application takes some time to notice it - the OS span a new process with the same PID after a PID rollabck (not completely unlikely since the ssh may have been started a long time ago) - we end-up killing a random process in the system
I think this is mostly avoidable by resetting priv->pid to -1 or 0 on any child communication error, and before doing the kill in the patch. Even better would be to be able to check that the process corresponding to priv->pid is still a child of the current process, I wonder if this can be achieved without blocking with an initial waitpid()
After some testing I believe we can safely do away with kill() completely. If we simply close(priv->sock) which is our end of the socketpair used to talk to SSH, then SSH detects the end-of-file condition and exits of its on accord. So there's no need to kill() - just close the socket and waitpid() seems to do the trickm, unless I'm missing something bad.
Looks fine and safe to me, I like it +1 ! Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
participants (4)
-
Daniel P. Berrange
-
Daniel Veillard
-
Gerd Hoffmann
-
Richard W.M. Jones